The Data Day, Today: August 8 2012

Who loves Hadoop? Who doesn’t?

And that’s the Data Day, today.

The Data Day, Today: Apr 25 2012

Splunk soars on IPO. VMware acquires Cetas. Vertica retain autonomy. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* For 451 Research clients

# Splunk IPO: $3bn and counting M&A Insight

# VMware snaps up Cetas Software for ‘big data’ analytics Deal Analysis

# HP’s Vertica retains its autonomy, continues integration with Autonomy Impact Report

# SAP makes long-awaited predictive analytics move of its own Impact Report

# Sanbolic pitches data management platform for server, desktop and database consolidation Impact Report

* Splunk IPO kills, lives up to expectations

* VMware acquires Cetas Software for Cloud and Big Data Analytics

* Opera Solutions Acquires Procurement Analytics Tools and Services from BIQ and Lexington Analytics

* Terascala Announces $14M Series B Funding Round Led by Strategic Partner Consortium

* Ravel Acquired by W2O Group To Expand Big Data Client Services And Enrich In-House Analytics and Insights Technology

* Teradata Active Data Warehouses Provide Private Cloud Benefits

* Pentaho Introduces New Interactive Visualization and Expanded Big Data Analytics

* Teradata Unveils New Purpose-Built Appliance for SAS High-Performance Analytics

* SAP Establishes Global Managing Board to Lead Company

* Oracle to Hadoop Under OneAppliance: GridIron Introduces First All-Flash Appliance Line With Unprecedented Performance to Tackle Unified Big Data Processing

* Lucid Imagination Technology Integration with SugarCRM Lets Customers Enjoy Improved Global Search Capabilities with Apache Lucene/Solr

* The Apache Software Foundation Announces Apache Cassandra v1.1

* Miso project: how it will help you make your own Guardian-style infographics and data visualisations

And that’s the Data Day, today.

Update on the relative popularity of NoSQL database skills

Back in December we ran a series of posts looking at the geographic distribution of NoSQL skills, according to the results of searching LinkedIn member profiles, culminating in a look at the relative overall popularity of the major NoSQL databases.

This week I took another look at LinkedIn to update the results for a forthcoming report, which gives us the opportunity to see how the results have changed over the past quarter:

While this provides us with an interesting opportunity to track LinkedIn profile mentions over time there isn’t a huge amount we can learn from this first update – other than that MongoDB seems to be increasing its dominance.

The only significant change that isn’t immediately obvious from looking at the chart is that Apache HBase has overtaken Apache CouchDB by a tiny margin to claim third place overall.

As we noted last time, however, Apache HBase is more reliant on the US than other NosQL databases for its LinkedIn mentions: it is the second most prevalent NoSQL database mentioned in the USA but fourth in the rest of the world.

Two other points to take into consideration:

– The results for Apache Cassandra are probably disproportionately low since we have to search for the full phrase in order to avoid including people called Cassandra.

– Previously we only searched for Membase. This time we added together the search results for both Membase and Couchbase. This may mean the result for Couch/Membase is disproportionately high since some members probably listed both.

This is not meant to be a comprehensive analysis, however, but rather a snapshot of one particular data source.

The Data Day, Today: Jan 10 2012

Oracle OEMs Cloudera. The future of Apache CouchDB. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Oracle announced the general availability of Big Data Appliance, and an OEM agreement with Cloudera for CDH and Cloudera Manager.

* The Future of Apache CouchDB Cloudant confirms intention to integrate the core capabilities of BigCouch into Apache CouchDB.

* Reinforcing Couchbase’s Commitment to Open Source and CouchDB Couchbase CEO Bob Wiederhold attempts to clear up any confusion.

* Hortonworks Appoints Shaun Connolly to Vice President of Corporate Strategy Former vice president of product strategy at VMware.

* Splunk even more data with 4.3 Introducing the latest Splunk release.

* Announcement of Percona XtraDB Cluster (alpha release) Based on Galera.

* Bringing Value of Big Data to Business: SAP’s Integrated Strategy Forbes interview with with Sanjay Poonen, President and corporate officer of SAP Global Solutions.

* New Release of Oracle Database Firewall Extends Support to MySQL and Enhances Reporting Capabilities Self-explanatory.

* Big data and the disruption curve “Many efforts are being funded by business units and not the IT department and money is increasingly being diverted from large enterprise vendors.”

* Get your SQL Server database ready for SQL Azure Microsoft “codename” SQL Azure Compatibility Assessment.

* An update on Apache Hadoop 1.0 Cloudera’s Charles Zedlewski helpfully explains Apache Hadoop branch numbering.

* Xeround and the CAP Theorem So where does Xeround fit in the CAP Theorem?

* Can Yahoo’s new CEO Thompson harness big data, analytics? Larry Dignan thinks Scott Thompson might just be the right guy for the job.

* US Companies Face Big Hurdles in ‘Big Data’ Use “21% of respondents were unsure how to best define Big Data”

* Schedule Your Agenda for 2012 NoSQL Events Alex Popescu updates his list of the year’s key NoSQL events.

* DataStax take Apache Cassandra Mainstream in 2011; Poised for Growth and Innovation in 2012 The usual momentum round-up from DataStax.

* Objectivity claimed significant growth in adoption of its graph database, InfiniteGraph and flagship object database, Objectivity/DB.

* Cloudera Connector for Teradata 1.0.0 Self-explanatory.

* For 451 Research clients

# SAS delivers in-memory analytics for Teradata and Greenplum Market Development report

# With $84m in funding, Opera sets out predictive-analytics plans Market Development report

* Google News Search outlier of the day: First Dagger Fencing Competition in the World Scheduled for January 14, 2012

And that’s the Data Day, today.

User perspectives on NoSQL

The NoSQL EU event in London this week was a great event with interesting perspectives from both vendors – Basho, Neo Technology, 10gen, Riptano – and also users – The Guardian, the BBC, Amazon, Twitter. In particular I was interested in learning from the latter about how and why they ended up using alternatives to the traditional relational database model.

Some of the reasons for using NoSQL have been well-documented: Amazon CTO Werner Vogels talked about how the traditional database offerings were unable to meet the scalability Amazon.com requires. Filling a functionality void also explains why Facebook created Cassandra, Google created BigTable, and Twitter created FlockDB (etc etc). As Werner said, “We couldn’t bet the company on other companies building the answer for us.”

As Werner also explained, however, the motivation for creating Dynamo was also about enabling choice and ensuring that Amazon was not trying to force the relational database to do something it was not designed to do. “Choosing the right tool for the job” was a recurring theme at NoSQL EU.

Given the NoSQL name it is easy to assume that this means that the relational database is by default “the wrong tool”. However, the most important element in that statement is arguably not “tool”, but “job” and The Guardian discussed how it was using non-relational data tools to create new applications that complement its ongoing investment in the Oracle database.

For example, the Guardian’s application to manage the progress of crowdsourcing the investigation of MP’s expenses is based on Redis, while the Zeitgeist trending news application runs on Google’s AppEngine, as did its live poll during the recent leader’s election debate. Datablog, meanwhile, relies on Google Spreadsheets to serve up usable and downloadable data – we’ll ignore for a moment whether Google Spreadsheets is a NoSQL database 😉

Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. The overarching theme, as Matthew Wall and Simon Willison explained, is that the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

On the subject of choosing the right tool for the job, Basho’s engineering manager Brian Fink pointed out that using NoSQL technology alongside relational SQL database technology may actually improve the performance of the SQL database since storing data in a relational database that does not need SQL features slows down access to data that does need SQL features.

Another perspective on this came from Werner Vogels who noted that unlike database administrators/ systems architects, users don’t care about where data resides or what model it uses – as long as they get the service they require. Werner explained that the Amazon.com homepage is a combination of 200-300 different services, with multiple data systems. Users do not think about data sources in isolation, they care about the amalgamated service.

This was also a theme that cropped up in the presentation by Enda Farrell, software architect at the BBC, who noted that the BBC’s homepage is a PHP application integrated with multiple data sources at multiple data centers, and also Twitter‘s analytics lead Kevin Weil, who described Twitter’s use of Hadoop, Pig, HBase, Cassandra and FlockDB.

While the company is using HBase for low-latency analytic applications such as people search and moving to Cassandra from MySQL for its online applications, it uses its recently open-sourced FlockDB graph database to serve up data on followers and correlate the intersection of followers to (for example) ensure that Tweets between two people are only sent to the followers of both. (As something of an aside, Twitter is using Hadoop to store the 7TB of of data its generates a day from Tweets, and Pig for non-real time analytics).

Kevin noted that the company is also working with Digg to build real-time analytics for Cassandra and will be releasing the results as open source, and also discussed how Twitter has made use of open source technologies created by others such as Facebook (both Cassandra and the Scribe log data aggregation server.

One of the issues that has arisen from the fact that organizations such as Amazon and Facebook have had to create their own data management technologies is the proliferation of NoSQL databases and a certain amount of wheel re-invention.

Werner explained that SmugMug creator Don Macaskill ended up being a MySQL expert not because he necessarily wanted to be, but because he needed to be because he had to be to keep his applications running.

“He doesn’t want to have to become an expert in Cassandra,” noted Werner. “What he wants is to have someone run it for him and take care of that.” Presumably Riptano, the new Cassandra vendor formed by Jonathan Ellis – project chair for the Cassandra database – will take care of that, but in the meantime Werner raised another long-term alternative.

“We shouldn’t all be doing this,” he said, adding that Dynamo is not as popular within Amazon Web Services as it once was as it is a product, that requires configuration and management, rather than a service, and Amazon employees “have better things to do.”

Which raises the question – don’t Twitter, Facebook, the BBC, the Guardian et al have better things to do than developing and maintaining database architecture? In a perfect world, yes. But in a perfect world they’d all have strongly consistent, scalable distributed database systems/services that are suited to their various applications.

Interestingly, describing S3 as “a better key/value store than Dynamo”, Werner noted that SimpleDB and S3 are “a good start to provide that service”.

Looking forward to NoSQL EU

I was asked a few weeks ago whether I thought NoSQL was largely a US, (and specifically) West Coast phenomenon. While it might seem that way for some of those in the bubble that is the Bay Area (and to be fair that’s where I was at the time), the answer is a definite “no”.

As if to prove it, NoSQL EU is being held London next week with a great program of presentations from NoSQL vendors, projects and users.

April 20 features presentations on The Guardian’s use of NoSQL, as well as an overview from Alex Popescu of MyNoSQL, followed by presentations from Basho, 10gen, Rackspace and Neo Technology.

April 21 sees Amazon CTO Werner Vogels describing the birth of Dynamo, as well as presentations on the use of NoSQL databases from the BBC, Twitter, and Comcast. That is followed by presentations on Redis, Tokyo Cabinet (et al) and “the fate of the relational database”. Oh, and a panel debate moderated by some bloke called James Governor 😉

Then on the 22nd there’s a day of workshops involving MongoDB, Redis, Riak and Neo4J.

It’s shaping up to be a great event and I’m really looking forward to it. If you’re going to be there and want to say hi (between sessions!) let me know.

How will pro-SQL respond to NoSQL?

Gear6’s Mark Atwood is less than impressed with my recent statement: “Memcached is not a key value store. It is a cache. Hence the name.”

Mark has responded with a post in which he explains how memcached can be used as a key value store with the assistance of “persistent memcached” from Gear6, or by combining memcached with something like Tokyo Cabinet.

As much as I agree with Mark that other technologies can be used to turn memcached into a key value store I can’t help thinking his post actually proves my point: that memcached itself is not a key value store.

Either way it brings me to the next post in the NoSQL series (see also The 451 Group’s recent Spotlight report), looking at what the existing technology providers are likely to do in response.

I spent last week in San Francisco at the Open Source Business Conference where David Recordon, head of open source initiatives at Facebook, outlined how the company makes use of various open source projects, including memcached and MySQL, to scale its infrastructure.

It was an interesting presentation, although the thing that stood out for me was that Recordon didn’t once mention Cassandra, the open source key value store created by Facebook, despite being asked directly about the company’s plans for what was rather quaintly referred to as “non-relational databases”.

In fact, this recent post from Recordon puts Cassandra in context: “we use it for Inbox search, but the majority of development is now being led by Digg, Rackspace, and Twitter”. It is technologies like MySQL and memcached that Facebook is scaling to provide its core horsepower.

The death of memcached, as they say, has been greatly exaggerated.

That said, it is clear that to some extent the rise of NoSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently. Sharding is a popular method of increasing the scalability of the MySQL database to serve the requirements of high-traffic websites, but it’s manually intensive. The memcached distributed memory object-caching system can also be used to improve performance, but does not provide persistence.

An alternative to throwing out investments in MySQL and memcached in favor of NoSQL is to improve the MySQL/memcached combination, however. A number of vendors, including Gear6 and NorthScale, are developing and delivering technologies that add persistence to memcached (see recent 451 Group coverage on Gear6 and NorthScale), while appliance providers such as Schooner Information Technology (451 coverage) and Virident Systems (451 coverage) have taken an appliance-based approach to adding persistence.

Another approach would be to improve the performance of MySQL itself. ScaleDB (451 coverage) has a shared-disk storage engine for MySQL that promises to improve its scalability. We have also recently come across GenieDB, (451 coverage) which is promising a massively distributed data storage engine for MySQL. Additionally, Tokutek’s TokuDB MySQL storage engine is based on Fractal Tree indexing technology that reduces data-insertion times, improving the performance of MySQL for both read and write applications, for example.

As we noted in our recent assessment of Tokutek, while TokuDB is effectively an operational database technology, it does blur the line between operations and analytics since the company claims it delivers a performance improvement sufficient to run ad hoc queries against live data.

Beyond MySQL, while we expect the database incumbents to feel the impact of NoSQL in certain use cases, the lack of consistency (in the CAP Theorem sense) inevitably enables quick dismissal of their wider applicability. Additionally, we expect to see the data management vendors take steps to improve performance and scalability. One method is through the use of in-memory databases to improve performance for repeatedly accessed data, another is through the use of in-memory data grid caching technologies, which are designed to solve both performance and scalability issues.

Although these technologies do not provide the scalability required by Facebook, Amazon, et al., the question is, how many applications need that level of scalability? Returning again to CAP Theorem, if we assume that most applications do not require the levels of partition tolerance seen at Google, expect the incumbents to argue that what they lack in partition tolerance they can make up for in consistency and availability.

Somewhat inevitably, the requirements mandated by NoSQL advocates will be watered down for enterprise adoption. At that level, it may arguably be easier for incumbent vendors to sacrifice a little consistency and availability for partition tolerance than it will be for NoSQL projects to add consistency and availability.

Much will depend on the workload in question, which is something that is being hidden by debates that assume a confrontational relationship between SQL and NoSQL databases. As the example of Facebook suggests, there is room for both MySQL/memcached and NoSQL

Saying yes to NoSQL

As a company, The 451 Group has built its reputation on taking a lead in covering disruptive technologies and vendors. Even so, with a movement as hyped as NoSQL databases, it sometimes pays to be cautious.

In my role covering data management technologies for The 451 Group’s Information Management practice I have been keeping an eye on the NoSQL database movement for some time, taking the time to understand the nuances of the various technologies involved and their potential enterprise applicability.

That watching brief has now spilled over into official coverage, following our recent assessment of 10gen. I also recently had the chance to meet up with Couchio’s VP of business development, Nitin Borwankar (see coverage initiation of Couchio). I’ve also caught up with Basho Technologies sooner rather than later. A report on that is now imminent.

There are a couple of reasons why I have formally began covering the NoSQL databases. The first is the maturing of the technologies, and the vendors behind them, to the point where they can be considered for enterprise-level adoption. The second is the demand we are getting from our clients to provide our view of the NoSQL space and its players.

This is coming both from the investment community and from existing vendors, either looking for potential partnerships or fearing potential competition. The number of queries we have been getting related to NoSQL and big data have encouraged articulation of my thoughts, so look-out for a two-part spotlight on the implications for the operational and analytical database markets in the coming weeks.

The biggest reason, however, is the recognition that the NoSQL movement is a user-led phenomena. There is an enormous amount of hype surrounding NoSQL but for the most part it is not coming from vendors like 10gen, Couchio and Basho (although they may not be actively discouraging it) but from technology users.

A quick look at the most prominent key-value and column-table NoSQL data stores highlights this. Many of these have been created by user organizations themselves in order fill a void and overcome the limitations of traditional relational databases – for example Google (BigTable), Yahoo (Hbase), Zvents (Hypertable), LinkedIn (Voldemort), Amazon (Dynamo), and Facebook (Cassandra).

It has become clear that traditional database technologies do need meet the scalability and performance requirements of dealing with big data workloads, particularly at a scale experienced by social networking services.

That does raise the question of how applicable these technologies will be to enterprises that do not share the architecture of the likes of Google, Facebook and LinkedIn – at least in the short-term. Although there are users – Cassandra users include Rackspace, Digg, Facebook, and Twitter, for example.

What there isn’t – for the likes of Cassandra and Voldemort, at least – is vendor-based support. That inevitably raises questions about the general applicability of the key-value/column table stores. As Dave Kellog notes, “unless you’ve got Google’s business model and talent pool, you probably shouldn’t copy their development tendencies”.

Given the levels of adoption it seems inevitable that vendors will emerge around some of these projects, not least since, as Dave puts it, “one day management will say: ‘Holy Cow folks, why in the world are we paying programmers to write and support software at this low a level?'”

In the meantime, it would appear that the document-oriented data stores (Couchio’s CouchDB, 10gen’s MongoDB, Basho’s Riak) are much more generally applicable, both technologically and from a business perspective. UPDATE – You can also add Neo Technology and its graph database technology to that list).

In our forthcoming two-part spotlight on this space I’ll articulate in more detail our view on the differentiation of the various NoSQL databases and other big data technologies and their potential enterprise applicability. The first part, on NoSQL and operational databases, is here.