The Data Day, A few days: May 1-May 8, 2015

Google launches Cloud Bigtable. And more.

And that’s the data day, today.

It’s the end of NoSQL as we know it (and I feel fine)

Last week I tweeted that this week was shaping up to be a watershed week in the history of NoSQL. I was referring, of course, to MongoDB launching 3.0 and DataStax acquiring Aurelius – although more specifically what the context of these two announcements tells us about the future of NoSQL.

While each of these announcements could be considered significant in its own right in combination they suggest a new stage in the evolution of NoSQL and a clear signal that the future of NoSQL will be driven by database products that support multiple data models.

When we formally started covering NoSQL in 2010 it made sense to divide the various projects into four groups: key value stores, distributed (wide) column stores (or BigTable clones), graph databases, and document-oriented databases.

By early 2013 it had become obvious that there was another emerging category: multi-model databases.

Multi-model NoSQL databases have therefore been around for several years but while we have seen growing interest in these multi-model databases, in terms of widespread adoption they still lagged behind the early specialist NoSQL databases. That’s what makes the recent announcements by MongoDB and DataStax so significant.

    1. Along with releasing version 3.0 of its document database, MongoDB also began to share (at least with us) its long-term multi-model vision for MongoDB, explaining how the pluggable storage engine architecture could enable the database to support multiple data models – such as key value, graph and relational.
    1. Meanwhile DataStax described how its acquisition of Aurelius will see it developing a graph database to complement Apache Cassandra’s wide column key value model, and explained its multi-model strategy.
  • Multi-model momentum may have been growing for years but the fact that the commercial providers behind the two most popular NoSQL databases have detailed their plans to go multi-model confirms that the multi-model approach is the future of NoSQL.

    Indeed, since we expect to see similar moves from other NoSQL players it will become increasingly difficult to divide the NoSQL space in terms of key value stores, wide column stores, graph databases, and document-oriented databases. Instead it makes sense to divide the NoSQL projects in terms of whether they are single-model or multi-model.

    451 Research clients can read more about our perspectives on MongoDB’s strategic direction, as well as DataStax’s acquisition of Aurelius, and the wider implications for the NoSQL sector.

    Necessity is the mother of NoSQL

    As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.

    Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.

    The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.

    The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:

    Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.

    Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.

    Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.

    That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.

    These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

    Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.

    Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.

    We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).

    In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.

    These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.

    NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

    User perspectives on NoSQL

    The NoSQL EU event in London this week was a great event with interesting perspectives from both vendors – Basho, Neo Technology, 10gen, Riptano – and also users – The Guardian, the BBC, Amazon, Twitter. In particular I was interested in learning from the latter about how and why they ended up using alternatives to the traditional relational database model.

    Some of the reasons for using NoSQL have been well-documented: Amazon CTO Werner Vogels talked about how the traditional database offerings were unable to meet the scalability Amazon.com requires. Filling a functionality void also explains why Facebook created Cassandra, Google created BigTable, and Twitter created FlockDB (etc etc). As Werner said, “We couldn’t bet the company on other companies building the answer for us.”

    As Werner also explained, however, the motivation for creating Dynamo was also about enabling choice and ensuring that Amazon was not trying to force the relational database to do something it was not designed to do. “Choosing the right tool for the job” was a recurring theme at NoSQL EU.

    Given the NoSQL name it is easy to assume that this means that the relational database is by default “the wrong tool”. However, the most important element in that statement is arguably not “tool”, but “job” and The Guardian discussed how it was using non-relational data tools to create new applications that complement its ongoing investment in the Oracle database.

    For example, the Guardian’s application to manage the progress of crowdsourcing the investigation of MP’s expenses is based on Redis, while the Zeitgeist trending news application runs on Google’s AppEngine, as did its live poll during the recent leader’s election debate. Datablog, meanwhile, relies on Google Spreadsheets to serve up usable and downloadable data – we’ll ignore for a moment whether Google Spreadsheets is a NoSQL database 😉

    Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. The overarching theme, as Matthew Wall and Simon Willison explained, is that the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

    On the subject of choosing the right tool for the job, Basho’s engineering manager Brian Fink pointed out that using NoSQL technology alongside relational SQL database technology may actually improve the performance of the SQL database since storing data in a relational database that does not need SQL features slows down access to data that does need SQL features.

    Another perspective on this came from Werner Vogels who noted that unlike database administrators/ systems architects, users don’t care about where data resides or what model it uses – as long as they get the service they require. Werner explained that the Amazon.com homepage is a combination of 200-300 different services, with multiple data systems. Users do not think about data sources in isolation, they care about the amalgamated service.

    This was also a theme that cropped up in the presentation by Enda Farrell, software architect at the BBC, who noted that the BBC’s homepage is a PHP application integrated with multiple data sources at multiple data centers, and also Twitter‘s analytics lead Kevin Weil, who described Twitter’s use of Hadoop, Pig, HBase, Cassandra and FlockDB.

    While the company is using HBase for low-latency analytic applications such as people search and moving to Cassandra from MySQL for its online applications, it uses its recently open-sourced FlockDB graph database to serve up data on followers and correlate the intersection of followers to (for example) ensure that Tweets between two people are only sent to the followers of both. (As something of an aside, Twitter is using Hadoop to store the 7TB of of data its generates a day from Tweets, and Pig for non-real time analytics).

    Kevin noted that the company is also working with Digg to build real-time analytics for Cassandra and will be releasing the results as open source, and also discussed how Twitter has made use of open source technologies created by others such as Facebook (both Cassandra and the Scribe log data aggregation server.

    One of the issues that has arisen from the fact that organizations such as Amazon and Facebook have had to create their own data management technologies is the proliferation of NoSQL databases and a certain amount of wheel re-invention.

    Werner explained that SmugMug creator Don Macaskill ended up being a MySQL expert not because he necessarily wanted to be, but because he needed to be because he had to be to keep his applications running.

    “He doesn’t want to have to become an expert in Cassandra,” noted Werner. “What he wants is to have someone run it for him and take care of that.” Presumably Riptano, the new Cassandra vendor formed by Jonathan Ellis – project chair for the Cassandra database – will take care of that, but in the meantime Werner raised another long-term alternative.

    “We shouldn’t all be doing this,” he said, adding that Dynamo is not as popular within Amazon Web Services as it once was as it is a product, that requires configuration and management, rather than a service, and Amazon employees “have better things to do.”

    Which raises the question – don’t Twitter, Facebook, the BBC, the Guardian et al have better things to do than developing and maintaining database architecture? In a perfect world, yes. But in a perfect world they’d all have strongly consistent, scalable distributed database systems/services that are suited to their various applications.

    Interestingly, describing S3 as “a better key/value store than Dynamo”, Werner noted that SimpleDB and S3 are “a good start to provide that service”.

    Saying yes to NoSQL

    As a company, The 451 Group has built its reputation on taking a lead in covering disruptive technologies and vendors. Even so, with a movement as hyped as NoSQL databases, it sometimes pays to be cautious.

    In my role covering data management technologies for The 451 Group’s Information Management practice I have been keeping an eye on the NoSQL database movement for some time, taking the time to understand the nuances of the various technologies involved and their potential enterprise applicability.

    That watching brief has now spilled over into official coverage, following our recent assessment of 10gen. I also recently had the chance to meet up with Couchio’s VP of business development, Nitin Borwankar (see coverage initiation of Couchio). I’ve also caught up with Basho Technologies sooner rather than later. A report on that is now imminent.

    There are a couple of reasons why I have formally began covering the NoSQL databases. The first is the maturing of the technologies, and the vendors behind them, to the point where they can be considered for enterprise-level adoption. The second is the demand we are getting from our clients to provide our view of the NoSQL space and its players.

    This is coming both from the investment community and from existing vendors, either looking for potential partnerships or fearing potential competition. The number of queries we have been getting related to NoSQL and big data have encouraged articulation of my thoughts, so look-out for a two-part spotlight on the implications for the operational and analytical database markets in the coming weeks.

    The biggest reason, however, is the recognition that the NoSQL movement is a user-led phenomena. There is an enormous amount of hype surrounding NoSQL but for the most part it is not coming from vendors like 10gen, Couchio and Basho (although they may not be actively discouraging it) but from technology users.

    A quick look at the most prominent key-value and column-table NoSQL data stores highlights this. Many of these have been created by user organizations themselves in order fill a void and overcome the limitations of traditional relational databases – for example Google (BigTable), Yahoo (Hbase), Zvents (Hypertable), LinkedIn (Voldemort), Amazon (Dynamo), and Facebook (Cassandra).

    It has become clear that traditional database technologies do need meet the scalability and performance requirements of dealing with big data workloads, particularly at a scale experienced by social networking services.

    That does raise the question of how applicable these technologies will be to enterprises that do not share the architecture of the likes of Google, Facebook and LinkedIn – at least in the short-term. Although there are users – Cassandra users include Rackspace, Digg, Facebook, and Twitter, for example.

    What there isn’t – for the likes of Cassandra and Voldemort, at least – is vendor-based support. That inevitably raises questions about the general applicability of the key-value/column table stores. As Dave Kellog notes, “unless you’ve got Google’s business model and talent pool, you probably shouldn’t copy their development tendencies”.

    Given the levels of adoption it seems inevitable that vendors will emerge around some of these projects, not least since, as Dave puts it, “one day management will say: ‘Holy Cow folks, why in the world are we paying programmers to write and support software at this low a level?'”

    In the meantime, it would appear that the document-oriented data stores (Couchio’s CouchDB, 10gen’s MongoDB, Basho’s Riak) are much more generally applicable, both technologically and from a business perspective. UPDATE – You can also add Neo Technology and its graph database technology to that list).

    In our forthcoming two-part spotlight on this space I’ll articulate in more detail our view on the differentiation of the various NoSQL databases and other big data technologies and their potential enterprise applicability. The first part, on NoSQL and operational databases, is here.