NoSQL Road Show, Hadoop Tuesdays and Hadoop World

I’ll be taking our data management research out on the road in the next few months with a number of events, webinars and presentations.

On October 12 I’m taking part in the NoSQL Road Show Amsterdam, with Basho, Trifork and Erlang Solutions, where I’ll be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

The following week, October 18, I’m taking part in the Hadoop Tuesdays series of webinars, presented by Cloudera and Informatica, specifically talking about the Hadoop Ecosystem.

The Apache Hadoop ecosystem will again be the focus of attention on November 8 and 9, when I’ll be in New York for Hadoop World, presenting The Blind Men and the Elephant.

Then it’s back to NoSQL with two more stops on the NoSQL Road Show, in London on November 29 and Stockholm on December 1, where I’ll once again be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

I hope you can join us for at least one of these events, and am looking forward to learning a lot about NoSQL and Apache Hadoop adoption, interest and concerns.

Red Hat considering NoSQL/Hadoop acquisition

Idle speculation over on our CAOS Theory blog.

Couchbase and SQLite cry UnQL for unstructured queries

NoSQL has never really been about SQL. As we pointed out in our NoSQL, NewSQL and Beyond report, “[one] of the NoSQL idiosyncrasies is that in most cases SQL itself is not the ‘problem’ being avoided. Indeed, a better term might be ‘NoSchema,’ given that a more common quality is a rejection of fixed table schema and join operations”.

Nevertheless the NoSQL term has stuck, and also inspired NewSQL (which, as critics have pointed out, is not really about SQL either), while a number of NoSQL providers started to look at how they could actually add support for SQL queries to their respective databases.

The recently-released version 0.8 of Apache Cassandra features the first implementation of Cassandra Query Language (CQL), an SQL-like query language, for Cassandra.

Meanwhile Couchbase and SQLite have teamed up to create UnQL (Unstructured Query Language), a new data query language for unstructured data. Pronounced ‘uncle’, UnQL is designed to remove the burden of query planning, optimization and execution from NoSQL developers by providing an adaptation of the SQL structured query language for unstructured data models.

As can be seen by an example of the draft syntax, UnQL is designed to be familiar to SQL developers, while also enabling querying over complex and unstructured storage models, such as document models.

UnQL was created by Couchbase CTO and CouchDB creator Damien Katz, alongwith SQLite creator and founder Richard Hipp and both Couchbase and SQLite have committed to implementing UnQL in future versions of their database products.

UnQL is not designed to be specific to select database products, however, and the specification is being released to the public domain at www.unqlspec.org. There is also the potential that open source parsers and query planning implementations will be created to foster adoption.

One of the principle drivers behind UnQL’s development is that a common query language is necessary to drive NoSQL adoption in the same way SQL drove adoption in the relational database market.

It remains to be seen whether UnQL will be picked up by other projects, although the release to the public domain should give confidence that this is not an attempt to force the industry to adopt a ‘standard’ from a single vendor.

Hadoop and NoSQL job trends – in context

Recently there have been a spate of postings regarding job trends for distributed data management technologies including Hadoop and the various NoSQL databases.

One thing you rarely see on these job trends charts is comparison with an incumbent technology, for context. There’s a reason for that, as this comparison of database-related jobs from Indeed.com illustrates:

Although there has been a recent increase in job postings related to Hadoop and MongoDB, both are dwarfed, in absolute terms, by the number of job postings involving SQL Server and MySQL.

So why all the fuss about Hadoop and NoSQL, from a corporate perspective? This chart, showing the relative growth for the same data management technologies, says it all:

NoSQL/NewSQL/MySQL is not a zero sum game

It has been fascinating to watch how the industry has responded to ‘NewSQL’ since we published our first report using the term.

From day one the term has taken on a life of its own as the vendors such as ScaleBase, VoltDB, NimbusDB and Xeround have picked it up and run with it , while the likes of Marten Mickos and Michael Stonebraker have also adopted the term.

The reaction hasn’t been all positive, of course, although much of the criticism has been of the “are you kidding?” or “this is getting silly” variety rather than constructive debate about either the term or the associated technologies.

Another popular response is along the lines of “does this mean the end of NoSQL?”. I think it is important to address this question because it depends on a common misunderstanding about technology: that in order for the latest technology to succeed it is necessary for the technology that immediately preceded it to fail.

While our report into NoSQL, NewSQL and Beyond identified common drivers for interest in NoSQL and NewSQL databases, as well as data caching/grid technologies, in truth there is a significant difference between the requirements for databases that provide relaxed consistency and/or schema dependency and those that retain the ACID properties of transactional database systems.

Although there will be isolated examples, it is going to be rare, therefore, that any potential adopter would be directly comparing NoSQL and NewSQL technologies unless they are still at the stage trying to figure out the level of consistency required for an individual application.

The other option they would have is to use an existing SQL database, particularly Oracle’s MySQL, which provides the middle ground that overlaps both NoSQL and NewSQL. A significant number of the NoSQL deployments we have identified have migrated from MySQL, while existing MySQL deployments (although probably not the same ones) are also targets for the numerous NewSQL vendors.

VoltDB is a primary example, as last’s week’s GigaOm article covering CTO Michael Stonebraker’s view on Facebook’s MySQL ‘fate worse than death’ illustrated.

Much debate (125 comments at last count) has followed Stonebraker’s assertion that Facebook would be better off migrating to a NewSQL offering like VoltDB, most of which has not supported his view.

There’s a good reason for that. There is a good argument to be made that if you were trying to create Facebook from scratch today you probably wouldn’t choose the shard management overhead involved in MySQL. In that regard, Stonebraker has a point.

However, the fact is that MySQL was pretty much the only logical choice when Facebook began and its commitment to MySQL has grown over the years. The company is now probably one of the world’s experts in scaling and managing MySQL – to the extent that Facebook engineer Domas Mituzas argues that the operational overhead in handling sharding and availability of MySQL has become a constant cost.

Under those circumstances it would take something significant for a company like Facebook to even consider migrating to a MySQL alternative. Database migration projects are costly and complex and extremely rare – even at non-Facebook scale.

And it is not as if the company hasn’t experimented with other database technologies – having created Apache Cassandra and adopted Apache HBase for its Messages update.

This is exactly the polyglot persistence strategy we are seeing from NoSQL and NewSQL adopters: retaining MySQL (or another SQL database) where is makes sense to do so, while adding NoSQL and perhaps NewSQL for new projects and applications for which it is appropriate.

One other point to note, however, is that adopting a NewSQL technology might not require migrating away from MySQL. While the NewSQL category includes new database products such as VoltDB, it also includes alternative MySQL storage engines and database load balancing and clustering products such as ScaleBase and ScalArc, which are specifically designed to improve the scalability of MySQL (with other SQL databases to come) in order to avoid migration to an alternative database.

Adoption of these technologies does not require the complete abandonment of ‘standard MySQL’ any more than the adoption of NoSQL for non-ACID application requirements does, and it certainly doesn’t require the abandonment of NoSQL.

Presenting NoSQL, NewSQL and Beyond at OSBC

Next Monday, May 16, I will be hosting session at the Open Source Business Conference in San Francisco focused on NoSQL, NewSQL and Beyond.

The presentation covers our recently published report of the same name, and provides some additional context on the role of open source in driving innovation in distributed data management.

Specifically, the presentation looks at the evolving influence of open source in the database market and the context for the emergence of new database alternatives.

I’ll be walking through the six core drivers that have driven the development and adoption of NoSQL and NewSQL databases, as well as data grid/cache technologies – scalability, performance, relaxed consistency, agility, intricacy and necessity – providing some user adoption examples for each.

The presentation also discusses the broader trends impacting the data management, providing an introduction to our total data concept and how some of the drivers behind NoSQL and NewSQL are also impacting the role of the enterprise data warehouse, Hadoop, and data management in the cloud.

The presentation begins at 3pm PT on Monday 16. The event is taking place at the Hilton San Francisco Union Square. I hope to see you there.

Necessity is the mother of NoSQL

As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.

Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.

The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.

The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:

Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.

Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.

Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.

That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.

These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.

Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.

We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).

In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.

These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.

NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

NoSQL, NewSQL and Beyond: The answer to SPRAINed relational databases

The 451 Group’s new long format report on emerging database alternatives, NoSQL, NewSQL and Beyond, is now available.

The report examines the changing database landscape, investigating how the failure of existing suppliers to meet the performance, scalability and flexibility needs of large-scale data processing has led to the development and adoption of alternative data management technologies.

Specifically, the report covers:

  • NoSQL databases designed to meet scalability requirements of distributed architectures and/or schema-less data management requirements, including big tables, key value stores, document database and graph databases
  • NewSQL databases designed to meet scalability requirements of distributed architectures or to improve performance such that horizontal scalability is no longer a necessity, including new MySQL storage engines, transparent sharding technologies, software and hardware appliances, and completely new databases
  • Data grid/cache products designed to store data in memory to increase application and database performance, covering a spectrum of data management capabilities from non-persistent data caching to persistent caching, replication, and distributed data and compute grid functionality

You can see how these products fit into the wider data management landscape from the chart below. The shaded areas are those specifically covered in this report.

The answer to SPRAINed relational databases

SPRAIN, used in the above graphic, is an acronym that refers to the six key factors driving the adoption of alternative data management technologies to traditional relational databases that are being ‘sprained’ as a result of being stretched beyond their normal capacity by the needs of high-volume, highly distributed or highly complex applications.

Those six key drivers, and their associated sub-drivers, are as follows:

  • Scalability – hardware economics
  • Performance – MySQL limitations
  • Relaxed consistency – CAP theorem
  • Agility – polyglot persistence
  • Intricacy – big data, total data
  • Necessity – open source

The report examines each of these drivers and sub-drivers in turn, investigating how they are driving interest in alternative database approaches in general, and how they prompted the development of specific NoSQL, NewSQL and data grid/cache products and services.

It continues with profiles of the individual database alternatives and their use cases and case studies before concluding with a discussion of the impact of these database alternatives on the wider database market and the likely consolidation, confluence and proliferation of various technologies looking forward.

Here’s a selection of some of our key findings:

  • The database market remains dominated by relational databases and the incumbent industry giants, but the emergence of NoSQL and NewSQL alternatives has in part been driven by the inability of these products to address emerging distributed and schema-less data management requirements.
  • Polyglot persistence, and the associated trend toward polyglot programming, is driving developers toward making use of multiple database products depending on which might be suitable for a particular task.
  • The NoSQL projects were developed in response to the failure of existing suppliers to address the performance, scalability and flexibility requirements of large-scale data processing, particularly for Web and cloud computing applications.
  • NewSQL and data-grid products have emerged to meet similar requirements among enterprises, a sector that is now also being targeted by NoSQL vendors.
  • While NoSQL is seen as a software innovation prompted by the need to deal with large volumes of data, the software innovation was a direct response to the improved performance of commodity hardware clusters and the ability to spread data storage and processing across that hardware.
  • Changing hardware economics mean that distributed server architecture is increasingly being adopted in traditional enterprise environments. The emergence of NewSQL providers is a direct response to the increasing need for scalable data management products to make more efficient use of this architecture.
  • Distributed data-grid/cache products are increasingly being positioned as potential alternatives to relational databases as the primary platform for distributed data management, with a relational database relegated to a supporting role.

The report is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

What we talk about when we talk about NewSQL

Yesterday The 451 Group published a report asking “How will the database incumbents respond to NoSQL and NewSQL?”

That prompted the pertinent question, “What do you mean by ‘NewSQL’?”

Since we are about to publish a report describing our view of the emerging database landscape, including NoSQL, NewSQL and beyond (now available), it probably is a good time to define what we mean by NewSQL (I haven’t mentioned the various NoSQL projects in this post, but they are covered extensively in the report. More on them another day).

“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.

And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

So who would be consider to be the NewSQL vendors? Like NoSQL, NewSQL is used to describe a loosely-affiliated group of companies (ScaleBase has done a good job of identifying, some of the several NewSQL sub-types) but what they have in common is the development of new relational database products and services designed to bring the benefits of the relational model to distributed architectures, or to improve the performance of relational databases to the extent that horizontal scalability is no longer a necessity.

In the first group we would include (in no particular order) Clustrix, GenieDB, ScalArc, Schooner, VoltDB, RethinkDB, ScaleDB, Akiban, CodeFutures, ScaleBase, Translattice, and NimbusDB, as well as Drizzle, MySQL Cluster with NDB, and MySQL with HandlerSocket. The latter group includes Tokutek and JustOne DB. The associated “NewSQL-as-a-service” category includes Amazon Relational Database Service, Microsoft SQL Azure, Xeround, Database.com and FathomDB.

(Links provide access to 451 Group coverage for clients. Non-clients can also apply for trial access).

Clearly there is the potential for overlap with NoSQL. It remains to be seen whether RethinkDB will be delivered as a NoSQL key value store for memcached or a “NewSQL” storage engine for MySQL, for example. While at least one of the vendors listed above is planning to enable the use of its database as a schema-less store, we also expect to see support for SQL queries added to some NoSQL databases. We are also sure that Citrusleaf won’t be the last NoSQL vendor to claim support for ACID transactions.

NewSQL is not about attempting to re-define the database market using our own term, but it is useful to broadly categorize the various emerging database products at this particular point in time.

Another clarification: ReadWriteWeb has picked up on this post and reported on the “NewSQL Movement”. I don’t think there is a movement in that sense that we saw the various NoSQL projects/vendors come together under the NoSQL umbrella with a common purpose. Perhaps the NewSQL players will do so (VoltDB and NimbusDB have reacted positively to the term, and Tokutek has become the first that I am aware of to explicitly describe its technology as NewSQL). As Derek Stainer notes, however: ” In the end it’s just a name, a way to categorize a group of similar solutions.”

In the meantime, we have already noted the beginning for the end of NoSQL, and the lines are blurring to the point where we expect the terms NoSQL and NewSQL will become irrelevant as the focus turns to specific use cases.

The identification of specific adoption drivers and use cases is the focus of our forthcoming long-form report on NoSQL, NewSQL and beyond, from which the 451 Group reported cited above is excerpted.

The report contains an overview of the roots of NoSQL and profiles of the major NoSQL projects and vendors, as well as analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, the evolving role of data grid technologies, and associated use cases.

It will be available very soon from the Information Management and CAOS practices and we will also publish more details of the key drivers as we see them and our view of the current database landscape here.

MySQL NoSQL survey highlights role of polyglot persistence

The MySQL developer website is currently running a poll to gauge the adoption of NoSQL database projects by MySQL developers.

The results are interesting, particularly in relation to our research report on the emergence and adoption of NoSQL and NewSQL databases, which I am completing this week.

Our research has shown that one of the drivers of NoSQL has been performance, and in particular the failure of MySQL to provide predictable performance at scale. We do see NoSQL being deployed for applications that previously ran on MySQL, or for which MySQL would previously have been the natural choice.

For example, while Facebook continues to run its core applications on MySQL running the InnoDB storage engine and memcached it also created what became Apache Cassandra to power its inbox search, and selected Apache HBase for its Messages application, which was updated in late 2010 to combine chat, email, and SMS, having found that MySQL was unable to deliver the performance required for large data sets.

Similarly, content discovery service StumbleUpon adopted HBase following problems with MySQL failover, Digg replaced its MySQL cluster with Apache Cassandra, and Wordnik replaced MySQL with MongoDB.

Clearly, however, not every MySQL application is suitable for a NoSQL database. Just because almost 80% of the MySQL survey respondents are adopting NoSQL database, does not mean they are replacing MySQL with NoSQL.

Like Facebook, many major NoSQL users also continue to use MySQL, including Twitter which back-tracked on a planned migration of its core status table to Apache Cassandra in 2010. It continues to use MySQL, but is adopting Cassandra for newer projects.

The adoption of multiple database products depending on the nature of the application is another of the six major drivers for NoSQL and NewSQL adoption highlighted by our research.

The theory of polyglot persistence has developed based on the fact that different data storage models have their own strengths and the acceptance that while the relational model is suitable for a large proportion of data storage requirements, there are times when a document, graph, or object database might be more suitable, or even a distributed file system.

Facebook and Twitter are prime examples of polyglot persistence in action, and the survey of MySQL developers shows that the practice is widespread. At the time of writing 205 people have responded to the survey, providing 421 responses.

If we exclude the 42 that indicate they are not using a NoSQL database, that means that the remaining 163 people are using 379 NoSQL databases, which equates to 2.33 databases per respondent, not including their existing use of MySQL or other traditional or NewSQL databases.

I’ll provide more details of the research report, including the other four adoption drivers, once the report is published. The report contains analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, as well as the evolving role of data grid technologies, as well as the associated use cases. It will be available soon for clients of our Information Management and CAOS practices.