Search Results for 'nosql' ↓

Red Hat considering NoSQL/Hadoop acquisition

Idle speculation over on our CAOS Theory blog.

Hadoop and NoSQL job trends – in context

Recently there have been a spate of postings regarding job trends for distributed data management technologies including Hadoop and the various NoSQL databases.

One thing you rarely see on these job trends charts is comparison with an incumbent technology, for context. There’s a reason for that, as this comparison of database-related jobs from Indeed.com illustrates:

Although there has been a recent increase in job postings related to Hadoop and MongoDB, both are dwarfed, in absolute terms, by the number of job postings involving SQL Server and MySQL.

So why all the fuss about Hadoop and NoSQL, from a corporate perspective? This chart, showing the relative growth for the same data management technologies, says it all:

NoSQL/NewSQL/MySQL is not a zero sum game

It has been fascinating to watch how the industry has responded to ‘NewSQL’ since we published our first report using the term.

From day one the term has taken on a life of its own as the vendors such as ScaleBase, VoltDB, NimbusDB and Xeround have picked it up and run with it , while the likes of Marten Mickos and Michael Stonebraker have also adopted the term.

The reaction hasn’t been all positive, of course, although much of the criticism has been of the “are you kidding?” or “this is getting silly” variety rather than constructive debate about either the term or the associated technologies.

Another popular response is along the lines of “does this mean the end of NoSQL?”. I think it is important to address this question because it depends on a common misunderstanding about technology: that in order for the latest technology to succeed it is necessary for the technology that immediately preceded it to fail.

While our report into NoSQL, NewSQL and Beyond identified common drivers for interest in NoSQL and NewSQL databases, as well as data caching/grid technologies, in truth there is a significant difference between the requirements for databases that provide relaxed consistency and/or schema dependency and those that retain the ACID properties of transactional database systems.

Although there will be isolated examples, it is going to be rare, therefore, that any potential adopter would be directly comparing NoSQL and NewSQL technologies unless they are still at the stage trying to figure out the level of consistency required for an individual application.

The other option they would have is to use an existing SQL database, particularly Oracle’s MySQL, which provides the middle ground that overlaps both NoSQL and NewSQL. A significant number of the NoSQL deployments we have identified have migrated from MySQL, while existing MySQL deployments (although probably not the same ones) are also targets for the numerous NewSQL vendors.

VoltDB is a primary example, as last’s week’s GigaOm article covering CTO Michael Stonebraker’s view on Facebook’s MySQL ‘fate worse than death’ illustrated.

Much debate (125 comments at last count) has followed Stonebraker’s assertion that Facebook would be better off migrating to a NewSQL offering like VoltDB, most of which has not supported his view.

There’s a good reason for that. There is a good argument to be made that if you were trying to create Facebook from scratch today you probably wouldn’t choose the shard management overhead involved in MySQL. In that regard, Stonebraker has a point.

However, the fact is that MySQL was pretty much the only logical choice when Facebook began and its commitment to MySQL has grown over the years. The company is now probably one of the world’s experts in scaling and managing MySQL – to the extent that Facebook engineer Domas Mituzas argues that the operational overhead in handling sharding and availability of MySQL has become a constant cost.

Under those circumstances it would take something significant for a company like Facebook to even consider migrating to a MySQL alternative. Database migration projects are costly and complex and extremely rare – even at non-Facebook scale.

And it is not as if the company hasn’t experimented with other database technologies – having created Apache Cassandra and adopted Apache HBase for its Messages update.

This is exactly the polyglot persistence strategy we are seeing from NoSQL and NewSQL adopters: retaining MySQL (or another SQL database) where is makes sense to do so, while adding NoSQL and perhaps NewSQL for new projects and applications for which it is appropriate.

One other point to note, however, is that adopting a NewSQL technology might not require migrating away from MySQL. While the NewSQL category includes new database products such as VoltDB, it also includes alternative MySQL storage engines and database load balancing and clustering products such as ScaleBase and ScalArc, which are specifically designed to improve the scalability of MySQL (with other SQL databases to come) in order to avoid migration to an alternative database.

Adoption of these technologies does not require the complete abandonment of ‘standard MySQL’ any more than the adoption of NoSQL for non-ACID application requirements does, and it certainly doesn’t require the abandonment of NoSQL.

Presenting NoSQL, NewSQL and Beyond at OSBC

Next Monday, May 16, I will be hosting session at the Open Source Business Conference in San Francisco focused on NoSQL, NewSQL and Beyond.

The presentation covers our recently published report of the same name, and provides some additional context on the role of open source in driving innovation in distributed data management.

Specifically, the presentation looks at the evolving influence of open source in the database market and the context for the emergence of new database alternatives.

I’ll be walking through the six core drivers that have driven the development and adoption of NoSQL and NewSQL databases, as well as data grid/cache technologies – scalability, performance, relaxed consistency, agility, intricacy and necessity – providing some user adoption examples for each.

The presentation also discusses the broader trends impacting the data management, providing an introduction to our total data concept and how some of the drivers behind NoSQL and NewSQL are also impacting the role of the enterprise data warehouse, Hadoop, and data management in the cloud.

The presentation begins at 3pm PT on Monday 16. The event is taking place at the Hilton San Francisco Union Square. I hope to see you there.

Necessity is the mother of NoSQL

As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.

Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.

The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.

The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:

Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.

Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.

Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.

That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.

These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.

Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.

We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).

In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.

These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.

NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

NoSQL, NewSQL and Beyond: The answer to SPRAINed relational databases

The 451 Group’s new long format report on emerging database alternatives, NoSQL, NewSQL and Beyond, is now available.

The report examines the changing database landscape, investigating how the failure of existing suppliers to meet the performance, scalability and flexibility needs of large-scale data processing has led to the development and adoption of alternative data management technologies.

Specifically, the report covers:

  • NoSQL databases designed to meet scalability requirements of distributed architectures and/or schema-less data management requirements, including big tables, key value stores, document database and graph databases
  • NewSQL databases designed to meet scalability requirements of distributed architectures or to improve performance such that horizontal scalability is no longer a necessity, including new MySQL storage engines, transparent sharding technologies, software and hardware appliances, and completely new databases
  • Data grid/cache products designed to store data in memory to increase application and database performance, covering a spectrum of data management capabilities from non-persistent data caching to persistent caching, replication, and distributed data and compute grid functionality

You can see how these products fit into the wider data management landscape from the chart below. The shaded areas are those specifically covered in this report.

The answer to SPRAINed relational databases

SPRAIN, used in the above graphic, is an acronym that refers to the six key factors driving the adoption of alternative data management technologies to traditional relational databases that are being ‘sprained’ as a result of being stretched beyond their normal capacity by the needs of high-volume, highly distributed or highly complex applications.

Those six key drivers, and their associated sub-drivers, are as follows:

  • Scalability – hardware economics
  • Performance – MySQL limitations
  • Relaxed consistency – CAP theorem
  • Agility – polyglot persistence
  • Intricacy – big data, total data
  • Necessity – open source

The report examines each of these drivers and sub-drivers in turn, investigating how they are driving interest in alternative database approaches in general, and how they prompted the development of specific NoSQL, NewSQL and data grid/cache products and services.

It continues with profiles of the individual database alternatives and their use cases and case studies before concluding with a discussion of the impact of these database alternatives on the wider database market and the likely consolidation, confluence and proliferation of various technologies looking forward.

Here’s a selection of some of our key findings:

  • The database market remains dominated by relational databases and the incumbent industry giants, but the emergence of NoSQL and NewSQL alternatives has in part been driven by the inability of these products to address emerging distributed and schema-less data management requirements.
  • Polyglot persistence, and the associated trend toward polyglot programming, is driving developers toward making use of multiple database products depending on which might be suitable for a particular task.
  • The NoSQL projects were developed in response to the failure of existing suppliers to address the performance, scalability and flexibility requirements of large-scale data processing, particularly for Web and cloud computing applications.
  • NewSQL and data-grid products have emerged to meet similar requirements among enterprises, a sector that is now also being targeted by NoSQL vendors.
  • While NoSQL is seen as a software innovation prompted by the need to deal with large volumes of data, the software innovation was a direct response to the improved performance of commodity hardware clusters and the ability to spread data storage and processing across that hardware.
  • Changing hardware economics mean that distributed server architecture is increasingly being adopted in traditional enterprise environments. The emergence of NewSQL providers is a direct response to the increasing need for scalable data management products to make more efficient use of this architecture.
  • Distributed data-grid/cache products are increasingly being positioned as potential alternatives to relational databases as the primary platform for distributed data management, with a relational database relegated to a supporting role.

The report is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

MySQL NoSQL survey highlights role of polyglot persistence

The MySQL developer website is currently running a poll to gauge the adoption of NoSQL database projects by MySQL developers.

The results are interesting, particularly in relation to our research report on the emergence and adoption of NoSQL and NewSQL databases, which I am completing this week.

Our research has shown that one of the drivers of NoSQL has been performance, and in particular the failure of MySQL to provide predictable performance at scale. We do see NoSQL being deployed for applications that previously ran on MySQL, or for which MySQL would previously have been the natural choice.

For example, while Facebook continues to run its core applications on MySQL running the InnoDB storage engine and memcached it also created what became Apache Cassandra to power its inbox search, and selected Apache HBase for its Messages application, which was updated in late 2010 to combine chat, email, and SMS, having found that MySQL was unable to deliver the performance required for large data sets.

Similarly, content discovery service StumbleUpon adopted HBase following problems with MySQL failover, Digg replaced its MySQL cluster with Apache Cassandra, and Wordnik replaced MySQL with MongoDB.

Clearly, however, not every MySQL application is suitable for a NoSQL database. Just because almost 80% of the MySQL survey respondents are adopting NoSQL database, does not mean they are replacing MySQL with NoSQL.

Like Facebook, many major NoSQL users also continue to use MySQL, including Twitter which back-tracked on a planned migration of its core status table to Apache Cassandra in 2010. It continues to use MySQL, but is adopting Cassandra for newer projects.

The adoption of multiple database products depending on the nature of the application is another of the six major drivers for NoSQL and NewSQL adoption highlighted by our research.

The theory of polyglot persistence has developed based on the fact that different data storage models have their own strengths and the acceptance that while the relational model is suitable for a large proportion of data storage requirements, there are times when a document, graph, or object database might be more suitable, or even a distributed file system.

Facebook and Twitter are prime examples of polyglot persistence in action, and the survey of MySQL developers shows that the practice is widespread. At the time of writing 205 people have responded to the survey, providing 421 responses.

If we exclude the 42 that indicate they are not using a NoSQL database, that means that the remaining 163 people are using 379 NoSQL databases, which equates to 2.33 databases per respondent, not including their existing use of MySQL or other traditional or NewSQL databases.

I’ll provide more details of the research report, including the other four adoption drivers, once the report is published. The report contains analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, as well as the evolving role of data grid technologies, as well as the associated use cases. It will be available soon for clients of our Information Management and CAOS practices.

Webinar: NoSQL and Hadoop in action at AOL

Next Thursday, February 24 (at 10am PT), I’ll be taking part in a webinar with Pero Subasic, Chief Architect, AOL to discuss the use cases for NoSQL database and Hadoop.

More specifically, Pero will be presenting how AOL Advertising leverages Hadoop and Membase NoSQL database technology to rapidly process operational user data to achieve sub-millisecond performance. Before that, I will be providing some context with a presentation about the changing data management landscape, the drivers behind the adoption of NoSQL databases and Hadoop, and their respective use cases.

My presentation provides a sneak peak into our ongoing research into the drivers and use cases for emerging database technologies, which will be delivered in a new long format report due in early April.

Following Pero’s presentation we will be joined by executives from Couchbase and Cloudera to answer any questions. You can register for the event here, while Couchbase’s James Phillips has provided a taster of what to expect here.

NoSQL consolidation begins…

The predicted consolidation of the NoSQL database landscape has begun. Membase and CouchOne have announced that they are merging to form Couchbase.

And in more interesting NoSQL news, Danish IT company Trifork has announced that it has acquired an 8% stake in Basho as part of the NoSQL vendor’s $7.4m series D round, and has become the European distributor for Riak.

The formation of Couchbase brings together to of the leading companies in the NoSQL space, and the complementary nature of the their technology and business plans highlights that the term NoSQL has been applied to many different database technologies which are being adopted for different reasons.

While Membase had focused on improving the performance of distributed applications through its Membase Server distributed database, CouchOne focused on developer interest in flexible document data stores and mobile applications, rather than performance at scale.

Additionally while Membase was focused on operational adoption with a small (albeit significant) developer community, the priority with CouchOne has been on growing adoption of Apache CouchDB, with commercial efforts only recently becoming the focus of attention.

The technology is also complementary. Couchbase will combine the Membase and CouchDB projects to form a new distributed document store project of the same name that combines the caching and clustering technology of Membase with the CouchDB document data store.

The result will be a new distributed document database covering a variety of use cases from mobile applications (Mobile Couchbase) to scalable clusters (Elastic Couchbase), with synchronization of data between the various Couchbase implementations enabled by CouchSync.

The merged company will be led by Bob Weiderhold, formerly CEO of Membase, while Damien Katz, formerly CEO of CouchOne and creator of the CouchDB database, becomes CTO.

Couchbase is claiming more than 200 customers, which would indicate phenomenal growth for both companies since the launch of their CouchOne Mobile and Membase Server products in September and October 2010 respectively.

Prior to the launch of those products they previously claimed just a handful of customers each, although CouchOne had signed up thousands of users to its free hosted services, so it had a large and willing audience ready for conversion.

Additionally the company claims millions of combined users since CouchDB has been included in every installation of the Ubuntu Linux distribution since late 2009 and Heroku (now part of Salesforce.com) offers a Membase-driven service to thousands of its hosting customers.

We previously predicted that we would see the NoSQL market both consolidate and proliferate this year, and it is worth noting that the merger of CouchOne and Membase will not result in a similar consolidation of open source projects.

While Couchbase.org can be expected to replace membase.org over time, the Couchbase project will be independent of the Apache CouchDB, which will not be impacted by the merger. Couchbase will continue to contribute to both CouchDB and also the memcached project.

While we’re on the subject of NoSQL, it is also interesting to see that Danish IT vendor Trifork has not only signed up to be European distributor of the Riak database, but has also taken a stake in Basho Technologies.

Trifork has acquired newly issued shares in Basho representing 8.35% of the company as part of its series D round, with an option to acquire an additional 3.96% at the end of Q1 2011.

NoSQL – consolidating and proliferating in 2011

Among the numerous prediction pieces during the rounds at the moment, Bradford Stephens, founder of Drawn to Scale suggested we could be in for continued proliferation of NoSQL database technologies in 2011, while Redmonk’s Stephen O’Grady predicted consolidation. I agree with both of them.

To understand how NoSQL could both proliferate and consolidate in 2011 it’s important to look at the small print. Bradford was talking specifically about open source tools, while Stephen was writing about commercially successful projects.

Given the levels of interest in NoSQL database technologies, the vast array of use cases, and the various interfaces and development languages – most of which are open source – I predict we’ll continue to see cross-pollination and the emergence of new projects as developers (corporate and individual) continue to scratch their own data-based itches.

However, I think we are also beginning to see the a narrowing of the commercial focus on those projects and companies that have enough traction to generate significant business opportunities and revenue, and that a few clear leaders will emerge in the various NoSQL sub-categories (key-value stores, document stores, graph databases and distributed column stores).

We can see previous evidence of the dual impact of proliferation and consolidation in the Linux market. While commercial opportunities are dominated by Red Hat, Novell and Canonical, that has not stopped the continued proliferation of Linux distributions.

The main difference between NoSQL and Linux markets, of course, is that the various Linux distributions all have a common core, and the diversity in the NoSQL space means that we are unlikely to see proliferation on the scale of Linux.

However, I think we’ll see a similar two-tier market emerge with a large number of technically interesting and differentiated open source projects, and a small number of commercially-viable general-purpose category leaders.