How will pro-SQL respond to NoSQL?

Gear6’s Mark Atwood is less than impressed with my recent statement: “Memcached is not a key value store. It is a cache. Hence the name.”

Mark has responded with a post in which he explains how memcached can be used as a key value store with the assistance of “persistent memcached” from Gear6, or by combining memcached with something like Tokyo Cabinet.

As much as I agree with Mark that other technologies can be used to turn memcached into a key value store I can’t help thinking his post actually proves my point: that memcached itself is not a key value store.

Either way it brings me to the next post in the NoSQL series (see also The 451 Group’s recent Spotlight report), looking at what the existing technology providers are likely to do in response.

I spent last week in San Francisco at the Open Source Business Conference where David Recordon, head of open source initiatives at Facebook, outlined how the company makes use of various open source projects, including memcached and MySQL, to scale its infrastructure.

It was an interesting presentation, although the thing that stood out for me was that Recordon didn’t once mention Cassandra, the open source key value store created by Facebook, despite being asked directly about the company’s plans for what was rather quaintly referred to as “non-relational databases”.

In fact, this recent post from Recordon puts Cassandra in context: “we use it for Inbox search, but the majority of development is now being led by Digg, Rackspace, and Twitter”. It is technologies like MySQL and memcached that Facebook is scaling to provide its core horsepower.

The death of memcached, as they say, has been greatly exaggerated.

That said, it is clear that to some extent the rise of NoSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently. Sharding is a popular method of increasing the scalability of the MySQL database to serve the requirements of high-traffic websites, but it’s manually intensive. The memcached distributed memory object-caching system can also be used to improve performance, but does not provide persistence.

An alternative to throwing out investments in MySQL and memcached in favor of NoSQL is to improve the MySQL/memcached combination, however. A number of vendors, including Gear6 and NorthScale, are developing and delivering technologies that add persistence to memcached (see recent 451 Group coverage on Gear6 and NorthScale), while appliance providers such as Schooner Information Technology (451 coverage) and Virident Systems (451 coverage) have taken an appliance-based approach to adding persistence.

Another approach would be to improve the performance of MySQL itself. ScaleDB (451 coverage) has a shared-disk storage engine for MySQL that promises to improve its scalability. We have also recently come across GenieDB, (451 coverage) which is promising a massively distributed data storage engine for MySQL. Additionally, Tokutek’s TokuDB MySQL storage engine is based on Fractal Tree indexing technology that reduces data-insertion times, improving the performance of MySQL for both read and write applications, for example.

As we noted in our recent assessment of Tokutek, while TokuDB is effectively an operational database technology, it does blur the line between operations and analytics since the company claims it delivers a performance improvement sufficient to run ad hoc queries against live data.

Beyond MySQL, while we expect the database incumbents to feel the impact of NoSQL in certain use cases, the lack of consistency (in the CAP Theorem sense) inevitably enables quick dismissal of their wider applicability. Additionally, we expect to see the data management vendors take steps to improve performance and scalability. One method is through the use of in-memory databases to improve performance for repeatedly accessed data, another is through the use of in-memory data grid caching technologies, which are designed to solve both performance and scalability issues.

Although these technologies do not provide the scalability required by Facebook, Amazon, et al., the question is, how many applications need that level of scalability? Returning again to CAP Theorem, if we assume that most applications do not require the levels of partition tolerance seen at Google, expect the incumbents to argue that what they lack in partition tolerance they can make up for in consistency and availability.

Somewhat inevitably, the requirements mandated by NoSQL advocates will be watered down for enterprise adoption. At that level, it may arguably be easier for incumbent vendors to sacrifice a little consistency and availability for partition tolerance than it will be for NoSQL projects to add consistency and availability.

Much will depend on the workload in question, which is something that is being hidden by debates that assume a confrontational relationship between SQL and NoSQL databases. As the example of Facebook suggests, there is room for both MySQL/memcached and NoSQL

Categorizing the “Foo” fighters – making sense of NoSQL

One of the essential problems with the covering the NoSQL movement is that it describes not what the associated databases are, but what they are not (and doesn’t even do that very well since SQL itself is in many cases orthogonal to the problem the databases are designed to solve).

It is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no one has adequately defined what “Foo” is.

Curt has proposed HVSP (High-Volume Simple Processing) as an alternative to NoSQL, and while I’m not jumping on the bandwagon just yet, it does pass the Ronseal test (it does what it says on the tin), and it also matches my view of what defines these distributed data store technologies.

Some observations:

  • I agree with Curt’s view that object-oriented and XML databases should not be considered part of this new breed of distributed data store technologies. There is a danger that NoSQL simply comes to mean non-relational.
  • I also agree that MapReduce and Hadoop should not be considered part of this category of data management technologies (which is somewhat ironic since if there is any technology for which the terms NoSQL or Not Only SQL are applicable, it is MapReduce).
  • The vendors associated with the NoSQL movement (Basho, Couchio and MongoDB) are in a problematic position. While they are benefiting from, and to some extent encouraging, interest in NoSQL, the overall term masks their individual benefits. My sense is they will look to move away from it sooner rather than later.
  • Memcached is not a key value store. It is a cache. Hence the name.
  • .
    There are numerous categorizations of the various NoSQL technologies available on the Internet. Without wishing to add yet another to the mix, I have created another one – more for my benefit than anything else.

    It includes a list of users for the various projects (where available), and also some sense of whether the various projects fit into CAP Theorem, an understanding of which is, to my mind, essential for understanding how and why the NoSQL/HVSP movement has emerged (look out for more on CAP Theorem in a follow-up post on alternatives to NoSQL).

    Here’s my take, for those that are interested. As you can see there’s a graph database-shaped whole in my knowledge. I’m hoping to fill that sooner rather than later.

    By the way, our Spotlight report introducing The 451 Group’s formal coverage of NoSQL databases will be available here imminently.

    Update: VMware has announced that it has hired Redis creator Salvatore Sanfilippo, and is taking on the Redis key value store project. The image below has been updated to reflect that, as well as the launch of NorthScale’s Membase.