Categorizing the “Foo” fighters – making sense of NoSQL

One of the essential problems with the covering the NoSQL movement is that it describes not what the associated databases are, but what they are not (and doesn’t even do that very well since SQL itself is in many cases orthogonal to the problem the databases are designed to solve).

It is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no one has adequately defined what “Foo” is.

Curt has proposed HVSP (High-Volume Simple Processing) as an alternative to NoSQL, and while I’m not jumping on the bandwagon just yet, it does pass the Ronseal test (it does what it says on the tin), and it also matches my view of what defines these distributed data store technologies.

Some observations:

  • I agree with Curt’s view that object-oriented and XML databases should not be considered part of this new breed of distributed data store technologies. There is a danger that NoSQL simply comes to mean non-relational.
  • I also agree that MapReduce and Hadoop should not be considered part of this category of data management technologies (which is somewhat ironic since if there is any technology for which the terms NoSQL or Not Only SQL are applicable, it is MapReduce).
  • The vendors associated with the NoSQL movement (Basho, Couchio and MongoDB) are in a problematic position. While they are benefiting from, and to some extent encouraging, interest in NoSQL, the overall term masks their individual benefits. My sense is they will look to move away from it sooner rather than later.
  • Memcached is not a key value store. It is a cache. Hence the name.
  • .
    There are numerous categorizations of the various NoSQL technologies available on the Internet. Without wishing to add yet another to the mix, I have created another one – more for my benefit than anything else.

    It includes a list of users for the various projects (where available), and also some sense of whether the various projects fit into CAP Theorem, an understanding of which is, to my mind, essential for understanding how and why the NoSQL/HVSP movement has emerged (look out for more on CAP Theorem in a follow-up post on alternatives to NoSQL).

    Here’s my take, for those that are interested. As you can see there’s a graph database-shaped whole in my knowledge. I’m hoping to fill that sooner rather than later.

    By the way, our Spotlight report introducing The 451 Group’s formal coverage of NoSQL databases will be available here imminently.

    Update: VMware has announced that it has hired Redis creator Salvatore Sanfilippo, and is taking on the Redis key value store project. The image below has been updated to reflect that, as well as the launch of NorthScale’s Membase.

    Because 20+ data warehousing vendors is never enough

    In our recent report on the data warehousing market we speculated that there would soon be a change in the number of vendors operating in what is a crowded market. We were anticipating that the number of vendors would go down, rather than up, but – in the short term at least – we have been proved wrong, as two new open source analytical databases emerged this week.

    First came the formation of Dynamo Business Intelligence Corp, (aka Dynamo BI), a new commercially supported distribution, and sponsor, of LucidDB. Then came the launch of InfiniDB Community Edition, a new open source analytic database based on MySQL from Calpont.

    We actually included Calpont in our report but its product plans at that time looked precarious to say the least as the company found that its plans to launch a data warehousing platform based on MySQL were overshadowed by Oracle’s acquisition of Sun.

    We were somewhat sceptical about whether Calpont – which has had a couple of false starts in the past – would find a way to bring something to market and we are impressed that the company has reached a licensing agreement with Sun that supports its open source and commercial aims.

    Specifically the company has arranged an OEM agreement with Sun for the MySQL Community Server version that enables it to be used with both Calpont’s open source and commercially licensed products. The first of those is InfiniDB Community Edition, a column-oriented, multi-threaded data warehouse platform which acts as a storage engine for MySQL.

    The GPLv2 Community Edition will only be available for deployment on a single-server and without any formal support from Calpont and is primarily aimed at raising interest among MySQL developers. A fully certified and supported commercial version will follow, although Calpont is reticent about providing details on that at the moment other than that it will make use of Calpont’s massively parallel processing capabilities and modular architecture to scale out as well as up.

    Calpont faces some competition in the MySQL segment from Kickfire and Infobright, particularly the latter given their similar open source software strategies (Kickfire is a MySQL appliance). Infobright has has grown rapidly since going open source and now boasts more than 100 customers, although Calpont maintains that leaves plenty of opportunities amongst MySQL users.

    We would agree with that, and also with the company’s claim to offer something different from Infobright technologically. Infobright also offers column-based storage but not massively parallel processing (although it is working on a shared-everything, peer-to-peer architecture). We should note that InfiniDB Community Edition is also restricted to a single server but this is the result of a strategic decision, rather than a technical limitation. The commercial version will be fully MPP.

    We recently noted that LucidDB is another open source database that is often overlooked since the LucidDB code is not commercially supported.

    Any concern over the future of LucidDB following the demise of LucidEra should be put to bed by the formation of Dynamo BI with the intention to provide a commercially supported distribution of LucidDB.

    As LucidDB project lead John Sichi wrote:

    “This is an offering which has been completely missing up until now, and which I and others such as Julian Hyde believe to be essential for accelerating adoption of LucidDB. LucidEra provided much of the critical development effort, but never offered commercial support on LucidDB since that was not part of its software-as-a-service business model. Eigenbase provides community infrastructure and development coordination, but a commercial offering is not part of its non-profit charter. So in the past, when individuals and companies have asked me whom they should talk to in order to purchase support for LucidDB, I have never had a good answer. “

    Meanwhile Nicholas Goodman revealed that the company has acquired the commercial rights to LucidDB and plans to offer DynamoDB as a prepackaged, assembled distribution. It will also be fully open source and all new features will be contributed to LucidDB.

    It is very early days for Dynamo BI, which doesn’t even have a website as yet, so it’s difficult to judge the company’s plans, but with some of the lead LucidDB developers involved and a solid starting project – “the best database no one ever told you about” – it has every chance. We’ll be looking to catch up with the company just as soon as it gets up and running.

    The data warehousing sector is extremely crowded and we continue to believe that there will be a shakeout in the near future, but there are opportunities for companies that are able to differentiate themselves from the pack. Starting a data warehousing company is generally not something that we would recommend right now, but both Calpont and Dynamo BI have opportunities to establish themselves.

    The future of the database is… plaid?

    Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.

    Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.

    While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.

    The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.

    As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.

    As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.

    Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.

    This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.

    As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”

    Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.

    Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.

    Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update

    Perhaps the future of the database may not be row- or column-based, but plaid.

    Lowering barriers to data warehousing adoption with open source

    Since the start of this year I’ve been covering data warehousing as part of The 451 Group’s information management practice, adding to my ongoing coverage of  databases, data caching, and CEP, and contributing to the CAOS research practice.

    I’ve covered data warehousing before but taking a fresh look at this space in recent months it’s been fascinating to see the variety of technologies and strategies that vendors are applying to the data warehousing problem. It’s also been interesting to compare the role that open source has played in the data warehousing market, compared to the database market.

    I’m preparing a major report on the data warehousing sector, for publication in the next couple of months. In preparartion for that I’ve published a rough outline of the role open source has played in the sector over on our CAOS Theory blog. Any comments or corrections much appreciated.

    Ingres launches project for in-memory, columnar, vectorized database engine

    Interesting news from Ingres today that it is teaming up with VectorWise, a database engine spin-off from Amsterdam’s Centrum Wiskunde & Informatica (CWI) scientific research establishment, to collaborate on a new database kernel project.

    The Ingres VectorWise project will create a new open source storage engine for the Ingres Database that will better enable it to be positioned as a platform for data warehouse and analytic workloads, although Ingres does not have detailed plans for the productization of the technology at this stage. The starting point for the project is the theory that modern multi-core parallel processors now look like, and behave like, symmetrical multi processing (SMP) servers, and that on-chip memory is taking the place of RAM, but that database software has not been updated to take advantage of process developments.

    In order to do so Ingres and VectorWise will be collaborating on vectorized execution, which sees multiple instructions processed simultaneously, and in-cache processing, through which the execution occurs within the CPU cache and main memory is effectively treated like disk. The result, according to Ingres, is to reduce the I/O bottleneck for query processing. Additionally, the VectorWise engine enables on the fly decompression and operation handling in memory and includes a compressed column store.

    It is claimed that the Ingres VectorWise project will deliver 10x performance increases over the current Ingres database.

    VectorWise span off from CWI in 2008 to commercialize the the X100 system previously created by its database architecture research group. Development of X100, now also known as VectorWise, has been led by respected research scientists Peter Boncz and Marcin Zukowski.

    Ingres maintains that by working with the CWI research scientists it has proven that their theories are technically feasible in a commercial product. Bringing such a commercial product to general availability is the next step, and history has proven that can be easier said than done. With that caveat we are impressed with the vision and ambition that Ingres is demonstrating.

    On the opportunities for cloud-based databases and data warehousing

    At last year’s 451 Group client event I presented on the topic of database management trends and databases in the cloud.

    At the time there was a lot of interest in cloud-based data management as Oracle and Microsoft had recently made their database management systems available on Amazon Web Services and Microsoft was about to launch the Azure platform.

    In the presentation I made the distinction between online distributed databases (BigTable, HBase, Hypertable), simple data query services (SimpleDB, Microsoft SSDS as was), and relational databases in the cloud (Oracle, MySQL, SQL Server on AWS etc) and cautioned that although relational databases were being made available on cloud platforms, there were a number of issues to be overcome, such as licensing, pricing, provisioning and administration.

    Since then we have seen very little activity from the major database players with regards to cloud computing (although Microsoft has evolved SQL Data Services to be a full-blown relational database as a service for the cloud, see the 451’s take on that here).

    In comparison there has been a lot more activity in the data warehousing space with regards to cloud computing. On the one hand there data warehousing players are later to the cloud, but in another they are more advanced, and for a couple of reasons I believe data warehousing is better suited to cloud deployments than the general purpose database.

  • For one thing most analytical databases are better suited to deployment in the cloud thanks to their massively parallel architectures being a better fit for clustered and virtualized cloud environments.
  • And for another, (some) analytics applications are perhaps better suited to cloud environments since they require large amounts of data to be stored for long periods but processed infrequently.
  • We have therefore seen more progress from analytical than transactional database vendors this year with regards to cloud computing. Vertica Systems launched its Vertica Analytic Database for the Cloud on EC2 in May 2008 (and is wotking on cloud computing services from Sun and Rackspace), while Aster Data followed suit with the launch of Aster nCluster Cloud Edition for Amazon and AppNexus in February this year, while February also saw Netezza partner with AppNexus on a data warehouse cloud service. The likes of Teradata and illuminate are also thinking about, if not talking about, cloud deployments.

    To be clear the early interest in cloud-based data warehousing appears to be in development and test rather than mission critical analytics applications, although there are early adopters and ShareThis, the online information-sharing service, is up and running on Amazon Web Services’ EC2 with Aster Data, while search marketing firm Didit is running nCluster Cloud Edition on AppNexus’ PrivateScale, and Sonian is using the Vertica Analytic Database for the Cloud on EC2.

    Greenplum today launched its take on data warehousing in the cloud, focusing its attention initially on private cloud deployments with its Enterprise Data Cloud initiative and plans to deliver “a new vision for bringing the power of self-service to data warehousing and analytics”.

    That may sound a bit woolly (and we do see the EDC as the first step towards private cloud deployments) but the plan to enable the Greenplum Database to act as a flexible pool of warehoused data from which business users will be able to provision data marts makes sense as enterprises look to replicate the potential benefits of cloud computing in their datacenters.

    Functionality including self-service provisioning and elastic scalability are still to come but version 3.3 does include online data-warehouse expansion capabilities and is available now. Greenplum also notes that it has customers using the Greenplum Database in private cloud environments, including Fox Interactive Media’s MySpace, Zions Bancorporation and Future Group.

    The initiative will also focus on agile development methodologies and an ecosystem of partners, and while we were somewhat surprised by the lack of virtualization and cloud provisioning vendors involved in today’s announcement, we are told they are in the works.

    In the meantime we are confident that Greenplum’s won’t be the last announcement from a data management focused on enabling private cloud computing deployments. While much of the initial focus around cloud-based data management was naturally focused on the likes of SimpleDB the ability to deliver flexible access to, and processing of, enterprise data is more likely to be taking place behind the firewall while users consider what data and which applications are suitable for the public cloud.

    Also worth mentioning while we’re on the subject in RainStor, the new cloud archive service recently launched by Clearpace Software, which enable users to retire data from legacy applications to Amazon S3 while ensuring that the data is available for querying on an ad hoc basis using EC2. Its an idea that resonates thanks to compliance-driven requirements for long-term data storage, combined with the cost of storing and accessing that data.

    451 Group subscribers should stay tuned for our formal take on RainStor, which should be published any day now, while I think it’s probably fair to say you can expect more of this discussion at this year’s client event.

    CEP consolidation begins as Aleri acquires Coral8

    Covering the the complex event specialists just got 25% easier. We noted in September last year that the complex event processing (CEP) specialists StreamBase Systems, Aleri and Coral8 were attractive acquisition targets and that it would only be a matter of time before we saw consolidation in the event processing sector. Consolidation among those vendors wasn’t exactly what we had in mind, but that is what has come to pass as Aleri has announced the acquisition of Coral8 for an undisclosed fee.

    The combined entity, which continues to use the Aleri name, is now claiming to be the largest CEP specialist on the market, although that is debatable and we expect it to be strongly debated by StreamBase and Progress Software’s Apama division.

    Here are the numbers to be debated: All of Coral8’s 45 employees are joining Aleri, which will have a combined headcount of 95 and will boast 80 paying customers, less than five of which are existing customers of both companies.

    We will have a full assessment of the deal and its implications out later today, but our first impressions are as follows:

    While the acquisition of Coral8 by Aleri may appear at first glance like a combination of near-equals the resulting business stands to benefit from complementary product and sales strategies that should bring about cost savings via reduced duplication of effort and enable further expansion outside financial services.

    CEP is becoming a core enabling technology for data processing and analysis and the new Aleri is well positioned to build on its established position in capital markets and exploit partnerships with business intelligence and data warehousing vendors for wider adoption

    W(h)ither Syndera

    Recent attempts to reach business event processing vendor Syndera by email proved unsuccessful, and just as I was about to reach out by more traditional means comes speculation that the company has shut down. Certainly www.syndera.com appears to no longer be operational.

    We previously noted that Tibco acquired ‘certain assets’ of the real-time BI software vendor for $1m in July, and those continue to be available in the form of the TIBCO Syndera Operation Suite.

    As Marco Seiriö notes in his speculation, it is somewhat surprising that the company, which had raised over $20m in VC funding, only managed a return in the region of $1m. A sign of the times or a special case?

    10gen, Babble, MongoDB and the changing nature of the database

    Back in July last year we reported on the formation of a new open source cloud computing start-up called 10gen on our Cloud Cover and CAOS Theory blogs.

    Seven months later and there have been a few changes at 10gen, such that this information management blog is arguably the most suitable venue for discussion of the implications of 10gen’s MongoDB, the cloud computing database which has now become its major focus.

    A quick recap: 10gen launched as an open source platform-as-a-service play offering the MongoDB object database as well as an application server and file system. So far, so cloud stack.

    However, the file system quickly became an interface layer to MongoDB while the company more recently decided that its application server runtime and MongoDB are better off apart and shifted its attention to the database, a standalone beta version of which was released last week.

    As the two projects have diverged so will this post. To continue reading about the future of the Babble application server head for CAOS Theory, otherwise:

    As this post from Geir Magnusson Jr, 10gen VP of Engineering & Co-Founder, at Codehaus describes, MongoDB is not your traditional database.

    “As I argue when people give me the chance to speak about it, databases are changing – just look at what is available in the so-called “cloud” arena. It tends not to be a RDBMS if it’s scalable. The storage engine under AppEngine, or Amazon’s SimpleDB, or any of the Dynamo implementations, etc, all of which change your programming model to one that isn’t “tables and joins”. Or look at the excellent CouchDB, a JSON store. If the RDBMS isn’t being replaced outright (like it has to be in “the cloud”), it can to be augmented with other persistence technologies that are better suited for a portion of the data requirements of a system.”

    This was one of the themes of my talk at our client event in Boston last year, and nothing has happened since then to change my mind. As Geir explains, the interesting thing about the new cloud databases (for want of a better term) is that they force users to think differently about what a database is for – and specifically to think beyond the realms of the relational.

    We see similar forces at work in the data warehousing space driven by column-oriented architectures, but the end result is the same as users are increasingly thinking beyond what already know to consider the best database management tools for the job at hand.

    As Geir adds of MongoDB: “It works fine as a database, but you can’t think relational. If you want to just replace MySQL with something else, but don’t want to rethink your data model, MongoDB isn’t for you.”