The Data Day, Today: Jan 24 2012

Thoughts on Splunk’s IPO and DynamoDB. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Thoughts on the Splunk IPO and S-1 By Dave Kellogg.

* Thoughts on SimpleDB, DynamoDB and Cassandra By Adrian Cockcroft.

* Recommind’s Revenue Leaps 95% in Record-Setting 2011 Predictable.

* Hewlett-Packard Expands to Cambridge via Vertica’s “Big Data” Center Moving.

* Announcing SkySQL Enterprise HA for the MariaDB & MySQL databases

* Membase Server is Now Couchbase Server But not *the* Couchbase Server.

* Cloudera Teams With O’Reilly Media to Merge Hadoop World and Strata Conferences

* Survey results: How businesses are adopting and dealing with data 100 Strata Online Conference attendees.

* Big data market survey: Hadoop solutions

* LinkedIn released SenseiDB, an open source distributed, realtime, semi-structured database.

* For 451 Research clients

# VMware: not your father’s database company Impact Report

# Sparsity Technologies draws up plans for graph database adoption Impact Report

# Amazon launches DynamoDB, an auto-configuring database as a service Market Development report

# NuoDB targets Q2 release for elastic relational database Market Development report

# ADVIZOR illuminates growth strategy, roadmap in data discovery and analysis Market Development report

# Birst adds own analytic engine for BI, OEM agreement with ParAccel Market Development report

* Google News Search outlier of the day: RentAGrandma.com Recruiting Wonderful Grandmas

And that’s the Data Day, today.

Job trends highlight post-M&A analytic database investment

When I was messing around with Indeed.com job trends the other day I was struck by an interesting trend relating to the five recent major M&A deals involving analytic database vendors: Netezza, Sybase, Greenplum, Vertica and Aster Data.


netezza, sybase iq, greenplum, vertica, aster data Job Trends graph

netezza, sybase iq, greenplum, vertica, aster data Job Trends Netezza jobsSybase Iq jobsGreenplum jobsVertica jobsAster Data jobs

The trends aren’t immediately obvious from that chart, but if we break them out individually and add a black dot to indicate the approximate date of the acquisition announcement it all becomes clear.


(Note: scale varies from chart to chart)

While the acquisitions have accelerated job postings for all acquired analytic databases, Greenplum has clearly been the biggest beneficiary. Indeed.com’s data also explains why this might be: EMC/Greenplum is responsible for over 50% of the current Greenplum-related job postings on the site (excluding recruiter postings).

Greenplum had 140 employees when it was acquired in July 2010. Based on the hiring growth illustrated above, EMC’s Data Computing Products Division is set to reach 650 by the end of the year.

Netezza started with a much larger base, but IBM is expected to increase headcount at Netezza from 500 in September 2010 to a target of 800 by year-end. Thanks, no doubt, to Netezza’s larger installed base, IBM is responsible for just 7.7% of Netezza job postings.

This highlights something we recently noted in a 451 Group M&A Insight report: in order to make a considerable dent in the dominance of the big four, any acquiring company will not only have to buy a data-warehousing player but also invest in its growth.

While Vertica and Aster Data are both heading in the right direction, we believe that HP and Teradata will have to accelerate their investment in the Vertica subsidiary and the new Aster Data ‘center of excellence’ respectively.

HP recently told us headcount has grown about 40% since the acquisition (it wasn’t being specific, but Vertica reported 100 employees in January). HP/Vertica is currently responsible for 13.9% for Vertica-related job postings on Indeed.com

We had speculated that Teradata would need to similarly boost the headcount at Aster Data beyond the estimated 100 employees. Teradata/Aster Data is responsible for 24% of job postings for Aster Data.

But what of Sybase? While Sybase IQ also has a larger installed base, SAP/Sybase are responsible for just 6.4% of the Sybase IQ-related job postings on Indeed.com. The Sybase IQ chart illustrates some common sense investment advice: the value of your investment can go down as well as up.

The future of the database is… plaid?

Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.

Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.

While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.

The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.

As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.

As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.

Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.

This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.

As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”

Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.

Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.

Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update

Perhaps the future of the database may not be row- or column-based, but plaid.

Lowering barriers to data warehousing adoption with open source

Since the start of this year I’ve been covering data warehousing as part of The 451 Group’s information management practice, adding to my ongoing coverage of  databases, data caching, and CEP, and contributing to the CAOS research practice.

I’ve covered data warehousing before but taking a fresh look at this space in recent months it’s been fascinating to see the variety of technologies and strategies that vendors are applying to the data warehousing problem. It’s also been interesting to compare the role that open source has played in the data warehousing market, compared to the database market.

I’m preparing a major report on the data warehousing sector, for publication in the next couple of months. In preparartion for that I’ve published a rough outline of the role open source has played in the sector over on our CAOS Theory blog. Any comments or corrections much appreciated.

On the opportunities for cloud-based databases and data warehousing

At last year’s 451 Group client event I presented on the topic of database management trends and databases in the cloud.

At the time there was a lot of interest in cloud-based data management as Oracle and Microsoft had recently made their database management systems available on Amazon Web Services and Microsoft was about to launch the Azure platform.

In the presentation I made the distinction between online distributed databases (BigTable, HBase, Hypertable), simple data query services (SimpleDB, Microsoft SSDS as was), and relational databases in the cloud (Oracle, MySQL, SQL Server on AWS etc) and cautioned that although relational databases were being made available on cloud platforms, there were a number of issues to be overcome, such as licensing, pricing, provisioning and administration.

Since then we have seen very little activity from the major database players with regards to cloud computing (although Microsoft has evolved SQL Data Services to be a full-blown relational database as a service for the cloud, see the 451’s take on that here).

In comparison there has been a lot more activity in the data warehousing space with regards to cloud computing. On the one hand there data warehousing players are later to the cloud, but in another they are more advanced, and for a couple of reasons I believe data warehousing is better suited to cloud deployments than the general purpose database.

  • For one thing most analytical databases are better suited to deployment in the cloud thanks to their massively parallel architectures being a better fit for clustered and virtualized cloud environments.
  • And for another, (some) analytics applications are perhaps better suited to cloud environments since they require large amounts of data to be stored for long periods but processed infrequently.
  • We have therefore seen more progress from analytical than transactional database vendors this year with regards to cloud computing. Vertica Systems launched its Vertica Analytic Database for the Cloud on EC2 in May 2008 (and is wotking on cloud computing services from Sun and Rackspace), while Aster Data followed suit with the launch of Aster nCluster Cloud Edition for Amazon and AppNexus in February this year, while February also saw Netezza partner with AppNexus on a data warehouse cloud service. The likes of Teradata and illuminate are also thinking about, if not talking about, cloud deployments.

    To be clear the early interest in cloud-based data warehousing appears to be in development and test rather than mission critical analytics applications, although there are early adopters and ShareThis, the online information-sharing service, is up and running on Amazon Web Services’ EC2 with Aster Data, while search marketing firm Didit is running nCluster Cloud Edition on AppNexus’ PrivateScale, and Sonian is using the Vertica Analytic Database for the Cloud on EC2.

    Greenplum today launched its take on data warehousing in the cloud, focusing its attention initially on private cloud deployments with its Enterprise Data Cloud initiative and plans to deliver “a new vision for bringing the power of self-service to data warehousing and analytics”.

    That may sound a bit woolly (and we do see the EDC as the first step towards private cloud deployments) but the plan to enable the Greenplum Database to act as a flexible pool of warehoused data from which business users will be able to provision data marts makes sense as enterprises look to replicate the potential benefits of cloud computing in their datacenters.

    Functionality including self-service provisioning and elastic scalability are still to come but version 3.3 does include online data-warehouse expansion capabilities and is available now. Greenplum also notes that it has customers using the Greenplum Database in private cloud environments, including Fox Interactive Media’s MySpace, Zions Bancorporation and Future Group.

    The initiative will also focus on agile development methodologies and an ecosystem of partners, and while we were somewhat surprised by the lack of virtualization and cloud provisioning vendors involved in today’s announcement, we are told they are in the works.

    In the meantime we are confident that Greenplum’s won’t be the last announcement from a data management focused on enabling private cloud computing deployments. While much of the initial focus around cloud-based data management was naturally focused on the likes of SimpleDB the ability to deliver flexible access to, and processing of, enterprise data is more likely to be taking place behind the firewall while users consider what data and which applications are suitable for the public cloud.

    Also worth mentioning while we’re on the subject in RainStor, the new cloud archive service recently launched by Clearpace Software, which enable users to retire data from legacy applications to Amazon S3 while ensuring that the data is available for querying on an ad hoc basis using EC2. Its an idea that resonates thanks to compliance-driven requirements for long-term data storage, combined with the cost of storing and accessing that data.

    451 Group subscribers should stay tuned for our formal take on RainStor, which should be published any day now, while I think it’s probably fair to say you can expect more of this discussion at this year’s client event.