The geographic distribution of Hadoop skills: in context

NC State University’s Institute for Advanced Analytics recently published some interesting statistics on Apache Hadoop adoption based on a search of LinkedIn data.

The statistics graphically illustrate what a lot of people wer already pretty sure of: that the geographic distribution of Hadoop skills (and presumably therefore adoption) is heavily weighted in favour of the USA, and in particular the San Francisco Bay Area.

The statistics showed that 64% of the 9,079 LinkedIn members with “Hadoop” in their member profiles (by no means perfect but an insightful measure nonetheless) are based in the US, and that the vast majority of those are in the Bay Area.

The results are what we would expect to see given the relative level of immaturity of Apache Hadoop adoption, as well as the nature and location of the early Hadoop adopters and Hadoop-related vendors.

The results got me thinking two things:
– how does the geographic spread compare to a more maturely adopted project?
– how does it compare to the various NoSQL projects?

So I did some searching of LinkedIn to find out.

To answer the first question I performed the same search for MySQL, as an example of a mature, widely-adopted open source project.

The results show that just 32% of the 366,084 LinkedIn members with “MySQL” in their member profiles are based in the US (precisely half that of Hadoop) while only 4.4% are in the Bay area, compared to 28.2% of the 9,079 LinkedIn members with “Hadoop” in their member profiles.

The charts below illustrate the difference in geographic distribution between Hadoop and MySQL. The size of the boxes is in proportion to the search result (click each image for a larger version).

With regards to the second question, I also ran searches for MongoDB, Riak, CouchDB, Apache Cassandra*, Membase*, Neo4j, Hbase, and Redis.

I’ll be posting the results for each of those over the next week or so, but in the meantime, the graphic below shows the split between the USA and Rest of the World (ROW) for all ten projects.

It illustrates, as I suspected, that the distribution of skills for NoSQL databases is more geographically disperse than for Hadoop.

I have some theories as to why that is – but I’d love to hear anyone else’s take on the results.

*I had to use the ‘Apache’ qualifier with Cassandra to filer out anyone called Cassandra, while Membase returned a more statistically relevant result than Couchbase.

World map image: Owen Blacker

Why SAP should march in the direction of ANTs

SAP faces a number of challenges to make the most of its proposed $5.8bn acquisition of Sybase, not the least of which being that the company’s core enterprise applications do not currently run on Sybase’s database software.

As we suggested last week that should be pretty easy to fix technically, but even if SAP gets its applications, BI software and data warehousing products up and running on Sybase ASE and IQ in short-order, it still faces a challenge to persuade the estimated two-third of SAP users that run on an Oracle database to deploy Sybase for new workloads, let alone migrate existing deployments.

Even if SAP were to bundle ASE and IQ at highly competitive rates (which we expect it to do) it will have a hard time convincing die-hard Oracle users to give up on their investments in Oracle database administration skills and tools. As Hasso Plattner noted yesterday, “they do not want to risk what they already have.”

Hasso was talking about the migration from disk-based to in-memory databases, and that is clearly SAP’s long-term goal, but even if we “assume for a minute that it really works” as Hasso advised, they is going to be a long-term period where SAP’s customers are going to remain on disk-based databases, and SAP is going to need to move at least some of those to Sybase to prove the wisdom of the acquisition.

A solution may have appeared today from an unlikely source, with IBM’s release of DB2 SQL Skin for Sybase ASE, a new feature for its DB2 database product that provides compatibility with applications developed for Sybase’s Adaptive Server Enterprise (ASE) database. Most Sybase applications should be able to run on DB2 unchanged, according to the companies, while users are also able to retain their Sybase database tools, as well as their administration skills.

That may not sound like particularly good news for SAP or Sybase, but the underlying technology could be an answer to its problems. DB2 SQL Skin for Sybase ASE was developed with ANTs Software and is based on its ANTs Compatibility Server (ACS).

ACS is not specific to DB2. It is designed to is designed to support the API language of an application written for one database and translate to the language of the new database – and ANTs maintains that re-purposing the technology to support other databases is a matter of metadata changes. In fact the first version of ACS, released in 2008, targeted migration from Sybase to Oracle databases.

Sybase should be pretty familiar with ANTs. In 2008 it licensed components of the company’s ANTs Data Server (ADS) real-time database product (now FourJ’s Genero db), while also entering into a partnership agreement to create a version of ACS that would enable migrations from Microsoft’s SQL Server to Sybase Adaptive Server Enterprise and Sybase IQ (451 Group coverage).

That agreement was put on hold when ANTs’ IBM opportunity arose, and while ANTs is likely to have its hands full dealing with IBM migration projects, we would not be surprised to see Sybase reviving its interest in a version that targets Oracle.

It might not reduce the time it takes to port SAP to Sybase – it would take time to create a version of ACS for Oracle-Sybase migrations (DB2 SQL Skin for Sybase was in development and testing for most of 2009) – but it would potentially enable SAP to deploy Sybase databases for new workloads without asking its users to retool and re-train.

Saying yes to NoSQL

As a company, The 451 Group has built its reputation on taking a lead in covering disruptive technologies and vendors. Even so, with a movement as hyped as NoSQL databases, it sometimes pays to be cautious.

In my role covering data management technologies for The 451 Group’s Information Management practice I have been keeping an eye on the NoSQL database movement for some time, taking the time to understand the nuances of the various technologies involved and their potential enterprise applicability.

That watching brief has now spilled over into official coverage, following our recent assessment of 10gen. I also recently had the chance to meet up with Couchio’s VP of business development, Nitin Borwankar (see coverage initiation of Couchio). I’ve also caught up with Basho Technologies sooner rather than later. A report on that is now imminent.

There are a couple of reasons why I have formally began covering the NoSQL databases. The first is the maturing of the technologies, and the vendors behind them, to the point where they can be considered for enterprise-level adoption. The second is the demand we are getting from our clients to provide our view of the NoSQL space and its players.

This is coming both from the investment community and from existing vendors, either looking for potential partnerships or fearing potential competition. The number of queries we have been getting related to NoSQL and big data have encouraged articulation of my thoughts, so look-out for a two-part spotlight on the implications for the operational and analytical database markets in the coming weeks.

The biggest reason, however, is the recognition that the NoSQL movement is a user-led phenomena. There is an enormous amount of hype surrounding NoSQL but for the most part it is not coming from vendors like 10gen, Couchio and Basho (although they may not be actively discouraging it) but from technology users.

A quick look at the most prominent key-value and column-table NoSQL data stores highlights this. Many of these have been created by user organizations themselves in order fill a void and overcome the limitations of traditional relational databases – for example Google (BigTable), Yahoo (Hbase), Zvents (Hypertable), LinkedIn (Voldemort), Amazon (Dynamo), and Facebook (Cassandra).

It has become clear that traditional database technologies do need meet the scalability and performance requirements of dealing with big data workloads, particularly at a scale experienced by social networking services.

That does raise the question of how applicable these technologies will be to enterprises that do not share the architecture of the likes of Google, Facebook and LinkedIn – at least in the short-term. Although there are users – Cassandra users include Rackspace, Digg, Facebook, and Twitter, for example.

What there isn’t – for the likes of Cassandra and Voldemort, at least – is vendor-based support. That inevitably raises questions about the general applicability of the key-value/column table stores. As Dave Kellog notes, “unless you’ve got Google’s business model and talent pool, you probably shouldn’t copy their development tendencies”.

Given the levels of adoption it seems inevitable that vendors will emerge around some of these projects, not least since, as Dave puts it, “one day management will say: ‘Holy Cow folks, why in the world are we paying programmers to write and support software at this low a level?'”

In the meantime, it would appear that the document-oriented data stores (Couchio’s CouchDB, 10gen’s MongoDB, Basho’s Riak) are much more generally applicable, both technologically and from a business perspective. UPDATE – You can also add Neo Technology and its graph database technology to that list).

In our forthcoming two-part spotlight on this space I’ll articulate in more detail our view on the differentiation of the various NoSQL databases and other big data technologies and their potential enterprise applicability. The first part, on NoSQL and operational databases, is here.

Is Sybase buying Aleri?

Marc Adler and Marco Seiriö seem to think so.

Such a deal would seem a little strange coming less than a year after Sybase licensed the underlying complex event processing (CEP) engine for Sybase CEP from Coral8, immediately prior to Coral8’s acquisition by Aleri.

The terms of that licensing agreement provide a clue as to why Sybase would consider opening up its wallet again to snap up Aleri, however.

As Aleri insisted last March, “The licensing arrangement allows Sybase to embed CEP capabilities within and ONLY WITHIN Sybase products such as RAP”.

Sybase later confirmed (clients only) to us that this was indeed the arrangement and maintained that its strategy for CEP was to embed it within larger platform products.

As well as RAP – The Trading Edition, the company’s risk-analytics platform, Sybase also had plans to target opportunities in the telecommunications, healthcare and government sectors.

One justification for the acquisition of Aleri would be that it would allow Sybase to target those markets and other opportunities with a standalone CEP offering based on Aleri’s next-generation engine codenamed Ohio which is slated for roll-out in 2010 and is designed to include the best features from Aleri Streaming Platform and the Coral8 Engine and be backwards-compatible with both.

Then of course there are the Aleri/Coral assets beyond the core CEP engine, including the Aleri Studio visual modeling application, as well as dashboard and OLAP server capabilities, and packaged applications for risk and liquidity analysis and management.

As for why Aleri would sell out to Sybase – we certainly noted some trepidation from the company when we caught up (clients only) in September last year. While the company was buoyant about its plans for Ohio it was reticent to discuss details of customer wins/successes.

The only thing the company would say was that it had more than 80 customers, the number of combined customers when the merger closed.

At that point it was somewhat more confident, claiming (clients only) to be the largest pure-play CEP vendor in terms of headcount and customer base and revenue (although with none of the CEP vendors disclosing revenue figures, that last claim was always highly debatable).

Two data management webinars this week

In addition to the 451 Group’s own data warehousing webinar on Thursday I will also be taking part in a webinar on Wednesday with EnterpriseDB on the subject of open source database adoption in the enterprise.

During the webinar we will provide recommendations for how organizations can effectively leverage open source software. Attendees will learn about open source software trends for 2010, top considerations when using open source databases, and best practices for successful deployments of open source software.

I’ll be providing some data points from our recent surveys on database adoption and open source adoption while EnterpriseDB’s Larry Alston will also showcase successful enterprise deployments of Postgres Plus.

The open source database webinar is Wednesday, December 16, at 1 pm ET. To register, visit this link.

The data warehousing webinar is Thursday, December 17th, at 1 pm ET. To register, visit this link.

Ten considerations for choosing/building a data warehouse

There is healthy competition in data warehousing, with more than 20 vendors competing for the attention of would-be customers with a variety of technologies, architectures and implementation methodologies.

With choice comes potential confusion, since users have to identify and compare different products and features, as well as vendor viability, to ensure they are investing their IT budgets wisely – especially in the current economic climate.

Our latest special report, Warehouse Optimization – Ten considerations for choosing/building a data warehouse, is designed to help reduce that confusion and is now available for existing 451 Group clients to download and non-clients to purchase. An executive summary is also available.

The report provides an overview of the data-warehousing vendor landscape, as tracked by The 451 Group, and examines the business and technology trends driving this market. It identifies 10 key technology trends in data warehousing and assesses how they can be used to choose the technologies and vendors that are best suited to a would-be customer and its specific application.

The report is not designed to make recommendations on particular vendors or technologies, but to provide an independent overview of the sector, which could be used by customers as part of a vendor-evaluation process. The report also examines the potential for consolidation and identifies some potential merger and acquisition drivers, as well as providing profiles of the data-warehousing vendors being tracked by The 451 Group as part of its ongoing coverage of this sector.

Look out also for a forthcoming webinar in which we will present the key findings and implications. We’ll keep you posted on the details.

The future of the database is… plaid?

Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.

Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.

While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.

The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.

As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.

As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.

Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.

This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.

As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”

Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.

Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.

Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update

Perhaps the future of the database may not be row- or column-based, but plaid.