VC funding for Hadoop and NoSQL tops $350m

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but below is a snapshot of the report, along with a graphic representation of the recent up-tick in funding.

According to our figures, between the beginning of 2008 and the end of 2010 $95.8m had been invested in the various Apache Hadoop- and NoSQL-related vendors. That figure now stands at more than $350.8m, up 266%.

That statistic does not really do justice to the sudden uptick of interest, however. The figures indicate that funding for Apache Hadoop- and NoSQL-related firms has more than doubled since the end of August, at which point the total stood at $157.5m.

A substantial reason for that huge jump is the staggering $84m series A funding round raised by Apache Hadoop-based analytics service provider Opera Solutions.

The original commercial supporter of Apache Hadoop, Cloudera, has also contributed strongly with a recent $40m series D round. In addition, MapR Technologies raised $20m to invest in its Apache Hadoop distribution, while we know that Hortonworks also raised a substantial round (unconfirmed, but reportedly $20m) from Benchmark Capital and former parent Yahoo as it was spun off in June. Index Ventures also recently announced that it has become an investor in Hortonworks.

I am reliably informed that if you factor in Hortonworks’ two undisclosed rounds, the total funding for Hadoop and NoSQL vendors is actually closer to $400m.

The various NoSQL database providers have also played a part in the recent burst of investment, with 10gen raising a $20m series D round and Couchbase raising $15m. DataStax, which has interests in both Apache Cassandra and Apache Hadoop, raised an $11m series B round, while Neo Technology raised a $10.6m series A round. Basho Technologies raised $12.5m in series D funding in three chunks during 2011.

Additionally, there are a variety of associated players, including Hadoop-based analytics providers such as Datameer, Karmasphere and Zettaset, as well as hosted NoSQL firms such as MongoLab, MongoHQ and Cloudant.

One investor company name that crops up more than most in the list above is Accel Partners, which was an original investor in both Cloudera and Couchbase, and backed Opera Solutions via its Accel- KKR joint venture with Kohlberg Kravis Roberts.

It appears that those investments have merely whetted Accel’s appetite for big data, however, as the firm last week announced a $100m Big Data Fund to invest in new businesses targeting storage, data management and analytics, as well as data-centric applications and tools.

While Accel is the fist VC shop that we are aware of to create a fund specifically for big data investments, we are confident both that it won’t be the last and that other VCs have already informally earmarked funds for data-related investments.

451 clients can get more details on funding and M&A involving more traditional database vendors, as well as our perspective on potential M&A suitors for the Hadoop and NoSQL players.

What is the point of Hadoop?

Among the many calls we have fielded from users, investors and vendors about Apache Hadoop, the most common underlying question we hear could be paraphrased ‘what is the point of Hadoop?’.

It is a more fundamental question than ‘what analytic workloads is Hadoop used for’ and really gets to the heart of uncovering why businesses are deploying or considering deploying Apache Hadoop. Our research suggests there are three core roles:

– Big data storage: Hadoop as a system for storing large, unstructured, data sets
– Big data integration: Hadoop as a data ingestion/ETL layer
– Big data analytics: Hadoop as a platform new new exploratory analytic applications

While much of the attention for Apache Hadoop use-cases focuses on the innovative new analytic applications it has enabled in this latter role thanks to its high-profile adoption at Web properties, for more traditional enterprises and later adopters the first two, more mundane, roles are more likely the trigger for initial adoption. Indeed there are some good examples of these three roles representing an adoption continuum.

We also see the multiple roles playing out at a vendor level, with regards to strategies for Hadoop-related products. Oracle’s Big Data Appliance (451 coverage), for example, is focused very specifically on Apache Hadoop as a pre-processing layer for data to be analyzed in Oracle Database.

While Oracle focuses on Hadoop’s ETL role, it is no surprise that the other major incumbent vendors showing interest in Hadoop can be grouped into three main areas:

– Storage vendors
– Existing database/integration vendors
– Business intelligence/analytics vendors

The impact of these roles on vendor and user adoption plans will be reflected in my presentation at Hadoop World in November, the Blind Men and The Elephant.

You can help shape this presentation, and our ongoing research into Hadoop adoption drivers and trends, by taking our survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

NoSQL Road Show, Hadoop Tuesdays and Hadoop World

I’ll be taking our data management research out on the road in the next few months with a number of events, webinars and presentations.

On October 12 I’m taking part in the NoSQL Road Show Amsterdam, with Basho, Trifork and Erlang Solutions, where I’ll be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

The following week, October 18, I’m taking part in the Hadoop Tuesdays series of webinars, presented by Cloudera and Informatica, specifically talking about the Hadoop Ecosystem.

The Apache Hadoop ecosystem will again be the focus of attention on November 8 and 9, when I’ll be in New York for Hadoop World, presenting The Blind Men and the Elephant.

Then it’s back to NoSQL with two more stops on the NoSQL Road Show, in London on November 29 and Stockholm on December 1, where I’ll once again be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

I hope you can join us for at least one of these events, and am looking forward to learning a lot about NoSQL and Apache Hadoop adoption, interest and concerns.

The dawn of polyglot analytics

While there has been a significant amount of interest in the volume, velocity of variety of big data (and perhaps a few other Vs depending on who you speak to), it has been increasingly clear to that the trends driving new approaches to data management relate not just to the nature of the data itself, but how the user wants to interact with the data.

As we previously noted, if you turn your attention to the value of the data then you have to take into account the trend towards storing and processing all data (or at least as much as is economically feasible), and the preferred rate of query (the acceptable time taken to generate the result of a query, as well as the time between queries). Another factor to be added to the mix is the way in which the user chooses to analyze the data: are they focused on creating a data model and schema to answer pre-defined queries, or engaging in exploratory analytic approaches in which data is extracted and the schema defined in response to the nature of the query?

All of these factors have significant implications for which technology is chosen to store and analyze the data, and another user-driven factor is the increased desire to use specialist data management technologies depending on the specific requirement. As we noted in NoSQL, NewSQL and Beyond, in the operational database world this approach has become known as polyglot persistence. Clearly though, in the analytic database market we are talking not just about approaches to storing the data, but also analyzing it. That is why we have begun using the term ‘polyglot analytics’ to describe the adoption of multiple query-processing technologies depending on the nature of the query.

Polyglot analytics explains why we are seeing adoption of Hadoop and MapReduce as a complement to existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other analytic use cases. Polyglot analytics also explains why a company like eBay would retain its Teradata Enterprise Data Warehouse for storing and analyzing traditional transactional and customer data, as well as adopting Hadoop for storing and analyzing clickstream, user behaviour and other un/semi-structured data, while also adopting an exploratory analytic platform based on Teradata’s Extreme Data Appliance for extreme analytics on a combination of transaction and user behaviour data pulled from both its EDW and Hadoop deployments.

The emergence of this kind of exploratory analytic platform exemplifies the polyglot analytics approach to adopting a different platform based the user’s approach to analytics rather than the nature of the data. It also highlights some of the thinking behind Teradata’s acquisition of Aster Data, IBM’s acquisition of Netezza, as well as HP’s acquisition of Vertica and the potential future role of vendors such as ParAccel and Infobright.

We are about to embark on a major survey of data management users to assess their attitudes to polyglot analytics and the drivers for adopting specific data management/analytics technologies. The results will be delivered as part of our Total Data report later this year. Stay tuned for more details on the survey in the coming weeks.

Red Hat considering NoSQL/Hadoop acquisition

Idle speculation over on our CAOS Theory blog.

Top Issues IT faces with Hadoop MapReduce: a Webinar with Platform Computing

Next Tuesday, August 3, at 8.30 AM PDT I’ll be taking part in a Webinar with Platform Computing to discuss the the benefits and challenges of Hadoop and MapReduce. Here’s the details:

With the explosion of data in the enterprise, especially unstructured data which constitutes about 80% of the total data in the enterprise, new tools and techniques are needed for business intelligence and big data processing. Apache Hadoop MapReduce is fast becoming the preferred solution for the analysis and processing of this data.

The speakers will address the issues facing enterprises deploying open source solutions. They will provide an overview of the solutions available for Big Data, discuss best practices, lessons learned, case studies and actionable plans to move your project forward.

To register for the event please visit the registration page.

Who is hiring Hadoop and MapReduce skills?

Continuing my recent exploration of Indeed.com’s job posting trends and data I have recently been taking a look at which organizations (excluding recruitment firms) are hiring Hadoop and MapReduce skills. The results are pretty interesting.

When it comes to who is hiring Hadoop skills, the answer, put simply, is Amazon, or more generally new media:


Source: Indeed.com Correct as of August 2, 2011

This is indicative of the early stage of adoption, and perhaps reflects the fact that many new media Hadoop adopters have chosen to self-support rather than turn to the Hadoop support providers/distributors.

It is no surprise to see those vendors also listed as they look to staff up to meet the expected levels of enterprise adoption (and it is worth noting that Amazon could also be included in the vendors category, given its Elastic MapReduce service).

Fascinating to see that of the vendors, VMware currently has the most job postings on Indeed.com referencing Hadoop, while Microsoft also makes an appearance.

Meanwhile the appearance of Northrop Grumman and Sears Holdings on this list indicates the potential for adoption in more traditional data management adopters, such as government and retail.

It is interesting to compare the results for Hadoop job postings with those mentioning Teradata, which shows a much more varied selection of retail, health, telecoms, and financial services providers, as well as systems integrators, government contractors, new media and vendors.

It is also interesting to compare Hadoop-related bog postings with those specifying MapReduce skills. There are a lot less of them, for a start, and while new media companies are well-represented, there is much greater interest from government contractors.


Source: Indeed.com Correct as of August 2, 2011

Hadoop and NoSQL job trends – in context

Recently there have been a spate of postings regarding job trends for distributed data management technologies including Hadoop and the various NoSQL databases.

One thing you rarely see on these job trends charts is comparison with an incumbent technology, for context. There’s a reason for that, as this comparison of database-related jobs from Indeed.com illustrates:

Although there has been a recent increase in job postings related to Hadoop and MongoDB, both are dwarfed, in absolute terms, by the number of job postings involving SQL Server and MySQL.

So why all the fuss about Hadoop and NoSQL, from a corporate perspective? This chart, showing the relative growth for the same data management technologies, says it all:

The cathedral in the bazaar: the future of the EDW

Kalido’s Winston Chen published an interesting post this week comparing Enterprise Data Warehouse projects to the building of a cathedral, notably Gaudi’s Sagrada Familia: “A timeless edifice cast in stone, for a client who’s not in a hurry.”

The requirement of many data warehousing projects to deliver on an immutable vision is one of their most likely failings. As we noted in our 2009 report on considerations for building a data warehouse:

“One of the most significant inefficiencies of data warehousing is that users have traditionally had to design their data-warehouse models to match their planned queries, and it can often be difficult to change schema once the data warehouse has been created. This approach is too rigid in a world of rapidly changing business requirements and real-time decision-making.”

Not only is it too rigid, as we added in our 2010 data warehousing market sizing report:

“It is also self-defeating, since a business analyst or executive that is unable to get the answers to queries they require from the EDW is likely to find their own ways to answer these queries – resulting in data silos and the exact redundancy and duplication issues the EDW was apparently designed to avoid.”

Given my dual focus on open source software, whenever I hear the term ‘cathedral’ used in the context of software, I can’t help but think of Eric Raymond’s seminal essay The Cathedral and the Bazaar, in which he made the case for collaborative open source development – the bazaar model – as an alternative to proprietary software development approaches, where software is carefully crafted, like cathedrals, to an immutable plan.

Mixing metaphors, I realised that the comparison between the cathedral and the bazaar can also be used to explain the previously discussed changing role of the enterprise data warehouse.

Whereas traditional approaches to analytics focused on building the EDW as the single source of the truth, and the timeless data cathedral Winston describes, today companies are focused more on taking advantage of multiple data storage and processing technologies in what would better be described as a data bazaar.

However, it is not a matter of choosing between the cathedral and the bazaar. What we are seeing is the EDW becoming part of a broader data analytics architecture, retaining the data-quality and security rules and schema applied to core enterprise data while other technologies such as Hadoop, specialist analytic appliances, and online repositories are deployed for more flexible ad hoc analytic use cases and analyzing alternative sources of data – including log and other machine-generated data.

The cathedral, in this instance, is part of the bazaar.

Managing this data bazaar is essentially what our total data concept is all about: selecting the most appropriate data storage/processing technology for a particular use case, while enabling access to any data that might be applicable to the query at hand, whether that data is structured or unstructured, whether it resides in the data warehouse, or Hadoop, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.

It is also essentially what IBM’s recently disclosed Smart Consolidation approach is all about: providing multiple technologies for operational analytics, ad hoc analytics, stream and batch processing, queryable archives, all connected by an “enterprise data hub”, and choosing the most appropriate query processing technology for the specific workload (so after polyglot programming and polyglot persistence comes polyglot analytics).

Two of my fellow database analysts, Curt Monash and Jim Kobelius, have recently been kicking around the question of what will be the “nucleus of the next-generation cloud EDW”.

While the data bazaar will rely on a core data integration/virtualization/federation hub, it seems to me that the idea that future data management architectures require a nucleus is a remnant of ‘cathedral thinking’.

Like Curt I think it is most likely that there will be no nucleus – or to put it another way, that each user will have a different perspective of the nucleus based on their role. For some Hadoop will be that nucleus, for others it will be the regional or departmental data mart. For others it will be an ad hoc analytics database. For some, it will remain the enterprise data warehouse.

I will be presenting more details about our total data concept and the various elements of the data bazaar, at The 451 Group’s client event in London on June 27.

Information management preview of 2011

Our clients will have seen our preview of 2011 last week. For those that aren’t (yet!) clients and therefore can’t see the whole 3,500-word report, here’s the introduction, followed by the titles of the sections to give you an idea of what we think will shape the information management market in 2011 and beyond. Of course the IT industry, like most others doesn’t rigorously follow the wiles of the Gregorian calendar, so some of these things will happen next year while others may not occur till 2012 and beyond. But happen they will, we believe.

We think information governance will play a more prominent role in 2011 and in the years beyond that. Specifically, we think master data management and data governance applications will appear in 2011 to replace the gaggle of spreadsheets, dashboards and scorecards commonly used today. Beyond that, we think information governance will evolve in the coming years, kick-started by end users who are asking for a more coherent way to manage their data, driven in part by their experience with the reactive and often chaotic nature of e-discovery.

In e-discovery itself, we expect to see a twin-track adoption trend. While cloud-based products have proven popular, at the same time, more enterprises buy e-discovery appliances.

‘Big data’ has become a bit of a catchall term to describe the masses of information being generated, but in 2011 we expect to see a shift to what we term a ‘total data’ approach to data management, as well as the analytics applications and tools that enable users to generate the business intelligence from their big data sets. Deeper down, the tools used in this process will include new BI tools to exploit Hadoop, as well as a push in predictive analytics beyond the statisticians and into finance, marketing and sales departments.

SharePoint 2010 may have come out in the year for which it is named, but its use will become truly widespread in 2011 as the first service pack is release and the ISV community around it completes their updates from SharePoint 2007. However, we don’t think cloud-based SharePoint will grow quite as fast as some people may expect. Finally, in the Web content management (WCM) market – so affected by SharePoint, as well as the open source movement – we expect a stratification between the everyday WCM-type scenario and Web experience management (WEM) for those organization that need to tie WCM, Web analytics, online marketing and commerce features together.

  • Governance family reunion: Information governance, meet governance, risk and compliance; meet data governance….
  • Master data management, data quality, data integration: the road to data governance
  • E-discovery post price war: affordable enough, or still too strategic to risk?
  • Data management – big, bigger, biggest
  • Putting the BI into big data in Hadoop
  • The business of predictive analytics
  • SharePoint 2010 gets real in 2011
  • WCM, WEM and stratification

And with that we’d like to wish all readers of Too Much Information a happy holiday season and a healthy and successful 2011.