Sizing the big data problem: ‘big data’ is the problem

Big data has been one of the big topics of the year in terms of client queries coming into The 451 Group, and one of the recurring questions (especially from vendors and investors) has been: “how big is the big data market?”

The only way to answer that is to ask another question: “what do you mean by ‘big data’?” We have mentioned before that the term is ill-defined, so it is essential to work out what an individual means when they use the term.

In our experience they usually mean one of two things:

  • Big data as a subset of overall data: specific volumes or classes of data that cannot be processed or analyzed by traditional approaches
  • Big data as a superset of the entire data management market, driven by the ever-increasing volume and complexity of data

Our perspective is that big data, if it means anything at all, represents a subset of overall data. However, it is not one that can be measurably defined by the size of the data volume. Specifically, as we recently articulated, we believe:

    “Big data is a term applied to data sets that are large, complex and dynamic (or a combination thereof) and for which there is a requirement to capture, manage and process the data set in its entirety, such that it is not possible to process the data using traditional software tools and analytic techniques within tolerable time frames.”

The confusion around the term big data also partly explains why we introduced the term “total data” to refer to a broader approach to data management, managing the storage and processing of all available data to deliver the necessary business intelligence.

The distinction is clearly important when it comes to sizing the potential opportunity. I recently came across a report from one of the big banks that put a figure on what it referred to as the “big data market”. However, they had used the superset definition.

The result was therefore not a calculation of the big data market, but a calculation of the total data management sector (although the method is in itself too simplistic for us to endorse the end result) since the approach taken was to add together the revenue estimates for all data management technologies – traditional and non-traditional.

.

Specifically, the bank had added up current market estimates for database software, storage and servers for databases, BI and analytics software, data integration, master data management, text analytics, database-related cloud revenue, complex event processing and NoSQL databases.

In comparison, the big data market is clearly a lot smaller, and represents a subset of revenue from traditional and non-traditional data management technologies, with a leaning towards the non-traditional technologies.

It is important to note, however, that big data cannot be measurably defined by the technology used to store and process it. As we have recently seen, not every use case for Hadoop or a NoSQL database – for example – involves big data.

Clearly this is a market that is a lot smaller than the one calculated by the bank, and the calculation required is a lot more complicated. We know, for example, that Teradata generated revenue of $489m in its third quarter. How much of that was attributable to big data?

Answering that requires a stricter definition of big data than is currently in usage (by anyone). But as we have noted above, ‘big data’ cannot be defined by data volume, or the technology used to store or process it.

There’s a lot of talk about the “big data problem”. The biggest problem with big data, however, is that the term has not – and arguably cannot – be defined in any measurable way.

How big is the big data market? You may as well ask “how long is a piece of string?”

If we are to understand the opportunity for storing and processing big data sets then the industry needs to get much more specific about what it is that is being stored and processed, and what we are using to store and process it.

Google Trends: Hadoop versus Big Data versus MapReduce

I was just looking through some slides handed to me by Cloudera’s John Kreisa and Mike Olson during our meeting last week and one of them jumped out at me.

It contains a graphic showing a Google Trends result for searches for Hadoop and “Big Data”. See for yourself why the graphic stood out:

In case you hadn’t guessed, Hadoop is in blue and “Big Data” is in red.

Even taken with a pinch of salt it’s a huge validation of the level of interest in Hadoop. Removing the quotes to search for Big Data (red) doesn’t change the overall picture.

See also Hadoop (blue) versus MapReduce (red).

UPDATE: eBay’s Oliver Ratzesberger puts the comparisons above in perspective somewhat by comparing Joomla vx Hadoop.

Webinar: navigating the changing landscape of open source databases

When we published our 2008 report on the impact of open source on the database market the overall conclusion was that adoption had been widespread but shallow.

Since then we’ve seen increased adoption of open source software, as well as the acquisition of MySQL by Oracle. Perhaps the most significant shift in the market since early 2008 has been the explosion in the number of open source database and data management projects, including the various NoSQL data stores, and of course Hadoop and its associated projects.

On Tuesday, November 9, 2010 at 11:00 am EST I’ll be joining Robin Schumacher, Director of Product Strategy from EnterpriseDB to present a webinar on navigating the changing landscape of open source databases.

Among the topics to be discussed are:

· the needs of organizations with hybrid mixed-workload environments

· how to choose the right tool for the job

· the involvement of user corporations (for better or for worse) in open source projects today.

You can find further details about the event and register here.