The Data Day: July 20, 2018

The ever-expanding data science and analytics landscape. And more

And that’s the data day, today.

7 Hadoop questions. Q7: Hadoop’s role

What is the point of Hadoop? It’s a question we’ve asked a few times on this blog, and continues to be a significant question asked by users, investors and vendors about Apache Hadoop. That is why it is one of the major questions being asked as part of our 451 Research 2013 Hadoop survey.


As I explained during our keynote presentation at the inaugural Hadoop Summit Europe earlier this year, our research suggests there are hundreds of potential workloads that are suitable for Hadoop, but three core roles:

  • Big data storage: Hadoop as a system for storing large, unstructured, data sets
  • Big data processing/integration: Hadoop as a data ingestion/ETL layer
  • Big data analytics: Hadoop as a platform new new exploratory analytic applications

And we’re not the only ones that see it that way. This blog from Cloudera CTO Amr Awadallah outlines three very similar, if differently-named use-cases (Transformation, Active Archive, and Exploration).

In fact, as I also explained during the Hadoop Summit keynote, we see these three roles as a process of maturing adoption, starting with low cost storage, moving on to high-performance data aggregation/ingestion, and finally exploratory analytics.


As such it is interesting to view the current results of our Hadoop survey, which show that the highest proportion of respondents that have implemented or plan to implement Hadoop (63%) for data analytics, followed by 48% for data integration and 43% for data storage.

This would suggest that our respondents include some significantly early Hadoop adopters. I look forward to properly analysing the results to see what they can tell us, but in the meantime it is interesting to note that the percentage of respondents using Hadoop for analytics is significantly higher among those that adopted Hadoop prior to 2012 (88%) compared to those that adopted in in 2012 or 2013 (65%).

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

What is the point of Hadoop?

Among the many calls we have fielded from users, investors and vendors about Apache Hadoop, the most common underlying question we hear could be paraphrased ‘what is the point of Hadoop?’.

It is a more fundamental question than ‘what analytic workloads is Hadoop used for’ and really gets to the heart of uncovering why businesses are deploying or considering deploying Apache Hadoop. Our research suggests there are three core roles:

– Big data storage: Hadoop as a system for storing large, unstructured, data sets
– Big data integration: Hadoop as a data ingestion/ETL layer
– Big data analytics: Hadoop as a platform new new exploratory analytic applications

While much of the attention for Apache Hadoop use-cases focuses on the innovative new analytic applications it has enabled in this latter role thanks to its high-profile adoption at Web properties, for more traditional enterprises and later adopters the first two, more mundane, roles are more likely the trigger for initial adoption. Indeed there are some good examples of these three roles representing an adoption continuum.

We also see the multiple roles playing out at a vendor level, with regards to strategies for Hadoop-related products. Oracle’s Big Data Appliance (451 coverage), for example, is focused very specifically on Apache Hadoop as a pre-processing layer for data to be analyzed in Oracle Database.

While Oracle focuses on Hadoop’s ETL role, it is no surprise that the other major incumbent vendors showing interest in Hadoop can be grouped into three main areas:

– Storage vendors
– Existing database/integration vendors
– Business intelligence/analytics vendors

The impact of these roles on vendor and user adoption plans will be reflected in my presentation at Hadoop World in November, the Blind Men and The Elephant.

You can help shape this presentation, and our ongoing research into Hadoop adoption drivers and trends, by taking our survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

Information management preview of 2011

Our clients will have seen our preview of 2011 last week. For those that aren’t (yet!) clients and therefore can’t see the whole 3,500-word report, here’s the introduction, followed by the titles of the sections to give you an idea of what we think will shape the information management market in 2011 and beyond. Of course the IT industry, like most others doesn’t rigorously follow the wiles of the Gregorian calendar, so some of these things will happen next year while others may not occur till 2012 and beyond. But happen they will, we believe.

We think information governance will play a more prominent role in 2011 and in the years beyond that. Specifically, we think master data management and data governance applications will appear in 2011 to replace the gaggle of spreadsheets, dashboards and scorecards commonly used today. Beyond that, we think information governance will evolve in the coming years, kick-started by end users who are asking for a more coherent way to manage their data, driven in part by their experience with the reactive and often chaotic nature of e-discovery.

In e-discovery itself, we expect to see a twin-track adoption trend. While cloud-based products have proven popular, at the same time, more enterprises buy e-discovery appliances.

‘Big data’ has become a bit of a catchall term to describe the masses of information being generated, but in 2011 we expect to see a shift to what we term a ‘total data’ approach to data management, as well as the analytics applications and tools that enable users to generate the business intelligence from their big data sets. Deeper down, the tools used in this process will include new BI tools to exploit Hadoop, as well as a push in predictive analytics beyond the statisticians and into finance, marketing and sales departments.

SharePoint 2010 may have come out in the year for which it is named, but its use will become truly widespread in 2011 as the first service pack is release and the ISV community around it completes their updates from SharePoint 2007. However, we don’t think cloud-based SharePoint will grow quite as fast as some people may expect. Finally, in the Web content management (WCM) market – so affected by SharePoint, as well as the open source movement – we expect a stratification between the everyday WCM-type scenario and Web experience management (WEM) for those organization that need to tie WCM, Web analytics, online marketing and commerce features together.

  • Governance family reunion: Information governance, meet governance, risk and compliance; meet data governance….
  • Master data management, data quality, data integration: the road to data governance
  • E-discovery post price war: affordable enough, or still too strategic to risk?
  • Data management – big, bigger, biggest
  • Putting the BI into big data in Hadoop
  • The business of predictive analytics
  • SharePoint 2010 gets real in 2011
  • WCM, WEM and stratification

And with that we’d like to wish all readers of Too Much Information a happy holiday season and a healthy and successful 2011.

Data as a natural energy source

A number of analogies have arisen in recent years to describe the importance of data and its role in shaping new business models and business strategies. Among these is the concept of the “data factory”, recently highlighted by Abhi Mehta of Bank of America to describe businesses that have realized that their greatest asset is data.

WalMart, Google and Facebook are good examples of data factories, according to Mehta, who is working to ensure that BofA joins the list as the first data factory for financial services.

Mehta lists three key concepts that are central to building a data factory:

  • Believe that your core asset is data
  • Be able to automate the data pipeline
  • Know how to monetize your data assets

The idea of the data factory is useful in describing the organizations that we see driving the adoption of new data management, management and analytics concepts (Mehta has also referred to this as the “birth of the next industrial revolution”) but it has some less useful connotations.

In particular, the focus on data as something that is produced or manufactured encourages the obsession with data volume and capacity that has put the Big in Big Data.

Size isn’t everything, and the ability to store vast amounts of data is only really impressive if you also have the ability to process and analyze that data and gain valuable business insight from it.

While the focus in 2010 has been on Big Data, we expect the focus to shift in 2011 towards big data analytics. While the data factory concept describes what these organizations are, it does not describe what it is that they do to gain analytical insight from their data.

Another analogy that has been kicking around for a few years is the idea of data as the new oil. There are a number of parallels that can be drawn between oil and gas companies exploring the landscape in search of pockets of crude, and businesses exploring their data landscape in search of pockets of useable data.

A good example of this is eBay’s Singularity platform for deep contextual analysis, one use of which was to combined transactional data from the company’s data warehouse with behavioural data on its buyers and sellers, and enabled identification of top sellers, driving increased revenue from those sellers.

By exploring information from multiple sources in a single platform the company was able to gain a better perspective over its data than would be possible using data sampling techniques, revealing a pocket of data that could be used to improve business performance.

However, exploring data within the organization is only scratching the surface of what eBay has achieved. The real secret to eBay’s success has been in harnessing that retail data in the first place.

This is a concept I have begun to explore recently in the context of turning data into products. It occurs to me that the companies that represent the most success in this regard are those that are not producing data, but harnessing naturally occurring information streams to capture the raw data that can be turned into usable data via analytics.

There is perhaps no greater example of this than Facebook, now home to over 500 million people using it to communicate, share information and photos, and join groups. While Facebook is often cited as an example of new data production,, that description is inaccurate.

Consider what these 500 million people did before Facebook. The answer, of course, is that they communicated, shared information and photos, and joined groups. The real genius of Facebook is that it harnesses a naturally occurring information stream and accelerates it.

Natural sources of data are everywhere, from the retail data that has been harnessed by the likes of eBay and Amazon, to the Internet search data that has been harnessed by Google, but also the data being produced by the millions of sensors in manufacturing facilities, data centres and office buildings around the world.

Harnessing that data is the first problem to solve, applying the data analytics techniques to that, automating the data pipeline, and knowing how to monetize the data assets completes the picture.

Mike Loukides of O’Reilly recently noted: “the future belongs to companies and people that turn data into products.” The companies and people that stand to gain the most are not those who focus on data as something to be produced and stockpiled, but as a natural energy source to be processed and analyzed.