The Data Day, A few days: November 15-21 2013

Storage and big data: what went wrong? And more

And that’s the data day, today.

7 Hadoop questions. Q7: Hadoop’s role

What is the point of Hadoop? It’s a question we’ve asked a few times on this blog, and continues to be a significant question asked by users, investors and vendors about Apache Hadoop. That is why it is one of the major questions being asked as part of our 451 Research 2013 Hadoop survey.

hadoop-elephant

As I explained during our keynote presentation at the inaugural Hadoop Summit Europe earlier this year, our research suggests there are hundreds of potential workloads that are suitable for Hadoop, but three core roles:

  • Big data storage: Hadoop as a system for storing large, unstructured, data sets
  • Big data processing/integration: Hadoop as a data ingestion/ETL layer
  • Big data analytics: Hadoop as a platform new new exploratory analytic applications

And we’re not the only ones that see it that way. This blog from Cloudera CTO Amr Awadallah outlines three very similar, if differently-named use-cases (Transformation, Active Archive, and Exploration).

In fact, as I also explained during the Hadoop Summit keynote, we see these three roles as a process of maturing adoption, starting with low cost storage, moving on to high-performance data aggregation/ingestion, and finally exploratory analytics.

survey

As such it is interesting to view the current results of our Hadoop survey, which show that the highest proportion of respondents that have implemented or plan to implement Hadoop (63%) for data analytics, followed by 48% for data integration and 43% for data storage.

This would suggest that our respondents include some significantly early Hadoop adopters. I look forward to properly analysing the results to see what they can tell us, but in the meantime it is interesting to note that the percentage of respondents using Hadoop for analytics is significantly higher among those that adopted Hadoop prior to 2012 (88%) compared to those that adopted in in 2012 or 2013 (65%).

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

What is the point of Hadoop?

Among the many calls we have fielded from users, investors and vendors about Apache Hadoop, the most common underlying question we hear could be paraphrased ‘what is the point of Hadoop?’.

It is a more fundamental question than ‘what analytic workloads is Hadoop used for’ and really gets to the heart of uncovering why businesses are deploying or considering deploying Apache Hadoop. Our research suggests there are three core roles:

– Big data storage: Hadoop as a system for storing large, unstructured, data sets
– Big data integration: Hadoop as a data ingestion/ETL layer
– Big data analytics: Hadoop as a platform new new exploratory analytic applications

While much of the attention for Apache Hadoop use-cases focuses on the innovative new analytic applications it has enabled in this latter role thanks to its high-profile adoption at Web properties, for more traditional enterprises and later adopters the first two, more mundane, roles are more likely the trigger for initial adoption. Indeed there are some good examples of these three roles representing an adoption continuum.

We also see the multiple roles playing out at a vendor level, with regards to strategies for Hadoop-related products. Oracle’s Big Data Appliance (451 coverage), for example, is focused very specifically on Apache Hadoop as a pre-processing layer for data to be analyzed in Oracle Database.

While Oracle focuses on Hadoop’s ETL role, it is no surprise that the other major incumbent vendors showing interest in Hadoop can be grouped into three main areas:

– Storage vendors
– Existing database/integration vendors
– Business intelligence/analytics vendors

The impact of these roles on vendor and user adoption plans will be reflected in my presentation at Hadoop World in November, the Blind Men and The Elephant.

You can help shape this presentation, and our ongoing research into Hadoop adoption drivers and trends, by taking our survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

Upcoming presentation on virtualization and storage

I’m going to be presenting the introductory session at a BrightTalk virtual conference on March 25 on the role and impact of the virtual server revolution on the storage infrastructure. Although it’s been evident for some time that the emergence of server virtualization has had — and continues to have — a meaningful impact on the storage world, the sheer pace of change here makes this a worthwhile topic to revisit. As the first presenter of the event — the conference runs all day — it’s my job to set the scene; as well as introducing the topic within the context of the challenges that IT and storage managers face, I’ll outline a few issues that will hopefully serve as discussion points throughout the day.

Deciding on which issues to focus on is actually a lot harder than it sounds — I only have 45 minutes — because, when you start digging into it, the impact of virtualization on storage is profound on just about every level; performance, capacity (and more importantly, capacity utilization), data protection and reliability, and management.

I’ll aim to touch on as many of these points as time allows, as well as provide some thoughts on the questions that IT and storage managers should be asking when considering how to improve their storage infrastructure to get the most out of an increasingly virtualized datacenter.

The idea is to make this a thought-provoking and interactive session. Register for the live presentation here: http://www.brighttalk.com/webcast/6907.  After registering you will receive a confirmation email as well as a 24-hour reminder email.  As a live attendee you will be able to interact with me by posing questions which I will be able to answer on air.  If you are unable to watch live, the presentation will remain available via the link above for on-demand participation.

Bridging the “storage-information” gap

When Nick first unveiled this blog last month he rightly noted ‘storage’ as one of the many categories that falls into a capacious bucket we term ‘information management.’ With this in mind he reminded me that it would be appropriate for the 451 Group’s storage research team to contribute to the debate, so here it is!

For the uninitiated, storage can appear to be either a bit of a black hole, or just a lot of spinning rust, so I’m not going to start with a storage 101 (although if you have a 451 password you can peruse our recent research here). Suffice to say that storage is just one element of the information management infrastructure, but its role is certainly evolving.

Storage systems and associated software traditionally have provided applications and users with the data they need, when they need it, along with the required levels of protection. Clearly, storage has had to become smarter (not to mention cheaper) to deal with issues like data growth; technologies such as data deduplication help firms grapple with the “too much” part of information management. But up until now the lines of demarcation between “storage” (and data management) and “information” management have been fairly clear. Even though larger “portfolio” vendors such as EMC and IBM have feet in both camps, the reality is that such products and services are organized, managed and sold separately.

That said, there’s no doubt these worlds are coming together. The issues we as analysts are grappling with relate to where and why this taking place, how it manifests itself, the role of technology, and the impact of this on vendor, investor and end-user strategies. At the very least there is a demand for technologies that help organizations bridge the gap – and the juxtaposition – between the fairly closeted, back-end storage “silo” and the more, shall we say, liberated, front-end interface where information meets its consumers.

Here, a number of competing forces are challenging, even forcing, organizations to become smarter about understanding what “information” they have in their storage infrastructure; data retention vs data disposition, regulated vs unregulated data and public vs private data being just three. Armed with such intelligence, firms can, in theory, make better decisions about how (and how long) data is stored, protected, retained and made available to support changing business requirements.

“Hang on a minute,” I hear you cry. “Isn’t this what Information Lifecycle Management (ILM) was supposed to be about?” Well, yes, I’m afraid it was. And one thing that covering the storage industry for almost a decade has told me is that it moves at a glacial pace. In the case of ILM, the iceberg has probably lapped it by now. The hows and whys of ILM’s failure to capture the imagination of the industry is probably best left for another day, but I believe that at least one aim of ILM – helping organizations better understand their data so it can better support the business — still makes perfect sense.

What we are now seeing is the emergence of some real business drivers that are compelling a variety of stakeholders – from CIOs to General Counsel — to take an active interest in better understanding their data. This, in turn, is driving industry consolidation as larger vendors in particular move to fill out their product portfolios; the latest example of this is the news of HP’s acquisition of Australia-based records management specialist Tower Software. Over the next few weeks I’ll be exploring in more detail three areas where we think this storage-information gap is being bridged; in eDiscovery, archiving and security. Stay tuned for our deeper thoughts and perspectives in this fast-moving space.