‘Big Data’ Survival Guide: A 10-step guide to surviving the ‘big data’ deluge

Earlier today I presented a ‘Big Data’ Survival Guide at our HCTSEU event in London. The presentation was in effect a 10-step guide to surviving the ‘big data’ deluge.

Here’s a taster of what was discussed:

1. There’s no such thing as “big” data.
Or, more to the point: The problem is not “big” data – it’s more data. The increased use of interactive applications and websites – as well as sensors, meters and other data-generating machines – has increased the volume, velocity and variety of data to store and process.

2. ‘Big Data’ has the potential to revolutionize the IT industry.
Here we are talking less about the three Vs of big data and more about ‘big data’ as a concept, which describes the realization of greater business intelligence by storing, processing and analyzing that increased volume, velocity and variety of data. It can be summed up by the statement from Google’s The Unreasonable Effectiveness of Data that “Simple models and a lot of data trump more elaborate models based on less data”

3. Never use the term ‘big data’ when ‘data’ will do.
“Big Data” is nearing/at/over the hype peak. Be cautious about how you use it. “Big Data” and technologies like Hadoop will eventually become subsumed into the fabric of the IT industry and will simply become part of the way we do business.

4. (It’s not how big it is) It’s what you do with it that counts.
Generating value from data is about more than just the volume, variety, and velocity of data. The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements. That is the essence of our Total Data management concept, which builds on the three Vs to also assess Totality, Exploration, Frequency and Dependency, which can be explained via:

5. All data has potential value.
Totality: The desire to process and analyze data in its entirety, rather than analyzing a sample of data and extrapolating the results.

6. You may have to search for it.
Exploration: The interest in exploratory analytic approaches, in which schema is defined in response to the nature of the query.

7. Time is of the essence.
Frequency: The desire to increase the rate of analysis to generate more accurate and timely business intelligence.

8. Make the most of what you have.
Dependency: The need to balance investment in existing technologies and skills with the adoption of new techniques.

9. Choose the right tool for the job.
There is no shortcut to determining which is the best technology to deploy for a particular workload. Several companies have developed their own approaches to solving this problem, which does provide some general guidance.

10. If your data is “big” the way you manage it should be “total”.
Everything I talked about in the presentation, including examples from eBay, Orbitz, Expedia, Vestas Wind Systems, and Disney (and several others) that I did not have space to address in this post, is included in our Total Data report. It examines the trends behind ‘big data’, explains the new and existing technologies used to store and process and deliver value from data, and outlines a Total Data management approach focused on selecting the most appropriate data storage and processing technology to deliver value from big data.

Previewing data management and analytics in 2012

451 Research yesterday announced that it has published its 2012 Previews report, an all-encompassing report highlighting the most disruptive and significant trends that our analysts expect to dominate and drive the enterprise IT industry agenda over the coming year.

The 93 page report provides an outlook and assessment across all 451 Research technology sectors and practice areas – including software infrastructure, cloud enablement, hosting, security, datacenter technologies, hardware, information management, mobility, networking and eco-efficient IT – with input from our team of 40+ analysts. The 2012 Previews report is available upon request here.

IM research director Simon Robinson has already provided a taster of our predictions as they relate to the information-centric landscape. Below I have outlined some of our core predictions related to the data-centric ecosystem:

The overall trend predicted for 2012 could best be described as the shifting focus from volume, velocity and velocity, to delivering value. Out concept of Total Data reflects the path from velocity and variety of information sources to the all-important endgame of deriving value from data. We expect to see increased interest in data integration and analytics technologies and approaches designed specifically to exploit the potential benefits of ‘big data’ and mainstream adoption of Hadoop and other new sources of data.

We also anticipate, and are beginning to see, increased focus on technologies that enable access to data in different storage platforms without requiring data movement. We believe there is an emerging role for what we are calling the ‘data hub‘ – an independent platform that is responsible for managing access to data on the various data storage and processing technologies.

Increased understanding of the value of analytics will also increase interest in the integration of analytics into operational applications. Embedded analytics is nothing new, but has the potential to achieve mainstream adoption this year as the dominant purveyors of applications used to run operations are increasingly focused on serving up embedded analytics as a key component within their product portfolios. Equally importantly, many of them now have database platforms capable of uniting previously disparate technologies to deliver true embedded analysis.

There has been a growing recognition over the past year or so that any type of data management project – whether focused on master data management (MDM), data or application integration, or data quality – needs to bring real benefits to business processes. Some may see this assertion as obvious and pretty easy to achieve, but that’s not necessarily the case. However, it is likely to become more so in the next 12-18 months as companies realize a process-driven approach to most data management programs makes sense and vendors deliver capabilities to meet this demand.

While ‘big data’ presents a number of opportunities, it also poses many challenges, not the least of which is the lack of developers, managers, analysts and scientists with analytics skills. The users and investors placing a bet on the opportunities offered by new data management products are unlikely to be laughing if it turns out that they cannot employ people to deploy, manage and run those products, or analysts to make sense of the data they produce. It is not surprising that, therefore, the vendors that supply those technologies are investing in ensuring that there is a competent workforce to support existing and new projects.

Finally, while cloud computing may be one of the technology industry’s hot topics, it has had relatively little impact on the data management sector to date. That is not to say that databases are not available on cloud computing platforms, but we must make a distinction between databases that are deployed in public clouds, and ‘cloud databases‘ that have the potential to fulfil the role of emerging databases in building private and hybrid clouds. The former have been available for many years. The latter are just beginning to come to fruition based on NoSQL databases, as well as a new breed of NewSQL relational databases, designed to meet the performance, scalability and flexibility needs of large-scale data processing.

451 Research clients can get more details of these specific predictions via our 2012 preview – Information Management, Part 2. Non-clients can apply for trial access at the same link, while the entire 2012 Previews report is available here.

Also, mark your diaries for a webinar discussing report highlights on Thursday Feb 9 at noon ET, which will be open for clients and non-clients to attend. Registration details to follow soon…

Previewing Information Management in 2012

Every New Year affords us the opportunity to dust down our collective crystal balls and predict what we think will be the key trends and technologies dominating our respective coverage areas over the coming 12 months.We at 451 Research just published our 2012 Preview report; at almost 100 pages it’s a monster, but offers some great insights across twelve technology subsectors, spanning from managed hosting and the future of cloud to the emergence of software-defined networking and solid state storage; and everything in between. The report is available to both 451Research clients and non-clients (in return for a few details); access the landing page here.  There’s a press release of highlights here. Also, mark your diaries for a webinar discussing report highlights on Thursday Feb 9 at noon ET, which will be open for clients and non-clients to attend. Registration details to follow soon…

Here are a selection of key takeaways from the first part of the Information Management preview, which focuses on information governance, ediscovery, search, collaboration and file sharing. (Matt Aslett will be posting highlights of part 2, which focuses more on data management and analytics, shortly.)

  • One of the most obvious common themes that will continue to influence technology spending decisions in the coming year is the impact of continued explosive data and information growth.  This  continues to shape new legal frameworks and technology stacks around information governance and e-discovery, as well as to drive a new breed of applications growing up around what we term the ‘Total Data’ landscape.
  • Data volumes and distributed data drive the need for more automation and auto-classification capabilities will continue to emerge more successfully in e-discovery, information governance and data protection veins — indeed, we expect to see more intersection between these, as we noted in a recent post.
  • The maturing of the cloud model – especially as it relates to file sharing and collaboration, but also from a more structured database perspective – will drive new opportunities and challenges for IT professionals in the coming year.  Looks like 2012 may be the year of ‘Dropbox for the enterprise.’
  • One of the big emerging issues that rose to the fore in 2011, and is bound to get more attention as the New Year proceeds, is around the dearth of IT and business skills in some of these areas, without which the industry at large will struggle to harness and truly exploit the attendant opportunities.
  • The changes in information management in recent years have encouraged (or forced) collaboration between IT departments, as well as between IT and other functions. Although this highlights that many of the issues here are as much about people and processes as they are about technology, the organizations able to leap ahead in 2012 will be those that can most effectively manage the interaction of all three.
  • We also see more movement of underlying information management infrastructures into the applications arena.  This is true with search-based applications, as well as in the Web-experience management vein, which moves beyond pure Web content management.  And while Microsoft SharePoint continues to gain adoption as a base layer of content-management infrastructure, there is also growth in the ISV community that can extend SharePoint into different areas at the application-level.

There is a lot more in the report about proposed changes in the e-discovery arena, advances of the cloud, enterprise search and impact of mobile devices and bring-your-device-to-work on information management.

Who said you can’t go home again?

Every new year represents some change; the hope of new challenges and opportunities. It is not all that often that a fresh new year also brings such literal and fundamental change, as it has for me this year. I ended 2011 on the vendor-side of things – and I am starting 2012 on the analyst side.

Of course, this is highly familiar ground for me. I was not only an analyst with 451 Research previously, but I have also been a demi-analyst of sorts through my blogging and other non-traditional marketing activities with both SugarCRM  and Basho Technologies.

Coming back to 451 Research is exciting for many reasons: this has always been a great team of highly intelligent individuals with great vision, and the type of analysis here is right up my alley.

In that vein, I wanted to give a heads up around the kinds of technology innovation I plan to make my area of focus. I will cover, as I did in my first go-round here, on core CRM, ERP and other packaged applications. But the world of applications is changing, rapidly and in fascinating ways. I will also cover how social media (and other data sets) are influencing how developers build applications – and how end-users interact with them. Also, I see the cloud and platform-as-a-service creating new and exciting applications choices for businesses of all sizes. PaaS means many things to many people, but I believe we will see even more PaaS development around enterprise apps in the coming months.

As noted above, Data in all its forms and sources is changing how we approach business. We have moved from leaving most of our enterprise data out of the applications we use daily to thinking about “Total Data” in just a few short years. This is an exciting area of technology development, and how data analysis plays into modern apps will be a focus. I am excited about working with the like of Matt Aslett and other team members on this research.

I am also excited to be working with Kathleen Reidy around how technologies such as enterprise search, text analytics, and collaboration/content management tools are shaping new concepts like the “social enterprise.”

Mobile apps – in the business sense – have taken much more of a front seat since I last covered applications – so I will try to keep on top of mobile as well. And again, this will be a collaborative effort to augment this existing strong mobile coverage here.

To sum up…essentially, if it lies at the top of the stack, and is indicative of “cool new tech” – I will probably be interested.

I look forward to speaking with some new, old and familiar technology providers. A lot has changed in the last five years since I last wore an analyst’s hat. But following this change from the vendor-side has given me an interesting angle. I hope my research and ideas offered through 451 Research’s many outlets reflects this in a positive and valuable manner for our ever-growing audience.

Our Total Data report is now totally available

…and it’s totally awesome.

Data volumes are exploding. Enterprises need better techniques to analyze, for example, IT management data or customer behavior statistics. The term ‘big data’ has emerged to describe new data management challenges posed by the growing volume, variety and velocity of data being produced by interactive applications and websites, as well as sensors, meters and other data-generating machines.

Our term ‘Total Data’ denotes a broad approach to data management that makes use of all available data, regardless of where it resides, to improve the efficiency and accuracy of business intelligence.

Total Data describes how users are deploying specialist data management technologies to maximize the benefit from individual operational or analytic workloads, while avoiding the creation of data silos by applying a unified approach to management that enables efficient data movement and integration.

This report examines the trends behind big data, as well as the new and existing technologies used to store and process this data, and outlines a Total Data management approach that is focused on selecting the most appropriate data storage and processing technology to deliver value from big data.

For more details of our Total Data report, and how to get it, see this page.

Valeriy Lobanovskyi: soccer manager… big data visionary

The increased focus on the value of data, combined with the recent release of Moneyball, has focused much attention on Oakland Athletics general manager Billy Beane and his successful use of data to improve performance.

Beane was my no means the first to realize the potential use of data in sports, however. That title could arguably go to Valeriy Lobanovskyi, manager of the Dynamo Kyiv soccer team between 1974 and 1990.

Lobanovskyi’s name is unlikely to be well known to even the most ardent football fans but our research into Total Football as an inspiration for our total data concept has highlighted the fact that Lobanovskyi was as much a big data visionary as he was a footballing visionary.

Total football is most readily associated with Rinus Michels and his teams: Ajax of Amsterdam, Barcelona, and the Dutch national side of the 1970s; but while Michels was busy winning Dutch league titles and European Cups, Lobanovskyi similarly was busy at Dynamo Kiev winning the Soviet League eight times, the Ukrainian league five times, and the European Cup Winner’s Cup twice with an approach known as Universality.

Describing the concept of Universality, Lobanovskyi once stated that “the most important thing in football is what a player is doing on a pitch when he is not in possession of the ball.”

Total football devotees will recognize the description, and as Hortonworks co-founder Arun C Murthy recently noted, Lobanovskyi arguably deserves as much credit as Michels for coming up with what would eventually become known as total football.

So far, so football visionary. What separates Lobanovskyi from Michels is the fact that he based much of his vision on data, and the analysis of data. Originally trained as an engineer, Lobanovskyi saw the potential value of a scientific, data-led approach to sport.

Together with statistician Anatoliy Zelentsov, Lobanovskyi devised a method of recording and analyzing the events and actions in a game of football and using it to provide players with a statistical analysis of their performance and set targets designed to meet the style he wanted the team to play (squeezing, pressing, or combination).

“All life,” Lobanovskyi once said, “is a number”.

An example of Lobanovskyi and Zelentsov’s targets, as explained in Inverting the Pyramid: A History of Football Tactics, by Jonathan Wilson, is displayed below:

To put this in some context, Lobanovskyi was using statistics and data as a means of gaining competitive advantage in sport 20 years before the formation of Opta Sports and Prozone, and almost 30 years before Beane and the 2002 Oakland Athletics.

Clients can read more about Total Football, and our description of approaches to data management in an era of ‘big data’, in our Total Data report, to be released in the coming days.

VC funding for Hadoop and NoSQL tops $350m

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but below is a snapshot of the report, along with a graphic representation of the recent up-tick in funding.

According to our figures, between the beginning of 2008 and the end of 2010 $95.8m had been invested in the various Apache Hadoop- and NoSQL-related vendors. That figure now stands at more than $350.8m, up 266%.

That statistic does not really do justice to the sudden uptick of interest, however. The figures indicate that funding for Apache Hadoop- and NoSQL-related firms has more than doubled since the end of August, at which point the total stood at $157.5m.

A substantial reason for that huge jump is the staggering $84m series A funding round raised by Apache Hadoop-based analytics service provider Opera Solutions.

The original commercial supporter of Apache Hadoop, Cloudera, has also contributed strongly with a recent $40m series D round. In addition, MapR Technologies raised $20m to invest in its Apache Hadoop distribution, while we know that Hortonworks also raised a substantial round (unconfirmed, but reportedly $20m) from Benchmark Capital and former parent Yahoo as it was spun off in June. Index Ventures also recently announced that it has become an investor in Hortonworks.

I am reliably informed that if you factor in Hortonworks’ two undisclosed rounds, the total funding for Hadoop and NoSQL vendors is actually closer to $400m.

The various NoSQL database providers have also played a part in the recent burst of investment, with 10gen raising a $20m series D round and Couchbase raising $15m. DataStax, which has interests in both Apache Cassandra and Apache Hadoop, raised an $11m series B round, while Neo Technology raised a $10.6m series A round. Basho Technologies raised $12.5m in series D funding in three chunks during 2011.

Additionally, there are a variety of associated players, including Hadoop-based analytics providers such as Datameer, Karmasphere and Zettaset, as well as hosted NoSQL firms such as MongoLab, MongoHQ and Cloudant.

One investor company name that crops up more than most in the list above is Accel Partners, which was an original investor in both Cloudera and Couchbase, and backed Opera Solutions via its Accel- KKR joint venture with Kohlberg Kravis Roberts.

It appears that those investments have merely whetted Accel’s appetite for big data, however, as the firm last week announced a $100m Big Data Fund to invest in new businesses targeting storage, data management and analytics, as well as data-centric applications and tools.

While Accel is the fist VC shop that we are aware of to create a fund specifically for big data investments, we are confident both that it won’t be the last and that other VCs have already informally earmarked funds for data-related investments.

451 clients can get more details on funding and M&A involving more traditional database vendors, as well as our perspective on potential M&A suitors for the Hadoop and NoSQL players.

What is the point of Hadoop?

Among the many calls we have fielded from users, investors and vendors about Apache Hadoop, the most common underlying question we hear could be paraphrased ‘what is the point of Hadoop?’.

It is a more fundamental question than ‘what analytic workloads is Hadoop used for’ and really gets to the heart of uncovering why businesses are deploying or considering deploying Apache Hadoop. Our research suggests there are three core roles:

– Big data storage: Hadoop as a system for storing large, unstructured, data sets
– Big data integration: Hadoop as a data ingestion/ETL layer
– Big data analytics: Hadoop as a platform new new exploratory analytic applications

While much of the attention for Apache Hadoop use-cases focuses on the innovative new analytic applications it has enabled in this latter role thanks to its high-profile adoption at Web properties, for more traditional enterprises and later adopters the first two, more mundane, roles are more likely the trigger for initial adoption. Indeed there are some good examples of these three roles representing an adoption continuum.

We also see the multiple roles playing out at a vendor level, with regards to strategies for Hadoop-related products. Oracle’s Big Data Appliance (451 coverage), for example, is focused very specifically on Apache Hadoop as a pre-processing layer for data to be analyzed in Oracle Database.

While Oracle focuses on Hadoop’s ETL role, it is no surprise that the other major incumbent vendors showing interest in Hadoop can be grouped into three main areas:

– Storage vendors
– Existing database/integration vendors
– Business intelligence/analytics vendors

The impact of these roles on vendor and user adoption plans will be reflected in my presentation at Hadoop World in November, the Blind Men and The Elephant.

You can help shape this presentation, and our ongoing research into Hadoop adoption drivers and trends, by taking our survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

Our big data/total data survey is now live

The 451 Group is conducting a survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

Created in conjunction with TheInfoPro, a division of The 451 Group focused on real-world perspectives on the IT customer, the survey contains less than 20 questions and does not ask for details of specific projects. It does cover data volumes and complexity, as well as attitudes to emerging data management technologies – such as Hadoop and exploratory analytics, as well as NoSQL and NewSQL – for certain workloads.

In return for your participation, you will receive a copy of a forthcoming long-format report covering introducing Total Data, The 451 Group’s concept for explaining the changing data management landscape, which will include the results. Respondents will also have the opportunity to become members of TheInfoPro’s peer network.

The survey is expected to close in late October and we are also plan to provide a snapshot of the results in our presentation, The Blind Men and The Elephant, at Hadoop World in early November.

Many thanks in advance for your participation in this survey. We look forward to sharing the results with you. The survey can be found at http://bit.ly/451data

NoSQL Road Show, Hadoop Tuesdays and Hadoop World

I’ll be taking our data management research out on the road in the next few months with a number of events, webinars and presentations.

On October 12 I’m taking part in the NoSQL Road Show Amsterdam, with Basho, Trifork and Erlang Solutions, where I’ll be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

The following week, October 18, I’m taking part in the Hadoop Tuesdays series of webinars, presented by Cloudera and Informatica, specifically talking about the Hadoop Ecosystem.

The Apache Hadoop ecosystem will again be the focus of attention on November 8 and 9, when I’ll be in New York for Hadoop World, presenting The Blind Men and the Elephant.

Then it’s back to NoSQL with two more stops on the NoSQL Road Show, in London on November 29 and Stockholm on December 1, where I’ll once again be presenting NoSQL, NewSQL, Big Data…Total Data – The Future of Enterprise Data Management.

I hope you can join us for at least one of these events, and am looking forward to learning a lot about NoSQL and Apache Hadoop adoption, interest and concerns.