Hadoop’s potential to revolutionise the IT industry

Platfora’s CEO Ben Werther recently wrote a great post explaining the benefits of Apache Hadoop and its potential to play a major role in a modern-day equivalent of the industrial revolution.

Ben highlights one of the important aspects of our Total Data concept, that generating value from data is about more than just the volume, variety, and velocity of ‘big data’, but also the way in which the user wants to interact with their data.

“What has changed – the heart of the ‘big data’ shift – is only peripherally about the volume of data. Companies are realizing that there is surprising value locked up in their data, but in unanticipated ways that will only emerge down the road.”

He also rightly points out that while Hadoop provides what is fast-becoming the platform of choice for storing all of this data, from an industrial revolution perspective we are still reliant on the equivalent of expert blacksmiths to make sense of all the data.

“Since every company of any scale is going to need to leverage big data, as an industry we either need to train up hundreds of thousands of expert blacksmiths (aka data scientists) or find a way into the industrialized world (aka better tools and technology that dramatically lower the bar to harnessing big data).”

This is a point that Cloudera CEO Mike Olson has been making in recent months. As he stated during his presentation at last month’s OSBC: “we need to see a new class of applications that exploit the benefits and architecture of Hadoop.”

There has been a tremendous amount of effort in the past 12-18 months to integrate Hadoop into the existing data management landscape, via the development of uni- and bi-directional connectors and translators that enable the co-existence of Hadoop with existing relational and non-relational databases and SQL analytics and reporting tools.

This is extremely valuable – especially for enterprises with a heavy investment in SQL tools and skills. As Larry Feinsmith, Managing Director, Office of the CIO, JPMorgan Chase pointed out at last year’s Hadoop World: “it is vitally important that new big data tools integrate with existing products and tools”.

This is why ‘dependency’ (on existing tools/skills) is an integral element of the Total Data concept alongside totality, exploration and frequency.

However, this integration of Hadoop into the established data management market really only gets the industry so far, and in doing-so maintains the SQL-centric view of the world that has dominated for decades.

As Ben suggests, the true start of the ‘industrial revolution’ will begin with the delivery of tools that are specifically designed to take advantage of Hadoop and other technologies and that bring the benefits of big data to the masses.

We are just beginning to see the delivery of these tools and to think beyond the SQL-centric perspective with analytics approaches specifically designed to take advantage of MapReduce and/or the Hadoop Distributed File System. This again though, signals only the end of the beginning of the revolution.

‘Big data’ describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies.

The true impact of ‘big data’ will only be realised once people and companies begin to change their behaviour, using this greater business intelligence gained from using tools specifically designed to exploit the benefits and architecture of Hadoop and other emerging data processing technologies, to alter business processes and practices.

Forthcoming Webinar: Real World Success from Big Data

The initial focus of ‘big data’ has been about its increasing volume, velocity and variety — the “three Vs” — with little mention of real world application. Now is the time to get down to business.

On Wednesday, May 30, at 9am PT I’ll be taking part in a webinar with Splunk to discuss real world successes with ‘big data’.

451 Research believes that in order to deliver value from ‘big data’, businesses need to look beyond the nature of the data and re-assess the technologies, processes and policies they use to engage with that data.

I will outline 451 Research’s ‘total data’ concept for delivering business value from ‘big data’, providing examples of how companies are seeking agile new data management technologies, business strategies and analytical approaches to turn the “three Vs” of data into actionable operational intelligence.

I’ll be joined by Sanjay Mehta, Vice President of Product Marketing at Splunk, which was founded specifically to focus on the opportunity of effectively getting value from massive and ever changing amounts of machine-generated data, one of the fastest growing and most complex segments of ‘big data’.

Sanjay will share big data achievements from three Splunk customers, Groupon, Intuit and CenturyLink. Using Splunk, these companies are turning massive volumes of unstructured and semi-structured machine data into powerful insights.

Register here.

The Data Day, Today: May 8 2012

IBM acquires Vivisimo. Funding for Birst, ParAccel, Metamarkets and DataSift. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* For 451 Research clients

# IBM picks up Vivisimo to search for value in ‘big data’ Deal Analysis

# Teradata delivers on analytic cloud vision with Active Data Warehouse Private Cloud Impact Report

# The Big Blue picture for ‘big data’ analytics: IBM sheds light on BigSheets Impact Report

# Oversight Systems’ Continuous Analysis extracts actionable insight from data Impact Report

# Kalido updates MDM offering with business users, operationalizing master data in mind Impact Report

# Delphix reaps reward from agile approach to database virtualization Impact Report

# Automated Insights looks to pitch narrative, visuals and stats to enterprises Impact Report

# myDIALS eyes indirect sales in quest to be Internet access layer for analytics Impact Report

* IBM Advances Big Data Analytics with Acquisition of Vivisimo Also announces support for Cloudera.

* Teradata Announces 2012 First Quarter Results Revenue up 21% (PDF)

* Actuate Reports First Quarter 2012 Financial Results Revenue up 9% (PDF)

* Birst Secures $26 Million in Financing Led By Sequoia Capital

* ParAccel Closes Record Q1 Revenues and $20 Million Investment Round

* Metamarkets Raises $15 Million to Deliver Data Science-as-a-Service

* DataSift adds $7.2M: The story so far and focus for the future

* Teradata to Acquire eCircle (PDF)

* Google BigQuery brings Big Data analytics to all businesses

* TIBCO Spotfire Brings the Power of Data Discovery to Big Data and Extreme Information

* Jaspersoft Teams with VMware To Deliver Business Intelligence for Data-Driven Cloud Applications

* Kalido and Teradata Sign Global Reseller Agreement

* Actuate Announces Cloudera Alliance to Support Apache Hadoop and BIRT Developers in Big Data Integration

* Hortonworks and Kognitio Announce Technical Partnership Driving Apache Hadoop Adoption in Big Data Analytics Implementations

* Tokutek and PalominoDB Partner to Bring Scale, Performance to Database Deployments

* Acunu is pleased to announce v2 of the Acunu Data Platform!

* Is Yahoo really threatening memcached and Open Compute?

* Introducing Zend DBi as a MySQL Replacement on IBM i

* Zettaset and Hyve Solutions Build First Fully Integrated Enterprise OS Hadoop Solution

* Cloudera Announces New Japanese Subsidiary

* Bull Announces the Formation of Database Migration Business Unit

* Couchbase to Run Native with Key-Value API for ioMemory

* The Big Data Value Continuum

* Big Data is Business Intelligence plus Attention Deficit Disorder

* Nokia released Dempsy an open source stream data processing platform.

And that’s the Data Day, today.

‘Big Data’ Survival Guide: A 10-step guide to surviving the ‘big data’ deluge

Earlier today I presented a ‘Big Data’ Survival Guide at our HCTSEU event in London. The presentation was in effect a 10-step guide to surviving the ‘big data’ deluge.

Here’s a taster of what was discussed:

1. There’s no such thing as “big” data.
Or, more to the point: The problem is not “big” data – it’s more data. The increased use of interactive applications and websites – as well as sensors, meters and other data-generating machines – has increased the volume, velocity and variety of data to store and process.

2. ‘Big Data’ has the potential to revolutionize the IT industry.
Here we are talking less about the three Vs of big data and more about ‘big data’ as a concept, which describes the realization of greater business intelligence by storing, processing and analyzing that increased volume, velocity and variety of data. It can be summed up by the statement from Google’s The Unreasonable Effectiveness of Data that “Simple models and a lot of data trump more elaborate models based on less data”

3. Never use the term ‘big data’ when ‘data’ will do.
“Big Data” is nearing/at/over the hype peak. Be cautious about how you use it. “Big Data” and technologies like Hadoop will eventually become subsumed into the fabric of the IT industry and will simply become part of the way we do business.

4. (It’s not how big it is) It’s what you do with it that counts.
Generating value from data is about more than just the volume, variety, and velocity of data. The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements. That is the essence of our Total Data management concept, which builds on the three Vs to also assess Totality, Exploration, Frequency and Dependency, which can be explained via:

5. All data has potential value.
Totality: The desire to process and analyze data in its entirety, rather than analyzing a sample of data and extrapolating the results.

6. You may have to search for it.
Exploration: The interest in exploratory analytic approaches, in which schema is defined in response to the nature of the query.

7. Time is of the essence.
Frequency: The desire to increase the rate of analysis to generate more accurate and timely business intelligence.

8. Make the most of what you have.
Dependency: The need to balance investment in existing technologies and skills with the adoption of new techniques.

9. Choose the right tool for the job.
There is no shortcut to determining which is the best technology to deploy for a particular workload. Several companies have developed their own approaches to solving this problem, which does provide some general guidance.

10. If your data is “big” the way you manage it should be “total”.
Everything I talked about in the presentation, including examples from eBay, Orbitz, Expedia, Vestas Wind Systems, and Disney (and several others) that I did not have space to address in this post, is included in our Total Data report. It examines the trends behind ‘big data’, explains the new and existing technologies used to store and process and deliver value from data, and outlines a Total Data management approach focused on selecting the most appropriate data storage and processing technology to deliver value from big data.

The Data Day, Today: Mar 22 2012

Oracle reports Q3. EMC acquires Pivotal Labs. ClearStoty launches. And much, much more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Oracle Reports Q3 GAAP EPS Up 20% to 49 Cents; Q3 Non-GAAP EPS Up 15% to 62 Cents Database and middleware revenue up 10%.

* EMC Goes Social, Open and Agile With Big Data EMC acquires Pivotal Labs, plans to release Chorus as an open source project

* ClearStory Data Launches With Investment From Google Ventures, Andreessen Horowitz and Khosla Ventures

* HP Lead Big Data Exec Chris Lynch Resigns

* “Hortonworks Names Ari Zilka Chief Products Officer

* DataStax Enterprise 2.0 Adds Enterprise Search Capabilities to Smart Big Data Platform

* MapR Unveils Most Comprehensive Data Connection Options for Hadoop

* New Web-Based Alpine Illuminator Integrates with EMC Greenplum Chorus, The Social Data Science Platform

* RainStor and IBM InfoSphere BigInsights to Address Growing Big Data Challenges

* IBM Introduces New Predictive Analytics Services and Software to Reduce Fraud, Manage Financial Performance and Deliver Next Best Action

* Datameer Releases Major New Version of Analytics Platform

* Kognitio Announces Formation of “Kognitio Cloud” Business Unit

* HStreaming Announces Free Community Edition of Its Real-Time Analytics Platform for Hadoop

* Talend and MapR Announce Certification of Big Data Integration and Big Data Quality

* Schooner Information Technology Releases Membrain 4.0

* Gazzang Launches Big Data Encryption and Key Management Platform

* Logicworks Solves Big Data Hosting Challenges With New Infrastructure Services for Hadoop

* “Big Data” Among Most Confusing Tech Buzzwords

* For 451 Research clients

# Infochimps launches Chef-based platform for Hadoop deployment Impact Report

# Big-data security, or SIEM buzzword parity? Spotlight report

# DataStax adds enterprise search and elastic reprovisioning to database platform Market Development report

# With a new CEO and IBM as a reseller, Revolution Analytics charts next growth phase Market Development report

# Cray branches out, offering storage and a ‘big data’ appliance Market Development report

# CodeFutures sees a future beyond database sharding Market Development report

# Third time lucky for ScaleOut StateServer 5.0? Market Development report

# Attunity looks to 2012 for turnaround; up to the cloud and ‘big data’ movement Market Development report

# Panorama rides Microsoft’s coattails into in-memory social BI using SQL Server 2012 Market Development report

And that’s the Data Day, today.

Updated: sizing the big data problem: ‘big data’ is *still* the problem

In late 2010 I published a post discussing the problems associated with trying to size the ‘big data’ market based on a lack of clarity on the definition of the term and what technologies it applies to.

In that post we discussed a 2010 Bank of America Merrill Lynch report that estimated that ‘big data’ represented a total addressable market worth $64bn. This week Wikibon estimated that the big data market stands at just over $5bn in factory revenue growing to over $50bn by 2017, while Deloitte estimated that industry revenues will likely be in the range of $1-1.5bn this year.

To put that in perspective, Bank of America Merrill Lynch estimated that the total addressable market for ‘big data’ in 2010 was this

Wikibon estimates that the ‘big data’ market in 2012 is this

and Deloite estimates that the ‘big data’ market in 2012 is this

UPDATE – IDC has become the first of the big analyst vendors to break out its big data abacuses (abaci?). IDC thinks the ‘big data’ market in 2010 was $3.2bn. That’s this

Not surprisingly they came to their numbers by different means. BoA added up market estimates for database software, storage and servers for databases, BI and analytics software, data integration, master data management, text analytics, database-related cloud revenue, complex event processing and NoSQL databases.

Wikibon came to its estimate by adding up revenue associated with a select group of technologies and a select group of vendors, while Deloitte added up revenue estimates for database, ERP and BI software, reduced the total by 90% to reflect the proportion of data warehouses with more than five terabytes of data, and reduced that total by 80-85% to reflect the low level of current adoption.

IDC, meanwhile, went through a slightly tortuous route of defining the market based on the volume of data collected, OR deployments of ultra-high-speed messaging technology, OR rapidly growing data sets, AND the use of scale-out architecture, AND the use of two or more data types OR high-speed data sources.

There is something to be said for each of these definitions. But equally each can be easily dismissed. We previously described our issues with the all-inclusive nature of the BoA numbers, and while we find Wikibon’s process much more agreeable, some of the individual numbers they have come up with are highly questionable. Deloitte’s methodology is surreal, but defensible. IDC’s just illustrates the problem:

What this highlights is that the essential problem is the lack of definition for ‘big data’. As we stated in 2010: “The biggest problem with ‘big data’… is that the term has not been – and arguably cannot be – defined in any measurable way. How big is the ‘big data’ market? You may as well ask ‘how long is a piece of string?'”

The Data Day, Today: Jan 27 2012

Amazon launches AWS Storage Gateway. Postgres Plus Cloud Server. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Amazon Web Services Announces AWS Storage Gateway to Connect Enterprise Data with the Cloud

* EnterpriseDB Announces Availability of Postgres Plus Cloud Database

* Big VCs Invest In Big Data Startup Continuuity

* At Davos, Discussions of a Global Data Deluge

* Zimory Names New Head of zimory®scale; the Cloud Database Elasticity Division

* Jaspersoft’s Java Reporting Engine Integrated with Cloud Foundry

* IBM Debuts New Analytics Appliance to Help Retailers Transform Big Data Into Business Opportunities

* The Mass Technology Leadership Council published its report on big data and analytics.

* Apache HBase 0.92.0 has been released

* Is Security An Afterthought For NoSQL?

* What’s the big deal about Big Data?

* Hadoop Summit 2012 Announced to Showcase Apache Hadoop as Next Generation Enterprise Data Platform

* Announcing BigCouch 0.4

* Microsoft’s plan for Hadoop and big data

* Google Goes MoreSQL With Tenzing – SQL Over MapReduce

* Seismic Data Science: Reflection Seismology and Hadoop

* GoodData Posts Record-Breaking 600% Year-Over-Year Revenue Growth In 2011

* For 451 Research clients

# 2012 M&A Outlook – Software Assessing the runners and riders for M&A and IPOs in 2012

# RJMetrics scores $1.2m debt funding, sets out SaaS BI stall Impact report

* Google News Search outlier of the day: Pork Tenderloin: A Healthy Eating Hero

And that’s the Data Day, today.

Previewing data management and analytics in 2012

451 Research yesterday announced that it has published its 2012 Previews report, an all-encompassing report highlighting the most disruptive and significant trends that our analysts expect to dominate and drive the enterprise IT industry agenda over the coming year.

The 93 page report provides an outlook and assessment across all 451 Research technology sectors and practice areas – including software infrastructure, cloud enablement, hosting, security, datacenter technologies, hardware, information management, mobility, networking and eco-efficient IT – with input from our team of 40+ analysts. The 2012 Previews report is available upon request here.

IM research director Simon Robinson has already provided a taster of our predictions as they relate to the information-centric landscape. Below I have outlined some of our core predictions related to the data-centric ecosystem:

The overall trend predicted for 2012 could best be described as the shifting focus from volume, velocity and velocity, to delivering value. Out concept of Total Data reflects the path from velocity and variety of information sources to the all-important endgame of deriving value from data. We expect to see increased interest in data integration and analytics technologies and approaches designed specifically to exploit the potential benefits of ‘big data’ and mainstream adoption of Hadoop and other new sources of data.

We also anticipate, and are beginning to see, increased focus on technologies that enable access to data in different storage platforms without requiring data movement. We believe there is an emerging role for what we are calling the ‘data hub‘ – an independent platform that is responsible for managing access to data on the various data storage and processing technologies.

Increased understanding of the value of analytics will also increase interest in the integration of analytics into operational applications. Embedded analytics is nothing new, but has the potential to achieve mainstream adoption this year as the dominant purveyors of applications used to run operations are increasingly focused on serving up embedded analytics as a key component within their product portfolios. Equally importantly, many of them now have database platforms capable of uniting previously disparate technologies to deliver true embedded analysis.

There has been a growing recognition over the past year or so that any type of data management project – whether focused on master data management (MDM), data or application integration, or data quality – needs to bring real benefits to business processes. Some may see this assertion as obvious and pretty easy to achieve, but that’s not necessarily the case. However, it is likely to become more so in the next 12-18 months as companies realize a process-driven approach to most data management programs makes sense and vendors deliver capabilities to meet this demand.

While ‘big data’ presents a number of opportunities, it also poses many challenges, not the least of which is the lack of developers, managers, analysts and scientists with analytics skills. The users and investors placing a bet on the opportunities offered by new data management products are unlikely to be laughing if it turns out that they cannot employ people to deploy, manage and run those products, or analysts to make sense of the data they produce. It is not surprising that, therefore, the vendors that supply those technologies are investing in ensuring that there is a competent workforce to support existing and new projects.

Finally, while cloud computing may be one of the technology industry’s hot topics, it has had relatively little impact on the data management sector to date. That is not to say that databases are not available on cloud computing platforms, but we must make a distinction between databases that are deployed in public clouds, and ‘cloud databases‘ that have the potential to fulfil the role of emerging databases in building private and hybrid clouds. The former have been available for many years. The latter are just beginning to come to fruition based on NoSQL databases, as well as a new breed of NewSQL relational databases, designed to meet the performance, scalability and flexibility needs of large-scale data processing.

451 Research clients can get more details of these specific predictions via our 2012 preview – Information Management, Part 2. Non-clients can apply for trial access at the same link, while the entire 2012 Previews report is available here.

Also, mark your diaries for a webinar discussing report highlights on Thursday Feb 9 at noon ET, which will be open for clients and non-clients to attend. Registration details to follow soon…

Previewing Information Management in 2012

Every New Year affords us the opportunity to dust down our collective crystal balls and predict what we think will be the key trends and technologies dominating our respective coverage areas over the coming 12 months.We at 451 Research just published our 2012 Preview report; at almost 100 pages it’s a monster, but offers some great insights across twelve technology subsectors, spanning from managed hosting and the future of cloud to the emergence of software-defined networking and solid state storage; and everything in between. The report is available to both 451Research clients and non-clients (in return for a few details); access the landing page here.  There’s a press release of highlights here. Also, mark your diaries for a webinar discussing report highlights on Thursday Feb 9 at noon ET, which will be open for clients and non-clients to attend. Registration details to follow soon…

Here are a selection of key takeaways from the first part of the Information Management preview, which focuses on information governance, ediscovery, search, collaboration and file sharing. (Matt Aslett will be posting highlights of part 2, which focuses more on data management and analytics, shortly.)

  • One of the most obvious common themes that will continue to influence technology spending decisions in the coming year is the impact of continued explosive data and information growth.  This  continues to shape new legal frameworks and technology stacks around information governance and e-discovery, as well as to drive a new breed of applications growing up around what we term the ‘Total Data’ landscape.
  • Data volumes and distributed data drive the need for more automation and auto-classification capabilities will continue to emerge more successfully in e-discovery, information governance and data protection veins — indeed, we expect to see more intersection between these, as we noted in a recent post.
  • The maturing of the cloud model – especially as it relates to file sharing and collaboration, but also from a more structured database perspective – will drive new opportunities and challenges for IT professionals in the coming year.  Looks like 2012 may be the year of ‘Dropbox for the enterprise.’
  • One of the big emerging issues that rose to the fore in 2011, and is bound to get more attention as the New Year proceeds, is around the dearth of IT and business skills in some of these areas, without which the industry at large will struggle to harness and truly exploit the attendant opportunities.
  • The changes in information management in recent years have encouraged (or forced) collaboration between IT departments, as well as between IT and other functions. Although this highlights that many of the issues here are as much about people and processes as they are about technology, the organizations able to leap ahead in 2012 will be those that can most effectively manage the interaction of all three.
  • We also see more movement of underlying information management infrastructures into the applications arena.  This is true with search-based applications, as well as in the Web-experience management vein, which moves beyond pure Web content management.  And while Microsoft SharePoint continues to gain adoption as a base layer of content-management infrastructure, there is also growth in the ISV community that can extend SharePoint into different areas at the application-level.

There is a lot more in the report about proposed changes in the e-discovery arena, advances of the cloud, enterprise search and impact of mobile devices and bring-your-device-to-work on information management.

The Data Day, Today: Jan 24 2012

Thoughts on Splunk’s IPO and DynamoDB. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Thoughts on the Splunk IPO and S-1 By Dave Kellogg.

* Thoughts on SimpleDB, DynamoDB and Cassandra By Adrian Cockcroft.

* Recommind’s Revenue Leaps 95% in Record-Setting 2011 Predictable.

* Hewlett-Packard Expands to Cambridge via Vertica’s “Big Data” Center Moving.

* Announcing SkySQL Enterprise HA for the MariaDB & MySQL databases

* Membase Server is Now Couchbase Server But not *the* Couchbase Server.

* Cloudera Teams With O’Reilly Media to Merge Hadoop World and Strata Conferences

* Survey results: How businesses are adopting and dealing with data 100 Strata Online Conference attendees.

* Big data market survey: Hadoop solutions

* LinkedIn released SenseiDB, an open source distributed, realtime, semi-structured database.

* For 451 Research clients

# VMware: not your father’s database company Impact Report

# Sparsity Technologies draws up plans for graph database adoption Impact Report

# Amazon launches DynamoDB, an auto-configuring database as a service Market Development report

# NuoDB targets Q2 release for elastic relational database Market Development report

# ADVIZOR illuminates growth strategy, roadmap in data discovery and analysis Market Development report

# Birst adds own analytic engine for BI, OEM agreement with ParAccel Market Development report

* Google News Search outlier of the day: RentAGrandma.com Recruiting Wonderful Grandmas

And that’s the Data Day, today.