7 Hadoop questions. Q1: Hadoop and the data warehouse

What is the relationship between Hadoop and the data warehouse? That’s one of the primary questions being asked in the 451 Research 2013 Hadoop survey. Through our conversations with Hadoop users to date we’ve seen that the answer to that question differs from company to company, depending on how far advanced they are in terms of their adoption.

hadoop-elephant

For the most part we see that Hadoop is being used for workloads that were not previously on the data warehouse as part of a strategy of storing, processing and analyzing data that was previous ignored due to being unsuitable – either in terms of cost or data format – for analysis using a relational data warehouse.

However, we also see some companies taking advantage of the cost advantages of storing data in Hadoop to offload workloads from the data warehouse, either temporarily or permanently.

And at the other end of the spectrum we also see companies in which Hadoop is being used, or at least considered at this stage, as a replacement for the data warehouse.
survey

Which use-cases are most popular? That’s one of the things our survey is designed to find out. The early results indicate a greater preference for Hadoop being used for workloads that were not previously on the data warehouse and also Hadoop being used to permanently migrate some workloads from the data warehouse, but it is still early stages.

While that accounts for the way in which Hadoop is being used today, it doesn’t get to the heart of the long-term potential for Hadoop in relation to the data warehouse. Therefore, the survey also asks about the long-term potential to replace the data warehouse.

Again we see a spectrum of strategies in action, from some companies planning for Hadoop to eventually completely replace the data warehouse, through some moving the majority of workloads to Hadoop, through others moving a minority of workloads to Hadoop, to those that believe Hadoop will never replace the data warehouse.

Again the early survey results are interesting, with ‘a minority of workloads will move to Hadoop’ and ‘Hadoop will never replace the data warehouse’ the most popular answers at this early stage.

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

Forthcoming webinar on ‘big data’ and the ‘single version of the truth’

Many enterprises were persuaded to adopt enterprise data warehousing (EDW) technology to achieve a ‘single version of the truth’ for enterprise data.

In reality, the promises were rarely fulfilled with many stories of failed, lengthy and over budget projects. Even if an EDW project reached deployment, the warehouse schema is designed to answer a specific set of queries and is inflexible to change and accommodate growing variety of data.

On April 30 at 1pm ET I’ll be taking part in a webinar with NGDATA to discuss whether ‘big data’ technologies such as Hadoop, HBase and Solr can deliver on the promise of “single version of truth” by providing a real-time, 360° view of customers and products.

In this webinar, you will learn:

  • Why the inflexibility of EDWs failed to deliver 360° view
  • How big data technologies can finally make 360° view a reality
  • Overview of an interactive Big Data management solution
  • Best practices and success stories from leading companies

For more details, and to register, click here.

The dawn of polyglot analytics

While there has been a significant amount of interest in the volume, velocity of variety of big data (and perhaps a few other Vs depending on who you speak to), it has been increasingly clear to that the trends driving new approaches to data management relate not just to the nature of the data itself, but how the user wants to interact with the data.

As we previously noted, if you turn your attention to the value of the data then you have to take into account the trend towards storing and processing all data (or at least as much as is economically feasible), and the preferred rate of query (the acceptable time taken to generate the result of a query, as well as the time between queries). Another factor to be added to the mix is the way in which the user chooses to analyze the data: are they focused on creating a data model and schema to answer pre-defined queries, or engaging in exploratory analytic approaches in which data is extracted and the schema defined in response to the nature of the query?

All of these factors have significant implications for which technology is chosen to store and analyze the data, and another user-driven factor is the increased desire to use specialist data management technologies depending on the specific requirement. As we noted in NoSQL, NewSQL and Beyond, in the operational database world this approach has become known as polyglot persistence. Clearly though, in the analytic database market we are talking not just about approaches to storing the data, but also analyzing it. That is why we have begun using the term ‘polyglot analytics’ to describe the adoption of multiple query-processing technologies depending on the nature of the query.

Polyglot analytics explains why we are seeing adoption of Hadoop and MapReduce as a complement to existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other analytic use cases. Polyglot analytics also explains why a company like eBay would retain its Teradata Enterprise Data Warehouse for storing and analyzing traditional transactional and customer data, as well as adopting Hadoop for storing and analyzing clickstream, user behaviour and other un/semi-structured data, while also adopting an exploratory analytic platform based on Teradata’s Extreme Data Appliance for extreme analytics on a combination of transaction and user behaviour data pulled from both its EDW and Hadoop deployments.

The emergence of this kind of exploratory analytic platform exemplifies the polyglot analytics approach to adopting a different platform based the user’s approach to analytics rather than the nature of the data. It also highlights some of the thinking behind Teradata’s acquisition of Aster Data, IBM’s acquisition of Netezza, as well as HP’s acquisition of Vertica and the potential future role of vendors such as ParAccel and Infobright.

We are about to embark on a major survey of data management users to assess their attitudes to polyglot analytics and the drivers for adopting specific data management/analytics technologies. The results will be delivered as part of our Total Data report later this year. Stay tuned for more details on the survey in the coming weeks.

Data cloud, datastructure, and the end of the EDW

There have been a spate of reports and blog posts recently postulating about the potential demise of the enterprise data warehouse (EDW) in the light of big data and evolving approaches to data management.

There are a number of connected themes that have led the likes of Colin White and Barry Devlin to ponder the future of the EDW, and as it happens I’ll be talking about these during our 451 Client event in San Francisco on Wednesday.

While my presentation doesn’t speak directly to the future of the EDW, it does cover the trends that are driving the reconsideration of the assumption that the EDW is, and should be, the central source of business intelligence in the enterprise.

As Colin points out, this is an assumption based on historical deficiencies with alternative data sources that evolved into best practices. “Although BI and BPM applications typically process data in a data warehouse, this is only because of… issues… concerning direct access [to] business transaction data. If these issues could be resolved then there would be no need for a data warehouse.”

The massive improvements in processing performance seen since the advent of data warehousing means that it is now more practical to process data where it resides, or is generated rather than forcing data to be held in a central data warehouse.

For example, while distributed caching was initially adopted to improve the performance of Web and financial applications, it also provides an opportunity to perform real-time analytics on application performance and user behaviour (enabling targeted ads for example) long before the data get anywhere near the data warehouse.

While the central EDW approach has some advantages for data control, security and reliability, this has always been more theoretical than practical, as there is the need for regional and departmental data marts, and users continue to use local copies of data.

As we put it in last year’s Data Warehousing 2009-2013 report:

“The approach of many users now is not to stop those distributed systems from being created, but rather to ensure that they can be managed according to the same data-quality and security rules as the EDW.

With the application of cloud computing capabilities to on-premises infrastructure, users now have the promise of distributed pools of enterprise data that marry central management with distributed use and control, empowering business users to create elastic and temporary data marts without the risk of data-mart proliferation.”

The concept of the “data cloud” is nascent, but companies such as eBay are pushing in that direction, while also making use of data storage and processing technologies above and beyond traditional databases.

Hadoop is a prime example, but so too are the infrastructure components that are generating vast amounts of data that can be used by the enterprise to better understand how the infrastructure is helping or hindering the business in responding to changing demands.

For the 451 client event we have come up with the term ‘datastruture’ to describe these infrastructure elements. What is ‘datastructure’? It’s the machines that are responsible for generating machine-generated data.

While that may sound like we’ve just slapped a new label on existing technology we believe that those data-generating machines will evolve over time to take advantage of improved available processing power with embedded data analytics capabilities.

Just as in-database analytics has enabled users to reduce data processing latency by taking the analytics to the data in the database, it seems likely that users will look to do the same for machine-generated data by taking the analytics to the data in the ‘datastructure’.

This ‘datastructure’ with embedded database and analytics capabilties therefore becomes part of the wider ‘data cloud’, alongside regional and departmental data marts, and the central business application data warehouse, as well as the ability to spin up and provision virtual data marts.

As Barry Devlin puts it: “A single logical storehouse is required with both a well-defined, consistent and integrated physical core and a loose federation of data whose diversity, timeliness and even inconsistency is valued.”

Making this work will require new data cloud management capabilities, as well as an approach to data management that we have called “total data”. As we previously explained:

“Total data is about more than data volumes. It’s about taking a broad view of available data sources and processing and analytic technologies, bringing in data from multiple sources, and having the flexibility to respond to changing business requirements…

Total data involves processing any data that might be applicable to the query at hand, whether that data is structured or unstructured, and whether it resides in the data warehouse, or a distributed Hadoop file system, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.”

As for the end of the EDW, both Colin and Barry argue, and I agree, that what we are seeing does not portend the end of the EDW but recognition that the EDW is a component of business intelligence, rather than the source of all business intelligence itself.

Going once, going twice… any more bids for Netezza?

No sooner had IBM announced its intention to acquire Netezza this morning than the New York Times came knocking for some perspective on the deal. There were two main questions: will anyone else bid for Netezza, and will someone now bid for Teradata.

While there is no guarantee of a 3Par-style bidding war I believe Netezza has the potential. Just last week we stated that Netezza would be the prime candidate for any firm looking to make an impact in the data-warehousing sector. In a crowded market it offers the right mix of established presence, technological differentiation and growth potential.

According to the 451 Group’s recent Information Management report, Data Warehousing: 2009-2013, Netezza is the fifth-placed data warehousing vendor, albeit some distance behind the established players. The company is predicted to deliver full-year revenue of just under $250m in 2010, in the region of 10% of the data-warehousing revenue of Oracle and IBM, but easily double the revenue of the sixth-placed vendor.

We also think rivals may see some potential to beat IBM’s offer price. As my 451 colleague Brenon Daly notes, the $27 per share purchase price represents an 80% premium against where Netezza was trading a month ago, but just 10% on the previous day’s close. Additionally, IBM is paying 6.8x projected sales which, while a relatively rich valuation, is much lower than rival EMC paid for Greenplum.

One of the reasons we think Netezza could spark a bidding war is that it is differentiated by its growth potential and established market share. It may not be in 3Par territory in terms of the scarcity of comparable rivals (we are tracking 20+ data warehousing providers), but if the likes of HP and Dell are looking to make a significant impact in data warehousing, Netezza is the prime candidate.

The other option would be to make a bid for Teradata, which delivers in market share what it lacks in growth. The company is the the largest data warehousing specialist by a considerable margin and has repositioned its product set to improve growth, so it is no surprise to see speculation that it could be the next acquisition target.

Given Teradata’s $6.2bn market cap, potential acquirers may consider there is more value in trying to outbid IBM. Either way, IBM’s bid for Netezza may not be the last bid to acquire a data warehousing player we will see this year.

One other thing – Netezza is being advised on this deal by Qatalyst Partners. No prizes for guessing who advised 3Par. Qatalyst’s other notable advisory role? The six-week bidding war that resulted in EMC acquiring Data Domain.

Sizing the data warehousing opportunity

The data warehousing market will see a compound annual growth rate of 11.5% from 2009 through 2013 to reach a total of $13.2bn in revenues.

That is the main finding highlighted by the latest report from The 451 Group’s Information Management practice, which provides market-sizing information for the data-warehousing sector from 2009 to 2013.

The report includes revenue estimates and growth projections, and examines the business and technology trends driving the market.

It was put together with the assistance of Market Monitor – the new market-sizing service from The 451 Group and Tier1 Research. Props to Greg Zwakman and Elizabeth Nelson for their number-crunching.

Among the key findings, available via the executive summary (PDF), are:

  • Four vendors dominate the data-warehouse market, with 93.6% of total revenue in 2010. These vendors are expected to retain their advantage and generate 92.2% of revenue in 2013.
  • Analytic databases are now able to take advantage of greater processor performance at a lower cost, improving price/performance and lowering barriers to entry.
  • With the application of cloud capabilities, users now have the promise of pools of enterprise data that marry central management with distributed use and control.
  • Products that take advantage of improved hardware performance will drive revenue growth for all vendors, and will protect the market share of incumbents.
  • As a result of systems performance improvements, data-warehousing vendors are also taking advantage of the opportunity to bring more advanced analytic capabilities to the DB engine.
  • Although we expect many smaller vendors to grow at a much faster rate between now and 2013, it will not be at the expense of the market’s dominant vendors.
  • While the Hadoop Core is not a direct alternative to traditional analytic DBs, the increased maturity of associated projects means that use cases for Hadoop- and MapReduce-enabled analytic DBs will overlap.

There is, of course, much more detail in the full report. 451 Group clients can download the report here, while non-clients can also use the same link to purchase the report, or request more information.

Forthcoming webinar on data warehousing

Following the recent publication of our special report, Warehouse Optimization – Ten considerations for choosing/building a data warehouse, I will be presenting an overview of some of the key findings in a webinar on December 17.

The report provides an overview of the data-warehousing vendor landscape, as tracked by The 451 Group, and examines the business and technology trends driving this market. It identifies 10 key technology trends in data warehousing and assesses how they can be used to choose the technologies and vendors that are best suited to a would-be customer and its specific application.

During the webinar I will present some details of those ten key trends and how we see consensus forming around some technologies that have previous divided the industry, enabling the conversation to move on to business-oriented issues. As the market continues to mature, differentiation among vendors will shift from a focus on specific technologies to a reflection of various business processes.

The webinar is scheduled for Thursday, December 17th, at 1 pm ET. I will present for about 30 minutes, followed by Q&A.

If you are interested you can register for the event, and download an executive summary of the report, here.

On the opportunities for cloud-based databases and data warehousing

At last year’s 451 Group client event I presented on the topic of database management trends and databases in the cloud.

At the time there was a lot of interest in cloud-based data management as Oracle and Microsoft had recently made their database management systems available on Amazon Web Services and Microsoft was about to launch the Azure platform.

In the presentation I made the distinction between online distributed databases (BigTable, HBase, Hypertable), simple data query services (SimpleDB, Microsoft SSDS as was), and relational databases in the cloud (Oracle, MySQL, SQL Server on AWS etc) and cautioned that although relational databases were being made available on cloud platforms, there were a number of issues to be overcome, such as licensing, pricing, provisioning and administration.

Since then we have seen very little activity from the major database players with regards to cloud computing (although Microsoft has evolved SQL Data Services to be a full-blown relational database as a service for the cloud, see the 451’s take on that here).

In comparison there has been a lot more activity in the data warehousing space with regards to cloud computing. On the one hand there data warehousing players are later to the cloud, but in another they are more advanced, and for a couple of reasons I believe data warehousing is better suited to cloud deployments than the general purpose database.

  • For one thing most analytical databases are better suited to deployment in the cloud thanks to their massively parallel architectures being a better fit for clustered and virtualized cloud environments.
  • And for another, (some) analytics applications are perhaps better suited to cloud environments since they require large amounts of data to be stored for long periods but processed infrequently.
  • We have therefore seen more progress from analytical than transactional database vendors this year with regards to cloud computing. Vertica Systems launched its Vertica Analytic Database for the Cloud on EC2 in May 2008 (and is wotking on cloud computing services from Sun and Rackspace), while Aster Data followed suit with the launch of Aster nCluster Cloud Edition for Amazon and AppNexus in February this year, while February also saw Netezza partner with AppNexus on a data warehouse cloud service. The likes of Teradata and illuminate are also thinking about, if not talking about, cloud deployments.

    To be clear the early interest in cloud-based data warehousing appears to be in development and test rather than mission critical analytics applications, although there are early adopters and ShareThis, the online information-sharing service, is up and running on Amazon Web Services’ EC2 with Aster Data, while search marketing firm Didit is running nCluster Cloud Edition on AppNexus’ PrivateScale, and Sonian is using the Vertica Analytic Database for the Cloud on EC2.

    Greenplum today launched its take on data warehousing in the cloud, focusing its attention initially on private cloud deployments with its Enterprise Data Cloud initiative and plans to deliver “a new vision for bringing the power of self-service to data warehousing and analytics”.

    That may sound a bit woolly (and we do see the EDC as the first step towards private cloud deployments) but the plan to enable the Greenplum Database to act as a flexible pool of warehoused data from which business users will be able to provision data marts makes sense as enterprises look to replicate the potential benefits of cloud computing in their datacenters.

    Functionality including self-service provisioning and elastic scalability are still to come but version 3.3 does include online data-warehouse expansion capabilities and is available now. Greenplum also notes that it has customers using the Greenplum Database in private cloud environments, including Fox Interactive Media’s MySpace, Zions Bancorporation and Future Group.

    The initiative will also focus on agile development methodologies and an ecosystem of partners, and while we were somewhat surprised by the lack of virtualization and cloud provisioning vendors involved in today’s announcement, we are told they are in the works.

    In the meantime we are confident that Greenplum’s won’t be the last announcement from a data management focused on enabling private cloud computing deployments. While much of the initial focus around cloud-based data management was naturally focused on the likes of SimpleDB the ability to deliver flexible access to, and processing of, enterprise data is more likely to be taking place behind the firewall while users consider what data and which applications are suitable for the public cloud.

    Also worth mentioning while we’re on the subject in RainStor, the new cloud archive service recently launched by Clearpace Software, which enable users to retire data from legacy applications to Amazon S3 while ensuring that the data is available for querying on an ad hoc basis using EC2. Its an idea that resonates thanks to compliance-driven requirements for long-term data storage, combined with the cost of storing and accessing that data.

    451 Group subscribers should stay tuned for our formal take on RainStor, which should be published any day now, while I think it’s probably fair to say you can expect more of this discussion at this year’s client event.