The Data Day, Two days: August 27/28 2012

Citrusleaf. Aerospike. AlchemyDB. Sqrrl. Percolator. Dremel. Pregel. And more.

And that’s the Data Day, today.

Hadoop is dead. Long live Hadoop.

GigaOM published an interesting article over the weekend written by Cloudant’s Mike Miller about why the days are numbered for Hadoop as we know it.

Miller argues that while Google’s MapReduce and file system research inspired the rise of the Apache Hadoop project, Google’s subsequent research into areas such as incremental indexing, ad hoc analytics and graph analysis is likely to inspire the next-generation of data management technologies.

We’ve made similar observations ourselves but would caution against assuming, as some people appear to have done, that implementations of Google’s Percolator, Dremel and Pregel projects are likely to lead to Hadoop’s demise. Hadoop’s days are not numbered. Just Hadoop as we know it.

Miller makes this point himself when he writes “it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology.”

As we noted in our 2011 Total Data report:

“it may be that we see more success for distributed data processing technologies that extend beyond Hadoop’s batch processing focus… Advances in the next generation of Hadoop delivered in the 0.23 release will actually enable some of these frameworks to run on the HDFS, alongside or in place of MapReduce.”

With the ongoing development of that 0.23 release (now known as Apache Hadoop 2.0) we are beginning to see that process in action. Hadoop 2.0 includes the delivery of the much-anticipated MapReduce 2.0 (also known as YARN, as well as NextGen MapReduce). Whatever you choose to call it, it is a new architecture that splits the JobTracker into its two major functions: resource management and application lifecycle management. The result is that multiple versions of MapReduce can run in the same cluster, and that MapReduce becomes one of several frameworks that can run on the Hadoop Distributed File System.

The first of these is Apache HAMA – the bulk synchronous parallel computing framework for scientific computations, but we will also see other frameworks supported by Hadoop – thanks to Arun C Murthy for pointing to two of them – and fully expect the likes of incremental indexing, ad hoc analytics and graph analysis to be among them.

As we added in Total Data:

“This supports the concept recently raised by Apache Hadoop creator Doug Cutting that what we currently call ‘Hadoop’ could perhaps be thought of as a set of replaceable components in a wider distributed data processing ecosystem… the definition of Hadoop might therefore evolve over time to encompass some of the technologies that could currently be seen as potential alternatives…”

The future of Hadoop is… Hadoop.