Entries from October 2013 ↓

The Data Day, A few days: October 19-25 2013

Hadoop and Teradata go to the cloud. And more.

And that’s the data day, today.

7 Hadoop questions. Q7: Hadoop’s role

What is the point of Hadoop? It’s a question we’ve asked a few times on this blog, and continues to be a significant question asked by users, investors and vendors about Apache Hadoop. That is why it is one of the major questions being asked as part of our 451 Research 2013 Hadoop survey.

hadoop-elephant

As I explained during our keynote presentation at the inaugural Hadoop Summit Europe earlier this year, our research suggests there are hundreds of potential workloads that are suitable for Hadoop, but three core roles:

  • Big data storage: Hadoop as a system for storing large, unstructured, data sets
  • Big data processing/integration: Hadoop as a data ingestion/ETL layer
  • Big data analytics: Hadoop as a platform new new exploratory analytic applications

And we’re not the only ones that see it that way. This blog from Cloudera CTO Amr Awadallah outlines three very similar, if differently-named use-cases (Transformation, Active Archive, and Exploration).

In fact, as I also explained during the Hadoop Summit keynote, we see these three roles as a process of maturing adoption, starting with low cost storage, moving on to high-performance data aggregation/ingestion, and finally exploratory analytics.

survey

As such it is interesting to view the current results of our Hadoop survey, which show that the highest proportion of respondents that have implemented or plan to implement Hadoop (63%) for data analytics, followed by 48% for data integration and 43% for data storage.

This would suggest that our respondents include some significantly early Hadoop adopters. I look forward to properly analysing the results to see what they can tell us, but in the meantime it is interesting to note that the percentage of respondents using Hadoop for analytics is significantly higher among those that adopted Hadoop prior to 2012 (88%) compared to those that adopted in in 2012 or 2013 (65%).

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

The Data Day, A few days: October 12-18 2013

Apache Hadoop 2 goes GA. Teradata cuts guidance. And more

And that’s the data day, today.

7 Hadoop questions. Q6: Hadoop’s shortcomings

What are the major shortcomings of Hadoop? The answer to that questions looks set to shape the future development roadmap for the open source data processing framework, which is why it is one of the major questions being asked as part of our 451 Research 2013 Hadoop survey.

hadoop-elephant

The limitations of Hadoop have been widely reported over the years, but as the Apache Hadoop community and related vendors have responded to issues such as reliability and high availability – not least via the now generally available Apache Hadoop 2 – so attention turns to other areas such as security, administration and performance, as well as more advanced functionality requirements, including graph processing, stream processing, improved SQL support and virtualization support.

survey

The list of potential improvements is therefore fairly long, and as we near the end of our survey it is interesting to see that the list of key advances respondents are looking for in order to increase adoption of Hadoop is fairly widespread.

So far the responses to our Hadoop survey suggest administration tooling and performance top the list, followed by reliability, SQL support and backup and recovery, but development tools and authentication and access control are not far behind.

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

The Data Day, A few days: October 5-11 2013

TransLattice acquires StormDB. Funding for Cirro and TempoDB. And more.

And that’s the data day, today.

7 Hadoop questions. Q5: SQL in Hadoop, SQL on Hadoop, or SQL and Hadoop?

What is your preferred approach to integrating SQL and Hadoop? Until recently that was a straight shoot-out between Hive and Pig, but in 2013 the options for making use of existing SQL skills to analyze data in Hadoop have increased dramatically. That’s why the choice of approach to SQL in/on/and Hadoop is one of the primary questions being asked in the 451 Research 2013 Hadoop survey.

hadoop-elephant

I write in/on/and as I believe that is a good way of understanding the various approaches and how they compare at this point.

SQL in Hadoop
Hive’s classic approach of converting SQL queries into MapReduce jobs falls into this category, but lacks the performance that some users are looking for to enable more interactive analysis. Hortonworks has started the Stinger Initiative to align HiveQL more closely with standard SQL, optimize Hive’s query execution plans and introduce a new columnar file format for storing Hive data.

SQL on Hadoop
Rather than attempting to improve the performance of SQL-via-MapReduce, several efforts are underway to create a SQL engine that enables native SQL-based processing of data in HDFS while avoiding MapReduce. Key efforts include Cloudera’s Impala project and Cloudera Enterprise RTQ product, the MapR-initiated Apache Drill project, Pivotal’s HAWQ and JethroData. IBM’s Big SQL also appears to fit into this category.

SQL and Hadoop
Co-location of relational database technologies and Hadoop enables data to be processed in each platform, using SQL in the RDBMS and MapReduce in HDFS. Hadapt pioneered this approach, while RainStor launched RainStor Big Data Analytics on Hadoop in early 2012, combining its column-based database software, and Microsoft has been previewing PolyBase, which will offer the ability to join tables from SQL Server PDW with data from HDFS to return a combined result. SQL and Hadoop is a broader category in which we would also include Citus Data, which takes advantage of PostgreSQL’s foreign data wrapper technology to query data in HDFS via the local query execution, as well as Teradata’s SQL-H, which enables SQL analysts to invoke MapReduce and SQL-MapReduce jobs against Hadoop from Teradata’s databases. We would absolutely concede that there are distinct differences between the approaches in this category.

survey

It is naturally early stages for most of these approaches given that most of them only appeared in 2013 and some are still in development and testing. So far the responses to our Hadoop survey suggest higher levels of interest in Cloudera Impala, Cloudera RTQ, and Apache Drill, followed by IBM Big SQL, Hadapt and Pivotal HAWQ

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

The Data Day, A few days: October 1-4 2013

MongoDB raises $150m. And more.

And that’s the data day, today.

7 Hadoop questions. Q4: alternative file systems

Which is your preferred Hadoop file system? The obvious answer is likely to be the Hadoop Distributed File System itself, although in recent years we’ve seen an increasing number of vendors pitching their own file system technologies as potential alternatives to HDFS. That’s why the use of alternative file systems is one of the primary questions being asked in the 451 Research 2013 Hadoop survey.

hadoop-elephant

The limitations of HDFS are well-publicised, and it is no surprise that many vendors see an opportunity to pitch their existing files system technologies as alternatives to HDFS.

There is now a large number of HDFS alternatives to choose from, including: Cleversafe Dispersed Storage Network, DataStax CassandraFS, EMC Isilon OneFS, IBM GPFS, InkTank Ceph, MapR NFS, Quantcast QFS, Red Hat Storage (GlusterFS), and Symantec Veritas CFS.

Our research indicates that adoption of alternatives to HDFS is limited at this stage and early efforts, such as Appistry’s CloudIQ Storage Hadoop Edition, have come and gone.

However, as adoption of Hadoop grows into more mainstream enterprises, we increasingly see interest in some of these HDFS alternatives, particularly in relation to attempts to reduce duplication of effort with regards to file system management and maintenance.

survey

The early responses to our Hadoop survey are therefore interesting: MapR NFS has scored highest in terms of adoption so far, but there is interest across the board (especially Red Hat Storage, CassandraFS, GPFS, OneFS and Ceph). By and large though, its true to say that most respondents have not considered, tested or adopted an alternative file system to date.

To give your view on this and other questions related to the adoption of Hadoop, please take our 451 Research 2013 Hadoop survey.

NoSQL LinkedIn Skills Index – September 2013

With our rebooted NoSQL LinkedIn Skills Index, based on the number of LinkedIn member profiles mentioning each of the NoSQL projects, now into its second year, I thought it was a good time to add some newer projects to the list; specifically: ArangoDB, FoundationDB, RethinkDB, and Titan.

It shouldn’t surprise anyone to find that those four new additions failed to make a dent in the top ten list of the NoSQL databases most often cited in LinkedIn profiles. However, there is still some interesting activity this quarter, with Riak leapfrogging MarkLogic (as predicted).

linkedinq31

Outside the top ten, Apache Accumulo overtook Voldemort, and saw the second fastest growth in mentions in Q3, behind only DynamoDB and ahead of Neo4j, MongoDB, and Cassandra.

That growth saw MongoDB extend its lead as the most popular NoSQL database, according to LinkedIn profile mentions. As the chart below illustrates, it now accounts for 49% of all mentions of NoSQL technologies in LinkedIn profiles, according to our sample, compared with 47% in June.

allNoSQLq3

Incidentally, adding the four new NoSQL databases to the analysis did not have a significant impact on MongoDB’s share. Without them it still registered 49%. Expect MongoDB to pass the 50% threshold in Q4, however, as well as Couchbase to overtake MarkLogic.

Of course, we would also note that this is not meant to be a comprehensive analysis, but rather a snapshot of one particular data source.