TIBCO targets ‘analytics for all’. And more.
And that’s the data day, today.
Dell acquiring EMC. And more
And that’s the data day, today.
Strata+Hadoop World special
And that’s the data day, today.
Three years after we (re)started tracking mentions of NoSQL database in LinkedIn member profiles it is time to retire the NoSQL LinkedIn Skills Index – at least in terms of regular updates.
We started tracking mentions of NoSQL database in LinkedIn member profiles in order to keep an eye on trends that could shape the industry, but after three years it has become clear that in terms of LinkedIn member profiles there is only one trend: the total dominance of MongoDB.
Once again MongoDB was responsible for more than 50% of all mentions of NoSQL database in LinkedIn member profiles in Q3, placing it way, way ahead of the nearest competitor.
As always there were changes of position further down the rankings, with OrientDB overtaking Accumulo and RethinkDB overtaking Voldemort. We are talking about very small numbers, however. To be honest tracking these numbers has become something of a chore given the lack of change, and even the addition of Microsoft Azure DocumentDB and Google Cloud Bigtable couldn’t lift our interest
For the record, the fastest growth in the quarter was recorded by RethinkDB, with mentions up 36.2%, followed by multi-model players OrientDB (28.0%) and ArangoDB (23.0%), as well as Aerospike (22.1%). Inside the top ten, DynamoDB had the fastest growth (16.5%).
However, since none of the top 10 look like changing places any time soon, and none of the players outside stand any chance of breaking into the top 10, the time has come to retire the NoSQL LinkedIn Skills Index.
Perhaps we’ll pull it out and freshen it up on special occasions, however.
Of course, we would also note that this is not meant to be a comprehensive analysis, but rather a snapshot of one particular data source.
What is Hadoop?
It should be fairly simple: in the beginning there was the Hadoop Distributed File System, Hadoop MapReduce, and the Hadoop Common set of utilities. Even with the addition of Apache YARN in 2013, just four projects officially form the core of Apache Hadoop.
However, this is not what most people refer to when they use the term ‘Hadoop’. Instead most people refer to the combination of Hadoop-related projects that are combined together with the Hadoop core to create Hadoop distributions.
As 451 Research’s Periodic Table of Hadoop illustrates, there are at least 40 projects that could be considered part of the Hadoop ecosystem (our table is comprised of Hadoop-related Apache Software Foundation projects, as well as other open source projects included in more than one Hadoop distribution). So ‘Hadoop’ represents pretty much any combination of more than 40 projects.
Hadoop’s creator Doug Cutting has asserted that Hadoop will evolve over time from a batch-processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem. At the same time the word ‘Hadoop’ has evolved to become a catch-all brand for that wider distributed data-processing ecosystem.
That is potentially confusing, especially for for later mainstream adopters as they seek get their heads around what Hadoop is and what it is for. However, that’s not what this blog post is about. I’m less interested in defining what Hadoop is as I am interested in identifying what isn’t Hadoop.
When is Hadoop not Hadoop?
Recent announcements from the original Hadoop commercial supporter, Cloudera, have highlighted the significance of this question. First it anointed Spark as the successor to MapReduce, then it launched Kudu, a new storage engine and potential alternative to the Hadoop Distributed File System (HDFS).
If the company’s plans for Spark and Kudu play out, pretty soon we could see a whole lot of ‘Hadoop deployments’ that make use of neither MapReduce nor HDFS – the primary initial Hadoop core projects. This isn’t just a potential outcome. Already today it is perfectly plausible that a ‘Hadoop deployment’ might not involve MapReduce or HDFS – it could involve Spark accessing data in AWS S3 for example.
Both Spark and Kudu are open source and are clearly part of the wider Hadoop ecosystem, but where do you draw the line in terms of what is and isn’t ‘Hadoop’?
Vendors are increasingly layering additional proprietary components on top of this Hadoop ecosystem for differentiation. MapR has most obviously blurred the lines between Hadoop and not Hadoop, but Cloudera Enterprise could also arguably be put in a ‘Hadoop+’ category along with things like Pivotal Big Data Suite, and IBM BigInsights.
Then there are things that aren’t even claimed to be Hadoop but on closer inspection bear a close resemblance as ‘Hadoop’ evolves beyond its core. For example, the Stratio Platform is based on Apache Spark and other Apache projects including Flume and Kafka. It is isn’t claimed to be Hadoop but it enables data to be stored in the Hadoop Distributed File System (as well as AWS S3, Elasticsearch, MongoDB, Apache Cassandra, Redis, and relational databases) so it is surely part of the same wider family of data platforms.
If not Hadoop, then what?
So what should we call this wider family of data platforms – including Hadoop+ and ‘other’? Due to the pick-and-mix nature of the Hadoop ecosystem there is no easy way to answer that in terms of technology or use-cases. The products and services will be designed specifically to deliver a mix of data processing and storage capabilities, including MapReduce, SQL engines and stream processing, as well as HDFS, HBase, S3 and Kudu, and much more besides, both proprietary and open source.
Indeed it is probably easier to think about this not in terms of technologies but the symbols that represent them. If Hadoop was originally symbolised by an elephant then what symbol best conveys the category of data platforms based on the wider Hadoop ecosystem and beyond?
Given the veritable menagerie of animals (and inanimate objects) that represent the various Hadoop ecosystem projects – elephant, pig, bee, tortoise, falcon, giraffe, orca, squirrel, hippopotamus, antelope, phoenix, kylin, roadrunner, hummingbird – there is surely only one choice: the Chimera.
Source: Wikimedia
For those not acquainted with Greek mythology the Chimera was a fire-breathing, multi-headed hybrid creature composed of the parts of more than one animal. While Chimera was classically composed of the features of a lion, a snake and a goat, the term chimera can be used to describe any animal with parts taken from various animals.
As such it is perfect to symbolise the multi-headed hybrid Hadoop-based data platforms we see evolving. We are therefore tempted to use the term Chimeric Data Platform to describe this wider category of data platforms that are building on and expanding from Hadoop.
The fact that Merriam Webster further defines chimera as “something that exists only in the imagination and is not possible in reality” is an added bonus that appeals to our sense of humour.
Data Management of Things. And more.
And that’s the data day, today.
Cloudera anoints Spark. Paxata raises $18m. And more.
And that’s the data day, today.
Hortonworks acquires Onyara. And more.
And that’s the data day, today.
1010data acquired. And more.
And that’s the data day, today.
IBM acquires Compose. And more
And that’s the data day, today.