HBase — Too much information

The Data Day: May 19, 2017

May 19th, 2017 — Data management

At no time — at no time — were data sources or analytics methods discussed.

For @451Research clients: Total Data: Data Platforms & Analytics Market Monitor https://t.co/CnZIJY5cIp

— Matt Aslett (@maslett) May 17, 2017

For @451Research clients: Cloud giants get tooled up for data-driven battle royale https://t.co/n74TZDmyf7 By me and @owenrog pic.twitter.com/d0rvi3ibrm

— Matt Aslett (@maslett) May 17, 2017

For @451Research clients: @TIBCO bolsters machine learning and predictive analytics with @Statistica https://t.co/ifhFd6ZTss By Krishna Roy

— Matt Aslett (@maslett) May 17, 2017

For @451Research clients: Prescriptive analytics: Just what the doctor ordered, a little too early https://t.co/jySajnEB7e By Krishna Roy

— Matt Aslett (@maslett) May 19, 2017

For @451Research clients: @Infoworksio expands Hadoop-based data-warehousing vision following $15m series B https://t.co/5LXv4KCTJn

— Matt Aslett (@maslett) May 15, 2017

For @451Research clients: @wherescape automates Data Vault 2.0 data warehouses with Data Vault Express https://t.co/KxUTcfgCts

— Matt Aslett (@maslett) May 16, 2017

Mode Analytics raises $13m series B, led by REV Venture Partners. https://t.co/bZQIF7TqKo

— Matt Aslett (@maslett) May 16, 2017

GE Ventures has made a strategic equity investment in Tamr https://t.co/uNGCi5dKii

— Matt Aslett (@maslett) May 18, 2017

FogHorn Systems extends series A funding for edge analytics and machine learning platform https://t.co/ujo1jZUKLM

— Matt Aslett (@maslett) May 16, 2017

TIBCO Software to acquire Statistica (from the for-some-reason-unmentioned Quest Software) https://t.co/BOYwk67XgD

— Matt Aslett (@maslett) May 15, 2017

SAP announces SAP Leonardo Machine Learning https://t.co/jVbCwAQ8i3

— Matt Aslett (@maslett) May 16, 2017

SAP announces Analytics Cloud update, BusinessObjects 4.2, Lumira 2.0 and Analytics Hub https://t.co/iczjvJmWBN

— Matt Aslett (@maslett) May 17, 2017

Informatica adds CLAIRE metadata-driven AI engine to its Informatica Intelligent Data Platform. https://t.co/udifmwGj6E

— Matt Aslett (@maslett) May 16, 2017

Informatica announces Informatica Intelligent Cloud Services iPaaS https://t.co/UO8wHhJDvi

— Matt Aslett (@maslett) May 16, 2017

Informatica announces Informatica Data Governance & Compliance https://t.co/0hZMrI64pI

— Matt Aslett (@maslett) May 17, 2017

SnapLogic delivers AI-powered Integration Assistant with its Spring 2017 release. https://t.co/TwfwkO0Vb1

— Matt Aslett (@maslett) May 18, 2017

Qlik previews new intuitive analytics in latest version of Flik Sense https://t.co/yWksQYxxnt

— Matt Aslett (@maslett) May 16, 2017

Microsoft announces the public preview of HDInsight HBase on Azure Data Lake Store https://t.co/96Knt1wNEa

— Matt Aslett (@maslett) May 19, 2017

MarkLogic launches version 9 of its MarkLogic Enterprise NoSQL database. https://t.co/DQCcB6BjlX

— Matt Aslett (@maslett) May 16, 2017

Pivotal launches Spring Cloud Data Flow 1.2 https://t.co/qQEsQcGPlN

— Matt Aslett (@maslett) May 16, 2017

Host Analytics releases Cloud EPM platform update https://t.co/wh5g9SKyvW

— Matt Aslett (@maslett) May 17, 2017

DataTorrent launches Real-Time Streaming (RTS) 3.8 https://t.co/d40nstwtpx

— Matt Aslett (@maslett) May 18, 2017

Crate.io introduces CrateDB 2.0 Enterprise and Open Source Editions https://t.co/XzC6iqAlTP

— Matt Aslett (@maslett) May 17, 2017

Google says Cloud Spanner is now production-ready https://t.co/kQIAEUyFM5

— Matt Aslett (@maslett) May 16, 2017

WhereScape launches Data Vault Express to automate Data Vault 2.0 data warehouse development https://t.co/E7nwPGVzL2

— Matt Aslett (@maslett) May 16, 2017

The Apache Software Foundation Announces Apache Beam v2.0.0 https://t.co/BxZyQ38gbu

— Matt Aslett (@maslett) May 17, 2017

The Apache Software Foundation Announces Apache Samza v0.13 https://t.co/aX7V34wXhC

— Matt Aslett (@maslett) May 15, 2017

And that’s the data day, today.

Comments Off on The Data Day: May 19, 2017

NoSQL LinkedIn Skills Index – An Interesting Occasional Update

December 19th, 2016 — Data management

I was recently prompted by OrientDB CEO Luca Garulli to take another look at the NoSQL LinkedIn Skills Index, which we previously updated on a regular basis between September 2012 and 2015.

Hey @maslett, Do you have any plan to update the #NoSQL LinkedIn Skill Index report? Last was more than 1y ago: https://t.co/HKRWJID8nz

— Luca Garulli (@lgarulli) December 9, 2016

I wouldn’t read too much into the results since there’s been such a long period between updates, and this is – as ever – just a snapshot of one particular data source. However, they are definitely interesting, especially when you consider that we retired the NoSQL LinkedIn Skills Index primarily because the results had become so boringly predictable.

As such I’d make the following observations without any additional comment:

It is interesting to note that MongoDB’s share of mentions of NoSQL databases in LinkedIn member profiles has declined since September 2015, from 51% to 48%. Of course, MongoDB remains the number one by a considerable margin.

It is also interesting to note that Redis has climbed above Cassandra to claim second spot.

Similarly it is interesting that Neo4j has climbed above CouchDB for fifth place.

And it is also interesting that DynamoDB has overtaken Couchbase for eighth place.

It is also interesting that the two fastest growing NoSQL databases, in terms of mentions in LinkedIn profiles, are Google Cloud Bigtable (up 557%) and Azure DocumentDB (up 254%).

And it is also interesting that the third fastest growth came from RethinkDB, despite the recent demise of the company of the same name.

Those growth rates saw Google Clooud Bigtable climb above Voldemort, ArangoDB, Hypertable and Allegrograph, while Azure DocumentDB climbed above Titan and Voldemort, and RethinkDB climbed above Titan and Accumulo.

Since Luca prompted another look at the results, I should also probably point out that mentions of OrientDB grew at a healthy 83% as OrientDB held on to 11th place in the Index.

Interesting…

1 Comment

The Data Day, A few days: February 21-27, 2015

February 27th, 2015 — Data management

Hortonworks reports first financial results

For @451Research clients: @Hortonworks highlights billings growth as heavy losses continue in first public quarter http://t.co/uNM7f1bs3k

— Matt Aslett (@maslett) February 25, 2015

For @451Research clients: @Pivotal goes all-out open source for big data, launches Hadoop alliance with @Hortonworks http://t.co/4YmGoWPQTq

— Matt Aslett (@maslett) February 23, 2015

For @451Research clients: @VoltDB revs in-memory database with version 5.0 http://t.co/COoJe3xcf5 By @jasonstamper

— Matt Aslett (@maslett) February 25, 2015

For @451Research clients: Are Small Consultancies Best for Big Data Projects? http://t.co/1itOsfjW9i New TBI report by @drkatyring

— Matt Aslett (@maslett) February 25, 2015

For @451Research clients: @MongoLab prepares Telemetry performance-analysis service for hosted MongoDB databases http://t.co/azD58DeTa2

— Matt Aslett (@maslett) February 26, 2015

For @451Research clients: @brytlytUK shines a light on GPU-based analytics, seeks funding http://t.co/yXLjuNitzn By @jasonstamper

— Matt Aslett (@maslett) February 23, 2015

For @451Research clients: @UforaInc builds a full integrated stack for data analytics on parallel systems http://t.co/Bbm6fg5yvy By @sfjohna

— Matt Aslett (@maslett) February 26, 2015

Also by @sfjohna for @451Research clients: @Oracle now competing on price with new-generation X5 engineered systems http://t.co/9WpYPbkEEY

— Matt Aslett (@maslett) February 26, 2015

For @451Research clients: @InformaticaCorp defines a framework for API integration in iPaaS Cloud http://t.co/DSQzf6PnH0 By @CarlLehmann1

— Matt Aslett (@maslett) February 27, 2015

For @451Research clients: @awscloud's DynamoDB becomes cheaper for those willing to commit http://t.co/ji4TwjqGfK By @owenrog

— Matt Aslett (@maslett) February 27, 2015

Hortonworks: Q4 net loss of $90.6m on revenue up 55% to $12.7m, FY14 net loss of $177.4m on revenue up 91% to $46.0m http://t.co/T54FV7AdTv

— Matt Aslett (@maslett) February 24, 2015

Splunk reports Q4 net loss of $57m on revenue up 48% to $147.4m, FY net loss of $217.1m on revenue up 49% to $450.9m http://t.co/6wtvIQ83sz

— Matt Aslett (@maslett) February 27, 2015

Fortune: That IPO? Cloudera bides its time http://t.co/yAOKrhsWro

— Matt Aslett (@maslett) February 24, 2015

Metamarkets closes $15m in new funding Led by Data Collective http://t.co/CeliXof21z

— Matt Aslett (@maslett) February 25, 2015

Teradata appoints co-presidents to lead Marketing Applications and Data and Analytics divisions. http://t.co/2MSuEBeUpk

— Matt Aslett (@maslett) February 26, 2015

I missed this recent post from HP explaining why it rejected the Open Data Platform – “a risk to innovation” http://t.co/S9I9HYGLhP

— Matt Aslett (@maslett) February 25, 2015

The Apache Software Foundation Announces Apache HBase v1.0 http://t.co/jFrbGOrml4

— Matt Aslett (@maslett) February 24, 2015

Confluent announces the general availability of Confluent Platform 1.0 stream data platform powered by Apache Kafka http://t.co/g5lWlWmsHD

— Matt Aslett (@maslett) February 26, 2015

Penguin Computing announces Scyld ClusterWare for Hadoop http://t.co/r19Lq43tfu

— Matt Aslett (@maslett) February 25, 2015

Deep IS appoints former LogMeIn execs Les Yetton and Chad Jones as CEO and chief strategy officer, respectively. http://t.co/EljEmSuePs

— Matt Aslett (@maslett) February 24, 2015

And that’s the data day, today.

Comments Off on The Data Day, A few days: February 21-27, 2015

Update on the relative popularity of NoSQL database skills

March 27th, 2012 — Data management

Back in December we ran a series of posts looking at the geographic distribution of NoSQL skills, according to the results of searching LinkedIn member profiles, culminating in a look at the relative overall popularity of the major NoSQL databases.

This week I took another look at LinkedIn to update the results for a forthcoming report, which gives us the opportunity to see how the results have changed over the past quarter:

While this provides us with an interesting opportunity to track LinkedIn profile mentions over time there isn’t a huge amount we can learn from this first update – other than that MongoDB seems to be increasing its dominance.

The only significant change that isn’t immediately obvious from looking at the chart is that Apache HBase has overtaken Apache CouchDB by a tiny margin to claim third place overall.

As we noted last time, however, Apache HBase is more reliant on the US than other NosQL databases for its LinkedIn mentions: it is the second most prevalent NoSQL database mentioned in the USA but fourth in the rest of the world.

Two other points to take into consideration:

– The results for Apache Cassandra are probably disproportionately low since we have to search for the full phrase in order to avoid including people called Cassandra.

– Previously we only searched for Membase. This time we added together the search results for both Membase and Couchbase. This may mean the result for Couch/Membase is disproportionately high since some members probably listed both.

This is not meant to be a comprehensive analysis, however, but rather a snapshot of one particular data source.

Comments Off on Update on the relative popularity of NoSQL database skills

The Data Day, Today: Jan 27 2012

January 27th, 2012 — Data management

Amazon launches AWS Storage Gateway. Postgres Plus Cloud Server. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Amazon Web Services Announces AWS Storage Gateway to Connect Enterprise Data with the Cloud

* EnterpriseDB Announces Availability of Postgres Plus Cloud Database

* Big VCs Invest In Big Data Startup Continuuity

* At Davos, Discussions of a Global Data Deluge

* Zimory Names New Head of zimory®scale; the Cloud Database Elasticity Division

* Jaspersoft’s Java Reporting Engine Integrated with Cloud Foundry

* IBM Debuts New Analytics Appliance to Help Retailers Transform Big Data Into Business Opportunities

* The Mass Technology Leadership Council published its report on big data and analytics.

* Apache HBase 0.92.0 has been released

* Is Security An Afterthought For NoSQL?

* What’s the big deal about Big Data?

* Hadoop Summit 2012 Announced to Showcase Apache Hadoop as Next Generation Enterprise Data Platform

* Announcing BigCouch 0.4

* Microsoft’s plan for Hadoop and big data

* Google Goes MoreSQL With Tenzing – SQL Over MapReduce

* Seismic Data Science: Reflection Seismology and Hadoop

* GoodData Posts Record-Breaking 600% Year-Over-Year Revenue Growth In 2011

* For 451 Research clients

# 2012 M&A Outlook – Software Assessing the runners and riders for M&A and IPOs in 2012

# RJMetrics scores $1.2m debt funding, sets out SaaS BI stall Impact report

* Google News Search outlier of the day: Pork Tenderloin: A Healthy Eating Hero

And that’s the Data Day, today.

Comments Off on The Data Day, Today: Jan 27 2012

The geographic distribution of NoSQL skills – just one more thing

December 12th, 2011 — Data management

Hidden away amongst the details of our little tour around LinkedIn statistics on NoSQL and Hadoop skills was some interesting information on how many LinkedIn members list the various data management technologies in our sample in their profiles.

Our original post contained the fact that there were 9,079 LinkedIn members with “Hadoop” in their member profiles, for example, compared to 366,084 with “MySQL” in their member profiles.

Later posts showed there were 170 with “Membase” and 1,687 with “HBase”, 787 with “Apache Cassandra” and 376 with “Riak”, 6,048 with “MongoDB” and 2,152 with “Redis”, and finally, 1,844 with “CouchDB” and 268 with “Neo4j”.

This gives us an interesting perspective on the relative adoption of the various NoSQL databases:

If it wasn’t already obvious from the list above, the chart illustrates just how much more prevalent MongoDB skills are compared to the other NoSQL databases, followed by Redis, Apache CouchDB, Apache HBase and Apache Cassandra. The chart also illustrates that while HBase is the second most prevalent NoSQL skill set in the USA, it is only fourth overall given its lower prevalence in the rest of the world.

In response, a representative from a certain vendor notes “Some skills are more valued not because they are more prevalent, but because they are harder to achieve.” Make of that what you will.

Comments Off on The geographic distribution of NoSQL skills – just one more thing

The geographic distribution of NoSQL skills: HBase and Membase

December 5th, 2011 — Data management

Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.

The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.

We begin this week’s series with Membase and HBase, the two projects that proved, like Apache Hadoop, to have significantly greater adoption in the USA compared to the rest of the world.

The statistics showed that 58.2% of the 170 LinkedIn members with “Membase” in their member profiles are based in the US (as previously explained, we tried the same search with Couchbase, but with only 85 results we decided to use the Membase result set as it was more statistically relevant).

As with Hadoop, a significant proportion (27.1%) of those are in the Bay area, the highest proportion of all the NoSQL databases we looked at. The results also indicate that Ukraine is a hot-spot for Membase skills, with 3.5%, while Membase adoption is lower the UK (2.4%) than other NoSQL databases.

It should not be a great surprise that Apache HBase returned similar results to Apache Hadoop. The top eight individual regions for HBase were exactly the same as for Hadoop, although the UK (3.4%) is stronger for HBase, as is India (10.7%).

The statistics showed that 57.0% of the 1,687 LinkedIn members with “HBase” in their member profiles are based in the US, with 25.0% in the Bay area (the third highest in our sample behind Hadoop and Membase).

The series will continue later this week with MongoDB, Riak, CouchDB, Apache Cassandra, Neo4j, and Redis.

N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker

1 Comment

The geographic distribution of Hadoop skills: in context

December 2nd, 2011 — Data management

NC State University’s Institute for Advanced Analytics recently published some interesting statistics on Apache Hadoop adoption based on a search of LinkedIn data.

The statistics graphically illustrate what a lot of people wer already pretty sure of: that the geographic distribution of Hadoop skills (and presumably therefore adoption) is heavily weighted in favour of the USA, and in particular the San Francisco Bay Area.

The statistics showed that 64% of the 9,079 LinkedIn members with “Hadoop” in their member profiles (by no means perfect but an insightful measure nonetheless) are based in the US, and that the vast majority of those are in the Bay Area.

The results are what we would expect to see given the relative level of immaturity of Apache Hadoop adoption, as well as the nature and location of the early Hadoop adopters and Hadoop-related vendors.

The results got me thinking two things:
– how does the geographic spread compare to a more maturely adopted project?
– how does it compare to the various NoSQL projects?

So I did some searching of LinkedIn to find out.

To answer the first question I performed the same search for MySQL, as an example of a mature, widely-adopted open source project.

The results show that just 32% of the 366,084 LinkedIn members with “MySQL” in their member profiles are based in the US (precisely half that of Hadoop) while only 4.4% are in the Bay area, compared to 28.2% of the 9,079 LinkedIn members with “Hadoop” in their member profiles.

The charts below illustrate the difference in geographic distribution between Hadoop and MySQL. The size of the boxes is in proportion to the search result (click each image for a larger version).

With regards to the second question, I also ran searches for MongoDB, Riak, CouchDB, Apache Cassandra*, Membase*, Neo4j, Hbase, and Redis.

I’ll be posting the results for each of those over the next week or so, but in the meantime, the graphic below shows the split between the USA and Rest of the World (ROW) for all ten projects.

It illustrates, as I suspected, that the distribution of skills for NoSQL databases is more geographically disperse than for Hadoop.

I have some theories as to why that is – but I’d love to hear anyone else’s take on the results.

*I had to use the ‘Apache’ qualifier with Cassandra to filer out anyone called Cassandra, while Membase returned a more statistically relevant result than Couchbase.

World map image: Owen Blacker

4 Comments

User perspectives on NoSQL

April 21st, 2010 — Data management

The NoSQL EU event in London this week was a great event with interesting perspectives from both vendors – Basho, Neo Technology, 10gen, Riptano – and also users – The Guardian, the BBC, Amazon, Twitter. In particular I was interested in learning from the latter about how and why they ended up using alternatives to the traditional relational database model.

Some of the reasons for using NoSQL have been well-documented: Amazon CTO Werner Vogels talked about how the traditional database offerings were unable to meet the scalability Amazon.com requires. Filling a functionality void also explains why Facebook created Cassandra, Google created BigTable, and Twitter created FlockDB (etc etc). As Werner said, “We couldn’t bet the company on other companies building the answer for us.”

As Werner also explained, however, the motivation for creating Dynamo was also about enabling choice and ensuring that Amazon was not trying to force the relational database to do something it was not designed to do. “Choosing the right tool for the job” was a recurring theme at NoSQL EU.

Given the NoSQL name it is easy to assume that this means that the relational database is by default “the wrong tool”. However, the most important element in that statement is arguably not “tool”, but “job” and The Guardian discussed how it was using non-relational data tools to create new applications that complement its ongoing investment in the Oracle database.

For example, the Guardian’s application to manage the progress of crowdsourcing the investigation of MP’s expenses is based on Redis, while the Zeitgeist trending news application runs on Google’s AppEngine, as did its live poll during the recent leader’s election debate. Datablog, meanwhile, relies on Google Spreadsheets to serve up usable and downloadable data – we’ll ignore for a moment whether Google Spreadsheets is a NoSQL database 😉

Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. The overarching theme, as Matthew Wall and Simon Willison explained, is that the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

On the subject of choosing the right tool for the job, Basho’s engineering manager Brian Fink pointed out that using NoSQL technology alongside relational SQL database technology may actually improve the performance of the SQL database since storing data in a relational database that does not need SQL features slows down access to data that does need SQL features.

Another perspective on this came from Werner Vogels who noted that unlike database administrators/ systems architects, users don’t care about where data resides or what model it uses – as long as they get the service they require. Werner explained that the Amazon.com homepage is a combination of 200-300 different services, with multiple data systems. Users do not think about data sources in isolation, they care about the amalgamated service.

This was also a theme that cropped up in the presentation by Enda Farrell, software architect at the BBC, who noted that the BBC’s homepage is a PHP application integrated with multiple data sources at multiple data centers, and also Twitter‘s analytics lead Kevin Weil, who described Twitter’s use of Hadoop, Pig, HBase, Cassandra and FlockDB.

While the company is using HBase for low-latency analytic applications such as people search and moving to Cassandra from MySQL for its online applications, it uses its recently open-sourced FlockDB graph database to serve up data on followers and correlate the intersection of followers to (for example) ensure that Tweets between two people are only sent to the followers of both. (As something of an aside, Twitter is using Hadoop to store the 7TB of of data its generates a day from Tweets, and Pig for non-real time analytics).

Kevin noted that the company is also working with Digg to build real-time analytics for Cassandra and will be releasing the results as open source, and also discussed how Twitter has made use of open source technologies created by others such as Facebook (both Cassandra and the Scribe log data aggregation server.

One of the issues that has arisen from the fact that organizations such as Amazon and Facebook have had to create their own data management technologies is the proliferation of NoSQL databases and a certain amount of wheel re-invention.

Werner explained that SmugMug creator Don Macaskill ended up being a MySQL expert not because he necessarily wanted to be, but because he needed to be because he had to be to keep his applications running.

“He doesn’t want to have to become an expert in Cassandra,” noted Werner. “What he wants is to have someone run it for him and take care of that.” Presumably Riptano, the new Cassandra vendor formed by Jonathan Ellis – project chair for the Cassandra database – will take care of that, but in the meantime Werner raised another long-term alternative.

“We shouldn’t all be doing this,” he said, adding that Dynamo is not as popular within Amazon Web Services as it once was as it is a product, that requires configuration and management, rather than a service, and Amazon employees “have better things to do.”

Which raises the question – don’t Twitter, Facebook, the BBC, the Guardian et al have better things to do than developing and maintaining database architecture? In a perfect world, yes. But in a perfect world they’d all have strongly consistent, scalable distributed database systems/services that are suited to their various applications.

Interestingly, describing S3 as “a better key/value store than Dynamo”, Werner noted that SimpleDB and S3 are “a good start to provide that service”.

4 Comments

The Data Day: May 19, 2017

NoSQL LinkedIn Skills Index – An Interesting Occasional Update

The Data Day, A few days: February 21-27, 2015

Update on the relative popularity of NoSQL database skills

The Data Day, Today: Jan 27 2012

The geographic distribution of NoSQL skills – just one more thing

The geographic distribution of NoSQL skills: HBase and Membase

The geographic distribution of Hadoop skills: in context

User perspectives on NoSQL

Search

Twitter: maslett

Categories

451 Group blogroll

Recent Posts

Subscribe via Email

Archives