The Data Day, A few days: May 21-27, 2016

Updated 451 data platforms and analytics market and vendor estimates. And more.

And that’s the data day, today.

The Data Day, A few days: June 1-12, 2015

Teradata supports Presto. And more

And that’s the data day, today.

The Data Day, A few days: May 15-19 2014

Hortonworks acquires XA Secure. And more

And that’s the data day, today.

The Data Day, A few days: March 22-April 4 2014

Cloudera raises $900m. Pivotal launches Big Data Suite. And more.

And that’s the data day, today.

The Data Day, A few days: October 12-18 2013

Apache Hadoop 2 goes GA. Teradata cuts guidance. And more

And that’s the data day, today.

The Data Day, Two days: October 5/8 2012

SiSense refreshes Prism. How Twitter, PayPal and Facebook scale MySQL. And more.

And that’s the Data Day, today.

The Data Day, Today: Apr 11 2012

IBM launches Galileo database update. SAP outlines database roadmap. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Made in IBM Labs: New IBM Software Accelerates Decision Making in the Era of Big Data IBM launches DB2 10 and InfoSphere Warehouse 10.

* SAP Unveils Unified Strategy for Real-Time Data Management to Grow Database Market Leadership

* SAP Unveils Strategy to Gain Predictive Insights From Big Data

* TIBCO Delivers Breakthrough Software to Analyze Big Data in Motion

* TIBCO Announces Intent to Acquire LogLogic

* TIBCO Spotfire and Attivio Partner to Deliver New Levels of Integration and Discovery for Data and Content

* Mortar Data, Hadoop for the Rest of Us, Gets Seed Funding

* The coming in-memory database tipping point. Microsoft’s perspective on in-memory databases.

* Jaspersoft Extends Partnership with Talend to Deliver Big Data Integration

* Oracle to Hold MySQL Connect Conference in San Francisco September 29 and 30, 2012

* Percona XtraDB Cluster Open Source Software Provides a New Approach to High Availability MySQL

* Tokutek Brings Replication Performance to MySQL and MariaDB

* Continuent Announces Tungsten Enterprise 1.5 for Multi-Master, Multi-Region MySQL Data Services in the Amazon EC2

* SkySQL, hastexo Form Highly Available Partnership

* MySQL at Twitter Twitter releases its MySQL modifications under BSD license.

* Percona Bundles New Relic to Provide Gold and Platinum Support Customers with Comprehensive Application Visibility

* Percona Toolkit 2.1 for MySQL Enables Schema Changes without Scheduling Downtime

* Percona XtraBackup 2.0 for MySQL and Percona Server Provides Increased Performance

* Delphix Expands Agile Data Platform to Support Oracle Exadata

* Red Hat and 10gen Create Compelling Open Source Data Platform

* Announcing Pre-Production MongoDB Subscription from 10gen

* VoltDB Announces Version 2.5

* Red Hat Storage 2.0 Beta: Partners Test Big Data, Hadoop Support

* Sungard wants to sell you Hadoop as a service

* Actian and Lenovo Team to Optimize Big Data and Business Intelligence with New Appliance

* Objectivity Expands European Management Team With Former Sones Founder Mauricio Matthesius

* expressor Expands Data Integration Platform Into Big Data

* The Apache Software Foundation Announces Apache Sqoop as a Top-Level Project

* LucidDB has left Eigenbase moved to Apache License

* For 451 Research clients

# IBM looks to the stars with Galileo relational database update Impact Report

# Indicee eyes fresh VC as it establishes beachhead for cloud BI service using OEM sales Impact Report

# Percona launches XtraDB Cluster for MySQL database high availability Impact Report

# Tokutek targets replication performance with database update Impact Report

# ‘Big data’ in the datacenter: Vigilent secures $6.7m funding round Impact Report

And that’s the Data Day, today.

The Data Day, Today: Mar 8 2012

Microsoft launches SQL Server 2012. MapR integrates with Informatica. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Microsoft Releases SQL Server 2012 to Help Customers Manage “Any Data, Any Size, Anywhere”

* SQL Server 2012 Released to Manufacturing

* SAS Access to Hadoop Links Leading Analytics, Big Data

* MapR And Informatica Announce Joint Support To Deliver High Performance Big Data Integration And Analysis

* Teradata Expands Integrated Analytics Portfolio

* New Teradata Platform Reshapes Business Intelligence Industry

* Microsoft’s Trinity: A graph database with web-scale potential

* KXEN Announces Availability of InfiniteInsight Version 6, a Predictive Analytics Solution with Unprecedented Agility, Productivity, and Ease of Use

* Software AG Announces its Strategy for the In-memory Management of Big Data

* Attunity and Hortonworks Announce Partnership to Simplify Big Data Integration with Apache Hadoop

* Schooner Information Technology and Ispirer Systems Partner to Deliver SQLWays for SchoonerSQL

* Big Data & Search-Based Applications

* Namenode HA Reaches a Major Milestone

* How Twitter is doing its part to democratize big data

* Dropping Prices Again– EC2, RDS, EMR and ElastiCache

* For 451 Research clients

# SAS outlines Hadoop strategy, previews Hadoop-based in-memory analytics Market Development report

# Pervasive rides the elephant into ‘big data’ predictive analytics Market Development report

# IBM makes desktop discovery and analysis play, shares business analytics priorities Market Development report

# Clustrix launches SDK to tap developer interest in new databases Market Development report

# Continuent and SkySQL team up for clustered MySQL support Analyst note

# MapR gets a boost from Cisco and Informatica Analyst note

And that’s the Data Day, today.

MySQL NoSQL survey highlights role of polyglot persistence

The MySQL developer website is currently running a poll to gauge the adoption of NoSQL database projects by MySQL developers.

The results are interesting, particularly in relation to our research report on the emergence and adoption of NoSQL and NewSQL databases, which I am completing this week.

Our research has shown that one of the drivers of NoSQL has been performance, and in particular the failure of MySQL to provide predictable performance at scale. We do see NoSQL being deployed for applications that previously ran on MySQL, or for which MySQL would previously have been the natural choice.

For example, while Facebook continues to run its core applications on MySQL running the InnoDB storage engine and memcached it also created what became Apache Cassandra to power its inbox search, and selected Apache HBase for its Messages application, which was updated in late 2010 to combine chat, email, and SMS, having found that MySQL was unable to deliver the performance required for large data sets.

Similarly, content discovery service StumbleUpon adopted HBase following problems with MySQL failover, Digg replaced its MySQL cluster with Apache Cassandra, and Wordnik replaced MySQL with MongoDB.

Clearly, however, not every MySQL application is suitable for a NoSQL database. Just because almost 80% of the MySQL survey respondents are adopting NoSQL database, does not mean they are replacing MySQL with NoSQL.

Like Facebook, many major NoSQL users also continue to use MySQL, including Twitter which back-tracked on a planned migration of its core status table to Apache Cassandra in 2010. It continues to use MySQL, but is adopting Cassandra for newer projects.

The adoption of multiple database products depending on the nature of the application is another of the six major drivers for NoSQL and NewSQL adoption highlighted by our research.

The theory of polyglot persistence has developed based on the fact that different data storage models have their own strengths and the acceptance that while the relational model is suitable for a large proportion of data storage requirements, there are times when a document, graph, or object database might be more suitable, or even a distributed file system.

Facebook and Twitter are prime examples of polyglot persistence in action, and the survey of MySQL developers shows that the practice is widespread. At the time of writing 205 people have responded to the survey, providing 421 responses.

If we exclude the 42 that indicate they are not using a NoSQL database, that means that the remaining 163 people are using 379 NoSQL databases, which equates to 2.33 databases per respondent, not including their existing use of MySQL or other traditional or NewSQL databases.

I’ll provide more details of the research report, including the other four adoption drivers, once the report is published. The report contains analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, as well as the evolving role of data grid technologies, as well as the associated use cases. It will be available soon for clients of our Information Management and CAOS practices.

User perspectives on NoSQL

The NoSQL EU event in London this week was a great event with interesting perspectives from both vendors – Basho, Neo Technology, 10gen, Riptano – and also users – The Guardian, the BBC, Amazon, Twitter. In particular I was interested in learning from the latter about how and why they ended up using alternatives to the traditional relational database model.

Some of the reasons for using NoSQL have been well-documented: Amazon CTO Werner Vogels talked about how the traditional database offerings were unable to meet the scalability Amazon.com requires. Filling a functionality void also explains why Facebook created Cassandra, Google created BigTable, and Twitter created FlockDB (etc etc). As Werner said, “We couldn’t bet the company on other companies building the answer for us.”

As Werner also explained, however, the motivation for creating Dynamo was also about enabling choice and ensuring that Amazon was not trying to force the relational database to do something it was not designed to do. “Choosing the right tool for the job” was a recurring theme at NoSQL EU.

Given the NoSQL name it is easy to assume that this means that the relational database is by default “the wrong tool”. However, the most important element in that statement is arguably not “tool”, but “job” and The Guardian discussed how it was using non-relational data tools to create new applications that complement its ongoing investment in the Oracle database.

For example, the Guardian’s application to manage the progress of crowdsourcing the investigation of MP’s expenses is based on Redis, while the Zeitgeist trending news application runs on Google’s AppEngine, as did its live poll during the recent leader’s election debate. Datablog, meanwhile, relies on Google Spreadsheets to serve up usable and downloadable data – we’ll ignore for a moment whether Google Spreadsheets is a NoSQL database 😉

Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. The overarching theme, as Matthew Wall and Simon Willison explained, is that the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

On the subject of choosing the right tool for the job, Basho’s engineering manager Brian Fink pointed out that using NoSQL technology alongside relational SQL database technology may actually improve the performance of the SQL database since storing data in a relational database that does not need SQL features slows down access to data that does need SQL features.

Another perspective on this came from Werner Vogels who noted that unlike database administrators/ systems architects, users don’t care about where data resides or what model it uses – as long as they get the service they require. Werner explained that the Amazon.com homepage is a combination of 200-300 different services, with multiple data systems. Users do not think about data sources in isolation, they care about the amalgamated service.

This was also a theme that cropped up in the presentation by Enda Farrell, software architect at the BBC, who noted that the BBC’s homepage is a PHP application integrated with multiple data sources at multiple data centers, and also Twitter‘s analytics lead Kevin Weil, who described Twitter’s use of Hadoop, Pig, HBase, Cassandra and FlockDB.

While the company is using HBase for low-latency analytic applications such as people search and moving to Cassandra from MySQL for its online applications, it uses its recently open-sourced FlockDB graph database to serve up data on followers and correlate the intersection of followers to (for example) ensure that Tweets between two people are only sent to the followers of both. (As something of an aside, Twitter is using Hadoop to store the 7TB of of data its generates a day from Tweets, and Pig for non-real time analytics).

Kevin noted that the company is also working with Digg to build real-time analytics for Cassandra and will be releasing the results as open source, and also discussed how Twitter has made use of open source technologies created by others such as Facebook (both Cassandra and the Scribe log data aggregation server.

One of the issues that has arisen from the fact that organizations such as Amazon and Facebook have had to create their own data management technologies is the proliferation of NoSQL databases and a certain amount of wheel re-invention.

Werner explained that SmugMug creator Don Macaskill ended up being a MySQL expert not because he necessarily wanted to be, but because he needed to be because he had to be to keep his applications running.

“He doesn’t want to have to become an expert in Cassandra,” noted Werner. “What he wants is to have someone run it for him and take care of that.” Presumably Riptano, the new Cassandra vendor formed by Jonathan Ellis – project chair for the Cassandra database – will take care of that, but in the meantime Werner raised another long-term alternative.

“We shouldn’t all be doing this,” he said, adding that Dynamo is not as popular within Amazon Web Services as it once was as it is a product, that requires configuration and management, rather than a service, and Amazon employees “have better things to do.”

Which raises the question – don’t Twitter, Facebook, the BBC, the Guardian et al have better things to do than developing and maintaining database architecture? In a perfect world, yes. But in a perfect world they’d all have strongly consistent, scalable distributed database systems/services that are suited to their various applications.

Interestingly, describing S3 as “a better key/value store than Dynamo”, Werner noted that SimpleDB and S3 are “a good start to provide that service”.