January 13th, 2012 — Data management
January 10th, 2012 — Data management
Oracle OEMs Cloudera. The future of Apache CouchDB. And more.
An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.
* Oracle announced the general availability of Big Data Appliance, and an OEM agreement with Cloudera for CDH and Cloudera Manager.
* The Future of Apache CouchDB Cloudant confirms intention to integrate the core capabilities of BigCouch into Apache CouchDB.
* Reinforcing Couchbase’s Commitment to Open Source and CouchDB Couchbase CEO Bob Wiederhold attempts to clear up any confusion.
* Hortonworks Appoints Shaun Connolly to Vice President of Corporate Strategy Former vice president of product strategy at VMware.
* Splunk even more data with 4.3 Introducing the latest Splunk release.
* Announcement of Percona XtraDB Cluster (alpha release) Based on Galera.
* Bringing Value of Big Data to Business: SAP’s Integrated Strategy Forbes interview with with Sanjay Poonen, President and corporate officer of SAP Global Solutions.
* New Release of Oracle Database Firewall Extends Support to MySQL and Enhances Reporting Capabilities Self-explanatory.
* Big data and the disruption curve “Many efforts are being funded by business units and not the IT department and money is increasingly being diverted from large enterprise vendors.”
* Get your SQL Server database ready for SQL Azure Microsoft “codename” SQL Azure Compatibility Assessment.
* An update on Apache Hadoop 1.0 Cloudera’s Charles Zedlewski helpfully explains Apache Hadoop branch numbering.
* Xeround and the CAP Theorem So where does Xeround fit in the CAP Theorem?
* Can Yahoo’s new CEO Thompson harness big data, analytics? Larry Dignan thinks Scott Thompson might just be the right guy for the job.
* US Companies Face Big Hurdles in ‘Big Data’ Use “21% of respondents were unsure how to best define Big Data”
* Schedule Your Agenda for 2012 NoSQL Events Alex Popescu updates his list of the year’s key NoSQL events.
* DataStax take Apache Cassandra Mainstream in 2011; Poised for Growth and Innovation in 2012 The usual momentum round-up from DataStax.
* Objectivity claimed significant growth in adoption of its graph database, InfiniteGraph and flagship object database, Objectivity/DB.
* Cloudera Connector for Teradata 1.0.0 Self-explanatory.
* For 451 Research clients
# SAS delivers in-memory analytics for Teradata and Greenplum Market Development report
# With $84m in funding, Opera sets out predictive-analytics plans Market Development report
* Google News Search outlier of the day: First Dagger Fencing Competition in the World Scheduled for January 14, 2012
And that’s the Data Day, today.
January 6th, 2012 — Data management
As I mentioned earlier this week, a major research focus for Q1 is the MySQL ecosystem, the positives and negatives of Oracle’s MySQL strategy, and the competitive overlap between MySQL, NoSQL and NewSQL.
It is impossible to think about this without reconsidering the commitments made by Oracle to customers, developers and users of MySQL in late December 2009, which played a significant part in satisfying European Commission concerns about Oracle’s acquisition of Sun.
While the commitments were both welcomed and derided when they were announced, it is worth considering today whether those commitments have been as significant in practice as they appeared to be two years ago.
For example, Oracle’s commitment to and investment in InnoDB – while positive for MySQL users – has arguably diminished the relevance of some of the storage engine-related commitments.
We will be coming to our own conclusions based on our research over the coming weeks, but I am interested in any feedback from MySQL customers, developers and users about how well Oracle has kept to its commitments and their significance in hindsight.
You can find a full list of the commitments here but the edited highlights are below:
1. Continued Availability of Storage Engine APIs.
2. Non-assertion of copyright and no requirement for a commercial license related to implementing the storage engine APIs .
3. Extension of any existing commercial storage engine licenses until December 10, 2014.
4. Commitment to continue licensing MySQL using the GNU GPL.
5. Customers would not be required to purchase support services from Oracle as a condition of obtaining a commercial license to MySQL.
6. Increase spending on MySQL research and development.
7. Commitment to create and fund a customer advisory board.
8. Commitment to create and fund a MySQL Storage Engine Vendor Advisory Board.
9. Commitment to retain the free MySQL Reference Manual.
10. Retention of annual or multi-year subscription renewals for end-users and embedded customers.
October 3rd, 2011 — Data management
We have previously speculated at The 451 Group about Oracle’s potential to respond to the growing adoption of NoSQL databases, noting that the company had a number of options at its disposal, including Berkeley DB and projects like HandlerSocket.
While some may wonder about the potential impact of Oracle NoSQL (based indeed on Berkeley DB) on the existing NoSQL vendors, I believe the launch says something very significant about NoSQL itself: specifically that its adoption is driven by more than the nature of the query language.
To get a sense of why Oracle NoSQL is significant, think about the way Oracle has traditionally responded to alternative approaches that threaten the relational model and its dominance thereof. Oracle’s approach has traditionally been to subsume the alternative approach, at least in part, into Oracle Database, nullifying the competitive threat.
Oracle CEO Larry Ellison explained the approach himself on a recent call with investors:
“We think that data should be integrated with a single database technology. That’s always been our strategy for Oracle. And it started as a relational database then we added objects, then we added text and then we’ve added a variety of other things like video and audio to the Oracle Database. We think that should be unified and that’s how we’re approaching the problem.”
As we recently covered (451 clients only), Oracle is in the process of replicating this strategy with MySQL, adding support for the ability to directly access MySQL’s InnoDB and MySQL’s Cluster’s NDB storage engines using the memcached API.
This ability to perform non-SQL querying of the database is part of the agility benefit of NoSQL, and if the term NoSQL were to be taken literally would perhaps be enough to discourage would-be NoSQL adopters from turning away from MySQL.
As our NoSQL, NewSQL and Beyond report highlighted, however, agility is just one of six key trends we see driving adoption of NoSQL databases. Scalability, performance, relaxed consistency, intricacy and necessity will not be solved by the ability to query MySQL or MySQL Cluster using the memcached API.
The launch of Oracle NoSQL is therefore a clear indication that there are trends at work here that cannot be solved by adding non-SQL querying to existing relational databases.
There is another significant factor here, which is the fact that Oracle has chose to name the product NoSQL. In one simple naming move the company has effectively disarmed the NoSQL ‘movement’.
We have previously noted that existing NoSQL vendors were turning away from the term in favor of emphasizing their individual strengths. How many of them are going to want to self-identify with an Oracle product? I’m not convinced any of them believe the brand is worth fighting for.
July 26th, 2011 — Data management
Recently there have been a spate of postings regarding job trends for distributed data management technologies including Hadoop and the various NoSQL databases.
One thing you rarely see on these job trends charts is comparison with an incumbent technology, for context. There’s a reason for that, as this comparison of database-related jobs from Indeed.com illustrates:
Although there has been a recent increase in job postings related to Hadoop and MongoDB, both are dwarfed, in absolute terms, by the number of job postings involving SQL Server and MySQL.
So why all the fuss about Hadoop and NoSQL, from a corporate perspective? This chart, showing the relative growth for the same data management technologies, says it all:
July 11th, 2011 — Data management
It has been fascinating to watch how the industry has responded to ‘NewSQL’ since we published our first report using the term.
From day one the term has taken on a life of its own as the vendors such as ScaleBase, VoltDB, NimbusDB and Xeround have picked it up and run with it , while the likes of Marten Mickos and Michael Stonebraker have also adopted the term.
The reaction hasn’t been all positive, of course, although much of the criticism has been of the “are you kidding?” or “this is getting silly” variety rather than constructive debate about either the term or the associated technologies.
Another popular response is along the lines of “does this mean the end of NoSQL?”. I think it is important to address this question because it depends on a common misunderstanding about technology: that in order for the latest technology to succeed it is necessary for the technology that immediately preceded it to fail.
While our report into NoSQL, NewSQL and Beyond identified common drivers for interest in NoSQL and NewSQL databases, as well as data caching/grid technologies, in truth there is a significant difference between the requirements for databases that provide relaxed consistency and/or schema dependency and those that retain the ACID properties of transactional database systems.
Although there will be isolated examples, it is going to be rare, therefore, that any potential adopter would be directly comparing NoSQL and NewSQL technologies unless they are still at the stage trying to figure out the level of consistency required for an individual application.
The other option they would have is to use an existing SQL database, particularly Oracle’s MySQL, which provides the middle ground that overlaps both NoSQL and NewSQL. A significant number of the NoSQL deployments we have identified have migrated from MySQL, while existing MySQL deployments (although probably not the same ones) are also targets for the numerous NewSQL vendors.
VoltDB is a primary example, as last’s week’s GigaOm article covering CTO Michael Stonebraker’s view on Facebook’s MySQL ‘fate worse than death’ illustrated.
Much debate (125 comments at last count) has followed Stonebraker’s assertion that Facebook would be better off migrating to a NewSQL offering like VoltDB, most of which has not supported his view.
There’s a good reason for that. There is a good argument to be made that if you were trying to create Facebook from scratch today you probably wouldn’t choose the shard management overhead involved in MySQL. In that regard, Stonebraker has a point.
However, the fact is that MySQL was pretty much the only logical choice when Facebook began and its commitment to MySQL has grown over the years. The company is now probably one of the world’s experts in scaling and managing MySQL – to the extent that Facebook engineer Domas Mituzas argues that the operational overhead in handling sharding and availability of MySQL has become a constant cost.
Under those circumstances it would take something significant for a company like Facebook to even consider migrating to a MySQL alternative. Database migration projects are costly and complex and extremely rare – even at non-Facebook scale.
And it is not as if the company hasn’t experimented with other database technologies – having created Apache Cassandra and adopted Apache HBase for its Messages update.
This is exactly the polyglot persistence strategy we are seeing from NoSQL and NewSQL adopters: retaining MySQL (or another SQL database) where is makes sense to do so, while adding NoSQL and perhaps NewSQL for new projects and applications for which it is appropriate.
One other point to note, however, is that adopting a NewSQL technology might not require migrating away from MySQL. While the NewSQL category includes new database products such as VoltDB, it also includes alternative MySQL storage engines and database load balancing and clustering products such as ScaleBase and ScalArc, which are specifically designed to improve the scalability of MySQL (with other SQL databases to come) in order to avoid migration to an alternative database.
Adoption of these technologies does not require the complete abandonment of ‘standard MySQL’ any more than the adoption of NoSQL for non-ACID application requirements does, and it certainly doesn’t require the abandonment of NoSQL.
March 24th, 2011 — Data management
The MySQL developer website is currently running a poll to gauge the adoption of NoSQL database projects by MySQL developers.
The results are interesting, particularly in relation to our research report on the emergence and adoption of NoSQL and NewSQL databases, which I am completing this week.
Our research has shown that one of the drivers of NoSQL has been performance, and in particular the failure of MySQL to provide predictable performance at scale. We do see NoSQL being deployed for applications that previously ran on MySQL, or for which MySQL would previously have been the natural choice.
For example, while Facebook continues to run its core applications on MySQL running the InnoDB storage engine and memcached it also created what became Apache Cassandra to power its inbox search, and selected Apache HBase for its Messages application, which was updated in late 2010 to combine chat, email, and SMS, having found that MySQL was unable to deliver the performance required for large data sets.
Similarly, content discovery service StumbleUpon adopted HBase following problems with MySQL failover, Digg replaced its MySQL cluster with Apache Cassandra, and Wordnik replaced MySQL with MongoDB.
Clearly, however, not every MySQL application is suitable for a NoSQL database. Just because almost 80% of the MySQL survey respondents are adopting NoSQL database, does not mean they are replacing MySQL with NoSQL.
Like Facebook, many major NoSQL users also continue to use MySQL, including Twitter which back-tracked on a planned migration of its core status table to Apache Cassandra in 2010. It continues to use MySQL, but is adopting Cassandra for newer projects.
The adoption of multiple database products depending on the nature of the application is another of the six major drivers for NoSQL and NewSQL adoption highlighted by our research.
The theory of polyglot persistence has developed based on the fact that different data storage models have their own strengths and the acceptance that while the relational model is suitable for a large proportion of data storage requirements, there are times when a document, graph, or object database might be more suitable, or even a distributed file system.
Facebook and Twitter are prime examples of polyglot persistence in action, and the survey of MySQL developers shows that the practice is widespread. At the time of writing 205 people have responded to the survey, providing 421 responses.
If we exclude the 42 that indicate they are not using a NoSQL database, that means that the remaining 163 people are using 379 NoSQL databases, which equates to 2.33 databases per respondent, not including their existing use of MySQL or other traditional or NewSQL databases.
I’ll provide more details of the research report, including the other four adoption drivers, once the report is published. The report contains analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, as well as the evolving role of data grid technologies, as well as the associated use cases. It will be available soon for clients of our Information Management and CAOS practices.
March 22nd, 2010 — Data management
Gear6’s Mark Atwood is less than impressed with my recent statement: “Memcached is not a key value store. It is a cache. Hence the name.”
Mark has responded with a post in which he explains how memcached can be used as a key value store with the assistance of “persistent memcached” from Gear6, or by combining memcached with something like Tokyo Cabinet.
As much as I agree with Mark that other technologies can be used to turn memcached into a key value store I can’t help thinking his post actually proves my point: that memcached itself is not a key value store.
Either way it brings me to the next post in the NoSQL series (see also The 451 Group’s recent Spotlight report), looking at what the existing technology providers are likely to do in response.
I spent last week in San Francisco at the Open Source Business Conference where David Recordon, head of open source initiatives at Facebook, outlined how the company makes use of various open source projects, including memcached and MySQL, to scale its infrastructure.
It was an interesting presentation, although the thing that stood out for me was that Recordon didn’t once mention Cassandra, the open source key value store created by Facebook, despite being asked directly about the company’s plans for what was rather quaintly referred to as “non-relational databases”.
In fact, this recent post from Recordon puts Cassandra in context: “we use it for Inbox search, but the majority of development is now being led by Digg, Rackspace, and Twitter”. It is technologies like MySQL and memcached that Facebook is scaling to provide its core horsepower.
The death of memcached, as they say, has been greatly exaggerated.
That said, it is clear that to some extent the rise of NoSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently. Sharding is a popular method of increasing the scalability of the MySQL database to serve the requirements of high-traffic websites, but it’s manually intensive. The memcached distributed memory object-caching system can also be used to improve performance, but does not provide persistence.
An alternative to throwing out investments in MySQL and memcached in favor of NoSQL is to improve the MySQL/memcached combination, however. A number of vendors, including Gear6 and NorthScale, are developing and delivering technologies that add persistence to memcached (see recent 451 Group coverage on Gear6 and NorthScale), while appliance providers such as Schooner Information Technology (451 coverage) and Virident Systems (451 coverage) have taken an appliance-based approach to adding persistence.
Another approach would be to improve the performance of MySQL itself. ScaleDB (451 coverage) has a shared-disk storage engine for MySQL that promises to improve its scalability. We have also recently come across GenieDB, (451 coverage) which is promising a massively distributed data storage engine for MySQL. Additionally, Tokutek’s TokuDB MySQL storage engine is based on Fractal Tree indexing technology that reduces data-insertion times, improving the performance of MySQL for both read and write applications, for example.
As we noted in our recent assessment of Tokutek, while TokuDB is effectively an operational database technology, it does blur the line between operations and analytics since the company claims it delivers a performance improvement sufficient to run ad hoc queries against live data.
Beyond MySQL, while we expect the database incumbents to feel the impact of NoSQL in certain use cases, the lack of consistency (in the CAP Theorem sense) inevitably enables quick dismissal of their wider applicability. Additionally, we expect to see the data management vendors take steps to improve performance and scalability. One method is through the use of in-memory databases to improve performance for repeatedly accessed data, another is through the use of in-memory data grid caching technologies, which are designed to solve both performance and scalability issues.
Although these technologies do not provide the scalability required by Facebook, Amazon, et al., the question is, how many applications need that level of scalability? Returning again to CAP Theorem, if we assume that most applications do not require the levels of partition tolerance seen at Google, expect the incumbents to argue that what they lack in partition tolerance they can make up for in consistency and availability.
Somewhat inevitably, the requirements mandated by NoSQL advocates will be watered down for enterprise adoption. At that level, it may arguably be easier for incumbent vendors to sacrifice a little consistency and availability for partition tolerance than it will be for NoSQL projects to add consistency and availability.
Much will depend on the workload in question, which is something that is being hidden by debates that assume a confrontational relationship between SQL and NoSQL databases. As the example of Facebook suggests, there is room for both MySQL/memcached and NoSQL
August 6th, 2009 — Data management
Since the start of this year I’ve been covering data warehousing as part of The 451 Group’s information management practice, adding to my ongoing coverage of databases, data caching, and CEP, and contributing to the CAOS research practice.
I’ve covered data warehousing before but taking a fresh look at this space in recent months it’s been fascinating to see the variety of technologies and strategies that vendors are applying to the data warehousing problem. It’s also been interesting to compare the role that open source has played in the data warehousing market, compared to the database market.
I’m preparing a major report on the data warehousing sector, for publication in the next couple of months. In preparartion for that I’ve published a rough outline of the role open source has played in the sector over on our CAOS Theory blog. Any comments or corrections much appreciated.