February 29th, 2012 — Data management
February 24th, 2012 — Data management
January 31st, 2012 — Data management
As expected, EMC has announced that it is integrating its Greenplum HD distribution of Apache Hadoop with its Isilon scale-out NAS technology. The move coincides with a re-branding of the company’s Hadoop distributions that, while slight, could prove significant.
Specifically, EMC has enabled the Hadoop Distributed File System (HDFS) as a native protocol supported on OneFS in addition to Network File System (NFS) and Common Internet File System (CIFS) support, enabling Isilon systems to provide the underlying storage layer for Hadoop processing, as well as a common storage pool for Hadoop and other systems.
EMC is talking up the benefits of combining Isilon with Greenplum HD. For the record, that’s the Hadoop distribution previously known as Greenplum HD Community Edition, based on the Apache Hadoop 0.20.1 code branch.
Greenplum HD Enterprise Edition, based on MapR Technologies’ M5 distribution, is now known as Greenplum MR, and is not supported by Isilon due to the fact that it replaces HDFS with Direct Access NFS.
EMC notes that Greenplum MR is being positioned as a high-performance Hadoop offering for customers that have failed to achieve their required performance from other distributions.
While EMC is quick to maintain its happiness with the MapR relationship and its commitment to Greenplum MR, it’s clear that tight integration with Isilon, particularly in the EMC Greenplum DCA, will result in an expanded role for Greenplum HD.
Additionally, while the company’s Greenplum Command Center provides unified management for the Greenplum Database, Greenplum HD and Greenplum Chorus as part of the recently announced Unified Analytics Platform (UAP), MapR has its own management and monitoring functionality.
Since we expect EMC to pitch the benefits of integrated software in UAP and software and hardware in DCA, it is now clear that Greenplum HD, rather than the Greenplum MR, is considered the company’s primary Hadoop distribution.
Given Greenplum HD’s starring role in the Unified Analytics Platform (UAP), Data Computing Appliance (DCA) and integration with Isilon, Greenplum MR’s role is likely to become increasingly niche.
January 27th, 2012 — Data management
451 Research yesterday announced that it has published its 2012 Previews report, an all-encompassing report highlighting the most disruptive and significant trends that our analysts expect to dominate and drive the enterprise IT industry agenda over the coming year.
The 93 page report provides an outlook and assessment across all 451 Research technology sectors and practice areas – including software infrastructure, cloud enablement, hosting, security, datacenter technologies, hardware, information management, mobility, networking and eco-efficient IT – with input from our team of 40+ analysts. The 2012 Previews report is available upon request here.
IM research director Simon Robinson has already provided a taster of our predictions as they relate to the information-centric landscape. Below I have outlined some of our core predictions related to the data-centric ecosystem:
The overall trend predicted for 2012 could best be described as the shifting focus from volume, velocity and velocity, to delivering value. Out concept of Total Data reflects the path from velocity and variety of information sources to the all-important endgame of deriving value from data. We expect to see increased interest in data integration and analytics technologies and approaches designed specifically to exploit the potential benefits of ‘big data’ and mainstream adoption of Hadoop and other new sources of data.
We also anticipate, and are beginning to see, increased focus on technologies that enable access to data in different storage platforms without requiring data movement. We believe there is an emerging role for what we are calling the ‘data hub‘ – an independent platform that is responsible for managing access to data on the various data storage and processing technologies.
Increased understanding of the value of analytics will also increase interest in the integration of analytics into operational applications. Embedded analytics is nothing new, but has the potential to achieve mainstream adoption this year as the dominant purveyors of applications used to run operations are increasingly focused on serving up embedded analytics as a key component within their product portfolios. Equally importantly, many of them now have database platforms capable of uniting previously disparate technologies to deliver true embedded analysis.
There has been a growing recognition over the past year or so that any type of data management project – whether focused on master data management (MDM), data or application integration, or data quality – needs to bring real benefits to business processes. Some may see this assertion as obvious and pretty easy to achieve, but that’s not necessarily the case. However, it is likely to become more so in the next 12-18 months as companies realize a process-driven approach to most data management programs makes sense and vendors deliver capabilities to meet this demand.
While ‘big data’ presents a number of opportunities, it also poses many challenges, not the least of which is the lack of developers, managers, analysts and scientists with analytics skills. The users and investors placing a bet on the opportunities offered by new data management products are unlikely to be laughing if it turns out that they cannot employ people to deploy, manage and run those products, or analysts to make sense of the data they produce. It is not surprising that, therefore, the vendors that supply those technologies are investing in ensuring that there is a competent workforce to support existing and new projects.
Finally, while cloud computing may be one of the technology industry’s hot topics, it has had relatively little impact on the data management sector to date. That is not to say that databases are not available on cloud computing platforms, but we must make a distinction between databases that are deployed in public clouds, and ‘cloud databases‘ that have the potential to fulfil the role of emerging databases in building private and hybrid clouds. The former have been available for many years. The latter are just beginning to come to fruition based on NoSQL databases, as well as a new breed of NewSQL relational databases, designed to meet the performance, scalability and flexibility needs of large-scale data processing.
451 Research clients can get more details of these specific predictions via our 2012 preview – Information Management, Part 2. Non-clients can apply for trial access at the same link, while the entire 2012 Previews report is available here.
Also, mark your diaries for a webinar discussing report highlights on Thursday Feb 9 at noon ET, which will be open for clients and non-clients to attend. Registration details to follow soon…
January 24th, 2012 — Data management
January 19th, 2012 — Data management
Amazon launches DynamoDB. Red Hat virtually supports JasperReports. And more.
An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.
* Amazon Web Services Launches Amazon DynamoDB See also blog posts from Werner Vogels and Jeff Barr, as well as reaction from DataStax and Basho.
* Jaspersoft Delivers Analytics for Red Hat Enterprise Virtualization Customers JasperReports Server is embedded in Red Hat Enterprise Virtualization 3.0.
* Tableau 7.0 Brings Simplicity to Business Intelligence Including new Data Server for data sharing and management.
* Hortonworks to Deliver Next-Generation of Apache Hadoop Pre-announcement (emphasis on the pre).
* RainStor Announces First Enterprise Database Running Natively on Hadoop as well as partnerships with Cloudera, Hortonworks, and MapR, and support from Composite Software.
* Talend Platform for Data Services Operationalizes Information and Data A common development, deployment and monitoring environment for both data management and application integration.
* Fujitsu Launches Cloud Services as a Platform for Big Data Data Utilization Platform Services.
* All you wanted to know about Hadoop, but were too afraid to ask A graphic illustration of the various versions of Apache Hadoop.
* Oracle Database or Hadoop? Another good post from Pythian’s Gwen Shapira. See also Aaron Cordova’s Do I need SQL or Hadoop?
* Meet Code 42, Accel’s first Big Data Fund investment GigaOM has the details.
* MapR CEO Sees Big Changes in Big Data in 2012 Predictive.
* Introducing DataFu: an open source collection of useful Apache Pig UDFs LinkedIn launches open source user-defined functions.
* Big Data Needs Data Scientists, Or Quants, Or Excel Jockeys … or something.
* Career of the Future: Data Scientist [INFOGRAPHIC] Infotaining.
* Knives out for Oracle. SAP and IBM offer some perspectives on Exalytics and Big Data Appliance respectively.
* For 451 Research clients
# Information Builders uses Infobright to take BI in-memory, expands SMB reach Market development report
# RainStor launches database complement to Apache Hadoop Market development report
# Heroku’s Postgres is poised for growing interest in database as a service Market development report
* Google News Search outlier of the day: This Spud’s For All of You: “2012 Is the Year of the Potato”
And that’s the Data Day, today.
January 10th, 2012 — Data management
Oracle OEMs Cloudera. The future of Apache CouchDB. And more.
An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.
* Oracle announced the general availability of Big Data Appliance, and an OEM agreement with Cloudera for CDH and Cloudera Manager.
* The Future of Apache CouchDB Cloudant confirms intention to integrate the core capabilities of BigCouch into Apache CouchDB.
* Reinforcing Couchbase’s Commitment to Open Source and CouchDB Couchbase CEO Bob Wiederhold attempts to clear up any confusion.
* Hortonworks Appoints Shaun Connolly to Vice President of Corporate Strategy Former vice president of product strategy at VMware.
* Splunk even more data with 4.3 Introducing the latest Splunk release.
* Announcement of Percona XtraDB Cluster (alpha release) Based on Galera.
* Bringing Value of Big Data to Business: SAP’s Integrated Strategy Forbes interview with with Sanjay Poonen, President and corporate officer of SAP Global Solutions.
* New Release of Oracle Database Firewall Extends Support to MySQL and Enhances Reporting Capabilities Self-explanatory.
* Big data and the disruption curve “Many efforts are being funded by business units and not the IT department and money is increasingly being diverted from large enterprise vendors.”
* Get your SQL Server database ready for SQL Azure Microsoft “codename” SQL Azure Compatibility Assessment.
* An update on Apache Hadoop 1.0 Cloudera’s Charles Zedlewski helpfully explains Apache Hadoop branch numbering.
* Xeround and the CAP Theorem So where does Xeround fit in the CAP Theorem?
* Can Yahoo’s new CEO Thompson harness big data, analytics? Larry Dignan thinks Scott Thompson might just be the right guy for the job.
* US Companies Face Big Hurdles in ‘Big Data’ Use “21% of respondents were unsure how to best define Big Data”
* Schedule Your Agenda for 2012 NoSQL Events Alex Popescu updates his list of the year’s key NoSQL events.
* DataStax take Apache Cassandra Mainstream in 2011; Poised for Growth and Innovation in 2012 The usual momentum round-up from DataStax.
* Objectivity claimed significant growth in adoption of its graph database, InfiniteGraph and flagship object database, Objectivity/DB.
* Cloudera Connector for Teradata 1.0.0 Self-explanatory.
* For 451 Research clients
# SAS delivers in-memory analytics for Teradata and Greenplum Market Development report
# With $84m in funding, Opera sets out predictive-analytics plans Market Development report
* Google News Search outlier of the day: First Dagger Fencing Competition in the World Scheduled for January 14, 2012
And that’s the Data Day, today.
January 5th, 2012 — Data management
Apache Hadoop 1.0. The future of CouchDB (or Couchbase anyway). And more.
Welcome to the first in an occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.
* The Apache Software Foundation Announces Apache Hadoop v1.0 Self-explanatory.
* The Future of CouchDB Apache CouchDB creator Damien Katz explains why he is focusing his attention on Couchbase Server.
* Understanding Microsoft’s big-picture plans for Hadoop and Project Isotope Mary Jo Foley parses Alexander Stojanovic’s presentation.
* MongoDB Extends Leadership in NoSQL 10gen claims more than 400 commercial customers.
* 1010data’s Unique Big Data Analytics Platform Sees Stunning Growth in 2011 1010data runs the numbers on its adoption in 2011.
* TouchDB 1.0 is out TouchDB is a lightweight CouchDB-compatible database engine suitable for embedding into mobile apps.
* Data Scientist = Rock Star, Really? Virginia Backaitis is sceptical.
* Swimming with Dolphins Splunk’s connector for MySQL.
* What the Sumerians can teach us about data Pete Warden finds data inspiration at the British Museum.
* How To (Not) Get Smart About Big Data Wim Rampen on the importance of filtering noise.
* For 451 Research clients
# Total Data: exploratory analytic platforms Spotlight report
# Apache Hadoop reaches version 1.0, with more to come Analyst note
# Acunu hones focus on ‘big data’ platform for operational analytics Market development report
# Jaspersoft gets big into ‘big data,’ illuminates BI business momentum Market development report
* Google News Search outlier of the day: “Bella” Becomes Most Popular Name for Both Dogs and Cats
And that’s the Data Day, today.
December 5th, 2011 — Data management
Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.
The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.
We begin this week’s series with Membase and HBase, the two projects that proved, like Apache Hadoop, to have significantly greater adoption in the USA compared to the rest of the world.
The statistics showed that 58.2% of the 170 LinkedIn members with “Membase” in their member profiles are based in the US (as previously explained, we tried the same search with Couchbase, but with only 85 results we decided to use the Membase result set as it was more statistically relevant).
As with Hadoop, a significant proportion (27.1%) of those are in the Bay area, the highest proportion of all the NoSQL databases we looked at. The results also indicate that Ukraine is a hot-spot for Membase skills, with 3.5%, while Membase adoption is lower the UK (2.4%) than other NoSQL databases.
It should not be a great surprise that Apache HBase returned similar results to Apache Hadoop. The top eight individual regions for HBase were exactly the same as for Hadoop, although the UK (3.4%) is stronger for HBase, as is India (10.7%).
The statistics showed that 57.0% of the 1,687 LinkedIn members with “HBase” in their member profiles are based in the US, with 25.0% in the Bay area (the third highest in our sample behind Hadoop and Membase).
The series will continue later this week with MongoDB, Riak, CouchDB, Apache Cassandra, Neo4j, and Redis.
N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker
December 2nd, 2011 — Data management
NC State University’s Institute for Advanced Analytics recently published some interesting statistics on Apache Hadoop adoption based on a search of LinkedIn data.
The statistics graphically illustrate what a lot of people wer already pretty sure of: that the geographic distribution of Hadoop skills (and presumably therefore adoption) is heavily weighted in favour of the USA, and in particular the San Francisco Bay Area.
The statistics showed that 64% of the 9,079 LinkedIn members with “Hadoop” in their member profiles (by no means perfect but an insightful measure nonetheless) are based in the US, and that the vast majority of those are in the Bay Area.
The results are what we would expect to see given the relative level of immaturity of Apache Hadoop adoption, as well as the nature and location of the early Hadoop adopters and Hadoop-related vendors.
The results got me thinking two things:
– how does the geographic spread compare to a more maturely adopted project?
– how does it compare to the various NoSQL projects?
So I did some searching of LinkedIn to find out.
To answer the first question I performed the same search for MySQL, as an example of a mature, widely-adopted open source project.
The results show that just 32% of the 366,084 LinkedIn members with “MySQL” in their member profiles are based in the US (precisely half that of Hadoop) while only 4.4% are in the Bay area, compared to 28.2% of the 9,079 LinkedIn members with “Hadoop” in their member profiles.
The charts below illustrate the difference in geographic distribution between Hadoop and MySQL. The size of the boxes is in proportion to the search result (click each image for a larger version).
With regards to the second question, I also ran searches for MongoDB, Riak, CouchDB, Apache Cassandra*, Membase*, Neo4j, Hbase, and Redis.
I’ll be posting the results for each of those over the next week or so, but in the meantime, the graphic below shows the split between the USA and Rest of the World (ROW) for all ten projects.
It illustrates, as I suspected, that the distribution of skills for NoSQL databases is more geographically disperse than for Hadoop.
I have some theories as to why that is – but I’d love to hear anyone else’s take on the results.
*I had to use the ‘Apache’ qualifier with Cassandra to filer out anyone called Cassandra, while Membase returned a more statistically relevant result than Couchbase.
World map image: Owen Blacker