The Data Day, Two days: November 8/9 2012

Funding for Neo, Elasticsearch and Hadapt. And more

And that’s the Data Day, today.

The Data Day, Two days: October 15/16 2012

NGDATA searches for consumer intelligence. Sparsity looks for social analytics partners.

And that’s the Data Day, today.

The Data Day, Two days: September 17/18 2012

Google’s Spanner. Acunu. OpTier. Opera. MarkLogic. And more.

And that’s the Data Day, today.

Hadoop is dead. Long live Hadoop.

GigaOM published an interesting article over the weekend written by Cloudant’s Mike Miller about why the days are numbered for Hadoop as we know it.

Miller argues that while Google’s MapReduce and file system research inspired the rise of the Apache Hadoop project, Google’s subsequent research into areas such as incremental indexing, ad hoc analytics and graph analysis is likely to inspire the next-generation of data management technologies.

We’ve made similar observations ourselves but would caution against assuming, as some people appear to have done, that implementations of Google’s Percolator, Dremel and Pregel projects are likely to lead to Hadoop’s demise. Hadoop’s days are not numbered. Just Hadoop as we know it.

Miller makes this point himself when he writes “it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology.”

As we noted in our 2011 Total Data report:

“it may be that we see more success for distributed data processing technologies that extend beyond Hadoop’s batch processing focus… Advances in the next generation of Hadoop delivered in the 0.23 release will actually enable some of these frameworks to run on the HDFS, alongside or in place of MapReduce.”

With the ongoing development of that 0.23 release (now known as Apache Hadoop 2.0) we are beginning to see that process in action. Hadoop 2.0 includes the delivery of the much-anticipated MapReduce 2.0 (also known as YARN, as well as NextGen MapReduce). Whatever you choose to call it, it is a new architecture that splits the JobTracker into its two major functions: resource management and application lifecycle management. The result is that multiple versions of MapReduce can run in the same cluster, and that MapReduce becomes one of several frameworks that can run on the Hadoop Distributed File System.

The first of these is Apache HAMA – the bulk synchronous parallel computing framework for scientific computations, but we will also see other frameworks supported by Hadoop – thanks to Arun C Murthy for pointing to two of them – and fully expect the likes of incremental indexing, ad hoc analytics and graph analysis to be among them.

As we added in Total Data:

“This supports the concept recently raised by Apache Hadoop creator Doug Cutting that what we currently call ‘Hadoop’ could perhaps be thought of as a set of replaceable components in a wider distributed data processing ecosystem… the definition of Hadoop might therefore evolve over time to encompass some of the technologies that could currently be seen as potential alternatives…”

The future of Hadoop is… Hadoop.

The Data Day, Today: May 18 2012

SAP expands HANA. Informatica embraces big data. Gary Bloom joins MarkLogic. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* For 451 Research clients

# Informatica 9.5: ‘big data’ runs through the integration platform makeover Impact Report

# Lucid Imagination launches search-based ‘big data’ platform Impact Report

# Datameer updates Hadoop-based BI stack with an eye to more complex analysis Impact Report

# MarkLogic searches for operational analytics role with plans for SQL, MapReduce support Impact Report

# Infobright shines following shift to machine-generated data Impact Report

# Starcounter focuses on performance with in-memory database update Impact Report

# Guavus bears fruit with data-processing platform for communications operators Impact Report

# InsightSquared bags $4.5m series A funding and salesforce.com as an investor Impact Report

# MarkLogic names veteran exec Gary Bloom as new president and CEO Analyst note

* SAP Continues to Expand Capabilities and Scale of SAP HANA Platform and Ease Developer Adoption

* SAP HANA Offers Multi-Node Capabilities to Help Customers Scale Out

* Gary Bloom Joins MarkLogic as Chief Executive Officer

* Amazon RDS for SQL Server and .NET support for AWS Elastic Beanstalk

* Informatica 9.5 Unleashes the Power of Hadoop

* Informatica Brings Master Data Management to Big Data, Social, Cloud and Mobile Computing

* Talend Announces New Release of Enterprise Open Source Integration Platform

* Lucid Imagination Combines Search, Analytics and Big Data to Tackle the Problem of Dark Data

* Big Data Refinery Fuels Next-Generation Data Architecture

* 7 Key Drivers for the Big Data Market

* Google puts a price tag on Cloud SQL services

* Actuate and Hortonworks Collaborate to Visualize Big Data

* Hadapt and Cloudera Deliver Big Data Analytics with Apache Hadoop

* Cloudera Partners With Hadoop Managed Services Provider MetaScale to Help Large Traditional Enterprises Adopt Apache Hadoop

* Opera Solutions’ Big Analytics Tailor Made for SAP HANA: Signal Hub Technology

* Cloudant to Contribute Big Data Capabilities to Apache CouchDB Project

* Hortonworks and Kognitio Announce Technical Partnership

* Starcounter Unveils World’s Fastest Consistent Database

* XAP 9.0 – Geared for Real-Time Big Data Stream Processing

* How long before R overtakes SAS and SPSS?

* Betting big on live sports data, Perform lays €120 million on RunningBall

And that’s the Data Day, today.

The Data Day, Today: May 8 2012

IBM acquires Vivisimo. Funding for Birst, ParAccel, Metamarkets and DataSift. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* For 451 Research clients

# IBM picks up Vivisimo to search for value in ‘big data’ Deal Analysis

# Teradata delivers on analytic cloud vision with Active Data Warehouse Private Cloud Impact Report

# The Big Blue picture for ‘big data’ analytics: IBM sheds light on BigSheets Impact Report

# Oversight Systems’ Continuous Analysis extracts actionable insight from data Impact Report

# Kalido updates MDM offering with business users, operationalizing master data in mind Impact Report

# Delphix reaps reward from agile approach to database virtualization Impact Report

# Automated Insights looks to pitch narrative, visuals and stats to enterprises Impact Report

# myDIALS eyes indirect sales in quest to be Internet access layer for analytics Impact Report

* IBM Advances Big Data Analytics with Acquisition of Vivisimo Also announces support for Cloudera.

* Teradata Announces 2012 First Quarter Results Revenue up 21% (PDF)

* Actuate Reports First Quarter 2012 Financial Results Revenue up 9% (PDF)

* Birst Secures $26 Million in Financing Led By Sequoia Capital

* ParAccel Closes Record Q1 Revenues and $20 Million Investment Round

* Metamarkets Raises $15 Million to Deliver Data Science-as-a-Service

* DataSift adds $7.2M: The story so far and focus for the future

* Teradata to Acquire eCircle (PDF)

* Google BigQuery brings Big Data analytics to all businesses

* TIBCO Spotfire Brings the Power of Data Discovery to Big Data and Extreme Information

* Jaspersoft Teams with VMware To Deliver Business Intelligence for Data-Driven Cloud Applications

* Kalido and Teradata Sign Global Reseller Agreement

* Actuate Announces Cloudera Alliance to Support Apache Hadoop and BIRT Developers in Big Data Integration

* Hortonworks and Kognitio Announce Technical Partnership Driving Apache Hadoop Adoption in Big Data Analytics Implementations

* Tokutek and PalominoDB Partner to Bring Scale, Performance to Database Deployments

* Acunu is pleased to announce v2 of the Acunu Data Platform!

* Is Yahoo really threatening memcached and Open Compute?

* Introducing Zend DBi as a MySQL Replacement on IBM i

* Zettaset and Hyve Solutions Build First Fully Integrated Enterprise OS Hadoop Solution

* Cloudera Announces New Japanese Subsidiary

* Bull Announces the Formation of Database Migration Business Unit

* Couchbase to Run Native with Key-Value API for ioMemory

* The Big Data Value Continuum

* Big Data is Business Intelligence plus Attention Deficit Disorder

* Nokia released Dempsy an open source stream data processing platform.

And that’s the Data Day, today.

Search by another name: enterprise search starts to mature into ‘application era’

Customers of The 451 Group would have seen my report on the enterprise search market published September 15. If you are a client, you can view it here. I thought it would be useful to provide a condensed version of the report to a wider audience as I think the market is at an important point it in its development and it merits a broader discussion.

The enterprise search market is morphing before our eyes into something new. Portions of it are disappearing, and others are moving into adjacent markets, but a core part of it will remain intact. A few key factors have caused this, we think. Some are historical, by which we mean they had their largest effect in the past, but the ongoing effect is still being felt, whereas the contemporary factors are the ones that we think are having their largest impact now, and will continue to do so in the short-term future (12-18 months).

Historical factors

  • Over-promising and under-delivery of intranet search between the last two US recessions, roughly between 2002 and 2007, resulting in a lot of failed projects.
  • A lack of market awareness and understanding of the value and risk inherent in unstructured data.
  • The entrance of Google into the market in 2002.
  • The lack of vision by certain closely related players in enterprise content management (ECM) and business intelligence (BI).

Contemporary factors

  • The lack of a clear value proposition for enterprise search.
  • The rise of open source, in particular Apache Lucene/Solr.
  • The emergence of big data, or total data.
  • The social media explosion.
  • The rapid spread of SharePoint.
  • The acquisitive growth of Autonomy Corp.
  • Acquisition of fast-growing players by major software vendors, notably Dassault Systemes, Hewlett-Packard and Microsoft.

The result of all this has been a split into roughly four markets, which we refer to as low-end, midmarket, OEM and high-end search-based applications.

Entry-level search

The low-end, or entry-level, enterprise search market has become, if not commodified, then pretty close to it. It is dominated by Google and open source. Other commercial vendors that once played in it have mostly left the market.

The result is that potential entry-level enterprise search customers are left with a dichotomy of choices: Google’s yellow search appliances that have two-year-term licenses and somewhat limited configurability (but are truly plug-and-play options) on the one hand, and open source on the other. It is a closed versus a very open box, and they have different and equally enthusiastic customer bases. Google is a very popular department-level choice, often purchased by line-of-business knowledge workers frustrated at obsolete and over-engineered search engines. Open source is, of course, popular with those that want to configure their search engine themselves or have a service provider do it and, thus, have a lot of control over how the engine works, as well as the results it delivers. Apache Lucene is also part of many commercial, high-end enterprise search products, including those of IBM.

Midmarket search

Mid-market search is a somewhat vague area, where vendors are succeeding in deals of roughly $75,000-250,000 selling intranet search. This area has thinned out as some vendors have tried to move upmarket into the world of search-based applications, but there are still many vendors making a decent living here. However, SharePoint has had a major effect on this part of the market, and if enterprises already have SharePoint – and Microsoft reckons more than 70% have at least bought a license at some point already – then it can be tough to offer a viable alternative. However, if SharePoint isn’t the main focus, then there is still a decent business to be had offering effective enterprise search, often in specific verticals, albeit without a huge amount of vertical customization.

OEM

The OEM search business has become a lot more interesting recently, in part due to which vendors have left it, leaving space for others. Microsoft’s acquisition of FAST in early 2008 meant one of the two major vendors at the time had essentially left the market entirely, since its focus moved almost entirely to SharePoint, as we recently documented. The other major OEM vendor at the time was Autonomy, and while it would still consider itself to be so, we think much of its OEM business, in fact, comes from document filters, rather than the OEMing of the IDOL search engine. Autonomy would strongly dispute that, but it might be moot soon anyway – it now looks as if it will end up as part of Hewlett-Packard following the announcement of its acquisition at a huge valuation, on August 18.

Those exits have left room for the rise of other vendors in the space. Key markets here include archiving, data-loss prevention and e-discovery. Many tools in these areas have old or quite basic search and text analysis functionality embedded in them, and vendors are looking for more powerful alternatives.

Search-based applications

The high end of the enterprise search market has become, in effect, the market for search-based applications (SBA) – that is, applications that are built on top of a search engine, rather than solely a relational database (although they often work alongside a database). These were touted back in the early 2000s by FAST, but it was too early, and FAST was too complex a set of tools to give the notion widespread acceptance. But in the latter part of the last decade and this one, SBAs have emerged as an answer to the problem of generic intranet search engines getting short shrift from users dissatisfied that the search engines don’t deliver what they want, when they want it.

Until recently, SBAs have mainly been a case of the vendors and their implementation partners building one-off custom applications for customers. But they are now moving to the stage where out-of-the-box user interfaces are being supplied for common tasks. In other words, it’s maturing in a similar way to the application software industry 20 years ago, which was built on top of the explosion in the use of relational databases.

We’ve seen examples in manufacturing, banking and customer service, and one of the key characteristics of SBAs is their ability to combine structured and unstructured data together in a single interface. That was also the goal of earlier efforts to combine search with business-intelligence tools, which often simply took the form of adding a search engine to a BI tool. That was too simplistic, and the idea didn’t really take off, in part because search vendors hadn’t paid enough attention to structure data.

But SBAs, which put much more focus on the indexing process than earlier efforts, appear to be gaining traction. If we were to get to the situation where search indexes are considered a better way of manipulating disparate data types than relational databases, that would be a major shift (see big data). Another key element of successful SBAs is that they don’t look like traditional search engines, with a large amount of white space and a search bar in the middle of the screen. Rather, they make use of facets and other navigation techniques to guide users through information, or often simply to present the relevant information to them.

As I mentioned, there’s more in the full report, including more about specific vendors, total (or big) data and the impact of social media. If you’d like to know more about it, please get in touch with me.

Necessity is the mother of NoSQL

As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.

Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.

The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.

The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:

Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.

Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.

Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.

That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.

These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.

Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.

We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).

In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.

These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.

NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

Sizing and analyzing the cloud-based archiving market

The cloud archiving market will generate around $193m in revenues in 2010, growing at a CAGR of 36% to reach $664m by 2014.

This is a key finding from a new 451 report published this week, which offers an in-depth analysis of the growing opportunity around how the cloud is being utilized to meet enterprise data retention requirements.

As well as sizing the market, the 50-page report – Cloud Archiving; A New Model for Enterprise Data Retention – details market evolution, adoption drivers and benefits, plus potential drawbacks and risks.

These issues are examined in more detail via five case studies offering real world experiences of organizations that have embraced the cloud for archiving purposes. The report also offers a comprehensive overview of the key players from a supplier perspective, with detailed profiles of cloud archive service providers, with discussion of related enabling technologies that will act as a catalyst for adoption, as well as expected future market developments.

Profiled suppliers include:

  • Autonomy
  • Dell
  • Global Relay
  • Google
  • i365
  • Iron Mountain
  • LiveOffice
  • Microsoft
  • Mimecast
  • Nirvanix
  • Proofpoint
  • SMARSH
  • Sonian
  • Zetta

Why a dedicated report on archiving in the cloud, you may ask? It’s a fair question, and one that we encountered internally, since archiving aging data is hardly the most dynamic-sounding application for the cloud.

However, we believe cloud archiving is an important market for a couple of reasons.  First, archiving is a relatively low-risk way of leveraging cloud economics for data storage and retention, and is less affected by the performance/latency limitation that have stymied enterprise adoption of other cloud-storage applications, such as online backup. For this reason, the market is already big enough in revenue terms to sustain a good number of suppliers; a broad spectrum that spans from Internet/IT giants to tiny, VC-backed startups. It is also set to experience continued healthy growth in the coming years as adoption extends from niche, highly regulated markets (such as financial services) to more mainstream organizations. This will pull additional suppliers – including some large players — into the market through a combination of organic development and acquisition.

Second, archiving is establishing itself as a crucial ‘gateway’ application for the cloud that could encourage organizations to embrace the cloud for other IT processes. Though it is still clearly early days, innovative suppliers are looking at ways in which data stored in an archive can be leveraged in other valuable ways.

All of these issues, and more, are examined in much more detail in the report, which is available to CloudScape subscribers here and Information Management subscribers here. An executive summary and table of contents (PDF) can be found here.

Finally, the report should act as an excellent primer for those interested in knowing more about how the cloud can be leveraged to help support ediscovery processes; this will be covered in much more detail in another report to be published soon by Katey Wood.

Google’s enterprise search: in the cloud & in a box

Google has changed the name the scope of its Website search it offers to Website owners that want a little more than simply to know that their site is being indexed by Google, but don’t want to go as far as buying one of its blue or yellow search appliances. 451 clients can read what we thought of it here.

Google has three levels of Website search to offer organizations – completely free but with no control as to which parts of your website are indexed and when, known as Custom Search Edition/AdSense for Search (CSE/AFS); the newly rebranded Google Site Search; and  the Google search appliances, which it sells in Mini and Search Appliance form factors, which can be used both for external-facing Website search as well as intranet search.

Google stopped issuing customer numbers for its appliances in October 2007. The number of organizations it had sold to at that point was about 10,000 customers. I suspect that number is around 11,500 now, though I don’t have any great methodology to back that up, I’m just extrapolating from previously-issued growth figures. That’s an extraordinary amount of organizations with a Google box.

To give some perspective, Autonomy has ~17,000 customers now. But the vast majority came from Verity. When Autonomy bought Verity in November 2005, Verity had about 15,000 customers (and Autonomy had about 1,000). But Verity got about 8,000 of those customers via its acquisition of Cardiff Software in February 2004. So in about 2.5 years Autonomy has added about 1,000 customer, but of course has done of lot of up-selling to its base and doesn’t play in the low-cost search business anymore (mainly because of Google).

The actual number of Google appliances sold is higher of course as many organizations have multiple appliances. I’ll never forget 18 months or so ago standing in  a room of a top 3 Wall Street investment bank with its top ~25 technologists gathered in a room and seeing about 6 of them put up their hands when asked who has a Google appliance – most of those weren’t known about to their boss or to each other.

But Google appliance proliferation is commonplace in large organizations. The things are so cheap and so relatively easy to install they are bought often under the radar of IT . The problem comes when times get tough (as they are in investment banking IT, that’s for sure) the organization wants to ring more out of the assets it has – even if it didn’t know it had those assets until relatively recently.

That’s why we strongly expect Google to come out with some sort of management layer this year to handle this sort of unintended (by the customer that is) proliferation. Watch this space.