The dawn of polyglot analytics

While there has been a significant amount of interest in the volume, velocity of variety of big data (and perhaps a few other Vs depending on who you speak to), it has been increasingly clear to that the trends driving new approaches to data management relate not just to the nature of the data itself, but how the user wants to interact with the data.

As we previously noted, if you turn your attention to the value of the data then you have to take into account the trend towards storing and processing all data (or at least as much as is economically feasible), and the preferred rate of query (the acceptable time taken to generate the result of a query, as well as the time between queries). Another factor to be added to the mix is the way in which the user chooses to analyze the data: are they focused on creating a data model and schema to answer pre-defined queries, or engaging in exploratory analytic approaches in which data is extracted and the schema defined in response to the nature of the query?

All of these factors have significant implications for which technology is chosen to store and analyze the data, and another user-driven factor is the increased desire to use specialist data management technologies depending on the specific requirement. As we noted in NoSQL, NewSQL and Beyond, in the operational database world this approach has become known as polyglot persistence. Clearly though, in the analytic database market we are talking not just about approaches to storing the data, but also analyzing it. That is why we have begun using the term ‘polyglot analytics’ to describe the adoption of multiple query-processing technologies depending on the nature of the query.

Polyglot analytics explains why we are seeing adoption of Hadoop and MapReduce as a complement to existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other analytic use cases. Polyglot analytics also explains why a company like eBay would retain its Teradata Enterprise Data Warehouse for storing and analyzing traditional transactional and customer data, as well as adopting Hadoop for storing and analyzing clickstream, user behaviour and other un/semi-structured data, while also adopting an exploratory analytic platform based on Teradata’s Extreme Data Appliance for extreme analytics on a combination of transaction and user behaviour data pulled from both its EDW and Hadoop deployments.

The emergence of this kind of exploratory analytic platform exemplifies the polyglot analytics approach to adopting a different platform based the user’s approach to analytics rather than the nature of the data. It also highlights some of the thinking behind Teradata’s acquisition of Aster Data, IBM’s acquisition of Netezza, as well as HP’s acquisition of Vertica and the potential future role of vendors such as ParAccel and Infobright.

We are about to embark on a major survey of data management users to assess their attitudes to polyglot analytics and the drivers for adopting specific data management/analytics technologies. The results will be delivered as part of our Total Data report later this year. Stay tuned for more details on the survey in the coming weeks.

The cathedral in the bazaar: the future of the EDW

Kalido’s Winston Chen published an interesting post this week comparing Enterprise Data Warehouse projects to the building of a cathedral, notably Gaudi’s Sagrada Familia: “A timeless edifice cast in stone, for a client who’s not in a hurry.”

The requirement of many data warehousing projects to deliver on an immutable vision is one of their most likely failings. As we noted in our 2009 report on considerations for building a data warehouse:

“One of the most significant inefficiencies of data warehousing is that users have traditionally had to design their data-warehouse models to match their planned queries, and it can often be difficult to change schema once the data warehouse has been created. This approach is too rigid in a world of rapidly changing business requirements and real-time decision-making.”

Not only is it too rigid, as we added in our 2010 data warehousing market sizing report:

“It is also self-defeating, since a business analyst or executive that is unable to get the answers to queries they require from the EDW is likely to find their own ways to answer these queries – resulting in data silos and the exact redundancy and duplication issues the EDW was apparently designed to avoid.”

Given my dual focus on open source software, whenever I hear the term ‘cathedral’ used in the context of software, I can’t help but think of Eric Raymond’s seminal essay The Cathedral and the Bazaar, in which he made the case for collaborative open source development – the bazaar model – as an alternative to proprietary software development approaches, where software is carefully crafted, like cathedrals, to an immutable plan.

Mixing metaphors, I realised that the comparison between the cathedral and the bazaar can also be used to explain the previously discussed changing role of the enterprise data warehouse.

Whereas traditional approaches to analytics focused on building the EDW as the single source of the truth, and the timeless data cathedral Winston describes, today companies are focused more on taking advantage of multiple data storage and processing technologies in what would better be described as a data bazaar.

However, it is not a matter of choosing between the cathedral and the bazaar. What we are seeing is the EDW becoming part of a broader data analytics architecture, retaining the data-quality and security rules and schema applied to core enterprise data while other technologies such as Hadoop, specialist analytic appliances, and online repositories are deployed for more flexible ad hoc analytic use cases and analyzing alternative sources of data – including log and other machine-generated data.

The cathedral, in this instance, is part of the bazaar.

Managing this data bazaar is essentially what our total data concept is all about: selecting the most appropriate data storage/processing technology for a particular use case, while enabling access to any data that might be applicable to the query at hand, whether that data is structured or unstructured, whether it resides in the data warehouse, or Hadoop, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.

It is also essentially what IBM’s recently disclosed Smart Consolidation approach is all about: providing multiple technologies for operational analytics, ad hoc analytics, stream and batch processing, queryable archives, all connected by an “enterprise data hub”, and choosing the most appropriate query processing technology for the specific workload (so after polyglot programming and polyglot persistence comes polyglot analytics).

Two of my fellow database analysts, Curt Monash and Jim Kobelius, have recently been kicking around the question of what will be the “nucleus of the next-generation cloud EDW”.

While the data bazaar will rely on a core data integration/virtualization/federation hub, it seems to me that the idea that future data management architectures require a nucleus is a remnant of ‘cathedral thinking’.

Like Curt I think it is most likely that there will be no nucleus – or to put it another way, that each user will have a different perspective of the nucleus based on their role. For some Hadoop will be that nucleus, for others it will be the regional or departmental data mart. For others it will be an ad hoc analytics database. For some, it will remain the enterprise data warehouse.

I will be presenting more details about our total data concept and the various elements of the data bazaar, at The 451 Group’s client event in London on June 27.