The cathedral in the bazaar: the future of the EDW

Kalido’s Winston Chen published an interesting post this week comparing Enterprise Data Warehouse projects to the building of a cathedral, notably Gaudi’s Sagrada Familia: “A timeless edifice cast in stone, for a client who’s not in a hurry.”

The requirement of many data warehousing projects to deliver on an immutable vision is one of their most likely failings. As we noted in our 2009 report on considerations for building a data warehouse:

“One of the most significant inefficiencies of data warehousing is that users have traditionally had to design their data-warehouse models to match their planned queries, and it can often be difficult to change schema once the data warehouse has been created. This approach is too rigid in a world of rapidly changing business requirements and real-time decision-making.”

Not only is it too rigid, as we added in our 2010 data warehousing market sizing report:

“It is also self-defeating, since a business analyst or executive that is unable to get the answers to queries they require from the EDW is likely to find their own ways to answer these queries – resulting in data silos and the exact redundancy and duplication issues the EDW was apparently designed to avoid.”

Given my dual focus on open source software, whenever I hear the term ‘cathedral’ used in the context of software, I can’t help but think of Eric Raymond’s seminal essay The Cathedral and the Bazaar, in which he made the case for collaborative open source development – the bazaar model – as an alternative to proprietary software development approaches, where software is carefully crafted, like cathedrals, to an immutable plan.

Mixing metaphors, I realised that the comparison between the cathedral and the bazaar can also be used to explain the previously discussed changing role of the enterprise data warehouse.

Whereas traditional approaches to analytics focused on building the EDW as the single source of the truth, and the timeless data cathedral Winston describes, today companies are focused more on taking advantage of multiple data storage and processing technologies in what would better be described as a data bazaar.

However, it is not a matter of choosing between the cathedral and the bazaar. What we are seeing is the EDW becoming part of a broader data analytics architecture, retaining the data-quality and security rules and schema applied to core enterprise data while other technologies such as Hadoop, specialist analytic appliances, and online repositories are deployed for more flexible ad hoc analytic use cases and analyzing alternative sources of data – including log and other machine-generated data.

The cathedral, in this instance, is part of the bazaar.

Managing this data bazaar is essentially what our total data concept is all about: selecting the most appropriate data storage/processing technology for a particular use case, while enabling access to any data that might be applicable to the query at hand, whether that data is structured or unstructured, whether it resides in the data warehouse, or Hadoop, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.

It is also essentially what IBM’s recently disclosed Smart Consolidation approach is all about: providing multiple technologies for operational analytics, ad hoc analytics, stream and batch processing, queryable archives, all connected by an “enterprise data hub”, and choosing the most appropriate query processing technology for the specific workload (so after polyglot programming and polyglot persistence comes polyglot analytics).

Two of my fellow database analysts, Curt Monash and Jim Kobelius, have recently been kicking around the question of what will be the “nucleus of the next-generation cloud EDW”.

While the data bazaar will rely on a core data integration/virtualization/federation hub, it seems to me that the idea that future data management architectures require a nucleus is a remnant of ‘cathedral thinking’.

Like Curt I think it is most likely that there will be no nucleus – or to put it another way, that each user will have a different perspective of the nucleus based on their role. For some Hadoop will be that nucleus, for others it will be the regional or departmental data mart. For others it will be an ad hoc analytics database. For some, it will remain the enterprise data warehouse.

I will be presenting more details about our total data concept and the various elements of the data bazaar, at The 451 Group’s client event in London on June 27.

Tags: , , , ,