Total Data: delivering value from big data

One of the problems I have with the term ‘big data’ is that it can covers a diverse set of products that can be applied to different problems. While ‘big data’ highlights the problem – volume/variety/velocity, and promises a solution – value, it doesn’t provide a path in between the two.

The selection of appropriate technologies to deliver the value required from big data is central to the Total Data management concept, recently introduced in our long format report of the same name.

In determining the potential value of a specific technology, we acknowledged that the volume, variety and velocity of data must be taken into account. However, we also considered the impact of processing the data in its totality, the query frequency, the desire to explore data rather than simply query it, and the dependency on existing skills and resources.

This can be expressed as:

‘Total data’ = (Volume +/- Variety +/- Velocity)
+ (Totality +/- Exploration +/- Dependency +/- Frequency)

The various technologies that can be considered ‘big data’ technologies have individual benefits based on the seven factors expressed in the equation above, with a significant amount of overlap. As such, mapping a combination of factors to a specific data management technology is no simple task. However, it is possible to express an approximation of how individual technologies relate to these seven factors.

The graphic below illustrates the relationship between the factors impacting the generation of value from big data and the technologies discussed in the report.

The fact that some technologies overlap does not necessarily mean that they should be considered appropriate for the same workloads, but it does illustrate that similar factors are driving the adoption of those technologies for their respective workloads. It illustrates, for example, that among the analytic technologies in particular, there is significant potential overlap in the drivers encouraging the adoption of EDWs and exploratory analytic platforms (EAPs), and EAPs and Hadoop.

Our research indicates that these three platforms (and others) are being used across different companies for the same workloads.

Although Hadoop is largely a complement to the EDW, we see a lot of confusion from would-be adopters about what workloads should be deployed on Hadoop, rather than on the EDW. While Hadoop is better suited to unstructured and semi-structured data and workloads that benefit from a more relaxed approach to schema, unfortunately there is no shortcut to determining which is the best technology to deploy for a particular workload.

However, we have seen several companies discuss the approaches they have taken to solving this problem, which does provide some general guidance.

For example, JPMorgan Chase has created a spider chart that assesses the relative strengths and weaknesses of traditional relational databases and what it calls ‘big-data analytics’ on Hadoop. While there is a small amount of overlap, the company has found that the strengths of traditional databases lie in transactional data update patterns, concurrent jobs, responsiveness and table join complexity. In comparison, Hadoop’s strengths lie in data volume per job, schema complexity, processing freedom and data volume in general.

Another company that has built its own model to understand how different queries perform on different platforms is eBay, which has an added level of complexity with its Singularity EAP. The company has built its own model to understand how queries perform on the various platforms – in terms of system unit cost, units consumed, query cost, latency and parallel efficiency – to help users decide if they should be running queries against the EDW, Singularity or Hadoop. Using a standard Hive query, eBay was able to demonstrate that Hadoop performed well in terms of parallel efficiency and unit cost, while the EDW performed well in terms of units consumed and latency, and Singularity performed well in terms of query cost, latency and units consumed.

Disney is another company that has taken its own approach to comparing potential deployment options, also adding NoSQL databases to its financial estimates and net-present-value analysis. While the company faced hardware, support, training and learning-curve costs in adopting Hadoop and NoSQL databases, it had to weigh that against the hardware, licensing and support costs of traditional relational databases. The most critical factor, however – and the most difficult to calculate – was the lost opportunity cost of not adopting new technologies, which was likely to limit Disney’s ability to execute on its strategic initiatives

451 Research clients can get more detail about these projects, as well as a definition of exploratory analytic platform, datastructure, and queryable archive, by taking a look at our Total Data report.