The dawn of polyglot analytics

While there has been a significant amount of interest in the volume, velocity of variety of big data (and perhaps a few other Vs depending on who you speak to), it has been increasingly clear to that the trends driving new approaches to data management relate not just to the nature of the data itself, but how the user wants to interact with the data.

As we previously noted, if you turn your attention to the value of the data then you have to take into account the trend towards storing and processing all data (or at least as much as is economically feasible), and the preferred rate of query (the acceptable time taken to generate the result of a query, as well as the time between queries). Another factor to be added to the mix is the way in which the user chooses to analyze the data: are they focused on creating a data model and schema to answer pre-defined queries, or engaging in exploratory analytic approaches in which data is extracted and the schema defined in response to the nature of the query?

All of these factors have significant implications for which technology is chosen to store and analyze the data, and another user-driven factor is the increased desire to use specialist data management technologies depending on the specific requirement. As we noted in NoSQL, NewSQL and Beyond, in the operational database world this approach has become known as polyglot persistence. Clearly though, in the analytic database market we are talking not just about approaches to storing the data, but also analyzing it. That is why we have begun using the term ‘polyglot analytics’ to describe the adoption of multiple query-processing technologies depending on the nature of the query.

Polyglot analytics explains why we are seeing adoption of Hadoop and MapReduce as a complement to existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other analytic use cases. Polyglot analytics also explains why a company like eBay would retain its Teradata Enterprise Data Warehouse for storing and analyzing traditional transactional and customer data, as well as adopting Hadoop for storing and analyzing clickstream, user behaviour and other un/semi-structured data, while also adopting an exploratory analytic platform based on Teradata’s Extreme Data Appliance for extreme analytics on a combination of transaction and user behaviour data pulled from both its EDW and Hadoop deployments.

The emergence of this kind of exploratory analytic platform exemplifies the polyglot analytics approach to adopting a different platform based the user’s approach to analytics rather than the nature of the data. It also highlights some of the thinking behind Teradata’s acquisition of Aster Data, IBM’s acquisition of Netezza, as well as HP’s acquisition of Vertica and the potential future role of vendors such as ParAccel and Infobright.

We are about to embark on a major survey of data management users to assess their attitudes to polyglot analytics and the drivers for adopting specific data management/analytics technologies. The results will be delivered as part of our Total Data report later this year. Stay tuned for more details on the survey in the coming weeks.

Beyond ‘big data’

Alistair Croll published an interesting post this week entitled ‘there’s no such thing as big data’ in which he argued, prompted by a friend that “given how much traditional companies put [big data] to work, it might as well not exist.”

Tim O’Reilly continued the theme in his follow-up post, arguing:

“companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue”

There is much to agree with – in fact I have myself argued that when it comes to data, the key issue is not how much you have, but what you do with it. However, there is also a significant change of emphasis here from the underlying principles that have driven the interest in ‘big data’ in the last 12-18 months.

Compare Tim O’Reilly’s statement with the following, from Google’s seminal research paper The Unreasonable Effectiveness of Data:

“invariably, simple models and a lot of data trump more elaborate models based on less data”

While the two statements are not entirely contradictory, they do indicate a change in emphasis related to data. There has been so much emphasis of the ‘big’ in ‘big data’, as if the growing volume, variety and velocity of data itself would deliver improved business insights.

As I have argued in the introduction to our ‘total data’ management concept and the numerous presentations given on the subject this year, in order to deliver value from that data, you have to look beyond the nature of the data and consider what it is that the user wants to do with that data.

Specifically, we believe that one of the key factors in delivering value is companies focusing on storing and processing all of their data (or at least as much as is economically feasible) rather than analysing samples and extrapolating the results.

The other factor is time, and specifically how fast users can get to the results they are looking for. Another way of looking at this is in terms of the rate of query. Again, this is not about the nature of the data, but what the user wants to do with that data.

This focus on the rate of query has implications on the value of the data, as expressed in the following equation:

Value = (Volume ± Variety ± Velocity) x Totality/Time

The rate of query also has significant implications in terms of which technologies are deployed to store and process the data and to actually put the data to use in delivering business insight and value.

Getting back to the points made by Alistair and Tim in relation to the Unreasonable Effectiveness of Data, it would seem that to date there has been more focus on what Google referred to as “a lot of data”, and less on the “simple models” to deliver value from that data.

There is clearly a balance to be struck, and the answer lies not in ‘big data’ but “more clue” and defining and delivering those “simple models”.

Top Issues IT faces with Hadoop MapReduce: a Webinar with Platform Computing

Next Tuesday, August 3, at 8.30 AM PDT I’ll be taking part in a Webinar with Platform Computing to discuss the the benefits and challenges of Hadoop and MapReduce. Here’s the details:

With the explosion of data in the enterprise, especially unstructured data which constitutes about 80% of the total data in the enterprise, new tools and techniques are needed for business intelligence and big data processing. Apache Hadoop MapReduce is fast becoming the preferred solution for the analysis and processing of this data.

The speakers will address the issues facing enterprises deploying open source solutions. They will provide an overview of the solutions available for Big Data, discuss best practices, lessons learned, case studies and actionable plans to move your project forward.

To register for the event please visit the registration page.

Variety, Velocity, and Volume: a Webinar with Azul Systems

This Wednesday, August 3, at 9 AM PDT I’ll be taking part in a Webinar with Azul Systems to discuss the performance challenges of big data in the enterprise. Here’s the details:

“Big Data” is a hot topic and the concept of “Big Data” is a useful frame for the challenges of scaling petabyte or terabyte data that typically cannot be addressed with traditional technologies. However, Big Data is no longer just a challenge for large social media companies – enterprise can also benefit from understanding when and how to apply these technologies and architectures.

In this Webinar Matthew Aslett of the 451 Group reviews the taxonomy of Big Data and explains how organizations are employing new data management technologies and approaches to ensure that they turn the data deluge into more accurate and efficient operations.

Gil Tene, CTO and co-founder of Azul Systems, will then highlight in greater detail the infrastructure and building block choices for enterprise architects and how to address the performance, scalability, and velocity challenges of Big Data in the enterprise.

Key takeways:

  • New strategies for integrating Big Data applications within your existing infrastructure and operations
  • Tradeoffs between capacity and performance
  • The importance and challenges of Java for Big Data in the enterprise.
  • To register for the event please visit the registration page.

    The cathedral in the bazaar: the future of the EDW

    Kalido’s Winston Chen published an interesting post this week comparing Enterprise Data Warehouse projects to the building of a cathedral, notably Gaudi’s Sagrada Familia: “A timeless edifice cast in stone, for a client who’s not in a hurry.”

    The requirement of many data warehousing projects to deliver on an immutable vision is one of their most likely failings. As we noted in our 2009 report on considerations for building a data warehouse:

    “One of the most significant inefficiencies of data warehousing is that users have traditionally had to design their data-warehouse models to match their planned queries, and it can often be difficult to change schema once the data warehouse has been created. This approach is too rigid in a world of rapidly changing business requirements and real-time decision-making.”

    Not only is it too rigid, as we added in our 2010 data warehousing market sizing report:

    “It is also self-defeating, since a business analyst or executive that is unable to get the answers to queries they require from the EDW is likely to find their own ways to answer these queries – resulting in data silos and the exact redundancy and duplication issues the EDW was apparently designed to avoid.”

    Given my dual focus on open source software, whenever I hear the term ‘cathedral’ used in the context of software, I can’t help but think of Eric Raymond’s seminal essay The Cathedral and the Bazaar, in which he made the case for collaborative open source development – the bazaar model – as an alternative to proprietary software development approaches, where software is carefully crafted, like cathedrals, to an immutable plan.

    Mixing metaphors, I realised that the comparison between the cathedral and the bazaar can also be used to explain the previously discussed changing role of the enterprise data warehouse.

    Whereas traditional approaches to analytics focused on building the EDW as the single source of the truth, and the timeless data cathedral Winston describes, today companies are focused more on taking advantage of multiple data storage and processing technologies in what would better be described as a data bazaar.

    However, it is not a matter of choosing between the cathedral and the bazaar. What we are seeing is the EDW becoming part of a broader data analytics architecture, retaining the data-quality and security rules and schema applied to core enterprise data while other technologies such as Hadoop, specialist analytic appliances, and online repositories are deployed for more flexible ad hoc analytic use cases and analyzing alternative sources of data – including log and other machine-generated data.

    The cathedral, in this instance, is part of the bazaar.

    Managing this data bazaar is essentially what our total data concept is all about: selecting the most appropriate data storage/processing technology for a particular use case, while enabling access to any data that might be applicable to the query at hand, whether that data is structured or unstructured, whether it resides in the data warehouse, or Hadoop, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.

    It is also essentially what IBM’s recently disclosed Smart Consolidation approach is all about: providing multiple technologies for operational analytics, ad hoc analytics, stream and batch processing, queryable archives, all connected by an “enterprise data hub”, and choosing the most appropriate query processing technology for the specific workload (so after polyglot programming and polyglot persistence comes polyglot analytics).

    Two of my fellow database analysts, Curt Monash and Jim Kobelius, have recently been kicking around the question of what will be the “nucleus of the next-generation cloud EDW”.

    While the data bazaar will rely on a core data integration/virtualization/federation hub, it seems to me that the idea that future data management architectures require a nucleus is a remnant of ‘cathedral thinking’.

    Like Curt I think it is most likely that there will be no nucleus – or to put it another way, that each user will have a different perspective of the nucleus based on their role. For some Hadoop will be that nucleus, for others it will be the regional or departmental data mart. For others it will be an ad hoc analytics database. For some, it will remain the enterprise data warehouse.

    I will be presenting more details about our total data concept and the various elements of the data bazaar, at The 451 Group’s client event in London on June 27.

    Presenting NoSQL, NewSQL and Beyond at OSBC

    Next Monday, May 16, I will be hosting session at the Open Source Business Conference in San Francisco focused on NoSQL, NewSQL and Beyond.

    The presentation covers our recently published report of the same name, and provides some additional context on the role of open source in driving innovation in distributed data management.

    Specifically, the presentation looks at the evolving influence of open source in the database market and the context for the emergence of new database alternatives.

    I’ll be walking through the six core drivers that have driven the development and adoption of NoSQL and NewSQL databases, as well as data grid/cache technologies – scalability, performance, relaxed consistency, agility, intricacy and necessity – providing some user adoption examples for each.

    The presentation also discusses the broader trends impacting the data management, providing an introduction to our total data concept and how some of the drivers behind NoSQL and NewSQL are also impacting the role of the enterprise data warehouse, Hadoop, and data management in the cloud.

    The presentation begins at 3pm PT on Monday 16. The event is taking place at the Hilton San Francisco Union Square. I hope to see you there.

    Data cloud, datastructure, and the end of the EDW

    There have been a spate of reports and blog posts recently postulating about the potential demise of the enterprise data warehouse (EDW) in the light of big data and evolving approaches to data management.

    There are a number of connected themes that have led the likes of Colin White and Barry Devlin to ponder the future of the EDW, and as it happens I’ll be talking about these during our 451 Client event in San Francisco on Wednesday.

    While my presentation doesn’t speak directly to the future of the EDW, it does cover the trends that are driving the reconsideration of the assumption that the EDW is, and should be, the central source of business intelligence in the enterprise.

    As Colin points out, this is an assumption based on historical deficiencies with alternative data sources that evolved into best practices. “Although BI and BPM applications typically process data in a data warehouse, this is only because of… issues… concerning direct access [to] business transaction data. If these issues could be resolved then there would be no need for a data warehouse.”

    The massive improvements in processing performance seen since the advent of data warehousing means that it is now more practical to process data where it resides, or is generated rather than forcing data to be held in a central data warehouse.

    For example, while distributed caching was initially adopted to improve the performance of Web and financial applications, it also provides an opportunity to perform real-time analytics on application performance and user behaviour (enabling targeted ads for example) long before the data get anywhere near the data warehouse.

    While the central EDW approach has some advantages for data control, security and reliability, this has always been more theoretical than practical, as there is the need for regional and departmental data marts, and users continue to use local copies of data.

    As we put it in last year’s Data Warehousing 2009-2013 report:

    “The approach of many users now is not to stop those distributed systems from being created, but rather to ensure that they can be managed according to the same data-quality and security rules as the EDW.

    With the application of cloud computing capabilities to on-premises infrastructure, users now have the promise of distributed pools of enterprise data that marry central management with distributed use and control, empowering business users to create elastic and temporary data marts without the risk of data-mart proliferation.”

    The concept of the “data cloud” is nascent, but companies such as eBay are pushing in that direction, while also making use of data storage and processing technologies above and beyond traditional databases.

    Hadoop is a prime example, but so too are the infrastructure components that are generating vast amounts of data that can be used by the enterprise to better understand how the infrastructure is helping or hindering the business in responding to changing demands.

    For the 451 client event we have come up with the term ‘datastruture’ to describe these infrastructure elements. What is ‘datastructure’? It’s the machines that are responsible for generating machine-generated data.

    While that may sound like we’ve just slapped a new label on existing technology we believe that those data-generating machines will evolve over time to take advantage of improved available processing power with embedded data analytics capabilities.

    Just as in-database analytics has enabled users to reduce data processing latency by taking the analytics to the data in the database, it seems likely that users will look to do the same for machine-generated data by taking the analytics to the data in the ‘datastructure’.

    This ‘datastructure’ with embedded database and analytics capabilties therefore becomes part of the wider ‘data cloud’, alongside regional and departmental data marts, and the central business application data warehouse, as well as the ability to spin up and provision virtual data marts.

    As Barry Devlin puts it: “A single logical storehouse is required with both a well-defined, consistent and integrated physical core and a loose federation of data whose diversity, timeliness and even inconsistency is valued.”

    Making this work will require new data cloud management capabilities, as well as an approach to data management that we have called “total data”. As we previously explained:

    “Total data is about more than data volumes. It’s about taking a broad view of available data sources and processing and analytic technologies, bringing in data from multiple sources, and having the flexibility to respond to changing business requirements…

    Total data involves processing any data that might be applicable to the query at hand, whether that data is structured or unstructured, and whether it resides in the data warehouse, or a distributed Hadoop file system, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.”

    As for the end of the EDW, both Colin and Barry argue, and I agree, that what we are seeing does not portend the end of the EDW but recognition that the EDW is a component of business intelligence, rather than the source of all business intelligence itself.

    Sizing the big data problem: ‘big data’ is the problem

    Big data has been one of the big topics of the year in terms of client queries coming into The 451 Group, and one of the recurring questions (especially from vendors and investors) has been: “how big is the big data market?”

    The only way to answer that is to ask another question: “what do you mean by ‘big data’?” We have mentioned before that the term is ill-defined, so it is essential to work out what an individual means when they use the term.

    In our experience they usually mean one of two things:

    • Big data as a subset of overall data: specific volumes or classes of data that cannot be processed or analyzed by traditional approaches
    • Big data as a superset of the entire data management market, driven by the ever-increasing volume and complexity of data

    Our perspective is that big data, if it means anything at all, represents a subset of overall data. However, it is not one that can be measurably defined by the size of the data volume. Specifically, as we recently articulated, we believe:

      “Big data is a term applied to data sets that are large, complex and dynamic (or a combination thereof) and for which there is a requirement to capture, manage and process the data set in its entirety, such that it is not possible to process the data using traditional software tools and analytic techniques within tolerable time frames.”

    The confusion around the term big data also partly explains why we introduced the term “total data” to refer to a broader approach to data management, managing the storage and processing of all available data to deliver the necessary business intelligence.

    The distinction is clearly important when it comes to sizing the potential opportunity. I recently came across a report from one of the big banks that put a figure on what it referred to as the “big data market”. However, they had used the superset definition.

    The result was therefore not a calculation of the big data market, but a calculation of the total data management sector (although the method is in itself too simplistic for us to endorse the end result) since the approach taken was to add together the revenue estimates for all data management technologies – traditional and non-traditional.

    .

    Specifically, the bank had added up current market estimates for database software, storage and servers for databases, BI and analytics software, data integration, master data management, text analytics, database-related cloud revenue, complex event processing and NoSQL databases.

    In comparison, the big data market is clearly a lot smaller, and represents a subset of revenue from traditional and non-traditional data management technologies, with a leaning towards the non-traditional technologies.

    It is important to note, however, that big data cannot be measurably defined by the technology used to store and process it. As we have recently seen, not every use case for Hadoop or a NoSQL database – for example – involves big data.

    Clearly this is a market that is a lot smaller than the one calculated by the bank, and the calculation required is a lot more complicated. We know, for example, that Teradata generated revenue of $489m in its third quarter. How much of that was attributable to big data?

    Answering that requires a stricter definition of big data than is currently in usage (by anyone). But as we have noted above, ‘big data’ cannot be defined by data volume, or the technology used to store or process it.

    There’s a lot of talk about the “big data problem”. The biggest problem with big data, however, is that the term has not – and arguably cannot – be defined in any measurable way.

    How big is the big data market? You may as well ask “how long is a piece of string?”

    If we are to understand the opportunity for storing and processing big data sets then the industry needs to get much more specific about what it is that is being stored and processed, and what we are using to store and process it.

    Total data: ‘bigger’ than big data

    The 451 Group has recently published a spotlight report examining the trends that we see shaping the data management segment, including data volume, complexity, real-time processing demands and advanced analytics, as well as a perspective that no longer treats the enterprise data warehouse as the only source of trusted data for generating business intelligence.

    The report examines these trends and introduces the term ‘total data’ to describe the total opportunity and challenge provided by new approaches to data management.


    Johann Cruyff, exponent of total football, inspiration for total data. Source: Wikimedia. Attribution: Bundesarchiv, Bild 183-N0716-0314 / Mittelstädt, Rainer / CC-BY-SA

    Total data is not simply another term for big data; it describes a broader approach to data management, managing the storage and processing of big data to deliver the necessary BI.

    Total data involves processing any data that might be applicable to the query at hand, whether that data is structured or unstructured, and whether it resides in the data warehouse, or a distributed Hadoop file system, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.

    In the report we explain how total data is influencing modern data management with respect to four key trends. To summarize:

    • beyond big: total data is about processing all your data, regardless of the size of the data set
    • beyond data: total data is not just about being able to store data, but the delivery of actionable results based on analysis of that data
    • beyond the data warehouse: total data sees organisations complementing data warehousing with Hadoop, and its associated projects
    • beyond the database: total data includes the emergence of private data clouds, and the expansion of data sources suitable for analytics beyond the database

    The term ‘total data’ is inspired ‘total football,’ the soccer tactic that emerged in the early 1970s and enabled Ajax of Amsterdam to dominate European football in the early part of the decade and The Netherlands to reach the finals of two consecutive World Cups, having failed to qualify for the four preceding competitions.

    Unlike previous approaches that focused on each player having a fixed role to play, total football encouraged individual players to switch positions depending on what was happening around them while ensuring that the team as a whole fulfilled all the required tactical positions.

    Although total data is not meant to be directly analogous to total football, we do see a connection with the latter’s fluidity that is enabled by no longer requiring players to fulfill specific roles, and total data’s desire to break down dependencies on the enterprise data warehouse as the single version of the truth, while letting go of assumptions that the relational database offers a one-size-fits-all answer to data management.

    Total data is about more than data volumes. It’s about taking a broad view of available data sources and processing and analytic technologies, bringing in data from multiple sources, and having the flexibility to respond to changing business requirements.

    A more substantial explanation of the concept of total data and its impact on information and infrastructure management methods and technologies is available here for 451 Group clients. Non-clients can also apply for trial access.