Updated Data Platforms Map – January 2016

The January 2016 edition of the 451 Research Data Platforms Map is now available for download.

Initially designed to illustrate the complexity of the data platforms market, the latest version includes an updated index to help you navigate the complex array of current data platform providers.

image

There are numerous additions compared to the previous map, especially in the area of event/stream processing while we have also reconsidered our approach to Hadoop-as-a-service, narrowing it down to distinct Hadoop offerings rather than hosted Hadoop distributions.

We have also tried to clean up or approach to the convergence of Hadoop and search, although that remains a bit of a work in progress, to be honest. There’s also something in there for eagle-eyed Silicon Valley fans.

You can use this map to:

  • compare capabilities, offerings, and functionality.
  • understand where providers intersect and diverge.
  • identify shortlists of choices to suit enterprise needs.

The latest version of the map can be downloaded here.

What big data can learn from total football, and vice versa: part one

I was lucky enough to have a presentation using the title above accepted for Strata London in October. Unfortunately, due to other commitments, I will no longer be able to attend the event. Having already done some background research into the topic it seemed a shame for it to go to waste. To celebrate this weekend’s return of the Premier League I thought I’d write publish the results here instead.

Ever since the 2003 publication of Moneyball, and its account of the use of sabermetrics by Billy Beane and the 2002 Oakland Athletics’ to gain competitive advantage, questions have been asked about the potential applicability of statistics to other sports.

Interest in the UK naturally spiked following the release of the film of the book, which brought Moneyball to a wider audience and prompted questions about whether football was missing out by ignoring the potential of statistical analysis for competitive advantage.

I previously noted how almost 30 years before the 2002 Oakland Athletics, Dynamo Kyiv manager Valeriy Lobanovskyi instituted a scientific, data-led approach to tactics that enabled Dynamo Kiev to win the Soviet League eight times, the Ukrainian league five times, and the European Cup Winner’s Cup twice.

As much as he was a visionary, Lobanovskyi is also atypical of football managers, but there is other evidence that football has been ahead of the game in realising the potential for statistical analysis. After all, the three big names in football-related statistics – Amisco, Opta and Prozone – were all founded prior to the 2002 baseball season: in 1995, 1996 and 1998 respectively.

Each of these organisations, and many more besides, produce enormous amounts of data related to football matches which is sold to football clubs with a view to improving performance through statistical analysis.

As an example of the amount of data that can be generated in by football, the BBC recently reported that GPS monitors, routinely used by clubs in training if not in actual competitive games, “can collect up to 100 pieces of player data per second on speed, distance, heart rate, dynamic stress load, accelerations and decelerations.”

Having access to gobs of data is one thing; making sense of it is quite another. This is particularly the case in football which is much more fluid than baseball and other sports such as cricket that are essentially a series of repeated set-plays. This has led to sceptics claiming that statistics will never have the same impact in football as baseball due to its unpredictability.

Control the controllables
Our first lesson that data management can learn from football is to not worry about what statistics can’t tell you, and focus on what they can. Or in the words of Southampton FC manager Nigel Adkins: “control the controllables.” This precisely what Bolton Wanderers, one of the first football teams credited with adopting statistical analysis, did. Bolton did so by focusing on the aspects of the game that are set-plays.

Writing in the Financial Times Simon Kuper quotes Gavin Fleig, head of performance analysis at current Premier League champions Manchester City and former performance analyst at Bolton:

“We would be looking at, ‘If a defender cleared the ball from a long throw, where would the ball land? Well, this is the area it most commonly lands. Right, well that’s where we’ll put our man.’”

As a result, Bolton scored 45-50% of their goals from set-plays, compared to a league average of nearer 33%.

Perhaps the most significant set-play in football, certainly in terms of deciding between winners and losers, is the penalty shoot-out. Routinely dismissed by the losing team (England) as a lottery, the penalty shoot-out is anything but according to statistics analyzed by Prozone.

According to Prozone’s analysis:

  • the team taking the first kick wins 75% of the time
  • 81.2% of penalties taken to win shootouts were scored compared with 14.2% of those needed to keep the game alive
  • 71.4% of all penalty saves are in the lower third of the goal
  • None of the penalties aimed at the upper third of the net were saved (although they are more likely to miss)

  • Source: Prozone

    “Everything that can be counted doesn’t necessarily count”
    While statistics such as these suggest that the penalty shoot-out is less a lottery than a mathematical puzzle, our second lesson that data management can learn from football relates to the above quote from Albert Einstein, and the danger of assuming the relevance of statistics.

    Along with with Stefan Szymaski, Kuper is the author of Soccernomics (so good I bought it twice) – a treasure trove of stories and information on statistics and football.

    In Soccernomics, Kuper and Szymaski note that in the early days of statistical analysis in football players were initially judged on statistics that were easily counted: number of passes, number of tackles, number of shots, kilometres run etc.

    That last statistic turned out to be particularly meaningless. Kuper quotes Chelsea’s performance director, Mike Forde:

    “Can we find a correlation between total distance covered and winning? And the answer was invariably no.”

    Perhaps the greatest example of over-reliance on statistics comes from the surprise sale of Jaap Stam from Manchester United to Lazio in 2001.

    While it was widely reported that Manchester United manager Sir Alex Ferguson sold Stam due to controversial comments in his autobiography, Kuper maintains that it was a decision based on statistics: specifically, the fact that Stam, approaching 30, was tackling less often than he previously had. According to Kuper, Ferguson sold Stam based on that statistic alone.

    Whether it was the statistic or the autobiography, selling Stam was a decision Ferguson would later regret. Either way, it turns out that tackles made per game is about as useful a measure of a defender’s ability as the number of kilometres run.

    The proof of that comes in the shape of Paolo Maldini – arguably one of the greatest defenders the world has ever seen. As Kuper notes, statistically Maldini only made one tackle every two games. “Maldini positioned himself so well that he didn’t need to tackle.”

    All of which begs the question: if someone with the domain expertise of Sir Alex Ferguson, one of the greatest managers in the history of British football, armed with statistical evidence, can make an incorrect decision, is there really a role for statistics in football?

    In part two we will explore some of the other examples of statistical analysis influencing the beautiful game, including graph analysis and network theory; the great Liverpool Moneyball experiment, and the lessons learned from Total Football.

    HALF TIME

    Data as a natural energy source

    A number of analogies have arisen in recent years to describe the importance of data and its role in shaping new business models and business strategies. Among these is the concept of the “data factory”, recently highlighted by Abhi Mehta of Bank of America to describe businesses that have realized that their greatest asset is data.

    WalMart, Google and Facebook are good examples of data factories, according to Mehta, who is working to ensure that BofA joins the list as the first data factory for financial services.

    Mehta lists three key concepts that are central to building a data factory:

    • Believe that your core asset is data
    • Be able to automate the data pipeline
    • Know how to monetize your data assets

    The idea of the data factory is useful in describing the organizations that we see driving the adoption of new data management, management and analytics concepts (Mehta has also referred to this as the “birth of the next industrial revolution”) but it has some less useful connotations.

    In particular, the focus on data as something that is produced or manufactured encourages the obsession with data volume and capacity that has put the Big in Big Data.

    Size isn’t everything, and the ability to store vast amounts of data is only really impressive if you also have the ability to process and analyze that data and gain valuable business insight from it.

    While the focus in 2010 has been on Big Data, we expect the focus to shift in 2011 towards big data analytics. While the data factory concept describes what these organizations are, it does not describe what it is that they do to gain analytical insight from their data.

    Another analogy that has been kicking around for a few years is the idea of data as the new oil. There are a number of parallels that can be drawn between oil and gas companies exploring the landscape in search of pockets of crude, and businesses exploring their data landscape in search of pockets of useable data.

    A good example of this is eBay’s Singularity platform for deep contextual analysis, one use of which was to combined transactional data from the company’s data warehouse with behavioural data on its buyers and sellers, and enabled identification of top sellers, driving increased revenue from those sellers.

    By exploring information from multiple sources in a single platform the company was able to gain a better perspective over its data than would be possible using data sampling techniques, revealing a pocket of data that could be used to improve business performance.

    However, exploring data within the organization is only scratching the surface of what eBay has achieved. The real secret to eBay’s success has been in harnessing that retail data in the first place.

    This is a concept I have begun to explore recently in the context of turning data into products. It occurs to me that the companies that represent the most success in this regard are those that are not producing data, but harnessing naturally occurring information streams to capture the raw data that can be turned into usable data via analytics.

    There is perhaps no greater example of this than Facebook, now home to over 500 million people using it to communicate, share information and photos, and join groups. While Facebook is often cited as an example of new data production,, that description is inaccurate.

    Consider what these 500 million people did before Facebook. The answer, of course, is that they communicated, shared information and photos, and joined groups. The real genius of Facebook is that it harnesses a naturally occurring information stream and accelerates it.

    Natural sources of data are everywhere, from the retail data that has been harnessed by the likes of eBay and Amazon, to the Internet search data that has been harnessed by Google, but also the data being produced by the millions of sensors in manufacturing facilities, data centres and office buildings around the world.

    Harnessing that data is the first problem to solve, applying the data analytics techniques to that, automating the data pipeline, and knowing how to monetize the data assets completes the picture.

    Mike Loukides of O’Reilly recently noted: “the future belongs to companies and people that turn data into products.” The companies and people that stand to gain the most are not those who focus on data as something to be produced and stockpiled, but as a natural energy source to be processed and analyzed.