Matthew Aslett — Too much information

The Data Day, Two days: September 6/7 2012

September 7th, 2012 — Data management

StackIQ. Alpine Data. OpenDremel/Apache Drill. Memcachier. And more.

For 451 Research clients: StackIQ takes direct approach to Hadoop cluster management bit.ly/NU6Tu0

— Matt Aslett (@maslett) September 6, 2012

EMC Greenplum signs formal agreement to resell Alpine Data Labs predictive analytics suite. bit.ly/NU7BHR

— Matt Aslett (@maslett) September 6, 2012

The OpenDremel project is being merged with Apache Drill bit.ly/QfnrNF

— Matt Aslett (@maslett) September 6, 2012

How MemCachier went from a favor for a friend to cloud ubiquity gigaom.com/cloud/how-memc… via @derrickharris

— Matt Aslett (@maslett) September 6, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Two days: September 6/7 2012

The Data Day, Three days: September 3/4/5 2012

September 5th, 2012 — Data management

Basho joins CloudStack. Pentaho rides Hadoop wave. And more.

Basho joins Apache CloudStack project -collaborates with Citrix on distributed object storage. bit.ly/Q6RNaP

— Matt Aslett (@maslett) September 5, 2012

MariaDB Galera Cluster is here: j.mp/Q2VQ3K #synchronous #multimaster #scalability

— MariaDB (@mariadb) September 4, 2012

A new post from me on Hadapt’s blog about some new HDFS —> relational invisible loading technology. hadapt.com/blog/2012/9/5/… #bigdata #hadoop

— Daniel Abadi (@daniel_abadi) September 5, 2012

More than 70% of new Pentaho customers in Q2 are deploying on Hadoop. bit.ly/Q6Scde The rest split between RDBMS and NoSQL.

— Matt Aslett (@maslett) September 5, 2012

What Do Real-Life Hadoop Workloads Look Like? bit.ly/Q4VzNS

— Cloudera (@cloudera) September 5, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Three days: September 3/4/5 2012

The Data Day, Today: August 31 2012

August 31st, 2012 — Data management

MongoDB. Informatica. Splunk. Stewart Downing. And more.

For 451 Research clients: 10gen eyes operational analytics use case with MongoDB update bit.ly/PUdJSc

— Matt Aslett (@maslett) August 31, 2012

For 451 clients: Informatica illuminates data-privacy business one year after ActiveBase acquisition bit.ly/PUdNl1 By Krishna Roy

— Matt Aslett (@maslett) August 31, 2012

Splunk made a net loss of $4.6m on revenue up 71% to $44.5m in Q2. bit.ly/PUdWoC

— Matt Aslett (@maslett) August 31, 2012

What big data can learn from total football, and vice versa: part two bit.ly/PFNXzD Lies, damn lies, and Stewart Downing

— Matt Aslett (@maslett) August 31, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Today: August 31 2012

What big data can learn from total football, and vice versa: part two

August 31st, 2012 — Data management

With transfer deadline day in full-swing it seems like as good a day as any to complete our look at the relationship between football (soccer) and big data (part one here).

Today is the last chance for a few months for football clubs to outsmart their rivals by buying the players that they hope will give them a competitive advantage for the rest of the season. How will data play a part?

Whereas the 2002 Oakland Athletics provided a clear example of how statistics can be used to gain competitive advantage in baseball player recruitment, evidence of similar success in football is harder to find. As indicated in part one, a prime example is Botlon Wanderers, which arguably punched above its weight for years and was one of the first Premier League teams to use statistics to influence strategy and player recruitment.

As Simon Kuper details, one of the key examples of the latter is the club’s 2004 signing of the late Gary Speed, who at 34 would have been judged by many as being too old to compete for much longer at the highest level.

Kuper reports how Bolton was able to compare data related to Speed’s physical data with younger – and more expensive – players in similar positions, and determine that his performance was unlikely to deteriorate as much as would be assumed. Speed played for Bolton for another four years.

While there are other examples of successful purchases being influenced by data, those more sceptical about the potential for data to influence the beautiful game can also point to some high-profile failures.

If a moneyball approach was going to be successful within English football it had the perfect chance to prove itself at Liverpool in recent years. Since October 2010 the club has been owned by Fenway Sports Group and John W Henry, who once tried to hire Billy Beane as the general manager of the Boston Red Sox and in November 2010 hired the closest thing European football has to Billy Beane – Damien Comolli – as Liverpool’s Director of Football Strategy.

Statistical relevance
Quite how much Liverpool’s subsequent transfer policy was influenced by statistics is known only to Liverpool insiders, but certainly Comolli was cited as being responsible for the signings of Luis Suárez, Andy Carroll, Jordan Henderson, Charlie Adam, Stewart Downing, and José Enrique – for an estimated total of £110m – with the £35m spent on Carroll making him the most expensive British footballer of all time.

Either way, statistics have been used to judge the wisdom of those purchases with the scoring record of striker Andy Carroll (6 goals in 44 Premier League games) and winger Stewart Downing’s record of goals and assists (0 and 0 in 37 Premier League games) coming in for particular scrutiny.

Carroll yesterday joined West Ham United on loan, while Downing looks likely to have to adopt a more defensive role to stay at the club. Comolli left Liverpool in April 2012 by mutual consent.

While Liverpool’s transfer dealings are hardly a ringing endorsement for the applicability of statistical analysis and football, it would be wrong to judge their compatibility solely on the basis of transfers alone.

Network theory
We have also seen growing evidence of interest in statistical analysis to football tactics, with a number of academic research reports having been published in recent months. These include Quantifying the Performance of Individual Players in a Team Activity, which originated at the Amaral Lab for Complex Systems and Systems Biology at Northwestern University and provides the basis for Chimu Solutions’ FootballrRating.com, and A network theory analysis of football strategies by researchers at University College London and Queen Mary University of London.

Source: A network theory analysis of football strategies

Both of these use network-based analysis to understand and represent the value of players within a team’s overall strategy. As the researchers behind ‘A network theory analysis of football strategies’ explain:

“The resulting network or graph provides a direct visual inspection of a team’s strategy, from which we can identify play pattern, determine hot-spots on the play and localize potential weaknesses. Using different centrality measures, we can also determine the relative importance of each player in the game, the `popularity’ of a player, and the effect of removing players from the game.”

Looking at this from the perspective of someone with an interest in analytics it is fascinating to see football analyzed and represented in this way. Looking at it from the perspective of a football fan, I can’t help wondering whether this is just a matter of science being used to explain something that footballers and football fans just instinctively understand.

Another research paper, Science of Winning Soccer: Emergent pattern-forming dynamics in association football, certainly falls into the category of over-explaining the obvious. Based on quantitative analysis of a frame by frame viewing of a soccer match the researchers concluded that “local player numerical dominance is the key to defensive stability and offensive opportunity.”

In other words, the attacking team is more likely to score if it has more players in the opposition’s penalty area than there are defenders (having “numbers in the box“), while the defending team is less likely to concede if it has more defenders than there are attackers (or has “parked the bus”).

What’s more: “The winning goal of the Manchester City match occurred when a Queen Park Ranger [sic] fell down”. “She fell over!”

Which isn’t to say that there is nothing football can learn from big data – just that there are clearly areas in which statistical analysis have more value to contribute than others.

But we’ll conclude by looking at what can data management learn from football – particularly total football – the soccer tactic that emerged in the early 1970s and inspired our concept of Total Data.

In our report of the same name we briefly explained the key aspects of Total Football…

Total Football was a different strategic approach to the game that emerged in the late 1960s, most famously at Ajax of Amsterdam, that focused not on the position of the player, but on his ability to make use of the space between those positions. Players were encouraged to move into space rather than sticking to pre-defined notions of their positional role, even exchanging positions with a teammate.

While this exchange of positions came to symbolize Total Football, the maintenance of formation was important in balancing the skills and talents of individual team members with the overall team system. This was not a total abandonment of positional responsibility – the main advantage lay in enabling a fluid approach that could respond to changing requirements as the game progressed.

This fluidity relied on having players with the skill and ability to play in multiple positions, but also high levels of fitness in order to cover more of the pitch than the players whose role was determined by their position. It is no coincidence that Total Football emerged at the same time as an increased understanding of the role that sports science and diet had to play in improving athletic performance.

… and outlined four key areas in which we believe data management as a discipline can learn from Total Football in terms of delivering value from big data:

Abandonment of restrictive (self-imposed) rules about individual roles and responsibility

Accepting specialist data management technologies where appropriate, rather than forcing existing technologies to adapt to new requirements. Examples include the adoption of non-relational databases to store and process non-relational data formats, and the adoption of MapReduce to complement existing SQL skills and tools.

Promotion of individuality within the overall context of the system

This greater willingness to adopt specialist technologies where appropriate to the individual application and workload does not require the abandonment of existing investments in SQL database and data-warehousing technologies, but rather an understanding of the benefits of individual data storage and processing technologies and how they can be used in a complementary manner – or in concert – to achieve the desired result.

Enabling, and relying on, fluidity and flexibility to respond to changing requirements

The adoption of alternative platforms for ad hoc, iterative data analysis enables users to have more options to respond to new analytic requirements and to experiment with analytic processing projects without impacting the performance of the data warehouse.

Exploitation of improved performance levels

The role of more efficient hardware, processor and storage technologies is often overlooked, but it is this improved efficiency that means users are now in a position to store and process more data, more efficiently than ever.

Comments Off on What big data can learn from total football, and vice versa: part two

The Data Day, Two days: August 29/30 2012

August 30th, 2012 — Data management

ParStream. MongoDB 2.2. Infochimps. BigQuery. And more.

For 451 Research clients: ParStream raises $5.6m to fund North American analytic database push bit.ly/QVQ0Q0

— Matt Aslett (@maslett) August 30, 2012

MongoDB 2.2 Released with Improved Analytics and Faster Performance soc.ai/LJ

— 10gen (@10gen) August 29, 2012

Infochimps Welcomes Jim Kaskade as CEO bit.ly/NVPLtv

— Matt Aslett (@maslett) August 29, 2012

Google adds support for batch queries and Excel connector to BigQuery. bit.ly/QVQV2X

— Matt Aslett (@maslett) August 30, 2012

MetaScale announces strategic partnership with Hortonworks. prn.to/Q1t45p

— Matt Aslett (@maslett) August 29, 2012

Quest launches Toad Business Intelligence Suite bit.ly/QWlNAn

— Matt Aslett (@maslett) August 30, 2012

Make Way for the Soccer Geeks bit.ly/NVPveh Manchester City has opened up its data to the masses via @mjasay

— Matt Aslett (@maslett) August 29, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Two days: August 29/30 2012

The Data Day, Two days: August 27/28 2012

August 28th, 2012 — Data management

Citrusleaf. Aerospike. AlchemyDB. Sqrrl. Percolator. Dremel. Pregel. And more.

For 451 Research clients: Citrusleaf becomes Aerospike, ‘acq-hires’ AlchemyDB to create NoSQL/NewSQL database hybrid bit.ly/PXyeQ1

— Matt Aslett (@maslett) August 28, 2012

For 451 Research clients: sqrrl data emerges from NSA to commercialize Accumulo NoSQL database bit.ly/PXya2F

— Matt Aslett (@maslett) August 28, 2012

Percolator, Dremel and Pregel: Alternatives to Hadoop bit.ly/U7z3GQ

— Matt Aslett (@maslett) August 28, 2012

Graphs are Everywhere: Solving the Complexities of Social Connections bit.ly/U7zb9c

— Matt Aslett (@maslett) August 28, 2012

The Elephant in the Cloud – Putting Hadoop on any Cloud bit.ly/U7zxwC

— Matt Aslett (@maslett) August 28, 2012

Splunk previews two new offerings, and an open source project, to integrate with Hadoop. bit.ly/U7zsJ9

— Matt Aslett (@maslett) August 28, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Two days: August 27/28 2012

The Data Day, Today: August 24 2012

August 24th, 2012 — Data management

Facebook’s Prism. CAP Theorem. Keeping MySQL open. And more.

Facebook Tackles (Really) Big Data With ‘Project Prism’ bit.ly/NKAITF Cross-data centre Hadoop clustering.

— Matt Aslett (@maslett) August 24, 2012

CAP Theorem: two out of three ain’t right. bit.ly/sdnY3T updated

— Matt Aslett (@maslett) August 24, 2012

SkySQL and MariaDB Working Together to Keep MySQL an Open Ecosystem bit.ly/QxrS66

— Matt Aslett (@maslett) August 24, 2012

AbsolutData Raises $20M from Fidelity Growth Partners India bit.ly/QxrODn

— Matt Aslett (@maslett) August 24, 2012

Nutanix Complete Cluster Brings SAN-Free Virtualization to Hadoop mwne.ws/QxrSTD

— Matt Aslett (@maslett) August 24, 2012

Updates for Percona Server, the Enhanced Drop-in Replacement for MySQL, Released mwne.ws/QxrTHg

— Matt Aslett (@maslett) August 24, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Today: August 24 2012

The Data Day, Two days: August 22/23 2012

August 23rd, 2012 — Data management

MetaScale. Spark. Actuate and VoltDB. And more.

For 451 Research clients: MetaScale identifies key drivers for Hadoop managed services bit.ly/O7hDWl

— Matt Aslett (@maslett) August 23, 2012

7 reason why I like Spark: Been wanting to write about my new favorite tool for Analytics, #ampcamp got me motivated!goo.gl/t7urr

— Ben Lorica (@bigdata) August 21, 2012

For 451 clients: StoredIQ’s DataIQ brings a prettier picture for lawyers looking at ‘big data’ bit.ly/O7hG4L By @davidhorrigan

— Matt Aslett (@maslett) August 23, 2012

What big data can learn from total football, and vice versa: part one bit.ly/NHQcmS Moneyball meets the beautiful game.

— Matt Aslett (@maslett) August 22, 2012

Actuate and VoltDB partner for real-time analysis and visualisation. bit.ly/PCnsN1

— Matt Aslett (@maslett) August 22, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Two days: August 22/23 2012

What big data can learn from total football, and vice versa: part one

August 22nd, 2012 — Data management

I was lucky enough to have a presentation using the title above accepted for Strata London in October. Unfortunately, due to other commitments, I will no longer be able to attend the event. Having already done some background research into the topic it seemed a shame for it to go to waste. To celebrate this weekend’s return of the Premier League I thought I’d write publish the results here instead.

Ever since the 2003 publication of Moneyball, and its account of the use of sabermetrics by Billy Beane and the 2002 Oakland Athletics’ to gain competitive advantage, questions have been asked about the potential applicability of statistics to other sports.

Interest in the UK naturally spiked following the release of the film of the book, which brought Moneyball to a wider audience and prompted questions about whether football was missing out by ignoring the potential of statistical analysis for competitive advantage.

I previously noted how almost 30 years before the 2002 Oakland Athletics, Dynamo Kyiv manager Valeriy Lobanovskyi instituted a scientific, data-led approach to tactics that enabled Dynamo Kiev to win the Soviet League eight times, the Ukrainian league five times, and the European Cup Winner’s Cup twice.

As much as he was a visionary, Lobanovskyi is also atypical of football managers, but there is other evidence that football has been ahead of the game in realising the potential for statistical analysis. After all, the three big names in football-related statistics – Amisco, Opta and Prozone – were all founded prior to the 2002 baseball season: in 1995, 1996 and 1998 respectively.

Each of these organisations, and many more besides, produce enormous amounts of data related to football matches which is sold to football clubs with a view to improving performance through statistical analysis.

As an example of the amount of data that can be generated in by football, the BBC recently reported that GPS monitors, routinely used by clubs in training if not in actual competitive games, “can collect up to 100 pieces of player data per second on speed, distance, heart rate, dynamic stress load, accelerations and decelerations.”

Having access to gobs of data is one thing; making sense of it is quite another. This is particularly the case in football which is much more fluid than baseball and other sports such as cricket that are essentially a series of repeated set-plays. This has led to sceptics claiming that statistics will never have the same impact in football as baseball due to its unpredictability.

Control the controllables
Our first lesson that data management can learn from football is to not worry about what statistics can’t tell you, and focus on what they can. Or in the words of Southampton FC manager Nigel Adkins: “control the controllables.” This precisely what Bolton Wanderers, one of the first football teams credited with adopting statistical analysis, did. Bolton did so by focusing on the aspects of the game that are set-plays.

Writing in the Financial Times Simon Kuper quotes Gavin Fleig, head of performance analysis at current Premier League champions Manchester City and former performance analyst at Bolton:

“We would be looking at, ‘If a defender cleared the ball from a long throw, where would the ball land? Well, this is the area it most commonly lands. Right, well that’s where we’ll put our man.’”

As a result, Bolton scored 45-50% of their goals from set-plays, compared to a league average of nearer 33%.

Perhaps the most significant set-play in football, certainly in terms of deciding between winners and losers, is the penalty shoot-out. Routinely dismissed by the losing team (England) as a lottery, the penalty shoot-out is anything but according to statistics analyzed by Prozone.

According to Prozone’s analysis:

the team taking the first kick wins 75% of the time

81.2% of penalties taken to win shootouts were scored compared with 14.2% of those needed to keep the game alive

71.4% of all penalty saves are in the lower third of the goal

None of the penalties aimed at the upper third of the net were saved (although they are more likely to miss)

Source: Prozone

“Everything that can be counted doesn’t necessarily count”
While statistics such as these suggest that the penalty shoot-out is less a lottery than a mathematical puzzle, our second lesson that data management can learn from football relates to the above quote from Albert Einstein, and the danger of assuming the relevance of statistics.

Along with with Stefan Szymaski, Kuper is the author of Soccernomics (so good I bought it twice) – a treasure trove of stories and information on statistics and football.

In Soccernomics, Kuper and Szymaski note that in the early days of statistical analysis in football players were initially judged on statistics that were easily counted: number of passes, number of tackles, number of shots, kilometres run etc.

That last statistic turned out to be particularly meaningless. Kuper quotes Chelsea’s performance director, Mike Forde:

“Can we find a correlation between total distance covered and winning? And the answer was invariably no.”

Perhaps the greatest example of over-reliance on statistics comes from the surprise sale of Jaap Stam from Manchester United to Lazio in 2001.

While it was widely reported that Manchester United manager Sir Alex Ferguson sold Stam due to controversial comments in his autobiography, Kuper maintains that it was a decision based on statistics: specifically, the fact that Stam, approaching 30, was tackling less often than he previously had. According to Kuper, Ferguson sold Stam based on that statistic alone.

Whether it was the statistic or the autobiography, selling Stam was a decision Ferguson would later regret. Either way, it turns out that tackles made per game is about as useful a measure of a defender’s ability as the number of kilometres run.

The proof of that comes in the shape of Paolo Maldini – arguably one of the greatest defenders the world has ever seen. As Kuper notes, statistically Maldini only made one tackle every two games. “Maldini positioned himself so well that he didn’t need to tackle.”

All of which begs the question: if someone with the domain expertise of Sir Alex Ferguson, one of the greatest managers in the history of British football, armed with statistical evidence, can make an incorrect decision, is there really a role for statistics in football?

In part two we will explore some of the other examples of statistical analysis influencing the beautiful game, including graph analysis and network theory; the great Liverpool Moneyball experiment, and the lessons learned from Total Football.

HALF TIME

Comments Off on What big data can learn from total football, and vice versa: part one

The Data Day, Two days: August 20/21 2012

August 21st, 2012 — Data management

JustOne preps 2.0. Sqrrl raises $2m. Hadoop: can there be only one?

For 451 Research clients: JustOne Database lines up version two of dual-role OLAP/OLTP database bit.ly/OKyFZM

— Matt Aslett (@maslett) August 21, 2012

sqrrl raises $2 million, relocates from D.C. to Cambridge bit.ly/NBqQqM

— Matt Aslett (@maslett) August 20, 2012

Hadoop vendors – can there be only one? Yes: bit.ly/Rvjs6a No: bit.ly/RvjsmV Maybe: bit.ly/Rvjv1R

— Matt Aslett (@maslett) August 20, 2012

And that’s the Data Day, today.

Comments Off on The Data Day, Two days: August 20/21 2012

The Data Day, Two days: September 6/7 2012

The Data Day, Three days: September 3/4/5 2012

The Data Day, Today: August 31 2012

What big data can learn from total football, and vice versa: part two

The Data Day, Two days: August 29/30 2012

The Data Day, Two days: August 27/28 2012

The Data Day, Today: August 24 2012

The Data Day, Two days: August 22/23 2012

What big data can learn from total football, and vice versa: part one

The Data Day, Two days: August 20/21 2012

Search

Twitter: maslett

Categories

451 Group blogroll

Recent Posts

Subscribe via Email

Archives

Search

Tags

Twitter: maslett

Categories

451 Group blogroll

Recent Posts

Subscribe via Email

Archives