One of my favorite books of the past ten years was Freakonomics - Levitt is a master of using large amounts of data to find patterns that are interesting. From cheating on tests to betting in sumo wrestling, the point of the book is that, with a set of aggregate data, we can find interesting signals in the noise.
In a recent blog entry, they reference a story on the CBC news show The Fifth Estate. The Fifth almost always does really interesting expose type stories, and this one is no different. It's based on the work of UofT statistician Jeffrey Rosenthal who did some really interesting analysis on the winnings of lottery clerks - from the intro to the story:
"This investigation revealed that lottery retailers won around two hundred major prizes averaging a half a million dollars each. The chance of this happening, according to Jeffrey Rosenthal, a prominent statistician at the University of Toronto, is one chance in a trillion, trillion, trillion, trillion â€“ absolutely inconceivable."
Most interesting about both this research and the research of the Freakonomics guys is that it would have been extremely difficult to compile even as few as 25 years ago. Call it the Oracle Effect - as data becomes easier to search and more ubiquitously available, the ability to use advanced statistical tools to cull out patterns that would have been nearly invisible becomes almost trivial. We see it in many of the "Business Intelligence" tools that are being released today - they are able to cull patterns from a set of data that, to all human eyes, looks like noise.
While human neurology is brilliant at noticing social patterns, we simply don't do well with discovery of "low and slow" patterns that occur over large amounts of time or space - like noticing that lottery clerks happen to win the lottery far more than should be possible. Or noticing that sumo wrestlers lose meaningless matches more often than they otherwise should when certain financial incentives are in place. While some humans (especially those like Levitt) are brilliant at asking the questions, we'd be unlikely to have the right answers without an automated way to find the patterns in the data.
This trend will only continue to accelerate from here as information becomes even more easily available and more ubiquitously linked together.