Big Data: Principles and Examples Vol. 3
In this volume, we discuss Data Mining and The Birthday Paradox.
Data Mining and The Birthday Paradox
We’ve all heard of the Birthday Paradox. Put 23 randomly chosen people in a room and there is a 50% chance that two or more have the same birthday. Put 57 people in the room, and the chance is 99%. Do those people have anything more in common because they were born on the same day of the year? Astrologers will say yes, but most scientists would say there is no evidence to support that claim.
What does this have to do with big data? The answer is that generalizations of the math behind the birthday paradox tell us that we will—not just that we can but that with near 100% certainty we will—draw meaningless conclusions if we just look at enough variables. In fact, we can show that if we generate a large number of streams of completely random data, some of them will look like others.
The problem is, we can easily forget this when we look at big data sets with lots and lots of variables. These are the kind of things we see in what is called data exhaust. Data exhaust is the vast stream of data gathered and logged by digital devices ranging from mobile phones to engine sensors in cars to video cameras in public spaces to instruments on particle accelerators.
Look at a lot of this data, and you will find spurious correlations. This is what Principle 3 is all about. Statisticians have known about Principle 3 for decades, and have techniques for trying to deal with it. The best technique, however, is and always has been a controlled scientific experiment, as Principle 2 advocates.