What is Bonferroni?

A statistical function used to find out how often a certain condition would be met just by chance alone. This is very helpful for putting things in perspective when analyzing large data sets (which often have a number of false positive significant results).

Potential Applications:
  • I've always wondered if Facebook cross-references the events people attend to suggest whether or not they should be friends. If this was the only metric they used to suggest friendship, how many friend suggestions would be false positives?
  • How often could potential matches on large dating websites could happen by chance?
  • And of course, from my field, how many single-nucleotide polymorphisms (SNPs) could appear significant in a genome-wide association study by chance alone?

Let’s use the Facebook example and make the following assumptions:
  • There are 1 billion active users
  • Everyone attends an event 1 day in 60
  • There are 500,000 registered events within our scope, which is enough to account for 1 million people who attend an event on a given day.
  • We wade through 1000 days worth of event attendance records

What is the probability that two people were at the same event on two different days?

Assuming everyone randomly attends an event, the probability that someone attends an event on any given day is 0.01 (1/100). And when they do choose an event to attend, they choose one of the 2e+05 registered events at random. Just to be clear on notation, 

2e+05 = 2 x 10^5.

  • The probability of any two people both deciding to attend an event on the same day is 0.0001 (1/100*1/100).
  • The probability that they will attend the same event is 0.0001/2e+05 (number of registered events) = 5e-10.
  • The chance that they will attend the same event on two different days is ( 5e-10 )*( 5e-10 )=2.5e-19 (note that the events can be on two different days).

How many of these event attendance coincidences will indicate potential friendship? Let's just say a potential friendship is a pair of people and a pair of days, such that two people were at the same event on each of the two days.

  • The number of pairs of people is (10^9 choose 2) = 5e+17. 
  • The number of pairs of days is (1000 choose 2) = 5e+05.

So, the expected number of attendance coincidences that look like potential friendships = (the numbers of pairs of people)*(the number of pairs of days)*(the probability that any one pair of people and pair of days is an indication of potential friendship).

(5e+17)*(5e+05)*(2.5e-19) = 62,500

This means there will be 62,500 people who look like they will be friends even though they aren't. However, considering that this is on the scale of a billion folks, a 0.00625% false positive rate doesn't seem terrible. 

In my own experience, I know I've seen a number of suggested friends who I have nothing in common with except similar mutual friends. Maybe friend suggestions can be improved by incorporating this information. 

Of course, I deactivated my facebook in 2009 and haven't been back since. Perhaps they already leverage this information...