Understanding and Implementing the hypergeometric test in python

Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.

I. Understanding the Hypergeometric Distribution

The hypergeometric distribution describes the probability of events in the following scenario:

Suppose you have a jar containing 10 red marbles and 90 black marbles.
You collect 10 marbles from the jar.
What is the probability you collect k red marbles?

Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.

The hypergeometric distribution is parameterized by 3 quantities:

Knowing these three quantities, we can draw a probability mass function.

As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.

II. The Hypergeometric Test

Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.
We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.

The hypergeometric test first computes the probability of drawing 7 or more red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to reject the null hypothesis as too unlikely — something must be going on with this jar.

III. Implementing the Hypergeometric Test in Python

Thanks to the great work of the open-source contributors over at scipy, implementing this test is no trouble at all, but deserves an explanation.

Scipy uses a different naming convention for their parameters (as does everyone):

We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way:

from scipy.stats import hypergeom
pval = hypergeom.sf(x-1, M, n, N)

What is sf? The survival function is the inverse of the cumulative distribution function, i.e. sf = 1 — cdf . But what is the cumulative distribution function? The cumulative distribution function describes the sum of the probability mass up to some value. In our example, the cdf evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore sf evaluated at 7 is the probability of drawing 8 or more red marbles by chance.

However, we wanted to know the probability of drawing 7 or more red marbles, not 8, which is the reason for x-1 instead of x in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.

That’s it. If this helped you, I’d love to hear about it.

V. Appendix

conscious mammalian organism, fanatical tea snob.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store