Understanding and Implementing the hypergeometric test in python

Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.

I. Understanding the Hypergeometric Distribution

The hypergeometric distribution describes the probability of events in the following scenario:

Suppose you have a jar containing 10 red marbles and 90 black marbles.
You collect 10 marbles from the jar.
What is the probability you collect k red marbles?

Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.

The hypergeometric distribution is parameterized by 3 quantities:

  • The population size, usually denoted N.
    In this case, the total number of marbles in the jar: 100.

Knowing these three quantities, we can draw a probability mass function.

Image for post
Image for post

As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.

II. The Hypergeometric Test

Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.
We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.

The hypergeometric test first computes the probability of drawing 7 or more red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to reject the null hypothesis as too unlikely — something must be going on with this jar.

III. Implementing the Hypergeometric Test in Python

Thanks to the great work of the open-source contributors over at scipy, implementing this test is no trouble at all, but deserves an explanation.

Scipy uses a different naming convention for their parameters (as does everyone):

  • M is the population size (previously N)

We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way:

from scipy.stats import hypergeom
pval = hypergeom.sf(x-1, M, n, N)

What is sf? The survival function is the inverse of the cumulative distribution function, i.e. sf = 1 — cdf . But what is the cumulative distribution function? The cumulative distribution function describes the sum of the probability mass up to some value. In our example, the cdf evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore sf evaluated at 7 is the probability of drawing 8 or more red marbles by chance.

However, we wanted to know the probability of drawing 7 or more red marbles, not 8, which is the reason for x-1 instead of x in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.

That’s it. If this helped you, I’d love to hear about it.

V. Appendix

  • The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw.
Image for post
Image for post

Written by

conscious mammalian organism, fanatical tea snob.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store