Understanding and Implementing the hypergeometric test in python

3 min readSep 13, 2017

Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.

I. Understanding the Hypergeometric Distribution

The hypergeometric distribution describes the probability of events in the following scenario:

Suppose you have a jar containing 10 red marbles and 90 black marbles.
You collect 10 marbles from the jar.
What is the probability you collect k red marbles?

Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.

The hypergeometric distribution is parameterized by 3 quantities:

The population size, usually denoted N.
In this case, the total number of marbles in the jar: 100.
The number of “successes” in the population, usually denoted K.
In this case, the number of red marbles in the jar: 10.
The sample size, usually denoted n.
In this case, the number of draws from the jar: 10.

Knowing these three quantities, we can draw a probability mass function.

As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.

II. The Hypergeometric Test

Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.
We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.

The hypergeometric test first computes the probability of drawing 7 or more red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to reject the null hypothesis as too unlikely — something must be going on with this jar.

III. Implementing the Hypergeometric Test in Python

Thanks to the great work of the open-source contributors over at scipy, implementing this test is no trouble at all, but deserves an explanation.

Scipy uses a different naming convention for their parameters (as does everyone):

M is the population size (previously N)
n is the number of successes in the population (previously K)
N is the sample size (previously n)
X is still the number of drawn “successes”.

We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way:

from scipy.stats import hypergeom
pval = hypergeom.sf(x-1, M, n, N)

What is sf? The survival function is the inverse of the cumulative distribution function, i.e. sf = 1 — cdf . But what is the cumulative distribution function? The cumulative distribution function describes the sum of the probability mass up to some value. In our example, the cdf evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore sf evaluated at 7 is the probability of drawing 8 or more red marbles by chance.

However, we wanted to know the probability of drawing 7 or more red marbles, not 8, which is the reason for x-1 instead of x in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.

That’s it. If this helped you, I’d love to hear about it.

V. Appendix

The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw.
The hypergeometric probability mass function is given by (using the original variable convention)

You can find an online hypergeometric test calculator here.
If you’re in search of a GO enrichment tool in python, I like goenrich.

Understanding and Implementing the hypergeometric test in python

I. Understanding the Hypergeometric Distribution

II. The Hypergeometric Test

III. Implementing the Hypergeometric Test in Python

V. Appendix

Written by Alex Lenail