Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.
I. Understanding the Hypergeometric Distribution
The hypergeometric distribution describes the probability of events in the following scenario:
Suppose you have a jar containing 10 red marbles and 90 black marbles.
You collect 10 marbles from the jar.
What is the probability you collect k red marbles?
Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.
The hypergeometric distribution is parameterized by 3 quantities:
- The population size, usually denoted N.
In this case, the total number of marbles in the jar: 100.
- The number of “successes” in the population, usually denoted K.
In this case, the number of red marbles in the jar: 10.
- The sample size, usually denoted n.
In this case, the number of draws from the jar: 10.
Knowing these three quantities, we can draw a probability mass function.
As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.
II. The Hypergeometric Test
Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.
We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.
The hypergeometric test first computes the probability of drawing 7 or more red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to reject the null hypothesis as too unlikely — something must be going on with this jar.
III. Implementing the Hypergeometric Test in Python
Scipy uses a different naming convention for their parameters (as does everyone):
- M is the population size (previously N)
- n is the number of successes in the population (previously K)
- N is the sample size (previously n)
- X is still the number of drawn “successes”.
We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way:
from scipy.stats import hypergeom
pval = hypergeom.sf(x-1, M, n, N)
sf? The survival function is the inverse of the cumulative distribution function, i.e.
sf = 1 — cdf . But what is the cumulative distribution function? The cumulative distribution function describes the sum of the probability mass up to some value. In our example, the
cdf evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore
sf evaluated at 7 is the probability of drawing 8 or more red marbles by chance.
However, we wanted to know the probability of drawing 7 or more red marbles, not 8, which is the reason for
x-1 instead of
x in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.
That’s it. If this helped you, I’d love to hear about it.
- The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw.
- The hypergeometric probability mass function is given by (using the original variable convention)
- You can find an online hypergeometric test calculator here.
- If you’re in search of a GO enrichment tool in python, I like goenrich.