Understanding and Implementing the hypergeometric test in python

Alex Lenail
3 min readSep 13, 2017

Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.

I. Understanding the Hypergeometric Distribution

The hypergeometric distribution describes the probability of events in the following scenario:

Suppose you have a jar containing 10 red marbles and 90 black marbles.
You collect 10 marbles from the jar.
What is the probability you collect k red marbles?

Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.

The hypergeometric distribution is parameterized by 3 quantities:

  • The population size, usually denoted N.
    In this case, the total number of marbles in the jar: 100.
  • The number of “successes” in the population, usually denoted K.
    In this case, the number of red marbles in the jar: 10.
  • The sample size, usually denoted n.
    In this case, the number of draws from the jar: 10.

Knowing these three quantities, we can draw a probability mass function.

As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.

II. The Hypergeometric Test

Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.
We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.

The hypergeometric test first computes the probability of drawing 7 or more red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to reject the null hypothesis as too unlikely — something must be going on with this jar.

III. Implementing the Hypergeometric Test in Python

Thanks to the great work of the open-source contributors over at scipy, implementing this test is no trouble at all, but deserves an explanation.

Scipy uses a different naming convention for their parameters (as does everyone):

  • M is the population size (previously N)
  • n is the number of successes in the population (previously K)
  • N is the sample size (previously n)
  • X is still the number of drawn “successes”.

We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way:

from scipy.stats import hypergeom
pval = hypergeom.sf(x-1, M, n, N)

What is sf? The survival function is the inverse of the cumulative distribution function, i.e. sf = 1 — cdf . But what is the cumulative distribution function? The cumulative distribution function describes the sum of the probability mass up to some value. In our example, the cdf evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore sf evaluated at 7 is the probability of drawing 8 or more red marbles by chance.

However, we wanted to know the probability of drawing 7 or more red marbles, not 8, which is the reason for x-1 instead of x in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.

That’s it. If this helped you, I’d love to hear about it.

V. Appendix

  • The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw.
  • The hypergeometric probability mass function is given by (using the original variable convention)

--

--