# Understanding and Implementing the hypergeometric test in python

Whenever I find a topic I can’t find a sufficiently good tutorial or explanation of online, I feel compelled to offer one. I hope this helps you.

## I. Understanding the Hypergeometric Distribution

The hypergeometric distribution describes the probability of events in the following scenario:

Suppose you have a jar containing 10 red marbles and 90 black marbles.

You collect 10 marbles from the jar.

What is the probability you collect ** k** red marbles?

Collecting a single red marble seems intuitively most likely, but if you collected none or a couple, that wouldn’t be too surprising. But what if you collected 6, or 8? That might suggest a bias towards the red marbles, suggesting the possibility of some underlying, non-random process is at play.

The hypergeometric distribution is parameterized by 3 quantities:

- The population size, usually denoted
*N**.*

In this case, the total number of marbles in the jar: 100. - The number of “successes” in the population, usually denoted
*K**.*In this case, the number of red marbles in the jar: 10. - The sample size, usually denoted
.*n*

In this case, the number of draws from the jar: 10.

Knowing these three quantities, we can draw a **probability mass function**.

As we had suspected, 1 is most likely, 0 is quite likely, and 2 is not unlikely. Five or more seems very, very unlikely.

## II. The Hypergeometric Test

Suppose we suspect that this is no regular jar, and despite their fewer number, we anticipate drawing a disproportionate number of red marbles.

We draw 10 marbles, of which 7 are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.

The hypergeometric test first computes the probability of drawing 7 *or more* red marbles from this jar under the “null hypothesis”: the hypothesis that there is nothing special about the jar. If this probability (also called the p-value) is sufficiently low, then we can decide to *reject* the null hypothesis as too unlikely — *something* must be going on with this jar.

## III. Implementing the Hypergeometric Test in Python

Thanks to the great work of the open-source contributors over at scipy, implementing this test is no trouble at all, but deserves an explanation.

Scipy uses a different naming convention for their parameters (as does everyone):

is the population size (previously*M*)*N*is the number of successes in the population (previously*n*)*K*is the sample size (previously*N*)*n*is still the number of drawn “successes”.*X*

We can then compute a probability of drawing ** X** red marbles out of

**from a jar containing**

*N***red marbles out of**

*n***in the following way:**

*M*`from scipy.stats import hypergeom`

pval = hypergeom.sf(x-1, M, n, N)

What is `sf`

? The **survival function** is the inverse of the **cumulative distribution function**, i.e. `sf = 1 — cdf`

. But what is the cumulative distribution function? The **cumulative distribution function** describes the sum of the probability mass up to some value. In our example, the `cdf`

evaluated at 7 is the probability of drawing 7 or fewer red marbles. Therefore `sf`

evaluated at 7 is the probability of drawing 8 or more red marbles by chance.

However, we wanted to know the probability of drawing **7** or more red marbles, not 8, which is the reason for `x-1`

instead of `x`

in the code. This can be the source of misreported p-values, since different packages in different languages will vary on their implementation. I myself made this mistake.

That’s it. If this helped you, I’d love to hear about it.

## V. Appendix

- The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of
successes in*k*draws*n*replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar*with*putting them back in the jar after each draw.*without* - The hypergeometric probability mass function is given by (using the original variable convention)

- You can find an online hypergeometric test calculator here.
- If you’re in search of a GO enrichment tool in python, I like goenrich.