Here’s something problematic: let’s say you run a hypothesis test at a significance level of . Then, assuming you ran the test correctly and met assumptions, the probability of a type I error is only 5%. But if you instead run 100 tests at a significance level of , the probability of making at least one type I error soars to

and this only gets worse as you run increasingly many tests. This is the multiple testing problem, and this blog post discusses a couple of solutions to it as well as why they work. Most of this is taken from my notes for UChicago’s STAT 278, which I’d highly recommend to anyone interested in these sorts of things.

The set up for the rest of this post is as follows. We have p-values that we’ve acquired from some hypothesis tests. Some of these come from true null hypotheses, and are distributed uniformly on . The rest come from false null hypotheses, and we know nothing about their distribution.

We want to run a test that will have a significance level of , not for each test individually, but for all of them together. This will turn out to mean different things for different methods.

**1. Bonferroni Correction **

This is a method for testing the *global* null hypothesis, which is the hypothesis that all of the null hypotheses that generated our p-values are true (so we take the intersection of the relevant events). We do this in the most naive way possible, and it turns out to work.

We just test each of the at a significance level of , and reject the global null if any of the are rejected. That is, we reject the global null if for some and fail to reject otherwise.

In what sense does this work?

**Proposition 1.** * When testing the global null hypothesis using Bonferroni correction, the probability of making a type I error is . *

*Proof:* This is direct. The probability that we reject the global null is

which by the union bound is

so we have FWER control.

So that’s nice. But it’s also extremely conservative, and we can afford to do things that are a bit more powerful.

**2. Simes’ Test **

Here we’re also testing the global null, but a bit more cleverly. The Simes’ test rejects the global null if there is 1 p-value that is , or 2 p-values that are , or in general, if there are p-values that are for any . This will reject when Bonferroni rejects, but also in some other cases, so its more powerful. This power costs something, though: we need an assumption of independence to get a bound, and this bound isn’t as strong as for Bonferroni.

**Proposition 2.** * If the p-values are independent, then when testing the global null hypothesis using Simes’ test, the probability of making a type I error is . *

*Proof:* We assign a score to each -value that measures what percent of the way it gets us to rejecting the global null. Let

Then, Simes rejects if and only if, for some , there are ‘s that are , which means that the sum of the is . So now we can calculate the probability of that happening.

The expected value of any given score is

Then,

Note that for large , the factor counts a lot, and the bound is sharp. The equality construction is a bit of a pain, but the idea is to choose the joint distribution of the to make them negatively dependent.

**3. Holm-Bonferroni **

This test is different from the previous two in that it does not test the global null, but instead individually tests each of the and then controls the family-wise error rate (FWER), which is the probability of making at least one type I error across all the p-values. As an added bonus, we can accept/reject individual p-values rather than accepting/rejecting them all at once, so we have a much better sense of how significant the results are, taken together.

The idea of Holm-Bonferroni is to run Bonferroni until acceptance. We first run Bonferroni on . If we do not reject the global null, we stop. If we do, then we reject the smallest p-value, and repeat this procedure with the other p-values, continuing until we do not reject the global null.

Equivalently, if we order the p-values as , this is the same as checking if

for each and rejecting all the p-values before the first that fails this test.

**Proposition 3.** * For Holm-Bonferroni, we have . *

*Proof:* Let . Then

from a union bound. In this case, we might have false rejections, but this only happens of the time. So we show that the other of the time, there are no false rejections.

Suppose for all nulls . Then, take the ordered -values . Let be the first -value that comes from a null. We know that this -value can’t be too small, and we know it can’t happen super late in the list, because there are at most non-nulls. This means that . There are two cases now: the procedure stops before step (which means no false rejections because no nulls before ). Or, it gets to step , and we check if . But we know that and , so the . By our assumption on the null then, we stop at step , and there are no false rejections.

As with the Bonferroni correction, we don’t need any assumptions of independence.

**4. Benjamini-Hochberg **

This is another multiple testing procedure, but instead of controlling the FWER, it controls

the false discovery rate. The quantity we’re taking the expectation of is the false discovery proportion (FDP).

Here, we build on Simes’ test by running Simes’, and then rejecting those p-values that are , for the maximum we can do this for. Equivalently, we choose some threshold , and then reject all the p-values below this threshold.

**Proposition 4.** * If the p-values are independent, the Benjamin-Hochberg gives . *

*Proof:* Suppose we make rejections, so our threshold is , meaning that is rejected iff . The trouble is that here, the RHS also depends on , so we have to go through some gymnastics to get around this.

Let be the largest such that there are many p-values with that are . Call this statement the statement, and the statement that there are many p-values the BH statement.

*Claim.* * If is rejected, then . *

Call the first statement the BH statement, and the second one (with ) the BH statement.

Now, suppose is rejected. Then and BH holds for , so there are of the , , that are also . Thus BH holds for , and so .

On the other hand, , so BH is true at , which means there are values that are , so BH is true at , and thus . It follows that

With this claim, we can move onto the main result. We have

We can replace the denominator with without changing the sum, because when they differ, the numerator is 0, so it doesn’t matter. Then, this is

This is an inequality because if the first numerator is 1, the second is definitely 1, but if the first is 0, the second might not be.

Now, we use the tower law to write this as

That conditional expectation is just

This is a constant, and the expected value of that is just . When you add this over all the null terms, you get , as desired.

Here, we actually get a better bound than , which might seem like a good thing, but also means we’re not making as many discoveries as we could be. We can correct for this by estimating , and then running BH with , so that the FDR for this procedure will be .

Although we used independence here, we can make do without it, albeit at the cost of a logarithmic factor.

**Proposition 5.** * When running Benjamini-Hochberg, we have . *

*Proof:* As before, we define a score for each that is 1 if , if and so on, giving a score of 0 if .

We proved above that for a null -value, the expected score is

Now, we write

If is rejected, then , so its score is . Put another way,

Thus,