Multiple Testing Procedures

Here’s something problematic: let’s say you run a hypothesis test at a significance level of {\alpha=0.05}. Then, assuming you ran the test correctly and met assumptions, the probability of a type I error is only 5%. But if you instead run 100 tests at a significance level of {\alpha=0.05}, the probability of making at least one type I error soars to

\displaystyle 1-(1-0.05)^{100}\approx 99\%,

and this only gets worse as you run increasingly many tests. This is the multiple testing problem, and this blog post discusses a couple of solutions to it as well as why they work. Most of this is taken from my notes for UChicago’s STAT 278, which I’d highly recommend to anyone interested in these sorts of things.

The set up for the rest of this post is as follows. We have p-values {p_1,\cdots, p_n} that we’ve acquired from some hypothesis tests. Some of these come from true null hypotheses, and are distributed uniformly on {[0,1]}. The rest come from false null hypotheses, and we know nothing about their distribution.

We want to run a test that will have a significance level of {\alpha}, not for each test individually, but for all of them together. This will turn out to mean different things for different methods.

1. Bonferroni Correction

This is a method for testing the global null hypothesis, which is the hypothesis that all of the null hypotheses that generated our p-values are true (so we take the intersection of the relevant events). We do this in the most naive way possible, and it turns out to work.

We just test each of the {p_i} at a significance level of {\alpha/n}, and reject the global null if any of the {p_i} are rejected. That is, we reject the global null if {p_i\leq \alpha/n} for some {i} and fail to reject otherwise.

In what sense does this work?

Proposition 1. When testing the global null hypothesis using Bonferroni correction, the probability of making a type I error is {\leq\alpha}.

Proof: This is direct. The probability that we reject the global null is

\displaystyle \mathop{\mathbb P}\left(\min_{1\leq i\leq n} p_i\leq \frac{\alpha}{n}\right)=\mathop{\mathbb P}\left(\bigcup_{i=1}^n \left\{p_i\leq \frac{\alpha}{n}\right\}\right),

which by the union bound is

\displaystyle \leq \sum_{i=1}^n\mathop{\mathbb P}\left(p_i\leq \frac{\alpha}{n}\right)=n\cdot \frac{\alpha}{n}=\alpha,

so we have FWER control. \Box

So that’s nice. But it’s also extremely conservative, and we can afford to do things that are a bit more powerful.

2. Simes’ Test

Here we’re also testing the global null, but a bit more cleverly. The Simes’ test rejects the global null if there is 1 p-value that is {\leq \frac{\alpha}{n}}, or 2 p-values that are {\leq \frac{2\alpha}{n}}, or in general, if there are {k} p-values that are {\leq \frac{k\alpha}{n}} for any {1\leq k\leq n}. This will reject when Bonferroni rejects, but also in some other cases, so its more powerful. This power costs something, though: we need an assumption of independence to get a bound, and this bound isn’t as strong as for Bonferroni.

Proposition 2. If the p-values are independent, then when testing the global null hypothesis using Simes’ test, the probability of making a type I error is {\leq\alpha H_n\approx \alpha \log n}.

Proof: We assign a score to each {p}-value that measures what percent of the way it gets us to rejecting the global null. Let

\displaystyle s_i=\left\{\begin{aligned} 1&\text{ if }&& p_i\leq \frac{\alpha}{n}\\ \frac{1}{2}&\text{ if }&& \frac{\alpha}{n}\leq p_i\leq \frac{2\alpha}{n}\\ &\vdots \\ \frac{1}{n}&\text{ if }&& \frac{\alpha(n-1)}{n}\leq p_i\leq {\alpha}\\ \\ 0&\text{ if } &&p_i>\alpha\end{aligned}\right.

Then, Simes rejects if and only if, for some {k}, there are {k} {p_i}‘s that are {\leq \frac{k\alpha }{n}}, which means that the sum of the {s_i} is {\geq 1}. So now we can calculate the probability of that happening.

The expected value of any given score is

\displaystyle \mathop{\mathbb E}[s_i]=1\cdot \frac{\alpha}{n}+\cdots +\frac{1}{n}\cdot \frac{\alpha}{n}=\frac{\alpha}{n}H_n.


\displaystyle \mathop{\mathbb P}(\text{Simes rejects})\leq \mathop{\mathbb P}\left(\sum_{i=1}^m s_i\geq 1\right)\leq \mathop{\mathbb E}\left[\sum_{i=1}^n s_i\right] =n\cdot \frac{\alpha}{n}H_n =\alpha H_n.


Note that for large {n}, the {\log n} factor counts a lot, and the bound is sharp. The equality construction is a bit of a pain, but the idea is to choose the joint distribution of the {p_i} to make them negatively dependent.

3. Holm-Bonferroni

This test is different from the previous two in that it does not test the global null, but instead individually tests each of the {p_i} and then controls the family-wise error rate (FWER), which is the probability of making at least one type I error across all the p-values. As an added bonus, we can accept/reject individual p-values rather than accepting/rejecting them all at once, so we have a much better sense of how significant the results are, taken together.

The idea of Holm-Bonferroni is to run Bonferroni until acceptance. We first run Bonferroni on {p_1,\cdots, p_n}. If we do not reject the global null, we stop. If we do, then we reject the smallest p-value, and repeat this procedure with the other p-values, continuing until we do not reject the global null.

Equivalently, if we order the p-values as {p_{(1)},\cdots, p_{(n)}}, this is the same as checking if

\displaystyle p_{(k)}\leq \frac{\alpha}{n+k-1}

for each {k=1,\cdots, n} and rejecting all the p-values before the first {k} that fails this test.

Proposition 3. For Holm-Bonferroni, we have {\text{FWER}\leq \alpha}.

Proof: Let {n_0=\#\text{ of nulls}}. Then

\displaystyle \mathop{\mathbb P}\left(\min_{\text{nulls }i} \leq \frac{\alpha}{n_0}\right)\leq \alpha,

from a union bound. In this case, we might have false rejections, but this only happens {\alpha} of the time. So we show that the other {1-\alpha} of the time, there are no false rejections.

Suppose {p_i>\frac{\alpha}{n_0}} for all nulls {i}. Then, take the ordered {p}-values {p_{(1)},\cdots, p_{(n)}}. Let {p_{(k)}} be the first {p}-value that comes from a null. We know that this {p}-value can’t be too small, and we know it can’t happen super late in the list, because there are at most {n-n_0} non-nulls. This means that {k\leq n-n_0+1}. There are two cases now: the procedure stops before step {k} (which means no false rejections because no nulls before {k}). Or, it gets to step {k}, and we check if {p_{(k)}\leq \frac{\alpha}{n+1-k}}. But we know that {p_{(k)}>\frac{\alpha}{n_0}} and {k\leq n-n_0+1}, so the {\frac{\alpha}{n+1-k}\leq \frac{\alpha}{n_0}}. By our assumption on the null {p_i} then, we stop at step {k}, and there are no false rejections. \Box

As with the Bonferroni correction, we don’t need any assumptions of independence.

4. Benjamini-Hochberg

This is another multiple testing procedure, but instead of controlling the FWER, it controls

\displaystyle \text{FDR}=\mathop{\mathbb E}\left[\frac{\#\text{ null p-values rejected}}{\#\text{ p-values rejected}}\right],

the false discovery rate. The quantity we’re taking the expectation of is the false discovery proportion (FDP).

Here, we build on Simes’ test by running Simes’, and then rejecting those {k} p-values that are {\leq \alpha\frac{k}{n}}, for the maximum {k} we can do this for. Equivalently, we choose some threshold {\alpha\frac{k}{n}}, and then reject all the p-values below this threshold.

Proposition 4. If the p-values are independent, the Benjamin-Hochberg gives {\text{FDR}=\alpha\frac{n_0}{n}\leq \alpha}.

Proof: Suppose we make {\hat{k}} rejections, so our threshold is {\alpha\frac{\hat{k}}{n}}, meaning that {p_i} is rejected iff {p_i\leq \alpha\frac{\hat{k}}{n}}. The trouble is that here, the RHS also depends on {p_i}, so we have to go through some gymnastics to get around this.

Let {\hat{k}_i} be the largest {k} such that there are {k-1} many p-values {p_j} with {j\neq i} that are {\leq \alpha\frac{k}{n}}. Call this statement the {\text{BH}_i} statement, and the statement that there are {k} many p-values {\leq \alpha\frac{k}{n}} the BH statement.

Claim. If {p_i} is rejected, then {\hat{k}=\hat{k}_i}.

Call the first statement the BH statement, and the second one (with {k-1}) the BH{_\text{i}} statement.

Now, suppose {p_i} is rejected. Then {p_i \leq \alpha \frac{k}{n},} and BH holds for {k=\hat{k}}, so there are {\hat{k}-1} of the {p_j}, {j\neq i}, that are also {\leq \alpha \frac{\hat{k}}{n}}. Thus BH{_{\text{i}}} holds for {k=\hat{k}}, and so {\hat{k}_i\geq \hat{k}}.

On the other hand, {p_i\leq \alpha \frac{\hat{k}}{n}\leq \alpha \frac{\hat{k}_i}{n}}, so BH{_{\text{i}}} is true at {k=\hat{k}_i}, which means there are {\hat{k}_i} values that are {\leq \alpha \frac{\hat{k}_i}{n}}, so BH is true at {k=\hat{k}_i}, and thus {\hat{k}\geq \hat{k}_i}. It follows that {\hat{k}=\hat{k}_i}

With this claim, we can move onto the main result. We have

\displaystyle \text{FDR}=\mathop{\mathbb E}[\text{FDP}]=\mathop{\mathbb E}\left[\frac{\sum_{i\text{ null}} \mathbf{1}\{p_i\text{ is rejected}\}}{\max(1, \hat{k})}\right]=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ is rejected}}{\max(1,\hat{k})}\right].

We can replace the denominator with {\hat{k}_i} without changing the sum, because when they differ, the numerator is 0, so it doesn’t matter. Then, this is

\displaystyle \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\hat{k}_i}\right]\leq \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha \frac{\hat{k}_i}{n}\}}{\hat{k}_i}\right].

This is an inequality because if the first numerator is 1, the second is definitely 1, but if the first is 0, the second might not be.

Now, we use the tower law to write this as

\displaystyle \sum_{i\text{ null}} \mathop{\mathbb E}\left[\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha\cdot \hat{k}_i/n\}}{k_i}\Big| p_j, j\neq i\right]\right].

That conditional expectation is just

\displaystyle \frac{\mathop{\mathbb P}\{p_i\leq \alpha\hat{k}_i/n\mid p_j, j\neq i\}}{\hat{k}_i}=\frac{\alpha \hat{k}_i/n}{\hat{k}_i}=\alpha/n.

This is a constant, and the expected value of that is just {\alpha/n}. When you add this over all the null terms, you get {\alpha\frac{n_0}{n}}, as desired. \Box

Here, we actually get a better bound than {\alpha}, which might seem like a good thing, but also means we’re not making as many discoveries as we could be. We can correct for this by estimating {n_0}, and then running BH with {\alpha\frac{n}{n_0}}, so that the FDR for this procedure will be {\leq \alpha}.

Although we used independence here, we can make do without it, albeit at the cost of a logarithmic factor.

Proposition 5. When running Benjamini-Hochberg, we have {\text{FDR}\leq \alpha\frac{n_0}{n}H_n}.

Proof: As before, we define a score {s_i} for each {p_i} that is 1 if {0\leq p_i\leq \alpha \frac{1}{n}}, {\frac{1}{2}} if {\alpha\frac{1}{n}<p_i\leq \alpha \frac{2}{n},} and so on, giving a score of 0 if {p_i>\alpha}.

We proved above that for a null {p}-value, the expected score is

\displaystyle \frac{\alpha}{n}\left(1+\cdots+\frac{1}{n}\right).

Now, we write

\displaystyle \text{FDR}=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\right].

If {p_i} is rejected, then {p_i\leq \alpha \hat{k}/n}, so its score is {\geq 1/\hat{k}}. Put another way,

\displaystyle \frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\leq s_i.


\displaystyle \text{FDR}\leq \sum_{i\text{ null}} \mathop{\mathbb E}[s_i]=\alpha\frac{n_0}{n}\left(1+\cdots+\frac{1}{n}\right).



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s