Multiple testing

In statistics, "multiple testing" refers to the potential increase in Type I error that occurs when statistical tests are used repeatedly, for example while doing multiple comparisons to test null hypotheses stating that the averages of several disjoint populations are equal to each other (homogeneous).

Intuitively, even if a particular outcome of an experiment is very unlikely to happen, the fact that the experiment is repeated multiple times will increase the probability that the outcome appears at least once. As an example, if a coin is tossed 10 times and lands 10 times on tail, it will usually be considered evidence that the coin is biased, because the probability of observing such a series is very low for a fair coin (2&minus;10≈10&minus;3). However, if the same series of ten tails in a row appears as part of 10,000 tosses with the same coin, it is more likely to be seen as a random fluctuation in the long series of tosses.

Experimentwise significance level
If the significance level for a given experiment is α, the experimentwise significance level will increase exponentially (significance decreases) as the number of tests increases. More precisely, assuming all tests are independent, if n tests are performed, the experimentwise significance level will be given by 1 &minus; (1 &minus; α)n.

Thus, in order to retain the same overall rate of false positives in a series of multiple tests, the standards for each test must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved true. For instance, to obtain the usual alpha of 0.05 with ten tests, requiring an alpha of .005 for each test can be demonstrated to result in an overall alpha which does not exceed 0.05.

However, it can also be demonstrated that this technique may be conservative (depending on the correlation structure), i.e. will in actuality result in a true alpha of significantly less than 0.05; therefore raising the rate of false negatives, failing to identify an unnecessarily high percentage of actual significant differences in the data. This can have important real world consequences; for instance, it may result in failure to approve a drug which is in fact superior to existing drugs, thereby both depriving the world of an improved therapy, and also causing the drug company to lose the substantial investment in research and development up to that point. For this reason, there has been a great deal of attention paid to developing better techniques for multiple testing, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily.

Large-scale multiple testing
For large-scale multiple testing (for example, as is very common in genomics when using technologies such as DNA microarrays) one can instead control the false discovery rate (FDR), defined to be the expected proportion of false positives among all significant tests.