Multiple comparisons

In statistics, the multiple comparisons problem occurs when one subjects a number of independent observations to the same acceptance criterion that would be used when considering a single event.

Typically, an acceptance criterion of a single event takes the form of a requirement that the observed data be highly unlikely under a default assumption (null hypothesis). As the number of independent applications of the acceptance criterion begins to outweigh the high unlikelihood associated with each individual test, it becomes increasingly likely that one will observe data that satisfies the acceptance criterion by chance alone (even if the default assumption is true in all cases). These errors are considered false positives because they positively identify a set of observations as satisfying the acceptance criterion while that data in fact represents the null hypothesis. Many mathematical techniques have been developed to counter the false positive error rate associated with making multiple statistical comparisons.

Flipping coins
For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11/210=0.0107. This is relatively unlikely, and under most statistical criteria (such as p-value<0.05), one would declare that the null hypothesis should be rejected - i.e. the coin is unfair.

A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one was to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e. pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing any one of the coins behave that way would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1-0.0107)100=0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify a fair coin as unfair.

Formalism
Technically, the problem of multiple comparisons (also known as multiple testing problem) can be described as the potential increase in Type I error that occurs when statistical tests are used repeatedly: If n independent comparisons are performed, the experiment-wide significance level α (alpha) is given by
 * $$ 1-\left( 1-\alpha_\mathrm{per\ comparison} \right)^{num. of comparisons}$$

and it increases as the number of comparisons increases.

Methods
In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true. For instance, to obtain the usual alpha of 0.05 with ten comparisons, requires an alpha of 0.005 for each comparison to result in an overall alpha which does not exceed 0.05.

However, it can be demonstrated that this technique is overly conservative, i.e. it will actually result in a true alpha significantly less than 0.05; thereby raising the proportion of false negatives, failing to identify an unnecessarily high percentage of actual significant differences in the data.

This can have important real world consequences; for instance, it may result in failure to approve a drug which is in fact superior to existing drugs, thereby both depriving the world of an improved therapy, and also causing the drug company to lose the substantial investment in research and development up to that point. Similarly in fMRI the test is extremely conservative since tests are done over 100000 voxels in the brain. This demands significance values should be unrealistically low. For this reason, there has been a great deal of attention paid to developing better techniques for multiple comparisons, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:
 * Methods where total alpha can be proved to never exceed 0.05 (or other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including partially correct null hypothesis.
 * Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
 * Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA/Tukey range test before proceeding to multiple comparisons. These methods have "weak" control of Type I error
 * Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.

The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.

Post hoc testing of ANOVAs
Multiple comparison procedures are commonly used after obtaining a significant omnibus test, like the ANOVA F-test. The significant ANOVA result suggests rejecting the global null hypothesis H0 = "means are the same". Multiple comparison procedures are then used to determine which means are different from each other.

Comparing K means involves K(K &minus; 1)/2 pairwise comparisons.


 * The non parametric Friedman test is useful when doing multiple test on an hypothesis.


 * The Nemenyi test is similar to the ANOVA Tukey test.


 * The Bonferroni-Dunn test allows comparisons with a control.