34,421 Pages

The power of a statistical test is the probability that the test will reject the null hypothesis when the alternative hypothesis is true (i.e. that it will not make a Type II error). As power increases, the chances of a Type II error decrease. The probability of a Type II error is referred to as the false negative rate (β). Therefore power is equal to 1 − β.

Power analysis can be used to calculate the minimum sample size required to accept the outcome of a statistical test with a particular level of confidence. It can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical tests: for example, between a parametric and a nonparametric test of the same hypothesis.

## Background

Statistical tests use data from samples to assess, or make inferences about, a population. In the concrete setting of a two-sample comparison, the goal is to assess whether the mean values of some attribute obtained for individuals in two sub-populations differ. For example, to test the null hypothesis that the mean scores of men and women on a test do not differ, samples of men and women are drawn, the test is administered to them, and the mean score of one group is compared to that of the other group using a statistical test such as the two-sample Z-test. The power of the test is the probability that the test will find a statistically significant difference between men and women, as a function of the size of the true difference between those two populations. Note that power is the probability of finding a difference that does exist, as opposed to the likelihood of declaring a difference that does not exist (which is known as a Type I error).

## Factors influencing power

Statistical power may depend on a number of factors. Some of these factors may be particular to a specific testing situation, but at a minimum, power nearly always depends on the following two factors:

A significance criterion is a statement of how unlikely a result must be, if the null hypothesis is true, to be considered significant. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of obtaining the observed effect when the null hypothesis is true must be less than 0.05, and so on. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion. This increases the chance of obtaining a statistically significant result (rejecting the null hypothesis) when the null hypothesis is false, that is, reduces the risk of a Type II error. But it also increases the risk of obtaining a statistically significant result when the null hypothesis is true; that is, it increases the risk of a Type I error.

The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be be a direct estimate of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means Y − X would be a direct measure of the effect size, whereas (Y − X)/σ where σ is the common standard deviation of the outcomes in the treated and control groups, would be a standardized effect size. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size will rarely be sufficient to determine the power, as it does not contain information about the variability in the measurements.

The precision with which the data are measured often influences the power. Power can often be improved by reducing the measurement error in the data. A related concept is to improve the "reliability" of the measure being assessed (as in psychometric reliability).

The design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and Analysis of Variance, there is an extensive theory, and practical strategies, for improving the power based on optimally setting the values of the independent variables in the model.

## Interpretation

Although there are no formal standards for power, most researchers assess the power of their tests using 0.80 as a standard for adequacy.

There are times when the recommendations of power analysis regarding sample size will be inadequate. Power analysis is appropriate when the concern is with the correct acceptance or rejection of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail) to reject the null hypothesis of zero correlation. However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. Techniques similar to those employed in a traditional power analysis can be used to determine the sample size required for the width of a confidence interval to be less than a given value.

Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities is a "nuisance parameter." In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more "exploratory," there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.

Any statistical analysis involving multiple hypotheses is subject to inflation of the type I error rate if appropriate measures are not taken. Such measures typically involve applying a higher threshold of stringency to reject a hypothesis in order to compensate for the multiple comparisons being made (e.g. as in the Bonferroni method). In this situation, the power analysis should reflect the multiple testing approach to be used. Thus, for example, a given study may be well powered to detect a certain effect size when only one test is to be made, but the same effect size may have much lower power if several tests are to be performed.

## A priori vs. post hoc analysis

Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data is collected. A priori power analysis is conducted prior to the research study, and is typically used to determine an appropriate sample size to achieve adequate power. Post-hoc power analysis is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, the usefulness of retrospective techniques is controversial .

## Application

Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis, for example to determine the minimum number of animal test subjects needed for an experiment. If a study is inadequately powered, then, in frequentist statistics, there is little point in completing the research, as it is unlikely to allow one to choose between hypotheses at the desired significance level. By contrast, in Bayesian statistics, any properly-conducted experiment is valuable, as the data is used in the context of all data collected, and allows one to update one's beliefs via Bayesian inference, regardless of how little is collected. However, even in Bayesian statistics, power is a useful measure of how much a given experiment size can be expected to refine one's beliefs.

## Example

Suppose we plan to compare research subjects in terms of a quantity that is measured before and after a treatment, analyzing the data using a paired t-test. Let Bi, Ai denote the pre-treatment and post-treatment measures on subject i. In the paired t-test, we let Di = Ai −Bi, then proceed by analyzing D as in a one-sample t-test. Begin by computing the sample variance of the Di, which estimates the corresponding population variance . The one-sided test for the alternative hypothesis ED >0 rejects the null hypothesis if where n is the sample size, is the average of the Di, and 1.64 is the approximate decision threshold for a level 0.05 test based on a normal approximation to the test statistic.

Now suppose that the alternative hypothesis is true and ED = τ. Then the power is Since approximately follows a standard normal distribution when the alternative hypothesis is true, the approximate power can be calculated as Note that according to this formula, as either n or τ increase, the power increases, whereas if σD (and hence its sample-based estimate) increase, the power will decrease.