Individual differences |
Methods | Statistics | Clinical | Educational | Industrial | Professional items | World psychology |
In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. A closely related concept is the E-value, which is the average number of times in multiple testing that one expects to obtain a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. When the tests are statistically independent the E-value is the product of the number of tests and the p-value.
The lower the p-value, the less likely the result is if the null hypothesis is true, and consequently the more "significant" the result is, in the sense of statistical significance. One often accepts the alternative hypothesis, (i.e. rejects a null hypothesis) if the p-value is less than 0.05 or 0.01, corresponding to a 5% or 1% chance respectively of rejecting the null hypothesis when it is true (Type I error).
Coin flipping example
For example, an experiment is performed to determine whether a coin flip is fair (50% chance, each, of landing heads or tails) or unfairly biased (> 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin would result in 14 or more heads can be computed from binomial coefficients as
This probability is the (one-sided) p-value.
Because there is no way to know what percentage of coins in the world are unfair, the p-value does not tell us whether the coin is unfair. It measures the chance that a fair coin gives such result.
Generally, one rejects the null hypothesis if the p-value is smaller than or equal to the significance level, often represented by the Greek letter α (alpha). If the level is 0.05, then results that are only 5% likely or less, given that the null hypothesis is true, are deemed extraordinary.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of heads and tails. In such a case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14 heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity (two-sided test). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is symmetrical for a fair coin, the two-sided p-value is simply twice the above calculated single-sided p-value; i.e., the two-sided p-value is 0.115.
In the above example we thus have:
- null hypothesis (H0): fair coin;
- observation O: 14 heads out of 20 flips; and
- p-value of observation O given H0 = Prob(≥ 14 heads or ≥ 14 tails) = 0.115.
The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of 14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time were this in fact the case. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall evenly, the deviation from expected outcome is small enough to be reported as being "not statistically significant at the 5% level".
However, had a single extra head been obtained, the resulting p-value (two-tailed) would be 0.0414 (4.14%). This time the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected. Such a finding would be described as being "statistically significant at the 5% level".
Critics of p-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05). Furthermore, it is necessary to use a reasonable null hypothesis to assess the result fairly, but the choice of a null hypothesis often entails assumptions.
The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). A small p-value that indicates statistical significance does not indicate that an alternative hypothesis is ipso facto correct; there are additional tests which may be performed in order to make a more definitive statement about the validity of the null hypothesis, such as some "goodness of fit" tests.
Despite the ubiquity of p-value tests, this particular test for statistical significance has come under heavy criticism due both to its inherent shortcomings and the potential for misinterpretation.
- The p-value is not the probability that the null hypothesis is true. (This false conclusion is used to justify the "rule" of considering a result to be significant if its p-value is very small (near zero).)
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity. This is the Jeffreys–Lindley paradox.
- The p-value is not the probability that a finding is "merely a fluke." (Again, this conclusion arises from the "rule" that small p-values indicate significant differences.)
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is subtly different from the real meaning which is that the p-value is the chance that null hypothesis explains the result: the result might not be "merely a fluke," and be explicable by the null hypothesis with confidence equal to the p-value.
- The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacy.
- The p-value is not the probability that a replicating experiment would not yield the same conclusion.
- 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
- The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
- The p-value does not indicate the size or importance of the observed effect (compare with effect size).
- Dallal GE (2007) Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance
- Hubbard R, Armstrong JS (2005) Historical background on the widespread confusion of the p-value (PDF)
- Fisher's method for combining independent tests of significance using their p-values
- Dallal GE (2007) The Little Handbook of Statistical Practice (A good tutorial)
- National Institutes of Health definition of E-value
- Sterne JAC, Smith GD (2001). Sifting the evidence — what's wrong with significance tests?. BMJ 322 (7280): 226–231.
- Schervish MJ (1996). P Values: What They Are and What They Are Not. The American Statistician 50 (3): 203–206.
- Free online p-values calculators for various specific tests (chi-square, Fisher's F-test, etc).
- Understanding P-values, including a Java applet that illustrates how the numerical values of p-values can give quite misleading impressions about the truth or falsity of the hypothesis under test.
Survival function - Kaplan-Meier - Logrank test - Failure rate - Proportional hazards models
|This page uses Creative Commons Licensed content from Wikipedia (view authors).|