Changes: Statistical significance

Revision as of 13:06, 11 June 2008

Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word.

The significance level of a test is a traditional frequentist statistical hypothesis testing concept. In simple cases, it is defined as the probability of making a decision to reject the null hypothesis when the null hypothesis is actually true (a decision known as a Type I error, or "false positive determination"). The decision is often made using the p-value: if the p-value is less than the significance level, then the null hypothesis is rejected. The smaller the p-value, the more significant the result is said to be.

In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the null hypothesis is actually true is no more than the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis.

Use in practice

The significance level is usually represented by the Greek symbol, α (alpha). Popular levels of significance are 5%, 1% and 0.1%. If a test of significance gives a p-value lower than the α-level, the null hypothesis is rejected. Such results are informally referred to as 'statistically significant'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence," a 0.1% level of statistical significance is being implied. The lower the significance level, the stronger the evidence.

In some situations it is convenient to express the statistical significance as 1 − α. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.

Different α-levels have different advantages and disadvantages. Smaller α-levels give greater confidence in the determination of significance, but run greater risks of failing to reject a false null hypothesis (a Type II error, or "false negative determination"), and so have less statistical power. The selection of an α-level inevitably involves a compromise between significance and power, and consequently between the Type I error and the Type II error.

In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of "σ" (sigma), the standard deviation of a Gaussian distribution. A statistical significance of " $n\sigma$ " can be converted into a value of α via use of the error function:

\alpha = 1 - \operatorname{erf}(n/\sqrt{2})

The use of σ is motivated by the ubiquitous emergence of the Gaussian distribution in measurement uncertainties. For example, if a theory predicts a parameter to have a value of, say, 100, and one measures the parameter to be 109 ± 3, then one might report the measurement as a "3σ deviation" from the theoretical prediction. In terms of α, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1 − erf(3/√2) = 0.0027).

Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern statistical advice is that, where the outcome of a test is essentially the final outcome of an experiment or other study, the p-value should be quoted explicitly. And, importantly, it should be quoted whether or not the p-value is judged to be significant. This is to allow maximum information to be transfered from a summary of the study into meta-analyses.

Pitfalls

A common misconception is that a statistically significant result is always of practical significance, or demonstrates a large effect in the population. Unfortunately, this problem is commonly encountered in scientific writing. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant, and statistical significance says nothing about the practical significance of a difference.

One of the more common problems in significance testing is the tendency for multiple comparisons to yield spurious significant differences even where the null hypothesis is true. For instance, in a study of twenty comparisons, using an α-level of 5%, one comparison will likely yield a significant result despite the null hypothesis being true. In these cases p-values are adjusted in order to control either the false discovery rate or the familywise error rate.

An additional problem is that frequentist analyses of p-values are considered by some to overstate "statistical significance".^[1]^[2] See Bayes factor for details.

Yet another common pitfall often happens when a researcher writes the ambiguous statement "we found no statistically significant difference," which is then misquoted by others as "they found that there was no difference." Actually, statistics cannot be used to prove that there is exactly zero difference between two populations. Failing to find evidence that there is a difference does not constitute evidence that there is no difference. This principle is sometimes described by the maxim "Absence of evidence is not evidence of absence."

According to J. Scott Armstrong, attempts to educate researchers on how to avoid pitfalls of using statistical significance have had little success. In the papers "Significance Tests Harm Progress in Forecasting,"^[3] and "Statistical Significance Tests are Unnecessary Even When Properly Done,"^[4] Armstrong makes the case that even when done properly, statistical significance tests are of no value. A number of attempts failed to find empirical evidence supporting the use of significance tests. Tests of statistical significance are harmful to the development of scientific knowledge because they distract researchers from the use of proper methods. Armstrong suggests authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses.

Signal–noise ratio conceptualisation of significance

Statistical significance can be considered to be the confidence one has in a given result. In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement. In other words, the confidence one has in a given result being non-random (i.e. it is not a consequence of chance) depends on the signal-to-noise ratio (SNR) and the sample size.

Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:^[5]

\mathrm{confidence} = \frac{\mathrm{signal}}{\mathrm{noise}} \times \sqrt{\mathrm{sample\ size}}.

For clarity, the above formula is presented in tabular form below.

Dependence of confidence with noise, signal and sample size (tabular form)

Parameter	Parameter increases	Parameter decreases
Noise	Confidence decreases	Confidence increases
Signal	Confidence increases	Confidence decreases
Sample size	Confidence increases	Confidence decreases

In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the effect size (signal) is large. The confidence of a result (and its associated confidence interval) is not dependent on effect size alone. If the sample size is large and the noise is low a small effect size can be measured with great confidence. Whether a small effect size is considered important is dependent on the context of the events compared.

In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions (if there is great confidence in them). Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.

References

↑ Goodman S (1999). Toward evidence-based medical statistics. 1: The P value fallacy.. Ann Intern Med 130 (12): 995-1004.
↑ Goodman S (1999). Toward evidence-based medical statistics. 2: The Bayes factor.. Ann Intern Med 130 (12): 1005-13.
↑ Armstrong, J. Scott (2007). Significance tests harm progress in forecasting. International Journal of Forecasting 23: 321-327. Full Text
↑ Armstrong, J. Scott (2007). Statistical Significance Tests are Unnecessary Even When Properly Done. International Journal of Forecasting 23: 335-336. Full Text
↑ Sackett DL. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!). CMAJ. 2001 Oct 30;165(9):1226-37. PMID 11706914. Free Full Text.

External links

Raymond Hubbard, M.J. Bayarri, P Values are not Error Probabilities. A working paper that explains the difference between Fisher's evidential p-value and the Neyman-Pearson Type I error rate $\alpha$ .
The Concept of Statistical Significance Testing - Article by Bruce Thompon of the ERIC Clearinghouse on Assessment and Evaluation, Washington, D.C.

This page uses Creative Commons Licensed content from Wikipedia (view authors).

[Goodman1999a-1] Goodman S (1999). Toward evidence-based medical statistics. 1: The P value fallacy.. Ann Intern Med 130 (12): 995-1004.

[Goodman1999b-2] Goodman S (1999). Toward evidence-based medical statistics. 2: The Bayes factor.. Ann Intern Med 130 (12): 1005-13.

[3] Armstrong, J. Scott (2007). Significance tests harm progress in forecasting. International Journal of Forecasting 23: 321-327. Full Text

[4] Armstrong, J. Scott (2007). Statistical Significance Tests are Unnecessary Even When Properly Done. International Journal of Forecasting 23: 335-336. Full Text

[5] Sackett DL. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!). CMAJ. 2001 Oct 30;165(9):1226-37. PMID 11706914. Free Full Text.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
 {{StatsPsy}}
-In [[statistics]], a result is '''significant''' if it is unlikely to have occurred by chance, given that a presumed [[null hypothesis]] is true.
+In [[statistics]], a result is called '''statistically significant''' if it is unlikely to have occurred by [[chance]].  "A statistically significant difference" simply means there is statistical evidence that there is a difference;  it does not mean the difference is necessarily large, important, or significant in the common meaning of the word.
-More precisely, in traditional [[frequentist]] [[statistical hypothesis testing]], the '''significance level''' of a test is the maximum [[probability]] of accidentally rejecting a ''true'' [[null hypothesis]] (a decision known as a [[Type I error]]).  The significance of a result is also called its [[p-value]]; the smaller the p-value, the more significant the result is said to be.
+The '''significance level''' of a test is a traditional [[frequentist]] [[statistical hypothesis testing]] concept.  In simple cases, it is defined as the probability of making a decision to reject the null hypothesis when the [[null hypothesis]] is actually true (a decision known as a [[Type I error]], or "false positive determination").  The decision is often made using the [[p-value]]: if the p-value is less than the significance level, then the null hypothesis is rejected.  The smaller the p-value, the more significant the result is said to be.
-For example, one may choose a significance [[quantile|level]] of, say, 5%, and calculate a ''critical value'' of a [[statistic]] (such as the mean) so that the [[probability]] of it exceeding that value, given the truth of the [[null hypothesis]], would be 5%. If the actual, calculated statistic value exceeds the critical value, then it is '''significant''' "at the 5% level". Symbolically speaking, the significance level is denoted by &alpha; (alpha).
+In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the [[null hypothesis]] is actually true is ''no more than'' the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis.
+== Use in practice ==
-If the significance level is smaller, a value will be less likely to be more extreme than the critical value.  So a result which is "significant at the 1% level" is more significant than a result which is "significant at the 5% level".  However a test at the 1% level is more likely to have a [[Type II error]] than a test at the 5% level, and so will have less [[statistical power]].  In devising a hypothesis test, the tester will aim to maximize power for a given significance, but ultimately have to recognise that the best which can be achieved is likely to be a balance between significance and power, in other words between the risks of Type I and Type II errors. It is important to note that Type I error is not necessarily any worse than a Type II error, and vice versa. The severity of an error depends on each individual case.
+The significance level is usually represented by the Greek symbol, α (alpha).  Popular levels of significance are 5%, 1% and 0.1%. If a [[Statistical hypothesis testing|''test of significance'']] gives a p-value lower than the α-level, the null hypothesis is rejected. Such results are informally referred to as 'statistically significant'.  For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence," a 0.1% level of statistical significance is being implied.  The lower the significance level, the stronger the evidence.
-If the [[alternative hypothesis]] is in fact true, then a sufficiently large sample size is likely to give a highly significant result, even if the difference between the null hypothesis and the alternative hypothesis is very small. The statistical significance of a result is therefore not an indication of how substantial or important the difference is.
+In some situations it is convenient to express the statistical significance as 1&nbsp;&minus;&nbsp;&alpha;. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.
+Different α-levels have different advantages and disadvantages.  Smaller α-levels give greater confidence in the determination of significance, but run greater risks of failing to reject a false null hypothesis (a [[Type II error]], or "false negative determination"), and so have less [[statistical power]]. The selection of an α-level inevitably involves a compromise between significance and power, and consequently between the [[Type I error]] and the [[Type II error]].
+In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of "&sigma;" (sigma), the [[standard deviation]] of a [[Gaussian distribution]]. A statistical significance of "<math>n\sigma</math>" can be converted into a value of &alpha; via use of the [[error function]]:
+:<math>\alpha = 1 - \operatorname{erf}(n/\sqrt{2})</math>
+The use of &sigma; is motivated by the ubiquitous emergence of the Gaussian distribution in measurement uncertainties. For example, if a theory predicts a parameter to have a value of, say, 100, and one measures the parameter to be 109 &plusmn; 3, then one might report the measurement as a "3&sigma; deviation" from the theoretical prediction. In terms of &alpha;, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1&nbsp;&minus;&nbsp;erf(3/&radic;2) = 0.0027).
-[[Category:Statistics]]
+Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern statistical advice is that, where the outcome of a test is essentially the final outcome of an experiment or other study, the p-value should be quoted explicitly. And, importantly, it should be quoted whether or not the p-value is judged to be significant. This is to allow maximum information to be transfered from a summary of the study into [[meta-analyses]].
+== Pitfalls ==
+A common misconception is that a statistically significant result is always of practical significance, or demonstrates a large effect in the population. Unfortunately, this problem is commonly encountered in scientific writing. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant, and statistical significance says nothing about the practical significance of a difference.
+One of the more common problems in significance testing is the tendency for [[multiple comparisons]] to yield spurious significant differences even where the null hypothesis is true. For instance, in a study of twenty comparisons, using an α-level of 5%, one comparison will likely yield a significant result despite the null hypothesis being true. In these cases p-values are adjusted in order to control either the [[false discovery rate]] or the [[familywise error rate]].
+An additional problem is that [[frequentist]] analyses of p-values are considered by some to overstate "statistical significance".<ref name=Goodman1999a>{{cite journal | author = Goodman S | title = Toward evidence-based medical statistics. 1: The P value fallacy. | journal = Ann Intern Med | volume = 130 | issue = 12 | pages = 995-1004 | year = 1999 | pmid = 10383371}}</ref><ref name=Goodman1999b>{{cite journal | author = Goodman S | title = Toward evidence-based medical statistics. 2: The Bayes factor. | journal = Ann Intern Med | volume = 130 | issue = 12 | pages = 1005-13 | year = 1999 | pmid = 10383350}}</ref> See [[Bayes factor]] for details.
+Yet another common pitfall often happens when a researcher writes the ambiguous statement "we found no statistically significant difference," which is then misquoted by others as "they found that there was no difference."  Actually, statistics cannot be used to prove that there is exactly zero difference between two populations.  Failing to find evidence that there is a difference does not constitute evidence that there is no difference.  This principle is sometimes described by the maxim "Absence of evidence is not evidence of absence."
+According to [[J. Scott Armstrong]], attempts to educate researchers on how to avoid pitfalls of using statistical significance have had little success. In the papers "Significance Tests Harm Progress in Forecasting,"<ref>{{cite journal | author = Armstrong, J. Scott | title = Significance tests harm progress in forecasting | journal = International Journal of Forecasting | volume = 23 | pages = 321-327 | year = 2007 | doi = 10.1016/j.ijforecast.2007.03.004}} [http://dx.doi.org/10.1016/j.ijforecast.2007.03.004 Full Text]</ref> and "Statistical Significance Tests are Unnecessary Even When Properly Done,"<ref>{{cite journal | author = Armstrong, J. Scott | title = Statistical Significance Tests are Unnecessary Even When Properly Done | journal = International Journal of Forecasting | volume = 23 | pages = 335-336 | year = 2007 | doi = 10.1016/j.ijforecast.2007.01.010}} [http://dx.doi.org/10.1016/j.ijforecast.2007.01.010 Full Text]</ref> Armstrong makes the case that even when done properly, statistical significance tests are of no value. A number of attempts failed to find empirical evidence supporting the use of significance tests. Tests of statistical significance are harmful to the development of scientific knowledge because they distract researchers from the use of proper methods. Armstrong suggests authors should avoid tests of statistical significance; instead, they should report on [[effect size]]s, [[confidence intervals]], [[replication (statistics)|replication]]s/[[extensions]], and [[meta-analyses]].
+== Signal–noise ratio conceptualisation of significance ==
+Statistical significance can be considered to be the confidence one has in a given result.  In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement.  In other words, the confidence one has in a given result being non-random (i.e. it is not a consequence of [[chance]]) depends on the [[signal-to-noise ratio]] (SNR) and the sample size.
+Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:<ref>Sackett DL. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!). CMAJ. 2001 Oct 30;165(9):1226-37. PMID 11706914. [http://www.cmaj.ca/cgi/content/full/165/9/1226 Free Full Text].</ref>
+:<math>\mathrm{confidence} = \frac{\mathrm{signal}}{\mathrm{noise}} \times \sqrt{\mathrm{sample\ size}}.</math>
+For clarity, the above formula is presented in tabular form below.
+'''Dependence of confidence with noise, signal and sample size (tabular form)'''
+{| class="wikitable"
+!Parameter
+!Parameter increases
+!Parameter decreases
+|-
+|Noise
+|Confidence decreases
+|Confidence increases
+|-
+|Signal
+|Confidence increases
+|Confidence decreases
+|-
+|Sample size
+|Confidence increases
+|Confidence decreases
+|}
+In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the [[effect size]] (signal) is large.  The confidence of a result (and its associated [[confidence interval]]) is ''not'' dependent on effect size alone.  If the sample size is large and the noise is low a small effect size can be measured with great confidence.  Whether a small effect size is considered important is dependent on the context of the events compared.
+In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions (if there is great confidence in them).  Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.
+==See also==
+* [[A/B testing]]
+* [[ABX test]]
+* [[Fisher's method]] for combining [[statistical independence|independent]] [[statistical hypothesis testing|test]]s of significance
+* [[Burden of proof|Reasonable doubt]]
+==References==
+{{reflist}}
+==External links==
+* Raymond Hubbard, M.J. Bayarri, ''[http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf P Values are not Error Probabilities]''. A working paper that explains the difference between Fisher's evidential p-value and the Neyman-Pearson Type I error rate <math>\alpha</math>.
+*[http://www.ericdigests.org/1995-1/testing.htm The Concept of Statistical Significance Testing] - Article by Bruce Thompon of the ERIC Clearinghouse on Assessment and Evaluation, Washington, D.C.
+{{Statistics}}
+[[Category:Hypothesis testing]]
+<!--
 [[de:Statistische Signifikanz]]
+[[ko:유의수준]]
+[[he:מובהקות סטטיסטית]]
 [[lt:Reikšmingumo lygmuo]]
 [[nl:Significantie]]
+[[ja:有意]]
+[[pl:Poziom istotności]]
 [[pt:Significância estatística]]
 [[su:Statistical significance]]
+[[es:Significatividad estadística]]
+[[sv:Signifikans]]
 [[zh:显著性差异]]
+[[ru:Статистическая значимость]]
+-->
 {{enWP|Statistical_significance}}

v·d·e Statistics
Descriptive statistics	Mean (Arithmetic, Geometric) - Median - Mode - Power - Variance - Standard deviation
Inferential statistics	Hypothesis testing - Significance - Null hypothesis/Alternate hypothesis - Error - Z-test - Student's t-test - Maximum likelihood - Standard score/Z score - P-value - Analysis of variance
Survival analysis	Survival function - Kaplan-Meier - Logrank test - Failure rate - Proportional hazards models
Probability distributions	Normal (bell curve) - Poisson - Bernoulli
Correlation	Confounding variable - Pearson product-moment correlation coefficient - Rank correlation (Spearman's rank correlation coefficient, Kendall tau rank correlation coefficient)
Regression analysis	Linear regression - Nonlinear regression - Logistic regression