Assessment |
Biopsychology |
Comparative |
Cognitive |
Developmental |
Language |
Individual differences |
Personality |
Philosophy |
Social |
Methods |
Statistics |
Clinical |
Educational |
Industrial |
Professional items |
World psychology |
Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau (τ) coefficient, is a statistic used to measure the association between two measured quantities. A tau test is a non-parametric hypothesis test which uses the coefficient to test for statistical dependence.
Specifically, it is a measure of rank correlation: that is, the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,^{[1]} though Gustav Fechner had proposed a similar measure in the context of time series in 1897.^{[2]}
DefinitionEdit
Let (x_{1}, y_{1}), (x_{2}, y_{2}), …, (x_{n}, y_{n}) be a set of joint observations from two random variables X and Y respectively, such that all the values of (x_{i}) and (y_{i}) are unique. Any pair of observations (x_{i}, y_{i}) and (x_{j}, y_{j}) are said to be concordant if the ranks for both elements agree: that is, if both x_{i} > x_{j} and y_{i} > y_{j} or if both x_{i} < x_{j} and y_{i} < y_{j}. They are said to be discordant, if x_{i} > x_{j} and y_{i} < y_{j} or if x_{i} < x_{j} and y_{i} > y_{j}. If x_{i} = x_{j} or y_{i} = y_{j}, the pair is neither concordant nor discordant.
The Kendall τ coefficient is defined as:
- $ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1) } . $^{[3]}
PropertiesEdit
The denominator is the total number of pairs, so the coefficient must be in the range −1 ≤ τ ≤ 1.
- If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
- If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
- If X and Y are independent, then we would expect the coefficient to be approximately zero.
Hypothesis testEdit
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y.
Under a null hypothesis of X and Y being independent, the sampling distribution of τ will have an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance
- $ \frac{2(2n+5)}{9n (n-1)} $.^{[4]}
Accounting for tiesEdit
This article needs additional citations for verification. Please help improve this article by adding reliable references. Unsourced material may be challenged and removed. (June 2010) |
A pair {(x_{i}, y_{i}), (x_{j}, y_{j})} is said to be tied if x_{i} = x_{j} or y_{i} = y_{j}; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [-1, 1]:
Tau-aEdit
Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties.
Tau-bEdit
Tau-b statistic, unlike tau-a, makes adjustments for ties and is suitable for square tables. Values of tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
The Kendall tau-b coefficient is defined as:
- $ \tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}} $
where
- $ \begin{array}{ccl} n_0 & = & n(n-1)/2\\ n_1 & = & \sum_i t_i (t_i-1)/2 \\ n_2 & = & \sum_j u_j (u_j-1)/2 \\ t_i & = & \mbox{Number of tied values in the } i^{th} \mbox{ group of ties for the first quantity} \\ u_j & = & \mbox{Number of tied values in the } j^{th} \mbox{ group of ties for the second quantity} \end{array} $
Tau-cEdit
Tau-c differs from tau-b as in being more suitable for rectangular tables than for square tables.
Significance testsEdit
When two quantities are statistically independent, the distribution of $ \tau $ is not easily characterizable in terms of known distributions. However, for $ \tau_A $ the following statistic, $ z_A $, is approximately characterized by a standard normal distribution when the quantities are statistically independent:
- $ z_A = {3 (n_c - n_d) \over \sqrt{n(n-1)(2n+5)/2} } $
Thus, if you want to test whether two quantities are statistically dependent, compute $ z_A $, and find the cumulative probability for a standard normal distribution at $ -|z_A| $. For a 2-tailed test, multiply that number by two and this gives you the p-value. If the p-value is below your acceptance level (typically 5%), you can reject the null hypothesis that the quantities are statistically independent and accept the hypothesis that they are dependent.
Numerous adjustments should be added to $ z_A $ when accounting for ties. The following statistic, $ z_B $, provides an approximation coinciding with the $ \tau_B $ distribution and is again approximately characterized by a standard normal distribution when the quantities are statistically independent:
- $ z_B = {n_c - n_d \over \sqrt{ v } } $
where
- $ \begin{array}{ccl} v & = & (v_0 - v_t - v_u)/18 + v_1 + v_2 \\ v_0 & = & n (n-1) (2n+5) \\ v_t & = & \sum_i t_i (t_i-1) (2 t_i+5)\\ v_u & = & \sum_j u_j (u_j-1)(2 u_j+5) \\ v_1 & = & \sum_i t_i (t_i-1) \sum_j u_j (u_j-1) / (2n(n-1)) \\ v_2 & = & \sum_i t_i (t_i-1) (t_i-2) \sum_j u_j (u_j-1) (u_j-2) / (9 n (n-1) (n-2)) \end{array} $
AlgorithmsEdit
The direct computation of the numerator $ n_c - n_d $, involves two nested iterations, as characterized by the following pseudo-code:
numer := 0 for i:=1..N do for j:=1..(i-1) do numer := numer + sgn(x[i] - x[j]) * sgn(y[i] - y[j]) return numer
Although quick to implement, this algorithm is $ O(n^2) $ in complexity and becomes very slow on large samples. A more sophisticated algorithm^{[5]} built upon the Merge Sort algorithm can be used to compute the numerator in $ O(n \cdot \log{n}) $ time.
Begin by ordering your data points sorting by the first quantity, $ x $, and secondarily (among ties in $ x $) by the second quantity, $ y $. With this initial ordering, $ y $ is not sorted, and the core of the algorithm consists of computing how many steps a Bubble Sort would take to sort this initial $ y $. An enhanced Merge Sort algorithm, with $ O(n \log n) $ complexity, can be applied to compute the number of swaps, $ S(y) $, that would be required by a Bubble Sort to sort $ y_i $. Then the numerator for $ \tau $ is computed as:
- $ n_c-n_d = n_0 - n_1 - n_2 + n_3 - 2 S(y) $,
where $ n_3 $ is computed like $ n_1 $ and $ n_2 $, but with respect to the joint ties in $ x $ and $ y $.
A Merge Sort partitions the data to be sorted, $ y $ into two roughly equal halves, $ y_{left} $ and $ y_{right} $, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is equal to:
- $ S(y) = S(y_{left}) + S(y_{right}) + M(Y_{left},Y_{right}) $
where $ Y_{left} $ and $ Y_{right} $ are the sorted versions of $ y_{left} $ and $ y_{right} $, and $ M(\cdot,\cdot) $ characterizes the Bubble Sort swap-equivalent for a merge operation. $ M(\cdot,\cdot) $ is computed as depicted in the following pseudo-code:
function M(L[1..n], R[1..m]) n := n + m i := 1 j := 1 nSwaps := 0 while i + j <= n do if i > m or R[j] < L[i] then nSwaps := nSwaps + m - (i-1) j := j + 1 else i := i + 1 return nSwaps
A side effect of the above steps is that you end up with both a sorted version of $ x $ and a sorted version of $ y $. With these, the factors $ t_i $ and $ u_j $ used to compute $ \tau_B $ are easily obtained in a single linear-time pass through the sorted arrays.
A second algorithm with $ O(n \cdot \log{n}) $ time complexity, based on AVL trees, was devised by David Christensen.^{[6]}
See alsoEdit
- Correlation
- Gamma test (statistics)
- Kendall tau distance
- Kendall's W
- Rank correlation
- Spearman's rank correlation coefficient
- Theil–Sen estimator
ReferencesEdit
- ↑ Kendall, M. (1938). A New Measure of Rank Correlation. Biometrika 30 (1–2): 81–89.
- ↑ Kruskal, W.H. (1958). Ordinal Measures of Association. Journal of the American Statistical Association 53 (284): 814–861.
- ↑ Template:SpringerEOM
- ↑ Template:SpringerEOM
- ↑ Knight, W. (1966). A Computer Method for Calculating Kendall's Tau with Ungrouped Data. Journal of the American Statistical Association 61 (314): 436–439.
- ↑ Christensen, David (2005). Fast algorithms for the calculation of Kendall's τ. Computational Statistics 20 (1): 51–62.
- Abdi, H. (2007). Encyclopedia of Measurement and Statistics, Thousand Oaks (CA): Sage.
- Kendall, M. (1948) Rank Correlation Methods, Charles Griffin & Company Limited
External linksEdit
- Tied rank calculation
- Why Kendall tau?
- Software for computing Kendall's tau on very large datasets
- Online software: computes Kendall's tau rank correlation
- The CORR Procedure: Statistical Computations
[[[Category:Nonparametric statistical tests]]
This page uses Creative Commons Licensed content from Wikipedia (view authors). |