Regression towards the mean

In statistics, regression toward the mean is a principle stating that of related measurements, and selecting those where the first measurement is either extremely high or extremely low, the expected value of the second is closer to the mean than the observed value of the first.

Examples
Consider, for example, students who take a midterm and a final exam. Students who got an extremely high score on the midterm will probably get a good score on the final exam as well, but we expect their score to be closer to the average (i.e.: fewer standard deviations above the average) than their midterm score was. The reason: it is likely that some luck was involved in getting the exceptional midterm score, and this luck cannot be counted on for the final. It is also true that among those who get exceptionally high final exam scores, the average midterm score will be fewer standard deviations above average than the final exam score, since some of those got high scores on the final due to luck that they didn't have on the midterm. Similarly, unusually low scores regress toward the mean.

It is a commonplace observation that matings of two championship athletes, or of two geniuses, usually results in a child who is above average but less talented than either of their parents.

History
The first regression line drawn on biological data was a plot of seed weights presented by Francis Galton at a Royal Institution lecture in 1877. Galton had seven sets of sweet pea seeds labelled K to Q and in each packet the seeds were of the same weight. He chose sweet peas on the advice of his cousin Charles Darwin and the botanist Joseph Hooker as sweet peas tend not to self fertilise and the seed weight varies little with humidity. He distributed these packets to a group of friends throughout Great Britain who planted them. At the end of the growing season the plants were uprooted and returned to Galton. The seeds were distributed because when Galton had tried this experiment himself in the Kew Gardens in 1874, the crop had failed.

He found that the weights of the offspring seeds were normally distributed, like their parents, and that if he plotted the mean diameter of the offspring seeds against the mean diameter of their parents he could draw a straight line through the points - the first regression line. He also found on this plot that the mean size of the offspring seeds tended to the overall mean size. He initially referred to the slope of this line as the "coefficient of reversion". Once he discovered that this effect was not a heritable property but the result of his manipulations of the data, he changed the name to the "coefficient of regression". This result was important because it appeared to conflict with the current thinking on evolution and natural selection. He went to do extensive work in quantitative genetics and in 1888 coined the term "co-relation" and used the now familiar symbol "r" for this value.

In additional work he investigated geniuses in various fields and noted that their children, while typically gifted, were almost invariably closer to the average than their exceptional parents. He later described the same effect more numerically by comparing fathers' heights to their sons' heights. Again, the heights of sons both of unusually tall fathers and of unusually short fathers was typically closer to the mean height than their fathers' heights.

Ubiquity
It is important to realize that regression toward the mean is a ubiquitous statistical phenomenon and has nothing to do with biological inheritance. It is also unrelated to the progression of time: the fathers of exceptionally tall people also tend to be closer to the mean than their sons. The overall variability of height among fathers and sons is the same.

The original version of regression toward the mean suggests an identical trait with two correlated measurements with the same reliability. However, this character is not necessary, unless any pair of predicting and predicted variables had to be viewed with an identical potential trait. The necessary implicate presumption is that the standard deviations of the predicting and the predicted are the same to be comparable, or have been transformed or interpreted to be comparable.

One later version of regression toward the mean defines a predicting variable with measurement error which impairs the predicting coefficient. This interpretation is neither necessary. For example, in the original case the measurement error of length could be ignored.

Mathematical derivation
Let X and Y be zero mean jointly Gaussian random variables with the same variance, and correlation coefficient r. The Cauchy-Schwartz inequality shows that |r| <= 1. From Gaussianity, the expected value of Y conditioned on the value of X is linear in X; more precisely, E[Y|X]=rX, hence the estimated value for Y is closer to the mean 0 than the observed value X since |r| <= 1. Similar results can be obtained for more general classes of distributions. For example, let (X,Y) be jointly normal as above, and define W=AX, Z=AY, where A is any absolutely integrable scalar random variable independent of X and Y. The variables W and Z have zero mean but are not Gaussian. Nevertheless, it is possible to prove that the linear regression property still holds: E[Z|W]=rW, and once again regression toward the mean is observed.

The example illustrates a general feature: regression toward the mean is more pronounced the less the two variables are correlated, i.e. the smaller |r| is.

The phenomenon of regression toward the mean is related to Stein's example.

Regression fallacies
Misunderstandings of the principle (known as "regression fallacies") have repeatedly led to mistaken claims in the scientific literature.

An extreme example is Horace Secrist's 1933 book The Triumph of Mediocrity in Business, in which the statistics professor collected mountains of data to prove that the profit rates of competitive businesses tend towards the average over time. In fact, there is no such effect; the variability of profit rates is almost constant over time. Secrist had only described the common regression toward the mean. One exasperated reviewer likened the book to "proving the multiplication table by arranging elephants in rows and columns, and then doing the same for numerous other kinds of animals".

A different regression fallacy occurs in the following example. We want to test whether a certain stress-reducing drug increases reading skills of poor readers. Pupils are given a reading test. The lowest 10% scorers are then given the drug, and tested again, with a different test that also measures reading skill. We find that the average reading score of our group has improved significantly. This however does not show anything about the effectiveness of the drug: even without the drug, the principle of regression toward the mean would have predicted the same outcome.

The calculation and interpretation of "improvement scores" on standardized educational tests in Massachusetts probably provides another example of the regression fallacy. In 1999, schools were given improvement goals. For each school, the Department of Education tabulated the difference in the average score achieved by students in 1999 and in 2000. It was quickly noted that most of the worst-performing schools had met their goals, which the Department of Education took as confirmation of the soundness of their policies. However, it was also noted that many of the supposedly best schools in the Commonwealth, such as Brookline High School (with 18 National Merit Scholarship finalists) were declared to have failed. As in many cases involving statistics and public policy, the issue is debated, but "improvement scores" were not announced in subsequent years and the findings appear to be a case of regression to the mean.

In sports
Statistical analysts have long recognized the effect of regression to the mean in sports; they even have a special name for it: the "Sophomore Slump." For example, Carmelo Anthony of the NBA's Denver Nuggets had an outstanding rookie season in 2004. It was so outstanding, in fact, that he couldn't possibly be expected to repeat it: in 2005, Anthony's numbers had slightly dropped from his torrid rookie season. The reasons for the "sophomore slump" abound, as sports are all about adjustment and counter-adjustment, but luck-based excellence as a rookie is as good a reason as any. Of course, not just "sophomores" experience regression to the mean. Any athlete who posts a significant outlier, whether as a rookie (young players are universally not as good as those in their prime seasons), or particularly after their prime years (for most sports, the mid to late twenties), can be expected to perform more in line with their established standards of performance. The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy. Melvin Mora of the Baltimore Orioles put up a season in 2003, at age 31, that was so far away from his performance in prior seasons that analysts assumed it had to be an outlier... but in 2004, Mora was even better. Mora, then, had truly established a new level of production, though he will likely regress to his more reasonable 2003 numbers in 2005. Conversely, Kurt Thomas of the New York Knicks significantly ramped up his production in 2001, at an age (29) when players typically start to play more poorly. Sure enough, in the following season Thomas was his old self again, having regressed to the mean of his established level of play. John Hollinger has an alternate name for the law of regression to the mean: the "fluke rule." Whatever you call it, though, regression to the mean is a fact of life, and also of sports.