34,719 Pages

Cross-validation, sometimes called rotation estimation[1] [2] [3], is the statistical practice of partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis.

The initial subset of data is called the training set; the other subset(s) are called validation or testing sets.

The theory of cross-validation was inaugurated by Seymour Geisser. It is important in guarding against testing hypotheses suggested by the data ("Type III error"), especially where further samples are hazardous, costly or impossible (uncomfortable science) to collect.

Common types of cross-validation

Holdout validation

Holdout validation is not cross-validation in the common sense, because the data never are crossed over. Observations are chosen randomly from the initial sample to form the validation data, and the remaining observations are retained as the training data. Normally, less than a third of the initial sample is used for validation data.[4]

K-fold cross-validation

In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation.

Leave-one-out cross-validation

As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as a K-fold cross-validation with K being equal to the number of observations in the original sample, though efficient algorithms exist in some cases, for example with kernel regression and with Tikhonov regularization.

Error estimation

The parameter estimation error can be computed. Common error metrics are the mean squared error (MSE) and the root mean squared error (RMSE), respectively the estimated variance and standard deviation of the cross validation.

References

1. Kohavi, Ron (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143.(Morgan Kaufmann, San Mateo)
2. Chang, J., Luo, Y., and Su, K. 1992. GPSM: a Generalized Probabilistic Semantic Model for ambiguity resolution. In Proceedings of the 30th Annual Meeting on Association For Computational Linguistics (Newark, Delaware, June 28 - July 02, 1992). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 177-184
3. Devijver, P. A., and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall, London, 1982
4. Tutorial 12. Decision Trees Interactive Tutorial and Resources. URL accessed on 2006-06-21.