34,680 Pages

In statistics, regression analysis is a technique which examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). Regression analysis can be used as a descriptive method of data analysis (such as curve fitting) without relying on any assumptions about underlying processes generating the data.

When paired with assumptions in the form of a statistical model, regression can be used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. These uses of regression rely heavily on the model assumptions being satisfied. Regression analysis has been criticized as being misused for these purposes in many cases where the appropriate assumptions cannot be verified to hold. One factor contributing to the misuse of regression is that it can take considerably more skill to critique a model than to fit a model.

The key relationship in a regression is the regression equation. A regression equation contains regression parameters whose values are estimated using data. The estimated parameters measure the relationship between the dependent variable and each of the independent variables. When a regression model is used, the dependent variable is modeled as a random variable because of either uncertainty as to its value or inherent variability. The data are assumed to be sample from a probability distribution, which is usually assumed to be a normal distribution.

## History of regression

The term "regression" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents and more like their more distant ancestors. Francis Galton, a cousin of Charles Darwin, studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context.

## Simple linear regression

The general form of a simple linear regression is where is the intercept, is the slope, and is the error term, which picks up the unpredictable part of the response variable yi. The error term is usually posited to be normally distributed. The 's and 's are the data quantities from the sample or population in question, and and are the unknown parameters ("constants") to be estimated from the data. Estimates for the values of and can be derived by the method of ordinary least squares. The method is called "least squares," because estimates of and minimize the sum of squared error estimates for the given data set. The estimates of and are often denoted by and or their corresponding Roman letters. It can be shown (see Draper and Smith, 1998 for details) that least squares estimates are given by and where is the mean (average) of the values and is the mean of the values.

## Generalizing simple linear regression

The simple model above can be generalized in different ways.

• The number of predictors can be increased from one to several. See
Main article: linear regression
• The relationship between the knowns (the s and s) and the unknowns ( and the s) can be nonlinear. See
Main article: non-linear regression
• The response variable may be non-continuous. For binary (zero or one) variables, there are the probit and logit model. The multivariate probit model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, count models like the Poisson regression or the negative binomial model may be used
• The form of the right hand side can be determined from the data. See Nonparametric regression. These approaches require a large number of observations, as the data are used to build the model structure as well as estimate the model parameters. They are usually computationally intensive.

## Regression diagnostics

Once a regression model has been constructed, it is important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include R-squared, analyses of the pattern of residuals and construction of an ANOVA table. Statistical significance is checked by an F-test of the overall fit, followed by t-tests of individual parameters. Interpretations of these diagnostics rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are meaningless unless the modeling assumptions are satisfied.

## Estimation of model parameters

The parameters of a regression model can be estimated in many ways. The following list orders these methods roughly on the basis of how widely used they are in practice:

For a model with normally distributed errors the method of least squares and the method of maximum likelihood coincide (see Gauss-Markov theorem).

## Interpolation and extrapolation

Regression models predict a value of the variable given known values of the variables. If the prediction is to be done within the range of values of the variables used to construct the model this is known as interpolation. Prediction outside the range of the data used to construct the model is known as extrapolation and it is more risky.

## Assumptions underpinning regression

Regression analysis depends on certain assumptions

1. The predictors must be linearly independent, i.e it must not be possible to express any predictor as a linear combination of the others. See Multicollinear.
2. The error terms must be normally distributed and independent.
3. The variance of the error terms must be constant.
4. The sample must be representative of the population for the inference prediction.
5. The distribution of the dependent variable must have approximately equal variability, called the assumption of homoscedasticity

## Examples

To illustrate the various goals of regression, we give an example.

### Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

 Height (in) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Weight (lb) 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function such that , where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. will denote the vector containing all the measured heights ( ) and is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients and satisfying as well as possible (in the sense of the least-squares estimator) the equation: Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables and . The matrix X is constructed simply by putting a first column of 1's (the constant term in the model), a column with the original values (the X in the model) and a third column with these values cubed ( ). The realization of this matrix (i.e. for the data at hand) can be written:   1 58 195112 1 59 205379 1 60 216000 1 61 226981 1 62 238328 1 63 250047 1 64 262144 1 65 274625 1 66 287496 1 67 300763 1 68 314432 1 69 328509 1 70 343000 1 71 357911 1 72 373248

The matrix (sometimes called "information matrix" or "dispersion matrix") is: Vector is therefore: hence The confidence intervals are computed using: with: <- this number is incorrect   Therefore, we can say that the 95% confidence intervals are:   ## See also

Community content is available under CC-BY-SA unless otherwise noted.