Statistical regression

Regression analysis is any statistical method where the mean of one or more random variables is predicted conditioned on other (measured) random variables. In particular, there is linear regression, logistic regression, Poisson regression, supervised learning, and unit-weighted regression. Regression analysis is more than curve fitting (choosing a curve that best fits given data points); it involves fitting a model with both deterministic and stochastic components. The deterministic component is called the predictor and the stochastic component is called the error term.

The simplest form of a regression model contains a dependent variable (also called "outcome variable," "endogenous variable," or "Y-variable") and a single independent variable (also called "factor," "exogenous variable," or "X-variable").

Typical examples are the dependence of the blood pressure Y on the age X of a person, or the dependence of the weight Y of certain animals on their daily ration of food X. This dependence is called the regression of Y on X.

See also: multivariate normal distribution, important publications in regression analysis.

Regression is usually posed as an optimization problem as we are attempting to find a solution where the error is at a  minimum. The most common error measure that is used is the least squares: this corresponds to a Gaussian likelihood of generating observed data given the (hidden) random variable. In a certain sense, least squares is an optimal estimator: see the Gauss-Markov theorem.

The optimization problem in regression is typically solved by algorithms such as the gradient descent algorithm, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm. Probabilistic algorithms such as RANSAC can be used to find a good fit for a sample set, given a parametrized model of the curve function.

Regression can be expressed as a maximum likelihood method of estimating the parameters of a model. However, for small amounts of data, this estimate can have high variance. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available and they are exact, not asymptotic, and thus work well for small data sets. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.

Purpose and formulation
The goal of regression is to describe a set of data as accurately as possible. To do this, we set the following mathematical context:

$$(\Omega,\mathcal{A}, P)$$ will denote a probability space and $$(\Gamma, S)$$ will be a measure space. $$\Theta\subseteq\Gamma$$ is a set of coefficients.

Very often, $$\Gamma = \mathbb{R}^n$$ and $$S=\mathcal{B}_n$$ with $$n\in\mathbb{N}^*$$.

The response variable (or vector of observations) Y is a random variable, i.e. a measurable function:

$$Y:(\Omega,\mathcal{A})\rightarrow(\Gamma, S)$$.

This variable will be "explained" using other random variables called factors. Some people say Y is a dependent variable (because it depends on the factors) and call the factors independent variables. However, the factors can very well be statistically dependent (for example if one takes X and $$X^2$$) and the reponse variables can be statistically independent. Therefore, the terminology "dependent" and "independent" can be confusing and should be avoided.

Let $$p\in\mathbb{N}^*$$. $$p$$ is called number of factors.

$$\forall i\in \{1,\cdots,p\}, X_i:(\Omega,\mathcal{A})\rightarrow(\Gamma, S)$$.

Let $$\eta:\left\{ \begin{matrix} \Gamma^p\times\Theta&\rightarrow&\Gamma\\ (X_1,\cdots,X_p;\theta)&\mapsto&\eta(X_1,\cdots,X_p,\theta) \end{matrix} \right.$$.

We finally define $$\varepsilon:=Y-\eta(X_1,\cdots,X_p;\theta)$$, which means that $$Y=\eta(X_1,\cdots,X_p;\theta)+\varepsilon$$ or more concisely:


 * $$Y=\eta(X;\theta)+\varepsilon$$ (E)

if we accept the convention that X is either a matrix with one factor per column or a single vector if $$p=1$$. For example, Y could be the number of correct answers to a test and X could be the age of the person undertaking the test. The last term, $$\varepsilon$$, is a random variable called error which is supposed to model the variability in the experiment (i.e., in exactly the same conditions, the output Y of the experiment might differ slightly from experiment to experiment). This term actually represents the part of Y not explained by the model $$\eta$$.

The general form of the function $$\eta$$ is known. In fact, the only element we don't know in the equation (E) is $$\theta$$. The aim of regression is, given a set of data, to find an estimate $$\widehat{\theta}$$ of $$\theta$$ satisfying some criterion.

Doing a regression takes three steps: (1) deciding what kind of function $$\eta$$ we will use, (2) choosing the criterion to optimize, (3) finding and computing an estimator for $$\theta$$.

Linear regression
For continuous variables, linear regression is the most common case in practice because it is the easiest to compute and gives good results. Note that by "linear", we mean "linear in $$\theta$$, not "linear in X". When we do a linear regression, we are implicitly supposing that $$\Gamma = \mathbb{R}^n$$ and $$S=\mathcal{B}_n$$ with $$n\in\mathbb{N}^*$$ and that given a set of factors $$X_1,\cdots,X_p$$, the best approximation of the response variable $$Y$$ we can find is a linear combination of theses factors $$X_1,\cdots,X_p$$. The aim of linear regression is to find the right coefficients of this linear combination.

We choose $$\eta$$ the following way:


 * $$\eta(X,\theta)=\theta^0 + \sum_{j=1}^p \theta^j X_j.$$

Logistic regression
If the variable y has only discrete values (for example, a Yes/No variable), logistic regression is preferred. It is equivalent to making a linear regression on the odds ratio. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the factors.

In order to solve this problem efficiently, several methods exist. The most common one is the Gauss-Markov method, but it requires extra hypotheses.

Criterion to optimize
The criterion usually used is to minimize $$\mathbb{E}[\|\varepsilon\|^2]$$. The motivation behind this criterion is that the function $$(x,y)\mapsto\mathbb{E}[\|x-y\|^2]$$ defines a metric. Therefore, solving the regression problem in that case is equivalent to finding the function that lies the closest to Y. But why this particular metric? Simply because it lends itself very nicely to a geometrical interpretation, as we will later see.

Of course, this criterion is only one of the many criterions we can use. For example, we could modify this metric slightly to downweigh observations that are known to have a high variance because we consider them unreliable. This leads to weighted regression.

Choice of an estimator
Under assumptions which are met relatively often, there exists an optimal solution to the linear regression problem. These assumptions are called Gauss-Markov hypothesis. See also Gauss-Markov theorem.

The Gauss-Markov hypothesis
We suppose that $$\mathbb{E}\varepsilon=0$$ and that $$\mathbb{V}\varepsilon=\sigma^2 I)$$ (uncorrelated, but not necessarily independent) where $$\sigma^2<+\infty$$ and $$I$$ is the $$n\times n$$ identity matrix.

The linear regression problem is then equivalent to an orthogonal projection. It is as if we were considering a set of random variables containing Y and that we projected Y on a subspace of linear functions. This makes the computing of an estimator fairly straightforward. For a proof of this, please refer to least-squares estimation of linear regression coefficients.

Gauss-Markov least-squares estimation of the coefficients
If we suppose all $$p$$ factors are vectors of same length $$n$$, we can build a $$n\times p$$ matrix X:


 * $$X:=(X_1,\cdots,X_p)$$

Supposing this matrix is of full rank, it can be shown (for a proof of this, see least-squares estimation of linear regression coefficients) that a good estimator $$\widehat{\theta}$$ of the parameters $$\theta=(\theta^0,\cdots,\theta^p)$$ is:


 * $$\widehat{\theta}=(X^t X)^{-1}X^t y$$

where $$X^t$$ is the transpose of matrix X, and y is a realization of Y (Y is a random variable i.e. a function and y is the value that Y takes for the experiment under consideration). Based on this data, an estimation of the function $$\eta$$ we are looking for is:


 * $$\eta(X;\widehat{\theta}) = X\widehat{\theta}$$

Alternatives to Gauss-Markov
The Gauss-Markov estimator is extremely efficient: in fact, the Gauss-Markov theorem states that of all unbiased estimators, depending linearly on $$Y$$, of the linear regression coefficients, the least-square ones are the most efficient. Unfortunately, the Gauss-Markov hypothesis are fairly stringent and are often not met in practical cases: departure from the assumptions will corrupt the results quite significantly.

A rather naïve example of this is given on the figure below:



All points lie on a straight line, except one. The regression line is shown in red.

However, robust estimators are a bit fiddly (as one can see here for example, and people tend to overlook the Gauss-Markov hypothesis and brace themselves with the power of Gauss-Markov justifying it with the central limit theorem (for large values of $$n$$, the Gauss-Markov assumptions are often met).

If the Gauss-Markov hypotheses are not met, a variety of techniques are available.


 * If the error term is not normal but forms an exponential family one can use generalized linear models. Other techniques include the use of weighted least squares or transforming the dependent variable using the Box-Cox transformation.


 * If outliers are present the normal distribution can be replaced by a t-distribution or, alternatively, robust regression methods may be used.


 * If the predictor is not linear a nonparametric regression or semiparametric regression or nonlinear regression may be used.

Confidence interval for estimation assuming normality, homoscedasticiy, and uncorrelatedness
How much confidence can we have in the values of $$\beta$$ we estimated from the data? To answer, we unfortunately need to add hypothesis yet again. Suppose that:


 * $$Y\sim\mathcal{N}(X\theta,\sigma^2 I_n).\,$$

Then we can get the distribution of the least-square estimation of the parameters.

If $$\eta(X;\hat{\theta})=X\widehat{\theta}, \|u\|^2=u^t u$$ and $$\widehat{\sigma}^2:=\frac{1}{n-p}\|y-f\|^2$$ then


 * $$\widehat{\theta}\sim\mathcal{N}(\theta,\sigma^2(X^t X)^{-1}),$$


 * $$\frac{n-p}{\sigma^2}\widehat{\sigma}^2\sim\chi^2_{n-p},$$


 * $$\frac{1}{\sigma^2}\|f-\hat{f}\|^2\sim\chi_{p}^2.$$

For $$1\leq j\leq p$$, if we name $$s_j$$ the $$j$$-th diagonal element of the matrix $$(X^tX)^{-1}$$, a $$1-\alpha$$ confidence interval for $$\theta_j$$ is therefore:


 * $$[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}].$$

First example
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

We would like to see how the weight of these women depends on their height. We are therefore looking for a function $$f$$ such that $$y=f(x)$$, where y is the weight of the women and x their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:



We are therefore looking for coefficients $$\theta^0, \theta^1$$ and $$\theta^2$$ satisfying as well as possible (in the sense of the Gauss-Markov hypothesis) the equation:


 * $$y=\theta^0 + \theta^1 x + \theta^2 x^3+\varepsilon$$

This means we want to project y on the subspace generated by the variables $$1, x$$ and $$x^3$$. The matrix X is constructed simply by putting a column of 1's (the constant term in the model) a column with the original values (the x in the model) and a column with these values cubed (x^3). It can be written:

The matrix $$(X^t X)^{-1}$$ (sometimes called "information matrix" or "dispertion matrix") is:

$$ \left[\begin{matrix} 1927.3&-44.6&3.5e-3\\ -44.6&1.03&-8.1e-5\\ 3.5e-3&-8.1e-5&6.4e-9 \end{matrix}\right]$$

Vector $$\widehat{\theta}$$ is therefore:

$$\widehat{\theta}=(X^tX)^{-1}X^{t}y= <146.6,-2.0,4.3e-4>$$

Therefore: $$\eta(x) = 147 - 1.98 x + 4.27*10^{-4} x^3$$

A plot of this function shows that it lies quite closely to the data set:

The confidence intervals are computed using:


 * $$[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]$$

with:


 * $$\widehat{\sigma}=0.52$$
 * $$s_1=1927.3, s_2=1.033, s_3=6.37*10^{-9}$$
 * $$\alpha=5%$$
 * $$t_{n-p;1-\frac{\alpha}{2}}=2.1788$$

Therefore, we can say that with a probability of 0.95,


 * $$\theta^0\in[112.0, 181.2]$$


 * $$\theta^1\in[-2.8, -1.2]$$


 * $$\theta^2\in[3.6e-4, 4.9e-4]$$

Second example
We are given a vector of x values and another vector of y values and we are attempting to find a function f such that $$f(x_{i}) = y_{i} $$.


 * let $$

\vec{x} = \begin{pmatrix} -2 \\ -1 \\ 0 \\ 1 \\ 2 \\ \end{pmatrix},

\vec{y} = \begin{pmatrix} 5 \\ 2 \\ 1 \\ 2 \\ 5 \\ \end{pmatrix} $$

Let's assume that our solution is in the family of functions defined by a 3rd degree Fourier expansion written in the form:



f(x) = a_{0}/2 + a_{1}\cos(x) + b_{1}\sin(x) + a_{2}\cos(2x) + b_{2}\sin(2x) + a_{3}\cos(3x) + b_{3}\sin(3x) $$

where $$ a_{i}, b_{i} $$ are real numbers. This problem can be represented in matrix notation as:



\begin{pmatrix} 1/2, & \cos(x), & \sin(x), & \cos(2x), & \sin(2x), & \cos(3x), & \sin(3x), \\ \end{pmatrix} \begin{pmatrix} a_{0} \\ a_{1} \\ b_{1} \\ a_{2} \\ b_{2} \\ a_{3} \\ b_{3} \\ \end{pmatrix} = \vec{y} $$

filling this form in with our given values yields a problem in the form Xw = y



\begin{pmatrix} 1/2 & \cos(-2) & \sin(-2) & \cos(-4) & \sin(-4) & \cos(-6) & \sin(-6)\\ 1/2 & \cos(-1) & \sin(-1) & \cos(-2) & \sin(-2) & \cos(-3) & \sin(-3)\\ 1/2 & 1 & 0 & 1 & 0 & 1 & 0\\ 1/2 & \cos(1) & \sin(1) & \cos(2) & \sin(2) & \cos(3) & \sin(3)\\ 1/2 & \cos(2) & \sin(2) & \cos(4) & \sin(4) & \cos(6) & \sin(6)\\ \end{pmatrix}. \begin{pmatrix} a_{0} \\ a_{1} \\ b_{1} \\ a_{2} \\ b_{2} \\ a_{3} \\ b_{3} \\ \end{pmatrix} = \begin{pmatrix} 5 \\ 2 \\ 1 \\ 2 \\ 5 \\ \end{pmatrix} $$

This problem can now be posed as an optimization problem to find the minimum sum of squared errors.
 * $$\min_{\vec{w}} \sum_{i=1}^{n} (\vec{x_{i}}\vec{w} - y_{i})^2 $$
 * $$\min_{\vec{w}} \|X\vec{w} - \vec{y}\|^2. $$

solving this with least squares yields:



\vec{w} = \begin{pmatrix} 0 \\ 4.25 \\ 0 \\ -6.13 \\ 0 \\ 2.88 \\ 0 \\ \end{pmatrix} $$

thus the 3rd-degree Fourier function that fits the data best is given by:

f(x) = 4.25\cos(x) - 6.13\cos(2x) + 2.88\cos(3x). $$