Oja learning rule

A key problem in artificial neural networks is how neurons learn. The central hypothesis is that learning is based on changing the connections, or synaptic weights between neurons by specific learning rules. In unsupervised learning, the changes in the weights only depend on the inputs and the output of the neuron. A popular assumption is the Hebbian learning rule, according to which the change in a given synaptic weight is proportional to both the pre-synaptic input and the output activity of the post-synaptic neuron.

The Oja learning rule (Oja, 1982) is a mathematical formalization of this Hebbian learning rule, such that over time the neuron actually learns to compute a principal component of its input stream.

The simple neuron model
Consider a simplified neuron model that has $$n$$ inputs $$x_1, ... x_n\ ,$$ each with a weight $$w_i\ .$$ The neuron first computes the weighted sum of the inputs,
 * $$\label{linear}

y = \sum_{i=1}^n w_i x_i $$

and then passes this sum to the next neurons in the network. The problem of learning is how to change the weights $$w_i$$ when a stream of input vectors $${\mathbf x} = (x_1, ..., x_n)\ ,$$ one at a  time, are given to this neuron as inputs.

From Hebbian learning to the Oja learning rule
Using the mathematical notation above, the Hebbian learning principle could be stated as
 * $$\label{Hebb}

\Delta w_i = \alpha x_i y, \; i = 1,...,n $$

where $$\Delta w_i$$ denotes the change in the value of the weight $$w_i\ ,$$ $$x_i$$ is the input coming through the weight  $$w_i\ ,$$ and  $$y$$ is the output of the neuron as given in equation (1). The coefficient $$\alpha$$ is called the learning rate and it is typically small. Due to this, one input vector (whose $$i-th$$ component is the term $$x_i$$) only causes a small instantaneous change in the weights, but when the small changes accumulate over time, the weights will settle to some values.

Equation \eqref{Hebb} represents the Hebbian principle, because the term is the product of the input and the output. However, this learning rule has a severe problem: there is nothing there to stop the connections from growing all the time, finally leading to very large values. There should be another term to balance this growth. In many neuron models, another term representing "forgetting" has been used: the value of the weight itself should be subtracted from the right hand side. The central idea in the Oja learning rule is to make this forgetting term proportional, not only to the value of the weight, but also to the square of the output of the neuron. The Oja rule reads:
 * $$\label{Ojarule}

\Delta w_i = \alpha (x_i y - y^2 w_i), \; i= 1,...,n. $$

Now, the forgetting term balances the growth of the weight. The squared output $$y^2$$ guarantees that the larger the output of the neuron becomes, the stronger is this balancing effect.

Oja learning rule and principal component analysis
A mathematical analysis of the Oja learning rule in \eqref{Ojarule} goes as follows (a much more thorough and rigorous analysis appears in the book (Oja, 1983)). First, change into vector notation, in which $$\mathbf x$$ is the column vector with elements $$x_i$$ and $$\mathbf w$$ is the column vector with elements  $$w_i\ .$$ They are called the input vector and the weight vector, respectively. In vector-matrix notation, equation (1) then reads
 * $$\label{linearVec}

y = {\mathbf w}^T{\mathbf x} = {\mathbf x}^T{\mathbf w} $$

where T means the transpose, changing a column vector into a row vector. This is the well-known inner product between two vectors, defined as the sum of products of their elements (see equation (1)).

Next, write equation \eqref{Ojarule} in vector notation :
 * $$\label{OjaruleVec}

\Delta {\mathbf w} = \alpha ({\mathbf x} y - y^2 {\mathbf w}). $$

Then, substitute  $$y$$  from equation \eqref{linearVec} into equation \eqref{OjaruleVec}:

\Delta {\mathbf w} = \alpha ({\mathbf x} {\mathbf x}^T{\mathbf w} - {\mathbf w}^T{\mathbf x}{\mathbf x}^T {\mathbf w}{\mathbf w}). $$ This is the incremental change for just one input vector $${\mathbf x}\ .$$ When the algorithm is run for a long time, changing the input vector at every step, one can look at the average behaviour. An especially interesting question is what is the value of the weights when the average change in the weight is zero. This is the point of convergence of the algorithm.

Averaging the right hand side over the $${\mathbf x}\ ,$$ conditional on $${\mathbf w}$$ staying constant, and setting this to zero gives the following equation for the weight vector at the point of convergence:
 * $$\label{PCAsolution}

{\mathbf C}{\mathbf w} - {\mathbf w}^T{\mathbf C}{\mathbf w}{\mathbf w} = 0 $$

where the matrix $${\mathbf C}$$ is the average of $${\mathbf x}{\mathbf x}^T\ .$$ Assuming the input vectors have zero means, this is in fact the well-known covariance matrix of the inputs.

Considering that the quadratic form $${\mathbf w}^T{\mathbf C}{\mathbf w}$$ is a scalar, this equation clearly is the eigenvalue-eigenvector equation for the covariance matrix $${\mathbf C}\ .$$ This analysis shows that if the weights converge in the Oja learning rule, then the weight vector becomes one of the eigenvectors of the input covariance matrix, and the output of the neuron becomes the corresponding principal component. Principal components are defined as the inner products between the eigenvectors and the input vectors. For this reason, the simple neuron learning by the Oja rule becomes a principal component analyzer (PCA).

Although not shown here, it has been proven that it is the first principal component that the neuron will find, and the norm of the weight vector tends to one. For details, see (Oja, 1983; 1992). This analysis is based on stochastic approximation theory (see e.g. Kushner and Clark, 1978) and depends on a set of mathematical assumptions. Especially, the learning rate $$\alpha$$ cannot be a constant but has to decrease over time. A typical decreasing sequence is

\alpha(t) = 1/t. $$

Extensions of the Oja learning rule
This learning rule has been extended to several directions. Two extensions are briefly reviewed here: Oja rule for several parallel neurons, and nonlinearities in the rule.

Oja rule for several neurons
It is possible to define this learning rule for a layer of parallel neurons, each receiving the same input vector $$\mathbf x\ .$$ Then, in order to prevent all the neurons from learning the same thing, parallel connections between them are needed. The result is that a subset or all of the principal components are learned. Such neural layers have been considered by (Oja, 1983, 1992; Sanger, 1989; Földiák, 1989).

Nonlinear Hebbian learning and independent component analysis
Independent component analysis (ICA) is a technique that is related to PCA, but is potentially much more powerful: instead of finding uncorrelated components like in PCA, statistically independent components are found, if they exist in the original data. It turns out that quite small changes in the Oja rule can produce independent, instead of principal, components in such a case. What is needed is to change the linear output factor $$y$$ in the Hebbian term to a suitable nonlinearity, such as $$y^3\ .$$ Also the forgetting term must be changed accordingly. The ensuing learning rule
 * $$\label{ICAruleVec}

\Delta {\mathbf w} = \alpha ({\mathbf x} y^3 - {\mathbf w}) $$

can be shown to give one of the independent hidden factors under suitable assumptions (Hyvärinen and Oja, 1998). The main requirement is that prior to entering this algorithm, the input vectors have to be zero mean and whitened so that their covariance matrix $$\mathbf C$$ is equal to the identity matrix. This can be achieved with a simple linear transformation, or by a variant of the Oja rule (see also Hyvärinen et al, 2001).

Recommended reading

 * Diamantaras K.I. and Kung S.Y. (1996) Principal component neural networks: theory and applications. Wiley.