Backpropagation

Backpropagation is a common method of training artificial neural networks so as to minimize the objective function. Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969. It wasn't until 1974 and later, when applied in the context of neural networks and through the work of Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, that it gained recognition, and it led to a “renaissance” in the field of artificial neural network research.

It is a supervised learning method, and is a generalization of the delta rule. It requires a dataset of the desired output for many inputs, making up the training set. It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop). The term is an abbreviation for "backward propagation of errors". Backpropagation requires that the activation function used by the artificial neurons (or "nodes") be differentiable.

Summary
For better understanding, the backpropagation learning algorithm can be divided into two phases: propagation and weight update.

Phase 1: Propagation
Each propagation involves the following steps:


 * 1) Forward propagation of a training pattern's input through the neural network in order to generate the propagation's output activations.
 * 2) Backward propagation of the propagation's output activations through the neural network using the training pattern's target in order to generate the deltas of all output and hidden neurons.

Phase 2: Weight update
For each weight-synapse follow the following steps: This ratio influences the speed and quality of learning; it is called the learning rate. The sign of the gradient of a weight indicates where the error is increasing, this is why the weight must be updated in the opposite direction.
 * 1) Multiply its output delta and input activation to get the gradient of the weight.
 * 2) Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight.

Repeat phase 1 and 2 until the performance of the network is satisfactory.

Modes of learning
There are two modes of learning to choose from: One is on-line(incremental) learning and the other is batch learning. In on-line(incremental) learning, each propagation is followed immediately by a weight update. In batch learning, many propagations occur before weight updating occurs. Batch learning requires more memory capacity, but on-line learning requires more updates.

Algorithm
Actual algorithm for a 3-layer network (only one hidden layer):


 * Initialize the weights in the network (often randomly)
 * Do
 * For each example e in the training set
 * O = neural-net-output(network, e) ; forward pass
 * T = teacher output for e
 * Calculate error (T - O) at the output units
 * Compute delta_wh for all weights from hidden layer to output layer ; backward pass
 * Compute delta_wi for all weights from input layer to hidden layer ; backward pass continued
 * Update the weights in the network
 * Until all examples classified correctly or stopping criterion satisfied
 * Return the network

As the algorithm's name implies, the errors propagate backwards from the output nodes to the inner nodes. Technically speaking, backpropagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent algorithm to find weights that minimize the error. Often the term "backpropagation" is used in a more general sense, to refer to the entire procedure encompassing both the calculation of the gradient and its use in stochastic gradient descent. Backpropagation usually allows quick convergence on satisfactory local minima for error in the kind of networks to which it is suited.

Backpropagation networks are necessarily multilayer perceptrons (usually with one input, one hidden, and one output layer). In order for the hidden layer to serve any useful function, multilayer networks must have non-linear activation functions for the multiple layers: a multilayer network using only linear activation functions is equivalent to some single layer, linear network. Non-linear activation functions that are commonly used include the logistic function, the softmax function, and the gaussian function.

The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation in the reverse accumulation mode.

It is also closely related to the Gauss–Newton algorithm, and is also part of continuing research in neural backpropagation.

Multithreaded backpropagation
Backpropagation is an iterative process that can often take a great deal of time to complete. When multicore computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner.

The training data is broken up into equally large batches for each of the threads. Each thread executes the forward and backward propagations. The weight and threshold deltas are summed for each of the threads. At the end of each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to the neural network. This process continues for each iteration. This multithreaded approach to backpropagation is used by the Encog Neural Network Framework.

Limitations

 * The convergence obtained from backpropagation learning is very slow.
 * The convergence in backpropagation learning is not guaranteed.
 * The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.
 * Backpropagation learning requires input scaling or normalization.