Neural Network Training

I recently wrote a post about how perceptron classifiers are used as building blocks for neural networks. Here, I'm going to pick up where that post left off and discuss how multi-layer perceptron models are structured and trained to recognize patterns in data.

Parameters of a neural network model encompass all the weights at any level, and training a neural network on a known dataset aims to select values for those weights that minimize some pre-determined measurement of error (mean squared error, for example). As we update those weight values, we also update the definition of our hidden layers' features so as to improve the performance of the later perceptron layers. In this way, hidden layers can be interpreted as designing good features of the original output that can then be used by output layer perceptron classifier to make accurate predictions.
In order to build a useful neural network, we need to do two things: (1) construct the model in a way that allows it to learn efficiently and (2) train it in a way that leads to correct predictions.

Network Architecture

As I've mentioned before, neural networks are built with an input layer, one or several hidden layers, and an output layer. In feed-forward neural networks (which I will be discussing here), information flows in one direction through the input layer until it reaches the output layer. A neural network may contain several hidden layers with varying numbers of hidden nodes, and the ideal architecture for a given model depends on the kind of input data being used and the patterns the model is aiming to detect. On one hand, neural networks with many hidden neurons will be able to learn more complicated problems. However, having too many hidden neurons makes the model more prone towards learning specific features of the input examples instead of general patterns that can be extrapolated beyond the original training dataset. Choosing the composition of hidden layers that balances these concerns is hard to guess and usually is determined experimentally.

Activation Functions

There are several flavors of activations functions that can be used to threshold perceptron classifications. While the activation function is usually a homogeneous decision in a given neural network, in some deep learning models the activation function can change from layer to layer.
One factor to consider when choosing an activation function is that strictly positive values can be awkward in multilayer models. If outputs are all positive, the next transition point being at zero can put a lot of pressure on the bias term to compensate and results in slow learning. For this reason, hyperbolic tangent functions (which saturate at -1 and 1), are sometimes more desirable than logistic functions (which saturate at 0 and 1).
Changing the steepness of the activation function can affect how a neural networks learns because the training algorithm will converge faster if the steepness is set to a high value. While this is desirable in some cases, such a decision would not be wise for a neural network that requires fractional output.

Backpropagation

Neural networks are trained using backpropagation, which is essentially the repetitive application of the chain rule of calculus to adjust weights within the model. Backpropagation algorithms determine weight adjustments by propagating the error backwards from the output layer to the input layer while adjusting the weights along the way.
The back-propagated error can be calculated incrementally for a single training pattern or as a sum of all the errors (called a batch) from an entire training file. The advantage of using an incremental training algorithm is that since the weights are adjusted many times during each training cycle, the training is less likely to get stuck in a local-minima state where small changes increase the error even though the model has not yet arrived at the optimal solution. Meanwhile, batch training algorithms are sometimes preferable because they compile global optimization information that is not measurable on an incremental scale. Most sources I've found recommend batch training algorithms over incremental ones unless you have a very large number of training patterns.
Another consideration when deciding how to train a neural network is the appropriate number of training cycles to run. This decision is influenced by the same challenge we saw when discussing neural network architecture (above): while more training cycles allow the model to learn more complex problems, too many cycles will specialize the model the the nuances of the training examples and it won't be able to pick out more general patterns in unseen data. The best way to guard against this is to have a set of data used for training and a second set for verification after training is complete.