Chapter 1: Using neural nets to recognize handwritten digits
Contents
Perceptrons
A perceptron takes several binary inputs, x1, x2, …, and produces a single binary output, which can be thought as making decisions by weighing up evidence.
To put it in more precise algebraic terms:
\(
output =
\begin{cases}
0, \text{if}\ \Sigma_j w_j x_j \leq \text{threshold} \\
1, \text{if}\ \Sigma_j w_j x_j < \text{threshold}
\end{cases}
\)
Using the bias instead of the threshold, the perceptron rule can be rewritten:
\(
output =
\begin{cases}
0, \text{if}\ w \times x + b \leq 0 \\
1, \text{if}\ w \times x + b < 0
\end{cases}
\)
The bias is a measure of how easy it is to get the perceptron to fire.
A perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer.
A perceptron implements NAND:
Only the input 11 can produces output 0, since (−2) ∗ 1 + (−2) ∗ 1 + 3 = −1 is negative.
The NAND example shows that we can use perceptrons to compute any logical functions, because the NAND gate is universal for computation.
Sigmoid neurons
The network would be learning if changing the weights and biases over and over would produce better and better output.
However, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1 . That flip may then cause the behavior of the rest of the network to completely change in some very complicated way.
That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behavior.
We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron.
The inputs of sigmoid neuron can take on any values between 0 and 1.
The output is σ(z), where z = w ⋅ x + b is called the weighted input, and σ is called the sigmoid function, and is defined by:
\(\large \sigma (z) \equiv \frac{1}{1 + e^{-z}}\)
This shape is a smoothed out version of a step function:
If σ had in fact been a step function, then the sigmoid neuron would be a perceptron.
The smoothness of σ means that small changes Δw in the weights and Δb in the bias will produce a small change Δoutput in the output from the neuron.
Using σ will also simplify the algebra, because exponentials have lovely properties when differentiated.
The architecture of neural networks
The leftmost layer is called the input layer, and the rightmost layer is called the output layer, while the middle layer is called the hidden layer.
The feedforward neural networks contain no loops, which means that the information is always fed forward, never fed back.
The recurrent neural networks contain loops, and neurons will fire for some limited duration of time before becoming quiescent. They’re much closer in spirit to how our brains work than feedforward networks.
A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a 64 by 64 greyscale image, then we’d have 4, 096 = 64 × 64 input neurons, with the intensities scaled appropriately between 0 and 1.
A simple network to classify handwritten digits
We’ll focus on writing a program to classify individual digit. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation.
To recognize individual digits we will use a three-layer neural network:
The activations of output neuron can be thought as probability. So we can guess the result according to the highest output neuron.
A way to get intuition of the hidden layer:
Suppose that the first neuron in the hidden layer detects whether or not an image like the following is present:
It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs.
In a similar way, let’s suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:
So if all four of these hidden neurons are firing then we can conclude that the digit is a 0. Of course, that’s not the only sort of evidence, but it seems safe at least in this case.
Learning with gradient descent
To quantify how well we’re achieving this goal we define a cost function:
\(C(w, b) \equiv \Sigma _x \Arrowvert y(x) – a \Arrowvert ^2\)
Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input and y is the corresponding desired output, and the sum is over all training inputs, x.
So the aim of our training algorithm will be to minimize the cost C(w, b) as a function of the weights and biases.
(We can not directly minimize the number of pictures correctly classified because it is not a smooth function of the weights and biases in the network.)
The gradient descent update rule in terms of components:
\(\large w_k \longrightarrow w_k – \eta \frac{\partial C}{\partial w_k}\)
\(\large b_l \longrightarrow b_l – \eta \frac{\partial C}{\partial b_l}\)
which comes from \(\bigtriangleup C \approx \bigtriangledown C \times \bigtriangleup v\)
The gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease C. (Just locally because it just considers the first order derivatives.)
The learning rate η is often varied so that the above equation remains a good approximation (not too big), but the algorithm isn’t too slow (not too samll).
An idea called stochastic gradient descent can be used to speed up learning.
It works by randomly picking out a small number m of randomly chosen training inputs which referred as a mini-batch.
Provided the sample size m is large enough we expect that the average value of the \(\bigtriangledown C_{x_j}\) will be roughly equal to the average over all \(\bigtriangledown C_x\).
We continue until we’ve exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.
In online learning, a neural network learns from just one training input at a time (just as human beings do).
Implementing our network to classify digits
https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py
1 2 3 4 5 6 |
$ python2 > import mnist_loader > training_data, validation_data, test_data = mnist_loader.load_data_wrapper() > import network > net = network.Network([784, 30, 10]) > net.SGD(training_data, 30, 10, 3.0, test_data=test_data) |
Toward deep learning
For the face-detection problem, a network breaks down a very complicated question – does this image show a face or not – into some sub-problems – … – into very simple questions answerable at the level of single pixels (later layers building up a hierarchy of ever more complex and abstract concepts).
Abstraction takes a different form in neural networks than it does in conventional programming, but it’s just as important.