Notes for Neural Networks and Deep Learning
Using neural nets to recognize handwritten digits
How the backpropagation algorithm works
The two assumptions we need about the cost function
The Hadamard product
:
- the elementwise product of two vectors and
- called Hadamard product or Schur product
The four fundamental equations behind backpropagation
- measure of error
-
- here delta is the "inexact derivative" or incremental amount changed in hidden neuron at layer , and it's set to be the partial derivative of the Cost function w.r.t the logistic logit at that neuron.
-
BP1 - error in the output layer
- BP1:
- how fast cost is changing as a function of output activation
- if cost C doesn't depend on neuron j, then will be small
- if using quadratic / squared error, , and so (notice the reversal of terms since the derivative of the inner term is -1)
- how fast the activation function is changing at
- is computed during forward pass
- it is a componentwise expression for
- BP1a:
- a vector whose components are the partial derivatives
- expresses the rate of change of C w.r.t. output activations
- equivalent to BP1
- with quadratic cost / squared error we have
- so then BP1 becomes
BP2 - An equation for the error in terms of next layer
- BP2:
- : the transpose of the weight matrix for the next layer
- if we know the error of the next layer, then when we multiply the transpose weight matrix by it, we are moving the error backward through the network, giving the error at layer l.
- By taking Hadamard product , we move error through the activation function in layer l, giving us error in the weighted input to layer l
- combining BP2 and BP1 we can use BP1 to compute then apply BP2 to compute , then BP2 again to compute and so on through network
BP3 - rate of change of cost w.r.t. any bias in network
- BP3:
- error is equal to rate of change .
- BP3-vec: , where b is evaluated at same neuron as .
BP4 - change of cost w.r.t. any weight in network:
- BP4:
- how to compute partial derivatives of cost w.r.t. weight using and , which are already known
- BP4-vec:
- is the activation into the weight w
- is the error of neuron output w.r.t. the weight
- the product of these is the partial deriv of Cost w.r.t. the weight
- the partial deriv of Cost w.r.t. the weight is called "the gradient term"
- when is near zero, weight learns slowly
- "the output neuron is saturated" when output neuron is low activation
() or high activation ()
- then learning happens slowly or has "stopped" learning
Summary
Proof of the four fundamental equations
- proofs for BP1 and BP2 are provided in terms of base definitions using the chain rule for derivatives
- proofs for BP3 and BP4 are left as an exercise to the reader
The backpropagation algorithm
- Input x: set the corresponding activation for the input layer
- Feedforward: For each l = 2, 3, ..., L,
- compute the logit z at the layer
- compute the activation from the logit at layer l:
- Output error : Compute the vector .
- compute the incremental error vector at output layer L
- is a vector whose components are the partial derivatives
- expresses the rate of change of C w.r.t. output activations
- The Nabla is
used in vector calculus as part of the names of distinct differential operators:
- the gradient
- the divergence
- the curl
- is elementwise matrix multiplication
- is the derivative of the logit at layer L
- so the incremental error at each unit equals the gradient of the cost w.r.t. the output activation multiplied by the output activation
- the incremental error vector is the vectorized form of that
- Backpropagate the error: For each , compute
- compute the incremental error backward, starting from L-1
- - the weights of the next layer put in matrix multiplication with the derivatives of the next layer
- is elementwise matrix multiplication
- is the derivative of the logit at layer l
- this takes deriv of activation function at layer l and puts it in elementwise multiplication with the matrix product of the weights and gradients for the next layer
- Output: The gradient of the cost function is given by
and
- is the weight for the connection from the neuron in the layer to the neuron in the layer:
- the change of cost w.r.t. a single weight that is leading to the jth unit of layer l from the kth unit of layer l-1 is the activation of the kth unit of layer l-1 times the incremental change at the jth unit of the lth layer
- recall that , so the incremental change at the jth unit of layer l is defined as the partial derivative of the cost w.r.t. the logit at that unit
- we're computing change of cost w.r.t. a single weight that is leading to the jth unit of layer l from the kth unit of layer l-1
- it is the activation of the kth unit of layer l-1 times the incremental change at the jth unit of the lth layer,
- and the incremental change at the jth unit of the lth layer is the partial derivative of the cost w.r.t. the logit of the unit at layer l