Notes for Neural Networks and Deep Learning

$s \odot t$ :

the elementwise product of two vectors $s$ and $t$
called Hadamard product or Schur product

The four fundamental equations behind backpropagation

measure of error
- - here delta is the "inexact derivative" or incremental amount changed in hidden neuron $j$ at layer $l$ , and it's set to be the partial derivative of the Cost function w.r.t the logistic logit at that neuron.

BP1 - error in the output layer $\delta^L$

BP1:
- - how fast cost is changing as a function of $j^{\text(th)}$ output activation
  - if cost C doesn't depend on neuron j, then $\delta^L_j$ will be small
  - if using quadratic / squared error, $C = \frac{1}{2} \sum_j (y_j-a^L_j)^2$ , and so $\partial C / \partial a^L_j = (a_j^L-y_j)$ (notice the reversal of terms since the derivative of the inner term is -1)
- - how fast the activation function $\sigma$ is changing at $z^L_j$
- $z^L_j$ is computed during forward pass
- it is a componentwise expression for $\delta^L$
BP1a:
- - a vector whose components are the partial derivatives $\partial C / \partial a^L_j$
  - expresses the rate of change of C w.r.t. output activations
- equivalent to BP1
- with quadratic cost / squared error we have
  - so then BP1 becomes $\delta^L = (a^L-y) \odot \sigma'(z^L)$

BP2 - An equation for the error $\delta^l$ in terms of next layer

BP2:
- $(w^{l+1})^T$ : the transpose of the weight matrix $w^{l+1}$ for the next layer
- if we know the error $\delta^{l+1}$ of the next layer, then when we multiply the transpose weight matrix $(w^{l+1})^T$ by it, we are moving the error backward through the network, giving the error at layer l.
- By taking Hadamard product $\odot \sigma'(z^l)$ , we move error through the activation function in layer l, giving us error $\delta^l$ in the weighted input to layer l
- combining BP2 and BP1 we can use BP1 to compute $\delta^L$ then apply BP2 to compute $\delta^{L-1}$ , then BP2 again to compute $\delta^{L-2}$ and so on through network

BP3 - rate of change of cost w.r.t. any bias in network

BP3:
- error $\delta^l_j$ is equal to rate of change $\frac{ \partial C }{ \partial b^l_j }$ .
BP3-vec: $\frac{ \partial C }{ \partial b } = \delta$ , where b is evaluated at same neuron as $\delta$ .

BP4 - change of cost w.r.t. any weight in network:

BP4:
- how to compute partial derivatives of cost w.r.t. weight using $\delta^l$ and $a^{l-1}$ , which are already known
BP4-vec:
- $a_{ \rm in }$ is the activation into the weight w
- $\delta_{ \rm out }$ is the error of neuron output w.r.t. the weight
- the product of these is the partial deriv of Cost w.r.t. the weight
  - the partial deriv of Cost w.r.t. the weight is called "the gradient term"
  - when $a_{ \rm in }$ is near zero, weight learns slowly
- "the output neuron is saturated" when output neuron is low activation () or high activation ()
  - then learning happens slowly or has "stopped" learning

Summary

Proof of the four fundamental equations

proofs for BP1 and BP2 are provided in terms of base definitions using the chain rule for derivatives
proofs for BP3 and BP4 are left as an exercise to the reader

The backpropagation algorithm

Input x: set the corresponding activation $a^1$ for the input layer
Feedforward: For each l = 2, 3, ..., L,
1. compute the logit z at the layer $z^{l} = w^l a^{l-1}+b^l$
2. compute the activation from the logit at layer l: $a^{l} = \sigma(z^{l})$
Output error $\sigma^L$ : Compute the vector .
- compute the incremental error vector at output layer L
- is a vector whose components are the partial derivatives
  - expresses the rate of change of C w.r.t. output activations
  - The Nabla is used in vector calculus as part of the names of distinct differential operators:
    - the gradient $\nabla$
    - the divergence
    - the curl
- $\odot$ is elementwise matrix multiplication
- $\sigma'(z^L)$ is the derivative of the logit at layer L
- so the incremental error at each unit equals the gradient of the cost w.r.t. the output activation multiplied by the output activation
- the incremental error vector is the vectorized form of that
Backpropagate the error: For each , compute
- compute the incremental error $\delta^l$ backward, starting from L-1
- $( w^{l+1})^T \delta^{l+1} )$ - the weights of the next layer put in matrix multiplication with the derivatives of the next layer
- $\odot$ is elementwise matrix multiplication
- $\sigma'(z^l)$ is the derivative of the logit at layer l
- this takes deriv of activation function at layer l and puts it in elementwise multiplication with the matrix product of the weights and gradients for the next layer
Output: The gradient of the cost function is given by and
- $w^l_{jk}$ is the weight for the connection from the $k^{\rm th}$ neuron in the $(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer:
- the change of cost w.r.t. a single weight that is leading to the jth unit of layer l from the kth unit of layer l-1 is the activation of the kth unit of layer l-1 times the incremental change at the jth unit of the lth layer
- recall that $\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}$ , so the incremental change at the jth unit of layer l is defined as the partial derivative of the cost w.r.t. the logit at that unit
- we're computing change of cost w.r.t. a single weight that is leading to the jth unit of layer l from the kth unit of layer l-1
- it is the activation of the kth unit of layer l-1 times the incremental change at the jth unit of the lth layer,
- and the incremental change at the jth unit of the lth layer is the partial derivative of the cost w.r.t. the logit of the unit at layer l

Neural Networks and Deep Learning Notes

Notes for Neural Networks and Deep Learning

Using neural nets to recognize handwritten digits

How the backpropagation algorithm works

The two assumptions we need about the cost function

The Hadamard product

The four fundamental equations behind backpropagation

BP1 - error in the output layer $\delta^L$

BP2 - An equation for the error $\delta^l$ in terms of next layer

BP3 - rate of change of cost w.r.t. any bias in network

BP4 - change of cost w.r.t. any weight in network:

Summary

Proof of the four fundamental equations

The backpropagation algorithm

The code for backpropagation

In what sense is backpropagation a fast algorithm?

Backpropagation: the big picture

Improving the way neural networks learn

A visual proof that neural nets can compute any function

Why are deep neural networks hard to train?

Deep learning

results matching ""

No results matching ""

Notes for Neural Networks and Deep Learning

BP1 - error in the output layer \delta^L

BP2 - An equation for the error \delta^l in terms of next layer

BP3 - rate of change of cost w.r.t. any bias in network

BP4 - change of cost w.r.t. any weight in network:

Summary

Proof of the four fundamental equations

results matching ""

No results matching ""

BP1 - error in the output layer $\delta^L$

BP2 - An equation for the error $\delta^l$ in terms of next layer