Backpropagation in Convolutional (Neural) Network

Neural networks and deep learning, Chapter 6:

Backpropagation in a convolutional network The core equations of backpropagation in a network with fully-connected layers are (BP1)-(BP4) (link). Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above. How are the equations of backpropagation modified?

In this post, I do not follow Michael Nielsen’s notations in Neural networks and deep learning above, rather I use notations from Learning From Data – A Short Course: x^{(l)}_{j} = \theta(s^{(l)}_{j}) (with \theta being activation function) and  s^{(l)}_{j} = w^{T}x^{(l-1)} (roughly),  E is cost function.

So we have 3 layers (I don’t count input layer):

  • L_{0}: Input layer.
  • L_{1}: Convolutional layer.
  • L_{2}: Pooling layer.
  • L_{3}: Output layer.

We also have weights W_{1} connects L_{0} and L_{1}W_{3} connects L_{2} and L_{3}. There is no W_{2} because I believe pooling layer is non-parametric.

We have:

    \[ \frac{\partial E}{\partial s^{(2)}_{j}} = \frac{\mathrm{d} x^{(2)}_{j}}{\mathrm{d} s^{(2)}_{j}}\frac{\partial E}{\partial x^{(2)}_{j}} \]

It’s easy to compute  \frac{\partial E}{\partial x^{(2)}_{j}}. How about  \frac{\mathrm{d} x^{(2)}_{j}}{\mathrm{d} s^{(2)}_{j}}? If  x^{(2)}_{j} = \max(s^{(2)}_{j}) (max-pooling) or x^{(2)}_{j} =\left \| s^{(2)}_{j} \right \| = \sqrt{ \left \| s^{(2)}_{j} \right \|^{2}} (L2 pooling), then  s^{(2)}_{j} must be a vector. Hence it would be a good idea if  x^{(2)}_{j} = I(s^{(2)}_{j}) (identity function) for max-pooling or  x^{(2)}_{j} = \sqrt{s^{(2)}_{j}} for L2 pooling, in such cases  s^{(2)}_{j} is still a number. There are no parameters to learn at pooling layer so we will go by the gradient update step.

    \[ \frac{\partial E}{\partial s^{(1)}_{j}} = \frac{\mathrm{d} x^{(1)}_{j}}{\mathrm{d} s^{(1)}_{j}}\frac{\partial E}{\partial x^{(1)}_{j}} = \frac{\mathrm{d} x^{(1)}_{j}}{\mathrm{d} s^{(1)}_{j}}\sum_{i=1}^{\text{no.pooling layer's units}}\frac{\partial s^{(2)}_{i}}{\partial x^{(1)}_{j}}\frac{\partial E}{\partial s^{(2)}_{i}} \]

As hinted above, at this step we can define  s^{(2)}_{i} = \max(..., x^{(1)}_{j},...) for max-pooling and  s^{(2)}_{i} = ... + \left (x^{(1)}_{j}  \right )^{2} + ... for L2 pooling. For derivative of max function, you can make a reference to Derivative of the f(x,y)=\min(x,y).

However, what plays vital role is how we compute  \frac{\partial E}{\partial w^{(1)}_{ij}}.

Dear my future self, you should remember that at this time I suck at multivariate calculus (do not be surprised, I can go this far due to single calculus power – special thanks to Herbert Gross), so DO NOT TRUST ME blindly from this step. Thanks.

Well, by chain rule, I guess this is how we calculate  \frac{\partial E}{\partial w^{(1)}_{ij}}:

    \[ \frac{\partial E}{\partial w^{(1)}_{ij}} = \sum_{k=1}^{\text{no.conv layer's units}}\frac{\partial s^{(1)}_{k}}{\partial w^{(1)}_{ij}}\frac{\partial E}{\partial s^{(1)}_{k}} \]

With  w^{(1)}_{ij}} being value of the weight at position (i, j) of convolutional layer’s kernel matrix.

That’s all, I think so.


Leave a Reply

Your email address will not be published. Required fields are marked *