Learning From Data – A Short Course: Exercise 7.9

Page 18

What can go wrong if you just initialize all the weights to exactly zero?



For l < L, if W^{(l+1)} becomes zero then  \delta^{l} becomes zero. For l = L, x^{(L)} = \theta((W^{(L)})^{T}x^{(L - 1)}) (\theta = \tanh or \theta = I), if W^{(L)} becomes zero then x^{(L)} becomes zero.


The gradient G^{(l)} will then becomes zero so the algorithm will stop immediately and then blindly return w = 0 as the final hypothesis.

Note that this result may not hold for other kinds of \theta. If \theta is standard logistic function (side note: I really hate sigmoid function usage confusion), x^{(L)} = \theta((W^{(L)})^{T}x^{(L - 1)}) = 0.5, hence it’s likely G^{(l)} \neq 0, so it’s likely that W^{(l)} \neq 0 eventually together with other likely non-zero components leads to  \delta^{(l - 1)} \neq 0. The problem here, as suggested by Andrew Ng, is that all the weights directly connected to an output node will share the same value of weight (after an update), \theta'(s^{(l)}) always shares the same value for each unit in the same layer because their contributing weights are the same, that will eventually leads to the same \delta^{(l)}, l < L for each unit in the same layer. This kind of redundant architure happens not only when weights are initialized to zero but also when all weights are initialized to the same value. So random initialization will likely symmetry breaking.


Leave a Reply

Your email address will not be published. Required fields are marked *