For the sigmoidal perceptron, , let the in-sample error be . Show that:
If , what happens to the gradient; how this is related to why it is hard to optimize the perceptron.
We observe that:
That means when is large enough, the Gradient Descent Algorithm will not make much change to . Even worse, it may stop when is currently at its largest possible value () and return that large as the final hypothesis that it could have found.