(a) Define an error for a single data point to be
Argue that PLA can be viewed as SGD on with learning rate .
when means that agrees with (no error at that point): .
when and disagrees (that point is misclassified): .
When there is no error at the data point : .
When the data point is misclassified: .
And the SGD with such error measure is no different from PLA.
(b) For logistic regression with a very large w, argue that minimizing using SGD is similar to PLA. This is another indication that the logistic regression weights can be used as a good approximation for classification.
If is large, then is large when and agrees (there is no error at data point ), hence will be very small (near zero). If and disagrees (the data point is misclassified) then is very small, hence will be near to .
So the above statement follows.