Learning From Data – A Short Course: Exercise 3.10

Page 98:

(a) Define an error for a single data point (x_{n}, y_{n}) to be

    \[ e_{n}(w) = max(0,-y_{n}w^{T}x_{n}) \]

Argue that PLA can be viewed as SGD on e_{n} with learning rate \eta = 1.

e_{n}(w) = 0 when  -y_{n}w^{T}x_{n} < 0 means that  w^{T}x_{n} agrees with  y_{n} (no error at that point):  \nabla e_{n}(w) = 0.

e_{n}(w) = -y_{n}w^{T}x_{n} when  w^{T}x_{n} and  y_{n} disagrees (that point is misclassified):  \nabla e_{n}(w) = -y_{n}x_{n}.


When there is no error at the data point x_{n}: w(t+1) = w(t) - 0.

When the data point is misclassified: w(t+1) = w(t) + y_{n}x_{n}.

And the SGD with such error measure is no different from PLA.

(b) For logistic regression with a very large w, argue that minimizing E_{in} using SGD is similar to PLA. This is another indication that the logistic regression weights can be used as a good approximation for classification.

We have:

    \[ \nabla e_{n}(w) = \frac {-y_{n}x_{n}}{1+e^{y_{n}w^{T}x_{n}}} \]

If w is large, then  e^{y_{n}w^{T}x_{n}} is large when   y_{n} and  w^{T}x_{n} agrees (there is no error at data point x_{n}), hence  \frac {-y_{n}x_{n}}{1+e^{y_{n}w^{T}x_{n}}} will be very small (near zero). If  y_{n} and  w^{T}x_{n} disagrees (the data point x_{n} is misclassified) then  e^{y_{n}w^{T}x_{n}} is very small, hence  \frac {-y_{n}x_{n}}{1+e^{y_{n}w^{T}x_{n}}} will be near to  -y_{n}x_{n}.

So the above statement follows.


Leave a Reply

Your email address will not be published. Required fields are marked *