Learning From Data – A Short Course: Exercise 3.4

Page 87:

Consider a noisy target y = w^{*T}x + \epsilon for generating the data, where \epsilon is a noise term with zero mean and \sigma^{2} variance, independently generated for every example (x, y). The expected error of the best possible linear fit to this target is thus \sigma^{2}.

For the data D = { (x_{1}, y_{1}),...,(x_{N}, y_{N}) }, denote the noise in y_{n} as \epsilon_{n} and let \epsilon = [\epsilon_{1}, \epsilon_{2},..., \epsilon_{N} ]^{T}, assume that X^{T}X is invertible. By following the steps below, show that the expected in-sample error of linear regression with respect to D is given by

    \[ \mathbb{E}[E_{in}(w_{lin})] = \sigma^{2}\left (1 - \frac{d+1}{N} \right ) \]

The solution of this part of the exercise can be found here: (a) – (b) – (c) – (d). A note here is that:

We have: the event (x, y) is not affected by the position of the data point that it occurs. Hence the probability distribution of \epsilon is the same as the probability distribution of \epsilon_{1} (the noise term of first chosen data point of data set), et cetera.

For the expected out-of-sample error, we take a special case which is easy to analyze. Consider a test data set D_{test} = { (x_{1}, y^{'}_{1}),...,(x_{N}, y^{'}_{N}), which shares the same input vectors x_{n} with D but with a different realization of the noise terms. Denote the noise in y^{'}_{n} as \epsilon^{'}_{n} and let \epsilon^{'} = [ \epsilon^{'}_{1}, \epsilon^{'}_{2}, ... ,\epsilon^{'}_{n} ]^{T}. Define E_{test}(w_{lin}) to be the average squared error on D_{test}.

(e) Prove that  \mathbb{E}_{D,\epsilon^{'}}[E_{test}(w_{lin})] = \sigma^{2}\left (1 + \frac{d+1}{N} \right ).

The special test error E_{test} is a very restricted case of the general out-of-sample error. Some detailed analysis shows that similar results can be obtained for the general case, as shown in Problem 3.11.

Training data set: y = Xw^{*} + \epsilon. Hence, the final hypothesis:  \widehat{y} = Xw^{*} + H\epsilon.

Test data set: y^{'} = Xw^{*} + \epsilon^{'}.

    \[ \widehat{y} - y = H\epsilon - I\epsilon^{'} \]

E_{test}(w_{lin}) = \left \| \widehat{y} - y \right \|

\left \| \widehat{y} - y \right \| = (H\epsilon - I\epsilon^{'})^{T}(H\epsilon - I\epsilon^{'}) = (\epsilon^{T}H^{T} - \epsilon^{'T})(H\epsilon - \epsilon^{'}) = \epsilon^{T}H\epsilon - \epsilon^{T}H\epsilon^{'} - \epsilon^{'T}H\epsilon + \epsilon{'T}\epsilon^{'}

Inherited from the results of the first part, we have:

 \mathbb{E}_{D,\epsilon^{'}}[\epsilon^{T}H\epsilon] = \sigma^{2}\frac{d+1}{N}

  \mathbb{E}_{D,\epsilon^{'}}[\epsilon{'T}\epsilon^{'}] = \sigma^{2}

(\epsilon and \epsilon^{'} are just two different vectors, however the components’ value of them all come from probability distribution of random variable \epsilon (not vector).)

We also have:

 \mathbb{E}_{D,\epsilon^{'}}[\epsilon^{T}H\epsilon^{'}] = \sum_{i=1}^{N}\epsilon_{i}H_{ij}\epsilon^{'}_{j} = 0

  \mathbb{E}_{D,\epsilon^{'}}[\epsilon^{'T}H\epsilon] = \sum_{i=1}^{N}\epsilon^{'}_{i}H_{ij}\epsilon_{j} = 0

For further explanation please refer the solution of the first part.


    \[  \mathbb{E}_{D,\epsilon^{'}}[E_{test}(w_{lin})] = \sigma^{2}\left (1 + \frac{d+1}{N} \right ) \]


Leave a Reply

Your email address will not be published. Required fields are marked *