Learning From Data – A Short Course: Exercise 4.11

Page 152:

In this particular experiment, the black curve (E_{cv}) is sometimes below and sometimes above the red curve (E_{out}). If we repeated this experiment many times, and plotted the average black and red curves, would you expect the black curve to lie above or below the red curve?

It is written on the page 147 that “In this case, the cross validation estimate will on average be an upper estimate for the out-of-sample error”. The reason is:

    \[ \mathbb{E}_{D}(E_{cv}) =\bar{E}_{out}(N-1)  \]


    \[ \mathbb{E}_{D}(E_{out}(g)) =\bar{E}_{out}(N) \]

and (dunno if proved or just a leap of faith, see also: “The fact that more training data lead to a better final hypothesis has been extensively verified empiricaly, although it is challenging to prove theoretically” on page 141):

    \[ \bar{E}_{out}(N-1) \geq \bar{E}_{out}(N) \]

Hence  on average, the red curve will lie below the black curve.



2 comments on “Learning From Data – A Short Course: Exercise 4.11”

    1. Thank you very much for your correction, I did make a terrible mistake.

      However, when looking back at the exercise, I suddenly realize I have taken the word “average” as averaging over all the possible data sets generated by the target probability distribution for granted, which now I’m not sure if I have understood the exercise correctly.

      BIG EDIT: I’m sorry for confusing you. It looks like I have understood the “average” correctly as the experiment was already stated as: “We have randomly selected 500 data points as the training data and the remaining are used as a test set for evaluation.”

Leave a Reply

Your email address will not be published. Required fields are marked *