Learning From Data – A Short Course: Exercise 4.7

Page 139:

Fix g^{-} (learned from D_{train}) and define \sigma^{2}_{val} =^{def} Var_{D_{val}}[E_{val}(g^{-})]. We consider how \sigma^{2}_{val} depends on K. Let

    \[ \sigma^{2}(g^{-}) = Var_{x}[e(g^{-}(x), y)] \]

be the pointwise variance in the out-of-sample error of g^{-}.

(a) Show that \sigma^{2}_{val} = \frac{1}{K}\sigma^{2}(g^{-}).

We have:

 \sigma^{2}_{val} =^{def} Var_{D_{val}}[E_{val}(g^{-})] = Var_{D_{val}}\left [\frac{1}{K}\sum_{x_{n} \in D_{val}}e(g^{-}(x_{n}), y_{n})\right]  = \frac{1}{K^{2}}Var_{D_{val}} \left [\sum_{x_{n} \in D_{val}}e(g^{-}(x_{n}), y_{n}) \right]

Because each x_{n} is independent from each other, so:

 \frac{1}{K^{2}}Var_{D_{val}} \left [\sum_{x_{n} \in D_{val}}e(g^{-}(x_{n}), y_{n}) \right] = \frac{1}{K^{2}}K \times Var_{D_{val}} [e(g^{-}(x_{n}), y_{n})] = \frac{1}{K}Var_{x} [e(g^{-}(x_{n}), y_{n})] = \frac{1}{K}\sigma^{2}(g^{-})

Reference: Properties of variance and covariance.

(b) In a classification problem, where e(g^{-}(x), y) = [[g^{-}(x) \neq y]], express \sigma^{2}_{val} in terms of  \mathbb{P}[g^{-}(x) \neq y].

First we observe that:

E_{D_{val}}[e(g^{-}(x),y)] = \sum_{x \in D_{val}}[[g^{-}(x) \neq y]]\mathbb{P}(x) = \sum_{x \in D_{val}}\mathbb{P}(x \cap g^{-}(x) \neq y) = \mathbb{P}(g^{-}(x) \neq y)


 E_{D_{val}}[e(g^{-}(x),y)^{2}] = E_{D_{val}}[[[g^{-}(x) \neq y]]^{2}] = E_{D_{val}}[e(g^{-}(x),y)]


 \sigma^{2}_{val} = \frac{1}{K}Var_{x} [e(g^{-}(x_{n}), y_{n})] = \frac{1}{K}(E_{x}(e(g^{-}(x_{n}), y_{n})^{2}) - E_{x}(e(g^{-}(x_{n}), y_{n}))^{2}) = \frac{1}{K}(\mathbb{P}(g^{-}(x) \neq y) - (\mathbb{P}(g^{-}(x) \neq y))^{2})

(c) Show that for any g^{-} in a classification problem, \sigma^{2}_{val} \leq \frac{1}{4K}.

First we consider the function:  f(x) = x - x^{2}, x \in [0;1], we have:  f'(x) = 1 - 2x, x \in [0;1] and  f'(x) = 0 \Leftrightarrow x = 0.5, x \in [0;1]. So f(x) reaches maxima value at x = 0.5 and that is f(0.5) = 0.25, which also means f(x) \leq 0.25.

Similarly, we have:

    \[ \frac{1}{K}(\mathbb{P}(g^{-}(x) \neq y) - (\mathbb{P}(g^{-}(x) \neq y))^{2}) \leq \frac{1}{4K} \]

(d) Is there a uniform upper bound for Var[E_{val}(g^{-})] similar to (c) in the case of regression with squared error e(g^{-}(x), y) = (g^{-}(x) - y)^{2}?

Because the squared error is unbounded hence the variance of it cannot be bounded. However, the result \sigma^{2}_{val} = \frac{1}{K}\sigma^{2}(g^{-}) (a) suggests that large K may help reduce the variance.

(e) For regression with squared error, if we train using fewer points (smaller N - K) to get g^{-}, do you expect  \sigma^{2}(g^{-}) to be higher or lower?

We have:

 \sigma^{2}(g^{-}) = E[((g^{-}(x) - y)^{2} - E[(g^{-}(x) - y)^{2}])^{2}] = E[(g^{-}(x) - y)^{4}] - (E[(g^{-}(x) - y)^{2}])^{2}

As we use fewer points to train, g^{-} gets worse. As g^{-} gets worse,  E[(g^{-}(x) - y)^{4}] and  E[(g^{-}(x) - y)^{2}] gets higher value, hence  E[(g^{-}(x) - y)^{4}] - (E[(g^{-}(x) - y)^{2}])^{2} often gets higher, that also means  \sigma^{2}(g^{-}) is expected to be higher.

(f) Conclude that increasing the size of the validation set can result in a better or a worse estimate of E_{out}.

Check the answer for (d) and (e), this question is also discussed at the same page.


2 comments on “Learning From Data – A Short Course: Exercise 4.7”

Leave a Reply

Your email address will not be published. Required fields are marked *