Learning From Data – A Short Course: Exercise 1.10

Page 23, Exercise 1.10

Here is an experiment that illustrates the difference between a single bin and multiple bins. Run a computer simulation for flipping 1,000 fair coins. Flip each coin independently 10 times. Let’s focus on 3 coins as follows: c_{1} is the first coin flipped; c_{rand} is a coin you choose at random ; c_{min} is the coin that had the minimum frequency of heads (pick the earlier one in case of a tie). Let \nu_{1}, \nu_{rand}, \nu_{min} be the fraction of heads you obtain for the respective three coins.

  1. What is \mu for the three coins selected?
  2. Repeat this entire experiment a large number of times (e.g., 100,000 runs of the entire experiment) to get several instances of \nu_{1}, \nu_{rand} and \nu_{min} and plot the histograms of the distribution of \nu_{1}, \nu_{rand} and \nu_{min}. Notice that which coins end up being c_{rand} and c_{min} may differ from one run to another.
  3. Using (b), plot estimates for  \mathbb{P} \left [ \left | \nu - \mu \right | > \epsilon \right ] as a function of \epsilon, together with the Hoeffding bound  2e^{-2\epsilon^{2}N} (on the same graph).
  4. Which coins obey the Hoeffding bound, and which ones do not? Explain why.
  5. Relate part (d) to the multiple bins in Figure 1.10.

Let’s make a correspondence here:

  • \nu_{x} corresponds to fraction of heads you obtain in a sample of 10 coin flips of type x.
  • \mu_{x} corresponds to fraction of heads you obtain in the (imaginarily unknown) “bin” of coin flips of type x – which is the probability of getting head when flipping a coin of type x.

c_{1}c_{rand} and c_{min} are all fair coins so it should be \mu_{1} = \mu_{rand} = \mu_{min} = 0.5.

Look at the plot here and read the dicussions [1] [2]. Here is my guess: Because 1,000 coins all share the same \mu and any two flips of one coin are independent from each other, and any two flips of two coins are also independent from each other, and the event a coin is randomly selected is independently from its flip result, so c_{rand} can be treated as a specific coin. Hence the distribution of \nu_{1} and \nu_{rand} is the same and it is binomial distribution. However, \nu_{min} has the different distribution and it is not binomial.

For example:

 \mathbb{P}(\nu_{min}=0) = 1 - \prod_{i=1}^{1000}\mathbb{P}(\nu_{i} \neq 0) = 1 - \prod_{i=1}^{1000}(1 - \mathbb{P}(\nu_{i} = 0)) = 1 - \prod_{i=1}^{1000}(1 - 0.5^{10}) = 1 - (1 - 0.5^{10})^{1000} \approx 0.6235762 (*)


    \[ i = 1..1000, \mathbb{P}(\nu_{i} = 0) = 0.5^{10} \]

You can see the correspondence of (*) and the example in the Lecture 02’s video. You can also generalize the result above. Also please notice that if we use Hoeffding Inequality on the c_{min}, we have this probability:  \mathbb{P}[|\nu_{min} - \mu_{min}| > 0.4] \leq 2e^{-2\times0.4^{2}\times10} \approx 0.08, which contrasts the fact that  \mathbb{P}[|\nu_{min} - \mu_{min}| = 0.5] = \mathbb{P}[\nu_{min}=0] \approx 0.6235762, or maybe… we can’t apply Hoeffding Inequality on the c_{min}?

We also have the following correspondence:

  • One coin flip corresponds one marble.
  • A sample of 10 coin flips corresponds a sample of 10 marbles.
  • Each coin has its own imaginary bin.
  • 100,000 runs of the entire experiment corresponds 100,000 times picking marbles from 1,000 bins with 10 marbles each bin.

Takeaway point: Probability is based on the information that we have got. In case of the hypothesis set, hypothesis is chosen based on training set and learning algorithm. However, in our theoretical analysis so far, we do not consider any specific of the training set nor of the learning algorithm so the only information we have is: the final hypothesis is one of the hypotheses in the hypothesis set, hence the M factor in  2Me^{-2\epsilon^{2}N}.

What we have:

    \[ \mathbb{P} \left [ \left | E_{in}(h) - E_{out}(h) \right | > \epsilon \right ] \leq p \]

What we want to know:

    \[ \mathbb{P} \left [ \left | E_{in}(g) - E_{out}(g) \right | > \epsilon \right | g = LearningAlgorithm(H) ] \]

And we know that g can be h_{1} or h_{2} or … or h_{n}, so:

 \mathbb{P} \left [ \left | E_{in}(g) - E_{out}(g) \right | > \epsilon \right | g = LearningAlgorithm(H) ] = \mathbb{P} \left [ \left | E_{in}(h_{1}) - E_{out}(h_{1}) \right | > \epsilon \right ] + ... + \mathbb{P} \left [ \left | E_{in}(h_{n}) - E_{out}(h_{n}) \right | > \epsilon \right ]


Leave a Reply

Your email address will not be published. Required fields are marked *