[Book Note] Learning From Data – A Short Course

This book should be read along with watching its corresponding online course. However, you should not watch the online course alone.


  • Page 19: Reference: Malik Magdon-Ismail.
  • Page 31: y here is a random variable. Reference: Malik Magdon-Ismail, The Elements of Statistical Learning page 28.
  • Page 32: “we will assume the target to be a probability distribution P(y|x), thus covering the general case” => Senpai please notice on page 53: “Each D is a point on that canvas. The probability of a point is determined by which x_{n}‘s in X happen to be in that particular D, and is calculated based on  the distribution P over X“.
  • Page 49:
  • Page 53: The canvas is the space of all possible data sets of size N.
  • Page 46: “Unless d_{vc}(H) = \infty … be crushed by the \frac {1}{N} factor”, please refer to page note of Page 78.
  • Page 78: \ln N^{d} = \ln N \times N \times ... \times N = \ln N + \ln N + ... + \ln N = d\ln N.
  • Page 85: X ^ {T} X = A.
  • Page 88: Reference: Lecture 08 – Bias-Variance Tradeoff. The expected in-sample error is not the same as expected out-of-sample error because the final hypothesis varies depending on the data set X.
  • Page 93: Use Taylor series expansion for E_{in}(x) and then substitute the series into expression E_{in}(w(0)+\eta\hat{v})-E_{in}(w(0)). You will need more mathematics to understand that equality more clearly. Why is O(\eta^2) small? Reference.
  • Page 98: w changes for every iteration, then what does it mean by: “‘on average’ the minimization proceeds in the right direction, but is a bit wiggly. In the long run, these random fluctuations cancel out.”?
    • My possible interpretation:
      • “on average” here means with the same weight in an iteration, the average value of the possible chosen data points is the expected value.
  • Page 99: What does it mean by: “The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid flat regions and local minima in the case of a complicated error surface.”?
  • Page 104:
    • Why don’t we transform the output vector y too?
    • Pascal triangle: \frac {Q(Q+3)}{2} = 2 + 3 + ... + Q + (Q + 1) .
  • Page 123: We have:  \epsilon = \frac {y_{n} - f(x_{n})}{\sigma} \Rightarrow \sigma\epsilon = y_{n} - f(x_{n}). Reference: Standard Normal Distribution.
  • Page 130:
    • Reference for w_{reg} solution: Lecture 12 Video.
    • “And just like the regular derivative, the gradient points in the direction of greatest increase”, reference: Vector Calculus: Understanding the Gradient.
    • Also make reference to w(t + 1) = w(t) + \eta v_{t} in Gradient Descent Algorithm for the statement: “w cannot be optimal” (not sure about this understanding).
    • What does it mean by “This means that  \nabla E_{in}(w) has some non-zero component along the constraint surface”?
    • If w^{T}w is small and \lambda_{C} > 0 then \lambda_{C}w^{T}w is small.
  • Page 136: The more noisy the target is, many more potential final hypotheses are produced by the learning algorithm for each data set (these hypotheses would not have a chance to be final if the target is less noisy).
  • Page 139: Are we assuming that the target function is free of stochastic noise? If not, then how “e(g^{-}(x_{n}), y_{n}) depends only on x_{n}“, doesn’t it also depend on  \mathbb{P}(y_{n} | x_{n})?
  • Page 140: Figure 4.8 does not contradicts the figures on page 67. In Figure 4.8 the hypothesis changes as K increases hence the difference in generalization error’s behaviour.




  • Is the event | E_{in}(g) - E_{out}(g) | > \epsilon equivalent to the event  \exists h \in  H, | E_{in}(h) - E_{out}(h) | > \epsilon? If yes, then why not:

        \[ $ \mathbb{P}(| E_{in}(g) - E_{out}(g) | > \epsilon) = \mathbb{P}(\exists h \in H, | E_{in}(h) - E_{out}(h) | > \epsilon) = 1 - \prod_{h \in H}\mathbb{P}(| E_{in}(h) - E_{out}(h) | \leq \epsilon) $ \]

  • With the same 0 < \epsilon < 1, the VC bound (3.1) / 78 (for linear classification) will require more N than the VC bound at page 87 (for linear regression). Why?


1 comment on “[Book Note] Learning From Data – A Short Course”

Leave a Reply

Your email address will not be published. Required fields are marked *