This book should be read along with watching its corresponding online course. However, you should not watch the online course alone.

PAGE NOTE

• Page 19: Reference: Malik Magdon-Ismail.
• Page 31: y here is a random variable. Reference: Malik Magdon-Ismail, The Elements of Statistical Learning page 28.
• Page 32: “we will assume the target to be a probability distribution , thus covering the general case” => Senpai please notice on page 53: “Each D is a point on that canvas. The probability of a point is determined by which ‘s in happen to be in that particular , and is calculated based on  the distribution over “.
• Page 49:
• We have: , so that will explain “each term in the sum is polynomial (of degree i…”.
• Proof of Growth Function when Points are not binary.
• Page 53: The canvas is the space of all possible data sets of size .
• Page 46: “Unless … be crushed by the factor”, please refer to page note of Page 78.
• Page 78: .
• Page 85: .
• Page 88: Reference: Lecture 08 – Bias-Variance Tradeoff. The expected in-sample error is not the same as expected out-of-sample error because the final hypothesis varies depending on the data set .
• Page 93: Use Taylor series expansion for and then substitute the series into expression . You will need more mathematics to understand that equality more clearly. Why is small? Reference.
• Page 98: changes for every iteration, then what does it mean by: “‘on average’ the minimization proceeds in the right direction, but is a bit wiggly. In the long run, these random fluctuations cancel out.”?
• My possible interpretation:
• “on average” here means with the same weight in an iteration, the average value of the possible chosen data points is the expected value.
• Page 99: What does it mean by: “The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid flat regions and local minima in the case of a complicated error surface.”?
• Page 104:
• Why don’t we transform the output vector too?
• Pascal triangle: .
• Page 123: We have: . Reference: Standard Normal Distribution.
• Page 130:
• Reference for solution: Lecture 12 Video.
• “And just like the regular derivative, the gradient points in the direction of greatest increase”, reference: Vector Calculus: Understanding the Gradient.
• Also make reference to in Gradient Descent Algorithm for the statement: “ cannot be optimal” (not sure about this understanding).
• What does it mean by “This means that has some non-zero component along the constraint surface”?
• If is small and then is small.
• Page 136: The more noisy the target is, many more potential final hypotheses are produced by the learning algorithm for each data set (these hypotheses would not have a chance to be final if the target is less noisy).
• Page 139: Are we assuming that the target function is free of stochastic noise? If not, then how “ depends only on “, doesn’t it also depend on ?
• Page 140: Figure 4.8 does not contradicts the figures on page 67. In Figure 4.8 the hypothesis changes as increases hence the difference in generalization error’s behaviour.

EXERCISE NOTE

• Exercise 1.2.
• Exercise 1.13: A note here is that both h and f are binary function.
• Exercise 2.1: For Example 2.2.1, find a k such that .
• Exercise 2.2: For Example 2.2.1, verify that , here k = 2. Reference: Lecture 06 – Theory of Generalization.
• Exercise 2.4.

PROBLEM NOTE

QUESTION NOTE

• Is the event equivalent to the event ? If yes, then why not: • With the same , the VC bound (3.1) / 78 (for linear classification) will require more than the VC bound at page 87 (for linear regression). Why?