This book should be read along with watching its corresponding online course. However, you should not watch the online course alone.
- Page 19: Reference: Malik Magdon-Ismail.
- Page 31: y here is a random variable. Reference: Malik Magdon-Ismail, The Elements of Statistical Learning page 28.
- Page 32: “we will assume the target to be a probability distribution , thus covering the general case” => Senpai please notice on page 53: “Each D is a point on that canvas. The probability of a point is determined by which ‘s in happen to be in that particular , and is calculated based on the distribution over “.
- Page 49:
- We have: , so that will explain “each term in the sum is polynomial (of degree i…”.
- Proof of Growth Function when Points are not binary.
- Page 53: The canvas is the space of all possible data sets of size .
- Page 46: “Unless … be crushed by the factor”, please refer to page note of Page 78.
- Page 78: .
- Page 85: .
- Page 88: Reference: Lecture 08 – Bias-Variance Tradeoff. The expected in-sample error is not the same as expected out-of-sample error because the final hypothesis varies depending on the data set .
- Page 93: Use Taylor series expansion for and then substitute the series into expression . You will need more mathematics to understand that equality more clearly. Why is small? Reference.
- Page 98: changes for every iteration, then what does it mean by: “‘on average’ the minimization proceeds in the right direction, but is a bit wiggly. In the long run, these random fluctuations cancel out.”?
- My possible interpretation:
- “on average” here means with the same weight in an iteration, the average value of the possible chosen data points is the expected value.
- My possible interpretation:
- Page 99: What does it mean by: “The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid flat regions and local minima in the case of a complicated error surface.”?
- Page 104:
- Why don’t we transform the output vector too?
- Pascal triangle: .
- Page 123: We have: . Reference: Standard Normal Distribution.
- Page 130:
- Reference for solution: Lecture 12 Video.
- “And just like the regular derivative, the gradient points in the direction of greatest increase”, reference: Vector Calculus: Understanding the Gradient.
- Also make reference to in Gradient Descent Algorithm for the statement: “ cannot be optimal” (not sure about this understanding).
- What does it mean by “This means that has some non-zero component along the constraint surface”?
- If is small and then is small.
- Page 136: The more noisy the target is, many more potential final hypotheses are produced by the learning algorithm for each data set (these hypotheses would not have a chance to be final if the target is less noisy).
- Page 139: Are we assuming that the target function is free of stochastic noise? If not, then how “ depends only on “, doesn’t it also depend on ?
- Page 140: Figure 4.8 does not contradicts the figures on page 67. In Figure 4.8 the hypothesis changes as increases hence the difference in generalization error’s behaviour.
- Exercise 1.2.
- Exercise 1.13: A note here is that both h and f are binary function.
- Exercise 2.1: For Example 2.2.1, find a k such that .
- Exercise 2.2: For Example 2.2.1, verify that , here k = 2. Reference: Lecture 06 – Theory of Generalization.
- Exercise 2.4.
- Is the event equivalent to the event ? If yes, then why not:
- With the same , the VC bound (3.1) / 78 (for linear classification) will require more than the VC bound at page 87 (for linear regression). Why?