# [Book Note] Learning From Data – A Short Course

This book should be read along with watching its corresponding online course. However, you* should not *watch the online course alone.

**PAGE NOTE**

**Page 19:**Reference: Malik Magdon-Ismail.**Page 31:***y*here is a random variable. Reference: Malik Magdon-Ismail,*The Elements of Statistical Learning*page 28.**Page 32:**“we will assume the target to be a probability distribution , thus covering the general case” => Senpai please notice on page 53: “Each D is a point on that canvas. The probability of a point is determined by which ‘s in happen to be in that particular , and is calculated based on the distribution over “.**Page 49:**- We have: , so that will explain “each term in the sum is polynomial (of degree i…”.
- Proof of Growth Function when Points are not binary.

**Page 53:**The canvas is the space of all possible data sets of size .**Page 46:**“Unless … be crushed by the factor”, please refer to page note of Page 78.**Page 78:**.**Page 85:**.**Page 88:**Reference: Lecture 08 – Bias-Variance Tradeoff. The expected in-sample error is not the same as expected out-of-sample error because the final hypothesis varies depending on the data set .**Page 93:**Use Taylor series expansion for and then substitute the series into expression . You will need more mathematics to understand that equality more clearly. Why is small? Reference.**Page 98:**changes for every iteration, then what does it mean by: “‘on average’ the minimization proceeds in the right direction, but is a bit wiggly. In the long run, these random fluctuations cancel out.”?- My possible interpretation:
- “on average” here means with the same weight in an iteration, the average value of the possible chosen data points is the expected value.

- My possible interpretation:
**Page 99:**What does it mean by: “The*randomness*introduced by processing one data point at a time can be a plus, helping the algorithm to avoid flat regions and local minima in the case of a complicated error surface.”?**Page 104:**- Why don’t we transform the output vector too?
- Pascal triangle: .

**Page 123:**We have: . Reference: Standard Normal Distribution.**Page 130:**- Reference for solution: Lecture 12 Video.
- “And just like the regular derivative, the gradient points in the direction of greatest increase”, reference: Vector Calculus: Understanding the Gradient.
- Also make reference to in
*Gradient Descent Algorithm*for the statement: “ cannot be optimal” (not sure about this understanding). - What does it mean by “This means that has some non-zero component along the constraint surface”?
- If is small and then is small.

**Page 136:**The more noisy the target is, many more potential final hypotheses are produced by the learning algorithm for each data set (these hypotheses would not have a chance to be final if the target is less noisy).**Page 139:**Are we assuming that the target function is free of stochastic noise? If not, then how “ depends only on “, doesn’t it also depend on ?**Page 140:**Figure 4.8 does not contradicts the figures on page 67. In Figure 4.8 the hypothesis changes as increases hence the difference in generalization error’s behaviour.

**EXERCISE NOTE**

- Exercise 1.2.
- Exercise 1.13: A note here is that both h and f are
*binary*function. - Exercise 2.1: For Example 2.2.1, find a k such that .
- Exercise 2.2: For Example 2.2.1, verify that , here k = 2. Reference: Lecture 06 – Theory of Generalization.
- Exercise 2.4.

**PROBLEM NOTE**

**QUESTION NOTE**

- Is the event equivalent to the event ? If yes, then why not:
- With the same , the VC bound (3.1) / 78 (for linear classification) will require more than the VC bound at page 87 (for linear regression). Why?

## 1 comment on “[Book Note] Learning From Data – A Short Course”