[Book Note] Learning From Data – A Short Course
This book should be read along with watching its corresponding online course. However, you should not watch the online course alone.
PAGE NOTE
- Page 19: Reference: Malik Magdon-Ismail.
- Page 31: y here is a random variable. Reference: Malik Magdon-Ismail, The Elements of Statistical Learning page 28.
- Page 32: “we will assume the target to be a probability distribution
, thus covering the general case” => Senpai please notice on page 53: “Each D is a point on that canvas. The probability of a point is determined by which
‘s in
happen to be in that particular
, and is calculated based on the distribution
over
“.
- Page 49:
- We have:
, so that will explain “each term in the sum is polynomial (of degree i…”.
- Proof of Growth Function when Points are not binary.
- We have:
- Page 53: The canvas is the space of all possible data sets of size
.
- Page 46: “Unless
… be crushed by the
factor”, please refer to page note of Page 78.
- Page 78:
.
- Page 85:
.
- Page 88: Reference: Lecture 08 – Bias-Variance Tradeoff. The expected in-sample error is not the same as expected out-of-sample error because the final hypothesis varies depending on the data set
.
- Page 93: Use Taylor series expansion for
and then substitute the series into expression
. You will need more mathematics to understand that equality more clearly. Why is
small? Reference.
- Page 98:
changes for every iteration, then what does it mean by: “‘on average’ the minimization proceeds in the right direction, but is a bit wiggly. In the long run, these random fluctuations cancel out.”?
- My possible interpretation:
- “on average” here means with the same weight in an iteration, the average value of the possible chosen data points is the expected value.
- My possible interpretation:
- Page 99: What does it mean by: “The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid flat regions and local minima in the case of a complicated error surface.”?
- Page 104:
- Why don’t we transform the output vector
too?
- Pascal triangle:
.
- Why don’t we transform the output vector
- Page 123: We have:
. Reference: Standard Normal Distribution.
- Page 130:
- Reference for
solution: Lecture 12 Video.
- “And just like the regular derivative, the gradient points in the direction of greatest increase”, reference: Vector Calculus: Understanding the Gradient.
- Also make reference to
in Gradient Descent Algorithm for the statement: “
cannot be optimal” (not sure about this understanding).
- What does it mean by “This means that
has some non-zero component along the constraint surface”?
- If
is small and
then
is small.
- Reference for
- Page 136: The more noisy the target is, many more potential final hypotheses are produced by the learning algorithm for each data set (these hypotheses would not have a chance to be final if the target is less noisy).
- Page 139: Are we assuming that the target function is free of stochastic noise? If not, then how “
depends only on
“, doesn’t it also depend on
?
- Page 140: Figure 4.8 does not contradicts the figures on page 67. In Figure 4.8 the hypothesis changes as
increases hence the difference in generalization error’s behaviour.
EXERCISE NOTE
- Exercise 1.2.
- Exercise 1.13: A note here is that both h and f are binary function.
- Exercise 2.1: For Example 2.2.1, find a k such that
.
- Exercise 2.2: For Example 2.2.1, verify that
, here k = 2. Reference: Lecture 06 – Theory of Generalization.
- Exercise 2.4.
PROBLEM NOTE
QUESTION NOTE
- Is the event
equivalent to the event
? If yes, then why not:
- With the same
, the VC bound (3.1) / 78 (for linear classification) will require more
than the VC bound at page 87 (for linear regression). Why?







1 comment on “[Book Note] Learning From Data – A Short Course”