Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 47 additions & 75 deletions phases/01-math-foundations/04-calculus-for-ml/docs/en.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

## Learning Objectives

- Compute numerical and analytical derivatives for common ML functions (x^2, sigmoid, cross-entropy)
- Compute numerical and analytical derivatives for common ML functions ($x^2$, sigmoid, cross-entropy)
- Implement gradient descent from scratch to minimize a loss function in 1D and 2D
- Derive the gradient of a linear regression model and train it via manual weight updates
- Explain the Hessian matrix, Taylor series approximations, and their connection to optimization methods
Expand All @@ -24,11 +24,11 @@ Without calculus, training a neural network would mean trying random changes and

### What is a derivative?

A derivative measures the rate of change. For a function y = f(x), the derivative f'(x) tells you: if you nudge x by a tiny amount, how much does y change?
A derivative measures the rate of change. For a function $y = f(x)$, the derivative $f'(x)$ tells you: if you nudge $x$ by a tiny amount, how much does y change?

Geometrically, the derivative is the slope of the tangent line at a point.

**f(x) = x^2:**
**$f(x) = x^2$:**

| x | f(x) | f'(x) (slope) |
|---|------|---------------|
Expand All @@ -41,38 +41,31 @@ At x=2, the slope is 4. If you move x a tiny bit to the right, y increases by ab

The formal definition:

```
f'(x) = lim f(x + h) - f(x)
h->0 -----------------
h
```
$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

In code, you skip the limit and just use a very small h. That is the numerical derivative.

### Partial derivatives: one variable at a time

Real functions have many inputs. A neural network loss depends on thousands of weights. A partial derivative holds all variables constant except one, then takes the derivative with respect to that one.

```
f(x, y) = x^2 + 3xy + y^2

df/dx = 2x + 3y (treat y as a constant)
df/dy = 3x + 2y (treat x as a constant)
```
$$f(x, y) = x^2 + 3xy + y^2$$

$$\frac{\partial f}{\partial x} = 2x + 3y \quad \text{(treat } y \text{ as a constant)}$$

$$\frac{\partial f}{\partial y} = 3x + 2y \quad \text{(treat } x \text{ as a constant)}$$

Each partial derivative answers: if I nudge just this one weight, how does the loss change?

### The gradient: vector of all partial derivatives

The gradient collects every partial derivative into one vector. For a function f(x, y, z), the gradient is:

```
grad f = [ df/dx, df/dy, df/dz ]
```
$$\nabla f = \left[\frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z}\right]$$

The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction.

**Contour plot of f(x,y) = x^2 + y^2:**
**Contour plot of $f(x,y) = x^2 + y^2$:**

The function forms a bowl shape with concentric circles as contour lines. The minimum is at (0, 0).

Expand All @@ -87,16 +80,11 @@ This is gradient descent in a picture. Compute the gradient, negate it, take a s

Training a neural network is optimization. You have a loss function L(w1, w2, ..., wn) that measures how wrong the model is. You want to minimize it.

```
Gradient descent update rule:

w_new = w_old - learning_rate * dL/dw
$$w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial w}$$

For every weight:
1. Compute the partial derivative of loss with respect to that weight
2. Subtract a small multiple of it from the weight
3. Repeat
```
1. Compute the partial derivative of loss with respect to that weight
2. Subtract a small multiple of it from the weight
3. Repeat

The learning rate controls step size. Too big and you overshoot. Too small and you crawl.

Expand All @@ -116,19 +104,13 @@ Gradient descent follows the slope downhill. It can get stuck in local minima, b

There are two ways to compute a derivative.

Analytical: apply calculus rules by hand. For f(x) = x^2, the derivative is f'(x) = 2x. Exact. Fast.
Analytical: apply calculus rules by hand. For $f(x) = x^2$, the derivative is $f'(x) = 2x$. Exact. Fast.

Numerical: approximate using the definition. Compute f(x+h) and f(x-h) for a tiny h, then use the difference.

```
Numerical (central difference):

f'(x) ~= f(x + h) - f(x - h)
-----------------------
2h

h = 0.0001 works well in practice
```
$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$

$h = 0.0001$ works well in practice.

Numerical derivatives are slower but work for any function. Analytical derivatives are fast but require you to derive the formula. Neural network frameworks use a third approach: automatic differentiation, which computes exact derivatives mechanically. You will see that in Phase 3.

Expand All @@ -148,7 +130,7 @@ f(x) = ln(x) f'(x) = 1/x Cross-entropy loss
f(x) = 1/(1+e^-x) f'(x) = f(x)(1-f(x)) Sigmoid activation
```

For f(x) = x^2:
For $f(x) = x^2$:

```
f(x) = x^2 f'(x) = 2x
Expand All @@ -161,7 +143,7 @@ f(x) = x^2 f'(x) = 2x
2 4 4 slope tilts right (increasing)
```

For f(w) = wx + b with x=3, b=1:
For $f(w) = wx + b$ with $x=3$, $b=1$:

```
f(w) = 3w + 1 f'(w) = 3
Expand Down Expand Up @@ -189,18 +171,13 @@ Neural networks are chains of functions: input -> linear -> activation -> linear

The gradient tells you the slope. The Hessian tells you the curvature.

The Hessian is the matrix of second-order partial derivatives. For a function f(x1, x2, ..., xn), entry (i, j) of the Hessian is:
The Hessian is the matrix of second-order partial derivatives. For a function $f(x_1, x_2, \ldots, x_n)$, entry $(i, j)$ of the Hessian is:

```
H[i][j] = d^2f / (dx_i * dx_j)
```
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \, \partial x_j}$$

For a 2-variable function f(x, y):

```
H = | d^2f/dx^2 d^2f/dxdy |
| d^2f/dydx d^2f/dy^2 |
```
$$H = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial x\,\partial y} \\[8pt] \dfrac{\partial^2 f}{\partial y\,\partial x} & \dfrac{\partial^2 f}{\partial y^2} \end{pmatrix}$$

**What the Hessian tells you at a critical point (where gradient = 0):**

Expand All @@ -210,7 +187,7 @@ H = | d^2f/dx^2 d^2f/dxdy |
| Negative definite (all eigenvalues < 0) | Local maximum | Bowl pointing down |
| Indefinite (mixed eigenvalues) | Saddle point | Horse saddle shape |

**Example:** f(x, y) = x^2 - y^2 (a saddle function)
**Example:** $f(x, y) = x^2 - y^2$ (a saddle function)

```
df/dx = 2x df/dy = -2y
Expand All @@ -223,7 +200,7 @@ Eigenvalues: 2 and -2 (one positive, one negative)
--> Saddle point at (0, 0)
```

Compare with f(x, y) = x^2 + y^2 (a bowl):
Compare with $f(x, y) = x^2 + y^2$ (a bowl):

```
H = | 2 0 |
Expand All @@ -237,10 +214,10 @@ Eigenvalues: 2 and 2 (both positive)

Newton's method uses the Hessian to take better optimization steps than gradient descent. Instead of just following the slope, it accounts for curvature:

```
Newton's update: w_new = w_old - H^(-1) * gradient
Gradient descent: w_new = w_old - lr * gradient
```
| Method | Update rule |
|--------|-------------|
| Newton's method | $w_{\text{new}} = w_{\text{old}} - H^{-1} \nabla L$ |
| Gradient descent | $w_{\text{new}} = w_{\text{old}} - \alpha \nabla L$ |

Newton's method converges faster because the Hessian "rescales" the gradient -- steep directions get smaller steps, flat directions get larger steps.

Expand All @@ -249,7 +226,7 @@ The catch: for a neural network with N parameters, the Hessian is N x N. A model
| Method | What it uses | Cost | Convergence |
|--------|-------------|------|-------------|
| Gradient descent | First derivatives only | O(N) per step | Slow (linear) |
| Newton's method | Full Hessian | O(N^3) per step | Fast (quadratic) |
| Newton's method | Full Hessian | $O(N^3)$ per step | Fast (quadratic) |
| L-BFGS | Approximate Hessian from gradient history | O(N) per step | Medium (superlinear) |
| Adam | Per-parameter adaptive rates (diagonal Hessian approx) | O(N) per step | Medium |
| Natural gradient | Fisher information matrix (statistical Hessian) | O(N^2) per step | Fast |
Expand All @@ -260,17 +237,15 @@ In practice, Adam is the default optimizer for deep learning. It approximates se

Any smooth function can be approximated locally by a polynomial:

```
f(x + h) = f(x) + f'(x)*h + (1/2)*f''(x)*h^2 + (1/6)*f'''(x)*h^3 + ...
```
$$f(x + h) = f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2 + \frac{1}{6}f'''(x)\,h^3 + \cdots$$

The more terms you include, the better the approximation -- but only near the point x.

**Why Taylor series matter for ML:**

- **First-order Taylor = gradient descent.** When you use f(x + h) ~ f(x) + f'(x)*h, you are making a linear approximation. Gradient descent minimizes this linear model to choose h = -lr * f'(x).
- **First-order Taylor = gradient descent.** When you use $f(x + h) \approx f(x) + f'(x)\,h$, you are making a linear approximation. Gradient descent minimizes this linear model to choose $h = -\alpha f'(x)$.

- **Second-order Taylor = Newton's method.** Using f(x + h) ~ f(x) + f'(x)*h + (1/2)*f''(x)*h^2, you get a quadratic model. Minimizing it gives h = -f'(x)/f''(x) -- Newton's step.
- **Second-order Taylor = Newton's method.** Using $f(x + h) \approx f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2$, you get a quadratic model. Minimizing it gives $h = -f'(x)/f''(x)$ — Newton's step.

- **Loss function design.** MSE and cross-entropy are smooth, which means their Taylor expansions are well-behaved. This is not an accident. Smooth losses make optimization predictable.

Expand All @@ -292,27 +267,24 @@ Derivatives tell you rates of change. Integrals compute accumulations -- area un
In ML, you rarely compute integrals by hand, but the concept is everywhere:

**Probability.** For a continuous random variable with density p(x):
```
P(a < X < b) = integral from a to b of p(x) dx
```
$$P(a < X < b) = \int_a^b p(x)\,dx$$
The area under the probability density curve between a and b is the probability of landing in that range.

**Expected value.** The average outcome weighted by probability:
```
E[f(X)] = integral of f(x) * p(x) dx
```

$$\mathbb{E}[f(X)] = \int f(x)\, p(x)\,dx$$
The expected loss over a data distribution is an integral. Training minimizes an empirical approximation of this.

**KL divergence.** Measures how different two distributions are:
```
KL(p || q) = integral of p(x) * log(p(x) / q(x)) dx
```

$$D_{\mathrm{KL}}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx$$

Used in VAEs, knowledge distillation, and Bayesian inference.

**Normalization constants.** In Bayesian inference:
```
p(w | data) = p(data | w) * p(w) / integral of p(data | w) * p(w) dw
```

$$p(w \mid \text{data}) = \frac{p(\text{data} \mid w)\, p(w)}{\int p(\text{data} \mid w)\, p(w)\, dw}$$

The denominator is an integral over all possible parameter values. It is often intractable, which is why we use approximations like MCMC and variational inference.

| Integral concept | Where it appears in ML |
Expand Down Expand Up @@ -353,7 +325,7 @@ This is all backpropagation is: the chain rule applied systematically through a

When a function maps a vector to a vector (like a neural network layer), its derivative is a matrix. The Jacobian contains every partial derivative of every output with respect to every input.

For f: R^n -> R^m, the Jacobian J is an m x n matrix:
For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J$ is an $m \times n$ matrix:

| | x1 | x2 | ... | xn |
|---|---|---|---|---|
Expand Down Expand Up @@ -596,9 +568,9 @@ You just built gradient descent from scratch. PyTorch automates the gradient com

## Exercises

1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of x^3 at x=2 is 12.
2. Use gradient descent to find the minimum of f(x, y) = (x - 3)^2 + (y + 1)^2. Start from (0, 0). The answer should converge to (3, -1).
3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on f(x) = x^4 - 3x^2.
1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of $x^3$ at $x=2$ is 12.
2. Use gradient descent to find the minimum of $f(x, y) = (x-3)^2 + (y+1)^2$. Start from $(0, 0)$. The answer should converge to $(3, -1)$.
3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on $f(x) = x^4 - 3x^2$.

## Key Terms

Expand Down