From 7236530f10b99d7c740f96f9fb282263cb3d8dc2 Mon Sep 17 00:00:00 2001 From: sawgatio Date: Sun, 24 May 2026 19:21:20 +0530 Subject: [PATCH] fix: convert plain-text math to laTex only for 04-calculus-for-ml docs --- .../04-calculus-for-ml/docs/en.md | 122 +++++++----------- 1 file changed, 47 insertions(+), 75 deletions(-) diff --git a/phases/01-math-foundations/04-calculus-for-ml/docs/en.md b/phases/01-math-foundations/04-calculus-for-ml/docs/en.md index 41e08ee0e..a05dd4ec2 100644 --- a/phases/01-math-foundations/04-calculus-for-ml/docs/en.md +++ b/phases/01-math-foundations/04-calculus-for-ml/docs/en.md @@ -9,7 +9,7 @@ ## Learning Objectives -- Compute numerical and analytical derivatives for common ML functions (x^2, sigmoid, cross-entropy) +- Compute numerical and analytical derivatives for common ML functions ($x^2$, sigmoid, cross-entropy) - Implement gradient descent from scratch to minimize a loss function in 1D and 2D - Derive the gradient of a linear regression model and train it via manual weight updates - Explain the Hessian matrix, Taylor series approximations, and their connection to optimization methods @@ -24,11 +24,11 @@ Without calculus, training a neural network would mean trying random changes and ### What is a derivative? -A derivative measures the rate of change. For a function y = f(x), the derivative f'(x) tells you: if you nudge x by a tiny amount, how much does y change? +A derivative measures the rate of change. For a function $y = f(x)$, the derivative $f'(x)$ tells you: if you nudge $x$ by a tiny amount, how much does y change? Geometrically, the derivative is the slope of the tangent line at a point. -**f(x) = x^2:** +**$f(x) = x^2$:** | x | f(x) | f'(x) (slope) | |---|------|---------------| @@ -41,11 +41,7 @@ At x=2, the slope is 4. If you move x a tiny bit to the right, y increases by ab The formal definition: -``` -f'(x) = lim f(x + h) - f(x) - h->0 ----------------- - h -``` +$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$ In code, you skip the limit and just use a very small h. That is the numerical derivative. @@ -53,12 +49,11 @@ In code, you skip the limit and just use a very small h. That is the numerical d Real functions have many inputs. A neural network loss depends on thousands of weights. A partial derivative holds all variables constant except one, then takes the derivative with respect to that one. -``` -f(x, y) = x^2 + 3xy + y^2 - -df/dx = 2x + 3y (treat y as a constant) -df/dy = 3x + 2y (treat x as a constant) -``` +$$f(x, y) = x^2 + 3xy + y^2$$ + +$$\frac{\partial f}{\partial x} = 2x + 3y \quad \text{(treat } y \text{ as a constant)}$$ + +$$\frac{\partial f}{\partial y} = 3x + 2y \quad \text{(treat } x \text{ as a constant)}$$ Each partial derivative answers: if I nudge just this one weight, how does the loss change? @@ -66,13 +61,11 @@ Each partial derivative answers: if I nudge just this one weight, how does the l The gradient collects every partial derivative into one vector. For a function f(x, y, z), the gradient is: -``` -grad f = [ df/dx, df/dy, df/dz ] -``` +$$\nabla f = \left[\frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z}\right]$$ The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction. -**Contour plot of f(x,y) = x^2 + y^2:** +**Contour plot of $f(x,y) = x^2 + y^2$:** The function forms a bowl shape with concentric circles as contour lines. The minimum is at (0, 0). @@ -87,16 +80,11 @@ This is gradient descent in a picture. Compute the gradient, negate it, take a s Training a neural network is optimization. You have a loss function L(w1, w2, ..., wn) that measures how wrong the model is. You want to minimize it. -``` -Gradient descent update rule: - - w_new = w_old - learning_rate * dL/dw +$$w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial w}$$ -For every weight: - 1. Compute the partial derivative of loss with respect to that weight - 2. Subtract a small multiple of it from the weight - 3. Repeat -``` +1. Compute the partial derivative of loss with respect to that weight +2. Subtract a small multiple of it from the weight +3. Repeat The learning rate controls step size. Too big and you overshoot. Too small and you crawl. @@ -116,19 +104,13 @@ Gradient descent follows the slope downhill. It can get stuck in local minima, b There are two ways to compute a derivative. -Analytical: apply calculus rules by hand. For f(x) = x^2, the derivative is f'(x) = 2x. Exact. Fast. +Analytical: apply calculus rules by hand. For $f(x) = x^2$, the derivative is $f'(x) = 2x$. Exact. Fast. Numerical: approximate using the definition. Compute f(x+h) and f(x-h) for a tiny h, then use the difference. -``` -Numerical (central difference): - -f'(x) ~= f(x + h) - f(x - h) - ----------------------- - 2h - -h = 0.0001 works well in practice -``` +$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$ + +$h = 0.0001$ works well in practice. Numerical derivatives are slower but work for any function. Analytical derivatives are fast but require you to derive the formula. Neural network frameworks use a third approach: automatic differentiation, which computes exact derivatives mechanically. You will see that in Phase 3. @@ -148,7 +130,7 @@ f(x) = ln(x) f'(x) = 1/x Cross-entropy loss f(x) = 1/(1+e^-x) f'(x) = f(x)(1-f(x)) Sigmoid activation ``` -For f(x) = x^2: +For $f(x) = x^2$: ``` f(x) = x^2 f'(x) = 2x @@ -161,7 +143,7 @@ f(x) = x^2 f'(x) = 2x 2 4 4 slope tilts right (increasing) ``` -For f(w) = wx + b with x=3, b=1: +For $f(w) = wx + b$ with $x=3$, $b=1$: ``` f(w) = 3w + 1 f'(w) = 3 @@ -189,18 +171,13 @@ Neural networks are chains of functions: input -> linear -> activation -> linear The gradient tells you the slope. The Hessian tells you the curvature. -The Hessian is the matrix of second-order partial derivatives. For a function f(x1, x2, ..., xn), entry (i, j) of the Hessian is: +The Hessian is the matrix of second-order partial derivatives. For a function $f(x_1, x_2, \ldots, x_n)$, entry $(i, j)$ of the Hessian is: -``` -H[i][j] = d^2f / (dx_i * dx_j) -``` +$$H_{ij} = \frac{\partial^2 f}{\partial x_i \, \partial x_j}$$ For a 2-variable function f(x, y): -``` -H = | d^2f/dx^2 d^2f/dxdy | - | d^2f/dydx d^2f/dy^2 | -``` +$$H = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial x\,\partial y} \\[8pt] \dfrac{\partial^2 f}{\partial y\,\partial x} & \dfrac{\partial^2 f}{\partial y^2} \end{pmatrix}$$ **What the Hessian tells you at a critical point (where gradient = 0):** @@ -210,7 +187,7 @@ H = | d^2f/dx^2 d^2f/dxdy | | Negative definite (all eigenvalues < 0) | Local maximum | Bowl pointing down | | Indefinite (mixed eigenvalues) | Saddle point | Horse saddle shape | -**Example:** f(x, y) = x^2 - y^2 (a saddle function) +**Example:** $f(x, y) = x^2 - y^2$ (a saddle function) ``` df/dx = 2x df/dy = -2y @@ -223,7 +200,7 @@ Eigenvalues: 2 and -2 (one positive, one negative) --> Saddle point at (0, 0) ``` -Compare with f(x, y) = x^2 + y^2 (a bowl): +Compare with $f(x, y) = x^2 + y^2$ (a bowl): ``` H = | 2 0 | @@ -237,10 +214,10 @@ Eigenvalues: 2 and 2 (both positive) Newton's method uses the Hessian to take better optimization steps than gradient descent. Instead of just following the slope, it accounts for curvature: -``` -Newton's update: w_new = w_old - H^(-1) * gradient -Gradient descent: w_new = w_old - lr * gradient -``` +| Method | Update rule | +|--------|-------------| +| Newton's method | $w_{\text{new}} = w_{\text{old}} - H^{-1} \nabla L$ | +| Gradient descent | $w_{\text{new}} = w_{\text{old}} - \alpha \nabla L$ | Newton's method converges faster because the Hessian "rescales" the gradient -- steep directions get smaller steps, flat directions get larger steps. @@ -249,7 +226,7 @@ The catch: for a neural network with N parameters, the Hessian is N x N. A model | Method | What it uses | Cost | Convergence | |--------|-------------|------|-------------| | Gradient descent | First derivatives only | O(N) per step | Slow (linear) | -| Newton's method | Full Hessian | O(N^3) per step | Fast (quadratic) | +| Newton's method | Full Hessian | $O(N^3)$ per step | Fast (quadratic) | | L-BFGS | Approximate Hessian from gradient history | O(N) per step | Medium (superlinear) | | Adam | Per-parameter adaptive rates (diagonal Hessian approx) | O(N) per step | Medium | | Natural gradient | Fisher information matrix (statistical Hessian) | O(N^2) per step | Fast | @@ -260,17 +237,15 @@ In practice, Adam is the default optimizer for deep learning. It approximates se Any smooth function can be approximated locally by a polynomial: -``` -f(x + h) = f(x) + f'(x)*h + (1/2)*f''(x)*h^2 + (1/6)*f'''(x)*h^3 + ... -``` +$$f(x + h) = f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2 + \frac{1}{6}f'''(x)\,h^3 + \cdots$$ The more terms you include, the better the approximation -- but only near the point x. **Why Taylor series matter for ML:** -- **First-order Taylor = gradient descent.** When you use f(x + h) ~ f(x) + f'(x)*h, you are making a linear approximation. Gradient descent minimizes this linear model to choose h = -lr * f'(x). +- **First-order Taylor = gradient descent.** When you use $f(x + h) \approx f(x) + f'(x)\,h$, you are making a linear approximation. Gradient descent minimizes this linear model to choose $h = -\alpha f'(x)$. -- **Second-order Taylor = Newton's method.** Using f(x + h) ~ f(x) + f'(x)*h + (1/2)*f''(x)*h^2, you get a quadratic model. Minimizing it gives h = -f'(x)/f''(x) -- Newton's step. +- **Second-order Taylor = Newton's method.** Using $f(x + h) \approx f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2$, you get a quadratic model. Minimizing it gives $h = -f'(x)/f''(x)$ — Newton's step. - **Loss function design.** MSE and cross-entropy are smooth, which means their Taylor expansions are well-behaved. This is not an accident. Smooth losses make optimization predictable. @@ -292,27 +267,24 @@ Derivatives tell you rates of change. Integrals compute accumulations -- area un In ML, you rarely compute integrals by hand, but the concept is everywhere: **Probability.** For a continuous random variable with density p(x): -``` -P(a < X < b) = integral from a to b of p(x) dx -``` +$$P(a < X < b) = \int_a^b p(x)\,dx$$ The area under the probability density curve between a and b is the probability of landing in that range. **Expected value.** The average outcome weighted by probability: -``` -E[f(X)] = integral of f(x) * p(x) dx -``` + +$$\mathbb{E}[f(X)] = \int f(x)\, p(x)\,dx$$ The expected loss over a data distribution is an integral. Training minimizes an empirical approximation of this. **KL divergence.** Measures how different two distributions are: -``` -KL(p || q) = integral of p(x) * log(p(x) / q(x)) dx -``` + +$$D_{\mathrm{KL}}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx$$ + Used in VAEs, knowledge distillation, and Bayesian inference. **Normalization constants.** In Bayesian inference: -``` -p(w | data) = p(data | w) * p(w) / integral of p(data | w) * p(w) dw -``` + +$$p(w \mid \text{data}) = \frac{p(\text{data} \mid w)\, p(w)}{\int p(\text{data} \mid w)\, p(w)\, dw}$$ + The denominator is an integral over all possible parameter values. It is often intractable, which is why we use approximations like MCMC and variational inference. | Integral concept | Where it appears in ML | @@ -353,7 +325,7 @@ This is all backpropagation is: the chain rule applied systematically through a When a function maps a vector to a vector (like a neural network layer), its derivative is a matrix. The Jacobian contains every partial derivative of every output with respect to every input. -For f: R^n -> R^m, the Jacobian J is an m x n matrix: +For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J$ is an $m \times n$ matrix: | | x1 | x2 | ... | xn | |---|---|---|---|---| @@ -596,9 +568,9 @@ You just built gradient descent from scratch. PyTorch automates the gradient com ## Exercises -1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of x^3 at x=2 is 12. -2. Use gradient descent to find the minimum of f(x, y) = (x - 3)^2 + (y + 1)^2. Start from (0, 0). The answer should converge to (3, -1). -3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on f(x) = x^4 - 3x^2. +1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of $x^3$ at $x=2$ is 12. +2. Use gradient descent to find the minimum of $f(x, y) = (x-3)^2 + (y+1)^2$. Start from $(0, 0)$. The answer should converge to $(3, -1)$. +3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on $f(x) = x^4 - 3x^2$. ## Key Terms