rohitg00 · sawgatio · May 24, 2026
diff --git a/phases/01-math-foundations/04-calculus-for-ml/docs/en.md b/phases/01-math-foundations/04-calculus-for-ml/docs/en.md
@@ -9,7 +9,7 @@
 
 ## Learning Objectives
 
-- Compute numerical and analytical derivatives for common ML functions (x^2, sigmoid, cross-entropy)
+- Compute numerical and analytical derivatives for common ML functions ($x^2$, sigmoid, cross-entropy)
 - Implement gradient descent from scratch to minimize a loss function in 1D and 2D
 - Derive the gradient of a linear regression model and train it via manual weight updates
 - Explain the Hessian matrix, Taylor series approximations, and their connection to optimization methods
@@ -24,11 +24,11 @@ Without calculus, training a neural network would mean trying random changes and
 
 ### What is a derivative?
 
-A derivative measures the rate of change. For a function y = f(x), the derivative f'(x) tells you: if you nudge x by a tiny amount, how much does y change?
+A derivative measures the rate of change. For a function $y = f(x)$, the derivative $f'(x)$ tells you: if you nudge $x$ by a tiny amount, how much does y change?
 
 Geometrically, the derivative is the slope of the tangent line at a point.
 
-**f(x) = x^2:**
+**$f(x) = x^2$:**
 
 | x | f(x) | f'(x) (slope) |
 |---|------|---------------|
@@ -41,38 +41,31 @@ At x=2, the slope is 4. If you move x a tiny bit to the right, y increases by ab
 
 The formal definition:
 
-```
-f'(x) = lim   f(x + h) - f(x)
-        h->0  -----------------
-                     h
-```
+$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$
 
 In code, you skip the limit and just use a very small h. That is the numerical derivative.
 
 ### Partial derivatives: one variable at a time
 
 Real functions have many inputs. A neural network loss depends on thousands of weights. A partial derivative holds all variables constant except one, then takes the derivative with respect to that one.
 
-```
-f(x, y) = x^2 + 3xy + y^2
-
-df/dx = 2x + 3y     (treat y as a constant)
-df/dy = 3x + 2y     (treat x as a constant)
-```
+$$f(x, y) = x^2 + 3xy + y^2$$
+
+$$\frac{\partial f}{\partial x} = 2x + 3y \quad \text{(treat } y \text{ as a constant)}$$
+
+$$\frac{\partial f}{\partial y} = 3x + 2y \quad \text{(treat } x \text{ as a constant)}$$
 
 Each partial derivative answers: if I nudge just this one weight, how does the loss change?
 
 ### The gradient: vector of all partial derivatives
 
 The gradient collects every partial derivative into one vector. For a function f(x, y, z), the gradient is:
 
-```
-grad f = [ df/dx, df/dy, df/dz ]
-```
+$$\nabla f = \left[\frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z}\right]$$
 
 The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction.
 
-**Contour plot of f(x,y) = x^2 + y^2:**
+**Contour plot of $f(x,y) = x^2 + y^2$:**
 
 The function forms a bowl shape with concentric circles as contour lines. The minimum is at (0, 0).
 
@@ -87,16 +80,11 @@ This is gradient descent in a picture. Compute the gradient, negate it, take a s
 
 Training a neural network is optimization. You have a loss function L(w1, w2, ..., wn) that measures how wrong the model is. You want to minimize it.
 
-```
-Gradient descent update rule:
-
-  w_new = w_old - learning_rate * dL/dw
+$$w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial w}$$
 
-For every weight:
-  1. Compute the partial derivative of loss with respect to that weight
-  2. Subtract a small multiple of it from the weight
-  3. Repeat
-```
+1. Compute the partial derivative of loss with respect to that weight
+2. Subtract a small multiple of it from the weight
+3. Repeat
 
 The learning rate controls step size. Too big and you overshoot. Too small and you crawl.
 
@@ -116,19 +104,13 @@ Gradient descent follows the slope downhill. It can get stuck in local minima, b
 
 There are two ways to compute a derivative.
 
-Analytical: apply calculus rules by hand. For f(x) = x^2, the derivative is f'(x) = 2x. Exact. Fast.
+Analytical: apply calculus rules by hand. For $f(x) = x^2$, the derivative is $f'(x) = 2x$. Exact. Fast.
 
 Numerical: approximate using the definition. Compute f(x+h) and f(x-h) for a tiny h, then use the difference.
 
-```
-Numerical (central difference):
-
-f'(x) ~= f(x + h) - f(x - h)
-          -----------------------
-                  2h
-
-h = 0.0001 works well in practice
-```
+$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$
+
+$h = 0.0001$ works well in practice.
 
 Numerical derivatives are slower but work for any function. Analytical derivatives are fast but require you to derive the formula. Neural network frameworks use a third approach: automatic differentiation, which computes exact derivatives mechanically. You will see that in Phase 3.
 
@@ -148,7 +130,7 @@ f(x) = ln(x)   f'(x) = 1/x     Cross-entropy loss
 f(x) = 1/(1+e^-x)  f'(x) = f(x)(1-f(x))   Sigmoid activation
 ```
 
-For f(x) = x^2:
+For $f(x) = x^2$:
 
 ```
 f(x) = x^2    f'(x) = 2x
@@ -161,7 +143,7 @@ f(x) = x^2    f'(x) = 2x
    2    4       4      slope tilts right (increasing)
 ```
 
-For f(w) = wx + b with x=3, b=1:
+For $f(w) = wx + b$ with $x=3$, $b=1$:
 
 ```
 f(w) = 3w + 1    f'(w) = 3
@@ -189,18 +171,13 @@ Neural networks are chains of functions: input -> linear -> activation -> linear
 
 The gradient tells you the slope. The Hessian tells you the curvature.
 
-The Hessian is the matrix of second-order partial derivatives. For a function f(x1, x2, ..., xn), entry (i, j) of the Hessian is:
+The Hessian is the matrix of second-order partial derivatives. For a function $f(x_1, x_2, \ldots, x_n)$, entry $(i, j)$ of the Hessian is:
 
-```
-H[i][j] = d^2f / (dx_i * dx_j)
-```
+$$H_{ij} = \frac{\partial^2 f}{\partial x_i \, \partial x_j}$$
 
 For a 2-variable function f(x, y):
 
-```
-H = | d^2f/dx^2    d^2f/dxdy |
-    | d^2f/dydx    d^2f/dy^2 |
-```
+$$H = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial x\,\partial y} \\[8pt] \dfrac{\partial^2 f}{\partial y\,\partial x} & \dfrac{\partial^2 f}{\partial y^2} \end{pmatrix}$$
 
 **What the Hessian tells you at a critical point (where gradient = 0):**
 
@@ -210,7 +187,7 @@ H = | d^2f/dx^2    d^2f/dxdy |
 | Negative definite (all eigenvalues < 0) | Local maximum | Bowl pointing down |
 | Indefinite (mixed eigenvalues) | Saddle point | Horse saddle shape |
 
-**Example:** f(x, y) = x^2 - y^2 (a saddle function)
+**Example:** $f(x, y) = x^2 - y^2$ (a saddle function)
 
 ```
 df/dx = 2x       df/dy = -2y
@@ -223,7 +200,7 @@ Eigenvalues: 2 and -2 (one positive, one negative)
 --> Saddle point at (0, 0)
 ```
 
-Compare with f(x, y) = x^2 + y^2 (a bowl):
+Compare with $f(x, y) = x^2 + y^2$ (a bowl):
 
 ```
 H = | 2  0 |
@@ -237,10 +214,10 @@ Eigenvalues: 2 and 2 (both positive)
 
 Newton's method uses the Hessian to take better optimization steps than gradient descent. Instead of just following the slope, it accounts for curvature:
 
-```
-Newton's update:    w_new = w_old - H^(-1) * gradient
-Gradient descent:   w_new = w_old - lr * gradient
-```
+| Method | Update rule |
+|--------|-------------|
+| Newton's method | $w_{\text{new}} = w_{\text{old}} - H^{-1} \nabla L$ |
+| Gradient descent | $w_{\text{new}} = w_{\text{old}} - \alpha \nabla L$ |
 
 Newton's method converges faster because the Hessian "rescales" the gradient -- steep directions get smaller steps, flat directions get larger steps.
 
@@ -249,7 +226,7 @@ The catch: for a neural network with N parameters, the Hessian is N x N. A model
 | Method | What it uses | Cost | Convergence |
 |--------|-------------|------|-------------|
 | Gradient descent | First derivatives only | O(N) per step | Slow (linear) |
-| Newton's method | Full Hessian | O(N^3) per step | Fast (quadratic) |
+| Newton's method | Full Hessian | $O(N^3)$ per step | Fast (quadratic) |
 | L-BFGS | Approximate Hessian from gradient history | O(N) per step | Medium (superlinear) |
 | Adam | Per-parameter adaptive rates (diagonal Hessian approx) | O(N) per step | Medium |
 | Natural gradient | Fisher information matrix (statistical Hessian) | O(N^2) per step | Fast |
@@ -260,17 +237,15 @@ In practice, Adam is the default optimizer for deep learning. It approximates se
 
 Any smooth function can be approximated locally by a polynomial:
 
-```
-f(x + h) = f(x) + f'(x)*h + (1/2)*f''(x)*h^2 + (1/6)*f'''(x)*h^3 + ...
-```
+$$f(x + h) = f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2 + \frac{1}{6}f'''(x)\,h^3 + \cdots$$
 
 The more terms you include, the better the approximation -- but only near the point x.
 
 **Why Taylor series matter for ML:**
 
-- **First-order Taylor = gradient descent.** When you use f(x + h) ~ f(x) + f'(x)*h, you are making a linear approximation. Gradient descent minimizes this linear model to choose h = -lr * f'(x).
+- **First-order Taylor = gradient descent.** When you use $f(x + h) \approx f(x) + f'(x)\,h$, you are making a linear approximation. Gradient descent minimizes this linear model to choose $h = -\alpha f'(x)$.
 
-- **Second-order Taylor = Newton's method.** Using f(x + h) ~ f(x) + f'(x)*h + (1/2)*f''(x)*h^2, you get a quadratic model. Minimizing it gives h = -f'(x)/f''(x) -- Newton's step.
+- **Second-order Taylor = Newton's method.** Using $f(x + h) \approx f(x) + f'(x)\,h + \frac{1}{2}f''(x)\,h^2$, you get a quadratic model. Minimizing it gives $h = -f'(x)/f''(x)$ — Newton's step.
 
 - **Loss function design.** MSE and cross-entropy are smooth, which means their Taylor expansions are well-behaved. This is not an accident. Smooth losses make optimization predictable.
 
@@ -292,27 +267,24 @@ Derivatives tell you rates of change. Integrals compute accumulations -- area un
 In ML, you rarely compute integrals by hand, but the concept is everywhere:
 
 **Probability.** For a continuous random variable with density p(x):
-```
-P(a < X < b) = integral from a to b of p(x) dx
-```
+$$P(a < X < b) = \int_a^b p(x)\,dx$$
 The area under the probability density curve between a and b is the probability of landing in that range.
 
 **Expected value.** The average outcome weighted by probability:
-```
-E[f(X)] = integral of f(x) * p(x) dx
-```
+
+$$\mathbb{E}[f(X)] = \int f(x)\, p(x)\,dx$$
 The expected loss over a data distribution is an integral. Training minimizes an empirical approximation of this.
 
 **KL divergence.** Measures how different two distributions are:
-```
-KL(p || q) = integral of p(x) * log(p(x) / q(x)) dx
-```
+
+$$D_{\mathrm{KL}}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx$$
+
 Used in VAEs, knowledge distillation, and Bayesian inference.
 
 **Normalization constants.** In Bayesian inference:
-```
-p(w | data) = p(data | w) * p(w) / integral of p(data | w) * p(w) dw
-```
+
+$$p(w \mid \text{data}) = \frac{p(\text{data} \mid w)\, p(w)}{\int p(\text{data} \mid w)\, p(w)\, dw}$$
+
 The denominator is an integral over all possible parameter values. It is often intractable, which is why we use approximations like MCMC and variational inference.
 
 | Integral concept | Where it appears in ML |
@@ -353,7 +325,7 @@ This is all backpropagation is: the chain rule applied systematically through a
 
 When a function maps a vector to a vector (like a neural network layer), its derivative is a matrix. The Jacobian contains every partial derivative of every output with respect to every input.
 
-For f: R^n -> R^m, the Jacobian J is an m x n matrix:
+For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J$ is an $m \times n$ matrix:
 
 | | x1 | x2 | ... | xn |
 |---|---|---|---|---|
@@ -596,9 +568,9 @@ You just built gradient descent from scratch. PyTorch automates the gradient com
 
 ## Exercises
 
-1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of x^3 at x=2 is 12.
-2. Use gradient descent to find the minimum of f(x, y) = (x - 3)^2 + (y + 1)^2. Start from (0, 0). The answer should converge to (3, -1).
-3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on f(x) = x^4 - 3x^2.
+1. Implement `numerical_second_derivative(f, x)` using `numerical_derivative` called twice. Verify that the second derivative of $x^3$ at $x=2$ is 12.
+2. Use gradient descent to find the minimum of $f(x, y) = (x-3)^2 + (y+1)^2$. Start from $(0, 0)$. The answer should converge to $(3, -1)$.
+3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on $f(x) = x^4 - 3x^2$.
 
 ## Key Terms