Calculus for ML Cheat Sheet

mathGrades 11-124 sections

Derivatives & Rules

// Derivative: Rate of change
df/dx = lim(h→0) [f(x+h) - f(x)] / h

// Basic derivatives
d/dx[c] = 0 (constant)
d/dx[x] = 1
d/dx[x^n] = n×x^(n-1) (power rule)
d/dx[e^x] = e^x
d/dx[ln(x)] = 1/x
d/dx[sin(x)] = cos(x)
d/dx[cos(x)] = -sin(x)

// Combination rules
Sum rule: (f + g)' = f' + g'
Product rule: (f×g)' = f'×g + f×g'
Quotient rule: (f/g)' = (f'×g - f×g') / g²

// Chain rule: f(g(x))' = f'(g(x)) × g'(x)
Example: sin(x²)' = cos(x²) × 2x

// Second derivative: d²f/dx²
Rate of change of rate of change
Positive: Concave up (minimum)
Negative: Concave down (maximum)
Zero: Inflection point (change of concavity)

// Extrema
Local minimum: f'(x) = 0, f''(x) > 0
Local maximum: f'(x) = 0, f''(x) < 0
Saddle point: f'(x) = 0, f''(x) = 0

Partial Derivatives & Gradient

// Partial derivative: Derivative w.r.t. one variable
f(x, y) = x² + 3xy + y²

∂f/∂x = 2x + 3y (treat y as constant)
∂f/∂y = 3x + 2y (treat x as constant)

// Gradient: Vector of all partial derivatives
∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z, ...]

Points in direction of steepest increase
Magnitude: Rate of increase

// Directional derivative
D_u f = ∇f · u  (u is unit vector)
Rate of change in direction u

// Chain rule (multivariable)
If z = f(x,y), x = x(t), y = y(t)
dz/dt = ∂f/∂x × dx/dt + ∂f/∂y × dy/dt

Neural networks backprop uses this!

// Jacobian: Matrix of all first partials
f: R^n → R^m
J = [∂f1/∂x1  ∂f1/∂x2  ...]
    [∂f2/∂x1  ∂f2/∂x2  ...]

Used for: Sensitivity analysis, chain rule

// Hessian: Matrix of second partials
H = [∂²f/∂x²    ∂²f/∂x∂y]
    [∂²f/∂y∂x   ∂²f/∂y² ]

Used for: Optimization (Newton's method)
Second-order info about curvature

Optimization

// Gradient descent: Iterative optimization
x_new = x_old - α × ∇f(x_old)
α: Learning rate (step size)

// Update rule
For each step, move opposite to gradient
Towards local minimum

// Learning rate effects
Too small: Slow convergence
Too large: Overshoot, oscillate, diverge
Goldilocks: Fast, stable convergence

// Variants
Batch gradient descent: Use all data (stable, slow)
Stochastic GD: Use 1 sample (fast, noisy)
Mini-batch GD: Use batch of N samples (balance)

// Convergence
∇f = 0: Stationary point (min, max, or saddle)
∇f ≈ 0: Near convergence
Monitor loss: Should decrease over iterations

// Stopping criteria
Max iterations reached
∇f sufficiently small
Loss change < threshold
Validation loss increases (overfitting)

// Second-order methods
Newton's method: x_new = x - H^-1 × ∇f
Uses Hessian (2nd order info)
Faster convergence, more expensive

Quasi-Newton: Approximate Hessian (BFGS, L-BFGS)

// Constrained optimization
Minimize f(x) subject to g(x) = 0
Lagrange multipliers: ∇f = λ∇g
Lagrangian: L = f(x) - λ×g(x)

Taylor Series & Approximation

// Taylor series: Approximate function with polynomial
f(x) ≈ f(a) + f'(a)(x-a) + f''(a)(x-a)²/2! + ...

// 1st order (linear approximation)
f(x) ≈ f(a) + f'(a)(x-a)
Gradient descent uses this locally

// 2nd order (quadratic approximation)
f(x) ≈ f(a) + f'(a)(x-a) + f''(a)(x-a)²/2

// Newton's method uses 2nd order
Minimize: x_new = x - f'(x)/f''(x)

// Multivariate Taylor
f(x) ≈ f(a) + ∇f(a)·(x-a) + ½(x-a)^T H(a) (x-a) + ...

// Application: Backpropagation
Neural network training uses gradient (1st order Taylor)
Approximates loss locally, moves opposite to gradient

// Convergence analysis
Curvature (2nd order) affects convergence
Sharp minimum: Hessian eigenvalues large
Flat region: Eigenvalues small
Saddle point: Mixed signs

// Step size theory
Convex function: Any local min = global min
For non-convex: Get stuck in local minima
Neural networks: Non-convex, yet work well

Calculus for ML Cheat Sheet

Derivatives & Rules

Partial Derivatives & Gradient

Optimization

Taylor Series & Approximation

More Cheat Sheets

Linear Algebra Cheat Sheet

Probability Cheat Sheet