2.2. Derivatives & Gradients
🪄 Step 1: Intuition & Motivation
Core Idea: Derivatives and gradients are the steering wheel of machine learning — they tell your model which direction to move to improve predictions and how fast to move there.
Every time your model adjusts its weights, it’s following the gradients — like hiking down a mountain blindfolded, guided only by the slope beneath your feet.
Simple Analogy: Imagine you’re on a hill in the dark and want to reach the lowest point (the loss minimum). You can’t see the valley, but if you feel which way the ground slopes downward, you can keep stepping that way. That “feeling” is your gradient.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
A derivative measures how a function changes when its input changes slightly.
For a single variable:
$$ f'(x) = \frac{df}{dx} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} $$If $f’(x) > 0$ → function is increasing. If $f’(x) < 0$ → function is decreasing. If $f’(x) = 0$ → possible flat point (min or max).
For multiple variables, we use the gradient:
$$ \nabla f(x_1, x_2, ..., x_n) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix} $$This gradient points in the direction of the steepest ascent — the direction where the function increases fastest. To minimize a loss, we move in the opposite direction.
That’s the essence of gradient descent.
Why It Works This Way
Imagine standing on a curved surface.
- The derivative tells how steep the ground is in one direction.
- The gradient tells the direction of maximum steepness in all directions combined.
When optimizing a model, we compute these slopes for every parameter (weight). The model then “steps downhill” by subtracting a fraction of that gradient (controlled by the learning rate).
This is what happens under the hood in every training iteration:
- Compute the loss.
- Compute gradients (via calculus).
- Update parameters: $w \leftarrow w - \eta \nabla L$.
That’s the entire brain of deep learning in three lines!
How It Fits in ML Thinking
- Derivatives = How a function changes locally.
- Gradients = The direction of maximum change in multi-dimensional space.
- Jacobian = Describes gradients for vector-valued functions (multi-output).
- Hessian = Describes curvature (second-order derivatives).
Together, these help models:
- Find optimal parameters (gradient descent).
- Understand curvature of loss surfaces (Newton methods).
- Backpropagate errors efficiently in deep networks.
📐 Step 3: Mathematical Foundation
Derivative (Single Variable)
Example: If $f(x) = x^2$, then $f’(x) = 2x$.
At $x=3$, the slope is $6$ → meaning a small increase in $x$ makes $f(x)$ increase by about 6 times that amount.
Gradient (Multi-variable)
For $f(x, y) = x^2 + y^2$:
$$ \nabla f = \begin{bmatrix} 2x \ 2y \end{bmatrix} $$This gradient points directly away from the origin — the direction of steepest increase.
To minimize $f$, we move opposite the gradient:
$$ (x, y) \leftarrow (x, y) - \eta \nabla f $$Jacobian & Hessian (Higher-Order Insight)
Jacobian ($J$): Matrix of all first-order partial derivatives. For $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$,
$$ J_{ij} = \frac{\partial f_i}{\partial x_j} $$Hessian ($H$): Matrix of second-order partial derivatives for a scalar function.
$$ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$
The Hessian reveals curvature — how the slope changes. Positive curvature → bowl shape (min). Negative → hill (max).
Connection to Backpropagation
Backpropagation in neural networks is just the chain rule applied repeatedly across layers:
$$ \frac{\partial L}{\partial W_i} = \frac{\partial L}{\partial a_i} \frac{\partial a_i}{\partial z_i} \frac{\partial z_i}{\partial W_i} $$Each layer passes gradients backward to update weights. The model “learns” by adjusting weights proportional to how much they contributed to the error.
🧠 Step 4: Key Ideas
- Derivative = rate of change in one direction.
- Gradient = vector of all partial derivatives — direction of steepest ascent.
- Jacobian = generalization for vector outputs.
- Hessian = measures curvature of the loss surface.
- Backpropagation uses the chain rule to compute gradients efficiently.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Foundation of optimization and model training.
- Provides clear geometric interpretation of learning direction.
- Enables gradient-based learning for all modern deep models.
- Gradients can vanish (approach zero) or explode (grow too large), making training unstable.
- Second-order methods (using Hessians) are often too computationally heavy.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- Myth: Gradients always point to the global minimum. → Truth: They only point toward a local minimum, which may not be global.
- Myth: If gradients vanish, the model just needs more epochs. → Truth: No — you need architectural fixes like better activations or normalization.
- Myth: Backprop is a mysterious algorithm. → Truth: It’s just the chain rule applied systematically across layers.
🧩 Step 7: Mini Summary
🧠 What You Learned: Derivatives capture how a function changes locally; gradients extend that to multiple dimensions, guiding optimization.
⚙️ How It Works: The gradient tells the direction of steepest ascent; moving against it minimizes loss. Backpropagation efficiently computes these gradients for all parameters.
🎯 Why It Matters: Gradients are the language of learning — without them, no modern model could adjust itself, improve predictions, or converge.