2.4. Optimization & Convexity
🪄 Step 1: Intuition & Motivation
Core Idea: Once your model knows the direction of improvement (from gradients), the next question is: How do we move? Optimization is all about finding the sweet spot — the parameter values that minimize the loss function as efficiently, accurately, and stably as possible.
Simple Analogy: Imagine hiking down a mountain in fog. The gradient tells you which way is downhill, but optimization decides how far to step, how fast to move, and how to avoid zigzagging or getting stuck in pits.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Optimization in machine learning means adjusting model parameters to minimize a loss function — typically via iterative updates. At each step:
- Compute gradient (steepest direction of increase).
- Move opposite to it (to descend).
- Repeat until the slope flattens (near a minimum).
That’s gradient descent in essence.
If the function is convex (a perfect bowl shape), you’re guaranteed to reach the single lowest point — the global minimum. If it’s non-convex (bumpy landscape), you might get stuck in local minima or saddle points — a major challenge in deep learning.
Why It Works This Way
Gradient descent works because of the local linear approximation idea: Zoom in close enough to any smooth curve, and it looks like a line. So, by stepping downhill in small enough increments, we can follow the slope toward lower values.
The step size is controlled by the learning rate ($\eta$).
- Too small → slow convergence (you crawl).
- Too large → you overshoot or bounce around (you run past the valley).
Different optimization algorithms (like SGD, Momentum, Adam) refine how we pick that step size or direction.
How It Fits in ML Thinking
Optimization is the engine that drives learning. Without it, gradients are just knowledge — not action.
Every training step in ML — whether it’s logistic regression or GPT — is just a variation of:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) $$Optimization decides:
- How to pick $\eta$ (learning rate).
- Whether to use full data or batches (SGD).
- Whether to accelerate updates (Momentum, Adam).
Understanding this separates intuitive practitioners from mechanical tuners.
📐 Step 3: Mathematical Foundation
Gradient Descent
For parameters $\theta$ and loss $L(\theta)$:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) $$- $\eta$: learning rate (controls step size)
- $\nabla_\theta L(\theta_t)$: gradient (direction of steepest ascent)
Stochastic Gradient Descent (SGD)
Full gradient descent uses all data points to compute $\nabla L$. SGD instead samples a few randomly (a “mini-batch”) at each step:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L_i(\theta_t) $$Advantages:
- Faster (uses less data each step)
- Adds a little “noise” that can help escape local minima.
Momentum Method
Momentum builds on SGD by remembering the previous update:
$$ v_t = \beta v_{t-1} + (1 - \beta)\nabla_\theta L(\theta_t) $$$$ \theta_{t+1} = \theta_t - \eta v_t $$Here, $\beta$ (0.9–0.99) controls how much “past velocity” we keep.
Convexity
A function $f(x)$ is convex if the line segment between any two points on its graph lies above the curve itself:
$$ f(\lambda x_1 + (1-\lambda)x_2) \leq \lambda f(x_1) + (1-\lambda)f(x_2) $$This ensures:
- Only one global minimum.
- Gradient descent will find it reliably (no traps).
Deep learning loss surfaces, however, are non-convex — but in high dimensions, many local minima behave similarly, so optimization still works well.
Learning Rate Schedules
Rather than a fixed $\eta$, we often decay it over time:
$$ \eta_t = \frac{\eta_0}{1 + k t} $$or use cyclic or cosine schedules. This helps:
- Take large steps early (fast learning)
- Small steps later (fine-tuning near minima)
Adam Optimizer (Briefly)
Adam combines momentum and adaptive learning rates:
$$ m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla_\theta L_t \ v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla_\theta L_t)^2 $$Then adjust:
$$ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t}+\epsilon} $$Adam works fast, but may fail to converge on certain convex problems because it over-adapts learning rates — taking uneven steps.
🧠 Step 4: Key Ideas
- Optimization ≠ just math — it’s how models actually learn from data.
- Convexity gives mathematical guarantees; non-convexity requires heuristic faith.
- Momentum helps smooth noisy updates.
- Learning rate is the single most important hyperparameter in training.
- Adam is fast but can overfit or fail to converge; SGD is slower but more reliable for large-scale training.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Gradient-based optimization scales to millions of parameters.
- Convex functions guarantee convergence.
- Momentum and adaptive rates accelerate training dramatically.
- Non-convex surfaces can trap models in poor local minima or plateaus.
- Choosing hyperparameters (learning rate, momentum, decay) can be tricky.
- Adaptive optimizers (Adam, RMSprop) may not generalize as well.
- SGD → slower but robust, better generalization.
- Adam/RMSProp → faster convergence, but may over-adapt.
- A balanced approach often starts with Adam for quick results, then fine-tunes with SGD.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- Myth: Adam always beats SGD. → Truth: Adam converges faster initially but can generalize worse; SGD often wins long-term.
- Myth: Bigger learning rate = faster learning. → Truth: Too large leads to divergence or oscillation.
- Myth: Convex functions are rare. → Truth: Many classical ML problems (logistic regression, SVMs) are convex; only deep nets break convexity.
🧩 Step 7: Mini Summary
🧠 What You Learned: Optimization translates gradient information into motion — the practical process of learning. Convexity ensures stability and guarantees; non-convexity makes learning powerful yet unpredictable.
⚙️ How It Works: Iterative updates like gradient descent adjust parameters to minimize loss. Variants like SGD, Momentum, and Adam refine the step size and direction for efficiency.
🎯 Why It Matters: Mastering optimization means mastering learning itself — it’s what turns “knowing the slope” into “reaching the summit.”