2.4. Optimization & Convexity

5 min read 1062 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Once your model knows the direction of improvement (from gradients), the next question is: How do we move? Optimization is all about finding the sweet spot — the parameter values that minimize the loss function as efficiently, accurately, and stably as possible.

  • Simple Analogy: Imagine hiking down a mountain in fog. The gradient tells you which way is downhill, but optimization decides how far to step, how fast to move, and how to avoid zigzagging or getting stuck in pits.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Optimization in machine learning means adjusting model parameters to minimize a loss function — typically via iterative updates. At each step:

  1. Compute gradient (steepest direction of increase).
  2. Move opposite to it (to descend).
  3. Repeat until the slope flattens (near a minimum).

That’s gradient descent in essence.

If the function is convex (a perfect bowl shape), you’re guaranteed to reach the single lowest point — the global minimum. If it’s non-convex (bumpy landscape), you might get stuck in local minima or saddle points — a major challenge in deep learning.


Why It Works This Way

Gradient descent works because of the local linear approximation idea: Zoom in close enough to any smooth curve, and it looks like a line. So, by stepping downhill in small enough increments, we can follow the slope toward lower values.

The step size is controlled by the learning rate ($\eta$).

  • Too small → slow convergence (you crawl).
  • Too large → you overshoot or bounce around (you run past the valley).

Different optimization algorithms (like SGD, Momentum, Adam) refine how we pick that step size or direction.


How It Fits in ML Thinking

Optimization is the engine that drives learning. Without it, gradients are just knowledge — not action.

Every training step in ML — whether it’s logistic regression or GPT — is just a variation of:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) $$

Optimization decides:

  • How to pick $\eta$ (learning rate).
  • Whether to use full data or batches (SGD).
  • Whether to accelerate updates (Momentum, Adam).

Understanding this separates intuitive practitioners from mechanical tuners.


📐 Step 3: Mathematical Foundation

Gradient Descent

For parameters $\theta$ and loss $L(\theta)$:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) $$
  • $\eta$: learning rate (controls step size)
  • $\nabla_\theta L(\theta_t)$: gradient (direction of steepest ascent)
Think of gradient descent as sliding down a valley: small steps ensure you don’t miss the bottom; large steps risk overshooting it.

Stochastic Gradient Descent (SGD)

Full gradient descent uses all data points to compute $\nabla L$. SGD instead samples a few randomly (a “mini-batch”) at each step:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L_i(\theta_t) $$

Advantages:

  • Faster (uses less data each step)
  • Adds a little “noise” that can help escape local minima.
SGD is like hiking down with a shaky flashlight — your steps wobble, but that randomness helps you discover deeper valleys.

Momentum Method

Momentum builds on SGD by remembering the previous update:

$$ v_t = \beta v_{t-1} + (1 - \beta)\nabla_\theta L(\theta_t) $$

$$ \theta_{t+1} = \theta_t - \eta v_t $$

Here, $\beta$ (0.9–0.99) controls how much “past velocity” we keep.

Imagine rolling a ball downhill. Momentum smooths out small bumps so you don’t get stuck — it keeps the ball rolling in roughly the same direction.

Convexity

A function $f(x)$ is convex if the line segment between any two points on its graph lies above the curve itself:

$$ f(\lambda x_1 + (1-\lambda)x_2) \leq \lambda f(x_1) + (1-\lambda)f(x_2) $$

This ensures:

  • Only one global minimum.
  • Gradient descent will find it reliably (no traps).

Deep learning loss surfaces, however, are non-convex — but in high dimensions, many local minima behave similarly, so optimization still works well.

Convex = perfect bowl. Non-convex = bumpy mountain range. Convex optimization is guaranteed peace; non-convex learning is controlled chaos.

Learning Rate Schedules

Rather than a fixed $\eta$, we often decay it over time:

$$ \eta_t = \frac{\eta_0}{1 + k t} $$

or use cyclic or cosine schedules. This helps:

  • Take large steps early (fast learning)
  • Small steps later (fine-tuning near minima)
Start with bold moves when you’re far away, tiptoe carefully as you near the goal.

Adam Optimizer (Briefly)

Adam combines momentum and adaptive learning rates:

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla_\theta L_t \ v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla_\theta L_t)^2 $$

Then adjust:

$$ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t}+\epsilon} $$

Adam works fast, but may fail to converge on certain convex problems because it over-adapts learning rates — taking uneven steps.

Adam is like an overenthusiastic learner — very adaptive, but sometimes too jumpy to settle down.

🧠 Step 4: Key Ideas

  • Optimization ≠ just math — it’s how models actually learn from data.
  • Convexity gives mathematical guarantees; non-convexity requires heuristic faith.
  • Momentum helps smooth noisy updates.
  • Learning rate is the single most important hyperparameter in training.
  • Adam is fast but can overfit or fail to converge; SGD is slower but more reliable for large-scale training.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Gradient-based optimization scales to millions of parameters.
  • Convex functions guarantee convergence.
  • Momentum and adaptive rates accelerate training dramatically.
  • Non-convex surfaces can trap models in poor local minima or plateaus.
  • Choosing hyperparameters (learning rate, momentum, decay) can be tricky.
  • Adaptive optimizers (Adam, RMSprop) may not generalize as well.
  • SGD → slower but robust, better generalization.
  • Adam/RMSProp → faster convergence, but may over-adapt.
  • A balanced approach often starts with Adam for quick results, then fine-tunes with SGD.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • Myth: Adam always beats SGD. → Truth: Adam converges faster initially but can generalize worse; SGD often wins long-term.
  • Myth: Bigger learning rate = faster learning. → Truth: Too large leads to divergence or oscillation.
  • Myth: Convex functions are rare. → Truth: Many classical ML problems (logistic regression, SVMs) are convex; only deep nets break convexity.

🧩 Step 7: Mini Summary

🧠 What You Learned: Optimization translates gradient information into motion — the practical process of learning. Convexity ensures stability and guarantees; non-convexity makes learning powerful yet unpredictable.

⚙️ How It Works: Iterative updates like gradient descent adjust parameters to minimize loss. Variants like SGD, Momentum, and Adam refine the step size and direction for efficiency.

🎯 Why It Matters: Mastering optimization means mastering learning itself — it’s what turns “knowing the slope” into “reaching the summit.”

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!