2.1. Limits, Continuity & Differentiability
🪄 Step 1: Intuition & Motivation
Core Idea: In data science and machine learning, when we train a model, we’re trying to minimize a loss function — make it as small and smooth as possible. To do that, we rely on gradients — the rate at which the loss changes. But for gradients to exist, the function must be continuous and differentiable.
Simple Analogy: Think of driving on a road.
- If the road is continuous, there are no gaps — you can drive smoothly.
- If the road is differentiable, there are no sharp turns — your steering wheel moves smoothly.
- If the road has corners or jumps, you can still drive (the model still runs), but your optimization (gradient descent) will stutter or crash.
That’s exactly what happens when you train models with non-smooth activation functions like ReLU.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
1️⃣ Limits
A limit describes what a function approaches as you get close to a point — even if the function isn’t defined exactly at that point.
For example:
$$ \lim_{x \to 2} (3x + 1) = 7 $$It tells you the “trend” or “approach value,” even if $x = 2$ isn’t allowed.
2️⃣ Continuity
A function is continuous at a point $x = c$ if:
- The limit exists.
- The function is defined at that point.
- The limit equals the function’s value.
This means the graph can be drawn without lifting your pen.
3️⃣ Differentiability
A function is differentiable if it has a well-defined slope everywhere — meaning its rate of change doesn’t suddenly “jump.”
If $f(x)$ is differentiable at $x=c$, it’s automatically continuous there (but not the other way around).
In machine learning, differentiability ensures gradients exist — and without gradients, gradient descent can’t find the minimum.
Why It Works This Way
Let’s connect this to learning dynamics. When we update model weights using gradient descent, we compute:
$$ w_{t+1} = w_t - \eta \frac{\partial L}{\partial w} $$If $L(w)$ (the loss function) isn’t smooth, the gradient might not exist — or might suddenly change direction. That’s like your GPS telling you to “turn left sharply” without warning.
This instability can make training oscillate or stall, especially for piecewise functions like ReLU:
$$ \text{ReLU}(x) = \begin{cases} x & \text{if } x > 0 \ 0 & \text{if } x \le 0 \end{cases} $$At $x = 0$, ReLU isn’t differentiable because the slope jumps from 0 to 1. But we handle it using subgradients — a “best possible guess” slope in that sharp region.
How It Fits in ML Thinking
- Continuity ensures the loss surface isn’t fragmented — optimization moves smoothly.
- Differentiability ensures gradients exist — so the optimizer knows which direction to move.
- Limits provide a bridge to reasoning about functions that behave oddly at boundaries (like activations or normalization layers).
That’s why models like neural networks carefully choose activation functions — balancing smoothness (good for optimization) with expressiveness (good for learning nonlinear patterns).
📐 Step 3: Mathematical Foundation
Limit Definition
means that as $x$ approaches $c$ from both sides, $f(x)$ gets closer to $L$.
If the left-hand and right-hand limits differ → the limit does not exist → function is discontinuous there.
Continuity and Differentiability Relationship
- Differentiability ⇒ Continuity
- But Continuity ⇏ Differentiability
Example: ReLU is continuous at 0 (no gap), but not differentiable (sharp corner).
🧠 Step 4: Key Ideas
- Limit: What a function approaches near a point.
- Continuity: No jumps — smooth values.
- Differentiability: No corners — smooth slopes.
- Differentiability ⇒ Continuity, but not vice versa.
- Smoothness ensures stable gradient updates in optimization.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Creates a foundation for understanding gradient-based optimization.
- Ensures model training behaves predictably.
- Forms the backbone for defining loss functions and activation behaviors.
- Real-world data and models often involve non-smooth behavior (e.g., ReLU, discontinuous distributions).
- Overemphasis on smoothness may simplify models too much, hurting expressiveness.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- Myth: If a function is continuous, it’s automatically smooth. → Truth: Not necessarily — ReLU is continuous but not differentiable at 0.
- Myth: Discontinuities make a function unusable in ML. → Truth: Some discontinuities can be handled by approximations (e.g., subgradients).
- Myth: We always need perfect differentiability. → Truth: Optimization algorithms often tolerate small “kinks” surprisingly well.
🧩 Step 7: Mini Summary
🧠 What You Learned: Limits define how functions behave near points, continuity ensures no jumps, and differentiability ensures smooth slopes — together they make training surfaces navigable.
⚙️ How It Works: Models use smooth, differentiable loss functions to ensure gradients exist and optimization proceeds predictably.
🎯 Why It Matters: Understanding smoothness explains why some activations (like ReLU) work despite sharp edges — and why optimization sometimes fails due to “rough terrain.”