1.5. Scaling Laws & Model Capacity
🪄 Step 1: Intuition & Motivation
- Core Idea: In deep learning, a bigger model usually performs better — but only up to a point. After that, simply adding more parameters doesn’t help (and sometimes hurts). Scaling laws describe how model performance changes as we increase its size, the data it’s trained on, and the compute used.
They give us a map — a way to predict how much improvement we’ll get if we double the model, the dataset, or the GPUs.
- Simple Analogy: Imagine teaching 3 students:
- One has a tiny notebook (small model).
- One has a large notebook (big model).
- One has a huge notebook, but you only give them 2 pages of notes (undertrained giant).
No matter how big their notebook, if they don’t have enough notes (data) or time to study (compute), they won’t learn better. Scaling laws help you find the sweet spot — the right balance between model size, data, and compute.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Researchers discovered that the loss (L) of large language models follows a predictable power-law relationship with:
- N: Number of parameters
- D: Amount of training data
- C: Compute budget (total FLOPs used)
This relationship can be written as:
$$ L = A N^\alpha D^\beta C^\gamma $$Here, $A$ is a constant, and $\alpha$, $\beta$, and $\gamma$ are negative exponents — meaning that as you increase $N$, $D$, or $C$, the loss decreases, but with diminishing returns.
So instead of guessing how big a model should be, scaling laws let researchers mathematically estimate the ideal configuration for a given budget.
Why It Works This Way
Each component affects learning differently:
- Parameters ($N$): Decide how much knowledge the model can store.
- Data ($D$): Provides the experience it learns from.
- Compute ($C$): Determines how deeply it can process that experience.
If any one of these is too small, it bottlenecks the rest — like building a supercomputer but giving it 5 minutes to train.
How It Fits in ML Thinking
📐 Step 3: Mathematical Foundation
Kaplan’s Scaling Law (2020)
Kaplan et al. found that language model loss scales predictably with model size, dataset size, and compute:
$$ L = A N^{\alpha} D^{\beta} C^{\gamma} $$where:
- $L$ = cross-entropy loss
- $N$ = number of parameters
- $D$ = dataset size (in tokens)
- $C$ = compute (in FLOPs)
- $\alpha, \beta, \gamma < 0$ (showing loss decreases as scale increases)
Interpretation:
- As you double parameters ($N$), loss decreases — but by a smaller margin each time.
- Diminishing returns kick in — every new billion parameters help a bit less.
Chinchilla Scaling (2022)
Kaplan’s models were over-parameterized and under-trained — they had too many weights for the amount of data used. In 2022, the Chinchilla paper fixed this imbalance.
They found that for optimal performance at fixed compute:
- Data and parameters should scale equally — i.e., double the model, double the data.
Result: Smaller models trained longer on more data can outperform much larger models trained on limited data.
For example, Chinchilla (70B params) outperformed Gopher (280B params) — simply because it trained on more tokens.
🧠 Step 4: Assumptions or Key Ideas
- The training regime is compute-limited (you have finite GPUs or FLOPs).
- The model has enough capacity to learn but not so much that it overfits.
- Training data is of sufficient quality and diversity.
- Scaling efficiency depends on architecture, optimizer, and tokenizer efficiency.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Predicts performance gains before spending compute.
- Guides model–data–compute balance for large-scale training.
- Reduces wasted resources on suboptimal configurations.
⚠️ Limitations
- Empirical — holds well within observed scales but may break at extreme limits.
- Doesn’t account for architecture changes (like mixture-of-experts or retrieval-augmented systems).
- Assumes homogeneous data and training conditions.
⚖️ Trade-offs
- More parameters → higher inference latency and memory use.
- More data → longer training but better generalization.
- More compute → diminishing returns; doubling FLOPs rarely doubles performance.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Bigger models are always better.” ❌ Scaling laws show diminishing returns.
- “Scaling is only about parameters.” ❌ It’s a triad: parameters, data, compute.
- “We can reuse Kaplan’s constants forever.” ❌ Constants differ by architecture, dataset, and tokenizer.
🧩 Step 7: Mini Summary
🧠 What You Learned: Scaling laws describe how model performance improves (and saturates) as we increase size, data, and compute.
⚙️ How It Works: The loss follows predictable power-law relationships with diminishing returns, optimized through balanced scaling (Chinchilla).
🎯 Why It Matters: Understanding scaling laws helps design efficient, cost-effective, and balanced large language models — crucial for training at massive scales.