3.1. Distributed Training — Dividing the Giant

5 min read 1009 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Large Language Models (LLMs) are too big for a single GPU’s memory or compute power. Training them requires breaking the workload across multiple GPUs, sometimes across entire clusters.

That’s where distributed training comes in — the art of dividing a single model’s work efficiently among many devices without losing synchronization or accuracy.

  • Simple Analogy: Imagine trying to move a mountain by yourself — impossible. Now, gather 1000 people. Each person carries one shovel-full at a time, but they must coordinate, share progress, and avoid duplicate effort. That’s distributed training in spirit — teamwork at scale.

🌱 Step 2: Core Concept

There are three main strategies for dividing the training workload — Data, Model, and Pipeline Parallelism — and in practice, they’re often combined.


Data Parallelism (DP) — Same Model, Different Data

Each GPU holds a complete copy of the model, but trains on a different mini-batch of data.

After computing local gradients, all GPUs synchronize by averaging their updates (via an AllReduce operation), ensuring the global model parameters stay consistent.

Example Workflow:

  1. Split training data across GPUs.
  2. Each GPU computes loss and gradients locally.
  3. Gradients are averaged (synchronized).
  4. Every GPU updates its model copy with the same global gradient.

Tools: PyTorch DDP (Distributed Data Parallel), Horovod, DeepSpeed.

Data Parallelism is simple, stable, and scales well for medium-sized models. It’s like having multiple students each study different chapters but share notes after each session.

Model Parallelism (MP) — Split the Model Itself

When the model itself is too big to fit on one GPU, we split the model’s layers or matrices across devices.

For instance:

  • GPU 1 handles the first few Transformer layers.
  • GPU 2 handles the next few layers.

In Tensor Parallelism, even individual layers (like attention or MLP weights) are split across GPUs for fine-grained efficiency.

Example Frameworks: Megatron-LM, Tensor Parallelism (DeepSpeed).

GPT-3 (175B) couldn’t fit on any single GPU. It used model parallelism to spread layers across 96 GPUs per training node.

Challenges:

  • High communication overhead between GPUs (since outputs from one GPU become inputs for another).
  • Harder to debug and balance.

Pipeline Parallelism (PP) — Assembly Line Training

Pipeline Parallelism divides the model into stages, with each GPU responsible for one stage.

Instead of processing one batch at a time, batches are split into micro-batches, so all GPUs can stay busy simultaneously — like an assembly line where multiple cars are being built at once.

Example:

  1. Stage 1 starts processing micro-batch 1.
  2. As soon as it finishes, Stage 2 starts micro-batch 1 while Stage 1 moves to micro-batch 2.
  3. The “pipeline” fills up, keeping all GPUs active.

This technique avoids idle GPUs but introduces pipeline bubbles — idle time when waiting for earlier stages to finish.

Tools: DeepSpeed Pipeline Engine, GPipe, FairScale.

Micro-batching is like multitasking — keeping all workers occupied instead of waiting for one to finish.

Hybrid Parallelism — The Ultimate Combo

In massive-scale training (think GPT-4, PaLM), no single strategy is enough. So engineers combine them:

StrategyWhat It HandlesBenefit
Data ParallelismMany samplesEfficient throughput
Model ParallelismHuge parameter sizeFit the model
Pipeline ParallelismSequential layersFull GPU utilization

This hybrid approach ensures all hardware resources are maximized — minimizing idle time, balancing memory, and improving speed.

Frameworks like DeepSpeed and Megatron-DeepSpeed automatically manage this complex orchestration.


📐 Step 3: Mathematical Foundation

Gradient Synchronization (AllReduce)

In Data Parallelism, all GPUs must agree on the same model weights after each backward pass.

If GPU_i computes gradients $g_i$, we average them using an AllReduce operation:

$$ g = \frac{1}{N} \sum_{i=1}^{N} g_i $$

Then each GPU updates its local weights using:

$$ w_{t+1} = w_t - \eta g $$

where $\eta$ is the learning rate.

Insight: Synchronization keeps replicas in sync — crucial for reproducibility and stability.

It’s like multiple chefs cooking from the same recipe — after each batch, they compare notes and adjust their techniques together.

Pipeline Scheduling (Micro-Batch Overlap)

If each GPU stage takes $t$ seconds per micro-batch, then pipeline efficiency improves roughly as:

$$ \text{Utilization} \approx \frac{m}{m + (n - 1)} $$

where $m$ = number of micro-batches, $n$ = number of pipeline stages.

Increasing $m$ (micro-batches) fills the pipeline, reducing idle “bubbles.”


🧠 Step 4: Advanced Optimization — ZeRO (Zero Redundancy Optimizer)

What Is ZeRO?

DeepSpeed’s ZeRO (Zero Redundancy Optimizer) is a key innovation that eliminates memory duplication across GPUs.

Normally, every GPU stores the same optimizer states, gradients, and parameters — wasting memory.

ZeRO splits them into shards:

  • Stage 1: Shard optimizer states.
  • Stage 2: Shard gradients.
  • Stage 3: Shard model parameters too.

This allows models 10× larger to fit in memory, without increasing communication costs drastically.

Think of ZeRO as having roommates who each store different sections of your bookshelf — together, you save space without losing access to any book.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Enables trillion-parameter model training.
  • Combines scalability with fault tolerance.
  • Maximizes GPU utilization via hybrid methods.

⚠️ Limitations

  • High communication overhead (especially in Model Parallelism).
  • Pipeline stages must be balanced carefully to avoid idle GPUs.
  • Debugging and synchronization are non-trivial.

⚖️ Trade-offs

  • Data Parallelism = simple but memory-heavy.
  • Model Parallelism = memory-efficient but communication-heavy.
  • Pipeline Parallelism = efficient but complex scheduling.
  • Hybrid = best overall efficiency, hardest to implement.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “More GPUs always mean faster training.” ❌ Communication costs grow — speedups often plateau.
  • “All GPUs do the same work.” ❌ In Model/Pipeline Parallelism, GPUs handle different pieces of the model.
  • “ZeRO is just gradient compression.” ❌ It’s a full memory partitioning strategy — much deeper than compression.

🧩 Step 7: Mini Summary

🧠 What You Learned: Distributed training splits the LLM workload across GPUs using data, model, and pipeline parallelism — or all combined in hybrid systems.

⚙️ How It Works: Synchronization, micro-batching, and memory sharding keep thousands of GPUs working together smoothly.

🎯 Why It Matters: This is the backbone of modern LLM training — enabling massive models like GPT-4 to exist at all.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!