1.2. Latency vs. Throughput
🪄 Step 1: Intuition & Motivation
Core Idea (in 1 short paragraph): In machine learning systems, speed and capacity are like two sides of a seesaw. You can either respond quickly to one user (low latency) or handle many users at once (high throughput) — but doing both perfectly is almost impossible. Understanding how to balance these two is the heart of designing efficient ML systems.
Simple Analogy: Picture a coffee shop.
- Latency is how long one customer waits for their latte.
- Throughput is how many cups the café can serve in an hour. Hire more baristas (parallelism), and you can serve more people (throughput). But if one espresso machine slows down, everyone waits (latency). That’s the trade-off ML engineers face too.
🌱 Step 2: Core Concept
Latency and throughput define how fast and how much your system can process. Let’s unpack them one layer at a time.
What’s Happening Under the Hood?
When an ML model serves predictions (say, a recommendation API), two key forces interact:
Latency measures response time — how long it takes from the moment a request arrives until a prediction is returned. Example: A user clicks “Show Recommendations,” and the model responds in 150 milliseconds.
Throughput measures volume over time — how many predictions the system can make per second or minute. Example: Your system handles 10,000 predictions per second (10K QPS).
However, improving one often hurts the other:
- Serving requests one by one gives low latency but wastes hardware efficiency.
- Processing requests in batches increases throughput but adds waiting time for batch formation — increasing latency.
Why It Works This Way
Both metrics compete for system resources like CPU/GPU time, memory, and network bandwidth. If we imagine your system as a busy restaurant kitchen:
- The chef (GPU) can make one dish very fast (low latency) if there’s only one order.
- But to feed 100 customers, the chef must prepare dishes in batches (higher throughput), which means each customer waits longer (higher latency).
This is why ML inference systems often use batching, queuing, and asynchronous processing to balance the trade-off dynamically.
How It Fits in ML Thinking
In ML system design:
- Latency matters for user-facing applications — voice assistants, search autocomplete, or fraud checks.
- Throughput matters for backend analytics — model retraining, embedding generation, or daily predictions.
Top engineers learn to tune both, not just pick one. That means designing for low latency during peak hours while sustaining high throughput during heavy loads — a balance achieved through hardware choices, caching, and parallelism.
📐 Step 3: Mathematical Foundation
Latency–Throughput Relationship
We can model their relationship simply:
$$ \text{Throughput} = \frac{N}{T_\text{total}} $$$$ \text{Latency} = \frac{T_\text{total}}{N} $$Where:
- $N$ = number of requests processed
- $T_\text{total}$ = total time to complete all requests
As $N$ increases (more requests per batch), throughput improves — but individual latency rises because each request waits for others to join the batch.
P99 Latency (Tail Latency)
This means 99% of your users get responses faster than that number. The slowest 1% (“the long tail”) often hides bottlenecks like slow queries, GC pauses, or overloaded nodes.
🧠 Step 4: Assumptions or Key Ideas
- System performance depends on workload type (batch vs. online).
- Hardware limits (CPU, GPU, memory) impose natural ceilings on throughput.
- Batching improves throughput only up to a point — after that, latency penalties outweigh the gains.
- Caching (using Redis, FAISS) reduces latency by avoiding redundant computations.
⚖️ Step 5: Strengths, Limitations & Trade-offs
Optimizing for Latency:
- Great for interactive systems.
- Boosts user satisfaction and retention.
- Feels “instantaneous” when well-tuned.
Optimizing for Throughput:
- Ideal for large-scale data processing.
- Efficiently uses hardware (batching and parallelization).
- Lowers operational cost per prediction.
Latency-First Systems:
- Hardware-expensive and energy-intensive.
- Difficult to scale under bursty loads.
Throughput-First Systems:
- Feels slow for individual requests.
- Higher delay during batching or queuing.
The art lies in balance:
- Small batches → low latency, moderate throughput.
- Large batches → high throughput, higher latency. In practice, systems use adaptive batching, adjusting batch size based on traffic load — like a café hiring more baristas only during rush hours.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Throughput and latency are independent.” → False; they’re inversely related.
- “Reducing latency always helps.” → Not if you waste hardware or increase cost disproportionately.
- “High throughput means system is fast.” → Not necessarily; users might still wait longer per request.
🧩 Step 7: Mini Summary
🧠 What You Learned: Latency and throughput are two performance lenses for understanding ML serving systems.
⚙️ How It Works: Reducing response time (latency) often reduces overall capacity (throughput), and vice versa.
🎯 Why It Matters: Knowing how to trade between these two defines your ability to design scalable, responsive ML systems.