7.1. Scaling Training and Serving
🪄 Step 1: Intuition & Motivation
Core Idea: Training a single model on one machine may work for prototypes, but production ML is a massive orchestration of compute, data, and serving infrastructure. Scaling ensures that your models train faster, serve predictions instantly, and handle millions of requests — without exploding costs or latency.
Simple Analogy: Think of ML systems like a restaurant:
- Training = cooking the meal.
- Serving = delivering the food.
- Scaling = adding more chefs, better ovens, and faster delivery routes — without ruining the taste (accuracy).
🌱 Step 2: Core Concept
To scale ML systems, you need to optimize both sides of the equation:
- Training: Make learning faster using distributed compute.
- Serving: Make predictions faster using efficient architecture and autoscaling.
1️⃣ Distributed Training — Teaching in Parallel
When datasets and models become too large for a single GPU or machine, you split the workload across multiple nodes. There are two main strategies:
🧮 Data Parallelism
Each worker gets a different chunk of the data but the same model copy. After each batch, workers synchronize gradients to keep model weights consistent.
Example: If you have 4 GPUs and 1 million samples, each GPU trains on 250k samples per step. After gradients are averaged, the model updates once per global batch.
Tools:
- PyTorch Distributed Data Parallel (DDP)
- TensorFlow
tf.distribute.MirroredStrategy - Horovod (Uber’s distributed training library)
When to Use: When your model fits into one GPU’s memory but data is huge.
💡 Intuition: Imagine multiple chefs each cooking part of the meal with the same recipe — then combining results into one final dish.
🧩 Model Parallelism
When the model itself is too big to fit on one GPU, you split the model’s layers across devices.
Example:
- GPU 1: Layers 1–4
- GPU 2: Layers 5–8
- GPU 3: Layers 9–12
Each GPU passes forward activations to the next, like an assembly line.
Tools:
- DeepSpeed, Megatron-LM (used for large LLMs)
- Tensor Parallelism and Pipeline Parallelism
When to Use: For extremely large models (e.g., GPT, BERT, Vision Transformers).
💡 Intuition: Think of model parallelism like a car factory — each station (GPU) assembles a different part before passing it along.
2️⃣ Serving Architectures — The Speed of Delivery
Once your model is trained, the next challenge is how to deliver predictions efficiently.
🍱 Batch Inference (Offline Serving)
- Used for large-scale predictions on static datasets (e.g., scoring millions of users nightly).
- Typically triggered by schedulers like Airflow or Spark jobs.
- High throughput, no real-time constraints.
Example: Recomputing all customer credit risk scores every 24 hours.
💡 Analogy: Like delivering meals in bulk once a day — efficient but not instant.
⚡ Online Inference (Real-Time Serving)
- Used when predictions must be made in milliseconds (e.g., recommendation, fraud detection).
- Model is deployed behind an API endpoint (e.g., FastAPI, TensorFlow Serving, TorchServe).
- Prioritize low latency, auto-scaling, and caching.
Example: When a user clicks “Pay,” the fraud model predicts instantly if it’s suspicious.
💡 Analogy: Like a fast-food counter — every order is served instantly, one customer at a time.
🌊 Streaming Inference (Continuous Serving)
- Ideal for event-driven pipelines (e.g., IoT, logs, or real-time personalization).
- Models process data as it arrives — using tools like Kafka Streams, Flink, or Ray Serve.
- Requires asynchronous buffering, checkpointing, and backpressure management.
Example: Predicting energy consumption in real-time from thousands of IoT sensors.
💡 Analogy: Like a conveyor belt sushi restaurant — new items arrive continuously, and the system keeps up in real time.
3️⃣ Autoscaling — Matching Compute to Demand
Scaling isn’t just about adding more machines — it’s about adding them at the right time.
Kubernetes’ Horizontal Pod Autoscaler (HPA) automatically adjusts the number of serving pods based on demand.
🧠 How It Works:
- Define CPU or memory thresholds.
- When usage exceeds threshold, HPA spawns new pods.
- When demand falls, it scales down to save cost.
Example Config (YAML):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-detector
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Metrics You Can Use for Scaling:
- CPU/GPU utilization
- Request latency
- Queue length
- Throughput (requests per second)
💡 Intuition: Autoscaling is like adding more delivery drivers during rush hour and letting them rest when the roads are quiet.
📐 Step 3: Mathematical Foundation
Let’s model the relationship between latency (L), compute capacity (C), and load (λ).
Latency–Load Trade-Off
In queueing theory, average latency is approximated by:
$$ L = \frac{1}{\mu - \lambda} $$Where:
- $L$ = latency
- $\lambda$ = incoming request rate
- $\mu$ = service rate (capacity per server)
As $\lambda$ approaches $\mu$, latency spikes exponentially. Autoscaling increases $\mu$ dynamically, keeping $(\mu - \lambda)$ large and latency low.
🧠 Step 4: How to Reduce Inference Latency (Without Retraining)
Even if you don’t change the model, you can make it predict faster with infrastructure and optimization tricks.
🔧 Techniques:
Model Quantization: Reduce precision (e.g., FP32 → INT8) to speed up computation with minimal accuracy loss.
Model Compilation: Use graph compilers like ONNX Runtime, TensorRT, or TorchScript to optimize computation graphs.
Batching Requests: Combine multiple small inference requests into one larger batch (efficient for GPUs).
Caching Frequent Results: Store predictions for recurring inputs (e.g., same user/session).
Async I/O and Threading: Overlap CPU–GPU communication using asynchronous execution.
Edge Serving: Move model inference closer to users (via CDNs or edge nodes).
💡 Intuition: You can’t make your car smarter without retraining it, but you can make it faster by tuning the engine and removing unnecessary baggage.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Enables massive scalability and high availability.
- Handles millions of inferences efficiently.
- Reduces latency with autoscaling and optimization.
- Distributed systems add complexity and synchronization overhead.
- Debugging across nodes is harder.
- Cost can balloon without proper monitoring.
- Trade-off between latency and cost: Low latency requires always-on GPUs, but that’s expensive. Smart autoscaling and caching reduce cost without hurting performance.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“Scaling means just adding more machines.” No — poor parallelization or data skew can still bottleneck your system.
“Model size determines serving speed.” Not entirely — serving speed depends more on hardware, batching, and architecture.
“Autoscaling handles everything automatically.” You still need to define thresholds, metrics, and cost controls to avoid runaway scaling.
🧩 Step 7: Mini Summary
🧠 What You Learned: Scaling in ML involves both distributed training for faster learning and optimized serving for instant predictions.
⚙️ How It Works: Data and model parallelism distribute workloads, while serving architectures (batch, online, streaming) and Kubernetes autoscaling balance performance and cost.
🎯 Why It Matters: Without scaling, models that work in notebooks collapse in production. Scaling turns them into real systems — fast, resilient, and efficient.