7.1. Scaling Training and Serving

AI System Design Interview Guide (2025)

ML System Design Infrastructure

6 min read 1137 words

🪄 Step 1: Intuition & Motivation

Core Idea: Training a single model on one machine may work for prototypes, but production ML is a massive orchestration of compute, data, and serving infrastructure. Scaling ensures that your models train faster, serve predictions instantly, and handle millions of requests — without exploding costs or latency.
Simple Analogy: Think of ML systems like a restaurant:
- Training = cooking the meal.
- Serving = delivering the food.
- Scaling = adding more chefs, better ovens, and faster delivery routes — without ruining the taste (accuracy).

🌱 Step 2: Core Concept

To scale ML systems, you need to optimize both sides of the equation:

Training: Make learning faster using distributed compute.
Serving: Make predictions faster using efficient architecture and autoscaling.

1️⃣ Distributed Training — Teaching in Parallel

When datasets and models become too large for a single GPU or machine, you split the workload across multiple nodes. There are two main strategies:

🧮 Data Parallelism

Each worker gets a different chunk of the data but the same model copy. After each batch, workers synchronize gradients to keep model weights consistent.

Example: If you have 4 GPUs and 1 million samples, each GPU trains on 250k samples per step. After gradients are averaged, the model updates once per global batch.

Tools:

PyTorch Distributed Data Parallel (DDP)
TensorFlow tf.distribute.MirroredStrategy
Horovod (Uber’s distributed training library)

When to Use: When your model fits into one GPU’s memory but data is huge.

💡 Intuition: Imagine multiple chefs each cooking part of the meal with the same recipe — then combining results into one final dish.

🧩 Model Parallelism

When the model itself is too big to fit on one GPU, you split the model’s layers across devices.

Example:

GPU 1: Layers 1–4
GPU 2: Layers 5–8
GPU 3: Layers 9–12

Each GPU passes forward activations to the next, like an assembly line.

Tools:

DeepSpeed, Megatron-LM (used for large LLMs)
Tensor Parallelism and Pipeline Parallelism

When to Use: For extremely large models (e.g., GPT, BERT, Vision Transformers).

💡 Intuition: Think of model parallelism like a car factory — each station (GPU) assembles a different part before passing it along.

2️⃣ Serving Architectures — The Speed of Delivery

Once your model is trained, the next challenge is how to deliver predictions efficiently.

🍱 Batch Inference (Offline Serving)

Used for large-scale predictions on static datasets (e.g., scoring millions of users nightly).
Typically triggered by schedulers like Airflow or Spark jobs.
High throughput, no real-time constraints.

Example: Recomputing all customer credit risk scores every 24 hours.

💡 Analogy: Like delivering meals in bulk once a day — efficient but not instant.

⚡ Online Inference (Real-Time Serving)

Used when predictions must be made in milliseconds (e.g., recommendation, fraud detection).
Model is deployed behind an API endpoint (e.g., FastAPI, TensorFlow Serving, TorchServe).
Prioritize low latency, auto-scaling, and caching.

Example: When a user clicks “Pay,” the fraud model predicts instantly if it’s suspicious.

💡 Analogy: Like a fast-food counter — every order is served instantly, one customer at a time.

🌊 Streaming Inference (Continuous Serving)

Ideal for event-driven pipelines (e.g., IoT, logs, or real-time personalization).
Models process data as it arrives — using tools like Kafka Streams, Flink, or Ray Serve.
Requires asynchronous buffering, checkpointing, and backpressure management.

Example: Predicting energy consumption in real-time from thousands of IoT sensors.

💡 Analogy: Like a conveyor belt sushi restaurant — new items arrive continuously, and the system keeps up in real time.

3️⃣ Autoscaling — Matching Compute to Demand

Scaling isn’t just about adding more machines — it’s about adding them at the right time.

Kubernetes’ Horizontal Pod Autoscaler (HPA) automatically adjusts the number of serving pods based on demand.

🧠 How It Works:

Define CPU or memory thresholds.
When usage exceeds threshold, HPA spawns new pods.
When demand falls, it scales down to save cost.

Example Config (YAML):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-detector
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Metrics You Can Use for Scaling:

CPU/GPU utilization
Request latency
Queue length
Throughput (requests per second)

💡 Intuition: Autoscaling is like adding more delivery drivers during rush hour and letting them rest when the roads are quiet.

📐 Step 3: Mathematical Foundation

Let’s model the relationship between latency (L), compute capacity (C), and load (λ).

Latency–Load Trade-Off

In queueing theory, average latency is approximated by:

$$ L = \frac{1}{\mu - \lambda} $$

Where:

$L$ = latency
$\lambda$ = incoming request rate
$\mu$ = service rate (capacity per server)

As $\lambda$ approaches $\mu$, latency spikes exponentially. Autoscaling increases $\mu$ dynamically, keeping $(\mu - \lambda)$ large and latency low.

When servers are overloaded, even small increases in traffic cause big slowdowns. Autoscaling prevents that by increasing capacity before overload happens.

🧠 Step 4: How to Reduce Inference Latency (Without Retraining)

Even if you don’t change the model, you can make it predict faster with infrastructure and optimization tricks.

🔧 Techniques:

Model Quantization: Reduce precision (e.g., FP32 → INT8) to speed up computation with minimal accuracy loss.
Model Compilation: Use graph compilers like ONNX Runtime, TensorRT, or TorchScript to optimize computation graphs.
Batching Requests: Combine multiple small inference requests into one larger batch (efficient for GPUs).
Caching Frequent Results: Store predictions for recurring inputs (e.g., same user/session).
Async I/O and Threading: Overlap CPU–GPU communication using asynchronous execution.
Edge Serving: Move model inference closer to users (via CDNs or edge nodes).

💡 Intuition: You can’t make your car smarter without retraining it, but you can make it faster by tuning the engine and removing unnecessary baggage.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Enables massive scalability and high availability.
Handles millions of inferences efficiently.
Reduces latency with autoscaling and optimization.

Distributed systems add complexity and synchronization overhead.
Debugging across nodes is harder.
Cost can balloon without proper monitoring.

Trade-off between latency and cost: Low latency requires always-on GPUs, but that’s expensive. Smart autoscaling and caching reduce cost without hurting performance.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Scaling means just adding more machines.” No — poor parallelization or data skew can still bottleneck your system.
“Model size determines serving speed.” Not entirely — serving speed depends more on hardware, batching, and architecture.
“Autoscaling handles everything automatically.” You still need to define thresholds, metrics, and cost controls to avoid runaway scaling.

🧩 Step 7: Mini Summary

🧠 What You Learned: Scaling in ML involves both distributed training for faster learning and optimized serving for instant predictions.

⚙️ How It Works: Data and model parallelism distribute workloads, while serving architectures (batch, online, streaming) and Kubernetes autoscaling balance performance and cost.

🎯 Why It Matters: Without scaling, models that work in notebooks collapse in production. Scaling turns them into real systems — fast, resilient, and efficient.

7.2. Cost Optimization 6.2. Build Governance Workflows