7.1. Scaling Training and Serving

6 min read 1137 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Training a single model on one machine may work for prototypes, but production ML is a massive orchestration of compute, data, and serving infrastructure. Scaling ensures that your models train faster, serve predictions instantly, and handle millions of requests — without exploding costs or latency.

  • Simple Analogy: Think of ML systems like a restaurant:

    • Training = cooking the meal.
    • Serving = delivering the food.
    • Scaling = adding more chefs, better ovens, and faster delivery routes — without ruining the taste (accuracy).

🌱 Step 2: Core Concept

To scale ML systems, you need to optimize both sides of the equation:

  • Training: Make learning faster using distributed compute.
  • Serving: Make predictions faster using efficient architecture and autoscaling.

1️⃣ Distributed Training — Teaching in Parallel

When datasets and models become too large for a single GPU or machine, you split the workload across multiple nodes. There are two main strategies:


🧮 Data Parallelism

Each worker gets a different chunk of the data but the same model copy. After each batch, workers synchronize gradients to keep model weights consistent.

Example: If you have 4 GPUs and 1 million samples, each GPU trains on 250k samples per step. After gradients are averaged, the model updates once per global batch.

Tools:

  • PyTorch Distributed Data Parallel (DDP)
  • TensorFlow tf.distribute.MirroredStrategy
  • Horovod (Uber’s distributed training library)

When to Use: When your model fits into one GPU’s memory but data is huge.

💡 Intuition: Imagine multiple chefs each cooking part of the meal with the same recipe — then combining results into one final dish.


🧩 Model Parallelism

When the model itself is too big to fit on one GPU, you split the model’s layers across devices.

Example:

  • GPU 1: Layers 1–4
  • GPU 2: Layers 5–8
  • GPU 3: Layers 9–12

Each GPU passes forward activations to the next, like an assembly line.

Tools:

  • DeepSpeed, Megatron-LM (used for large LLMs)
  • Tensor Parallelism and Pipeline Parallelism

When to Use: For extremely large models (e.g., GPT, BERT, Vision Transformers).

💡 Intuition: Think of model parallelism like a car factory — each station (GPU) assembles a different part before passing it along.


2️⃣ Serving Architectures — The Speed of Delivery

Once your model is trained, the next challenge is how to deliver predictions efficiently.


🍱 Batch Inference (Offline Serving)

  • Used for large-scale predictions on static datasets (e.g., scoring millions of users nightly).
  • Typically triggered by schedulers like Airflow or Spark jobs.
  • High throughput, no real-time constraints.

Example: Recomputing all customer credit risk scores every 24 hours.

💡 Analogy: Like delivering meals in bulk once a day — efficient but not instant.


Online Inference (Real-Time Serving)

  • Used when predictions must be made in milliseconds (e.g., recommendation, fraud detection).
  • Model is deployed behind an API endpoint (e.g., FastAPI, TensorFlow Serving, TorchServe).
  • Prioritize low latency, auto-scaling, and caching.

Example: When a user clicks “Pay,” the fraud model predicts instantly if it’s suspicious.

💡 Analogy: Like a fast-food counter — every order is served instantly, one customer at a time.


🌊 Streaming Inference (Continuous Serving)

  • Ideal for event-driven pipelines (e.g., IoT, logs, or real-time personalization).
  • Models process data as it arrives — using tools like Kafka Streams, Flink, or Ray Serve.
  • Requires asynchronous buffering, checkpointing, and backpressure management.

Example: Predicting energy consumption in real-time from thousands of IoT sensors.

💡 Analogy: Like a conveyor belt sushi restaurant — new items arrive continuously, and the system keeps up in real time.


3️⃣ Autoscaling — Matching Compute to Demand

Scaling isn’t just about adding more machines — it’s about adding them at the right time.

Kubernetes’ Horizontal Pod Autoscaler (HPA) automatically adjusts the number of serving pods based on demand.


🧠 How It Works:

  1. Define CPU or memory thresholds.
  2. When usage exceeds threshold, HPA spawns new pods.
  3. When demand falls, it scales down to save cost.

Example Config (YAML):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-detector
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Metrics You Can Use for Scaling:

  • CPU/GPU utilization
  • Request latency
  • Queue length
  • Throughput (requests per second)

💡 Intuition: Autoscaling is like adding more delivery drivers during rush hour and letting them rest when the roads are quiet.


📐 Step 3: Mathematical Foundation

Let’s model the relationship between latency (L), compute capacity (C), and load (λ).

Latency–Load Trade-Off

In queueing theory, average latency is approximated by:

$$ L = \frac{1}{\mu - \lambda} $$

Where:

  • $L$ = latency
  • $\lambda$ = incoming request rate
  • $\mu$ = service rate (capacity per server)

As $\lambda$ approaches $\mu$, latency spikes exponentially. Autoscaling increases $\mu$ dynamically, keeping $(\mu - \lambda)$ large and latency low.

When servers are overloaded, even small increases in traffic cause big slowdowns. Autoscaling prevents that by increasing capacity before overload happens.

🧠 Step 4: How to Reduce Inference Latency (Without Retraining)

Even if you don’t change the model, you can make it predict faster with infrastructure and optimization tricks.

🔧 Techniques:

  1. Model Quantization: Reduce precision (e.g., FP32 → INT8) to speed up computation with minimal accuracy loss.

  2. Model Compilation: Use graph compilers like ONNX Runtime, TensorRT, or TorchScript to optimize computation graphs.

  3. Batching Requests: Combine multiple small inference requests into one larger batch (efficient for GPUs).

  4. Caching Frequent Results: Store predictions for recurring inputs (e.g., same user/session).

  5. Async I/O and Threading: Overlap CPU–GPU communication using asynchronous execution.

  6. Edge Serving: Move model inference closer to users (via CDNs or edge nodes).

💡 Intuition: You can’t make your car smarter without retraining it, but you can make it faster by tuning the engine and removing unnecessary baggage.


⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Enables massive scalability and high availability.
  • Handles millions of inferences efficiently.
  • Reduces latency with autoscaling and optimization.
  • Distributed systems add complexity and synchronization overhead.
  • Debugging across nodes is harder.
  • Cost can balloon without proper monitoring.
  • Trade-off between latency and cost: Low latency requires always-on GPUs, but that’s expensive. Smart autoscaling and caching reduce cost without hurting performance.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Scaling means just adding more machines.” No — poor parallelization or data skew can still bottleneck your system.

  • “Model size determines serving speed.” Not entirely — serving speed depends more on hardware, batching, and architecture.

  • “Autoscaling handles everything automatically.” You still need to define thresholds, metrics, and cost controls to avoid runaway scaling.


🧩 Step 7: Mini Summary

🧠 What You Learned: Scaling in ML involves both distributed training for faster learning and optimized serving for instant predictions.

⚙️ How It Works: Data and model parallelism distribute workloads, while serving architectures (batch, online, streaming) and Kubernetes autoscaling balance performance and cost.

🎯 Why It Matters: Without scaling, models that work in notebooks collapse in production. Scaling turns them into real systems — fast, resilient, and efficient.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!