2.3. Online Inference Architecture

5 min read 938 words

🪄 Step 1: Intuition & Motivation

  • Core Idea (in 1 short paragraph): Online inference is how your model answers questions right now. The architecture decides whether we reply instantly (synchronous), queue work and reply later (asynchronous), share servers across many models or teams (multi-tenant), and how we scale up to big traffic without falling over. The goal: fast, steady, and affordable predictions.

  • Simple Analogy (one only): Think of a busy restaurant: some orders are cooked while the customer waits (synchronous); others are picked up later as takeout (asynchronous). You can run one big kitchen for multiple menus (multi-tenant) or separate mini-kitchens for each chef (multi-model serving). Good hosts seat guests efficiently (load balancers) and call in extra staff when a line forms (autoscaling).


🌱 Step 2: Core Concept

We’ll break the idea into four big decisions that shape latency, throughput, and cost.

What’s Happening Under the Hood?

1) Synchronous vs. Asynchronous

  • Synchronous (request–response): Client opens a connection; server returns a prediction within a tight SLA (e.g., P99 ≤ 200ms). Best for search, autocomplete, fraud checks.
  • Asynchronous (queued work): Client submits a job, gets a job ID, and polls or receives a callback/webhook when done. Best for slow or heavy jobs (e.g., batch scoring, large LLM generations).

2) Multi-Model Serving vs. Multi-Tenant Inference

  • Multi-Model Serving: One endpoint process hosts several models (e.g., “/predict?model_id=X”). Saves cold-starts and management overhead; careful with memory and isolation.
  • Multi-Tenant Inference: Many users/teams share the same hardware pool. Schedulers pack workloads onto GPUs/CPUs to raise utilization; enforce quotas and fairness to prevent “noisy neighbor” issues.

3) Load Balancing

  • Layer 4/7 LB distributes requests across replicas (round-robin, least-connections, latency-aware).
  • Sticky routing can keep a session or model on a warm replica (reduces cold-starts).
  • Edge caches and feature caches (e.g., Redis) cut repeated work.

4) Autoscaling & Cold Starts

  • HPA/KEDA: Scale by CPU/GPU utilization, queue depth, or request rate (QPS).
  • Warm pools & pre-loading weights: Keep a small set of ready pods/containers to absorb spikes.
  • Vectorized inference: Batch multiple requests together to exploit GPU parallelism, balancing throughput vs. added queueing delay.
Why It Works This Way
  • Synchronous flows prioritize latency; asynchronous flows prioritize throughput and cost.
  • Co-locating multiple models reduces duplicated overhead (weights, containers) but raises resource contention risk.
  • Autoscaling on the right signal (QPS, queue length, token/s generation rate for LLMs) catches surges early.
  • Vectorization on accelerators dramatically improves efficiency — the GPU prefers bigger bites.
How It Fits in ML Thinking

Online inference is where data engineering, systems, and modeling meet the user.

  • Your model choice shapes latency (size, quantization).
  • Your serving pattern shapes user experience (sync vs. async).
  • Your infra choices (LB, autoscaling, batching) shape cost and reliability.
    Together, these define a production-ready, interview-ready design.

📐 Step 3: Mathematical Foundation

Use lightweight queueing and scaling intuition to reason about capacity and SLAs.

Little’s Law (Back-of-the-Envelope Sizing)
$$ L = \lambda \cdot W $$
  • $L$: average number of requests in the system (in-flight).
  • $\lambda$: arrival rate (QPS).
  • $W$: average time in system (latency, incl. queueing).

Interpretation: If you allow $W$ to rise, $L$ (in-flight load) grows. To keep $W$ (e.g., P99 ≤ SLA) small at high $\lambda$, add capacity (more replicas) or raise per-replica throughput (vectorization, better hardware).

If 1000 req/s must finish in 0.1s on average, expect ~100 requests in-flight. Provision so those 100 don’t overwhelm a single node.
Service Rate & Batching

Let $\mu$ be per-replica service rate (req/s). With batch size $B$ on a GPU:

$$ \mu \approx \frac{B}{t_\text{infer}(B)} $$
  • Larger $B$ often improves $\mu$ (better GPU utilization) but increases waiting time to fill a batch → higher latency.
  • Adaptive batching caps max wait (e.g., 5–10 ms) and max batch size (hardware sweet spot).

🧠 Step 4: Assumptions or Key Ideas

  • Define SLA targets explicitly (e.g., P99 ≤ 200ms, error rate ≤ 0.1%).
  • Choose transport wisely: gRPC (binary, streaming, HTTP/2) usually beats REST+JSON for low-latency.
  • Use model signatures (input/output schemas) to validate requests at the edge.
  • Monitor capacity signals: QPS, queue depth, GPU util, tokens/s (for LLMs), and cache hit rates.
  • Plan isolation: resource quotas, per-tenant rate limits, and circuit breakers.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Synchronous: Best UX for interactive flows; simple client logic.
  • Asynchronous: Handles long/large jobs cheaply; resilient to spikes.
  • Multi-model serving: Fewer cold-starts; unified management.
  • Multi-tenant: Higher hardware utilization; lower cost per request.
  • Synchronous: Sensitive to tail latency and cold starts.
  • Asynchronous: More client complexity (callbacks, polling); eventual consistency.
  • Multi-model serving: Risk of interference; harder debugging if one model misbehaves.
  • Multi-tenant: “Noisy neighbor” problems; fairness and quota enforcement required.
  • Latency vs. Throughput: Vectorize to raise throughput, cap queue wait to protect latency.
  • Isolation vs. Efficiency: Dedicated pools reduce interference but cost more.
  • Simplicity vs. Control: REST is simple; gRPC offers lower latency and streaming but adds tooling complexity.

🚧 Step 6: Common Misunderstandings (Optional)

🚨 Common Misunderstandings (Click to Expand)
  • “More pods always fix latency.” → If the bottleneck is weights load or a downstream DB, more pods won’t help.
  • “Batch size should be as big as possible.” → Past a point, added queueing delay hurts P99 more than throughput helps.
  • “JSON vs. gRPC doesn’t matter.” → Serialization and HTTP/1.1 overhead can meaningfully tax tail latency at scale.

🧩 Step 7: Mini Summary

🧠 What You Learned: Online inference architecture chooses between synchronous vs. asynchronous, single vs. shared hardware, and how to load-balance and autoscale to meet SLAs. ⚙️ How It Works: Use gRPC for low-latency transport, vectorized inference for throughput, autoscaling on smart signals, and warm pools to defeat cold starts. 🎯 Why It Matters: These choices determine whether your system can handle real traffic spikes reliably and affordably.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!