3.2. Caching & Precomputation

AI System Design Interview Guide (2025)

5 min read 936 words

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph): Caching and precomputation make machine learning systems feel fast by reusing work already done. Instead of recalculating predictions or embeddings every single time, the system remembers — like a student who keeps notes instead of rereading the whole textbook before every test. But caching isn’t just about speed; it’s a careful dance between efficiency and freshness — saving time without falling out of date.
Simple Analogy (one only): Imagine a coffee shop that keeps your “usual order” ready before you even ask.
- Caching is remembering what you liked last time.
- Precomputation is preparing your coffee before you arrive. Both save time, but if your taste changes (drift), that stored order might not match what you want today.

🌱 Step 2: Core Concept

Machine learning systems cache to save computation, latency, and money — but they must manage staleness (how outdated cached data becomes).

What’s Happening Under the Hood?

1️⃣ Prediction Caching

When requests repeat — like the same user asking for the same recommendation — you can store model outputs in a fast key-value store (e.g., Redis, Memcached).

Key = input (e.g., user_id, query string).
Value = model output (e.g., top 10 recommendations). On the next request, you return the cached result instantly instead of running inference again.

✅ When to use: high-traffic systems with frequent duplicate or similar requests. ❌ When not to: highly dynamic inputs (rapidly changing user state).

2️⃣ Embedding Precomputation

For systems like search or recommendation:

Compute and store embeddings (numerical representations) for items or users ahead of time.
At query time, you only compute the embedding for the query and find nearest neighbors among precomputed vectors (using FAISS, Milvus, or ScaNN).

✅ When to use: static or slow-moving catalogs (e.g., product database). ❌ When not to: when content updates every minute or embeddings drift fast.

3️⃣ Freshness vs. Cost

You can’t keep recomputing embeddings forever — it’s expensive. So, you pick a refresh cadence:

Hourly for fast-moving data.
Daily/weekly for stable domains.

You control TTL (Time To Live) — how long cached data stays valid before expiring.

Why It Works This Way

Caching exploits temporal locality: the idea that recent requests are likely to repeat. Precomputation leverages amortization: expensive computations paid once, reused many times.

Together, they trade a bit of memory for a lot of speed. But they only work if your cached results are still relevant — once data drifts (e.g., user behavior or model weights change), stale caches can harm accuracy or trust.

How It Fits in ML Thinking

Caching sits between model serving and data engineering.

From the serving side: it reduces inference latency and cost.
From the data side: it balances freshness and consistency across evolving features. It’s the invisible glue that lets massive ML systems — like search engines or personalized feeds — run in real time without overloading GPUs.

📐 Step 3: Mathematical Foundation

Cache Hit Ratio

$$ \text{Hit Ratio} = \frac{\text{Cache Hits}}{\text{Cache Hits} + \text{Cache Misses}} $$

High hit ratio = effective caching.
Low hit ratio = wasted cache space or too-volatile data.

You can tune cache size, TTL, and eviction policy (LRU, LFU, FIFO) to maximize hits while keeping data relevant.

If 8 out of 10 customers order the same latte, keep it prepped. If everyone orders something new, caching doesn’t help — you’re just filling the fridge with leftovers.

Staleness vs. Refresh Cost Trade-off

Let $T_\text{refresh}$ = refresh interval, $C_\text{update}$ = cost to refresh, and $L_\text{stale}$ = loss from using stale data. The goal is to minimize total cost:

$$ \text{Total Cost} = C_\text{update} + L_\text{stale}(T_\text{refresh}) $$

Shorter $T_\text{refresh}$ → lower staleness, higher compute cost.
Longer $T_\text{refresh}$ → higher staleness, lower compute cost.

Updating your menu every day is fresh but costly. Updating once a month is cheap but stale. Pick the interval where customer satisfaction meets kitchen budget.

🧠 Step 4: Assumptions or Key Ideas

User requests often repeat or are similar enough for caching to matter.
Cached data must be refreshed periodically to stay relevant.
Embedding drift happens when model weights or underlying data distribution changes.
TTL and eviction policies are business-driven, not purely technical.
Caches exist at multiple layers: edge (CDN), feature store, and inference response layer. s

⚖️ Step 5: Strengths, Limitations & Trade-offs

Dramatically lowers inference latency and cost.
Reduces pressure on compute infrastructure.
Enables instant responses for recurring requests.
Great for precomputable embeddings and static catalogs.

Stale predictions if underlying data or model changes.
Cache invalidation is hard — deciding what to evict and when.
Wastes memory if requests are highly unique.
Must handle cache warm-up and misses gracefully.

Freshness vs. Cost: Frequent updates improve accuracy but increase compute cost.
Size vs. Hit Ratio: Larger caches store more but consume more memory.
TTL vs. Accuracy: Longer TTLs mean faster responses but risk drift.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Caching is always safe.” → Wrong. Cached predictions can become wrong due to drift or retraining.
“All caches should live forever.” → No — old embeddings degrade model trustworthiness.
“Precomputation removes the need for inference.” → It only removes repetitive work; dynamic parts still need real-time computation.

🧩 Step 7: Mini Summary

🧠 What You Learned: Caching and precomputation store and reuse costly computations to speed up ML systems.

⚙️ How It Works: Use key-value stores for frequent requests, precompute embeddings where possible, and balance refresh frequency with cost.

🎯 Why It Matters: They make real-time AI feel instantaneous — but mismanaging staleness or drift can quietly break your system’s trust and performance.

3.3. Model Compression & Distillation 3.1. Model Sharding & Distributed Inference