3.2. Caching & Precomputation
🪄 Step 1: Intuition & Motivation
Core Idea (in 1 short paragraph): Caching and precomputation make machine learning systems feel fast by reusing work already done. Instead of recalculating predictions or embeddings every single time, the system remembers — like a student who keeps notes instead of rereading the whole textbook before every test. But caching isn’t just about speed; it’s a careful dance between efficiency and freshness — saving time without falling out of date.
Simple Analogy (one only): Imagine a coffee shop that keeps your “usual order” ready before you even ask.
- Caching is remembering what you liked last time.
- Precomputation is preparing your coffee before you arrive. Both save time, but if your taste changes (drift), that stored order might not match what you want today.
🌱 Step 2: Core Concept
Machine learning systems cache to save computation, latency, and money — but they must manage staleness (how outdated cached data becomes).
What’s Happening Under the Hood?
1️⃣ Prediction Caching
When requests repeat — like the same user asking for the same recommendation — you can store model outputs in a fast key-value store (e.g., Redis, Memcached).
- Key = input (e.g., user_id, query string).
- Value = model output (e.g., top 10 recommendations). On the next request, you return the cached result instantly instead of running inference again.
✅ When to use: high-traffic systems with frequent duplicate or similar requests. ❌ When not to: highly dynamic inputs (rapidly changing user state).
2️⃣ Embedding Precomputation
For systems like search or recommendation:
- Compute and store embeddings (numerical representations) for items or users ahead of time.
- At query time, you only compute the embedding for the query and find nearest neighbors among precomputed vectors (using FAISS, Milvus, or ScaNN).
✅ When to use: static or slow-moving catalogs (e.g., product database). ❌ When not to: when content updates every minute or embeddings drift fast.
3️⃣ Freshness vs. Cost
You can’t keep recomputing embeddings forever — it’s expensive. So, you pick a refresh cadence:
- Hourly for fast-moving data.
- Daily/weekly for stable domains.
You control TTL (Time To Live) — how long cached data stays valid before expiring.
Why It Works This Way
Caching exploits temporal locality: the idea that recent requests are likely to repeat. Precomputation leverages amortization: expensive computations paid once, reused many times.
Together, they trade a bit of memory for a lot of speed. But they only work if your cached results are still relevant — once data drifts (e.g., user behavior or model weights change), stale caches can harm accuracy or trust.
How It Fits in ML Thinking
Caching sits between model serving and data engineering.
- From the serving side: it reduces inference latency and cost.
- From the data side: it balances freshness and consistency across evolving features. It’s the invisible glue that lets massive ML systems — like search engines or personalized feeds — run in real time without overloading GPUs.
📐 Step 3: Mathematical Foundation
Cache Hit Ratio
- High hit ratio = effective caching.
- Low hit ratio = wasted cache space or too-volatile data.
You can tune cache size, TTL, and eviction policy (LRU, LFU, FIFO) to maximize hits while keeping data relevant.
Staleness vs. Refresh Cost Trade-off
Let $T_\text{refresh}$ = refresh interval, $C_\text{update}$ = cost to refresh, and $L_\text{stale}$ = loss from using stale data. The goal is to minimize total cost:
$$ \text{Total Cost} = C_\text{update} + L_\text{stale}(T_\text{refresh}) $$- Shorter $T_\text{refresh}$ → lower staleness, higher compute cost.
- Longer $T_\text{refresh}$ → higher staleness, lower compute cost.
🧠 Step 4: Assumptions or Key Ideas
- User requests often repeat or are similar enough for caching to matter.
- Cached data must be refreshed periodically to stay relevant.
- Embedding drift happens when model weights or underlying data distribution changes.
- TTL and eviction policies are business-driven, not purely technical.
- Caches exist at multiple layers: edge (CDN), feature store, and inference response layer. s
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Dramatically lowers inference latency and cost.
- Reduces pressure on compute infrastructure.
- Enables instant responses for recurring requests.
- Great for precomputable embeddings and static catalogs.
- Stale predictions if underlying data or model changes.
- Cache invalidation is hard — deciding what to evict and when.
- Wastes memory if requests are highly unique.
- Must handle cache warm-up and misses gracefully.
- Freshness vs. Cost: Frequent updates improve accuracy but increase compute cost.
- Size vs. Hit Ratio: Larger caches store more but consume more memory.
- TTL vs. Accuracy: Longer TTLs mean faster responses but risk drift.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Caching is always safe.” → Wrong. Cached predictions can become wrong due to drift or retraining.
- “All caches should live forever.” → No — old embeddings degrade model trustworthiness.
- “Precomputation removes the need for inference.” → It only removes repetitive work; dynamic parts still need real-time computation.
🧩 Step 7: Mini Summary
🧠 What You Learned: Caching and precomputation store and reuse costly computations to speed up ML systems.
⚙️ How It Works: Use key-value stores for frequent requests, precompute embeddings where possible, and balance refresh frequency with cost.
🎯 Why It Matters: They make real-time AI feel instantaneous — but mismanaging staleness or drift can quietly break your system’s trust and performance.