7.2. Cost Optimization

AI System Design Interview Guide (2025)

5 min read 1031 words

🪄 Step 1: Intuition & Motivation

Core Idea: In ML infrastructure, performance isn’t free. Every training run, GPU allocation, or model endpoint burns compute hours and storage costs. Cost optimization isn’t about cutting corners — it’s about making your ML systems financially intelligent, just like they’re algorithmically intelligent.
Simple Analogy: Think of your ML system like a power plant:
- Running at full power 24/7 ensures zero outages (great latency) — but your bill skyrockets.
- Running only when needed saves money — but users might experience flickers (latency spikes). Smart engineers design systems that scale power on demand, maintaining balance between speed and spend.

🌱 Step 2: Core Concept

Cost optimization in ML infrastructure focuses on reducing compute, storage, and network costs — without hurting performance or accuracy.

Let’s explore the three big levers to pull: compute savings, data efficiency, and execution smartness.

1️⃣ Spot Instances and Serverless Compute — The Ephemeral Advantage

Spot instances (AWS EC2 Spot, GCP Preemptible VMs, Azure Low-Priority VMs) are spare compute resources offered at up to 90% lower cost — but they can be interrupted anytime.

Serverless compute (AWS Lambda, Google Cloud Run) executes code only when triggered, perfect for short-lived, event-driven ML tasks.

🧠 When to Use:

Training non-critical models (e.g., experiments, retraining jobs).
Batch inference or scheduled ETL jobs.
Hyperparameter sweeps that can checkpoint progress.

💡 How to Mitigate Interruptions:

Use checkpointing to save training progress.
Employ orchestrators (Airflow/Kubeflow) that reschedule failed jobs.
For inference, use hybrid models: baseline servers + burstable spot instances.

💡 Intuition: Spot instances are like cheap train tickets — great deal, but you might have to get off mid-way. Checkpointing is your “save game” button in case you need to restart the journey.

2️⃣ Caching Embeddings and Intermediate Computations — The Memory Trick

ML systems often recompute the same features, embeddings, or inference outputs multiple times — wasting compute cycles.

Solution: Cache once, reuse many times.

🧩 Common Caching Opportunities:

Embeddings: Save vectorized text/image representations for re-use across models.
Intermediate Features: Store aggregated or preprocessed features used by multiple pipelines.
Inference Results: Cache results for popular queries (e.g., “top recommended items”).

Implementation Approaches:

Redis / Memcached for fast lookups.
Feature Store (e.g., Feast) for shared access across models.
Layered caches (L1 in memory, L2 in disk/object store).

Formulaic Cost Reduction: If computation of embedding $E$ costs $C_E$ and reuse frequency = $r$, then caching saves:

$$ Savings = (r - 1) \times C_E $$

💡 Intuition: Caching is like keeping leftovers in the fridge — why cook the same meal again when it’s already good to go?

3️⃣ Lazy Evaluation — Don’t Do Work Until You Must

Lazy evaluation means deferring computations until their results are actually needed.

In ML systems, this drastically reduces redundant or premature work.

🧠 Where It Helps:

Data pipelines: Only transform data that’s actually consumed downstream.
Feature computation: Generate features on-demand, not preemptively.
ETL frameworks: Use tools like Spark, Dask, or Ray — they naturally delay execution until results are requested.

Example: Instead of joining all tables upfront, join only the subset needed for the current batch.

Mathematical Analogy: If $f(x)$ is expensive to compute, and only $p$ fraction of results are needed, then expected cost:

$$ E[C] = p \times C_{full} $$

where $p < 1$. That’s your proportional saving from lazy computation.

💡 Intuition: Lazy evaluation is like doing chores only when guests actually arrive — not every day just in case.

📐 Step 3: Mathematical Foundation

Let’s formalize the cost–latency trade-off for ML inference.

Balancing Latency vs. Cost

Let:

$C$ = cost per unit time (e.g., dollars/hour)
$L$ = latency per request (milliseconds)
$x$ = system state: number of active servers

Two regimes:

Always-On (High Latency Efficiency): $C_{on} = C \times x_{max}$ $L_{on} = L_{min}$
On-Demand (High Cost Efficiency): $C_{off} = C \times x_{avg}$ $L_{off} = L_{min} + L_{startup}$

The optimization goal:

$$ \min_{x} ( \alpha C + \beta L ) $$

Where $\alpha$ and $\beta$ are weights for cost and latency priorities. You tune them based on business needs — for instance, a healthcare app ($\beta$ high) may prioritize latency, while batch ETL ($\alpha$ high) prioritizes cost.

You can’t minimize both latency and cost simultaneously — it’s a see-saw. Smart systems find balance by scaling dynamically based on real-time demand curves.

🧠 Step 4: Advanced Cost Control Techniques

Model Distillation: Use smaller, cheaper student models for inference. → Example: Deploy distilled BERT instead of full-size BERT.
Asynchronous Batching: Aggregate concurrent requests dynamically for GPU efficiency.
Dynamic Routing: Route low-priority or high-latency-tolerant queries to cheaper hardware.
Mixed Precision Inference: Use FP16 or BF16 for faster GPU execution with negligible accuracy loss.
Cloud Cost Monitoring:
- AWS: Cost Explorer, CloudWatch
- GCP: BigQuery Cost Breakdown
- Azure: Cost Analysis

💡 Intuition: Cost optimization isn’t about being cheap — it’s about being cleverly efficient. Like turning off lights when no one’s in the room, not dimming them while you’re still reading.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Dramatically reduces operational costs without hurting quality.
Enables scalable ML infrastructure for startups and enterprises alike.
Makes retraining and serving pipelines sustainable long-term.

May add complexity (checkpointing, caching logic).
Serverless and spot instances may introduce cold-start delays.
Requires careful monitoring to avoid under-provisioning.

Latency vs. Cost Trade-off:

Always-On GPUs:
- ✅ Near-zero latency
- ⚠️ High idle cost
On-Demand / Serverless GPUs:
- ✅ Pay only when used
- ⚠️ Cold start delays

The best systems mix both: Keep one instance warm (for latency), others cold (for cost savings).

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Cost optimization means downgrading hardware.” Not true — it’s about smarter utilization, not weaker machines.
“Caching is optional.” In large-scale ML, it’s essential — recomputation costs can dwarf training costs.
“Serverless = instant.” No — serverless has cold starts; it’s great for bursty, not continuous workloads.

🧩 Step 7: Mini Summary

🧠 What You Learned: ML cost optimization means balancing performance and expense through spot compute, caching, and lazy execution.

⚙️ How It Works: Use spot or serverless resources for non-critical jobs, cache reusable outputs, and execute computations only when needed.

🎯 Why It Matters: Efficient ML systems scale sustainably — enabling innovation without runaway cloud bills.

8.1. Infrastructure as Code 7.1. Scaling Training and Serving