5.1. Cost Optimization

AI System Design Interview Guide (2025)

5 min read 935 words

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph): In machine learning systems, accuracy isn’t the only metric that matters — cost is the silent KPI. You can have the smartest model in the world, but if it’s draining money faster than it delivers value, it won’t survive. Cost optimization ensures your model is not just intelligent, but efficient, by balancing performance, resource use, and business ROI.
Simple Analogy (one only): Think of your ML system like a power-hungry robot. Training feeds it knowledge, inference makes it act, and storage keeps its memories. Cost optimization is the art of teaching the robot to use its power wisely — turning off lights when it’s idle, running faster only when needed, and recycling old memories efficiently.

🌱 Step 2: Core Concept

ML cost optimization isn’t just about slashing expenses — it’s about engineering smarter trade-offs. Let’s look at where the money actually goes and how to control it.

What’s Happening Under the Hood?

1️⃣ Compute Costs — Brains are Expensive

This covers both training and inference.

Training cost drivers:
- Model size and architecture complexity (FLOPs).
- Training duration and checkpoint frequency.
- Hardware choice (CPU vs. GPU vs. TPU).
- Distributed setup overhead and data parallelism inefficiency.
Inference cost drivers:
- Request rate (QPS).
- Batch size and latency SLA.
- Model quantization and caching efficiency.
- GPU utilization and instance type.

✅ Optimization levers: Profile GPU/CPU usage, use mixed precision, adaptive batching, and on-demand scaling to match traffic patterns.

2️⃣ Storage Costs — Memories Add Up

Everything from feature logs to intermediate datasets costs storage.

Feature store snapshots and historical training data grow fast.
Redundant feature tables and unused embeddings accumulate.

✅ Optimization levers: Compress old data (Parquet, Zstd), archive to cold storage (S3 Glacier), and version features incrementally instead of full dumps.

3️⃣ Egress Costs — The Hidden Villain

Data transfer across regions, clouds, or APIs costs money.

Model serving endpoints that pull large embeddings or external APIs pay for every byte.
Even internal systems pay “egress” when data crosses VPC boundaries.

✅ Optimization levers: Co-locate compute and data, cache features locally, and batch I/O.

4️⃣ Frequency & Batch Trade-offs

Model retraining and inference frequency affect costs non-linearly:

More frequent retraining: better freshness, higher compute cost.
Larger batch inference: higher throughput, lower per-request cost but higher latency. The sweet spot depends on business need and SLA tolerance.

Why It Works This Way

In ML pipelines, cost grows with data scale × model complexity × serving intensity. Most waste happens when:

Models idle on GPUs waiting for requests.
Systems over-provision capacity for peak loads that rarely occur.
Logs and features pile up without retention policies.

Optimizing cost isn’t about cutting corners — it’s about engineering balance: right resources, right time, right data.

How It Fits in ML Thinking

Every decision — from feature refresh cadence to model quantization — affects the total cost of ownership (TCO). Cost optimization sits at the intersection of:

Model design: lightweight architectures, distillation.
Infrastructure: autoscaling, caching, hardware profiling.
Operations: retention, retraining policy, and utilization monitoring.

Senior ML engineers are evaluated not just by performance metrics, but by cost-per-prediction and throughput efficiency.

📐 Step 3: Mathematical Foundation

Cost Decomposition Formula

The total operational cost per month ($C_\text{total}$) can be approximated as:

$$ C_\text{total} = C_\text{train} + C_\text{infer} + C_\text{store} + C_\text{egress} $$

Where:

$C_\text{train} = n_\text{epochs} \times h_\text{gpu} \times c_\text{gpu/hr}$
$C_\text{infer} = \frac{R}{B} \times t_\text{latency} \times c_\text{gpu/hr}$
- $R$: requests per second
- $B$: batch size
- $t_\text{latency}$: average inference time
$C_\text{store}$: data volume × storage rate
$C_\text{egress}$: data transferred × egress cost

Batching increases $B$, lowering inference cost per request; lazy loading reduces $h_\text{gpu}$ by keeping idle time near zero.

GPU Utilization Efficiency

GPU efficiency $\eta$ is defined as:

$$ \eta = \frac{\text{Active Compute Time}}{\text{Total Allocated Time}} $$

Low $\eta$ = you’re paying for idle GPUs. Improve it via autoscaling, vectorized inference, and cold-start avoidance.

🧠 Step 4: Assumptions or Key Ideas

Compute cost dominates for large models; storage/egress dominate for large-scale logging or distributed pipelines.
GPU utilization rarely exceeds 60% in unoptimized systems — profiling fixes this.
Lazy loading: load models into memory only when requests arrive.
On-demand scaling: auto-provision GPU pods during peaks, deallocate when idle.
Quantization, pruning, and caching reduce both compute and memory footprint.
Use cost dashboards (e.g., AWS Cost Explorer, Kubecost) for visibility.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Makes ML systems economically sustainable.
Improves GPU utilization and throughput.
Enables transparent trade-offs between accuracy and efficiency.
Aligns technical performance with business goals.

Over-optimization may hurt model freshness or accuracy.
Dynamic scaling introduces cold-start latency.
Monitoring and profiling add operational overhead.

Speed vs. Cost: Smaller batches are fast but expensive; large batches save money but add latency.
Freshness vs. Cost: Frequent retraining ensures relevance but increases compute bills.
Compute vs. Storage: Retaining too much data boosts storage costs, but deleting too early risks losing reproducibility.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“We just need cheaper GPUs.” → The real issue is usually low utilization, not hardware price.
“Batching always saves cost.” → Not if latency SLAs are strict; batching adds delay.
“Deleting data = saving money.” → Without governance, you risk losing retraining reproducibility.

🧩 Step 7: Mini Summary

🧠 What You Learned: Cost optimization is about understanding and balancing the economics of ML systems — compute, storage, and data movement.

⚙️ How It Works: Profile GPU usage, tune batching and retraining frequency, apply lazy loading and autoscaling, and manage storage smartly.

🎯 Why It Matters: It transforms models from academic prototypes into scalable, financially responsible products that can run sustainably at scale.

5.2. Governance & Explainability 4.3. Safe Rollbacks & Canary Deployments