ML System Design Design Patterns - Roadmap

AI System Design Interview Guide (2025)

6 min read 1066 words

⚙️ 1. Core System Design Trade-Offs

Note

The Top Tech Interview Angle: These trade-offs are the foundation of every ML system design discussion. Interviewers test your ability to reason under constraints — balancing latency, cost, and model accuracy. Success here signals that you can translate abstract ML theory into robust production architecture decisions.

1.1: Batch vs. Real-Time Processing

Understand batch pipelines (ETL, feature stores, offline training) versus streaming pipelines (Kafka, Flink, Spark Structured Streaming).
Learn when each is appropriate — e.g., batch for retraining models nightly; streaming for fraud detection.
Implement a small example of both using Python (pandas for batch, Kafka + FastAPI for stream scoring).

Deeper Insight: Probing Question: “If your fraud detection model uses a 30-minute delay, what’s the real business impact?” Discuss data freshness, throughput, and cost of serving trade-offs — how you’d mitigate lag via micro-batching or feature snapshotting.

1.2: Latency vs. Throughput

Learn system-level metrics: P99 latency, QPS, and throughput per node.
Study caching layers (Redis, Faiss) and how model size or quantization affects inference latency.
Measure and visualize these trade-offs experimentally by varying batch sizes during inference.

Deeper Insight: Probing Question: “Your model meets accuracy targets but adds 200ms latency — what would you do?” Explore model compression, batching trade-offs, and hardware-aware deployment (e.g., GPUs vs. CPUs vs. TPUs).

1.3: Shadow vs. A/B Testing

Learn how shadow deployment safely validates a model by mirroring production traffic without affecting users.
Contrast with A/B testing, which splits real traffic to measure impact on live metrics.
Study how to log predictions, compare metrics offline, and roll out with canary releases.

Deeper Insight: Probing Question: “How do you detect if shadow model predictions diverge from production in dangerous ways?” Discuss statistical significance, data drift detection, and guardrails for rollbacks.

🧩 2. Data Flow & Architecture Patterns

Note

Why This Matters: Data flow design is where candidates often falter. Strong answers here prove you understand how to move, transform, and validate data efficiently while preserving reproducibility and versioning.

2.1: Feature Store Design

Understand offline–online consistency, feature versioning, and time-travel queries.
Implement a minimal feature store using Feast or a custom SQL + Parquet-based approach.
Study caching, serving, and materialization intervals.

Probing Question: “What happens if your training and serving features get out of sync?” Discuss training-serving skew, schema drift, and mitigation strategies using feature registries and timestamp joins.

2.2: Model Registry & Versioning

Study how MLflow or Vertex AI Model Registry store models, metadata, and lineage.
Learn tagging strategies for experiment tracking (model:v3.2-prod), and version rollback mechanisms.
Build a lightweight registry using S3 + JSON manifest to simulate this behavior.

Deeper Insight: Be prepared to reason about reproducibility guarantees — why simply “saving model.pkl” is not enough, and how environment pinning (Docker + Conda YAMLs) ensures repeatability.

2.3: Online Inference Architecture

Compare synchronous vs. asynchronous serving patterns.
Study multi-model serving (one endpoint hosting multiple models) vs. multi-tenant inference (shared hardware).
Design load balancers and autoscaling rules (Kubernetes HPA).

Probing Question: “How would you design an inference API that scales to 100K QPS?” Talk about gRPC, vectorized inference, autoscaling triggers, and cold start mitigation via warm containers.

🧮 3. Scalability and Efficiency Patterns

Note

Why It’s Tested: These patterns separate junior from senior engineers. They show your understanding of hardware efficiency, cost trade-offs, and how to design for scalability under real-world constraints.

3.1: Model Sharding & Distributed Inference

Study tensor parallelism, pipeline parallelism, and model partitioning strategies.
Learn how systems like vLLM, DeepSpeed, and Ray Serve distribute large model weights across nodes.

Probing Question: “Your 40B parameter model doesn’t fit on a single GPU — what are your deployment options?” Discuss ZeRO partitioning, quantization, and offloading to CPU/SSD trade-offs.

3.2: Caching and Precomputation

Implement prediction caching: store frequent inference results in Redis.
Learn to precompute embeddings for recommendation or search.
Evaluate cost vs. freshness when caching: how long before embeddings drift?

Deeper Insight: Probing Question: “When does caching become dangerous?” Discuss stale predictions, data drift, and cache invalidation strategies (TTL, LRU eviction).

3.3: Model Compression & Distillation

Master quantization (int8, fp16), pruning, and knowledge distillation.
Quantify accuracy vs. latency/cost improvements using benchmarks.

Deeper Insight: Probing Question: “Your quantized model loses 4% accuracy — what do you do?” Discuss calibration data, mixed-precision, and post-training quantization improvements.

🧰 4. Reliability & Monitoring Patterns

Note

Why It’s Key: Production ML systems fail silently. Interviewers expect you to design robust observability, alerting, and recovery mechanisms.

4.1: Drift Detection

Understand data drift, concept drift, and how to measure divergence (KL divergence, PSI).
Build a drift detection service comparing real-time inputs to training distribution.

Probing Question: “If your model starts degrading silently, how will you detect it?” Explain performance monitoring loops and automated retraining triggers.

4.2: Model Monitoring and Alerting

Learn metrics beyond accuracy — e.g., feature distributions, prediction confidence, and fairness metrics.
Set up alerting thresholds and dashboards (Prometheus, Grafana).

Deeper Insight: Probing Question: “What would you log to debug a model drift incident?” Discuss input features, model version, latency, confidence, and output distributions.

4.3: Safe Rollbacks & Canary Deployments

Implement progressive rollouts — start with 1% traffic, observe metrics, and gradually increase.
Design rollback plans with pre-validated baseline checkpoints.

Deeper Insight: Probing Question: “Your new model passes offline tests but crashes in production. What’s your rollback process?” Talk about blue-green deployments, statistical rollback triggers, and checkpoint pinning.

🧭 5. Cost, Governance & Evolution Patterns

Note

Why It’s Crucial: Top companies test not just system correctness but engineering maturity — can you scale models while keeping them auditable, compliant, and cost-effective?

5.1: Cost Optimization

Learn cost breakdown across compute (training/inference), storage (feature logs), and egress (data movement).
Practice estimating cost impact of model refresh frequency and inference batch size.

Probing Question: “Your model’s inference cost tripled last quarter — where do you start investigating?” Talk about profiling GPU utilization, lazy loading, and on-demand scaling.

5.2: Governance & Explainability

Study model lineage tracking, bias detection, and explainability tooling (SHAP, LIME).
Understand audit trails for regulatory compliance (GDPR, AI Act).

Deeper Insight: Probing Question: “How do you balance explainability with model performance?” Discuss surrogate models, feature attribution caching, and trade-offs between transparency and model complexity.

5.3: Continuous Learning & Feedback Loops

Learn online learning pipelines: streaming feedback → retraining → deployment.
Understand guardrails for preventing model collapse due to biased feedback loops.

Deeper Insight: Probing Question: “What could go wrong with continuous learning?” Discuss feedback loops, concept drift, and human-in-the-loop retraining strategies.

5.3. Continuous Learning & Feedback Loops