ML System Design Design Patterns - Roadmap
⚙️ 1. Core System Design Trade-Offs
Note
The Top Tech Interview Angle: These trade-offs are the foundation of every ML system design discussion. Interviewers test your ability to reason under constraints — balancing latency, cost, and model accuracy. Success here signals that you can translate abstract ML theory into robust production architecture decisions.
1.1: Batch vs. Real-Time Processing
- Understand batch pipelines (ETL, feature stores, offline training) versus streaming pipelines (Kafka, Flink, Spark Structured Streaming).
- Learn when each is appropriate — e.g., batch for retraining models nightly; streaming for fraud detection.
- Implement a small example of both using Python (
pandasfor batch,Kafka+FastAPIfor stream scoring).
Deeper Insight: Probing Question: “If your fraud detection model uses a 30-minute delay, what’s the real business impact?” Discuss data freshness, throughput, and cost of serving trade-offs — how you’d mitigate lag via micro-batching or feature snapshotting.
1.2: Latency vs. Throughput
- Learn system-level metrics: P99 latency, QPS, and throughput per node.
- Study caching layers (Redis, Faiss) and how model size or quantization affects inference latency.
- Measure and visualize these trade-offs experimentally by varying batch sizes during inference.
Deeper Insight: Probing Question: “Your model meets accuracy targets but adds 200ms latency — what would you do?” Explore model compression, batching trade-offs, and hardware-aware deployment (e.g., GPUs vs. CPUs vs. TPUs).
1.3: Shadow vs. A/B Testing
- Learn how shadow deployment safely validates a model by mirroring production traffic without affecting users.
- Contrast with A/B testing, which splits real traffic to measure impact on live metrics.
- Study how to log predictions, compare metrics offline, and roll out with canary releases.
Deeper Insight: Probing Question: “How do you detect if shadow model predictions diverge from production in dangerous ways?” Discuss statistical significance, data drift detection, and guardrails for rollbacks.
🧩 2. Data Flow & Architecture Patterns
Note
Why This Matters: Data flow design is where candidates often falter. Strong answers here prove you understand how to move, transform, and validate data efficiently while preserving reproducibility and versioning.
2.1: Feature Store Design
- Understand offline–online consistency, feature versioning, and time-travel queries.
- Implement a minimal feature store using Feast or a custom SQL + Parquet-based approach.
- Study caching, serving, and materialization intervals.
Probing Question: “What happens if your training and serving features get out of sync?” Discuss training-serving skew, schema drift, and mitigation strategies using feature registries and timestamp joins.
2.2: Model Registry & Versioning
- Study how MLflow or Vertex AI Model Registry store models, metadata, and lineage.
- Learn tagging strategies for experiment tracking (
model:v3.2-prod), and version rollback mechanisms. - Build a lightweight registry using S3 + JSON manifest to simulate this behavior.
Deeper Insight: Be prepared to reason about reproducibility guarantees — why simply “saving model.pkl” is not enough, and how environment pinning (Docker + Conda YAMLs) ensures repeatability.
2.3: Online Inference Architecture
- Compare synchronous vs. asynchronous serving patterns.
- Study multi-model serving (one endpoint hosting multiple models) vs. multi-tenant inference (shared hardware).
- Design load balancers and autoscaling rules (Kubernetes HPA).
Probing Question: “How would you design an inference API that scales to 100K QPS?” Talk about gRPC, vectorized inference, autoscaling triggers, and cold start mitigation via warm containers.
🧮 3. Scalability and Efficiency Patterns
Note
Why It’s Tested: These patterns separate junior from senior engineers. They show your understanding of hardware efficiency, cost trade-offs, and how to design for scalability under real-world constraints.
3.1: Model Sharding & Distributed Inference
- Study tensor parallelism, pipeline parallelism, and model partitioning strategies.
- Learn how systems like vLLM, DeepSpeed, and Ray Serve distribute large model weights across nodes.
Probing Question: “Your 40B parameter model doesn’t fit on a single GPU — what are your deployment options?” Discuss ZeRO partitioning, quantization, and offloading to CPU/SSD trade-offs.
3.2: Caching and Precomputation
- Implement prediction caching: store frequent inference results in Redis.
- Learn to precompute embeddings for recommendation or search.
- Evaluate cost vs. freshness when caching: how long before embeddings drift?
Deeper Insight: Probing Question: “When does caching become dangerous?” Discuss stale predictions, data drift, and cache invalidation strategies (TTL, LRU eviction).
3.3: Model Compression & Distillation
- Master quantization (int8, fp16), pruning, and knowledge distillation.
- Quantify accuracy vs. latency/cost improvements using benchmarks.
Deeper Insight: Probing Question: “Your quantized model loses 4% accuracy — what do you do?” Discuss calibration data, mixed-precision, and post-training quantization improvements.
🧰 4. Reliability & Monitoring Patterns
Note
Why It’s Key: Production ML systems fail silently. Interviewers expect you to design robust observability, alerting, and recovery mechanisms.
4.1: Drift Detection
- Understand data drift, concept drift, and how to measure divergence (KL divergence, PSI).
- Build a drift detection service comparing real-time inputs to training distribution.
Probing Question: “If your model starts degrading silently, how will you detect it?” Explain performance monitoring loops and automated retraining triggers.
4.2: Model Monitoring and Alerting
- Learn metrics beyond accuracy — e.g., feature distributions, prediction confidence, and fairness metrics.
- Set up alerting thresholds and dashboards (Prometheus, Grafana).
Deeper Insight: Probing Question: “What would you log to debug a model drift incident?” Discuss input features, model version, latency, confidence, and output distributions.
4.3: Safe Rollbacks & Canary Deployments
- Implement progressive rollouts — start with 1% traffic, observe metrics, and gradually increase.
- Design rollback plans with pre-validated baseline checkpoints.
Deeper Insight: Probing Question: “Your new model passes offline tests but crashes in production. What’s your rollback process?” Talk about blue-green deployments, statistical rollback triggers, and checkpoint pinning.
🧭 5. Cost, Governance & Evolution Patterns
Note
Why It’s Crucial: Top companies test not just system correctness but engineering maturity — can you scale models while keeping them auditable, compliant, and cost-effective?
5.1: Cost Optimization
- Learn cost breakdown across compute (training/inference), storage (feature logs), and egress (data movement).
- Practice estimating cost impact of model refresh frequency and inference batch size.
Probing Question: “Your model’s inference cost tripled last quarter — where do you start investigating?” Talk about profiling GPU utilization, lazy loading, and on-demand scaling.
5.2: Governance & Explainability
- Study model lineage tracking, bias detection, and explainability tooling (SHAP, LIME).
- Understand audit trails for regulatory compliance (GDPR, AI Act).
Deeper Insight: Probing Question: “How do you balance explainability with model performance?” Discuss surrogate models, feature attribution caching, and trade-offs between transparency and model complexity.
5.3: Continuous Learning & Feedback Loops
- Learn online learning pipelines: streaming feedback → retraining → deployment.
- Understand guardrails for preventing model collapse due to biased feedback loops.
Deeper Insight: Probing Question: “What could go wrong with continuous learning?” Discuss feedback loops, concept drift, and human-in-the-loop retraining strategies.