ML System Design Design Patterns - Roadmap

6 min read 1066 words

⚙️ 1. Core System Design Trade-Offs

Note

The Top Tech Interview Angle: These trade-offs are the foundation of every ML system design discussion. Interviewers test your ability to reason under constraints — balancing latency, cost, and model accuracy. Success here signals that you can translate abstract ML theory into robust production architecture decisions.

1.1: Batch vs. Real-Time Processing

  • Understand batch pipelines (ETL, feature stores, offline training) versus streaming pipelines (Kafka, Flink, Spark Structured Streaming).
  • Learn when each is appropriate — e.g., batch for retraining models nightly; streaming for fraud detection.
  • Implement a small example of both using Python (pandas for batch, Kafka + FastAPI for stream scoring).

Deeper Insight: Probing Question: “If your fraud detection model uses a 30-minute delay, what’s the real business impact?” Discuss data freshness, throughput, and cost of serving trade-offs — how you’d mitigate lag via micro-batching or feature snapshotting.


1.2: Latency vs. Throughput

  • Learn system-level metrics: P99 latency, QPS, and throughput per node.
  • Study caching layers (Redis, Faiss) and how model size or quantization affects inference latency.
  • Measure and visualize these trade-offs experimentally by varying batch sizes during inference.

Deeper Insight: Probing Question: “Your model meets accuracy targets but adds 200ms latency — what would you do?” Explore model compression, batching trade-offs, and hardware-aware deployment (e.g., GPUs vs. CPUs vs. TPUs).


1.3: Shadow vs. A/B Testing

  • Learn how shadow deployment safely validates a model by mirroring production traffic without affecting users.
  • Contrast with A/B testing, which splits real traffic to measure impact on live metrics.
  • Study how to log predictions, compare metrics offline, and roll out with canary releases.

Deeper Insight: Probing Question: “How do you detect if shadow model predictions diverge from production in dangerous ways?” Discuss statistical significance, data drift detection, and guardrails for rollbacks.


🧩 2. Data Flow & Architecture Patterns

Note

Why This Matters: Data flow design is where candidates often falter. Strong answers here prove you understand how to move, transform, and validate data efficiently while preserving reproducibility and versioning.

2.1: Feature Store Design

  • Understand offline–online consistency, feature versioning, and time-travel queries.
  • Implement a minimal feature store using Feast or a custom SQL + Parquet-based approach.
  • Study caching, serving, and materialization intervals.

Probing Question: “What happens if your training and serving features get out of sync?” Discuss training-serving skew, schema drift, and mitigation strategies using feature registries and timestamp joins.


2.2: Model Registry & Versioning

  • Study how MLflow or Vertex AI Model Registry store models, metadata, and lineage.
  • Learn tagging strategies for experiment tracking (model:v3.2-prod), and version rollback mechanisms.
  • Build a lightweight registry using S3 + JSON manifest to simulate this behavior.

Deeper Insight: Be prepared to reason about reproducibility guarantees — why simply “saving model.pkl” is not enough, and how environment pinning (Docker + Conda YAMLs) ensures repeatability.


2.3: Online Inference Architecture

  • Compare synchronous vs. asynchronous serving patterns.
  • Study multi-model serving (one endpoint hosting multiple models) vs. multi-tenant inference (shared hardware).
  • Design load balancers and autoscaling rules (Kubernetes HPA).

Probing Question: “How would you design an inference API that scales to 100K QPS?” Talk about gRPC, vectorized inference, autoscaling triggers, and cold start mitigation via warm containers.


🧮 3. Scalability and Efficiency Patterns

Note

Why It’s Tested: These patterns separate junior from senior engineers. They show your understanding of hardware efficiency, cost trade-offs, and how to design for scalability under real-world constraints.

3.1: Model Sharding & Distributed Inference

  • Study tensor parallelism, pipeline parallelism, and model partitioning strategies.
  • Learn how systems like vLLM, DeepSpeed, and Ray Serve distribute large model weights across nodes.

Probing Question: “Your 40B parameter model doesn’t fit on a single GPU — what are your deployment options?” Discuss ZeRO partitioning, quantization, and offloading to CPU/SSD trade-offs.


3.2: Caching and Precomputation

  • Implement prediction caching: store frequent inference results in Redis.
  • Learn to precompute embeddings for recommendation or search.
  • Evaluate cost vs. freshness when caching: how long before embeddings drift?

Deeper Insight: Probing Question: “When does caching become dangerous?” Discuss stale predictions, data drift, and cache invalidation strategies (TTL, LRU eviction).


3.3: Model Compression & Distillation

  • Master quantization (int8, fp16), pruning, and knowledge distillation.
  • Quantify accuracy vs. latency/cost improvements using benchmarks.

Deeper Insight: Probing Question: “Your quantized model loses 4% accuracy — what do you do?” Discuss calibration data, mixed-precision, and post-training quantization improvements.


🧰 4. Reliability & Monitoring Patterns

Note

Why It’s Key: Production ML systems fail silently. Interviewers expect you to design robust observability, alerting, and recovery mechanisms.

4.1: Drift Detection

  • Understand data drift, concept drift, and how to measure divergence (KL divergence, PSI).
  • Build a drift detection service comparing real-time inputs to training distribution.

Probing Question: “If your model starts degrading silently, how will you detect it?” Explain performance monitoring loops and automated retraining triggers.


4.2: Model Monitoring and Alerting

  • Learn metrics beyond accuracy — e.g., feature distributions, prediction confidence, and fairness metrics.
  • Set up alerting thresholds and dashboards (Prometheus, Grafana).

Deeper Insight: Probing Question: “What would you log to debug a model drift incident?” Discuss input features, model version, latency, confidence, and output distributions.


4.3: Safe Rollbacks & Canary Deployments

  • Implement progressive rollouts — start with 1% traffic, observe metrics, and gradually increase.
  • Design rollback plans with pre-validated baseline checkpoints.

Deeper Insight: Probing Question: “Your new model passes offline tests but crashes in production. What’s your rollback process?” Talk about blue-green deployments, statistical rollback triggers, and checkpoint pinning.


🧭 5. Cost, Governance & Evolution Patterns

Note

Why It’s Crucial: Top companies test not just system correctness but engineering maturity — can you scale models while keeping them auditable, compliant, and cost-effective?

5.1: Cost Optimization

  • Learn cost breakdown across compute (training/inference), storage (feature logs), and egress (data movement).
  • Practice estimating cost impact of model refresh frequency and inference batch size.

Probing Question: “Your model’s inference cost tripled last quarter — where do you start investigating?” Talk about profiling GPU utilization, lazy loading, and on-demand scaling.


5.2: Governance & Explainability

  • Study model lineage tracking, bias detection, and explainability tooling (SHAP, LIME).
  • Understand audit trails for regulatory compliance (GDPR, AI Act).

Deeper Insight: Probing Question: “How do you balance explainability with model performance?” Discuss surrogate models, feature attribution caching, and trade-offs between transparency and model complexity.


5.3: Continuous Learning & Feedback Loops

  • Learn online learning pipelines: streaming feedback → retraining → deployment.
  • Understand guardrails for preventing model collapse due to biased feedback loops.

Deeper Insight: Probing Question: “What could go wrong with continuous learning?” Discuss feedback loops, concept drift, and human-in-the-loop retraining strategies.


Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!