ML System Architecture Fundamentals - Roadma

AI System Design Interview Guide (2025)

5 min read 857 words

🏗️ ML System Architecture Fundamentals

Note

The Top Tech Interview Angle: This topic assesses your ability to design large-scale ML systems that are robust, scalable, and maintainable. You’re expected to demonstrate both algorithmic intuition and systems thinking — knowing how data flows from ingestion to inference, how latency budgets constrain models, and how retraining loops are architected.

1.1: Understand End-to-End ML System Anatomy

Learn the 5 core components of every ML system:
1. Data Pipeline — Ingestion, cleaning, feature extraction
2. Model Training Pipeline — Experimentation and retraining
3. Model Registry — Versioning, validation, approval
4. Model Serving Layer — Real-time or batch inference
5. Monitoring & Feedback Loop — Drift, quality, and performance metrics
Study canonical architectures for fraud detection, recommendation, and ranking systems.

Deeper Insight: Be ready to whiteboard how features move from raw data → features → model → predictions → feedback. Interviewers often probe: “Where would you put feature engineering logic — in training, serving, or both?” The right answer emphasizes feature parity between offline and online components.

1.2: Design Principles for Scalable ML Systems

Master system properties: Scalability, Availability, Consistency, and Latency.
Learn how data flow differs between online prediction and offline training.
Understand why ML systems prefer immutable data stores, append-only event logs, and versioned artifacts.
Study batch, streaming, and hybrid (lambda) architectures and where each fits best.

Deeper Insight: “What happens if feature generation is delayed by 10 minutes?” → This question tests your ability to discuss event-time consistency and serving skew. The best answers reference feature stores or event backfilling strategies.

1.3: Fault Tolerance, Redundancy, and Consistency Models

Study fault domains — what happens when a model server dies mid-inference or a data pipeline job fails mid-run.
Learn recovery patterns: checkpointing, retry queues, idempotent writes, and graceful degradation (e.g., fallback models).
Understand CAP theorem trade-offs in ML contexts (e.g., why online systems may favor availability over consistency).

Probing Question: “Suppose your model prediction service is down — how do you keep the system functional?” Discuss fallback heuristics, default scores, or last-known-good models to maintain user experience.

1.4: Data and Feature Management Layer

Master Feature Store concepts — how features are defined, stored, versioned, and served to both training and inference pipelines.
Learn about offline vs. online stores, TTL policies, materialization, and point-in-time correctness.
Explore how entity joins, feature freshness, and backfill errors affect model quality.

Probing Question: “How do you ensure the same feature computation during training and inference?” This tests whether you understand feature consistency — often solved by a unified store or feature transformation framework.

1.5: Real-Time vs. Batch System Trade-offs

Compare online scoring (real-time) vs batch scoring (offline) systems.
Quantify latency targets — e.g., <100ms for ranking systems, vs. minutes for offline scoring.
Learn techniques for asynchronous inference, model caching, and pre-computed embeddings to reduce latency.

Deeper Insight: “You’re designing an ad-ranking model that must respond in <50ms — what optimizations would you apply?” Top answers discuss feature prefetching, model quantization, and GPU batching vs CPU parallelism trade-offs.

1.6: Model Versioning and Deployment Architecture

Study model lifecycle: training → evaluation → registration → shadow testing → A/B rollout → monitoring.
Learn deployment patterns:
- Canary Deployments
- Shadow Mode Inference
- Blue/Green Model Switching
Understand feature compatibility and schema versioning across model generations.

Probing Question: “If your new model performs better offline but worse online, what’s your debugging approach?” Discuss data leakage, feedback loop bias, or stale features as likely culprits.

1.7: Monitoring, Drift Detection, and Feedback Loops

Learn to instrument metrics: data quality, prediction drift, latency, error rate, user engagement.
Implement population stability index (PSI) and KL divergence for drift detection.
Understand closed-loop retraining — how fresh labels or user feedback re-enter the system.

Deeper Insight: “How do you detect that your recommendation system is decaying in quality?” The interviewer wants to hear about proxy metrics (CTR, engagement time), statistical drift, and alerting thresholds.

1.8: Multi-Tenancy and Resource Management

Explore how large organizations serve multiple models per team or product.
Learn about model serving platforms (TensorFlow Serving, Triton, Ray Serve) that handle routing, scaling, and concurrency.
Study how autoscaling policies, GPU/CPU allocation, and container orchestration (Kubernetes) ensure reliable multi-tenant serving.

Probing Question: “How would you design a platform that can serve 100 models with different latency SLAs?” Be ready to discuss resource partitioning, load balancing, and dynamic batching.

1.9: Security, Privacy, and Governance

Understand data encryption, PII redaction, model access control, and audit trails.
Learn about model inversion and membership inference attacks.
Study governance frameworks: Model Cards, Lineage Tracking, and Explainability Reports.

Deeper Insight: “What’s the difference between data-level privacy and model-level privacy?” Discuss Differential Privacy, Federated Learning, and Secure Aggregation mechanisms.

1.10: Putting It All Together — Designing End-to-End Systems

Combine concepts into case studies:
- Fraud Detection — streaming inference, high recall, event-time joins
- Recommendation Engine — user/item embeddings, retrieval + ranking stack
- Ads Ranking System — real-time auction, latency-constrained scoring, multi-objective optimization
Practice designing with trade-offs:
- Accuracy vs. Latency
- Personalization vs. Scalability
- Freshness vs. Stability

Probing Question: “If you had to re-architect your model to cut latency in half without losing much accuracy — what knobs can you turn?” The best candidates discuss model distillation, feature pruning, caching, and approximation algorithms.

1.9. Security, Privacy, and Governance