ML System Design Infrastructure - Roadmap

AI System Design Interview Guide (2025)

6 min read 1179 words

Machine learning infrastructure forms the backbone of reliable, scalable AI systems.
You’ll be evaluated on how well you connect data, models, and deployment pipelines, and how you design systems that scale gracefully while maintaining reproducibility.

⚙️ 1. Foundations of ML Infrastructure

Note

The Interview Angle: This tests your understanding of how ML systems differ from traditional software systems — specifically in terms of stateful data, non-deterministic training, and continuous learning loops. You’ll often be asked to design a scalable, maintainable ML workflow diagram from scratch.

1.1: Understand the End-to-End ML Lifecycle

Study the iterative nature of ML systems — Data → Feature Engineering → Training → Deployment → Monitoring → Feedback.
Understand the concept of drift — both data drift and concept drift — and how it impacts retraining frequency.
Map components to infrastructure:
- Data Lake → Feature Store
- Experiment Tracker → Model Registry
- CI/CD → Model Deployment
- Monitoring → Model Governance

Deeper Insight: Expect a question like, “If your model accuracy drops after deployment, where do you start debugging?” — You should discuss data versioning, feature consistency, and monitoring signals.

1.2: Learn the Infrastructure Stack Layers

Compute layer (GPUs, Kubernetes, distributed training systems like Ray or SageMaker).
Storage layer (object storage for datasets, model artifacts, logs).
Workflow orchestration (Airflow, Kubeflow, MLflow, Metaflow).
Observability layer (Prometheus, Grafana, Sentry, model performance dashboards).

Probing Question: “If you had to design a reproducible ML experiment pipeline, which layer is the most critical to get right first — and why?”

🧱 2. Model Registry and Experiment Tracking

Note

The Interview Angle: Used to assess your grasp of model lineage, governance, and reproducibility. Top interviewers want to see if you can explain how you’d recover or roll back to a previous model version and ensure consistent performance across environments.

2.1: Understand Model Versioning

Learn how to version models, parameters, and metrics.
Tools: MLflow, Weights & Biases, or custom registries.
Understand semantic versioning (e.g., 1.2.0 → major/minor/patch updates).
Discuss model lineage — what data, code, and hyperparameters produced each model.

Probing Question: “Your model performance regresses. How do you trace it back to the dataset version or hyperparameter that caused it?”

2.2: Build a Model Registry Conceptually

Study the components:
- Model metadata store (schema: model_name, version, metrics, artifact_path).
- Approval workflows (staging → production).
- Rollback and deprecation mechanisms.
Implement a simple registry with MLflow and SQLite for practice.

Deeper Insight: Discuss trade-offs between centralized and distributed registries — central provides consistency, distributed enables autonomy for multi-team environments.

🔁 3. CI/CD for Machine Learning

Note

The Interview Angle: This evaluates your ability to design continuous integration and continuous deployment for non-deterministic workloads. You’ll be asked to ensure automation without losing control over data and reproducibility.

3.1: Understand the Differences Between ML CI/CD and Software CI/CD

In traditional CI/CD: Code → Build → Test → Deploy.
In ML CI/CD: Data + Code + Model → Train → Validate → Register → Deploy.
Learn pipeline triggers:
- New data arrival
- Performance degradation
- Manual approval for promotion

Probing Question: “Why can’t we deploy ML models with the same CI/CD pipeline used for backend services?”

3.2: Build an ML Deployment Pipeline

Use GitHub Actions or Jenkins to automate:
- Data validation
- Model training
- Evaluation thresholds
- Deployment to staging and production
Integrate MLflow or Sagemaker for model promotion.

Deeper Insight: Explain shadow deployment and canary release strategies for safe rollout. Be able to describe how you’d measure success before full rollout.

🧩 4. Feature Store and Data Consistency

Note

The Interview Angle: Evaluates whether you understand training-serving skew — one of the most common sources of real-world ML failures. Top candidates can explain how feature stores mitigate this.

4.1: Core Concepts of Feature Store

Learn about offline store (training data) and online store (serving data).
Understand how a feature store ensures consistent transformations between both.
Study point-in-time correctness to prevent data leakage.

Probing Question: “What is point-in-time correctness and how does it differ from a simple join on timestamp?”

4.2: Design a Minimal Feature Store

Model a simple schema: feature_name, entity_id, timestamp, value.
Implement a mini version using Feast or even PostgreSQL.
Discuss feature versioning and backfill strategies.

Deeper Insight: Be prepared to compare data warehouse-based vs. real-time feature stores in terms of latency and cost.

🧠 5. Workflow Orchestration and Automation

Note

The Interview Angle: Used to test your understanding of pipeline orchestration, failure recovery, and DAG-based scheduling. Hiring panels care about how you handle retraining at scale and dependency management.

5.1: Orchestrate ML Pipelines

Study Airflow, Kubeflow Pipelines, or Metaflow.
Learn about DAGs, operators, sensors, and backfilling.
Understand idempotency and checkpointing for resilient pipelines.
Learn retry policies and data dependency handling.

Probing Question: “If your model training step fails mid-way, how do you ensure the pipeline can recover without retraining everything?”

5.2: Build a Simple DAG-Based Workflow

Create an Airflow DAG:
- Task 1: Data Validation
- Task 2: Feature Engineering
- Task 3: Model Training
- Task 4: Model Evaluation and Registration
- Task 5: Deployment
Integrate model metadata logging and artifact storage.

Deeper Insight: Discuss trade-offs between push-based (Airflow) and pull-based (Kubeflow) orchestration models.

📊 6. Monitoring, Logging, and Model Governance

Note

The Interview Angle: This section tests if you think like a responsible engineer. Can you detect model drift, explain metrics, and trace issues across the data-model boundary?

6.1: Model Performance Monitoring

Track model-level metrics (accuracy, precision, recall, F1).
Track data-level metrics (distribution shifts, missing value ratio).
Use monitoring tools like Evidently AI, Prometheus, or Grafana.

Probing Question: “How would you know your model silently degraded in production without access to ground truth?”

6.2: Build Governance Workflows

Define governance policies — model approval, retention, audit logs.
Understand model cards and data lineage tracking.
Study compliance frameworks (GDPR, AI Act).

Deeper Insight: Explain how governance differs between regulated (finance, healthcare) and unregulated domains.

🚀 7. Scaling and Cost Optimization

Note

The Interview Angle: This tests your ability to handle system-level trade-offs — cost vs. latency vs. reliability. Top candidates can reason about parallelization, caching, and serving infrastructure.

7.1: Scaling Training and Serving

Study distributed training (Data Parallelism, Model Parallelism).
Learn serving architectures:
- Batch inference
- Online inference
- Streaming inference
Implement autoscaling using Kubernetes Horizontal Pod Autoscaler.

Probing Question: “How would you reduce inference latency without retraining your model?”

7.2: Cost Optimization

Use spot instances or serverless compute for ephemeral workloads.
Cache embeddings or intermediate computations.
Design data pipelines with lazy evaluation to reduce cost.

Deeper Insight: Discuss trade-offs between latency (e.g., GPUs always on) and cost efficiency (on-demand or cold start).

🧩 8. Infrastructure as Code (IaC) and Security

Note

The Interview Angle: Tests your ability to operationalize ML systems — provisioning reproducible infrastructure and ensuring data and model security.

8.1: Infrastructure as Code

Learn Terraform or AWS CloudFormation basics.
Automate resource provisioning (S3 buckets, EC2 clusters, EKS nodes).
Maintain environment parity across staging and production.

Probing Question: “How do you ensure reproducibility when retraining models across environments?”

8.2: Security and Access Control

Implement RBAC for model registries and feature stores.
Encrypt data at rest and in transit.
Audit API access for model endpoints.

Deeper Insight: Discuss least privilege principle and secret management in ML environments.

8.2. Security and Access Control