2.2. Build a Model Registry Conceptually

AI System Design Interview Guide (2025)

5 min read 983 words

🪄 Step 1: Intuition & Motivation

Core Idea: Imagine you’re running a fleet of models — recommendation, fraud detection, pricing, demand forecasting — all trained by different teams. How do you know which model is live? Which version is better? Who deployed it? When should an old one be retired? That’s where a Model Registry comes in — it’s the “central nervous system” of your ML infrastructure that keeps your models organized, traceable, and safely deployable.
Simple Analogy: Think of the Model Registry as an app store for your ML models.
- Each app (model) has versions, release notes (metrics), and stages (beta, production).
- You can promote, rollback, or retire an app anytime — all with traceability. It’s how large organizations prevent chaos when hundreds of models are in motion.

🌱 Step 2: Core Concept

A Model Registry is a structured database plus a workflow system that manages the lifecycle of models — from creation to deprecation.

Let’s break it down layer by layer.

1️⃣ Model Metadata Store — The Brain

The metadata store is the heart of a model registry. It tracks everything about your models in a structured way — much like a library catalog.

A typical schema might look like this:

Field	Description
`model_name`	Logical name of the model (e.g., “fraud_detector”)
`version`	Semantic version number (e.g., 2.1.0)
`metrics`	Evaluation results (accuracy, F1, latency)
`artifact_path`	Where the model is stored (e.g., S3, GCS, or local path)
`data_version`	Dataset reference used for training
`created_by`	User or team that trained the model
`stage`	Current stage (staging, production, archived)
`timestamp`	When it was created or promoted

This schema ensures every model entry is a snapshot of truth — linking artifacts, metadata, and governance info.

💡 Intuition: The metadata store is your “model Wikipedia” — each entry tells you the who, what, when, and how behind a model.

2️⃣ Approval Workflows — The Gatekeeper

Models shouldn’t jump from research to production overnight. Approval workflows enforce quality control and governance.

Typical lifecycle stages:

Staging: The model has been trained and validated internally.
Production: The model has passed performance, fairness, and compliance checks.
Archived / Deprecated: The model is outdated or replaced by a newer version.

Promotion Rules Example:

Accuracy above 0.9 ✅
No performance regressions on key metrics ✅
Model card approved by reviewer ✅

Only then does the model move from “Staging” → “Production.”

💡 Intuition: Think of promotion like a “passport control” for your models — no entry to production without proper checks.

3️⃣ Rollback and Deprecation — The Safety Net

Even after promotion, a model can fail unexpectedly — maybe data drift or unseen edge cases. Rollback mechanisms ensure you can revert to a stable version quickly.

Key Concepts:

Rollback: Instantly revert to a previous version if new one misbehaves.
Deprecation: Officially retire models that are no longer valid or supported.
Audit Trails: Keep records of who changed what and when.

Example: If fraud_detector v2.1.0 starts flagging too many false positives, you can rollback to v2.0.1 (the last known good model) — with a single command or approval click.

💡 Intuition: Rollbacks are your “undo” button in ML — safety and trust depend on them.

📐 Step 3: Mathematical Foundation

While model registries are mostly architectural, there’s one elegant conceptual relation worth formalizing — model lifecycle transitions.

Model Lifecycle as a State Transition System

You can think of each model’s lifecycle as a finite state machine (FSM):

$$ S = { \text{Staging}, \text{Production}, \text{Archived} } $$

And transitions:

$$ T = { (\text{Staging} \rightarrow \text{Production}), (\text{Production} \rightarrow \text{Archived}), (\text{Production} \rightarrow \text{Rollback}) } $$

Each transition $t \in T$ must satisfy certain guard conditions — for example:

$$ \text{Accuracy}*{new} > \text{Accuracy}*{old} - \epsilon $$

where $\epsilon$ is the tolerated performance drop (like 0.01).

These formal rules ensure models move through the system safely and predictably.

A Model Registry is like a well-regulated airport: Models can only take off (deploy) or land (rollback) after passing safety checks — never jumping states arbitrarily.

🧠 Step 4: Key Ideas

Single Source of Truth: Every model version, artifact, and metric should exist in one consistent place.
Reproducibility: The registry should make it possible to re-train or reload any historical model.
Controlled Promotion: No model enters production without checks.
Auditability: Every change — training, promotion, or rollback — must be logged and attributable.
Interoperability: Registries should integrate easily with CI/CD, monitoring, and feature stores.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Provides governance and accountability.
Enables collaboration and transparency across teams.
Simplifies debugging and rollback in production.

Setting up centralized governance can slow experimentation.
Requires consistent schema and discipline across teams.
Can become a bottleneck if access is not automated or well-managed.

Centralized Registry:

✅ Ensures consistency, traceability, compliance.
⚠️ Less flexible; teams depend on a central admin.

Distributed Registry:

✅ Enables team autonomy and faster iteration.
⚠️ Harder to maintain global visibility and cross-team reproducibility.

The ideal enterprise solution? → Hybrid: Central governance with local team registries synced to a global catalog.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“A model registry is just a file store.” Wrong — it’s not only where models live but also how they’re governed, promoted, and tracked.
“Once a model is in production, we can delete the old ones.” Dangerous — old models are your fallback mechanism for rollback or audits.
“Manual updates are fine.” Not scalable. Top systems integrate registry updates into CI/CD for automated logging and versioning.

🧩 Step 7: Mini Summary

🧠 What You Learned: A model registry manages models like an app store — storing metadata, controlling promotion, and enabling rollbacks safely.

⚙️ How It Works: It combines a structured metadata store, approval workflow, and rollback mechanism — ensuring every model in production is traceable and reversible.

🎯 Why It Matters: Without a registry, model chaos ensues — teams lose track of versions, can’t reproduce results, and risk deploying untested models.

3.1. Understand the Differences Between ML CI/CD and Software CI/CD 2.1. Understand Model Versioning