8.2. Security and Access Control

AI System Design Interview Guide (2025)

ML System Design Infrastructure

6 min read 1081 words

🪄 Step 1: Intuition & Motivation

Core Idea: Machine learning systems don’t just make predictions — they hold data gold: sensitive user information, business insights, and proprietary models. If that data leaks or models are tampered with, it’s not just an engineering bug — it’s a trust disaster.
Security and access control ensure that only the right people and services can touch the right resources, under strict rules.
Simple Analogy: Think of your ML system as a vault in a bank.
- RBAC decides who gets which key.
- Encryption ensures even if someone breaks in, they can’t read what’s inside.
- Auditing keeps track of every key turn and vault access. Together, they create a trustworthy system where safety isn’t an afterthought — it’s a design principle.

🌱 Step 2: Core Concept

Security in ML isn’t about paranoia — it’s about discipline. You protect three assets:

Data (features, datasets, labels)
Models (weights, embeddings, metadata)
Pipelines (training, serving, and monitoring systems)

Let’s explore how each of the three main pillars — RBAC, encryption, and auditing — fortify this foundation.

1️⃣ Role-Based Access Control (RBAC) — The Gatekeeper

RBAC defines who can do what on which resource. Instead of granting blanket admin access, you assign permissions by role — e.g., “Data Scientist,” “MLOps Engineer,” or “Auditor.”

🧠 Typical ML RBAC Hierarchy:

Role	Permissions	Example Actions
Data Scientist	Read/write on datasets, train models	Upload training data, run experiments
MLOps Engineer	Deploy and monitor models	Push model to production, check logs
Compliance Officer	Read-only access for audits	View experiment lineage, export metrics
Service Account	API-level automation	Read model registry, invoke inference

🧩 Implementation Tools:

AWS IAM, GCP IAM, Azure AD → Cloud-level RBAC
MLflow / Model Registry → Fine-grained access to model versions
Feast / Feature Store → Restrict who can create or fetch features

🔒 Example: RBAC in a Feature Store (Feast)

roles:
  data_scientist:
    permissions:
      - read: ["features/customer_*"]
      - write: ["features/new_*"]
  analyst:
    permissions:
      - read: ["features/customer_summary"]

💡 Intuition: RBAC is like airport security — passengers, pilots, and ground staff all have different access passes. Everyone’s important, but no one gets into the cockpit unless authorized.

2️⃣ Encryption — The Lock and Key

Encryption ensures that even if your data or model files are exposed, they’re unreadable to unauthorized users.

There are two main layers of protection:

At Rest: When stored in databases, S3 buckets, or model registries.
In Transit: When moving between systems (e.g., during model deployment or inference calls).

🧱 Encryption at Rest:

Protects stored data and model weights.

Tools & Techniques:

AES-256 encryption for storage (e.g., AWS KMS, GCP KMS).
Encrypted file systems (EBS, GCS).
Encrypt model binaries in artifact stores.

Example: AWS S3 enables automatic server-side encryption:

resource "aws_s3_bucket" "ml_models" {
  bucket = "ml-model-artifacts"
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

⚡ Encryption in Transit:

Protects data during communication.

Practices:

Always use HTTPS for REST endpoints.
Use mutual TLS (mTLS) for service-to-service authentication.
Rotate TLS certificates periodically.

💡 Intuition: Encryption is like speaking in code — even if someone hears the conversation, they can’t understand it.

3️⃣ API Access Auditing — The Watchtower

Every API request — whether to fetch data, deploy models, or make predictions — must be logged and auditable. This creates transparency and accountability across the ML ecosystem.

🧠 What to Log:

Category	Example
Who	User or service identity
What	API endpoint or resource accessed
When	Timestamp
Where	IP or region
Outcome	Success, failure, or permission denied

Tools:

AWS CloudTrail, GCP Cloud Audit Logs, Elastic APM, Sentry
MLflow / Kubeflow metadata tracking for model lineage and version access

🔍 Example Audit Log:

{
  "timestamp": "2025-10-30T10:32:15Z",
  "user": "service-account:ml-deployer",
  "action": "POST /api/models/v2/deploy",
  "resource": "model:fraud_detector_v2",
  "status": "success",
  "ip": "34.112.45.89"
}

💡 Intuition: Auditing is like CCTV for your ML system — it doesn’t stop theft, but it ensures you always know who entered, when, and what they did.

📐 Step 3: Mathematical Foundation

Let’s model the “least privilege principle” formally.

Least Privilege Principle — The Security Minimization Rule

A user $u$ has access rights $R(u)$ for a resource set $S$. The least privilege policy requires:

$$ R(u) = \min { r_i \in S \mid r_i \text{ allows all required operations} } $$

This ensures each user or service has only the permissions needed to perform its function — nothing more.

If $|R(u)|$ grows beyond this minimum, the system risk increases exponentially with exposure area $E$:

$$ E \propto |R(u)| \times P(vuln) $$

where $P(vuln)$ is the probability of a vulnerability being exploited.

The least privilege principle is like giving every employee a key only to their own drawer, not to the entire office. Fewer keys mean fewer chances of misuse.

🧠 Step 4: Secret Management

Secrets = API keys, database credentials, or encryption tokens. Never hard-code them into scripts, notebooks, or configs.

🔐 Best Practices:

Use Secret Vaults:
- HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager.
- Access them dynamically via short-lived tokens.
Rotate Regularly:
- Automate secret rotation every 90 days.
Least Exposure:
- Store secrets in environment variables, not files.
Encrypt Secrets at Rest and Transit:
- Even secrets deserve encryption layers.
Access via IAM Roles (Not Keys):
- Prefer IAM roles with scoped policies over static credentials.

💡 Intuition: Secret management is like storing the vault’s key inside another vault — only those with explicit permission can unlock it.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Prevents unauthorized data/model access.
Ensures compliance (GDPR, HIPAA, SOC2).
Provides forensic visibility into all system actions.

Adds operational overhead (key rotation, IAM setup).
Overly restrictive RBAC can block legitimate workflows.
Encryption overhead may slightly increase latency.

Trade-off between security strictness and developer velocity:
- Too strict = slow iteration.
- Too loose = higher breach risk. Mature systems automate security policies to minimize friction.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“RBAC is enough.” Wrong — without auditing and encryption, RBAC only controls access, not exposure.
“Encryption slows everything down.” Modern hardware supports AES acceleration — overhead is minimal.
“Secrets in environment variables are safe forever.” They must still be rotated and encrypted — environment leaks happen.

🧩 Step 7: Mini Summary

🧠 What You Learned: ML security combines RBAC, encryption, auditing, and secret management to protect data, models, and pipelines.

⚙️ How It Works: RBAC enforces access boundaries, encryption guards data at rest and in transit, and audit logs ensure full traceability.

🎯 Why It Matters: Security transforms ML systems from “functional” to “trustworthy.” Without it, even the best models are liabilities waiting to happen.

ML System Design Infrastructure - Roadmap 8.1. Infrastructure as Code