4.6. Model Selection & Serving Strategies

Generative AI & LLM Interview Guide for Top Roles (2025)

6 min read 1114 words

🪄 Step 1: Intuition & Motivation

Core Idea: Not every task deserves a supercomputer brain. 🧠💸

In real-world AI systems, choosing which model to use, where to host it, and how to route queries efficiently can make the difference between a $10k bill and a $100k one — without losing quality.

Model selection and serving strategy is about intelligent orchestration: matching the right model to the right task at the right cost.

This is where engineering maturity shows — not just building AI, but running it smartly.

Simple Analogy: Imagine running a restaurant. 🍴 You wouldn’t use your Michelin-star chef to make instant noodles, right?

Likewise:

Use GPT-4 (the chef) for complex reasoning or creativity.
Use Mistral or Llama 3 (junior chefs) for simpler dishes.
Let your router (the manager) decide who handles which order — automatically.

That’s the essence of efficient model serving.

🌱 Step 2: Core Concept

We’ll unpack this into four core dimensions: 1️⃣ Model Comparison (OpenAI vs. Open Source) 2️⃣ Deployment Trade-offs (Cloud vs. Self-hosted) 3️⃣ Model Routing (Dynamic Task Allocation) 4️⃣ Lightweight Adaptations (LoRA & Quantization)

1️⃣ Model Comparison — The Intelligence Spectrum

Let’s compare the two broad families of LLMs:

Criteria	OpenAI API Models (GPT-4, GPT-4o)	Open-Source Models (Llama 3, Mistral, Mixtral)
Setup	Plug-and-play via API	Requires hosting & infra setup
Performance	Top-tier accuracy, broad generalization	Narrower domain, tunable
Control	Black-box (limited customization)	Full access, customizable weights
Cost	Pay-per-token, expensive at scale	One-time compute cost, cheaper long-term
Latency	Optimized globally	Depends on local infra
Security	Cloud-managed compliance	Must self-manage data and privacy
Use Cases	Enterprise apps, broad reasoning	Domain-specific or edge deployments

When to Choose What:

GPT-4 / GPT-4o: when accuracy and robustness outweigh cost (e.g., reasoning-heavy customer support).
Llama 3 / Mistral / Mixtral: when customization, privacy, or budget control matter (e.g., internal knowledge systems).

Use APIs for speed of iteration. Switch to open-source for scale and control.

2️⃣ Deployment Trade-offs — Cloud vs. Self-Hosted

☁️ Cloud-Hosted (e.g., OpenAI, Anthropic, Gemini)

Pros:
- Minimal setup time.
- Optimized inference and uptime.
- Scales automatically.
Cons:
- Ongoing operational costs.
- Data privacy concerns.
- Limited observability and customization.

Best for: early-stage startups or workloads with unpredictable traffic.

🖥️ Self-Hosted (e.g., Llama 3 via Ollama, vLLM, or Hugging Face TGI)

Pros:
- Full control of model weights and serving pipeline.
- Can use GPUs efficiently for large-batch inference.
- Lower long-term cost for high query volume.
Cons:
- Infrastructure maintenance overhead.
- Requires MLOps expertise.
- Harder to scale on demand.

Best for: mature systems or privacy-sensitive deployments (e.g., enterprise RAG systems).

A hybrid approach works best: Run critical logic on self-hosted open models and fallback to APIs for overflow or high-reasoning cases.

3️⃣ Model Routing — The Brain Dispatcher

Model routing is the logic that decides which model should handle which query.

Instead of a single monolithic model, you orchestrate a mixture of experts (MoE) — each optimized for different tasks.

🔧 Routing Criteria:

Category	Example Condition	Assigned Model
Complexity	“Requires multi-step reasoning”	GPT-4
Domain	“Contains legal terms”	Fine-tuned Llama 3
Latency SLA	“Must respond under 500ms”	Mistral
Cost Sensitivity	“Bulk Q&A, low precision”	Smaller distilled model

🧩 Routing Workflow:

A lightweight classifier (or a small LLM) analyzes incoming queries.
It estimates complexity (via token length, structure, or keywords).
Based on thresholds, it forwards the request to the appropriate model.

Example heuristic:

if "why" in query or len(query.split()) > 50:
    route_to("gpt-4")
else:
    route_to("llama-3-8b")

Advanced Systems:

Use scoring models to estimate uncertainty or expected token length.
Implement cascading routing (start with a small model; escalate if confidence < threshold).

Log model routing decisions — you’ll discover 80% of queries don’t need your biggest model.

4️⃣ Lightweight Adaptations — LoRA, Quantization & Distillation

When you can’t afford GPT-4 everywhere, the next best thing is to teach smaller models to behave like it.

🧩 LoRA (Low-Rank Adaptation)

Fine-tunes specific model layers with minimal additional parameters.
Allows specialization (e.g., legal reasoning, medical QA) at 1–2% of full fine-tuning cost.
Ideal for task-specific reasoning without retraining from scratch.

🧠 Quantization

Compresses model weights (e.g., from FP16 → INT8/INT4).
Reduces memory footprint and speeds up inference by 2–4x.
Slight drop in accuracy, but major gains in latency.
Works great with vLLM or GGUF (Ollama) formats.

🧪 Distillation Cascades

Use a large model (teacher) to train a smaller one (student).
The student learns to approximate the teacher’s reasoning distribution.
Example: distilling GPT-4 outputs into a Mistral-7B or Phi-3 model.

Result → 80–90% of GPT-level reasoning for 10% of the cost.

Distillation is the backbone of “tiered LLM architectures” — smaller models handle 90% of traffic; large models catch the rest.

📐 Step 3: Mathematical Foundation

Expected Cost in Multi-Model Routing

If:

$p_H$ = proportion of queries routed to high-cost model
$C_H$ = cost per query for high-cost model
$C_L$ = cost per query for low-cost model

Then total expected cost:

$$ E[C] = p_H C_H + (1 - p_H) C_L $$

Optimizing routing is minimizing $E[C]$ without dropping reasoning quality below threshold $Q_{\min}$:

$$ \min_{p_H} E[C] \quad \text{s.t.} \quad Q(p_H) \ge Q_{\min} $$

Your goal isn’t to use the smartest model always — it’s to reach the sweet spot where reasoning quality stays high but cost stays sane.

🧠 Step 4: Key Ideas & Assumptions

Large models ≠ better decisions for all tasks.
Hybrid routing balances performance, latency, and cost.
LoRA and distillation let you tailor smaller models for specific reasoning styles.
Quantization is key for edge deployment and low-latency scenarios.
Always monitor routing outcomes — static rules often age poorly.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Greatly reduces operational cost.
Improves system scalability and flexibility.
Enables custom tuning for specialized tasks.

⚠️ Limitations:

Routing misclassification can degrade performance.
Maintaining multiple models increases complexity.
Quantized models may lose nuance in reasoning-heavy tasks.

⚖️ Trade-offs:

Cost vs. Accuracy: Smaller models are cheaper but less precise.
Control vs. Convenience: Self-hosting offers freedom but requires maintenance.
Latency vs. Consistency: Routing adds small delays but improves overall efficiency.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Using GPT-4 everywhere guarantees success.” → It’s overkill for most tasks.
“Quantization just reduces size.” → It can also affect reasoning fidelity.
“Routing means random assignment.” → It’s intelligent selection based on task complexity or uncertainty.

🧩 Step 7: Mini Summary

🧠 What You Learned: Model selection and serving strategies ensure you use the right brain for the right job — balancing reasoning depth, latency, and cost.

⚙️ How It Works: Compare hosted vs. open-source models, deploy via hybrid setups, and use routing plus lightweight adaptations like LoRA or distillation for cost-efficient reasoning.

🎯 Why It Matters: Smart model orchestration makes LLM systems sustainable — capable of scaling reasoning power without scaling bills.

4.7. Multi-Agent and Hybrid Reasoning Systems 4.5. Logging, Observability & Feedback Loops