Large Language Model (LLM) Architecture - Roadmap
🤖 Core Foundations of Large Language Models
Note
The Top Companies Angle (LLM Foundations): These topics assess your mechanistic understanding of how large language models function — from text preprocessing to representation learning and attention dynamics. Expect interviewers to test whether you understand how and why the architecture works, not just what it does. Depth in these areas signals strong readiness for model interpretability, debugging, and scaling challenges in production.
1.1: LLM Architecture — The Blueprint of Intelligence
Begin with the Encoder–Decoder taxonomy:
- Encoder-only: BERT, RoBERTa (for understanding).
- Decoder-only: GPT family (for generation).
- Encoder-decoder: T5, FLAN (for translation, summarization).
Understand the Self-Attention Mechanism mathematically:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$Learn how this enables global context aggregation and replaces recurrence.
Dive into Positional Encoding — why it’s necessary and how sinusoidal vs. learned embeddings differ.
Deeper Insight: A top interviewer may ask: “Why do Transformers scale better than RNNs?” Your response should connect parallelization advantages, sequence length dependence, and vanishing gradient avoidance. Also, expect probing on why attention is quadratic and the trade-offs of linear attention variants like Performer or FlashAttention.
1.2: Tokenization — Turning Language into Numbers
Study Subword tokenization algorithms:
- BPE (Byte Pair Encoding) — merges frequent pairs iteratively.
- WordPiece — maximizes likelihood over corpus probability.
- SentencePiece — operates directly on raw text (language-agnostic).
Understand the trade-off between vocabulary size and sequence length — smaller vocabularies mean longer token sequences.
Experiment with
tokenizerslibrary in Python to encode and decode text; visualize token splits.
Deeper Insight: Be prepared to reason about how tokenization errors propagate (e.g., rare words → fragmented tokens → unstable embeddings). Expect follow-ups like: “If you fine-tune a model on a domain with unseen words, how would you handle OOV (Out-of-Vocabulary) tokens?” Mention subword-level generalization or training domain-specific tokenizers.
1.3: Embeddings — The Language Geometry
Learn how embeddings map discrete tokens to continuous vectors in $\mathbb{R}^d$.
Contrast:
- Static embeddings (Word2Vec, GloVe) — same vector for all contexts.
- Contextual embeddings (BERT, GPT) — vector changes per context.
Explore cosine similarity and vector arithmetic (e.g., king - man + woman ≈ queen).
Visualize embedding clusters using PCA or t-SNE.
Deeper Insight: Probing may include: “Why do we normalize embeddings before computing similarity?” or “How does embedding dimensionality affect model capacity and memory footprint?” Be ready to explain the curse of dimensionality, parameter scaling, and embedding sharing between input/output layers.
1.4: Modeling Objectives — Teaching Language Understanding
- Understand Causal Language Modeling (CLM) — predicting next tokens, used in GPT:
$$P(w_t|w_{
- Understand Masked Language Modeling (MLM) — predicting masked tokens, used in BERT: $$\mathcal{L}*{MLM} = -\sum \log P(w_m | w*{\setminus m})$$
- Study Denoising objectives in sequence-to-sequence models (T5).
Deeper Insight: Interviewers often test if you can articulate the information asymmetry between MLM and CLM and their implications for downstream fine-tuning. For instance, “Why can’t we use MLM for text generation directly?” or “What’s the trade-off between bidirectional context and autoregressive modeling?”
1.5: Scaling Laws & Model Capacity
- Study Kaplan et al. (2020) scaling laws:
$$L = A N^\alpha D^\beta C^\gamma$$
where
N=model parameters,D=dataset size,C=compute budget. - Learn how over-parameterization improves training stability but risks undertraining if data/compute are insufficient.
- Explore chinchilla scaling (2022): doubling data is often more efficient than doubling parameters.
Deeper Insight: Expect questions like: “How would you decide whether to scale model parameters or dataset size for better performance?” or “What limits model scaling at production?” — answer with memory bandwidth, training parallelism, and inference latency.
1.6: Optimization & Training Stability
- Review Adam, AdamW, and RMSProp — their role in adaptive learning rates.
- Learn about Gradient Clipping and Mixed Precision Training (FP16/BF16) for stability and efficiency.
- Study Learning Rate Warmup and Decay Schedules, essential for stabilizing large-model training.
Deeper Insight: In interviews, you may get: “Your model diverges after 10k steps — what would you check first?” Expect to discuss initialization, batch norm, learning rate, gradient explosion, or data corruption.
1.7: Regularization & Generalization
- Study Dropout in Attention/Feedforward layers, Weight Decay, and Label Smoothing.
- Learn how Early Stopping and Validation Loss Tracking prevent overfitting.
- Explore Stochastic Depth and Data Augmentation in text (e.g., back-translation).
Deeper Insight: Questions often explore your ability to detect overfitting vs. underfitting signals. Example: “If validation perplexity worsens but training perplexity keeps improving, what’s your next step?”
🧠 Training & Fine-Tuning of Large Language Models
Note
The Top Companies Angle (LLM Training): These topics test your ability to reason about data–compute–objective alignment. Top interviews will evaluate not just if you know what fine-tuning is, but when, why, and how to adapt or compress large models efficiently. Expect deep dives into optimization trade-offs, transfer learning limits, and parameter efficiency — essential for production-scale LLM work.
2.1: Pretraining vs. Fine-tuning — The Two-Stage Evolution
Stage 1: Pretraining
- Train on massive, unlabeled corpora using self-supervised objectives (e.g., CLM, MLM).
- Goal: Learn general linguistic and world knowledge.
- Data scale: trillions of tokens; loss typically measured via perplexity.
Stage 2: Fine-tuning
- Adapt the pretrained model to specific downstream tasks (classification, summarization, etc.).
- Use supervised, instruction, or reinforcement signals.
Deeper Insight: Interviewers often ask: “Why not just train directly on your target task?” The answer: Transfer learning efficiency — pretraining captures universal structure; fine-tuning specializes behavior. Follow-ups may include: “How do catastrophic forgetting and overfitting manifest during fine-tuning?” Discuss mitigations: smaller LR, layer freezing, or gradual unfreezing (ULMFiT-style).
2.2: Supervised Fine-Tuning (SFT) — Controlled Adaptation
SFT involves training on labeled datasets where the model learns desired input-output mappings. Example:
(prompt, ideal completion)pairs.Typical setup:
loss = cross_entropy(model(prompt), target) optimizer.step()Key techniques:
- Layer freezing: Only update top layers for efficiency.
- Curriculum tuning: Start with simple examples → move to complex.
- Gradient checkpointing to save GPU memory.
Deeper Insight: Common question: “How do you prevent overfitting when fine-tuning a small dataset?” Responses should mention early stopping, dropout, data augmentation, or mixout regularization. Advanced probing: “Would you fine-tune all parameters or use adapters?” — this bridges directly into the next section on PEFT.
2.3: Instruction Tuning — Teaching Models to Follow Human Intent
- Unlike SFT, instruction tuning trains on instruction-response pairs across many domains. Example: “Explain photosynthesis.” → “Photosynthesis is the process by which…”
- The goal: Make the model align with human communication norms rather than task-specific datasets.
- Study FLAN, T0, and InstructGPT as reference architectures.
- Key to understand: Task generalization comes from diverse, high-quality instruction datasets.
Deeper Insight: Probing question: “Why does instruction tuning improve zero-shot performance?” Because it conditions the model to map arbitrary instructions to coherent outputs, turning implicit reasoning into explicit behavior. Follow-up: “What’s the failure mode if instructions conflict?” → Discuss dataset bias and contradictory alignment signals.
2.4: Parameter-Efficient Fine-Tuning (PEFT) — Do More with Less
PEFT methods let you adapt large pretrained models without updating all parameters.
Learn these main techniques:
- Adapters — small bottleneck layers inserted into Transformer blocks.
- LoRA (Low-Rank Adaptation) — learns rank-decomposed weight updates. $$W' = W + \Delta W = W + BA$$
- Prefix/Prompt Tuning — prepends learnable tokens to context.
Benefits: Memory efficiency, faster experimentation, and multi-domain flexibility.
Deeper Insight: Expect an interviewer to test whether you understand why LoRA works — hint: it restricts adaptation to a low-dimensional subspace, preserving pretrained knowledge. Question: “Why does LoRA outperform full fine-tuning in low-data regimes?” Answer: It reduces overfitting by limiting parameter freedom. Bonus: Be ready to compare PEFT vs. Adapter vs. Prompt-tuning trade-offs in latency, memory, and expressivity.
2.5: Quantization & Distillation — Making Giants Efficient
Quantization: Reducing weight precision (e.g., FP32 → INT8/INT4) to cut memory and compute cost.
- Post-training quantization vs. Quantization-aware training (QAT).
- Learn about zero-point scaling, per-channel quantization, and dynamic quantization.
Distillation: Training a smaller student model to mimic a large teacher model.
- Soft targets from teacher’s logits guide student learning:
$$L = \alpha H(y, s) + (1 - \alpha) H(t, s)$$
where
H= cross-entropy,t= teacher output,s= student output.
- Soft targets from teacher’s logits guide student learning:
$$L = \alpha H(y, s) + (1 - \alpha) H(t, s)$$
where
Deeper Insight: Expect questions like: “How do quantization and distillation affect model accuracy?” and “Would you quantize before or after fine-tuning?” Discuss trade-offs: post-training quantization may hurt performance; QAT retains accuracy but increases training cost. Mention tools: Hugging Face
bitsandbytes, Intelneural-compressor.
2.6: Reinforcement Learning from Human Feedback (RLHF)
Three-stage pipeline:
- Supervised fine-tuning (SFT) to get a baseline policy.
- Reward model (RM) trained on human preference rankings.
- Policy optimization via PPO (Proximal Policy Optimization) to align model outputs with human preferences.
Core concept: Replace static loss with a dynamic reward signal reflecting human intent.
Deeper Insight: Interviewers might ask: “Why is PPO used in RLHF instead of vanilla policy gradient?” Discuss stability—PPO constrains update steps, preventing catastrophic drift from the SFT baseline. Be ready to talk about reward hacking, KL penalties, and how OpenAI’s “helpful-honest-harmless” triad inspired modern alignment objectives.
2.7: Safety Alignment & Post-training Alignment
- Study Constitutional AI, Self-Reward Modeling, and Direct Preference Optimization (DPO).
- Learn how safety alignment shifts from human feedback to rule-based or LLM-as-judge supervision.
- Important: Understand bias detection, toxicity mitigation, and content moderation filtering pipelines.
Deeper Insight: Common probing: “How can you align models without human labels?” — discuss synthetic feedback and self-critique (Reflexion, RLAIF). “How do you evaluate alignment success?” — use reward model accuracy, preference agreement rate, or human eval benchmarks (e.g., MT-Bench).
2.8: Evaluation & Monitoring After Fine-tuning
- Learn metrics: Perplexity, BLEU, ROUGE, Exact Match (EM), Win Rate, and Preference Accuracy.
- Understand continual evaluation pipelines — automated test harnesses that track degradation over time.
- Familiarize with hallucination detection (e.g., self-consistency, retrieval grounding).
Deeper Insight: Expect questions on how you’d know if your fine-tuned model regressed. Top answers discuss offline evaluation + live shadow deployments to measure drift before rollout.
⚙️ Implementation & Scaling Deep Dive
Note
The Top Tech Company Interview Angle: This part evaluates whether you can scale your knowledge from theory to production. You’ll be assessed on compute efficiency, distributed training fundamentals, and inference trade-offs — the real engineering backbone of LLM deployment. Expect detailed discussions around memory bottlenecks, model sharding, latency optimization, and reproducibility in multi-node training setups.
3.1: Distributed Training — Dividing the Giant
Data Parallelism (DP)
- Each GPU holds a full model copy but processes different mini-batches.
- Gradients are averaged across GPUs (via AllReduce).
- Tools: PyTorch DDP, DeepSpeed, Horovod.
Model Parallelism (MP)
- Split model layers across devices.
- Common strategy for extremely large models (e.g., GPT-3 with 175B params).
- Frameworks: Megatron-LM, Tensor Parallelism.
Pipeline Parallelism (PP)
- Break model into pipeline stages; each processes a different batch in sequence.
- Requires micro-batching to keep GPUs busy (fill-the-pipeline strategy).
Hybrid Parallelism (DP + MP + PP)
- Combine all three for maximum scaling (used in GPT-4 and PaLM).
Deeper Insight: Expect probing like: “Why does scaling efficiency drop with more GPUs?” Discuss communication overhead, gradient synchronization latency, and imbalance in pipeline stages. Bonus: If you mention ZeRO (Zero Redundancy Optimizer) stages (1–3) from DeepSpeed, it’s an instant signal of depth.
3.2: Memory Optimization — Training Without Melting GPUs
Gradient Checkpointing
- Saves memory by recomputing activations during the backward pass instead of storing all of them.
- Trades compute for memory — crucial for fitting large batches.
Mixed Precision Training (FP16 / BF16)
- Use half-precision for most ops while retaining FP32 master weights for stability.
- Learn how loss scaling avoids gradient underflow.
Activation Offloading
- Move activations to CPU or NVMe temporarily.
- Used in ZeRO-Offload and FSDP (Fully Sharded Data Parallelism).
Deeper Insight: A senior interviewer might ask: “You’re out of GPU memory at 80% utilization — what would you do?” Ideal answers mention:
- Reducing
batch_sizeor using gradient accumulation.- Employing mixed precision.
- Enabling checkpointing/offloading.
- Inspecting redundant buffers or unused caches.
3.3: Efficient Inference & Serving Pipelines
Quantization for Inference:
- Convert weights/activations to INT8/INT4 for lower latency.
- Dynamic quantization often yields 2–3× speedup with minimal loss.
Tensor Parallelism (TP):
- Shard attention or feed-forward layers across multiple GPUs at inference time.
- Used by libraries like Megatron-Deepspeed.
KV Cache Optimization:
- During autoregressive decoding, store key/value tensors to avoid recomputing attention for previous tokens.
- Enables streaming inference for chat models.
Batching & Speculative Decoding:
- Serve multiple user queries in parallel to maximize GPU throughput.
- Speculative decoding: use a smaller draft model to predict next tokens, validated by the main model — boosts throughput.
Deeper Insight: Common probes: “How would you serve a 70B parameter model with <200ms latency?” Mention tensor parallelism, quantization, efficient batching, and Triton or vLLM serving. Bonus: Discuss continuous batching (used in OpenAI inference) for massive concurrency.
3.4: Experiment Tracking & Reproducibility
- Use seed control across random, NumPy, and PyTorch.
- Log all hyperparameters, checkpoints, and dataset versions.
- Tools: Weights & Biases (W&B), MLflow, TensorBoard.
- Version datasets using DVC or Hugging Face Datasets with fixed splits.
Deeper Insight: Probing might include: “You rerun training and get a slightly different model — why?” Answer: non-deterministic GPU kernels, mixed precision noise, and differing RNG seeds. Discuss deterministic flags like
torch.backends.cudnn.deterministic=True.
3.5: Monitoring, Drift Detection, and Maintenance
Drift Monitoring:
- Detect shifts in data distribution (covariate or label drift).
- Use KL divergence or PSI (Population Stability Index) on embeddings or feature vectors.
Performance Tracking:
- Maintain live metrics: perplexity, BLEU, ROUGE, latency, and user satisfaction signals.
Feedback Loop:
- Collect failure cases and reintroduce into continual training or reinforcement pipelines.
Shadow Deployment:
- Run new fine-tuned models in parallel with production for silent A/B testing.
Deeper Insight: Common probe: “How do you ensure your LLM doesn’t degrade after deployment?” Discuss shadow evals, offline validation suites, live feedback metrics, and continual RLHF fine-tuning.
3.6: Scaling Infrastructure — From Lab to Production
Training Infrastructure
- Elastic clusters with job orchestration (Kubernetes, Ray, Slurm).
- Efficient I/O pipelines (TFRecord, WebDataset).
- Checkpoint sharding for resilience and restartability.
Serving Infrastructure
- Model sharding across GPUs via tensor parallelism.
- Use inference servers like Triton, vLLM, or Text Generation Inference (TGI).
- Integrate caching layers for frequent prompts (Redis, Memcached).
Deeper Insight: Expect real-world trade-off questions: “You can deploy one massive model or multiple smaller ones per region — which do you choose?” Discuss latency, cost, failure isolation, and regional adaptation. Elite answers highlight elastic scaling, A/B evaluation, and cross-region model drift.
3.7: Failure Recovery & Checkpoint Strategy
Checkpoint Intervals:
- Trade-off between fault tolerance and training overhead.
- Store at major epoch boundaries or every N steps depending on GPU failure likelihood.
Resumable Training:
- Log optimizer states (momentum, Adam moments).
- Save RNG seeds to continue seamlessly.
Deeper Insight: Interviewers often probe operational readiness: “If one node crashes mid-training, what happens?” Correct response: explain checkpoint reload, optimizer reinitialization, and gradient re-synchronization. Mention DeepSpeed ZeRO checkpoint partitioning for multi-node resilience.
🧩 Evaluation, Interpretability & Alignment Deep Dive
Note
The Top Tech Interview Angle: This section tests whether you understand what “good” means in generative AI. Interviewers look for engineers who can measure quality beyond loss values — using perplexity, semantic similarity, human preference modeling, and interpretability frameworks. Expect questions that blend mathematics, human feedback, and real-world deployment insight — exactly what senior ML engineers are expected to own in production-grade LLM systems.
4.1: Evaluation Metrics — Defining “Good” for LLMs
Evaluation in LLMs splits into:
- Intrinsic metrics — model-internal: loss, perplexity, log-likelihood.
- Extrinsic metrics — task-based: BLEU, ROUGE, accuracy, F1.
- Human preference metrics — subjective but essential for quality alignment.
Understand trade-offs: automatic metrics ≠ human satisfaction.
Learn win-rate and pairwise comparison as standard for dialogue evaluations.
Deeper Insight: Common question: “Why is perplexity not enough for dialogue models?” Because perplexity only measures token-level probability, not usefulness, truthfulness, or coherence. Mention multi-aspect evaluation — truthfulness, helpfulness, coherence, and toxicity as dimensions.
4.2: Perplexity — The Statistical Backbone
- Definition:
$$\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i|w_{
- Used mainly during pretraining and fine-tuning for comparing checkpoints.
- Remember: it’s not normalized across vocabularies or domains, so direct comparisons can mislead.
Deeper Insight: Probing example: “Model A has 25 perplexity, Model B has 20 — is B always better?” Answer: not necessarily — depends on dataset domain, tokenization granularity, and context length. Top candidates also mention perplexity saturation — when continued training stops reducing it but downstream quality still improves.
4.3: BLEU, ROUGE & Semantic Metrics — Evaluating Generations
BLEU (Bilingual Evaluation Understudy):
- Precision-based n-gram overlap metric.
- Common in translation.
- Formula: $$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$ where BP = brevity penalty.
ROUGE (Recall-Oriented Understudy):
- Measures recall of n-grams and longest common subsequences.
- Used in summarization.
BERTScore:
- Uses contextual embeddings instead of n-grams → aligns better with human judgment.
Deeper Insight: Probing: “When does BLEU fail?” Answer: BLEU fails when multiple valid answers exist (creative tasks). Interviewers value answers mentioning semantic alignment metrics like BERTScore, BLEURT, or COMET, which evaluate meaning rather than word overlap.
4.4: Human Evaluation & Preference Modeling
The gold standard for conversational and open-ended tasks.
Techniques:
- Pairwise Comparison: A vs. B judgments on output quality.
- Likert Scales: 1–5 ratings on helpfulness, truthfulness, coherence.
- Reward Modeling: Convert human preference data into a scalar reward function (used in RLHF).
Learn sampling strategies: interleaving, active selection, and reward aggregation.
Deeper Insight: Expect probing like: “How do you make human evaluation reliable?” Answer: diversity of annotators, clear rubrics, quality-control checks, and measuring inter-rater agreement (Cohen’s κ). Follow-up: “Can human feedback be biased?” → Yes, address annotator cultural bias and position bias.
4.5: Hallucination Detection & Calibration
Hallucination: Model generates fluent but factually incorrect information.
Detection strategies:
- Retrieval grounding: Compare output against a trusted knowledge base.
- Self-consistency: Ask model multiple times; incoherence indicates hallucination.
- Verifier models: Train smaller LLMs to validate claims.
Calibration techniques:
- Temperature tuning: Reduces randomness.
- Logit scaling: Adjusts output confidence.
- External grounding: RAG (Retrieval-Augmented Generation) or tool-use.
Deeper Insight: Common probing: “Can a low temperature fully remove hallucinations?” No — temperature only affects randomness, not factual grounding. The deep answer includes: “You must improve the retrieval process or incorporate truthfulness constraints in post-training alignment.”
4.6: Explainability — Making LLMs Less of a Black Box
Techniques:
- Attention Visualization: Inspect which tokens contribute most to predictions.
- Input Perturbation: Change inputs slightly, observe output shifts.
- Feature Attribution: Use SHAP, Integrated Gradients for encoder models.
- Probing Classifiers: Train small models on hidden representations to test what knowledge is encoded.
Learn attention ≠ explanation — attention weights correlate but don’t guarantee causal influence.
Deeper Insight: Expect: “How do you interpret what your Transformer learned?” Discuss probing tasks, representational similarity analysis, and layer-wise semantic emergence. Senior-level answers mention “representation drift” — how semantic spaces evolve during fine-tuning.
4.7: Behavioral Evaluation & Safety Testing
Test for harmful or biased behavior using controlled benchmarks:
- RealToxicityPrompts, BiasBench, TruthfulQA, AdvBench.
Learn red-teaming methods — generate adversarial prompts to stress-test the model.
Apply constitution-based filtering (Anthropic’s Constitutional AI) to constrain outputs to ethical guidelines.
Deeper Insight: Interviewers love: “How would you detect toxicity at scale?” Mention fine-tuned toxicity classifiers or embeddings-based toxicity filters. Follow-up: “Can safety fine-tuning reduce helpfulness?” Yes — alignment may reduce diversity and creativity; balance via reward scaling.
4.8: Continuous Feedback & Deployment Alignment
- Human-in-the-loop (HITL): Real-time feedback integration loop for model refinement.
- Evaluation pipelines: Automate testing after every fine-tuning or model release.
- Long-term drift detection: Monitor user satisfaction, coherence scores, or safety regressions.
Deeper Insight: Expect: “How would you maintain evaluation quality after deployment?” Elite candidates mention continuous A/B testing, reward model recalibration, and online learning safety guards (e.g., gating uncertain outputs).
4.9: Ethics, Fairness, and Transparency
- Understand dataset bias propagation, representation fairness, and data lineage tracking.
- Discuss differential privacy, data redaction, and auditable model logs.
- Be aware of legal frameworks — GDPR, AI Act, model cards, and data consent.
Deeper Insight: A senior interviewer may test philosophy + practicality: “Should models be open-sourced?” Good responses weigh transparency vs. misuse, and mention parameter sharing with usage restrictions — as done by many frontier labs.