Large Language Models (LLMs)
🧠 Transformer & GPT Family
Note
The Top Tech Interview Angle: Transformers underpin every LLM. Interviewers assess your ability to break down attention math, scaling strategies, and the trade-offs between decoder-only, encoder-decoder, and hybrid architectures. You’ll also be expected to reason about why GPT-style models dominate generative tasks.
1.1: The GPT Lineage (GPT-1 → GPT-4)
- Study how GPT-1 established transfer learning for NLP (unsupervised → fine-tuned).
- Understand GPT-2’s autoregressive causal masking for next-token prediction.
- Dive into GPT-3’s scaling laws — billions of parameters, sparse attention, and prompt-based learning.
- Examine GPT-4’s mixture-of-experts (MoE) and multimodality (GPT-4o).
Deeper Insight: GPT’s success lies in causal masking + massive scale. The key differentiator is that it learns tasks implicitly from text patterns without explicit supervision.
Probing Question: “Why do decoder-only models like GPT generalize so well to unseen tasks?” Because the autoregressive training objective inherently captures conditional dependencies — every next-token prediction becomes a proxy for reasoning.
1.2: Transformer Internals
- Dissect multi-head self-attention, feed-forward networks, residuals, and layer norms.
- Study attention parallelization and memory-efficient attention (FlashAttention).
- Explore rotary positional embeddings (RoPE) and their benefits over sinusoidal ones.
- Learn how modern GPT variants implement parallel transformer blocks for speed.
Deeper Insight: Real mastery means knowing how these micro-optimizations scale to 100B+ parameters while avoiding training divergence.
Probing Question: “Why does GPT use causal masks instead of positional tokens like BERT?” Because it predicts token t+1 given only ≤ t, enforcing a strict left-to-right dependency needed for generation.
🧩 BERT Family & Bidirectional Transformers
Note
The Top Tech Interview Angle: BERT revolutionized representation learning and contextual embeddings. Interviews often test whether you can differentiate between encoder-only and decoder-only training, and reason about why BERT isn’t generative but essential for understanding.
2.1: BERT and Its Variants
- Study Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
- Examine BERT’s bidirectional encoder stack — self-attention across both left and right context.
- Understand RoBERTa’s removal of NSP and dynamic masking improvements.
- Learn about DistilBERT and ALBERT — parameter-efficient distillations.
Deeper Insight: BERT builds context-rich embeddings, while GPT learns generative transitions. Understanding this duality is crucial for reasoning about hybrid architectures (e.g., T5).
Probing Question: “Why can’t BERT be directly used for open-ended text generation?” Because it masks tokens randomly, breaking autoregressive dependencies needed for coherent generation.
2.2: Sentence & Cross-Lingual Variants
- Learn Sentence-BERT (SBERT) for semantic similarity via contrastive objectives.
- Study mBERT and XLM-R, focusing on multilingual embedding alignment.
- Explore shared tokenizers and cross-lingual masked modeling strategies.
Probing Question: “How does SBERT differ from BERT in training objective?” SBERT uses contrastive learning — aligning semantically similar sentence pairs in embedding space, enabling retrieval and clustering tasks.
🧬 Encoder–Decoder & Instruction Models (T5, FLAN, UL2)
Note
The Top Tech Interview Angle: Encoder–decoder models test your understanding of sequence-to-sequence learning, multi-task pretraining, and instruction alignment. Knowing how these architectures combine understanding + generation is key for reasoning about instruction-tuned LLMs.
3.1: The T5 Architecture
- Understand T5’s unified text-to-text framework (“everything is text in, text out”).
- Study span corruption (replacing spans, not tokens) as a pretraining task.
- Learn about shared vocabulary, relative position encodings, and layer sharing.
- Explore T5.1.1, Flan-T5 (instruction-tuned), and UL2 (mixture of denoising).
Deeper Insight: T5’s architecture inspired the instruction-tuning revolution — enabling LLMs to follow human-readable commands across diverse tasks.
Probing Question: “Why does span corruption outperform token-level masking?” Because predicting longer spans forces the model to reason about syntactic and semantic coherence, not just local context.
3.2: Instruction Fine-Tuning & Mixture Objectives
- Study Flan, InstructGPT, and UL2 frameworks.
- Learn about multi-task fine-tuning and prompt prefixing for task conditioning.
- Examine the trade-off between multitask generalization and catastrophic forgetting.
Probing Question: “Why does instruction tuning improve zero-shot performance?” Because aligning on diverse task instructions teaches meta-learning — the ability to infer intent from natural language patterns.
🧮 Long-Context Models (Claude, Gemini 1.5, Mistral)
Note
The Top Tech Interview Angle: Long-context reasoning is a frontier topic. You’re evaluated on your understanding of efficient attention, token compression, and retrieval augmentation for handling extended sequences.
4.1: Context Extension Strategies
- Learn Sliding Window Attention, ALiBi, and Linear Attention (Performer).
- Study Attention with Linear Biases (ALiBi) — decaying attention weights for distance.
- Explore Retrieval-Augmented Memory — externalizing long-term context via vector stores.
- Understand how models like Gemini 1.5 and Claude 3.5 manage >1M token contexts.
Deeper Insight: True long-context modeling isn’t just attention scaling — it’s context compression, retrieval caching, and memory persistence.
Probing Question: “Why doesn’t linear attention fully solve long-context problems?” Because it loses global token interactions — useful for reasoning over distant dependencies.
4.2: Sparse and Mixture-of-Experts Models
- Study Switch Transformers, Mixtral, and GLaM architectures.
- Understand sparse routing — only activating a subset of experts per token.
- Learn how load balancing, expert dropout, and gating improve efficiency.
Probing Question: “Why are MoE models more memory-efficient?” Because they distribute capacity across conditional experts — fewer parameters active per forward pass.
🎨 Multimodal LLMs (GPT-4o, Gemini, MM-ReAct)
Note
The Top Tech Interview Angle: Multimodality tests whether you can reason about cross-modal embeddings and attention fusion. Expect to explain how LLMs integrate text, vision, and audio streams in a single architecture.
5.1: Cross-Modal Alignment
- Study CLIP (contrastive pretraining on text–image pairs).
- Learn Flamingo and PaLI — visual-text fusion using cross-attention layers.
- Explore GPT-4o — unified multimodal Transformer with shared attention backbone.
- Understand token projection layers for encoding non-text modalities.
Deeper Insight: Vision-language models don’t merge pixels and tokens directly; they align embeddings in a shared latent space, then fuse via cross-attention.
Probing Question: “Why does CLIP use contrastive loss instead of cross-entropy?” Because it optimizes for alignment between paired modalities rather than direct classification.
5.2: Multimodal Reasoning & ReAct Paradigm
- Learn MM-ReAct (reasoning + acting through modalities).
- Understand chain-of-thought + visual grounding workflows.
- Study tool use — how multimodal LLMs call vision, audio, and action APIs dynamically.
Probing Question: “How do multimodal LLMs decide when to reason vs. act?” Through internal planning layers (e.g., policy heads or action tokens) that model when to invoke tools or visual encoders.
🧩 Open-Source LLM Ecosystem
Note
The Top Tech Interview Angle: Real-world work often requires adapting or benchmarking open models. You’ll be evaluated on your understanding of open-source model families, their architectures, and training trade-offs.
6.1: LLaMA, Mistral, and Falcon Families
- Study LLaMA 2/3 — efficient dense models with grouped-query attention (GQA).
- Understand Mistral’s sliding window attention and Mixtral’s MoE setup.
- Examine Falcon — optimized for causal decoding with flash-attention.
Deeper Insight: Open models trade off scale for accessibility — understanding architectural optimizations shows practical engineering maturity.
Probing Question: “Why did Mistral outperform LLaMA at smaller sizes?” Because efficient attention + fine-tuned tokenization improved context handling and throughput.
6.2: Fine-Tuning & Evaluation of Open Models
- Learn fine-tuning frameworks: PEFT, Axolotl, vLLM, and Llama.cpp.
- Study quantization-aware fine-tuning and model merging (delta weights).
- Benchmark models on MMLU, ARC, and TruthfulQA.
Probing Question: “Why does quantization often degrade reasoning before factual recall?” Because reasoning pathways depend on activation precision more than factual token retrieval.