Large Language Models (LLMs)

6 min read 1153 words

🧠 Transformer & GPT Family

Note

The Top Tech Interview Angle: Transformers underpin every LLM. Interviewers assess your ability to break down attention math, scaling strategies, and the trade-offs between decoder-only, encoder-decoder, and hybrid architectures. You’ll also be expected to reason about why GPT-style models dominate generative tasks.

1.1: The GPT Lineage (GPT-1 → GPT-4)

  1. Study how GPT-1 established transfer learning for NLP (unsupervised → fine-tuned).
  2. Understand GPT-2’s autoregressive causal masking for next-token prediction.
  3. Dive into GPT-3’s scaling laws — billions of parameters, sparse attention, and prompt-based learning.
  4. Examine GPT-4’s mixture-of-experts (MoE) and multimodality (GPT-4o).

Deeper Insight: GPT’s success lies in causal masking + massive scale. The key differentiator is that it learns tasks implicitly from text patterns without explicit supervision.

Probing Question: “Why do decoder-only models like GPT generalize so well to unseen tasks?” Because the autoregressive training objective inherently captures conditional dependencies — every next-token prediction becomes a proxy for reasoning.


1.2: Transformer Internals

  1. Dissect multi-head self-attention, feed-forward networks, residuals, and layer norms.
  2. Study attention parallelization and memory-efficient attention (FlashAttention).
  3. Explore rotary positional embeddings (RoPE) and their benefits over sinusoidal ones.
  4. Learn how modern GPT variants implement parallel transformer blocks for speed.

Deeper Insight: Real mastery means knowing how these micro-optimizations scale to 100B+ parameters while avoiding training divergence.

Probing Question: “Why does GPT use causal masks instead of positional tokens like BERT?” Because it predicts token t+1 given only ≤ t, enforcing a strict left-to-right dependency needed for generation.


🧩 BERT Family & Bidirectional Transformers

Note

The Top Tech Interview Angle: BERT revolutionized representation learning and contextual embeddings. Interviews often test whether you can differentiate between encoder-only and decoder-only training, and reason about why BERT isn’t generative but essential for understanding.

2.1: BERT and Its Variants

  1. Study Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
  2. Examine BERT’s bidirectional encoder stack — self-attention across both left and right context.
  3. Understand RoBERTa’s removal of NSP and dynamic masking improvements.
  4. Learn about DistilBERT and ALBERT — parameter-efficient distillations.

Deeper Insight: BERT builds context-rich embeddings, while GPT learns generative transitions. Understanding this duality is crucial for reasoning about hybrid architectures (e.g., T5).

Probing Question: “Why can’t BERT be directly used for open-ended text generation?” Because it masks tokens randomly, breaking autoregressive dependencies needed for coherent generation.


2.2: Sentence & Cross-Lingual Variants

  1. Learn Sentence-BERT (SBERT) for semantic similarity via contrastive objectives.
  2. Study mBERT and XLM-R, focusing on multilingual embedding alignment.
  3. Explore shared tokenizers and cross-lingual masked modeling strategies.

Probing Question: “How does SBERT differ from BERT in training objective?” SBERT uses contrastive learning — aligning semantically similar sentence pairs in embedding space, enabling retrieval and clustering tasks.


🧬 Encoder–Decoder & Instruction Models (T5, FLAN, UL2)

Note

The Top Tech Interview Angle: Encoder–decoder models test your understanding of sequence-to-sequence learning, multi-task pretraining, and instruction alignment. Knowing how these architectures combine understanding + generation is key for reasoning about instruction-tuned LLMs.

3.1: The T5 Architecture

  1. Understand T5’s unified text-to-text framework (“everything is text in, text out”).
  2. Study span corruption (replacing spans, not tokens) as a pretraining task.
  3. Learn about shared vocabulary, relative position encodings, and layer sharing.
  4. Explore T5.1.1, Flan-T5 (instruction-tuned), and UL2 (mixture of denoising).

Deeper Insight: T5’s architecture inspired the instruction-tuning revolution — enabling LLMs to follow human-readable commands across diverse tasks.

Probing Question: “Why does span corruption outperform token-level masking?” Because predicting longer spans forces the model to reason about syntactic and semantic coherence, not just local context.


3.2: Instruction Fine-Tuning & Mixture Objectives

  1. Study Flan, InstructGPT, and UL2 frameworks.
  2. Learn about multi-task fine-tuning and prompt prefixing for task conditioning.
  3. Examine the trade-off between multitask generalization and catastrophic forgetting.

Probing Question: “Why does instruction tuning improve zero-shot performance?” Because aligning on diverse task instructions teaches meta-learning — the ability to infer intent from natural language patterns.


🧮 Long-Context Models (Claude, Gemini 1.5, Mistral)

Note

The Top Tech Interview Angle: Long-context reasoning is a frontier topic. You’re evaluated on your understanding of efficient attention, token compression, and retrieval augmentation for handling extended sequences.

4.1: Context Extension Strategies

  1. Learn Sliding Window Attention, ALiBi, and Linear Attention (Performer).
  2. Study Attention with Linear Biases (ALiBi) — decaying attention weights for distance.
  3. Explore Retrieval-Augmented Memory — externalizing long-term context via vector stores.
  4. Understand how models like Gemini 1.5 and Claude 3.5 manage >1M token contexts.

Deeper Insight: True long-context modeling isn’t just attention scaling — it’s context compression, retrieval caching, and memory persistence.

Probing Question: “Why doesn’t linear attention fully solve long-context problems?” Because it loses global token interactions — useful for reasoning over distant dependencies.


4.2: Sparse and Mixture-of-Experts Models

  1. Study Switch Transformers, Mixtral, and GLaM architectures.
  2. Understand sparse routing — only activating a subset of experts per token.
  3. Learn how load balancing, expert dropout, and gating improve efficiency.

Probing Question: “Why are MoE models more memory-efficient?” Because they distribute capacity across conditional experts — fewer parameters active per forward pass.


🎨 Multimodal LLMs (GPT-4o, Gemini, MM-ReAct)

Note

The Top Tech Interview Angle: Multimodality tests whether you can reason about cross-modal embeddings and attention fusion. Expect to explain how LLMs integrate text, vision, and audio streams in a single architecture.

5.1: Cross-Modal Alignment

  1. Study CLIP (contrastive pretraining on text–image pairs).
  2. Learn Flamingo and PaLI — visual-text fusion using cross-attention layers.
  3. Explore GPT-4o — unified multimodal Transformer with shared attention backbone.
  4. Understand token projection layers for encoding non-text modalities.

Deeper Insight: Vision-language models don’t merge pixels and tokens directly; they align embeddings in a shared latent space, then fuse via cross-attention.

Probing Question: “Why does CLIP use contrastive loss instead of cross-entropy?” Because it optimizes for alignment between paired modalities rather than direct classification.


5.2: Multimodal Reasoning & ReAct Paradigm

  1. Learn MM-ReAct (reasoning + acting through modalities).
  2. Understand chain-of-thought + visual grounding workflows.
  3. Study tool use — how multimodal LLMs call vision, audio, and action APIs dynamically.

Probing Question: “How do multimodal LLMs decide when to reason vs. act?” Through internal planning layers (e.g., policy heads or action tokens) that model when to invoke tools or visual encoders.


🧩 Open-Source LLM Ecosystem

Note

The Top Tech Interview Angle: Real-world work often requires adapting or benchmarking open models. You’ll be evaluated on your understanding of open-source model families, their architectures, and training trade-offs.

6.1: LLaMA, Mistral, and Falcon Families

  1. Study LLaMA 2/3 — efficient dense models with grouped-query attention (GQA).
  2. Understand Mistral’s sliding window attention and Mixtral’s MoE setup.
  3. Examine Falcon — optimized for causal decoding with flash-attention.

Deeper Insight: Open models trade off scale for accessibility — understanding architectural optimizations shows practical engineering maturity.

Probing Question: “Why did Mistral outperform LLaMA at smaller sizes?” Because efficient attention + fine-tuned tokenization improved context handling and throughput.


6.2: Fine-Tuning & Evaluation of Open Models

  1. Learn fine-tuning frameworks: PEFT, Axolotl, vLLM, and Llama.cpp.
  2. Study quantization-aware fine-tuning and model merging (delta weights).
  3. Benchmark models on MMLU, ARC, and TruthfulQA.

Probing Question: “Why does quantization often degrade reasoning before factual recall?” Because reasoning pathways depend on activation precision more than factual token retrieval.


Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!