🤖 Core ML Fundamentals
Note
The Top Tech Company Angle (Probability & Optimization): A strong foundation in probability, optimization, and linear algebra is non-negotiable. These basics underpin everything in LLMs, from attention mechanisms to likelihood-based training. Interviewers use this to check if you can reason about models without treating them as “black boxes.”
1.1: Probability & Information Theory
- Understand random variables, joint/marginal/conditional probabilities.
- Dive into KL divergence, cross-entropy, and entropy — crucial for loss functions.
- Be able to derive and explain why cross-entropy is the natural loss for classification.
Deeper Insight: Expect questions like, “Why do we prefer KL divergence over simple distance metrics?” or “How does cross-entropy behave when probabilities are highly imbalanced?”
1.2: Linear Algebra Foundations
- Matrix multiplication, eigenvalues/eigenvectors, and orthogonality.
- Master projections and vector spaces — essential for embeddings and transformations.
- Connect SVD/PCA intuition to dimensionality reduction in NLP.
Probing Question: “How does the rank of a weight matrix impact model capacity?” or “Why are low-rank approximations useful in large-scale models?”
1.3: Optimization & Gradient Descent
- Internalize gradient descent, SGD, momentum, Adam.
- Be fluent with convex vs. non-convex landscapes and saddle points.
- Understand learning rate schedules (cosine, warmup, decay).
Note: At scale, unstable training can cost millions. Be ready to discuss trade-offs between optimizers and how you’d debug exploding vs. vanishing gradients.
🧠 Deep Learning Foundations
Note
The Top Tech Company Angle (Neural Nets): Deep learning is the substrate for LLMs. The ability to reason about activations, backpropagation, and regularization signals deep comprehension. Interviewers will probe whether you understand both the math and practical consequences.
2.1: Feedforward Networks & Backpropagation
- Write out forward and backward passes mathematically.
- Implement a tiny NN from scratch with NumPy.
- Explain chain rule applications in backprop.
Probing Question: “If gradients vanish in deep networks, what would you do?” Expect to discuss initialization strategies, normalization, or residual connections.
2.2: Regularization & Generalization
- Understand dropout, weight decay, and batch norm.
- Learn why batch norm helps with internal covariate shift.
- Be able to connect regularization to bias-variance trade-offs.
Note: Be prepared to explain when dropout hurts performance (e.g., in attention-heavy architectures).
2.3: Sequence Models (RNNs, LSTMs, GRUs)
- Grasp recurrence equations and vanishing gradients problem.
- Connect LSTMs/GRUs to memory retention vs. forgetting.
- Be ready to compare sequence models to transformers.
Probing Question: “Why did transformers replace RNNs in NLP?” The interviewer is probing depth, not trivia.
🔍 Transformer Architecture & Attention
Note
The Top Tech Company Angle (Transformers): Transformers are the backbone of LLMs. You’ll be judged on your ability to reason about attention mathematically, explain design trade-offs, and discuss why alternatives failed.
3.1: Scaled Dot-Product Attention
- Derive $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$.
- Explain why scaling by $\sqrt{d_k}$ prevents vanishing gradients in softmax.
- Visualize attention matrices on toy data.
Probing Question: “What happens if we remove scaling in attention?” Expect to argue with math and intuition.
3.2: Multi-Head Attention
- Learn why multiple heads provide richer subspace representations.
- Implement a minimal multi-head attention in PyTorch.
- Compare single vs. multi-head empirically.
Note: Common pitfall: confusing concatenation vs. averaging across heads.
3.3: Positional Encodings
- Understand sinusoidal encodings vs. learned embeddings.
- Explain why transformers need them (lack of inherent order).
- Derive the sine/cosine functions for encoding.
Probing Question: “If you use learned positional embeddings, what happens when you extrapolate to longer sequences?”
📚 Large Language Models (LLMs)
Note
The Top Tech Company Angle (LLMs): This is the main act. Expect to show mastery of scaling laws, training objectives, and inference-time trade-offs. Being able to navigate hallucinations, prompt engineering, and fine-tuning approaches demonstrates end-to-end fluency.
4.1: Pretraining Objectives
- Derive masked language modeling (MLM) vs. causal LM objectives.
- Explain next-token prediction and cross-entropy loss in depth.
- Compare BERT vs. GPT training schemes.
Probing Question: “Why does causal LM scale better for generation tasks?”
4.2: Scaling Laws
- Study empirical scaling laws (Kaplan et al.).
- Understand compute-optimal training and data vs. parameter trade-offs.
- Be able to sketch scaling trends.
Note: Expect a follow-up: “If you double parameters but not data, what happens?”
4.3: Fine-Tuning Methods
- Full fine-tuning vs. adapters vs. LoRA.
- Understand parameter-efficient training for deployment.
- Implement LoRA in practice.
Probing Question: “Why might you prefer adapters in a multi-task environment?”
4.4: Alignment & RLHF
- Learn Reinforcement Learning from Human Feedback (policy, reward, PPO).
- Connect reward shaping to bias/hallucination control.
- Discuss alternatives: DPO, constitutional AI.
Note: Expect scale questions: “How would you make RLHF efficient on a trillion-parameter model?”
⚙️ MLOps & Systems for LLMs
Note
The Top Tech Company Angle (Deployment & Scale): Knowing theory is not enough. Companies want engineers who can scale, monitor, and optimize models in production. Deployment pitfalls often make or break interviews.
5.1: Serving Large Models
- Explore model parallelism vs. pipeline parallelism.
- Discuss quantization (INT8, FP16) and memory trade-offs.
- Be ready to describe inference serving strategies.
Probing Question: “How would you serve a 100B+ parameter model on GPUs with 40GB memory?”
5.2: Monitoring & Evaluation
- Define perplexity, BLEU, ROUGE, but also discuss their limits.
- Understand human eval and red-teaming.
- Learn drift detection in deployed systems.
Note: Follow-up: “Why does perplexity not always align with user satisfaction?”
5.3: Safety & Hallucinations
- Analyze why LLMs hallucinate (imperfect distributions).
- Discuss mitigation: retrieval augmentation, calibration, RLHF.
- Understand trade-offs between safety and creativity.
Probing Question: “If you add retrieval, how do you ensure latency doesn’t spike?”
🧩 Advanced Topics & Research Depth
Note
The Top Tech Company Angle (Beyond the Basics): Differentiation at the highest level comes from research-level insights. Candidates who can connect open problems to practical trade-offs stand out.
6.1: Retrieval-Augmented Generation (RAG)
- Architecture: retriever + generator.
- Dense vs. sparse retrieval (BM25 vs. DPR).
- Latency vs. accuracy trade-offs.
Probing Question: “Why might a hybrid retriever (sparse+dense) outperform either alone?”
6.2: Mixture of Experts (MoE)
- Understand gating networks, load balancing.
- Scaling benefits vs. routing challenges.
- Implement a toy MoE in PyTorch.
Note: Pitfall: assuming MoE always reduces inference cost — interviewer may challenge this.
6.3: Interpretability
- Attribution methods: attention visualization, probing tasks.
- Mechanistic interpretability (circuits in transformers).
- Be able to discuss societal implications.
Probing Question: “If attention isn’t explanation, how else would you probe model internals?”