2.5. Quantization & Distillation — Making Giants Efficient

Generative AI & LLM Interview Guide for Top Roles (2025)

Large Language Model (LLM) Architecture

5 min read 912 words

🪄 Step 1: Intuition & Motivation

Core Idea: Modern LLMs are massive — think billions of parameters, occupying hundreds of gigabytes. That’s great for performance, but not so great for your GPU (or wallet).

Two powerful techniques help tame these giants:

Quantization — Shrink the model by reducing number precision.
Distillation — Shrink the model by training a smaller one to mimic the larger one.

Together, they make large models faster, lighter, and deployable on everyday hardware.

Simple Analogy: Think of quantization as compressing a high-quality image to JPEG — smaller file, a bit of quality loss. Distillation is like teaching a talented student from a wise professor — smaller brain, but same wisdom.

🌱 Step 2: Core Concept

Quantization — Squeezing Precision for Efficiency

Normally, model weights are stored in 32-bit floating-point format (FP32). Quantization reduces these to lower-precision types like INT8, INT4, or even INT2.

This means each weight takes up less space, and computations become faster since lower-precision math is lighter on hardware.

Example: If a model has 10 billion parameters in FP32 (4 bytes each): → 40 GB memory. Convert to INT8 (1 byte each): → 10 GB memory — 4× smaller and faster!

But, quantization is not free — reducing precision can cause numerical error, leading to slight drops in accuracy or instability if done poorly.

Distillation — Passing Knowledge Downstream

Instead of compressing numbers, distillation compresses knowledge.

A large, well-trained teacher model generates soft targets — probability distributions over output tokens — which are then used to train a smaller student model.

This student model learns not only what the teacher predicts, but how confidently it predicts each outcome.

The loss function combines both the original ground truth and the teacher’s guidance:

$$ L = \alpha H(y, s) + (1 - \alpha) H(t, s) $$

where:

$H$ = cross-entropy loss,
$y$ = true labels,
$t$ = teacher outputs (soft targets),
$s$ = student outputs,
$\alpha$ balances between real data and teacher imitation.

The result? A much smaller, faster model that retains much of the teacher’s accuracy and behavior.

How They Work Together

Quantization reduces storage and computation, while distillation reduces architecture size.

You can combine them:

Distill first, quantize later: Train a small, student model and then compress it further.
Quantize during distillation: Quantize the student model during training for even greater efficiency.

This combo creates compact models like DistilBERT or LLaMA-Int4, which maintain near-teacher accuracy at a fraction of the cost.

📐 Step 3: Mathematical Foundation

Quantization Equation

In quantization, we map a floating-point value $x$ to a discrete integer $q$ using a scale and zero-point:

$$ q = \text{round}\left(\frac{x}{\text{scale}}\right) + \text{zero_point} $$

To recover the approximate float value during inference:

$$ x' = \text{scale} \times (q - \text{zero_point}) $$

Scale: determines how much each integer step represents.
Zero-point: aligns integer zero with the real zero in floating-point space.

This process can be per-tensor (one scale for all weights) or per-channel (each layer or neuron has its own scale).

Quantization is like resizing an image — each pixel (weight) gets fewer bits to express itself, but overall, the image (model) remains recognizable.

Knowledge Distillation Loss

The core loss blends two signals:

Hard loss: Match real labels ($y$).
Soft loss: Mimic the teacher’s probability distribution ($t$).

$$ L = \alpha H(y, s) + (1 - \alpha) H(t, s) $$

When $\alpha = 1$: pure supervised training.
When $\alpha = 0$: pure imitation learning.
In practice: $\alpha$ is around $0.3$–$0.5$ for balanced learning.

A temperature ($T$) parameter is sometimes used to soften the teacher’s logits:

$$ P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$

Higher $T$ → smoother distributions → more informative gradients.

🧠 Step 4: Types of Quantization

🧱 Post-Training Quantization (PTQ)

Apply quantization after full-precision training.
Fast and simple — no retraining needed.
Downside: Accuracy may drop due to rounding errors.

⚙️ Quantization-Aware Training (QAT)

Simulate quantization during training.
The model learns to adapt to reduced precision.
Almost same accuracy as full-precision, but higher training cost.

⚡ Dynamic Quantization

Quantize only during inference.
Weights stay quantized, activations are quantized on-the-fly.
Great for deployment when memory is tight.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Reduces model size by 4×–16×.
Speeds up inference significantly.
Enables deployment on edge or consumer devices.
Distillation maintains accuracy even with small models.

⚠️ Limitations

Quantization can introduce numerical instability.
Not all hardware supports low-precision arithmetic efficiently.
Distillation requires high-quality teacher outputs — garbage in, garbage out.

⚖️ Trade-offs

Quantization: Great for latency, small hit to accuracy.
Distillation: Great for accuracy, small hit to flexibility.
Together, they form a balanced compression strategy — smaller, faster, still smart.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Quantization just divides numbers by 4.” ❌ It involves scaling, zero-points, and rounding carefully to preserve meaning.
“Distillation only copies outputs.” ❌ It captures deeper knowledge — the teacher’s confidence distribution.
“Post-training quantization is always fine.” ❌ Sometimes accuracy crashes without Quantization-Aware Training (QAT).

🧩 Step 7: Mini Summary

🧠 What You Learned: Quantization compresses models by reducing numeric precision, while Distillation transfers knowledge from a large teacher to a smaller student.

⚙️ How It Works: Quantization reduces compute and memory needs; Distillation uses the teacher’s probability distribution to guide efficient learning.

🎯 Why It Matters: Together, they make LLMs deployable on limited hardware without giving up much performance — the key to scaling AI responsibly and affordably.

2.6. Reinforcement Learning from Human Feedback (RLHF)2.4. Parameter-Efficient Fine-Tuning (PEFT) — Do More with Less