2.6. Reinforcement Learning from Human Feedback (RLHF)

5 min read 928 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Even after pretraining and fine-tuning, language models may produce technically correct but socially tone-deaf responses. They don’t truly “understand” what humans want.

Enter Reinforcement Learning from Human Feedback (RLHF) — a technique that teaches models not just to generate plausible text, but to behave helpfully, honestly, and harmlessly according to human judgment.

  • Simple Analogy: Imagine training a puppy (the model). You first show it the right tricks (supervised fine-tuning), then reward it for the tricks you like best (human feedback), and finally teach it to maximize good behavior over time (reinforcement learning).

That’s RLHF — a structured way of aligning machines with human intent.


🌱 Step 2: Core Concept

Stage 1: Supervised Fine-Tuning (SFT)

This is the starting point. We train the model using instruction-response pairs written or approved by humans.

Example:

PromptIdeal Response
“Explain photosynthesis.”“Photosynthesis is the process by which plants convert light into energy.”

The resulting model, called the baseline policy, knows how to produce coherent answers but not necessarily which answers humans prefer.


Stage 2: Reward Modeling (RM)

Humans are asked to rank multiple model responses to the same prompt from best to worst.

Example:

PromptResponse AResponse BHuman Choice
“Explain quantum computing.”“It’s magic with particles.”“It uses quantum bits to perform complex computations.”B > A

These rankings train a reward model (RM) — a smaller network that learns to predict human preference.

Essentially, RM approximates:

$$ R(x, y) = \text{HumanPreferenceScore} $$

It transforms subjective judgment (“I like this answer better”) into a numeric reward the model can optimize for.


Stage 3: Reinforcement Learning (Policy Optimization)

Now, the model becomes an agent trying to maximize reward — i.e., produce outputs humans prefer.

We apply Proximal Policy Optimization (PPO), a reinforcement learning algorithm that:

  • Generates new responses (actions).
  • Evaluates them using the reward model (feedback).
  • Updates model weights — but with small, safe steps to avoid “forgetting” what it already knows.

Mathematically, PPO updates the policy $\pi_\theta$ to maximize:

$$ L(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right] $$

where $r_t(\theta)$ is the ratio of new to old policy probabilities. The “clip” keeps updates stable and prevents large jumps that destabilize training.


Why PPO Instead of Vanilla Policy Gradient?

Vanilla policy gradient methods can make aggressive, unstable updates that lead to wild behavior — like a student changing their entire personality after one test.

PPO fixes this by:

  • Constraining each policy update with a trust region (clipping).
  • Keeping the new model close to the baseline (using KL-divergence penalties). This ensures smooth, stable learning without catastrophic drift.

Why It Works: Dynamic Reward Alignment

Traditional fine-tuning uses a fixed loss function (like cross-entropy). But human preference isn’t fixed — it’s dynamic, subjective, and contextual.

RLHF replaces that static loss with a reward function learned from humans, allowing the model to adapt its behavior toward human values over time.

This is why RLHF outputs often feel more natural, polite, and nuanced.


📐 Step 3: Mathematical Foundation

The RLHF Training Loop
  1. Collect human feedback: Get rankings $r_i$ for model responses.

  2. Train Reward Model (RM): Fit a function $R_\phi(x, y)$ that predicts those rankings.

  3. Optimize Policy: Use PPO to maximize expected reward:

    $$ \max_\theta \mathbb{E}*{x,y \sim \pi*\theta}[R_\phi(x, y)] - \beta D_{KL}(\pi_\theta || \pi_{SFT}) $$

    The KL term penalizes deviation from the baseline model — balancing alignment with stability.

The KL term acts like a leash — the model can explore new, preferred behaviors but not run too far from its pretrained base.

🧠 Step 4: Key Concepts & Challenges

Reward Hacking

Sometimes, the model learns to “game” the reward model instead of improving its behavior. Example: using overly agreeable or verbose answers that humans rank higher — even when factually wrong.

Fixes include:

  • Regular retraining of reward models.
  • Incorporating factuality checks.
  • Penalizing overly long or repetitive responses.

KL Penalty — Keeping the Model Tethered
The KL term discourages the new policy from drifting too far from the supervised baseline. Without it, the model might produce strange or unsafe outputs (catastrophic drift).

Helpful–Honest–Harmless (HHH) Objective

Many modern alignment systems use the “HHH” framework:

  • Helpful: Provides relevant, accurate, and complete answers.
  • Honest: Doesn’t fabricate information or mislead.
  • Harmless: Avoids offensive, biased, or unsafe content.

RLHF systems balance these three aspects during optimization to create behaviorally safe AI.


⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Aligns models with human intent and social norms.
  • Produces polite, safe, and coherent responses.
  • Encourages generalization across conversational contexts.

⚠️ Limitations

  • Requires expensive human feedback collection.
  • Reward models can encode human biases.
  • Susceptible to “reward hacking” and preference drift.

⚖️ Trade-offs

  • More alignment ≈ more constraints — can reduce creativity or diversity.
  • Too strong a KL penalty → under-adaptation. Too weak → instability.
  • Striking the right balance is crucial for scalable, safe AI systems.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “RLHF means the model understands morality.” ❌ It only optimizes for human-like preferences, not moral reasoning.
  • “Reward models are perfect.” ❌ They’re noisy proxies of human judgment, not absolute truth.
  • “More feedback = better alignment.” ❌ Quality and diversity of feedback matter far more than volume.

🧩 Step 7: Mini Summary

🧠 What You Learned: RLHF fine-tunes models using human feedback to align their behavior with human expectations.

⚙️ How It Works: A supervised model becomes a reinforcement learner, maximizing a learned reward function (human preference) using PPO.

🎯 Why It Matters: RLHF is what makes modern LLMs feel human-friendly, balancing intelligence with empathy, truthfulness, and safety.

2.7. Safety Alignment & Post-Training Alignment2.5. Quantization & Distillation — Making Giants Efficient
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!