1.2. Tokenization — Turning Language into Numbers
🪄 Step 1: Intuition & Motivation
Core Idea: Before a machine can “understand” text, it must first convert words into something it can work with — numbers. But unlike humans, machines don’t see “words” as meaningful chunks — they only see sequences of numbers. Tokenization is the magical bridge between human language and machine-readable format.
Simple Analogy: Imagine feeding a novel to a calculator. The calculator stares blankly — it doesn’t speak English. Tokenization is like translating the entire novel into Lego blocks. Each block (token) has a numeric label. The model then learns how to assemble these blocks into meaningful sentences later.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
When you input text like:
“I love Transformers!”
the tokenizer breaks it into smaller, meaningful chunks — tokens. Depending on the tokenizer, these chunks can be:
- Whole words: “I”, “love”, “Transformers”
- Subwords: “Trans”, “former”, “s”
- Characters: “T”, “r”, “a”, “n”, …
Each token is then assigned an integer ID — for instance,
"I" → 27, "love" → 1032, "Trans" → 8814.
These IDs become the model’s vocabulary. During training, the model learns how these token sequences behave statistically, like “love” often follows “I.”
Why It Works This Way
Pure word-level tokenization creates problems:
- Languages have millions of unique words, including slang, typos, and inflections (“run”, “running”, “ran”).
- That would make the vocabulary enormous and the model impractically large.
To fix this, modern LLMs use subword tokenization — breaking words into smaller, reusable units that strike a balance between too fine-grained (characters) and too coarse (whole words). This ensures the model can handle unseen or rare words by composing them from known parts.
How It Fits in ML Thinking
📐 Step 3: Mathematical Foundation
Subword Tokenization: Byte Pair Encoding (BPE)
The basic principle:
- Start with a character-level vocabulary.
- Find the most frequent pair of adjacent symbols (like
'a'and'b'→'ab'). - Merge them into a new token.
- Repeat until you reach the target vocabulary size.
Formally,
$$ \text{pair} = \underset{(x,y)}{\arg\max} \ \text{freq}(x, y) $$where freq(x, y) counts how often two symbols occur together.
Each merge reduces sequence length but increases vocabulary size — it’s a controlled trade-off between compactness and expressivity.
SentencePiece: Language-Agnostic Tokenization
Unlike BPE or WordPiece, SentencePiece doesn’t require pre-segmented text — it treats raw bytes as input. It uses a Unigram Language Model to learn the most probable token inventory. This approach makes it robust across languages (English, Chinese, Japanese, etc.) because it doesn’t depend on spaces or alphabet boundaries.
🧠 Step 4: Assumptions or Key Ideas
- The same text will always tokenize the same way (determinism).
- Each token is treated as a distinct, learnable unit.
- Tokenization is language- and domain-sensitive — a tokenizer trained on news text might fail for medical jargon or code.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Handles unseen words gracefully via subword composition.
- Efficient balance between vocabulary size and sequence length.
- Enables cross-lingual compatibility (especially SentencePiece).
⚠️ Limitations
- Rare words may still fragment into long token chains (increasing sequence length).
- Token boundaries may not align with semantic meaning.
- Tokenization quality can degrade on noisy or mixed-language inputs.
🚧 Step 6: Common Misunderstandings (Optional)
🚨 Common Misunderstandings (Click to Expand)
- “Token = Word” — ❌ Not true. Tokens can be parts of words or even punctuation.
- “Tokenization doesn’t affect performance” — ❌ It does. Poor tokenization can cause inefficient learning and high loss.
- “All tokenizers are language-specific” — ❌ SentencePiece and Byte-level BPE work across languages.
🧩 Step 7: Mini Summary
🧠 What You Learned: Tokenization transforms messy, human-readable text into structured, numerical tokens.
⚙️ How It Works: It splits text into subwords using statistical patterns (like BPE or SentencePiece).
🎯 Why It Matters: Without effective tokenization, LLMs cannot meaningfully process or generalize across text.