1.2. Tokenization — Turning Language into Numbers

4 min read 756 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Before a machine can “understand” text, it must first convert words into something it can work with — numbers. But unlike humans, machines don’t see “words” as meaningful chunks — they only see sequences of numbers. Tokenization is the magical bridge between human language and machine-readable format.

  • Simple Analogy: Imagine feeding a novel to a calculator. The calculator stares blankly — it doesn’t speak English. Tokenization is like translating the entire novel into Lego blocks. Each block (token) has a numeric label. The model then learns how to assemble these blocks into meaningful sentences later.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When you input text like:

“I love Transformers!”

the tokenizer breaks it into smaller, meaningful chunks — tokens. Depending on the tokenizer, these chunks can be:

  • Whole words: “I”, “love”, “Transformers”
  • Subwords: “Trans”, “former”, “s”
  • Characters: “T”, “r”, “a”, “n”, …

Each token is then assigned an integer ID — for instance, "I" → 27, "love" → 1032, "Trans" → 8814.

These IDs become the model’s vocabulary. During training, the model learns how these token sequences behave statistically, like “love” often follows “I.”

Why It Works This Way

Pure word-level tokenization creates problems:

  • Languages have millions of unique words, including slang, typos, and inflections (“run”, “running”, “ran”).
  • That would make the vocabulary enormous and the model impractically large.

To fix this, modern LLMs use subword tokenization — breaking words into smaller, reusable units that strike a balance between too fine-grained (characters) and too coarse (whole words). This ensures the model can handle unseen or rare words by composing them from known parts.

How It Fits in ML Thinking
Tokenization defines how language becomes data. It’s the first step in every NLP pipeline — the “front door” to the model’s understanding. If tokenization fails, even the best neural network can’t reason correctly. Think of it as choosing the alphabet for your model’s inner language.

📐 Step 3: Mathematical Foundation

Subword Tokenization: Byte Pair Encoding (BPE)

The basic principle:

  1. Start with a character-level vocabulary.
  2. Find the most frequent pair of adjacent symbols (like 'a' and 'b''ab').
  3. Merge them into a new token.
  4. Repeat until you reach the target vocabulary size.

Formally,

$$ \text{pair} = \underset{(x,y)}{\arg\max} \ \text{freq}(x, y) $$

where freq(x, y) counts how often two symbols occur together.

Each merge reduces sequence length but increases vocabulary size — it’s a controlled trade-off between compactness and expressivity.

Think of it like compressing text by merging the most common building blocks. The model learns to reuse them efficiently, just like humans re-use syllables or roots when forming new words.
SentencePiece: Language-Agnostic Tokenization

Unlike BPE or WordPiece, SentencePiece doesn’t require pre-segmented text — it treats raw bytes as input. It uses a Unigram Language Model to learn the most probable token inventory. This approach makes it robust across languages (English, Chinese, Japanese, etc.) because it doesn’t depend on spaces or alphabet boundaries.

SentencePiece allows the same model architecture to be used across multilingual corpora — a critical factor for modern large-scale training.

🧠 Step 4: Assumptions or Key Ideas

  • The same text will always tokenize the same way (determinism).
  • Each token is treated as a distinct, learnable unit.
  • Tokenization is language- and domain-sensitive — a tokenizer trained on news text might fail for medical jargon or code.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Handles unseen words gracefully via subword composition.
  • Efficient balance between vocabulary size and sequence length.
  • Enables cross-lingual compatibility (especially SentencePiece).

⚠️ Limitations

  • Rare words may still fragment into long token chains (increasing sequence length).
  • Token boundaries may not align with semantic meaning.
  • Tokenization quality can degrade on noisy or mixed-language inputs.
⚖️ Trade-offs Smaller vocabularies mean shorter training times but longer input sequences. Larger vocabularies reduce sequence length but inflate embedding tables — increasing memory usage.

🚧 Step 6: Common Misunderstandings (Optional)

🚨 Common Misunderstandings (Click to Expand)
  • “Token = Word” — ❌ Not true. Tokens can be parts of words or even punctuation.
  • “Tokenization doesn’t affect performance” — ❌ It does. Poor tokenization can cause inefficient learning and high loss.
  • “All tokenizers are language-specific” — ❌ SentencePiece and Byte-level BPE work across languages.

🧩 Step 7: Mini Summary

🧠 What You Learned: Tokenization transforms messy, human-readable text into structured, numerical tokens.

⚙️ How It Works: It splits text into subwords using statistical patterns (like BPE or SentencePiece).

🎯 Why It Matters: Without effective tokenization, LLMs cannot meaningfully process or generalize across text.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!