Lecture 4 - Tokenization: From Text to Tokens

Welcome back!

Last time: Neural networks and deep learning - how models learn from data

Today: Tokenization, how text becomes numbers

Why it matters: How we split text affects everything: model behavior, cost, fairness across languages.

Ice breaker

Actually this time:

What can you do better than an LLM?

Agenda for today

Bridging from last time: why tokenization matters
Historical approaches: stemming and lemmatization
Modern subword tokenization: BPE and WordPiece
Hands-on: How ChatGPT sees text
Tokenization and fairness
Preview: Word embeddings (next week)

Part 1: Why Tokenization Matters

Remember the NLP pipeline

From Lecture 2:

1. Tokenization - Split text into pieces

2. Representation - Convert to numbers

3. Learning - Train a model

Today: Deep dive into step 1, because it affects everything else!

Why tokenization is foundational

Your tokenization choice determines:

What the model can "see"

Your vocabulary size (memory and speed)

How you handle new/rare words

Whether your model works across languages

The vocabulary explosion problem

English has:

~170,000 words in current use
Countless proper nouns (names, places, brands)
New words constantly ("COVID-19", "ChatGPT", "6-7")
Typos and variations ("looooove", "alot", "independant")

If every unique word gets its own token:

Massive vocabulary
Rare words poorly represented
Can't handle new words
~100,000+ possible output "labels"

Turn to your neighbor:

These words are clearly related, but to a computer they're completely different:

run, runs, running, ran, runner
happy, happier, happiest, happily, happiness
go, going, went, gone

How might we help a computer see the connection?

Part 2: Historical Approaches

Stemming: The crude solution

Idea: Chop off word endings to find the "stem"

Examples:

running  -> run
runs     -> run
runner   -> run
easily   -> easili
happiness -> happi
studies  -> studi

Problem 1: Creates nonsense stems ("easili" and "happi" aren't words)

Problem 2: Different words collide to the same stem:

"universal", "university", "universe" -> all become "univers"
"policy", "police" -> both become "polic"
"arm", "army" -> both become "arm"

Lemmatization: The smarter solution

Idea: Use linguistic knowledge to find the dictionary form (lemma)

Examples:

running  -> run
ran      -> run
better   -> good
is       -> be
mice     -> mouse

Better! Uses dictionaries and morphological rules to find true word forms.

But: Slow, language-specific, still treats lemmas as atomic.

Why stemming and lemmatization aren't enough

Still one token per word (vocabulary explosion continues)

Language-specific (need new rules/dictionaries for each language)

Can't handle new words (not in the dictionary)

Loses information ("running" vs "ran" have different tenses!)

Part 3: Modern Subword Tokenization

Let's guess and check

Quick pair-share:

How would you split this sentence into pieces for a computer to process?

How many "words"/tokens do you think ChatGPT sees?

"I can't believe ChatGPT doesn't understand state-of-the-art LLM-training techniques like gobbledigook! 🤯"

The trick - Don't tokenize at word boundaries

Instead: Learn a vocabulary of subword units that can be combined

"unhappiness" -> ["un", "happiness"]
"ChatGPT"     -> ["Chat", "GPT"]
"supercal..." -> ["super", "cal", "if", "rag", "il", "ist", "ic"]

Benefits:

Fixed vocabulary size (50k subwords vs 170k+ words)
New words break into known pieces
Shared meaning ("un" = negation across many words)

Byte-Pair Encoding (BPE)

The dominant approach for modern LLMs

High-level idea:

Start with character-level vocabulary
Find the most frequent pair of adjacent tokens
Merge them into a new token
Repeat until vocabulary reaches target size

Result: Common words become single tokens, rare words split into pieces

BPE example (board work)

Let's build a toy BPE vocabulary together on the board

Training text: "I like to run in my running shoes when I'm running late"

We'll merge the most frequent pairs step by step and watch how "run" emerges as a token!

BPE: Training vs. Encoding

Training (learning the vocabulary):

Scan corpus, count all adjacent token pairs
Greedily merge the most frequent pair to get a new token
Repeat until vocabulary reaches target size (e.g., 50k tokens)
Save the ordered list of merge rules

Encoding (tokenizing new text):

Apply the learned merge rules in priority order (order they were learned)
Don't re-count frequencies, just apply the rules deterministically
Same text always produces same tokens

Training: greedy, data-driven. Encoding: deterministic, fast.

BPE: Preventing cross-word merges

Problem: Without boundaries, BPE might merge characters across word boundaries.

"faster lower" split naively: f a s t e r l o w e r

The pair r + could merge across the two words!

Solution 1: End-of-word marker (original BPE, Sennrich et al. 2016)

Each word gets a </w> suffix before merging:

"faster" -> f a s t e r </w>
"lower"  -> l o w e r </w>

Merges like er</w> stay within each word. The boundary is never crossed.

Solution 2: Space prefix (GPT-2 and all GPT descendants)

Mark word starts with the preceding space:

"faster lower" -> ["faster", "Ġlower"]    (Ġ = space)

This is why "hello" and " hello" tokenize differently in the demo - the space is part of the next token, not the previous one.

BPE in practice

For real LLMs:

Train on billions of words
Create vocabulary of ~30k-50k subword tokens
Common words: one token ("the", "and", "ChatGPT")
Rare words: multiple tokens ("supercalifragilisticexpialidocious")

Tokenizer Variants (just FYI!)

Algorithm	Used By	Key Idea
BPE	GPT-2/3/4/5, LLaMA, Claude	Greedy: merge most frequent pairs
WordPiece	BERT, DistilBERT	Merge pairs that maximize likelihood ratio
Unigram	T5, ALBERT, XLNet	Start big, prune tokens that hurt least

WordPiece: Like BPE, but instead of raw frequency, scores merges by: $\frac{P ( merged )}{P ( first ) \times P ( second )}$ Prefers merges where the combined token is more likely than you'd expect from the parts.

Unigram: Opposite direction from BPE:

Start with a large vocabulary (all common substrings)
Compute how much each token contributes to likelihood
Remove the least useful tokens until target vocabulary size

Why subword tokenization works

Balances vocab size and granularity
Shares info across related words
Handles new/rare words gracefully
Data-driven - no linguistic rules needed
Works across languages

This is why all modern LLMs use subword tokenization!

Special Tokens

Beyond regular text, LLMs use special tokens for control and structure:

End of text: <|endoftext|>- tells the model a document is complete

Important to think about when using structured output (eg generating JSON / other formats)

Beginning of text: <|startoftext|> - marks the start

Padding: <pad> - fills in when batching sequences of different lengths

Unknown: <unk> - rare fallback for truly unknown input (less common with BPE)

Example chat template (simplified):

<|system|>You are a helpful assistant.<|endoftext|>
<|user|>What's the capital of France?<|endoftext|>
<|assistant|>Paris is the capital of France.<|endoftext|>

This is why "system prompts" work. They go in a special place the model treats as instructions.

Understanding special tokens helps you understand prompt injection - malicious input can insert fake tokens like <|system|> to override instructions. More on this in later.

Part 4: Tokenization in Practice

Live demo: OpenAI tokenizer

Let's see how GPT actually tokenizes text

Go to: platform.openai.com/tokenizer

Try these examples and discuss:

"running" vs "run"
"ChatGPT"
"supercalifragilisticexpialidocious"
" hello" vs "hello"
Code: "def main():"
Math: "2+2=4"
"🙂😀"
"strawberry"

Why LLMs struggle with certain tasks

Question: Why do LLMs struggle to count letters in words or reverse words?

Turn to your neighbor and discuss

Why LLMs struggle with certain tasks

Answer: They don't see individual letters - common words are single tokens!

Example: "strawberry" = ["str", "awberry"]

The model can't count the "r"s - it doesn't see individual letters!

This is why prompting tricks sometimes work:

"Spell it out letter by letter first"
"Break the word into characters"

These force the model to generate character-level tokens

Fun fact: OpenAI's o1 was code-named "Strawberry"

Tokenization archaeology: "SolidGoldMagikarp"

Story: In 2023, researchers discovered "glitch tokens" - tokens that made ChatGPT behave bizarrely.

One example: the token "SolidGoldMagikarp" (a Reddit username). When asked to repeat it, ChatGPT would:

Claim it couldn't see the word
Refuse to say it
Output completely unrelated text
Behave erratically

What happened? The tokenizer saw this Reddit username enough to make it a token. But the model rarely saw it during training - mismatch between tokenizer and model.

Quick skim now, but great reading for later!

https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Tokenizers are frozen

Once a model is trained, its tokenizer is fixed. You can't easily change it.

In 2020, models tokenized COVID-19 as ~["CO", "VID", "-", "19"].
Newer models trained after 2020 may have "COVID" as a single token.

Why newer models handle recent terms better: not just more data, updated tokenizers too.

Other tokenization effects

Arithmetic: Numbers tokenize inconsistently - sometimes digit-by-digit, sometimes as chunks

Code: Variable names split unpredictably

Rhymes: "cat" and "bat" might not share an "at" token

Tokenization shapes what LLMs find easy vs hard

Tokenizing Code vs Natural Language

Code and prose tokenize very differently:

Natural language: Words mostly stay intact

"The quick brown fox" = 5 tokens

Code: Variable names split unpredictably

print = 1 token (very common)
getUserDataFromDB = 6 tokens ["get", "User", "Data", "From", "DB"]
mySpecialFunction = 4 tokens ["my", "Special", "Function"]

Why this matters:

Longer sequences are harder for the model to understand
Uses up context faster

Rule of thumb: Assume ~ 10 tokens per line of code when you're asking AI to parse code files

Token Vocabularies Across Models

Different models make different tokenization choices:

Model	Vocab Size	Notes
GPT-2	~50k	Older, smaller vocabulary
GPT-4	~100k	Larger, better multilingual
Claude	~100k	Similar to GPT-4
LLaMA	~32k	Smaller but efficient
BERT	~30k	WordPiece, not BPE

A prompt optimized for one model may be inefficient for another.

Why this matters for prompt engineering:

Context window limits (e.g., 128k tokens) are in TOKENS, not words
Few-shot examples eat into your token budget
Verbose prompts = fewer tokens for the actual task
Non-English prompts use more of your context window

Mental model: How big is a token?

Rules of thumb for English:

~4 characters per token (on average)
~0.75 words per token (or ~1.3 tokens per word)
A typical page of text ≈ 500-700 tokens
A typical email ≈ 200-400 tokens
128K token context ≈ a 250-page book

The cost of tokens

Typical API pricing (as of early 2026):

Model	Input	Output
GPT-4	~$2.50 / 1M tokens	~$10 / 1M tokens
Claude Sonnet	~$3 / 1M tokens	~$15 / 1M tokens
GPT-4o-mini	~$0.15 / 1M tokens	~$0.60 / 1M tokens

Quick cost estimates (Claude Sonnet at ~$15/1M input):

1 email (~300 tokens): ~$0.005
A novel (~100K tokens): ~$1.50

Tokens are cheap individually. Volume is where costs add up.

How it adds up

Every time you send a message the LLM REREADS YOUR WHOLE CONVERSATION HISTORY as context
If you're doing development work with lots of code, each message could easily be 10k+ tokens (~$0.20)
If you set up a chatbot for many users / use LLMs to send spam emails...

Minification: Squeezing more into your context

You can strip characters to reduce token count before sending to an LLM.

Strategies:

Content Type	Technique
Code	Remove comments, collapse whitespace
JSON	Strip whitespace, shorten keys
Markdown	Remove extra newlines, simplify formatting
Logs	Deduplicate, truncate timestamps

Pros:

Fit more in context window
Reduce API costs

Cons:

Harder for the model to "read" - formatting aids comprehension
Harder for humans to read without whitespace?
Diminishing returns (saving 10% rarely matters)
Risk of removing important context

Rule of thumb: Minify data/logs aggressively. Keep code and instructions readable.

Activity: Tokenization Scavenger Hunt

Select a tokenizer (or compare them):

Find examples of each:

A real English word that splits into 4+ tokens
What's the longest English word you can find that is just one token?
Find a 4-digit number that's ONE token, and another 4-digit number that's TWO tokens. What's the pattern?
Find a word where changing the capitalization changes the number of tokens
Find a string where GPT's and Claude's tokenizers produce different numbers of tokens.
COVID-19 was 4 tokens in GPT-3 but is now 3 tokens. Can you find other examples of token count changing over time?
Translate "Hello, how are you today?" into at least 3 languages. Which language uses the MOST tokens?
Find a non-English word that's a single token.
If your name isn't common in English, how many tokens is it? Compare to a common English name.