Lecture 4 - Tokenization: From Text to Tokens

Welcome back!

Last time: Neural networks and deep learning - how models learn from data

Today: Tokenization, how text becomes numbers

Why it matters: How we split text affects everything: model behavior, cost, fairness across languages.

Ice breaker

Actually this time:

What can you do better than an LLM?

Agenda for today

  1. Bridging from last time: why tokenization matters
  2. Historical approaches: stemming and lemmatization
  3. Modern subword tokenization: BPE and WordPiece
  4. Hands-on: How ChatGPT sees text
  5. Tokenization and fairness
  6. Preview: Word embeddings (next week)

Part 1: Why Tokenization Matters

Remember the NLP pipeline

From Lecture 2:

1. Tokenization - Split text into pieces

2. Representation - Convert to numbers

3. Learning - Train a model

Today: Deep dive into step 1, because it affects everything else!

Why tokenization is foundational

Your tokenization choice determines:

What the model can "see"

Your vocabulary size (memory and speed)

How you handle new/rare words

Whether your model works across languages

The vocabulary explosion problem

English has:

  • ~170,000 words in current use
  • Countless proper nouns (names, places, brands)
  • New words constantly ("COVID-19", "ChatGPT", "6-7")
  • Typos and variations ("looooove", "alot", "independant")

If every unique word gets its own token:

  • Massive vocabulary
  • Rare words poorly represented
  • Can't handle new words
  • ~100,000+ possible output "labels"

Turn to your neighbor:

These words are clearly related, but to a computer they're completely different:

run, runs, running, ran, runner
happy, happier, happiest, happily, happiness
go, going, went, gone

How might we help a computer see the connection?

Part 2: Historical Approaches

Stemming: The crude solution

Idea: Chop off word endings to find the "stem"

Examples:

running  -> run
runs     -> run
runner   -> run
easily   -> easili
happiness -> happi
studies  -> studi

Problem 1: Creates nonsense stems ("easili" and "happi" aren't words)

Problem 2: Different words collide to the same stem:

  • "universal", "university", "universe" -> all become "univers"
  • "policy", "police" -> both become "polic"
  • "arm", "army" -> both become "arm"

Lemmatization: The smarter solution

Idea: Use linguistic knowledge to find the dictionary form (lemma)

Examples:

running  -> run
ran      -> run
better   -> good
is       -> be
mice     -> mouse

Better! Uses dictionaries and morphological rules to find true word forms.

But: Slow, language-specific, still treats lemmas as atomic.

Why stemming and lemmatization aren't enough

Still one token per word (vocabulary explosion continues)

Language-specific (need new rules/dictionaries for each language)

Can't handle new words (not in the dictionary)

Loses information ("running" vs "ran" have different tenses!)

Part 3: Modern Subword Tokenization

Let's guess and check

Quick pair-share:

How would you split this sentence into pieces for a computer to process?

How many "words"/tokens do you think ChatGPT sees?

"I can't believe ChatGPT doesn't understand state-of-the-art LLM-training techniques like gobbledigook! 🤯"

The trick - Don't tokenize at word boundaries

Instead: Learn a vocabulary of subword units that can be combined

"unhappiness" -> ["un", "happiness"]
"ChatGPT"     -> ["Chat", "GPT"]
"supercal..." -> ["super", "cal", "if", "rag", "il", "ist", "ic"]

Benefits:

  • Fixed vocabulary size (50k subwords vs 170k+ words)
  • New words break into known pieces
  • Shared meaning ("un" = negation across many words)

Byte-Pair Encoding (BPE)

The dominant approach for modern LLMs

High-level idea:

  1. Start with character-level vocabulary
  2. Find the most frequent pair of adjacent tokens
  3. Merge them into a new token
  4. Repeat until vocabulary reaches target size

Result: Common words become single tokens, rare words split into pieces

BPE example (board work)

Let's build a toy BPE vocabulary together on the board

Training text: "I like to run in my running shoes when I'm running late"

We'll merge the most frequent pairs step by step and watch how "run" emerges as a token!

BPE: Training vs. Encoding

Training (learning the vocabulary):

  • Scan corpus, count all adjacent token pairs
  • Greedily merge the most frequent pair to get a new token
  • Repeat until vocabulary reaches target size (e.g., 50k tokens)
  • Save the ordered list of merge rules

Encoding (tokenizing new text):

  • Apply the learned merge rules in priority order (order they were learned)
  • Don't re-count frequencies, just apply the rules deterministically
  • Same text always produces same tokens

Training: greedy, data-driven. Encoding: deterministic, fast.

BPE: Preventing cross-word merges

Problem: Without boundaries, BPE might merge characters across word boundaries.

"faster lower" split naively: f a s t e r l o w e r

The pair r + could merge across the two words!

Solution 1: End-of-word marker (original BPE, Sennrich et al. 2016)

Each word gets a </w> suffix before merging:

"faster" -> f a s t e r </w>
"lower"  -> l o w e r </w>

Merges like er</w> stay within each word. The boundary is never crossed.

Solution 2: Space prefix (GPT-2 and all GPT descendants)

Mark word starts with the preceding space:

"faster lower" -> ["faster", "Ġlower"]    (Ġ = space)

This is why "hello" and " hello" tokenize differently in the demo - the space is part of the next token, not the previous one.

BPE in practice

For real LLMs:

  • Train on billions of words
  • Create vocabulary of ~30k-50k subword tokens
  • Common words: one token ("the", "and", "ChatGPT")
  • Rare words: multiple tokens ("supercalifragilisticexpialidocious")

Tokenizer Variants (just FYI!)

AlgorithmUsed ByKey Idea
BPEGPT-2/3/4/5, LLaMA, ClaudeGreedy: merge most frequent pairs
WordPieceBERT, DistilBERTMerge pairs that maximize likelihood ratio
UnigramT5, ALBERT, XLNetStart big, prune tokens that hurt least

WordPiece: Like BPE, but instead of raw frequency, scores merges by: Prefers merges where the combined token is more likely than you'd expect from the parts.

Unigram: Opposite direction from BPE:

  1. Start with a large vocabulary (all common substrings)
  2. Compute how much each token contributes to likelihood
  3. Remove the least useful tokens until target vocabulary size

Why subword tokenization works

  • Balances vocab size and granularity
  • Shares info across related words
  • Handles new/rare words gracefully
  • Data-driven - no linguistic rules needed
  • Works across languages

This is why all modern LLMs use subword tokenization!

Special Tokens

Beyond regular text, LLMs use special tokens for control and structure:

End of text: <|endoftext|>- tells the model a document is complete

  • Important to think about when using structured output (eg generating JSON / other formats)

Beginning of text: <|startoftext|> - marks the start

Padding: <pad> - fills in when batching sequences of different lengths

Unknown: <unk> - rare fallback for truly unknown input (less common with BPE)

Chat-specific: <|user|>, <|assistant|>, <|system|> - structure conversations

Example chat template (simplified):

<|system|>You are a helpful assistant.<|endoftext|>
<|user|>What's the capital of France?<|endoftext|>
<|assistant|>Paris is the capital of France.<|endoftext|>

This is why "system prompts" work. They go in a special place the model treats as instructions.

Understanding special tokens helps you understand prompt injection - malicious input can insert fake tokens like <|system|> to override instructions. More on this in later.

Part 4: Tokenization in Practice

Live demo: OpenAI tokenizer

Let's see how GPT actually tokenizes text

Go to: platform.openai.com/tokenizer

Try these examples and discuss:

  1. "running" vs "run"
  2. "ChatGPT"
  3. "supercalifragilisticexpialidocious"
  4. " hello" vs "hello"
  5. Code: "def main():"
  6. Math: "2+2=4"
  7. "🙂😀"
  8. "strawberry"

Why LLMs struggle with certain tasks

Question: Why do LLMs struggle to count letters in words or reverse words?

Turn to your neighbor and discuss

Why LLMs struggle with certain tasks

Answer: They don't see individual letters - common words are single tokens!

Example: "strawberry" = ["str", "awberry"]

The model can't count the "r"s - it doesn't see individual letters!

This is why prompting tricks sometimes work:

  • "Spell it out letter by letter first"
  • "Break the word into characters"

These force the model to generate character-level tokens

Fun fact: OpenAI's o1 was code-named "Strawberry"

Tokenization archaeology: "SolidGoldMagikarp"

Story: In 2023, researchers discovered "glitch tokens" - tokens that made ChatGPT behave bizarrely.

One example: the token "SolidGoldMagikarp" (a Reddit username). When asked to repeat it, ChatGPT would:

  • Claim it couldn't see the word
  • Refuse to say it
  • Output completely unrelated text
  • Behave erratically

What happened? The tokenizer saw this Reddit username enough to make it a token. But the model rarely saw it during training - mismatch between tokenizer and model.

Quick skim now, but great reading for later!

Tokenizers are frozen

Once a model is trained, its tokenizer is fixed. You can't easily change it.

  • In 2020, models tokenized COVID-19 as ~["CO", "VID", "-", "19"].
  • Newer models trained after 2020 may have "COVID" as a single token.

Why newer models handle recent terms better: not just more data, updated tokenizers too.

Other tokenization effects

Arithmetic: Numbers tokenize inconsistently - sometimes digit-by-digit, sometimes as chunks

Code: Variable names split unpredictably

Rhymes: "cat" and "bat" might not share an "at" token

Tokenization shapes what LLMs find easy vs hard

Tokenizing Code vs Natural Language

Code and prose tokenize very differently:

Natural language: Words mostly stay intact

  • "The quick brown fox" = 5 tokens

Code: Variable names split unpredictably

  • print = 1 token (very common)
  • getUserDataFromDB = 6 tokens ["get", "User", "Data", "From", "DB"]
  • mySpecialFunction = 4 tokens ["my", "Special", "Function"]

Why this matters:

  • Longer sequences are harder for the model to understand
  • Uses up context faster

Rule of thumb: Assume ~ 10 tokens per line of code when you're asking AI to parse code files

Token Vocabularies Across Models

Different models make different tokenization choices:

ModelVocab SizeNotes
GPT-2~50kOlder, smaller vocabulary
GPT-4~100kLarger, better multilingual
Claude~100kSimilar to GPT-4
LLaMA~32kSmaller but efficient
BERT~30kWordPiece, not BPE

A prompt optimized for one model may be inefficient for another.

Why this matters for prompt engineering:

  • Context window limits (e.g., 128k tokens) are in TOKENS, not words
  • Few-shot examples eat into your token budget
  • Verbose prompts = fewer tokens for the actual task
  • Non-English prompts use more of your context window

Mental model: How big is a token?

Rules of thumb for English:

  • ~4 characters per token (on average)
  • ~0.75 words per token (or ~1.3 tokens per word)
  • A typical page of text ≈ 500-700 tokens
  • A typical email ≈ 200-400 tokens
  • 128K token context ≈ a 250-page book

The cost of tokens

Typical API pricing (as of early 2026):

ModelInputOutput
GPT-4~$2.50 / 1M tokens~$10 / 1M tokens
Claude Sonnet~$3 / 1M tokens~$15 / 1M tokens
GPT-4o-mini~$0.15 / 1M tokens~$0.60 / 1M tokens

Quick cost estimates (Claude Sonnet at ~$15/1M input):

  • 1 email (~300 tokens): ~$0.005
  • A novel (~100K tokens): ~$1.50

Tokens are cheap individually. Volume is where costs add up.

How it adds up

  • Every time you send a message the LLM REREADS YOUR WHOLE CONVERSATION HISTORY as context
  • If you're doing development work with lots of code, each message could easily be 10k+ tokens (~$0.20)
  • If you set up a chatbot for many users / use LLMs to send spam emails...

Minification: Squeezing more into your context

You can strip characters to reduce token count before sending to an LLM.

Strategies:

Content TypeTechnique
CodeRemove comments, collapse whitespace
JSONStrip whitespace, shorten keys
MarkdownRemove extra newlines, simplify formatting
LogsDeduplicate, truncate timestamps

Pros:

  • Fit more in context window
  • Reduce API costs

Cons:

  • Harder for the model to "read" - formatting aids comprehension
  • Harder for humans to read without whitespace?
  • Diminishing returns (saving 10% rarely matters)
  • Risk of removing important context

Rule of thumb: Minify data/logs aggressively. Keep code and instructions readable.

Activity: Tokenization Scavenger Hunt

Select a tokenizer (or compare them):

Find examples of each:

  1. A real English word that splits into 4+ tokens
  2. What's the longest English word you can find that is just one token?
  3. Find a 4-digit number that's ONE token, and another 4-digit number that's TWO tokens. What's the pattern?
  4. Find a word where changing the capitalization changes the number of tokens
  5. Find a string where GPT's and Claude's tokenizers produce different numbers of tokens.
  6. COVID-19 was 4 tokens in GPT-3 but is now 3 tokens. Can you find other examples of token count changing over time?
  7. Translate "Hello, how are you today?" into at least 3 languages. Which language uses the MOST tokens?
  8. Find a non-English word that's a single token.
  9. If your name isn't common in English, how many tokens is it? Compare to a common English name.

Part 5: Tokenization and Fairness

Not all languages are created equal

BPE vocabularies are learned from training data.

If training data is mostly English:

  • English words - efficient (one token per word)
  • Other languages - split aggressively

This has real consequences

Token efficiency across languages

Same meaning, different token counts:

"Hello, how are you?" (English): 6 tokens

"你好,你好吗?" (Chinese): 11 tokens

Nǐ hǎo ma (Chinese, pinyin): 7 tokens

"مرحبا، كيف حالك؟" (Arabic): 14 tokens

Same semantic content, different token counts!

Why this matters

Cost: Many APIs charge per token

Context limits: Same token limit = fewer words in Chinese than English

Performance: More tokens = longer sequences = harder to learn

Fairness: English speakers get a better deal

Discussion: Is this a problem?

Turn to your neighbor:

  1. Is token inefficiency for non-English languages a fairness issue?
  2. Whose responsibility is it to address this?
  3. What could be done about it?

Train on more balanced multilingual data

Language-specific tokenizers (but lose cross-lingual transfer)

Character-level models (no bias, but less efficient)

Larger vocabularies - more slots for non-Latin characters (GPT-4o went from 100k to 200k vocabulary, improving Chinese efficiency ~3x)

Adjust pricing by language (some APIs do this)

Part 6: Looking Ahead

What we've learned today

Tokenization is foundational - it determines what models can "see"

Historical approaches: stemming and lemmatization (word-level, limited)

Modern approach: subword tokenization (BPE, WordPiece)

Tokenization affects LLM behavior (letter counting, arithmetic, etc.)

Tokenization has fairness implications (language efficiency, cost)

Connecting the dots

Lecture 2: AI development + Classical NLP

Lecture 3 (Monday): Deep learning foundations

Lecture 4 (today): Tokenization

Lab/Reflection Due Friday (Feb 6)

  • Explore tokenization and/or neural network basics.

Monday: Sequence-to-sequence models and word embeddings

Monday: Attention!