AppCrib
Developer Tools

Why GPT, Claude, and Gemini Disagree on the Same Token Count

Domain knowledge·Published by AppCrib··
ToklenToken count, cost, and context window for GPT, Claude, and Gemini.

Take a sentence: "The quick brown fox jumps over the lazy dog 🦊."

Run it through OpenAI's tokenizer with GPT-4o, and you get a count in the low teens. Run it through the same library with GPT-4 Turbo, and the count is slightly higher. Send it to Anthropic's messages.count_tokens endpoint for Claude 3.5 Sonnet, and you'll get a number that's neither, somewhere close but not equal, depending on how the proprietary tokenizer handles the emoji. Hand the same string to Google's count_tokens for Gemini 1.5 Pro and you'll get something in the same ballpark, but again, slightly different.

Same sentence. Three or four numbers, depending on how many providers you ask.

This is a small example of a bigger pattern. Every model family ships its own tokenizer, and the tokenizers were designed for different jobs. The question of how a string becomes tokens is not standardized. The cost difference at scale is real, and the divergence shows up in places that matter: invoices, context windows, RAG budgets.

Where Subword Tokenization Came From

The technique most current LLM tokenizers use is descended from a compression algorithm Philip Gage published in 1994. Byte Pair Encoding, originally a way to find frequent byte sequences in a file and replace them with single bytes, sat unused in the NLP world for two decades. In 2016, Sennrich, Haddow, and Birch adapted it for neural machine translation in a paper titled "Neural Machine Translation of Rare Words with Subword Units." The motivation was practical. Word-level vocabularies blew up to hundreds of thousands of entries, and rare words ended up as <unk> tokens that the model couldn't reason about.

BPE solved that by finding the most common adjacent symbol pairs in a corpus and merging them, repeatedly, until a vocabulary of a fixed size was built. Common words ended up as single tokens. Rare words got broken into reusable pieces. "Tokenization" might split into "token" + "ization." A misspelling or a brand name gets reduced to two or three subword pieces instead of an unknown.

OpenAI adopted BPE for GPT-2 in 2019 with a vocabulary of about 50,000 entries. GPT-3 used the same family. With ChatGPT and GPT-3.5-turbo in late 2022, OpenAI shipped a refined vocabulary called cl100k_base containing 100,277 tokens, optimized for code and multilingual content in addition to English prose. GPT-4o brought another upgrade in May 2024: o200k_base, with roughly 200,000 tokens, that handles non-English languages substantially more compactly. The same English sentence costs about the same in both vocabularies. A Japanese or Arabic sentence costs noticeably less in o200k_base.

You can read the actual merge rules. They ship in the tiktoken library OpenAI publishes, and the file is a flat list of byte sequences and their merge order. There's no mystery there.

Anthropic Took a Different Path

Anthropic doesn't publish Claude's tokenizer. Earlier versions of the Anthropic Python SDK included a tokenizer module for Claude 1 and Claude 2, and the vocabulary was inspectable. Starting with Claude 3, that changed. The current way to get a real Claude token count is to call client.messages.count_tokens() against the API, which doesn't consume tokens but does require a server, an API key, and a network round trip.

The choice was deliberate, and it's worth understanding why. A published tokenizer makes it easier to clone the model's input behavior, which is a small but real piece of the moat for a frontier lab. It also means the tokenizer can change between Claude versions without breaking client code that pinned to an older vocabulary. A 1% efficiency gain in tokenization at the scale Anthropic operates at is worth the friction it imposes on developers.

The friction is real. If you want to size a prompt against Claude before you send it, your options are: call the count endpoint (server required), trust an approximation, or count manually using a heuristic. Most browser tools take option two. Most browser tools don't tell you they're doing it.

Gemini Uses SentencePiece

Google's tokenizers descend from a different lineage. SentencePiece, published by Kudo and Richardson at EMNLP 2018, was designed to be language-agnostic and to operate on raw text without language-specific preprocessing. Where BPE merges symbols based on frequency, SentencePiece's Unigram variant treats tokenization as a probabilistic problem. Build a large initial vocabulary, then prune it down to maximize the likelihood of the training corpus.

The result is a tokenizer that splits text in a meaningfully different way from BPE, especially for whitespace-heavy content and CJK languages. Gemini 1.5 Pro uses a SentencePiece-derived tokenizer with a vocabulary in the same general size class as o200k_base, and Google exposes a count_tokens method via the Vertex AI SDK and the public Gemini API for getting an exact count.

A short English sentence will tokenize to nearly the same number across BPE and SentencePiece. A long block of source code will not. Indentation, semicolons, and language-specific punctuation get split differently, and the differences accumulate quickly.

Where the Numbers Diverge

The differences are easiest to see in a table. Below are illustrative token counts for a few representative inputs across the four tokenizers, using public counts where available and noting where the count is an approximation.

InputGPT-4o (o200k_base)GPT-3.5/4 (cl100k_base)Claude 3.5 Sonnet (API count)Gemini 1.5 Pro (count_tokens)
"The quick brown fox jumps over the lazy dog."10101110
"🦊" (single emoji, alone)1532
"こんにちは世界" (Japanese "Hello world")48~64
1,000 lines of TypeScript~2,800~3,400~3,100~3,000
RFC 4180 plain text~6,200~6,800~6,400~6,500

Two patterns jump out. First, single-character inputs that fall outside the basic ASCII range get tokenized very differently, and the gap can be 5x or more. Second, the gap shrinks dramatically for long English text, which is what most production prompts actually contain. If you're sizing a 50,000-token English prompt, the four tokenizers will agree to within a few percent. If you're sizing a 5,000-token Japanese prompt, the gap matters.

The number you should care about is whichever tokenizer the model you're paying for actually uses. Estimating against the wrong one for a billing-sensitive workload is how a $40 prompt becomes a $60 prompt.

When the Difference Actually Costs You

There are three cases where the divergence is more than academic.

The first is cost estimation for non-English prompts. If you're translating, summarizing, or generating in CJK, Arabic, or other non-Latin scripts, a count from cl100k_base will run substantially higher than the actual o200k_base cost on GPT-4o, and a cl100k_base approximation for Claude will be wrong by a margin that matters. The math is simple: compute against the wrong tokenizer, miscalculate by 20-30%, and the surprise shows up on the invoice.

The second is context window sizing. Claude 3.5 Sonnet's 200,000-token window holds a different amount of content depending on what kind of content it is. A 200,000-token English novel is roughly 150,000 words. A 200,000-token codebase is closer to 8,000 lines, and the line count can vary by a factor of two depending on language and indentation style. Knowing the count for the model you're targeting, in the units that model uses, is the only reliable way to plan for "will this fit."

The third is RAG retrieval budgets. If you're packing retrieved chunks into a prompt up to a limit, an approximation that's off by 5% means you're either leaving usable budget on the table or overrunning the budget you set. Neither is catastrophic, but both compound across high-volume systems where you're running thousands of retrievals per day.

For drafts, comparisons, and rough sizing, an approximation is fine. For anything that lands on a bill or a hard limit, the right tokenizer is the one the model actually runs on.

Why Browser Tools Default to OpenAI's Tokenizer

The pattern across web token counters is to use OpenAI's cl100k_base tokenizer for every model, label it or not, and call it a day. The reason is structural. OpenAI's tokenizer is the only one that ships as a portable WASM-friendly library. Anthropic's tokenizer requires an API call, which means a server, which breaks the browser-only model. Google's SentencePiece tokenizer is open source but is heavier and harder to bundle for a browser context, and the model-specific vocabulary files for Gemini aren't published the way OpenAI publishes cl100k_base.tiktoken.

The result is that "your Claude token count" on most web tools is actually a cl100k_base count being relabeled. For English prose this is within a few percent of the true count. For non-English text, code with unusual characters, or anything emoji-heavy, the approximation drifts, sometimes by 20% or more.

The honest pattern is to label the approximation directly so a developer knows when to trust the number. If a tool shows you "1,247 tokens" with no asterisk, and the model you've selected is Claude, the number is a tiktoken proxy. Toklen labels it "Approximation via tiktoken" right under the count for Claude and Gemini, and gives you the exact cl100k_base or o200k_base count for OpenAI models. That difference is the whole point. If you want to see the divergence in practice, paste the same input across models in a tool that's honest about which count is exact and which is a proxy.

Toklen
Token count, cost, and context window for GPT, Claude, and Gemini.
Try Toklen