How LLM token
counting works
OpenAI's tiktoken library ships two production vocabularies: cl100k_base and o200k_base. Anthropic and Google publish neither their vocabularies nor their merge rules. That gap shapes every number Toklen shows you.
Byte Pair Encoding and the Vocabulary Problem
Byte Pair Encoding starts from raw UTF-8 bytes (256 values) and trains a tokenizer by repeatedly finding the most frequent adjacent pair and merging it into a new symbol. After 100,021 merges you have cl100k_base. After 199,741 merges you have o200k_base. A tokenizer is fully specified by that merge table plus a regex pre-tokenizer that decides where one candidate token can end and the next can begin.
A wider vocabulary lets the encoder represent longer runs of common text as single tokens. In our test corpus of GitHub READMEs, o200k cuts 8 to 12 percent off the cl100k count on English. Korean widens to roughly 30 percent. Python with heavy __dunder__ usage narrows the gap to under 5 percent.
For Toklen we shipped the js-tiktoken port of the Rust library instead of calling a server. The vocabulary loads as a base64 blob, decodes once, and tokenizes in under 5ms for inputs under 10k characters. No prompt leaves the browser.
Tokenizer Specifications by Provider
The table below is what Toklen counts against, sourced from provider documentation or, where the tokenizer is closed, from the substitute vocabulary.
| Provider | Tokenizer | Vocab Size | Algorithm | Released? | Used By |
|---|---|---|---|---|---|
| OpenAI | cl100k_base | 100,277 | BPE | Yes (tiktoken) | GPT-3.5-turbo, GPT-4, GPT-4-turbo |
| OpenAI | o200k_base | 199,997 | BPE | Yes (tiktoken) | GPT-4o, GPT-4o-mini, o1, o3 |
| Anthropic | Claude BPE | undisclosed | BPE (per docs) | No | Claude 3 Haiku, Sonnet, Opus; Claude 3.5 Sonnet |
SentencePiece | undisclosed (est. 256k) | Unigram LM | No (Gemini API exposes countTokens) | Gemini 1.5 Flash, Pro; Gemini 2.0 | |
| Meta | Llama tokenizer | 128,000 | BPE | Yes (HuggingFace) | Llama 3, 3.1, 3.3 |
| Mistral | Mistral tokenizer | 32,000 | BPE | Yes (mistral-common) | Mistral 7B, Mixtral, Mistral Large |
The Anthropic row is the awkward one. The /v1/messages/count_tokens endpoint returns exact counts, but it's a network call that bills against your rate limit and defeats the point of a local counter. Toklen substitutes cl100k_base, the same fallback used by LangChain, LlamaIndex, and OpenRouter's UI.
Context Window Limits across GPT, Claude, and Gemini
Context window is the maximum tokens a model accepts, input plus output or input only depending on provider accounting. As of March 2026:
| Model | Context Window | Input Pricing | Output Pricing |
|---|---|---|---|
| GPT-4o | 128,000 | $2.50 / 1M | $10.00 / 1M |
| GPT-4o-mini | 128,000 | $0.15 / 1M | $0.60 / 1M |
| Claude 3.5 Sonnet | 200,000 | $3.00 / 1M | $15.00 / 1M |
| Claude 3 Opus | 200,000 | $15.00 / 1M | $75.00 / 1M |
| Gemini 1.5 Pro | 2,000,000 | $1.25 / 1M (<=128k) | $5.00 / 1M (<=128k) |
| Gemini 1.5 Flash | 1,000,000 | $0.075 / 1M | $0.30 / 1M |
Gemini's 1M and 2M windows are real, but Google tiers the pricing: prompts over 128k tokens roughly double. Toklen's context bar flips amber at 80% and red at 95%. The failure mode we actually see is not one oversized prompt. It's a chat history that crept past the limit one turn at a time.
Where cl100k_base Diverges from Proprietary Tokenizers
Substituting cl100k_base for Claude or Gemini on English prose lands within 5 to 15% of the true count. Our calibration set turned up three specific divergences. Emoji and CJK characters undercount versus Anthropic's tokenizer by 10 to 20%, because Claude's vocabulary appears to allocate more merges to multi-byte sequences. Source code with long identifiers (Java, TypeScript) overcounts by 5 to 10%, because Claude seems to merge camelCase and snake_case more aggressively. Whitespace-heavy text overcounts by 3 to 8% across every substitution we tried.
Gemini's SentencePiece is the harder case because the algorithm is different. Unigram language model tokenization scores every possible segmentation and picks the most likely one, which is not the same operation as BPE's greedy left-to-right merge. The error band widens for inputs under 50 tokens.
If you need exact counts, call the provider's own counter. OpenAI exposes the count on any completion's usage object. Anthropic publishes count_tokens. Google publishes models.countTokens. Toklen marks every non-OpenAI result with an “approximate” badge so the substitution is visible, not hidden.
FAQ
How do I count tokens for an LLM prompt?
Paste your text into Toklen, pick the model you're sending it to (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, or one of seven others), and the token count updates in real time as you type. The count is computed in your browser via js-tiktoken — no server round-trip and no sign-up. For OpenAI models the count is exact; for Claude and Gemini it's an approximation via cl100k_base, typically within 5-15% for English text.
How do I count tokens for a Claude prompt?
Pick Claude 3.5 Sonnet, Claude 3 Opus, or Claude 3 Haiku in Toklen's model selector and paste the prompt. Anthropic doesn't publish the tokenizer Claude uses in production, so Toklen approximates the count using OpenAI's cl100k_base encoding — usually within 5-15% of the true count for English text. The result shows an approximation badge so you know it's an estimate. For exact counts call Anthropic's count_tokens API; for prompt planning and cost estimation the approximation is normally close enough.
How do I count tokens for a GPT-4 or GPT-5 prompt?
Pick GPT-4o, GPT-4o Mini, GPT-4 Turbo, or GPT-3.5 Turbo in Toklen and paste your prompt. The count is exact because Toklen runs OpenAI's real tokenizer (tiktoken) compiled to JavaScript. Toklen auto-selects the right encoding per model: o200k_base for GPT-4o family and cl100k_base for GPT-4 Turbo and GPT-3.5. When OpenAI ships a GPT-5 family model, the same exact-count behaviour will apply once its encoding is added.
How do I estimate the cost of an LLM API call?
Toklen multiplies the input token count by each model's published per-million-token input price and shows the dollar cost to six decimal places, updating live as you type. Pricing is current as of each release and is shown alongside the cost so you can verify. For OpenAI models the cost is exact on the input side; for Claude and Gemini it's an approximation because the input token count itself is approximated. Output cost is not included — Toklen only counts what you provide, not the model's response.
How much of an LLM context window does my prompt use?
Toklen shows a visual context window bar under every prompt: it fills as a percentage of the selected model's context window (for example 128K tokens for GPT-4o, 200K for Claude 3.5 Sonnet, 2M for Gemini 1.5 Pro) and shifts color from green to amber to red as you approach the limit. You can see at a glance whether you have headroom for system prompts, retrieved documents, or a longer response — no math required.
Why do Claude and Gemini show an approximation badge?
Anthropic and Google don't publish the tokenizers their models use in production. Toklen uses OpenAI's cl100k_base as a proxy for Claude and Gemini because it's open source, well-tested, and produces token counts that are typically within 5-15% of the true count for English text. The approximation gets less accurate for code (which tokenizers handle differently), non-English languages, and very short inputs. If you need exact counts for Claude, use Anthropic's official count_tokens API endpoint; for Gemini, use Google's countTokens method. For prompt planning and cost estimation, the cl100k_base approximation is usually close enough.
Why does the same text produce different counts per model?
Each model family trains its own tokenizer with a different vocabulary size and merge strategy. GPT-4o uses o200k_base (~200K tokens in the vocabulary), GPT-4 Turbo and GPT-3.5 use cl100k_base (~100K tokens), and older models like Codex used p50k_base. A word like "tokenization" might be one token in o200k_base and three tokens in cl100k_base because the newer tokenizer learned that merge during training. Toklen auto-selects the correct tokenizer when you pick an OpenAI model and falls back to cl100k_base for Claude and Gemini.
How accurate are the cost estimates?
For OpenAI models, cost estimates are exact down to the token because Toklen uses the real tokenizer plus OpenAI's published per-million-token pricing for input. Output costs are harder to predict because you don't know how long the response will be in advance — Toklen only counts your input. For Claude and Gemini, the input token count is approximated via cl100k_base (typically 5-15% variance), so cost estimates are approximations too. Pricing tiers are updated when the providers change them; check the model's official pricing page if you're making budget decisions for a large run.
Does Toklen send my text to any server?
No. Tokenization runs entirely in your browser via js-tiktoken, a pure JavaScript port of OpenAI's tiktoken library. Your prompt, code, or document never leaves your device. There's no backend, no API call, no database. You can open dev tools and watch the network tab — the only requests are loading the page itself and optional consent-gated PostHog analytics, which never include your text content. This matters if you're working with proprietary prompts, customer data, or anything under NDA.
Which models are supported?
Ten models across three providers. OpenAI: GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo (exact counts via tiktoken). Anthropic: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku (approximated via cl100k_base). Google: Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 1.0 Pro (approximated via cl100k_base). Each has its context window and pricing baked in, so switching models instantly updates the context gauge and cost estimate.
How is this different from OpenAI's tokenizer page?
OpenAI's tokenizer only counts OpenAI models and doesn't estimate cost or show context window usage. Toklen covers GPT, Claude, and Gemini in one interface, shows a color-coded context window bar so you can see at a glance how much of the model's capacity you're using, calculates per-model cost so you can compare what the same prompt would cost on each, and runs without a sign-in or usage cap. If you bounce between providers during development, it's one tab instead of three.
Your text stays on your device
No server, no database, no accounts. Tokenization runs in-browser via JavaScript. Optional consent-gated analytics via PostHog never include your text content.
Read the full privacy policyReady to count tokens?
Free. No sign-up. No data leaves your browser.
Open ToklenBuilt by Infinite Orchard · support@infiniteorchard.com