Token counts are approximations based on each model's tokenizer. Actual usage and costs may differ. Pricing reflects published rates at time of last update.
Token counting code reference: count GPT, Claude, and Gemini tokens in any language
Toklen counts tokens for GPT, Claude, and Gemini in your browser. The reference below covers how to do the same thing programmatically: provider SDKs, language ports, raw HTTP calls, and the model and encoding details you need when wiring token accounting into a pipeline or test harness.
Python: count tokens per provider
| Task | Snippet |
|---|---|
| GPT (any), exact and local | import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o"); len(enc.encode(text)) |
| Claude, exact via API | client.messages.count_tokens(model="claude-sonnet-4", messages=[{"role":"user","content":text}]).input_tokens |
| Claude, approximate and local | import tiktoken; len(tiktoken.get_encoding("cl100k_base").encode(text)) |
| Gemini, exact via API | model = genai.GenerativeModel("gemini-2.5-pro"); model.count_tokens(text).total_tokens |
| Pick encoding by name | tiktoken.get_encoding("o200k_base") |
| Decode tokens back to text | enc.decode([1234, 5678]) |
| Inspect token boundaries | [enc.decode_single_token_bytes(t) for t in enc.encode(text)] |
The tiktoken Python package ships the BPE vocabularies as compressed blobs and tokenizes in a Rust core. Anthropic and Google's exact endpoints bill against your rate limit, so for inputs over a few hundred tokens cache the result by hash.
JavaScript and TypeScript
Browser, Node, and edge runtimes. The browser path is what Toklen uses internally.
| Task | Snippet |
|---|---|
| Browser, js-tiktoken (no WASM) | import { encodingForModel } from "js-tiktoken"; encodingForModel("gpt-4o").encode(text).length |
| Browser or Node, @dqbd/tiktoken (WASM) | const enc = new Tiktoken(model.bpe_ranks, model.special_tokens, model.pat_str); enc.encode(text).length; enc.free(); |
| Edge runtimes, gpt-tokenizer | import { encode } from "gpt-tokenizer/model/gpt-4o"; encode(text).length |
| Claude via Anthropic SDK | const { input_tokens } = await client.messages.countTokens({ model: "claude-sonnet-4", messages: [{ role: "user", content: text }] }); |
| Gemini via Google SDK | const { totalTokens } = await ai.models.countTokens({ model: "gemini-2.5-pro", contents: text }); |
The WASM build (@dqbd/tiktoken) is faster on long inputs but ships ~1.4MB of WASM and requires unsafe-eval in your CSP. The pure-JS port (js-tiktoken) is ~700KB and works under stricter CSP, which makes it the right choice for static sites.
Always call .free() on @dqbd/tiktoken encoders. The WASM heap leaks otherwise.
Go
import "github.com/pkoukk/tiktoken-go"
enc, _ := tiktoken.EncodingForModel("gpt-4o")
n := len(enc.Encode(text, nil, nil))For Claude, use anthropic-sdk-go's Messages.CountTokens. For Gemini, google.golang.org/genai's Models.CountTokens.
The tiktoken-go package downloads encoding files on first call. In a container, pre-warm by pinning TIKTOKEN_CACHE_DIR and running a one-off encode at build time.
Rust
use tiktoken_rs::cl100k_base; let bpe = cl100k_base()?; let n = bpe.encode_with_special_tokens(text).len();
tiktoken-rs is the canonical port. For other providers there is no exact-count Rust client, so route through the HTTP endpoints with reqwest.
Shell and curl
| Task | Snippet |
|---|---|
| Claude count via HTTP | curl https://api.anthropic.com/v1/messages/count_tokens -H "x-api-key: $KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-sonnet-4","messages":[{"role":"user","content":"..."}]}' |
| Gemini count via HTTP | curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:countTokens?key=$KEY" -H "content-type: application/json" -d '{"contents":[{"parts":[{"text":"..."}]}]}' |
| Tokenize stdin (Python one-liner) | python -c "import sys,tiktoken; print(len(tiktoken.encoding_for_model('gpt-4o').encode(sys.stdin.read())))" < prompt.txt |
| Wordcount-style pipe | cat prompt.txt | python -m tiktoken count gpt-4o (requires pip install tiktoken-cli) |
OpenAI returns token usage on every completion response. For tracking spend in production, prefer response.usage.prompt_tokens over recomputing.
Encoding reference
The four encodings tiktoken ships, plus what each is for.
| Encoding | Vocabulary | Pattern style | Used by |
|---|---|---|---|
| o200k_base | 199,997 | Unicode + word-segmented | GPT-4o, GPT-4o-mini, o1, o1-mini, o3, o3-mini |
| cl100k_base | 100,277 | Unicode + word-segmented | GPT-4, GPT-4-turbo, GPT-3.5-turbo, text-embedding-3-* |
| p50k_base | 50,281 | GPT-2 style | text-davinci-003, code-davinci-002 |
| r50k_base | 50,257 | GPT-2 style | GPT-3 (davinci, curie, babbage, ada) |
Pick o200k_base for new GPT work, cl100k_base as the de-facto fallback for Anthropic, the older two only when targeting legacy OpenAI APIs.
Model reference: tokenizer, context, pricing per 1M tokens
Prices as of May 2026; provider listings change without breaking SDK behavior, so verify against current docs before billing logic.
| Model | Tokenizer | Context | Input | Output |
|---|---|---|---|---|
| GPT-4o | o200k_base | 128,000 | $2.50 | $10.00 |
| GPT-4o-mini | o200k_base | 128,000 | $0.15 | $0.60 |
| GPT-4-turbo | cl100k_base | 128,000 | $10.00 | $30.00 |
| GPT-4 | cl100k_base | 8,192 | $30.00 | $60.00 |
| GPT-3.5-turbo | cl100k_base | 16,385 | $0.50 | $1.50 |
| o1 | o200k_base | 200,000 | $15.00 | $60.00 |
| o1-mini | o200k_base | 128,000 | $1.10 | $4.40 |
| o3 | o200k_base | 200,000 | $10.00 | $40.00 |
| o3-mini | o200k_base | 200,000 | $1.10 | $4.40 |
| Claude Opus 4 | Claude BPE | 200,000 | $15.00 | $75.00 |
| Claude Sonnet 4 | Claude BPE | 200,000 | $3.00 | $15.00 |
| Claude 3.7 Sonnet | Claude BPE | 200,000 | $3.00 | $15.00 |
| Claude 3.5 Sonnet | Claude BPE | 200,000 | $3.00 | $15.00 |
| Claude 3.5 Haiku | Claude BPE | 200,000 | $0.80 | $4.00 |
| Claude 3 Haiku | Claude BPE | 200,000 | $0.25 | $1.25 |
| Gemini 2.5 Pro | SentencePiece | 2,000,000 | $1.25 / $2.50 | $5.00 / $10.00 |
| Gemini 2.5 Flash | SentencePiece | 1,000,000 | $0.075 / $0.15 | $0.30 / $0.60 |
| Gemini 2.0 Flash | SentencePiece | 1,000,000 | $0.075 | $0.30 |
| Gemini 1.5 Pro | SentencePiece | 2,000,000 | $1.25 / $2.50 | $5.00 / $10.00 |
| Gemini 1.5 Flash | SentencePiece | 1,000,000 | $0.075 | $0.30 |
Gemini's split pricing reflects the under/over 128k tier; the second number applies to prompts above 128,000 tokens. Anthropic's batch API halves these prices.
Cost math, in one place
cost_usd = (input_tokens / 1_000_000) * input_price_per_million
+ (output_tokens / 1_000_000) * output_price_per_millionAnthropic and OpenAI both offer prompt caching. Cached input tokens bill at 10 to 25 percent of the normal rate depending on provider. For Anthropic, the cache-aware formula is:
cost = (cache_create * 1.25 * input_price
+ cache_read * 0.10 * input_price
+ uncached * input_price) / 1_000_000OpenAI's automatic cache discounts cached input to 50 percent (25 percent on some models) with no explicit cache_control block. The API returns cached_tokens in the usage object.
Common pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| Token count for the same string differs across libraries | Library defaults to an older encoding (e.g., p50k_base) for a GPT-4o model | Pin o200k_base explicitly with tiktoken.get_encoding("o200k_base") |
| Off by ~5 tokens vs API usage | Chat messages add per-message overhead (role + name + separator tokens) | Add ~4 tokens per message and ~3 tokens for assistant priming, or trust usage from the API response |
| Counts double on Unicode emoji | Encoding text without normalizing form | Run text = unicodedata.normalize("NFC", text) before encoding |
| Counts wildly off for code-heavy prompts | Wrong encoding family for the model | Look up encoding by model name, not family ("gpt-4" → cl100k_base, "gpt-4o" → o200k_base) |
| Tiktoken.encode is slow in browser | @dqbd/tiktoken WASM cold start; vocabulary loads on first call | Pre-warm on idle, or use js-tiktoken which skips the WASM init |
| Memory leak in long-running Node process | Not calling .free() on WASM encoders | Wrap encoder use in try/finally with enc.free() in the finally |
| Claude count_tokens returns 401 | Anthropic version header missing | Send anthropic-version: 2023-06-01 on every request |
| Gemini countTokens returns 0 | Sending text in the wrong shape | Use contents: [{ parts: [{ text: "..." }] }], not bare {text} |
| Context bar shows >100% but request still succeeds | Provider counts input only; output reserved separately | Subtract max_tokens from context window before computing input percentage |
| Tool definitions inflate token count | Function/tool specs count against input budget | Include the rendered tool spec when measuring, not just user text |
Tokens per request: rough estimates by content type
Useful when you do not have the text yet but need to size a prompt.
| Content | Approximate tokens |
|---|---|
| English prose | 1 token per ~4 characters; 0.75 tokens per word |
| Code (Python, TypeScript) | 1 token per ~3.5 characters |
| Code (Java, C#, long camelCase) | 1 token per ~3 characters |
| Compact JSON | 1 token per ~2.5 characters |
| Pretty-printed JSON | 1 token per ~3.5 characters |
| Chinese, Japanese, Korean | 1.5 to 2 tokens per character |
| Emoji | 2 to 4 tokens per emoji |
| Base64 blob | 1 token per ~3 characters |
| Markdown table | 30 to 50 percent overhead vs the raw cell text |
These are calibration medians, not guarantees. For billing logic, count the real tokens.
Related concepts
- Prompt caching: OpenAI's automatic prompt cache and Anthropic's
cache_controlblocks both reduce input cost at the price of slightly higher first-write cost. - Batch API: OpenAI and Anthropic offer 50 percent off for async batch processing with a 24-hour SLA.
- Tool use overhead: tool and function definitions count against your input budget. A typical 5-tool spec adds 500 to 1,500 tokens before any user input.
- System prompts: Claude bills system prompts as input tokens. OpenAI bills system messages the same way.
- Max tokens vs context window: most providers' "context window" is the combined input + output.
max_tokensis the output reservation that reduces input headroom.