Why OpenAI’s API Is More Expensive for Non-English Languages

Beyond words: How byte pair encoding and Unicode encoding factor into pricing disparities

Why the OpenAI API costs up to 15x more for non-English languages: how byte pair encoding tokenization and Unicode create pricing disparities across languages.
Towards Data Science Archive
Published

August 16, 2023

How can it be that the phrase “Hello world” has two tokens in English and 12 tokens in Hindi?

After publishing my recent article on how to estimate the cost for OpenAI’s API, I received an interesting comment that someone had noticed that the OpenAI API is much more expensive in other languages, such as ones using Chinese, Japanese, or Korean (CJK) characters, than in English.

I wasn’t aware of this issue, but quickly realized that this is an active research field: At the beginning of this year, a paper called “Language Model Tokenizers Introduce Unfairness Between Languages” by Petrov et al. [2] showed that the “same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases.”

As a refresher, tokenization is the process of splitting a text into a list of tokens, which are common sequences of characters in a text.

An example for Tokenization

The difference in tokenization lengths is an issue because the OpenAI API is billed in units of 1,000 tokens. Thus, if you have up to 15 times more tokens in a comparable text, this will result in 15 times the API costs.

Experiment: Number of Tokens in Different Languages

Let’s translate the phrase “Hello world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). When we tokenize the new phrases with the cl100k_base tokenizer used in OpenAI’s GPT models, we get the following results (you can find the Code I used for these experiments at the end of this article):

Number of letters and tokens (cl100k_base) for the phrase “Hello world” in English, Japanese, and Hindi

From the above graph, we can make two interesting observations:

  1. The number of letters for this phrase is highest in English and lowest in Hindi, but the number of resulting tokens is lowest in English but highest in Hindi.
  2. In Hindi, there are more tokens than there are letters.

How can that happen?

Fundamentals

To understand why we end up with more tokens for the same phrase in languages other than English, we need to review two fundamental concepts of byte pair encoding and Unicode.

Byte Pair Encoding

The Byte Pair Encoding (BPE) algorithm was originally invented as a compression algorithm by Gage [1] in 1994.

“The [BPE] algorithm compresses data by finding the most frequently occurring pairs of adjacent bytes in the data and replacing all instances of the pair with a byte that was not in the original data. The algorithm repeats this process until no further compression is possible, either because there are no more frequently occurring pairs or there are no more unused bytes to represent pairs.” [1]

Let’s go through the example from the original paper [1]. Let’s say you have the smallest corpus of text consisting of the string “ABABCABCD”.

  1. For every pair of bytes (in this example, characters), you will count its occurrences in the corpus as shown below.
"ABABCABCD"  
  
pairs = {  
  'AB' : 3,  
  'BA' : 1,  
  'BC' : 2,  
  'CA' : 1,  
  'CD' : 1,  
}
  1. Take the pair of bytes with the highest number of occurrences and replace it with an unused character. In this case, we will replace the pair “AB” with “X”.
# Replace "AB" with "X" in "ABABCABCD":  
"XXCXCD"  
  
pairs = {  
  'XX' : 1,  
  'XC' : 2,  
  'CX' : 1,  
  'CD' : 1,   
}
  1. Repeat step 2 until no further compression is possible or no more unused bytes (in this example, characters) are available.
# Replace "XC" with "Y" in "XXCXCD":  
"XYYD"  
  
pairs = {  
  'XY' : 1,  
  'YY' : 1,  
  'YD' : 1,  
}

Unicode

Unicode is an encoding standard that defines how different characters are represented in unique numbers called code points. In this article, we are not going to cover all details of Unicode. Here is an excellent StackOverflow answer if you need a refresher.

What you need to know for the following explanation is that if your text is encoded in UTF-8, characters of different languages will require different amounts of bytes.

As you can see from the table below, letters of the English language can be represented with ASCII characters and only require 1 byte. But, e.g., Greek characters require 2 bytes, and Japanese characters require 3 bytes.

Looking Under The Hood

Now that we understand that characters of different languages require different amounts of bytes to be represented numerically and that the tokenizer used by OpenAI’s GPT models is a BPE algorithm, which tokenizes on the byte level, let’s have a deeper look at our opening experiment.

English

First, let’s look at the vanilla example of tokenization in English:

Tokenizing the phrase “Hello world”

From the above visualization, we can make the following observations:

  • One letter equals one code point
  • One Unicode code point equals 1 byte
  • The BPE tokenizes 5 bytes for “Hello” and 6 bytes for ” world” into two separate tokens

This observation matches the statement on the OpenAI’s tokenizer’s site:

“A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text.”

Notice how it says “for common English text”? Let’s look at texts that are not English.

Japanese

Now what happens in languages in which a letter does not correspond to one byte but multiple bytes? Let’s look at the phrase “Hello world” translated into Japanese, which uses CJK characters that are 3 bytes long in UTF-8 encoding:

Tokenizing the phrase “こんにちは世界”

From the above visualization, we can make the following observations:

  • One letter equals one code point
  • One Unicode code point equals 3 bytes
  • The BPE tokenizes 15 bytes for こんにちは (Japanese for “Hello”) into a single token
  • But the letter 界 is tokenized into a single token
  • The letter 世 is tokenized into two tokens

Hindi

It becomes even crazier in languages where one letter doesn’t equal one code point but is made of multiple code points. Let’s look at the phrase “Hello world” transcribed into Hindi. The Devanāgarī alphabet used for Hindi has characters that have to be split into multiple code points with each code point requiring 3 bytes:

Tokenizing the phrase “हैलो वर्ल्ड”

From the above visualization, we can make the following observations:

  • One letter can be made up of multiple Unicode code points (e.g., the letter है is made from combining the code points ह and ै)
  • One Unicode code point equals 3 bytes
  • Similarly to the Japanese letter 世, one code point can be divided into two tokens
  • Some tokens span more than one but less than two letters (e.g., token id 31584)

Summary

This article explored how the same phrase “Hello world” translated to Japanese and transcribed to Hindi is tokenized. First, we learned that the tokenizer used in OpenAI’s GPt models tokenizes on the byte level. Additionally, we saw that Japanese and Devanāgarī characters require more than one byte to represent a character in contrast to English. Thus, we saw that the UFT-8 encoding and BPE tokenizer play a big role in the resulting number of tokens and impact the API costs.

Of course, different factors, such as the fact that GPT models are not trained equally on multilingual texts, influence tokenization. At the time of writing, this issue is an active research field, and I am curious to see different solutions.

References

Web & Literature

[1] Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

[2] Petrov, A., La Malfa, E., Torr, P. H., & Bibi, A. (2023). Language Model Tokenizers Introduce Unfairness Between Languages. arXiv preprint arXiv:2305.15425.

Code

This is the code I used to calculate the number of tokens and decode the tokens for this article.

# pip install tiktoken  
  
import tiktoken  
  
# Define encoding  
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")  
  
# Tokenize text and get token ids  
tokens = encoding.encode(text)  
  
# Decode token ids  
decoded_text = [encoding.decode_single_token_bytes(token) for token in tokens]

This blog was originally published on Towards Data Science on Aug 16, 2023 and moved to this site on Feb 1, 2026.

Back to top