How many words are 128k tokens?

By Christian Prokopp on 2024-04-12

128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightforward for other languages, but we can approximate it using English. Confused? I was. Let me explain.

Tokens are not Words

Large Language Models (LLMs) train on and predict tokens, which are frequent sequences. You can try it on OpenAI's website. For example, fish is one token, and marriage is two tokens. The Portuguese word for fish peixe is two tokens, and the Japanese word 魚 is three tokens. The reasons are complex and a combination of how the encodings were trained with a bias towards English and how data are encoded, which works efficiently for English and Latin alphabet languages but not so well for other languages. I can recommend this excellent blog post by Anthony Shaw or this research paper for a deeper dive into the topic.

One approximation for how many tokens you need for other languages is comparing information density, i.e. how many more tokens it takes to say the same thing in German or Japanese versus English, which is, for the stated reasons, usually the most efficient. Of course, the translations themselves may introduce biases or verbosity.

Translating 128k ChatGPT tokens per Language

Our baseline is 96k words for 128k tokens in English. As mentioned above, we get fewer words and less meaning per token for technical and semantic reasons, so a one-to-one comparison is hard. Let us invent a metric, the English Word Equivalence (EWE). Using the 128k tokens allows us to express the equivalent of how many English words?

English Word Equivalence per Language

Using the above-cited resources, we can make the following English Word Equivalence approximations:

Spanish 73k
Portugues 72k
German 69k
Italian 69k
French 66k
Mandarin 54k
Cantonese 46k
Japanese 45k
Korean 41k

This means what we can express in 128k tokens in Spanish would roughly take 73k words in English or a factor of 1.32. At the bottom of the list, 128k tokens can express in Korean as much as 41k English words or a factor of 2.36. Take this with a grain of salt. Remember, it is not a judgment or a measure of a language. Imagine if everything was based on Korean in the early days of electronic computing, and Korean were the predominant language for encoders for LLMs. Things would look very different.

Bias and Cost

Machine Learning and LLMs are rightfully under scrutiny for biases. This highlights that some of the underlying technologies going back many decades combined with recent biases, e.g. English-speaking companies training models on an English-centric dataset, impact cost profiles for everyone on top of performance/outcomes.

Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.

How many words are 128k tokens?

Tokens are not Words

Translating 128k ChatGPT tokens per Language

English Word Equivalence per Language

Bias and Cost

Related Posts

Free Amazon Product and Bestseller Data

Introducing Tax Shrink

How to create a Data Dictionary using ChatGPT

Is Athena Spark a Delta Lake alternative to Databricks?

Will Tesla's AI break the insurance market?

Hard work is not appreciated!?