What is LLM Token

LLM is subject to a maximum limit of tokens it can accommodate, often expressed as numbers like 4k or 8k, which means it can handle a maximum of 4k or 8k tokens.

How are tokens counted?

In LLM, "tokens" refer to the smallest units into which text is broken down. For English, a token usually corresponds to a word, for example, "hello" is one token, and "world" is another token. However, for languages with more characters like Chinese, one character may correspond to one token.

Additionally, there are some special cases to consider. Punctuation marks are typically treated as separate tokens, such as periods, commas, etc. Special character combinations like URLs or email addresses are considered as a single token.

It's important to note that for longer text sequences, the number of tokens can be quite substantial. When counting tokens, we can simply consider it as the length of the text. Therefore, when we mention how many tokens are equivalent to a character, in English, it's usually a one-to-one relationship, but it may vary for languages with more characters.

Last updated