You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Prompt-Engineering-Guide/pages/research/llm-tokenization.en.mdx

30 lines
2.3 KiB
Markdown

# LLM Tokenization
Andrej Karpathy recently published a new [lecture](https://youtu.be/zduSFxRajkE?si=Hq_93DBE72SQt73V) on large language model (LLM) tokenization. Tokenization is a key part of training LLMs but it's a process that involves training tokenizers using their own datasets and algorithms (e.g., [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)).
In the lecture, Karpathy teaches how to implement a GPT tokenizer from scratch. He also discusses weird behaviors that trace back to tokenization.
!["LLM Tokenization"](../../img/research/tokenization.png)
*Figure Source: https://youtu.be/zduSFxRajkE?t=6711*
Here is the text version of the list above:
- Why can't LLM spell words? Tokenization.
- Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.
- Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.
- Why is LLM bad at simple arithmetic? Tokenization.
- Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.
- Why did my LLM abruptly halt when it sees the string "\<endoftext\>"? Tokenization.
- What is this weird warning I get about a "trailing whitespace"? Tokenization.
- Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.
- Why should I prefer to use YAML over JSON with LLMs? Tokenization.
- Why is LLM not actually end-to-end language modeling? Tokenization.
- What is the real root of suffering? Tokenization.
To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.
A good tool for tokenization is the [Tiktokenizer](https://tiktokenizer.vercel.app/) and this is what's actually used in the lecture for demonstration purposes.