Andrej Karpathy recently published a new [lecture](https://youtu.be/zduSFxRajkE?si=Hq_93DBE72SQt73V) on large language model (LLM) tokenization. Tokenization is a key part of training LLMs but it's a process that involves training tokenizers using their own datasets and algorithms (e.g., [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)).
In the lecture, Karpathy teaches how to implement a GPT tokenizer from scratch. He also discusses weird behaviors that trace back to tokenization.
To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.
A good tool for tokenization is the [Tiktokenizer](https://tiktokenizer.vercel.app/) and this is what's actually used in the lecture for demonstration purposes.