diff --git a/img/gemma/control-tokens.png b/img/gemma/control-tokens.png index 65e6527..4cddb78 100644 Binary files a/img/gemma/control-tokens.png and b/img/gemma/control-tokens.png differ diff --git a/pages/research/llm-tokenization.en.mdx b/pages/research/llm-tokenization.en.mdx index c3767ea..df40955 100644 --- a/pages/research/llm-tokenization.en.mdx +++ b/pages/research/llm-tokenization.en.mdx @@ -22,7 +22,7 @@ Here is the text version of the list above: - Why is LLM not actually end-to-end language modeling? Tokenization. - What is the real root of suffering? Tokenization. -To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis in tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook. +To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook. A good tool for tokenization is the [Tiktokenizer](https://tiktokenizer.vercel.app/) and this is what's actually used in the lecture for demonstration purposes.