fix

3 months ago · cf63f39bef
parent 6de9b13e8b
commit cf63f39bef
2 changed files with 1 additions and 1 deletions
--- a/img/gemma/control-tokens.png
+++ b/img/gemma/control-tokens.png
--- a/pages/research/llm-tokenization.en.mdx
+++ b/pages/research/llm-tokenization.en.mdx
@ -22,7 +22,7 @@ Here is the text version of the list above:
 - Why is LLM not actually end-to-end language modeling? Tokenization.
 - What is the real root of suffering? Tokenization.

-To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis in tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.
+To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the `max_tokens` configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt. You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.

 A good tool for tokenization is the [Tiktokenizer](https://tiktokenizer.vercel.app/) and this is what's actually used in the lecture for demonstration purposes.