From bb0dd8f82f79e0993690823cf438f7263ea14d82 Mon Sep 17 00:00:00 2001
From: Anthony Shaw <anthony.p.shaw@gmail.com>
Date: Tue, 19 Mar 2024 15:28:17 +1100
Subject: [PATCH] docs: Embellish article on splitting by tokens with more
 examples and missing details (#18997)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

**Description**

This PR adds some missing details from the "Split by tokens" page in the
documentation. Specifically:

- The `.from_tiktoken_encoder()` class methods for both the
`CharacterTextSplitter` and `RecursiveCharacterTextSplitter` default to
the old `gpt-2` encoding. I've added a comment to suggest specifying
`model_name` or `encoding`
- The docs didn't mention that the `from_tiktoken_encoder()` class
method passes additional kwargs down to the constructor of the splitter.
I only discovered this by reading the source code
- Added an example of using the `.from_tiktoken_encoder()` class method
with `RecursiveCharacterTextSplitter` which is the recommended approach
for most scenarios above `CharacterTextSplitter`
- Added a warning that `TokenTextSplitter` can split characters which
have multiple tokens (e.g. 猫 has 3 cl100k_base tokens) between multiple
chunks which creates malformed Unicode strings and should not be used in
these situations.

Side note: I think the default argument of `gpt2` for
`.from_tiktoken_encoder()` should be updated?

**Twitter handle** anthonypjshaw

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
---
 .../split_by_token.ipynb                      | 44 +++++++++++++++++--
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb b/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb
index 0d975c14bc..adc8edc27a 100644
--- a/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb
+++ b/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb
@@ -49,6 +49,14 @@
     "from langchain_text_splitters import CharacterTextSplitter"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a3ba1d8a",
+   "metadata": {},
+   "source": [
+    "The `.from_tiktoken_encoder()` method takes either `encoding` as an argument (e.g. `cl100k_base`), or the `model_name` (e.g. `gpt-4`). All additional arguments like `chunk_size`, `chunk_overlap`, and `separators` are used to instantiate `CharacterTextSplitter`:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -57,7 +65,7 @@
    "outputs": [],
    "source": [
     "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
-    "    chunk_size=100, chunk_overlap=0\n",
+    "    encoding=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n",
     ")\n",
     "texts = text_splitter.split_text(state_of_the_union)"
    ]
@@ -91,9 +99,31 @@
    "id": "de5b6a6e",
    "metadata": {},
    "source": [
-    "Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size.\n",
+    "Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0262a991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
     "\n",
-    "We can also load a tiktoken splitter directly, which ensure each split is smaller than chunk size."
+    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
+    "    model_name=\"gpt-4\",\n",
+    "    chunk_size=100,\n",
+    "    chunk_overlap=0,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "04457e3a",
+   "metadata": {},
+   "source": [
+    "We can also load a tiktoken splitter directly, which will ensure each split is smaller than chunk size."
    ]
   },
   {
@@ -111,6 +141,14 @@
     "print(texts[0])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "3bc155d0",
+   "metadata": {},
+   "source": [
+    "Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the `TokenTextSplitter` directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` or `CharacterTextSplitter.from_tiktoken_encoder` to ensure chunks contain valid Unicode strings."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "55f95f06",