diff --git a/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb b/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb index 0d975c14bc..adc8edc27a 100644 --- a/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb +++ b/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb @@ -49,6 +49,14 @@ "from langchain_text_splitters import CharacterTextSplitter" ] }, + { + "cell_type": "markdown", + "id": "a3ba1d8a", + "metadata": {}, + "source": [ + "The `.from_tiktoken_encoder()` method takes either `encoding` as an argument (e.g. `cl100k_base`), or the `model_name` (e.g. `gpt-4`). All additional arguments like `chunk_size`, `chunk_overlap`, and `separators` are used to instantiate `CharacterTextSplitter`:" + ] + }, { "cell_type": "code", "execution_count": 2, @@ -57,7 +65,7 @@ "outputs": [], "source": [ "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n", - " chunk_size=100, chunk_overlap=0\n", + " encoding=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n", ")\n", "texts = text_splitter.split_text(state_of_the_union)" ] @@ -91,9 +99,31 @@ "id": "de5b6a6e", "metadata": {}, "source": [ - "Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size.\n", + "Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0262a991", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", "\n", - "We can also load a tiktoken splitter directly, which ensure each split is smaller than chunk size." + "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n", + " model_name=\"gpt-4\",\n", + " chunk_size=100,\n", + " chunk_overlap=0,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "04457e3a", + "metadata": {}, + "source": [ + "We can also load a tiktoken splitter directly, which will ensure each split is smaller than chunk size." ] }, { @@ -111,6 +141,14 @@ "print(texts[0])" ] }, + { + "cell_type": "markdown", + "id": "3bc155d0", + "metadata": {}, + "source": [ + "Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the `TokenTextSplitter` directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` or `CharacterTextSplitter.from_tiktoken_encoder` to ensure chunks contain valid Unicode strings." + ] + }, { "cell_type": "markdown", "id": "55f95f06",