mirror of
https://github.com/hwchase17/langchain
synced 2024-11-18 09:25:54 +00:00
docs: Embellish article on splitting by tokens with more examples and missing details (#18997)
**Description** This PR adds some missing details from the "Split by tokens" page in the documentation. Specifically: - The `.from_tiktoken_encoder()` class methods for both the `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` default to the old `gpt-2` encoding. I've added a comment to suggest specifying `model_name` or `encoding` - The docs didn't mention that the `from_tiktoken_encoder()` class method passes additional kwargs down to the constructor of the splitter. I only discovered this by reading the source code - Added an example of using the `.from_tiktoken_encoder()` class method with `RecursiveCharacterTextSplitter` which is the recommended approach for most scenarios above `CharacterTextSplitter` - Added a warning that `TokenTextSplitter` can split characters which have multiple tokens (e.g. 猫 has 3 cl100k_base tokens) between multiple chunks which creates malformed Unicode strings and should not be used in these situations. Side note: I think the default argument of `gpt2` for `.from_tiktoken_encoder()` should be updated? **Twitter handle** anthonypjshaw --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
This commit is contained in:
parent
7afecec280
commit
bb0dd8f82f
@ -49,6 +49,14 @@
|
||||
"from langchain_text_splitters import CharacterTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a3ba1d8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `.from_tiktoken_encoder()` method takes either `encoding` as an argument (e.g. `cl100k_base`), or the `model_name` (e.g. `gpt-4`). All additional arguments like `chunk_size`, `chunk_overlap`, and `separators` are used to instantiate `CharacterTextSplitter`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
@ -57,7 +65,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
|
||||
" chunk_size=100, chunk_overlap=0\n",
|
||||
" encoding=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n",
|
||||
")\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)"
|
||||
]
|
||||
@ -91,9 +99,31 @@
|
||||
"id": "de5b6a6e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size.\n",
|
||||
"Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0262a991",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
|
||||
"\n",
|
||||
"We can also load a tiktoken splitter directly, which ensure each split is smaller than chunk size."
|
||||
"text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
|
||||
" model_name=\"gpt-4\",\n",
|
||||
" chunk_size=100,\n",
|
||||
" chunk_overlap=0,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "04457e3a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also load a tiktoken splitter directly, which will ensure each split is smaller than chunk size."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -111,6 +141,14 @@
|
||||
"print(texts[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3bc155d0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the `TokenTextSplitter` directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` or `CharacterTextSplitter.from_tiktoken_encoder` to ensure chunks contain valid Unicode strings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "55f95f06",
|
||||
|
Loading…
Reference in New Issue
Block a user