mirror of
https://github.com/hwchase17/langchain
synced 2024-11-06 03:20:49 +00:00
docs(text_splitter): update document of character splitter with tiktoken (#10001)
The current document has not mentioned that splits larger than chunk size would happen. I update the related document and explain why it happens and how to solve it. related issue #1349 #3838 #2140
This commit is contained in:
parent
565c021730
commit
05664a6f20
@ -91,7 +91,9 @@
|
||||
"id": "de5b6a6e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also load a tiktoken splitter directly"
|
||||
"Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size.\n",
|
||||
"\n",
|
||||
"We can also load a tiktoken splitter directly, which ensure each split is smaller than chunk size."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
Loading…
Reference in New Issue
Block a user