mirror of https://github.com/hwchase17/langchain
docs: Embellish article on splitting by tokens with more examples and missing details (#18997)
**Description** This PR adds some missing details from the "Split by tokens" page in the documentation. Specifically: - The `.from_tiktoken_encoder()` class methods for both the `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` default to the old `gpt-2` encoding. I've added a comment to suggest specifying `model_name` or `encoding` - The docs didn't mention that the `from_tiktoken_encoder()` class method passes additional kwargs down to the constructor of the splitter. I only discovered this by reading the source code - Added an example of using the `.from_tiktoken_encoder()` class method with `RecursiveCharacterTextSplitter` which is the recommended approach for most scenarios above `CharacterTextSplitter` - Added a warning that `TokenTextSplitter` can split characters which have multiple tokens (e.g. 猫 has 3 cl100k_base tokens) between multiple chunks which creates malformed Unicode strings and should not be used in these situations. Side note: I think the default argument of `gpt2` for `.from_tiktoken_encoder()` should be updated? **Twitter handle** anthonypjshaw --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>pull/18645/head
parent
7afecec280
commit
bb0dd8f82f
Loading…
Reference in New Issue