mirror of https://github.com/hwchase17/langchain
refactor: extract token text splitter function (#5179)
# Token text splitter for sentence transformers The current TokenTextSplitter only works with OpenAi models via the `tiktoken` package. This is not clear from the name `TokenTextSplitter`. In this (first PR) a token based text splitter for sentence transformer models is added. In the future I think we should work towards injecting a tokenizer into the TokenTextSplitter to make ti more flexible. Could perhaps be reviewed by @dev2049 --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>pull/5704/head
parent
26ec845921
commit
8d9e9e013c
@ -0,0 +1,131 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "73dbcdb9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SentenceTransformersTokenTextSplitter\n",
|
||||
"\n",
|
||||
"This notebook demonstrates how to use the `SentenceTransformersTokenTextSplitter` text splitter.\n",
|
||||
"\n",
|
||||
"Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model. \n",
|
||||
"\n",
|
||||
"The `SentenceTransformersTokenTextSplitter` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "9dd5419e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import SentenceTransformersTokenTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "b43e5d54",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)\n",
|
||||
"text = \"Lorem \""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "1df84cb4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"count_start_and_stop_tokens = 2\n",
|
||||
"text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens\n",
|
||||
"print(text_token_count)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "d7ad2213",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tokens in text to split: 514\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1\n",
|
||||
"\n",
|
||||
"# `text_to_split` does not fit in a single chunk\n",
|
||||
"text_to_split = text * token_multiplier\n",
|
||||
"\n",
|
||||
"print(f\"tokens in text to split: {splitter.count_tokens(text=text_to_split)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "818aea04",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"lorem\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"text_chunks = splitter.split_text(text=text_to_split)\n",
|
||||
"\n",
|
||||
"print(text_chunks[1])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e9ba4f23",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Loading…
Reference in New Issue