langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-31 15:20:26 +00:00

History

Kane Sweet ea331f3136 Fix token text splitter duplicates (#14848 ) - Description: - Add a break case to `text_splitter.py::split_text_on_tokens()` to avoid unwanted item at the end of result. - Add a testcase to enforce the behavior. - Issue: - #14649 - #5897 - Dependencies: n/a, --- Quick illustration of change: ``` text = "foo bar baz 123" tokenizer = Tokenizer( chunk_overlap=3, tokens_per_chunk=7 ) output = split_text_on_tokens(text=text, tokenizer=tokenizer) ``` output before change: `["foo bar", "bar baz", "baz 123", "123"]` output after change: `["foo bar", "bar baz", "baz 123"]`		2023-12-18 17:15:57 -08:00
..
cli	cli[patch]: unicode issue (#14672 )	2023-12-13 11:14:51 -08:00
community	community: replace deprecated davinci models (#14860 )	2023-12-18 13:49:46 -08:00
core	docstrings `core` update (#14871 )	2023-12-18 17:13:35 -08:00
experimental	create mypy cache dir if it doesn't exist (#14579 )	2023-12-12 15:34:50 -08:00
langchain	Fix token text splitter duplicates (#14848 )	2023-12-18 17:15:57 -08:00
partners	[Documentation] Updates to NVIDIA Playground/Foundation Model naming.… (#14770 )	2023-12-15 12:21:59 -08:00