langchain/libs
Kane Sweet ea331f3136
Fix token text splitter duplicates (#14848)
- **Description:** 
- Add a break case to `text_splitter.py::split_text_on_tokens()` to
avoid unwanted item at the end of result.
    - Add a testcase to enforce the behavior.
  - **Issue:** 
    - #14649 
    - #5897
  - **Dependencies:** n/a,
 
---

**Quick illustration of change:**

```
text = "foo bar baz 123"

tokenizer = Tokenizer(
        chunk_overlap=3,
        tokens_per_chunk=7
)

output = split_text_on_tokens(text=text, tokenizer=tokenizer)
```
output before change: `["foo bar", "bar baz", "baz 123", "123"]`
output after change: `["foo bar", "bar baz", "baz 123"]`
2023-12-18 17:15:57 -08:00
..
cli cli[patch]: unicode issue (#14672) 2023-12-13 11:14:51 -08:00
community community: replace deprecated davinci models (#14860) 2023-12-18 13:49:46 -08:00
core docstrings core update (#14871) 2023-12-18 17:13:35 -08:00
experimental create mypy cache dir if it doesn't exist (#14579) 2023-12-12 15:34:50 -08:00
langchain Fix token text splitter duplicates (#14848) 2023-12-18 17:15:57 -08:00
partners [Documentation] Updates to NVIDIA Playground/Foundation Model naming.… (#14770) 2023-12-15 12:21:59 -08:00