Commit Graph

15 Commits (main)

Author SHA1 Message Date
Harrison Chase 9381005098
fix bug with length function (#1257) 1 year ago
Harrison Chase 28781a6213
Harrison/markdown splitter (#1169)
Co-authored-by: Michael Chen <flamingdescent@gmail.com>
Co-authored-by: Michael Chen <michaelchen@stripe.com>
1 year ago
Harrison Chase ba54d36787
Harrison/tiktoken spec (#964)
Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
1 year ago
Harrison Chase 53d56d7650
Harrison/unstructured support (#903) 1 year ago
kahkeng 4a8f5cdf4b
Add alternative token-based text splitter (#816)
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with #528.
1 year ago
Smit Shah a87a2aacaa
[Minor Fix] Fix spacy TextSplitter init (#606) 1 year ago
Harrison Chase 1511606799
Harrison/fix splitting (#563)
fix issue where text splitting could possibly create empty docs
1 year ago
Harrison Chase 1192cc0767
smart text splitter (#530)
smart text splitter that iteratively tries different separators until it
works!
1 year ago
Harrison Chase c104d507bf
Harrison/improve data augmented generation docs (#390)
Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com>
Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>
1 year ago
Harrison Chase e7b625fe03
fix text splitter (#375) 1 year ago
Harrison Chase 2dd895d98c
add openai tokenizer (#355) 1 year ago
Xupeng (Tony) Tong bb4bf9d6d0
chore: minor clean up / formatting (#233)
to get familiarize with the project
1 year ago
Harrison Chase d87e73ddb1
huggingface tokenizer (#75) 2 years ago
Delip Rao 3ee6e332dd
Implements NLTK and Spacy-based TextSplitters (#103)
This PR is for Issue #88 

- [x] `make format`
- [x] `make lint`
- [x] `make tests`
2 years ago
Harrison Chase 160af4ba6b
Harrison/map reduce (#36) 2 years ago