Commit Graph

21 Commits

Author SHA1 Message Date
Paul Garner
69698be3e6
consistently use getLogger(__name__), no root logger (#2989)
re
https://github.com/hwchase17/langchain/issues/439#issuecomment-1510442791

I think it's not polite for a library to use the root logger

both of these forms are also used:
```
logger = logging.getLogger(__name__)
logger = logging.getLogger(__file__)
```
I am not sure if there is any reason behind one vs the other? (...I am
guessing maybe just contributed by different people)

it seems to me it'd be better to consistently use
`logging.getLogger(__name__)`

this makes it easier for consumers of the library to set up log
handlers, e.g. for everything with `langchain.` prefix
2023-04-16 12:49:35 -07:00
Tim Asp
51894ddd98
allow tokentextsplitters to use model name to select encoder (#2963)
Fixes a bug I was seeing when the `TokenTextSplitter` was correctly
splitting text under the gpt3.5-turbo token limit, but when firing the
prompt off too openai, it'd come back with an error that we were over
the context limit.

gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts
are just always off with the default `gpt-2` encoder.

It's possible to pass along the encoding to the `TokenTextSplitter`, but
it's much simpler to pass the model name of the LLM. No more concern
about keeping the tokenizer and llm model in sync :)
2023-04-16 08:33:47 -07:00
vinoyang
8073bc849f
Minor: Remove duplicated word in error message (#2706)
Removed the duplicated word "it" from the error message.
From:
`Please it install it with xxx`
To:
`Please install it with xxx`.
2023-04-11 13:10:33 -07:00
Harrison Chase
96ebe98dc2
Harrison/latex splitter (#1738)
Co-authored-by: Aidan Holland <thehappydinoa@gmail.com>
Co-authored-by: Jan de Boer <44832123+Janldeboer@users.noreply.github.com>
2023-03-17 08:10:27 -07:00
Harrison Chase
f95d551f7a
Harrison/shallow metadata (#1599)
Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com>
2023-03-11 09:18:25 -08:00
Harrison Chase
064741db58
Harrison/fix text splitter (#1511)
Co-authored-by: ajaysolanky <ajsolanky@gmail.com>
Co-authored-by: Ajay Solanky <ajaysolanky@saw-l14668307kd.myfiosgateway.com>
2023-03-07 15:42:28 -08:00
Harrison Chase
9381005098
fix bug with length function (#1257) 2023-02-23 16:00:15 -08:00
Harrison Chase
28781a6213
Harrison/markdown splitter (#1169)
Co-authored-by: Michael Chen <flamingdescent@gmail.com>
Co-authored-by: Michael Chen <michaelchen@stripe.com>
2023-02-19 21:31:58 -08:00
Harrison Chase
ba54d36787
Harrison/tiktoken spec (#964)
Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
2023-02-09 23:30:18 -08:00
Harrison Chase
53d56d7650
Harrison/unstructured support (#903) 2023-02-05 23:02:07 -08:00
kahkeng
4a8f5cdf4b
Add alternative token-based text splitter (#816)
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with #528.
2023-02-02 19:55:13 -08:00
Smit Shah
a87a2aacaa
[Minor Fix] Fix spacy TextSplitter init (#606) 2023-01-13 06:24:44 -08:00
Harrison Chase
1511606799
Harrison/fix splitting (#563)
fix issue where text splitting could possibly create empty docs
2023-01-08 19:19:32 -08:00
Harrison Chase
1192cc0767
smart text splitter (#530)
smart text splitter that iteratively tries different separators until it
works!
2023-01-08 15:11:10 -08:00
Harrison Chase
c104d507bf
Harrison/improve data augmented generation docs (#390)
Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com>
Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>
2022-12-20 22:24:08 -05:00
Harrison Chase
e7b625fe03
fix text splitter (#375) 2022-12-18 20:21:43 -05:00
Harrison Chase
2dd895d98c
add openai tokenizer (#355) 2022-12-15 22:35:42 -08:00
Xupeng (Tony) Tong
bb4bf9d6d0
chore: minor clean up / formatting (#233)
to get familiarize with the project
2022-12-01 10:50:36 -08:00
Harrison Chase
d87e73ddb1
huggingface tokenizer (#75) 2022-11-13 09:37:44 -08:00
Delip Rao
3ee6e332dd
Implements NLTK and Spacy-based TextSplitters (#103)
This PR is for Issue #88 

- [x] `make format`
- [x] `make lint`
- [x] `make tests`
2022-11-09 20:45:30 -08:00
Harrison Chase
160af4ba6b
Harrison/map reduce (#36) 2022-10-31 20:17:22 -07:00