langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-08 07:10:35 +00:00

Author	SHA1	Message	Date
Paul Garner	69698be3e6	consistently use getLogger(__name__), no root logger (#2989 ) re https://github.com/hwchase17/langchain/issues/439#issuecomment-1510442791 I think it's not polite for a library to use the root logger both of these forms are also used: ``` logger = logging.getLogger(__name__) logger = logging.getLogger(__file__) ``` I am not sure if there is any reason behind one vs the other? (...I am guessing maybe just contributed by different people) it seems to me it'd be better to consistently use `logging.getLogger(__name__)` this makes it easier for consumers of the library to set up log handlers, e.g. for everything with `langchain.` prefix	2023-04-16 12:49:35 -07:00
Tim Asp	51894ddd98	allow tokentextsplitters to use model name to select encoder (#2963 ) Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)	2023-04-16 08:33:47 -07:00
vinoyang	8073bc849f	Minor: Remove duplicated word in error message (#2706 ) Removed the duplicated word "it" from the error message. From: `Please it install it with xxx` To: `Please install it with xxx`.	2023-04-11 13:10:33 -07:00
Harrison Chase	96ebe98dc2	Harrison/latex splitter (#1738 ) Co-authored-by: Aidan Holland <thehappydinoa@gmail.com> Co-authored-by: Jan de Boer <44832123+Janldeboer@users.noreply.github.com>	2023-03-17 08:10:27 -07:00
Harrison Chase	f95d551f7a	Harrison/shallow metadata (#1599 ) Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com>	2023-03-11 09:18:25 -08:00
Harrison Chase	064741db58	Harrison/fix text splitter (#1511 ) Co-authored-by: ajaysolanky <ajsolanky@gmail.com> Co-authored-by: Ajay Solanky <ajaysolanky@saw-l14668307kd.myfiosgateway.com>	2023-03-07 15:42:28 -08:00
Harrison Chase	9381005098	fix bug with length function (#1257 )	2023-02-23 16:00:15 -08:00
Harrison Chase	28781a6213	Harrison/markdown splitter (#1169 ) Co-authored-by: Michael Chen <flamingdescent@gmail.com> Co-authored-by: Michael Chen <michaelchen@stripe.com>	2023-02-19 21:31:58 -08:00
Harrison Chase	ba54d36787	Harrison/tiktoken spec (#964 ) Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	2023-02-09 23:30:18 -08:00
Harrison Chase	53d56d7650	Harrison/unstructured support (#903 )	2023-02-05 23:02:07 -08:00
kahkeng	4a8f5cdf4b	Add alternative token-based text splitter (#816 ) This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with #528.	2023-02-02 19:55:13 -08:00
Smit Shah	a87a2aacaa	[Minor Fix] Fix spacy TextSplitter init (#606 )	2023-01-13 06:24:44 -08:00
Harrison Chase	1511606799	Harrison/fix splitting (#563 ) fix issue where text splitting could possibly create empty docs	2023-01-08 19:19:32 -08:00
Harrison Chase	1192cc0767	smart text splitter (#530 ) smart text splitter that iteratively tries different separators until it works!	2023-01-08 15:11:10 -08:00
Harrison Chase	c104d507bf	Harrison/improve data augmented generation docs (#390 ) Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com> Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>	2022-12-20 22:24:08 -05:00
Harrison Chase	e7b625fe03	fix text splitter (#375 )	2022-12-18 20:21:43 -05:00
Harrison Chase	2dd895d98c	add openai tokenizer (#355 )	2022-12-15 22:35:42 -08:00
Xupeng (Tony) Tong	bb4bf9d6d0	chore: minor clean up / formatting (#233 ) to get familiarize with the project	2022-12-01 10:50:36 -08:00
Harrison Chase	d87e73ddb1	huggingface tokenizer (#75 )	2022-11-13 09:37:44 -08:00
Delip Rao	3ee6e332dd	Implements NLTK and Spacy-based TextSplitters (#103 ) This PR is for Issue #88 - [x] `make format` - [x] `make lint` - [x] `make tests`	2022-11-09 20:45:30 -08:00
Harrison Chase	160af4ba6b	Harrison/map reduce (#36 )	2022-10-31 20:17:22 -07:00

21 Commits