langchain

Commit Graph

Author	SHA1	Message	Date
0xJordan	c5a46e7435	feat: Add support for the Solidity language (#6054 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> ## Add Solidity programming language support for code splitter. Twitter: @0xjord4n_ <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? Tag maintainers/contributors who might be interested: @hwchase17 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @hwchase17 VectorStores / Retrievers / Memory - @dev2049 -->	1 year ago
Lance Martin	b023f0c0f2	Text splitter for Markdown files by header (#5860 ) This creates a new kind of text splitter for markdown files. The user can supply a set of headers that they want to split the file on. We define a new text splitter class, `MarkdownHeaderTextSplitter`, that does a few things: (1) For each line, it determines the associated set of user-specified headers (2) It groups lines with common headers into splits See notebook for example usage and test cases.	1 year ago
Thomas B	ac3e6e3944	Fix IndexError in RecursiveCharacterTextSplitter (#5902 ) Fixes (not reported) an error that may occur in some cases in the RecursiveCharacterTextSplitter. An empty `new_separators` array ([]) would end up in the else path of the condition below and used in a function where it is expected to be non empty. ```python if new_separators is None: ... else: # _split_text() expects this array to be non-empty! other_info = self._split_text(s, new_separators) ``` resulting in an `IndexError` ```python def _split_text(self, text: str, separators: List[str]) -> List[str]: """Split incoming text and return chunks.""" final_chunks = [] # Get appropriate separator to use > separator = separators[-1] E IndexError: list index out of range langchain/text_splitter.py:425: IndexError ``` #### Who can review? @hwchase17 @eyurtsev --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	1 year ago
Jens Madsen	1250cd4630	fix: use model token limit not tokenizer ditto (#5939 ) This fixes a token limit bug in the SentenceTransformersTokenTextSplitter. Before the token limit was taken from tokenizer used by the model. However, for some models the token limit of the tokenizer (from `AutoTokenizer.from_pretrained`) does not equal the token limit of the model. This was a false assumption. Therefore, the token limit of the text splitter is now taken from the sentence transformers model token limit. Twitter: @plasmajens #### Before submitting #### Who can review? @hwchase17 and/or @dev2049 --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	1 year ago
felpigeon	2791a753bf	Add start index to metadata in TextSplitter (#5912 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> #### Add start index to metadata in TextSplitter - Modified method `create_documents` to track start position of each chunk - The `start_index` is included in the metadata if the `add_start_index` parameter in the class constructor is set to `True` This enables referencing back to the original document, particularly useful when a specific chunk is retrieved. <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? Tag maintainers/contributors who might be interested: @eyurtsev @agola11 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->	1 year ago
ugfly1210	4d8cda1c3b	FIX: backslash escaped (#5815 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> <!-- Remove if not applicable --> LatexTextSplitter needs to use "\n\\\chapter" when separators are escaped, such as "\n\\\chapter", otherwise it will report an error: (re.error: bad escape \c at position 1 (line 2, column 1)) Fixes # (issue) #### Before submitting <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use re.error: bad escape \c at position 1 (line 2, column 1) See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? @hwchase17 @dev2049 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 --> Co-authored-by: Pang <ugfly@qq.com>	1 year ago
Yoann Poupart	65111eb2b3	Attribute support for html tags (#5782 ) # What does this PR do? Change the HTML tags so that a tag with attributes can be found. ## Before submitting - [x] Tests added - [x] CI/CD validated ### Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.	1 year ago
Ilya	d5b1608216	fix markdown text splitter horizontal lines (#5625 ) Fixes #5614 #### Issue The `**` combination produces an exception when used as a seperator in `re.split`. Instead `\\\` should be used for regex exprations. #### Who can review? @eyurtsev	1 year ago
Jens Madsen	8d9e9e013c	refactor: extract token text splitter function (#5179 ) # Token text splitter for sentence transformers The current TokenTextSplitter only works with OpenAi models via the `tiktoken` package. This is not clear from the name `TokenTextSplitter`. In this (first PR) a token based text splitter for sentence transformer models is added. In the future I think we should work towards injecting a tokenizer into the TokenTextSplitter to make ti more flexible. Could perhaps be reviewed by @dev2049 --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	1 year ago
Patrick Keane	47c2ec2d0b	Corrects inconsistently misspelled variable name. (#5559 ) Corrects a spelling error (of the word separator) in several variable names. Three cut/paste instances of this were corrected, amidst instances of it also being named properly, which would likely would lead to issues for someone in the future. Here is one such example: ``` seperators = self.get_separators_for_language(Language.PYTHON) super().__init__(separators=seperators, kwargs) ``` becomes ``` separators = self.get_separators_for_language(Language.PYTHON) super().__init__(separators=separators, kwargs) ``` Make test results below: ``` ============================== 708 passed, 52 skipped, 27 warnings in 11.70s ============================== ```	1 year ago
Harrison Chase	f0ea77b230	add more vars to text splitter (#5503 )	1 year ago
Harrison Chase	5ce74b5958	code splitter docs (#5480 ) Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	1 year ago
Harrison Chase	f72bb966f8	Harrison/html splitter (#5468 ) Co-authored-by: David Revillas <26328973+r3v1@users.noreply.github.com>	1 year ago
ByronHsu	9d658aaa5a	Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171 ) As the title says, I added more code splitters. The implementation is trivial, so i don't add separate tests for each splitter. Let me know if any concerns. Fixes # (issue) https://github.com/hwchase17/langchain/issues/5170 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev @hwchase17 --------- Signed-off-by: byhsu <byhsu@linkedin.com> Co-authored-by: byhsu <byhsu@linkedin.com>	1 year ago
Harrison Chase	72f99ff953	Harrison/text splitter (#5417 ) adds support for keeping separators around when using recursive text splitter	1 year ago
Eugene Yurtsev	d56313acba	Improve effeciency of TextSplitter.split_documents, iterate once (#5111 ) # Improve TextSplitter.split_documents, collect page_content and metadata in one iteration ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev In the case where documents is a generator that can only be iterated once making this change is a huge help. Otherwise a silent issue happens where metadata is empty for all documents when documents is a generator. So we expand the argument from `List[Document]` to `Union[Iterable[Document], Sequence[Document]]` --------- Co-authored-by: Steven Tartakovsky <tartakovsky.developer@gmail.com>	1 year ago
Leonid Ganeline	c28cc0f1ac	changed ValueError to ImportError (#5103 ) # changed ValueError to ImportError Code cleaning. Fixed inconsistencies in ImportError handling. Sometimes it raises ImportError and sometime ValueError. I've changed all cases to the `raise ImportError` Also: - added installation instruction in the error message, where it missed; - fixed several installation instructions in the error message; - fixed several error handling in regards to the ImportError	1 year ago
Davis Chase	02ebb15c4a	Fix TextSplitter.from_tiktoken(#4361 ) Thanks to @danb27 for the fix! Minor update Fixes https://github.com/hwchase17/langchain/issues/4357 --------- Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>	1 year ago
Eugene Brodsky	a1001b29eb	Incorrect docstring for PythonCodeTextSplitter (#4296 ) Fixes a copy-paste error in the doctring	1 year ago
Davis Chase	46542dc774	Contextual compression retriever (#2915 ) Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	1 year ago
Davis Chase	10e4b32ecb	Add document transformer abstraction (#3182 ) Add DocumentTransformer abstraction so that in #2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like ```python class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC): @abstractmethod def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]: ... def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]: if query is None: raise ValueError("Must pass in non-null query to DocumentFilter") return self.filter(documents, query) ```	1 year ago
Paul Garner	69698be3e6	consistently use getLogger(__name__), no root logger (#2989 ) re https://github.com/hwchase17/langchain/issues/439#issuecomment-1510442791 I think it's not polite for a library to use the root logger both of these forms are also used: ``` logger = logging.getLogger(__name__) logger = logging.getLogger(__file__) ``` I am not sure if there is any reason behind one vs the other? (...I am guessing maybe just contributed by different people) it seems to me it'd be better to consistently use `logging.getLogger(__name__)` this makes it easier for consumers of the library to set up log handlers, e.g. for everything with `langchain.` prefix	1 year ago
Tim Asp	51894ddd98	allow tokentextsplitters to use model name to select encoder (#2963 ) Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)	1 year ago
vinoyang	8073bc849f	Minor: Remove duplicated word in error message (#2706 ) Removed the duplicated word "it" from the error message. From: `Please it install it with xxx` To: `Please install it with xxx`.	2 years ago
Harrison Chase	96ebe98dc2	Harrison/latex splitter (#1738 ) Co-authored-by: Aidan Holland <thehappydinoa@gmail.com> Co-authored-by: Jan de Boer <44832123+Janldeboer@users.noreply.github.com>	2 years ago
Harrison Chase	f95d551f7a	Harrison/shallow metadata (#1599 ) Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com>	2 years ago
Harrison Chase	064741db58	Harrison/fix text splitter (#1511 ) Co-authored-by: ajaysolanky <ajsolanky@gmail.com> Co-authored-by: Ajay Solanky <ajaysolanky@saw-l14668307kd.myfiosgateway.com>	2 years ago
Harrison Chase	9381005098	fix bug with length function (#1257 )	2 years ago
Harrison Chase	28781a6213	Harrison/markdown splitter (#1169 ) Co-authored-by: Michael Chen <flamingdescent@gmail.com> Co-authored-by: Michael Chen <michaelchen@stripe.com>	2 years ago
Harrison Chase	ba54d36787	Harrison/tiktoken spec (#964 ) Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	2 years ago
Harrison Chase	53d56d7650	Harrison/unstructured support (#903 )	2 years ago
kahkeng	4a8f5cdf4b	Add alternative token-based text splitter (#816 ) This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with #528.	2 years ago
Smit Shah	a87a2aacaa	[Minor Fix] Fix spacy TextSplitter init (#606 )	2 years ago
Harrison Chase	1511606799	Harrison/fix splitting (#563 ) fix issue where text splitting could possibly create empty docs	2 years ago
Harrison Chase	1192cc0767	smart text splitter (#530 ) smart text splitter that iteratively tries different separators until it works!	2 years ago
Harrison Chase	c104d507bf	Harrison/improve data augmented generation docs (#390 ) Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com> Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>	2 years ago
Harrison Chase	e7b625fe03	fix text splitter (#375 )	2 years ago
Harrison Chase	2dd895d98c	add openai tokenizer (#355 )	2 years ago
Xupeng (Tony) Tong	bb4bf9d6d0	chore: minor clean up / formatting (#233 ) to get familiarize with the project	2 years ago
Harrison Chase	d87e73ddb1	huggingface tokenizer (#75 )	2 years ago
Delip Rao	3ee6e332dd	Implements NLTK and Spacy-based TextSplitters (#103 ) This PR is for Issue #88 - [x] `make format` - [x] `make lint` - [x] `make tests`	2 years ago
Harrison Chase	160af4ba6b	Harrison/map reduce (#36 )	2 years ago

42 Commits (57b5f42847c70347afe43d04f007f937c46b2699)