langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Author	SHA1	Message	Date
Sasmitha Manathunga	0c7a5cb206	Fix inconsistent behavior of `CharacterTextSplitter` when changing `keep_separator` (#7263 ) - Description: - When `keep_separator` is `True` the `_split_text_with_regex()` method in `text_splitter` uses regex to split, but when `keep_separator` is `False` it uses `str.split()`. This causes problems when the separator is a special regex character like `.` or `*`. This PR fixes that by using `re.split()` in both cases. - Issue: #7262 - Tag maintainer: @baskaryan	2023-07-06 09:30:03 -04:00
Lance Martin	30f7288082	MD header text splitter returns Documents (#6571 ) Return `Documents` from MD header text splitter to simplify UX. Updates the test as well as example notebooks.	2023-06-22 09:25:38 -07:00
0xJordan	c5a46e7435	feat: Add support for the Solidity language (#6054 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> ## Add Solidity programming language support for code splitter. Twitter: @0xjord4n_ <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? Tag maintainers/contributors who might be interested: @hwchase17 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @hwchase17 VectorStores / Retrievers / Memory - @dev2049 -->	2023-06-14 14:25:02 -07:00
Lance Martin	ee3d0513ad	Add tests and update notebook for MarkdownHeaderTextSplitter (#6069 ) Add test and update notebook for `MarkdownHeaderTextSplitter`.	2023-06-13 09:07:52 -07:00
Thomas B	ac3e6e3944	Fix IndexError in RecursiveCharacterTextSplitter (#5902 ) Fixes (not reported) an error that may occur in some cases in the RecursiveCharacterTextSplitter. An empty `new_separators` array ([]) would end up in the else path of the condition below and used in a function where it is expected to be non empty. ```python if new_separators is None: ... else: # _split_text() expects this array to be non-empty! other_info = self._split_text(s, new_separators) ``` resulting in an `IndexError` ```python def _split_text(self, text: str, separators: List[str]) -> List[str]: """Split incoming text and return chunks.""" final_chunks = [] # Get appropriate separator to use > separator = separators[-1] E IndexError: list index out of range langchain/text_splitter.py:425: IndexError ``` #### Who can review? @hwchase17 @eyurtsev --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2023-06-10 16:48:53 -07:00
felpigeon	2791a753bf	Add start index to metadata in TextSplitter (#5912 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> #### Add start index to metadata in TextSplitter - Modified method `create_documents` to track start position of each chunk - The `start_index` is included in the metadata if the `add_start_index` parameter in the class constructor is set to `True` This enables referencing back to the original document, particularly useful when a specific chunk is retrieved. <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? Tag maintainers/contributors who might be interested: @eyurtsev @agola11 <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->	2023-06-08 23:09:32 -07:00
Yoann Poupart	65111eb2b3	Attribute support for html tags (#5782 ) # What does this PR do? Change the HTML tags so that a tag with attributes can be found. ## Before submitting - [x] Tests added - [x] CI/CD validated ### Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.	2023-06-06 09:27:37 -07:00
Ilya	d5b1608216	fix markdown text splitter horizontal lines (#5625 ) Fixes #5614 #### Issue The `**` combination produces an exception when used as a seperator in `re.split`. Instead `\\\` should be used for regex exprations. #### Who can review? @eyurtsev	2023-06-05 16:40:26 -07:00
Harrison Chase	5ce74b5958	code splitter docs (#5480 ) Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-05-31 07:11:53 -07:00
ByronHsu	9d658aaa5a	Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171 ) As the title says, I added more code splitters. The implementation is trivial, so i don't add separate tests for each splitter. Let me know if any concerns. Fixes # (issue) https://github.com/hwchase17/langchain/issues/5170 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev @hwchase17 --------- Signed-off-by: byhsu <byhsu@linkedin.com> Co-authored-by: byhsu <byhsu@linkedin.com>	2023-05-30 11:04:05 -04:00
Harrison Chase	72f99ff953	Harrison/text splitter (#5417 ) adds support for keeping separators around when using recursive text splitter	2023-05-29 16:56:31 -07:00
Eugene Yurtsev	d56313acba	Improve effeciency of TextSplitter.split_documents, iterate once (#5111 ) # Improve TextSplitter.split_documents, collect page_content and metadata in one iteration ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev In the case where documents is a generator that can only be iterated once making this change is a huge help. Otherwise a silent issue happens where metadata is empty for all documents when documents is a generator. So we expand the argument from `List[Document]` to `Union[Iterable[Document], Sequence[Document]]` --------- Co-authored-by: Steven Tartakovsky <tartakovsky.developer@gmail.com>	2023-05-22 23:00:24 -04:00
Roma	2b4e9a3efa	Add unit test for _merge_splits function (#3513 ) This commit adds a new unit test for the _merge_splits function in the text splitter. The new test verifies that the function merges text into chunks of the correct size and overlap, using a specified separator. The test passes on the current implementation of the function.	2023-04-25 10:02:59 -07:00
Harrison Chase	f95d551f7a	Harrison/shallow metadata (#1599 ) Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com>	2023-03-11 09:18:25 -08:00
Harrison Chase	064741db58	Harrison/fix text splitter (#1511 ) Co-authored-by: ajaysolanky <ajsolanky@gmail.com> Co-authored-by: Ajay Solanky <ajaysolanky@saw-l14668307kd.myfiosgateway.com>	2023-03-07 15:42:28 -08:00
Harrison Chase	1511606799	Harrison/fix splitting (#563 ) fix issue where text splitting could possibly create empty docs	2023-01-08 19:19:32 -08:00
Harrison Chase	1192cc0767	smart text splitter (#530 ) smart text splitter that iteratively tries different separators until it works!	2023-01-08 15:11:10 -08:00
Harrison Chase	c104d507bf	Harrison/improve data augmented generation docs (#390 ) Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com> Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>	2022-12-20 22:24:08 -05:00
Harrison Chase	e7b625fe03	fix text splitter (#375 )	2022-12-18 20:21:43 -05:00
Harrison Chase	160af4ba6b	Harrison/map reduce (#36 )	2022-10-31 20:17:22 -07:00

20 Commits