langchain

Author	SHA1	Message	Date
Mike McGarry	ddd595fe81	feature/4493 Improve Evernote Document Loader (#4577 ) # Improve Evernote Document Loader When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document. This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document. It also uses an existing optional dependency of `html2text` instead of `pypandoc` to remove the need to download the pandoc application via `download_pandoc()` to be able to use the `pypandoc` python bindings. Fixes #4493 Co-authored-by: Mike McGarry <mike.mcgarry@finbourne.com> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-05-19 14:28:17 -07:00
Daniel Chalef	c8c2276ccb	Zep Retriever - Vector Search Over Chat History (#4533 ) # Zep Retriever - Vector Search Over Chat History with the Zep Long-term Memory Service More on Zep: https://github.com/getzep/zep Note: This PR is related to and relies on https://github.com/hwchase17/langchain/pull/4834. I did not want to modify the `pyproject.toml` file to add the `zep-python` dependency a second time. Co-authored-by: Daniel Chalef <daniel.chalef@private.org>	2023-05-18 16:27:18 -07:00
Leonid Ganeline	a9bb3147d7	docs: vectorstores, different updates and fixes (#4939 ) # docs: vectorstores, different updates and fixes Multiple updates: - added/improved descriptions - fixed header levels - added headers - fixed headers	2023-05-18 15:35:47 -07:00
Leonid Ganeline	c75c0775e1	docs supabase update (#4935 ) # docs: updated `Supabase` notebook - the title of the notebook was inconsistent (included redundant "Vectorstore"). Removed this "Vectorstore" - added `Postgress` to the title. It is important. The `Postgres` name is much more popular than `Supabase`. - added description for the `Postrgress` - added more info to the `Supabase` description	2023-05-18 10:42:08 -07:00
Yuekai Zhang	1ed4228822	Fix bilibili (#4860 ) # Fix bilibili api import error bilibili-api package is depracated and there is no sync module. <!-- Thank you for contributing to LangChain! Your PR will appear in our next release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. --> <!-- Remove if not applicable --> Fixes #2673 #2724 ## Before submitting <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @vowelparrot @liaokongVFX <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->	2023-05-18 09:56:51 -04:00
Eugene Yurtsev	e46202829f	feat #4479 : TextLoader auto detect encoding and improved exceptions (#4927 ) # TextLoader auto detect encoding and enhanced exception handling - Add an option to enable encoding detection on `TextLoader`. - The detection is done using `chardet` - The loading is done by trying all detected encodings by order of confidence or raise an exception otherwise. ### New Dependencies: - `chardet` Fixes #4479 ## Before submitting <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: - @eyurtsev --------- Co-authored-by: blob42 <spike@w530>	2023-05-18 09:55:14 -04:00
Eugene Yurtsev	c06a47a691	Load specific file types from Google Drive (issue #4878 ) (#4926 ) # Load specific file types from Google Drive (issue #4878) Add the possibility to define what file types you want to load from Google Drive. ``` loader = GoogleDriveLoader( folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5", file_types=["document", "pdf"] recursive=False ) ``` Fixes ##4878 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: DataLoaders - @eyurtsev Twitter: [@UmerHAdil](https://twitter.com/@UmerHAdil) \| Discord: RicChilligerDude#7589 --------- Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>	2023-05-18 09:27:53 -04:00
Leonid Ganeline	c998569c8f	docs: text splitters improvements (#4490 ) #docs: text splitters improvements Changes are only in the Jupyter notebooks. - added links to the source packages and a short description of these packages - removed " Text Splitters" suffixes from the TOC elements (they made the list of the text splitters messy) - moved text splitters, based on the length function into a separate list. They can be mixed with any classes from the "Text Splitters", so it is a different classification. ## Who can review? @hwchase17 - project lead @eyurtsev @vowelparrot NOTE: please, check out the results of the `Python code` text splitter example (text_splitters/examples/python.ipynb). It looks suboptimal.	2023-05-17 21:33:34 -07:00
Davis Chase	df0c33a005	Faiss no avx2 (#4895 ) Co-authored-by: Ali Mirlou <alimirlou@gmail.com>	2023-05-17 19:18:57 -07:00
Leonid Ganeline	b96ab4b763	docs `retriever` improvements (#4430 ) # Docs: improvements in the `retrievers/examples/` notebooks Its primary purpose is to make the Jupyter notebook examples consistent and more suitable for first-time viewers. - add links to the integration source (if applicable) with a short description of this source; - removed `_retriever` suffix from the file names (where it existed) for consistency; - removed ` retriever` from the notebook title (where it existed) for consistency; - added code to install necessary Python package(s); - added code to set up the necessary API Key. - very small fixes in notebooks from other folders (for consistency): - docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb - docs/modules/indexes/vectorstores/examples/pinecone.ipynb - docs/modules/models/llms/integrations/cohere.ipynb - fixed misspelling in langchain/retrievers/time_weighted_retriever.py comment (sorry, about this change in a .py file ) ## Who can review @dev2049	2023-05-17 15:29:22 -07:00
Justin Levi Winter	0147f845f1	Update getting_started.ipynb (#4850 ) minor grammer issue	2023-05-17 13:19:14 -07:00
UmerHA	e257380deb	Typos (#4851 ) # Fixed typos (issues #4818 & #4668 & more typos) - At some places, it said `model = ChatOpenAI(model='gpt-3.5-turbo')` but should be `model = ChatOpenAI(model_name='gpt-3.5-turbo')` - Fixes some other typos Fixes #4818, #4668 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot	2023-05-17 11:52:22 -04:00
Harrison Chase	720ac49f42	2markdown loader (#4796 ) Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2023-05-16 23:42:53 -07:00
Brendan Mannix	4e56d3119c	update qdrant docs to reflect the proper way to initialize Qdrant() constructor (#4596 ) # update qdrant docs to reflect the proper way to initialize Qdrant() constructor The [Qdrant docs](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/qdrant.html) still contain an old reference for passing an `embedding_function` into the constructor. This is no longer supported. This PR updates the docs to reflect the proper way to initialize `Qdrant()` Old: ![Screenshot 2023-05-12 at 3 06 33 PM](https://github.com/hwchase17/langchain/assets/1552962/dd4063d2-2a07-4340-91bb-e305f7215ddd) New: ![Screenshot 2023-05-12 at 3 21 09 PM](https://github.com/hwchase17/langchain/assets/1552962/aebc3f63-1a8b-4ca3-93c0-a2ce30dcd282)	2023-05-16 17:30:38 -07:00
Raduan Al-Shedivat	00c6ec8a2d	fix(document_loaders/telegram): fix pandas calls + add tests (#4806 ) # Fix Telegram API loader + add tests. I was testing this integration and it was broken with next error: ```python message_threads = loader._get_message_threads(df) KeyError: False ``` Also, this particular loader didn't have any tests / related group in poetry, so I added those as well. @hwchase17 / @eyurtsev please take a look on this fix PR. --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-05-16 14:35:25 -07:00
了空	f7e3d97b19	Remove unnecessary spaces from document object’s page_content of BiliBiliLoader (#4619 ) - Remove unnecessary spaces from document object’s page_content of BiliBiliLoader - Fix BiliBiliLoader document and test file	2023-05-16 13:13:57 -04:00
Eugene Yurtsev	f47ec5b4b6	Docugami docs: First cell should be a title cell (#4735 ) # Make first cell a title in docugami docs This makes the first cell a title cell in docugami notebook	2023-05-16 13:12:14 -04:00
shiyu22	21b9397342	Update the milvus example (#4706 ) # Fix issue when running example - add the query content - update the `user` parameter with Zilliz Signed-off-by: shiyu22 <shiyu.chen@zilliz.com>	2023-05-15 16:16:57 -07:00
Harrison Chase	dd95f0892d	Harrison/add top k (#4707 ) Co-authored-by: blc16 <benlc@umich.edu>	2023-05-15 09:09:22 -07:00
Eugene Yurtsev	3c490b5ba3	Docugami DataLoader (#4727 ) ### Adds a document loader for Docugami Specifically: 1. Adds a data loader that talks to the [Docugami](http://docugami.com) API to download processed documents as semantic XML 2. Parses the semantic XML into chunks, with additional metadata capturing chunk semantics 3. Adds a detailed notebook showing how you can use additional metadata returned by Docugami for techniques like the [self-querying retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html) 4. Adds an integration test, and related documentation Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook): <img width="1585" alt="image" src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b"> --------- Co-authored-by: Taqi Jaffri <tjaffri@docugami.com> Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>	2023-05-15 10:53:00 -04:00
Lester Yang	cd3f9865f3	Feature: pdfplumber PDF loader with BaseBlobParser (#4552 ) # Feature: pdfplumber PDF loader with BaseBlobParser * Adds pdfplumber as a PDF loader * Adds pdfplumber as a blob parser.	2023-05-15 09:47:02 -04:00
Harrison Chase	b6e3ac17c4	Harrison/sitemap local (#4704 ) Co-authored-by: Lukas Bauer <lukas.bauer@mayflower.de>	2023-05-14 22:04:38 -07:00
Harrison Chase	12b4ee1fc7	Harrison/telegram chat loader (#4698 ) Co-authored-by: Akinwande Komolafe <47945512+Sensei-akin@users.noreply.github.com> Co-authored-by: Akinwande Komolafe <akhinoz@gmail.com>	2023-05-14 22:04:27 -07:00
Harrison Chase	6f47ab17a4	Harrison/param notion db (#4689 ) Co-authored-by: Edward Park <ed.sh.park@gmail.com>	2023-05-14 18:26:25 -07:00
Harrison Chase	243886be93	Harrison/virtual time (#4658 ) Co-authored-by: ifsheldon <39153080+ifsheldon@users.noreply.github.com> Co-authored-by: maple.liang <maple.liang@gempoll.com>	2023-05-14 10:29:17 -07:00
Harrison Chase	44ae673388	Harrison/multithreading directory loader (#4650 ) Co-authored-by: PawelFaron <42373772+PawelFaron@users.noreply.github.com> Co-authored-by: Pawel Faron <ext-pawel.faron@vaisala.com>	2023-05-13 21:46:02 -07:00
Leonid Ganeline	3ce78ef6c4	docs: document_loaders classification (#4069 ) Problem statement: the [document_loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html#) section is too long and hard to comprehend. Proposal: group document_loaders by 3 classes: (see `Files changed` tab) UPDATE: I've completely reworked the document_loader classification. Now this PR changes only one file! FYI @eyurtsev @hwchase17	2023-05-13 19:17:32 -07:00
Tim Asp	ed0d557ede	docs: fix pdf docs hierarchy and formatting (#4593 ) # Fix pdf loader docs page ![image](https://github.com/hwchase17/langchain/assets/707699/4a11f379-00ed-4f7a-9870-71f74e0cadc6) Using h1's messes with hierarchy, this fixes that, and moves the PyPDFium2 loader out of the middle of PDFMiner docs	2023-05-12 15:03:01 -04:00
Davis Chase	a4a9d1f403	Improve vespa interface (#4546 ) ![Screenshot 2023-05-11 at 7 50 31 PM](https://github.com/hwchase17/langchain/assets/130488702/bc8ab4bb-8006-44fc-ba07-df54e84ee2c1)	2023-05-12 10:11:26 -07:00
Neil Ruaro	3a2855945b	added documentation on retrieving a PG vectorstore (#4578 ) This PR adds in documentation on querying an existing vectorstore in PG Fixes 3191 (issue)	2023-05-12 13:04:06 -04:00
Leonid Ganeline	e17d0319d5	Add `arxiv` retriever (#4538 )	2023-05-11 22:48:38 -07:00
Harrison Chase	3ce29cb4a6	Harrison/new search (#4359 ) Co-authored-by: Jiaping(JP) Zhang <vincentzhangv@gmail.com>	2023-05-10 17:09:16 -07:00
Davis Chase	9ec60ad832	Add azure cognitive search retriever (#4467 ) All credit to @UmerHA, made a couple small changes --------- Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>	2023-05-10 15:27:27 -07:00
Davis Chase	46b100ea63	Add DocArray vector stores (#4483 ) Thanks to @anna-charlotte and @jupyterjazz for the contribution! Made few small changes to get it across the finish line --------- Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> Signed-off-by: jupyterjazz <saba.sturua@jina.ai> Co-authored-by: anna-charlotte <charlotte.gerhaher@jina.ai> Co-authored-by: jupyterjazz <saba.sturua@jina.ai> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>	2023-05-10 15:22:16 -07:00
Matt Robinson	3637d6da6e	feat: add loader for open office odt files (#4405 ) # ODF File Loader Adds a data loader for handling Open Office ODT files. Requires `unstructured>=0.6.3`. ### Testing The following should work using the `fake.odt` example doc from the [`unstructured` repo](https://github.com/Unstructured-IO/unstructured). ```python from langchain.document_loaders import UnstructuredODTLoader loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements") loader.load() loader = UnstructuredODTLoader(file_path="fake.odt", mode="single") loader.load() ```	2023-05-10 01:37:17 -07:00
Leonid Ganeline	ce15ffae6a	added `Wikipedia` retriever (#4302 ) - added `Wikipedia` retriever. It is effectively a wrapper for `WikipediaAPIWrapper`. It wrapps load() into get_relevant_documents() - sorted `__all__` in the `retrievers/__init__` - added integration tests for the WikipediaRetriever - added an example (as Jupyter notebook) for the WikipediaRetriever	2023-05-09 10:08:39 -07:00
BioErrorLog	04f765b838	Fix grammar in Text Splitters docs (#4373 ) # Fix grammar in Text Splitters docs Just a small fix of grammar in the documentation: "That means there two different axes" -> "That means there are two different axes"	2023-05-08 22:38:40 -04:00
Leonid Ganeline	9544b30821	added `Wikipedia` document loader (#4141 ) - Added the `Wikipedia` document loader. It is based on the existing `unilities/WikipediaAPIWrapper` - Added a respective ut-s and example notebook - Sorted list of classes in __init__	2023-05-06 09:32:45 -07:00
Davis Chase	5ca13cc1f0	Dev2049/pypdfium2 (#4209 ) thanks @jerrytigerxu for the addition! --------- Co-authored-by: Jere Xu <jtxu2008@gmail.com> Co-authored-by: jerrytigerxu <jere.tiger.xu@gmailc.om>	2023-05-05 17:55:31 -07:00
Leonid Ganeline	59204a5033	docs: `document_loaders` improvements (#4200 ) - made notebooks consistent: titles, service/format descriptions. - corrected short names to full names, for example, `Word` -> `Microsoft Word` - added missed descriptions - renamed notebook files to make ToC correctly sorted	2023-05-05 17:44:54 -07:00
Aivin V. Solatorio	6567b73e1a	JSON loader (#4067 ) This implements a loader of text passages in JSON format. The `jq` syntax is used to define a schema for accessing the relevant contents from the JSON file. This requires dependency on the `jq` package: https://pypi.org/project/jq/. --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>	2023-05-05 14:48:13 -07:00
Harrison Chase	26534457f5	simplify csv args (#4182 )	2023-05-05 09:22:08 -07:00
Davis Chase	d84bb02881	Add Chroma self query (#4149 ) Add internal query language -> chroma metadata filter translator	2023-05-05 08:43:08 -07:00
Harrison Chase	a9c2450330	Harrison/toml loader (#4090 ) Co-authored-by: Mika Ayenson <Mikaayenson@users.noreply.github.com>	2023-05-03 23:14:39 -07:00
Harrison Chase	fba6921b50	Harrison/one drive loader (#4081 ) Co-authored-by: José Ferraz Neto <netoferraz@gmail.com>	2023-05-03 22:55:34 -07:00
Davis Chase	7f8727bbcd	Router chains (#4019 ) Unpolished router examples to help flesh out abstractions and use cases ![Screenshot 2023-05-02 at 7 02 58 PM](https://user-images.githubusercontent.com/130488702/235820394-389e5584-db0b-415e-a260-2824b5555167.png) --------- Co-authored-by: Shreya Rajpal <shreya.rajpal@gmail.com>	2023-05-03 22:02:55 -07:00
Harrison Chase	5f30cc8713	Harrison/knn retriever (#4083 ) Co-authored-by: Yuichi Tateno (secon) <hotchpotch@users.noreply.github.com>	2023-05-03 21:21:58 -07:00
Harrison Chase	5a269d3175	Harrison/media wiki xml (#4072 ) Co-authored-by: Géraud de Drouas <gdedrouas@users.noreply.github.com>	2023-05-03 20:45:33 -07:00
Ivo Stranic	3b556eae44	Update deeplake example (#4055 )	2023-05-03 18:03:51 -07:00
Steve Kim	9b830f437c	Deleted importing Document from document_loaders.base because Documen… (#4068 ) Hi, - Modification: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/arxiv.html - Reason: In this example, the first line is unnecessary because the Document class does not exist in the base. - Resolves: Issue #4052 -------- P.S: This pull-request is my first time, so please let me know if I need to correct or write more explanation.	2023-05-03 17:54:30 -07:00

1 2 3 4

197 Commits