langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Author	SHA1	Message	Date
Pau Ramon Revilla	87802c86d9	Added a MHTML document loader (#6311 ) MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case. This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file. --------- Co-authored-by: rlm <pexpresss31@gmail.com>	2023-06-25 13:12:08 -07:00
Matt Robinson	be68f6f8ce	feat: Add `UnstructuredRSTLoader` (#6594 ) ### Summary Adds an `UnstructuredRSTLoader` for loading [reStructuredText](https://en.wikipedia.org/wiki/ReStructuredText) file. ### Testing ```python from langchain.document_loaders import UnstructuredRSTLoader loader = UnstructuredRSTLoader( file_path="example_data/README.rst", mode="elements" ) docs = loader.load() print(docs[0]) ``` ### Reviewers - @hwchase17 - @rlancemartin - @eyurtsev	2023-06-25 12:41:57 -07:00
Augustine Theodore	afc292e58d	Fix WhatsAppChatLoader : Enable parsing additional formats (#6663 ) - Description: Updated regex to support a new format that was observed when whatsapp chat was exported. - Issue: #6654 - Dependencies: No new dependencies - Tag maintainer: @rlancemartin, @eyurtsev	2023-06-25 12:08:43 -07:00
Suri Chen	14b9418cc5	Fix whatsappchatloader - enable parsing new datetime format on WhatsApp chat (#6555 ) - Description: observed new format on WhatsApp exported chat - example: `[2023/5/4, 16:17:13] ~ Carolina: 🥺` - Dependencies: no additional dependencies required - Tag maintainer: @rlancemartin, @eyurtsev	2023-06-21 19:11:49 -07:00
qued	e4224a396b	feat: Add `UnstructuredXMLLoader` for `.xml` files (#5955 ) # Unstructured XML Loader Adds an `UnstructuredXMLLoader` class for .xml files. Works with unstructured>=0.6.7. A plain text representation of the text with the XML tags will be available under the `page_content` attribute in the doc. ### Testing ```python from langchain.document_loaders import UnstructuredXMLLoader loader = UnstructuredXMLLoader( "example_data/factbook.xml", ) docs = loader.load() ``` ## Who can review? @hwchase17 @eyurtsev	2023-06-10 16:24:42 -07:00
Matt Robinson	11fec7d4d1	feat: Add `UnstructuredCSVLoader` for CSV files (#5844 ) ### Summary Adds an `UnstructuredCSVLoader` for loading CSVs. One advantage of using `UnstructuredCSVLoader` relative to the standard `CSVLoader` is that if you use `UnstructuredCSVLoader` in `"elements"` mode, an HTML representation of the table will be available in the metadata. #### Who can review? @hwchase17 @eyurtsev	2023-06-07 19:18:01 -07:00
Matt Robinson	a97e4252e3	feat: add `UnstructuredExcelLoader` for `.xlsx` and `.xls` files (#5617 ) # Unstructured Excel Loader Adds an `UnstructuredExcelLoader` class for `.xlsx` and `.xls` files. Works with `unstructured>=0.6.7`. A plain text representation of the Excel file will be available under the `page_content` attribute in the doc. If you use the loader in `"elements"` mode, an HTML representation of the Excel file will be available under the `text_as_html` metadata key. Each sheet in the Excel document is its own document. ### Testing ```python from langchain.document_loaders import UnstructuredExcelLoader loader = UnstructuredExcelLoader( "example_data/stanley-cups.xlsx", mode="elements" ) docs = loader.load() ``` ## Who can review? @hwchase17 @eyurtsev	2023-06-03 12:44:12 -07:00
Raduan Al-Shedivat	00c6ec8a2d	fix(document_loaders/telegram): fix pandas calls + add tests (#4806 ) # Fix Telegram API loader + add tests. I was testing this integration and it was broken with next error: ```python message_threads = loader._get_message_threads(df) KeyError: False ``` Also, this particular loader didn't have any tests / related group in poetry, so I added those as well. @hwchase17 / @eyurtsev please take a look on this fix PR. --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-05-16 14:35:25 -07:00
Harrison Chase	b6e3ac17c4	Harrison/sitemap local (#4704 ) Co-authored-by: Lukas Bauer <lukas.bauer@mayflower.de>	2023-05-14 22:04:38 -07:00
Matt Robinson	3637d6da6e	feat: add loader for open office odt files (#4405 ) # ODF File Loader Adds a data loader for handling Open Office ODT files. Requires `unstructured>=0.6.3`. ### Testing The following should work using the `fake.odt` example doc from the [`unstructured` repo](https://github.com/Unstructured-IO/unstructured). ```python from langchain.document_loaders import UnstructuredODTLoader loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements") loader.load() loader = UnstructuredODTLoader(file_path="fake.odt", mode="single") loader.load() ```	2023-05-10 01:37:17 -07:00
Rukmani	2b14036126	Update WhatsAppChatLoader to include the character ~ in the sender name (#4420 ) Fixes #4153 If the sender of a message in a group chat isn't in your contact list, they will appear with a ~ prefix in the exported chat. This PR adds support for parsing such lines.	2023-05-09 15:00:04 -07:00
Aivin V. Solatorio	6567b73e1a	JSON loader (#4067 ) This implements a loader of text passages in JSON format. The `jq` syntax is used to define a schema for accessing the relevant contents from the JSON file. This requires dependency on the `jq` package: https://pypi.org/project/jq/. --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>	2023-05-05 14:48:13 -07:00
hp0404	2a3c5f8353	Update WhatsAppChatLoader regex to handle multiple date-time formats (#4186 ) This PR updates the `message_line_regex` used by `WhatsAppChatLoader` to support different date-time formats used in WhatsApp chat exports; resolves #4153. The new regex handles the following input formats: ```terminal [05.05.23, 15:48:11] James: Hi here [11/8/21, 9:41:32 AM] User name: Message 123 1/23/23, 3:19 AM - User 2: Bye! 1/23/23, 3:22_AM - User 1: And let me know if anything changes ``` Tests have been added to verify that the loader works correctly with all formats.	2023-05-05 13:13:05 -07:00
hp0404	374725a715	Refactor TelegramChatLoader and FacebookChatLoader classes and add tests (#3863 ) This PR includes two main changes: - Refactor the `TelegramChatLoader` and `FacebookChatLoader` classes by removing the dependency on pandas and simplifying the message filtering process. - Add test cases for the `TelegramChatLoader` and `FacebookChatLoader` classes. This test ensures that the class correctly loads and processes the example chat data, providing better test coverage for this functionality.	2023-05-03 15:59:19 -07:00
Paul Garner	aa9d5707e0	Add PythonLoader which auto-detects encoding of Python files (#3311 ) This PR contributes a `PythonLoader`, which inherits from `TextLoader` but detects and sets the encoding automatically.	2023-04-21 10:47:57 -07:00
vowelparrot	bf0887c486	Add Slack Directory Loader (#2841 ) Fixes linting issue from #2835 Adds a loader for Slack Exports which can be a very valuable source of knowledge to use for internal QA bots and other use cases. ```py # Export data from your Slack Workspace first. from langchain.document_loaders import SLackDirectoryLoader SLACK_WORKSPACE_URL = "https://awesome.slack.com" loader = ("Slack_Exports", SLACK_WORKSPACE_URL) docs = loader.load() ```	2023-04-13 21:31:59 -07:00
Harrison Chase	e90d007db3	Harrison/msg files (#2375 ) Co-authored-by: Sahil Masand <masand.sahil@gmail.com> Co-authored-by: Sahil Masand <masands@cbh.com.au>	2023-04-04 06:48:34 -07:00
Sam Cordner-Matthews	1ddd6dbf0b	Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275 ) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary.	2023-04-01 12:48:27 -07:00
Daniel Chalef	b157e0c1c3	Add HTML document_loader that includes page title metadata (#1720 ) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org>	2023-03-16 21:47:17 -07:00
Harrison Chase	357d808484	Harrison/remote paths pdf (#1544 ) Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>	2023-03-08 20:53:37 -08:00
Harrison Chase	c64f98e2bb	Harrison/format agent instructions (#973 ) Co-authored-by: Andrew White <white.d.andrew@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net> Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>	2023-02-10 10:07:26 -08:00

21 Commits