langchain

Commit Graph

Author	SHA1	Message	Date
Tim Asp	d22651d82a	Add new iFixit document loader (#1333 ) iFixit is a wikipedia-like site that has a huge amount of open content on how to fix things, questions/answers for common troubleshooting and "things" related content that is more technical in nature. All content is licensed under CC-BY-SA-NC 3.0 Adding docs from iFixit as context for user questions like "I dropped my phone in water, what do I do?" or "My macbook pro is making a whining noise, what's wrong with it?" can yield significantly better responses than context free response from LLMs.	1 year ago
Matt Robinson	c46478d70e	feat: document loader for image files (#1330 ) ### Summary Adds a document loader for image files such as `.jpg` and `.png` files. ### Testing Run the following using the example document from the [`unstructured` repo](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs). ```python from langchain.document_loaders.image import UnstructuredImageLoader loader = UnstructuredImageLoader("layout-parser-paper-fast.jpg") loader.load() ```	1 year ago
Ingo Kleiber	fd9975dad7	add CoNLL-U document loader (#1297 ) I've added a simple [CoNLL-U](https://universaldependencies.org/format.html) document loader. CoNLL-U is a common format for NLP tasks and is used, for example, in the Universal Dependencies treebank corpora. The loader reads a single file in standard CoNLL-U format and returns a document.	1 year ago
Harrison Chase	0963096491	fix imports (#1288 )	1 year ago
Matt Robinson	2f15c11b87	feat: document loader for MS Word documents (#1282 ) ### Summary Adds a document loader for MS Word Documents. Works with both `.docx` and `.doc` files as longer as the user has installed `unstructured>=0.4.11`. ### Testing The follow workflow test the loader for both `.doc` and `.docx` files using example docs from the `unstructured` repo. #### `.docx` ```python from langchain.document_loaders import UnstructuredWordDocumentLoader filename = "../unstructured/example-docs/fake.docx" loader = UnstructuredWordDocumentLoader(filename) loader.load() ``` #### `.doc` ```python from langchain.document_loaders import UnstructuredWordDocumentLoader filename = "../unstructured/example-docs/fake.doc" loader = UnstructuredWordDocumentLoader(filename) loader.load() ```	1 year ago
Harrison Chase	96db6ed073	cleanup (#1274 )	1 year ago
Harrison Chase	42167a1e24	Harrison/fb loader (#1277 ) Co-authored-by: Vairo Di Pasquale <vairo.dp@gmail.com>	1 year ago
Klein Tahiraj	8a0751dadd	adding .ipynb loader and documentation Fixes #1248 (#1252 ) `NotebookLoader.load()` loads the `.ipynb` notebook file into a `Document` object. Parameters: * `include_outputs` (bool): whether to include cell outputs in the resulting document (default is False). * `max_output_length` (int): the maximum number of characters to include from each cell output (default is 10). * `remove_newline` (bool): whether to remove newline characters from the cell sources and outputs (default is False). * `traceback` (bool): whether to include full traceback (default is False).	1 year ago
Satoru Sakamoto	d480330fae	fix to specific language transcript (#1231 ) Currently youtube loader only seems to support English audio. Changed to load videos in the specified language.	1 year ago
Harrison Chase	5bdb8dd6fe	Harrison/unstructured io (#1200 )	1 year ago
Harrison Chase	d90a287d8f	Harrison/updating docs (#1196 )	1 year ago
Dennis Antela Martinez	23243ae69c	add gitbook document loader (#1180 ) Added a GitBook document loader. It lets you both, (1) fetch text from any single GitBook page, or (2) fetch all relative paths and return their respective content in Documents. I've modified the `scrape` method in the `WebBaseLoader` to accept custom web paths if given, but happy to remove it and move that logic into the `GitbookLoader` itself.	1 year ago
Zach Schillaci	159c560c95	Refactor some loops into list comprehensions (#1185 )	1 year ago
Harrison Chase	4766b20223	clean up loaders (#1178 )	1 year ago
Harrison Chase	37dd34bea5	fix path (#1168 )	1 year ago
Harrison Chase	65cc81c479	directory loader improvements (#1162 )	1 year ago
Harrison Chase	fb3c73d194	add srt loader (#1140 )	1 year ago
Harrison Chase	d5f3dfa1e1	Harrison/hn loader (#1130 ) Co-authored-by: William X <william.y.xuan@gmail.com>	1 year ago
Matt Robinson	2bee8d4941	feat: add support for `.ppt` files in `UnstructuredPowerPointLoader` (#1124 ) ### Summary Adds support for older `.ppt` file in the PowerPoint loader. ### Testing The following should work on `unstructured==0.4.11` using the example docs from the `unstructured` repo. ```python from langchain.document_loaders import UnstructuredPowerPointLoader filename = "../unstructured/example-docs/fake-power-point.pptx" loader = UnstructuredPowerPointLoader(filename) loader.load() filename = "../unstructured/example-docs/fake-power-point.ppt" loader = UnstructuredPowerPointLoader(filename) loader.load() ``` Now downgrade `unstructured` to version `0.4.10`. The following should work: ```python from langchain.document_loaders import UnstructuredPowerPointLoader filename = "../unstructured/example-docs/fake-power-point.pptx" loader = UnstructuredPowerPointLoader(filename) loader.load() ``` and the following should give you a `ValueError` and invite you to upgrade `unstructured`. ```python from langchain.document_loaders import UnstructuredPowerPointLoader filename = "../unstructured/example-docs/fake-power-point.ppt" loader = UnstructuredPowerPointLoader(filename) loader.load() ```	1 year ago
Harrison Chase	3f50feb280	fix telegram imports (#1110 )	1 year ago
Harrison Chase	c60954d0f8	Harrison/telegram loader (#1080 ) Co-authored-by: Maxime Vidal <max.vidal@hotmail.fr>	1 year ago
Akshay	d8ed286200	Update and rename everynote.py to evernote.py (#1060 ) Updating this base file as well as the .ipynb file of the example on the website: https://github.com/hwchase17/langchain/compare/master...akshayvkt:langchain:patch-1 https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/everynote.html	1 year ago
Matt Robinson	3ea1e5af1e	feat: added element metadata to unstructured loader (#1068 ) ### Summary Adds tracked metadata from `unstructured` elements to the document metadata when `UnstructuredFileLoader` is used in `"elements"` mode. Tracked metadata is available in `unstructured>=0.4.9`, but the code is written for backward compatibility with older `unstructured` versions. ### Testing Before running, make sure to upgrade to `unstructured==0.4.9`. In the code snippet below, you should see `page_number`, `filename`, and `category` in the metadata for each document. `doc[0]` should have `page_number: 1` and `doc[-1]` should have `page_number: 2`. The example document is `layout-parser-paper-fast.pdf` from the [`unstructured` sample docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs). ```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path=f"layout-parser-paper-fast.pdf", mode="elements") docs = loader.load() ```	1 year ago
Harrison Chase	0998577dfe	Harrison/unstructured structured (#1004 )	1 year ago
Harrison Chase	bbb06ca4cf	pdfminer (#1003 )	1 year ago
Harrison Chase	e51fad1488	Harrison/0083 (#996 ) Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	1 year ago
Harrison Chase	2e96704d59	Harrison/airbyte (#989 ) Co-authored-by: zanderchase <zanderchase@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MacBook-Pro.local>	1 year ago
zanderchase	c2d1d903fa	Zander/online pdf loader (#984 )	1 year ago
Matt Robinson	07a407d89a	feat: adds `UnstructuredURLLoader` for loading data from urls (#979 ) ### Summary Adds a `UnstructuredURLLoader` that supports loading data from a list of URLs. ### Testing ```python from langchain.document_loaders import UnstructuredURLLoader urls = [ "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023", "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023" ] loader = UnstructuredURLLoader(urls=urls) raw_documents = loader.load() ```	1 year ago
Harrison Chase	c64f98e2bb	Harrison/format agent instructions (#973 ) Co-authored-by: Andrew White <white.d.andrew@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net> Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>	1 year ago
Harrison Chase	5469d898a9	Harrison/everynote (#974 ) Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	1 year ago
Harrison Chase	01fa2d8117	Harrison/youtube fixes (#955 ) Co-authored-by: Ji <jizhang.work@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	1 year ago
zanderchase	8e126bc9bd	adding webpage loading logic (#942 )	1 year ago
Usama Navid	e85c53ce68	Update readthedocs.py (#943 ) Sometimes, the docs may be empty. For example for the text = soup.find_all("main", {"id": "main-content"}) was an empty list. To cater to these edge cases, the clean function needs to be checked if it is empty or not.	1 year ago
Harrison Chase	3e1901e1aa	gutenberg books (#946 ) Co-authored-by: zanderchase <zander@unfold.ag> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>	1 year ago
Harrison Chase	44ecec3896	Harrison/add roam loader (#939 )	1 year ago
Harrison Chase	637c0d6508	Harrison/obsidian (#920 )	1 year ago
Ankush Gola	6bd1529cb7	add GoogleDriveLoader (#914 ) only deal with docs files for now	1 year ago
Harrison Chase	2ec25ddd4c	add unstructured examples (#913 )	1 year ago
Harrison Chase	53d56d7650	Harrison/unstructured support (#903 )	1 year ago

40 Commits (main)