You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/community/langchain_community/document_loaders
Martin Triska 2df8ac402a
community[minor]: Added propagation of document metadata from O365BaseLoader (#20663)
**Description:**
- Added propagation of document metadata from O365BaseLoader to
FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the
hood).
- This is done by passing dictionary `metadata_dict`: key=filename and
value=dictionary containing document's metadata
- Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use
`mimetype` from it (if available) and pass metadata further into blob
loader.

**Issue:**
- `O365BaseLoader` under the hood downloads documents to temp folder and
then uses `FileSystemBlobLoader` on it.
- However metadata about the document in question is lost in this
process. In particular:
- `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file
extension, but that does not work 100% of the time.
- `web_url`: this is useful to keep around since in RAG LLM we might
want to provide link to the source document. In order to work well with
document parsers, we pass the `web_url` as `source` (`web_url` is
ignored by parsers, `source` is preserved)

**Dependencies:**
None

**Twitter handle:**
@martintriska1

Please review @baskaryan

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
8 hours ago
..
blob_loaders community[patch]: Update doc-string in CloudBlobLoader (#22069) 8 hours ago
parsers community[patch]: Fix remaining __inits__ in community (#22037) 1 day ago
__init__.py infra: rm unused # noqa violations (#22049) 1 day ago
acreom.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
airbyte.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
airbyte_json.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
airtable.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
apify_dataset.py community[patch]: update apify integration to attribute API activity to langchain (#21909) 3 days ago
arcgis_loader.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
arxiv.py community[minor]: Implement lazy_load() for ArxivLoader (#18664) 3 months ago
assemblyai.py community[patch]: docstrings update (#20301) 1 month ago
astradb.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
async_html.py community[minor]: allow enabling proxy in aiohttp session in AsyncHTML (#19499) 1 day ago
athena.py community[minor]: import fix (#20995) 3 weeks ago
azlyrics.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
azure_ai_data.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
azure_blob_storage_container.py community[patch]: type ignore fixes (#18395) 3 months ago
azure_blob_storage_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
baiducloud_bos_directory.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
baiducloud_bos_file.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
base.py core: Move document loader interfaces to core (#17723) 3 months ago
base_o365.py community[minor]: Added propagation of document metadata from O365BaseLoader (#20663) 8 hours ago
bibtex.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
bigquery.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
bilibili.py community[patch]: docstrings update (#20301) 1 month ago
blackboard.py infra: rm unused # noqa violations (#22049) 1 day ago
blockchain.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
brave_search.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
browserbase.py community: updated Browserbase loader (#21757) 1 week ago
browserless.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
cassandra.py community[minor]: Add Cassandra ByteStore (#22064) 9 hours ago
chatgpt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
chm.py community[patch]: docstrings (#16810) 3 months ago
chromium.py community[patch]: Update comments for lazy_load method (#21063) 3 weeks ago
college_confidential.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
concurrent.py community[patch]: import flattening fix (#20110) 1 month ago
confluence.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
conllu.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
couchbase.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
csv_loader.py community: Fix CSVLoader columns is None (#20701) 1 day ago
cube_semantic.py community[patch]: Implement lazy_load() for CubeSemanticLoader (#18535) 3 months ago
datadog_logs.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
dataframe.py community[patch]: support modin document loader (#18866) 2 months ago
diffbot.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
directory.py community: fix `DirectoryLoader` progress bar (#19821) 1 month ago
discord.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
doc_intelligence.py docs: community docstring updates (#21040) 3 weeks ago
docugami.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
docusaurus.py docs: docstrings `langchain_community` update (#14889) 5 months ago
dropbox.py infra: add print rule to ruff (#16221) 3 months ago
duckdb_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
email.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
epub.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
etherscan.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
evernote.py infra: rm unused # noqa violations (#22049) 1 day ago
excel.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
facebook_chat.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
fauna.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
figma.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
firecrawl.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
gcs_directory.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
gcs_file.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
generic.py community[patch]: import flattening fix (#20110) 1 month ago
geodataframe.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
git.py Merge pull request #18539 3 months ago
gitbook.py community[minor]: Implement lazy_load() for GitbookLoader (#18670) 3 months ago
github.py community: Implement lazy_load() for GithubFileLoader (#18584) 3 months ago
glue_catalog.py community[minor]: Add glue catalog loader (#20220) 1 month ago
google_speech_to_text.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
googledrive.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 3 weeks ago
gutenberg.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
helpers.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
hn.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
html.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
html_bs.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
hugging_face_dataset.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
hugging_face_model.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
ifixit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
image.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
image_captions.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
imsdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
iugu.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
joplin.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
json_loader.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
kinetica_loader.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
lakefs.py docs: docstrings `langchain_community` update (#14889) 5 months ago
larksuite.py community[minor]: Add LarkSuite wiki document loader. (#21016) 3 weeks ago
llmsherpa.py community[minor]: add support for llmsherpa (#19741) 2 months ago
markdown.py corrected outdated link (#15053) 5 months ago
mastodon.py Merge pull request #18671 3 months ago
max_compute.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
mediawikidump.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
merge.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
mhtml.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
mintbase.py community[minor]: add mintbase loader to langchain (#20089) 3 weeks ago
modern_treasury.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
mongodb.py community[minor]: added a feature to filter documents in Mongoloader (#18253) 3 months ago
news.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
notebook.py community[patch]: add NotebookLoader unit test (#17721) 2 months ago
notion.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
notiondb.py community[patch]: Fix NotionDBLoader 400 Error by conditionally adding filter parameter (#19075) 2 months ago
nuclia.py infra: add print rule to ruff (#16221) 3 months ago
obs_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
obs_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
obsidian.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
odt.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
onedrive.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
onedrive_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
onenote.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
open_city_data.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
oracleadb_loader.py community[minor]: add oracle autonomous database doc loader integration (#19536) 2 months ago
oracleai.py community[minor]: Oraclevs integration (#21123) 3 weeks ago
org_mode.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
pdf.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
pebblo.py infra: rm unused # noqa violations (#22049) 1 day ago
polars_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
powerpoint.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
psychic.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
pubmed.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
pyspark_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
python.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
quip.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
readthedocs.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
recursive_url_loader.py community[patch]: Using the right encoding to parse the web page in RecursiveUrlLoader (#20632) 3 weeks ago
reddit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
roam.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
rocksetdb.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
rspace.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
rss.py multiple: Remove unnecessary Ruff suppression comments (#21050) 3 weeks ago
rst.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
rtf.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
s3_directory.py community[patch]: Skip nested directories when using S3DirectoryLoader (#17829) 3 months ago
s3_file.py community[patch]: support unstructured_kwargs for s3 loader (#15473) 2 months ago
scrapfly.py community[minor]: Add Scrapfly Loader community integration (#22036) 1 day ago
sharepoint.py community[minor]: Added propagation of document metadata from O365BaseLoader (#20663) 8 hours ago
sitemap.py community[minor]: Implement lazy_load() for SitemapLoader (#18667) 3 months ago
slack_directory.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
snowflake_loader.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
spider.py doc list not empty (#21208) 3 days ago
spreedly.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
sql_database.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
srt.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
stripe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
surrealdb.py community[patch]: SurrealDB fix for asyncio (#16092) 4 months ago
telegram.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
tencent_cos_directory.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
tencent_cos_file.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
tensorflow_datasets.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
text.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
tidb.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
tomarkdown.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
toml.py community: Use default load() implementation in doc loaders (#18385) 3 months ago
trello.py community: Implement lazy_load() for TrelloLoader (#18658) 3 months ago
tsv.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
twitter.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
unstructured.py community[minor]: import fix (#20995) 3 weeks ago
url.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
url_playwright.py docs: community docstring updates (#21040) 3 weeks ago
url_selenium.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
vsdx.py community[patch]: import flattening fix (#20110) 1 month ago
weather.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
web_base.py community[patch]: raise_for_status logic missing in async _fetch of WebBaseLoader (#21948) 2 days ago
whatsapp_chat.py community: Implement lazy_load() for WhatsAppChatLoader (#18677) 3 months ago
wikipedia.py community[patch]: upgrade to recent version of mypy (#21616) 1 week ago
word_document.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
xml.py community: better support of pathlib paths in document loaders (#18396) 2 months ago
xorbits.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 5 months ago
youtube.py community[patch]: docstrings (#16810) 3 months ago
yuque.py community[minor]: add Yuque document loader (#17924) 3 months ago