langchain/libs/community/langchain_community
Martin Triska 2df8ac402a
community[minor]: Added propagation of document metadata from O365BaseLoader (#20663)
**Description:**
- Added propagation of document metadata from O365BaseLoader to
FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the
hood).
- This is done by passing dictionary `metadata_dict`: key=filename and
value=dictionary containing document's metadata
- Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use
`mimetype` from it (if available) and pass metadata further into blob
loader.

**Issue:**
- `O365BaseLoader` under the hood downloads documents to temp folder and
then uses `FileSystemBlobLoader` on it.
- However metadata about the document in question is lost in this
process. In particular:
- `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file
extension, but that does not work 100% of the time.
- `web_url`: this is useful to keep around since in RAG LLM we might
want to provide link to the source document. In order to work well with
document parsers, we pass the `web_url` as `source` (`web_url` is
ignored by parsers, `source` is preserved)

**Dependencies:**
None

**Twitter handle:**
@martintriska1

Please review @baskaryan

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-05-23 11:42:19 -04:00
..
adapters community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
agent_toolkits community[patch]: Fix remaining __inits__ in community (#22037) 2024-05-22 17:42:17 +00:00
agents langchain, community: move OpenAIAssistantV2Runnable to community (#22044) 2024-05-22 21:22:50 +00:00
callbacks infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
chains langchain[minor]: Add PebbloRetrievalQA chain with Identity & Semantic Enforcement support (#20641) 2024-05-15 13:14:52 +00:00
chat_loaders infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
chat_message_histories community[minor]: Add async methods to CassandraChatMessageHistory (#21975) 2024-05-23 10:13:05 -04:00
chat_models infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
cross_encoders multiple: langchain 0.2 in master (#21191) 2024-05-08 16:46:52 -04:00
docstore community[patch]: Fix remaining __inits__ in community (#22037) 2024-05-22 17:42:17 +00:00
document_compressors langchain: add RankLLM Reranker (#21171) 2024-05-22 20:12:55 +00:00
document_loaders community[minor]: Added propagation of document metadata from O365BaseLoader (#20663) 2024-05-23 11:42:19 -04:00
document_transformers infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
embeddings infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
example_selectors
graphs infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
indexes community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
llms infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
memory
output_parsers infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
query_constructors multiple: langchain 0.2 in master (#21191) 2024-05-08 16:46:52 -04:00
retrievers infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
storage community[minor]: Add Cassandra ByteStore (#22064) 2024-05-23 10:46:23 -04:00
tools community[patch]: Adding HEADER to the list of supported locations (#21946) 2024-05-22 22:47:56 +00:00
utilities community[minor]: Add Cassandra ByteStore (#22064) 2024-05-23 10:46:23 -04:00
utils
vectorstores community[patch]: surrealdb provide functions for MMR (Maximal Marginal Relevance) (#21185) 2024-05-22 22:53:55 +00:00
__init__.py
cache.py community: init signature revision for Cassandra LLM cache classes + small maintenance (#17765) 2024-05-16 17:22:24 +00:00
py.typed