mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
2df8ac402a
**Description:** - Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood). - This is done by passing dictionary `metadata_dict`: key=filename and value=dictionary containing document's metadata - Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use `mimetype` from it (if available) and pass metadata further into blob loader. **Issue:** - `O365BaseLoader` under the hood downloads documents to temp folder and then uses `FileSystemBlobLoader` on it. - However metadata about the document in question is lost in this process. In particular: - `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file extension, but that does not work 100% of the time. - `web_url`: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the `web_url` as `source` (`web_url` is ignored by parsers, `source` is preserved) **Dependencies:** None **Twitter handle:** @martintriska1 Please review @baskaryan --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> |
||
---|---|---|
.. | ||
adapters | ||
agent_toolkits | ||
agents | ||
callbacks | ||
chains | ||
chat_loaders | ||
chat_message_histories | ||
chat_models | ||
cross_encoders | ||
docstore | ||
document_compressors | ||
document_loaders | ||
document_transformers | ||
embeddings | ||
example_selectors | ||
graphs | ||
indexes | ||
llms | ||
memory | ||
output_parsers | ||
query_constructors | ||
retrievers | ||
storage | ||
tools | ||
utilities | ||
utils | ||
vectorstores | ||
__init__.py | ||
cache.py | ||
py.typed |