forked from Archives/langchain
feat: added element metadata to unstructured loader (#1068)
### Summary Adds tracked metadata from `unstructured` elements to the document metadata when `UnstructuredFileLoader` is used in `"elements"` mode. Tracked metadata is available in `unstructured>=0.4.9`, but the code is written for backward compatibility with older `unstructured` versions. ### Testing Before running, make sure to upgrade to `unstructured==0.4.9`. In the code snippet below, you should see `page_number`, `filename`, and `category` in the metadata for each document. `doc[0]` should have `page_number: 1` and `doc[-1]` should have `page_number: 2`. The example document is `layout-parser-paper-fast.pdf` from the [`unstructured` sample docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs). ```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path=f"layout-parser-paper-fast.pdf", mode="elements") docs = loader.load() ```searx-api
parent
bac676c8e7
commit
3ea1e5af1e
Loading…
Reference in New Issue