mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
d59c656ea5
#### Update (2): A single `UnstructuredLoader` is added to handle both local and api partitioning. This loader also handles single or multiple documents. #### Changes in `community`: Changes here do not affect users. In the initial process of using the SDK for the API Loaders, the Loaders in community were refactored. Other changes include: The `UnstructuredBaseLoader` has a new check to see if both `mode="paged"` and `chunking_strategy="by_page"`. It also now has `Element.element_id` added to the `Document.metadata`. `UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such, now both directly inherit from `UnstructuredBaseLoader` and initialize their `file_path`/`file` attributes respectively and implement their own `_post_process_elements` methods. -------- #### Update: New SDK Loaders in a [partner package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo) are introduced to prevent breaking changes for users (see discussion below). ##### TODO: - [x] Test docstring examples -------- - **Description:** UnstructuredAPIFileIOLoader and UnstructuredAPIFileLoader calls to the unstructured api are now made using the unstructured-client sdk. - **New Dependencies:** unstructured-client - [x] **Add tests and docs**: If you're adding a new integration, please include - [x] a test for the integration, preferably unit tests that do not rely on network access, - [x] update the description in `docs/docs/integrations/providers/unstructured.mdx` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. TODO: - [x] Update https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api - `langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb` - The description here needs to indicate that users should install `unstructured-client` instead of `unstructured`. Read over closely to look for any other changes that need to be made. - [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to handle json responses from the API instead of just lists of elements. - This method may need to be overwritten by the API loaders instead of changing it in the `UnstructuredBaseLoader`. - [x] Update the documentation links in the class docstrings (the Unstructured documents have moved) - [x] Update Document.metadata to include `element_id` (see thread [here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419)) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com>
72 lines
2.1 KiB
Markdown
72 lines
2.1 KiB
Markdown
# langchain-unstructured
|
|
|
|
This package contains the LangChain integration with Unstructured
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
pip install -U langchain-unstructured
|
|
```
|
|
|
|
And you should configure credentials by setting the following environment variables:
|
|
|
|
```bash
|
|
export UNSTRUCTURED_API_KEY="your-api-key"
|
|
```
|
|
|
|
## Loaders
|
|
|
|
Partition and load files using either the `unstructured-client` sdk and the
|
|
Unstructured API or locally using the `unstructured` library.
|
|
|
|
API:
|
|
To partition via the Unstructured API `pip install unstructured-client` and set
|
|
`partition_via_api=True` and define `api_key`. If you are running the unstructured API
|
|
locally, you can change the API rule by defining `url` when you initialize the
|
|
loader. The hosted Unstructured API requires an API key. See the links below to
|
|
learn more about our API offerings and get an API key.
|
|
|
|
Local:
|
|
By default the file loader uses the Unstructured `partition` function and will
|
|
automatically detect the file type.
|
|
|
|
In addition to document specific partition parameters, Unstructured has a rich set
|
|
of "chunking" parameters for post-processing elements into more useful text segments
|
|
for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
|
|
Unstructured kwargs to the loader to configure different unstructured settings.
|
|
|
|
Setup:
|
|
```bash
|
|
pip install -U langchain-unstructured
|
|
pip install -U unstructured-client
|
|
export UNSTRUCTURED_API_KEY="your-api-key"
|
|
```
|
|
|
|
Instantiate:
|
|
```python
|
|
from langchain_unstructured import UnstructuredLoader
|
|
|
|
loader = UnstructuredLoader(
|
|
file_path = ["example.pdf", "fake.pdf"],
|
|
api_key=UNSTRUCTURED_API_KEY,
|
|
partition_via_api=True,
|
|
chunking_strategy="by_title",
|
|
strategy="fast",
|
|
)
|
|
```
|
|
|
|
Load:
|
|
```python
|
|
docs = loader.load()
|
|
|
|
print(docs[0].page_content[:100])
|
|
print(docs[0].metadata)
|
|
```
|
|
|
|
References
|
|
----------
|
|
https://docs.unstructured.io/api-reference/api-services/sdk
|
|
https://docs.unstructured.io/api-reference/api-services/overview
|
|
https://docs.unstructured.io/open-source/core-functionality/partitioning
|
|
https://docs.unstructured.io/open-source/core-functionality/chunking
|