Tiny code review and docs fix for Docugami DataLoader (#4877)

# Docs and code review fixes for Docugami DataLoader

1. I noticed a couple of hyperlinks that are not loading in the
langchain docs (I guess need explicit anchor tags). Added those.
2. In code review @eyurtsev had a
[suggestion](https://github.com/hwchase17/langchain/pull/4727#discussion_r1194069347)
to allow string paths. Turns out just updating the type works (I tested
locally with string paths).

# Pre-submission checks
I ran `make lint` and `make tests` successfully.

---------

Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
pull/4520/head^2
Taqi Jaffri 1 year ago committed by GitHub
parent d6e0b9a43d
commit ef8b5f64bc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -8,10 +8,10 @@ Docugami converts business documents into a Document XML Knowledge Graph, genera
## Quick start ## Quick start
1. Create a Docugami workspace: http://www.docugami.com (free trials available) 1. Create a Docugami workspace: <a href="http://www.docugami.com">http://www.docugami.com</a> (free trials available)
2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later. 2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later.
3. Create an access token via the Developer Playground for your workspace. Detailed instructions: https://help.docugami.com/home/docugami-api 3. Create an access token via the Developer Playground for your workspace. Detailed instructions: https://help.docugami.com/home/docugami-api
4. Explore the Docugami API at https://api-docs.docugami.com/ to get a list of your processed docset IDs, or just the document IDs for a particular docset. 4. Explore the Docugami API at <a href="https://api-docs.docugami.com">https://api-docs.docugami.com</a> to get a list of your processed docset IDs, or just the document IDs for a particular docset.
6. Use the DocugamiLoader as detailed in [this notebook](../modules/indexes/document_loaders/examples/docugami.ipynb), to get rich semantic chunks for your documents. 6. Use the DocugamiLoader as detailed in [this notebook](../modules/indexes/document_loaders/examples/docugami.ipynb), to get rich semantic chunks for your documents.
7. Optionally, build and publish one or more [reports or abstracts](https://help.docugami.com/home/reports). This helps Docugami improve the semantic XML with better tags based on your preferences, which are then added to the DocugamiLoader output as metadata. Use techniques like [self-querying retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html) to do high accuracy Document QA. 7. Optionally, build and publish one or more [reports or abstracts](https://help.docugami.com/home/reports). This helps Docugami improve the semantic XML with better tags based on your preferences, which are then added to the DocugamiLoader output as metadata. Use techniques like [self-querying retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html) to do high accuracy Document QA.

@ -5,7 +5,7 @@ import logging
import os import os
import re import re
from pathlib import Path from pathlib import Path
from typing import Any, Dict, List, Mapping, Optional, Sequence from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
import requests import requests
from pydantic import BaseModel, root_validator from pydantic import BaseModel, root_validator
@ -39,7 +39,7 @@ class DocugamiLoader(BaseLoader, BaseModel):
access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY") access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY")
docset_id: Optional[str] docset_id: Optional[str]
document_ids: Optional[Sequence[str]] document_ids: Optional[Sequence[str]]
file_paths: Optional[Sequence[Path]] file_paths: Optional[Sequence[Union[Path, str]]]
min_chunk_size: int = 32 # appended to the next chunk to avoid over-chunking min_chunk_size: int = 32 # appended to the next chunk to avoid over-chunking
@root_validator @root_validator
@ -331,6 +331,7 @@ class DocugamiLoader(BaseLoader, BaseModel):
elif self.file_paths: elif self.file_paths:
# local mode (for integration testing, or pre-downloaded XML) # local mode (for integration testing, or pre-downloaded XML)
for path in self.file_paths: for path in self.file_paths:
path = Path(path)
with open(path, "rb") as file: with open(path, "rb") as file:
chunks += self._parse_dgml( chunks += self._parse_dgml(
{ {

Loading…
Cancel
Save