"## Docugami RAG over XML Knowledge Graphs (KG-RAG)\n",
"\n",
"Many documents contain a mixture of content types, including text and tables. \n",
"\n",
"Semi-structured data can be challenging for conventional RAG for a few reasons since semantics may be lost by text-only chunking techniques, e.g.: \n",
"\n",
"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search \n",
"\n",
"Docugami deconstructs documents into XML Knowledge Graphs consisting of hierarchical semantic chunks using the XML data model. This cookbook shows how to perform RAG using XML Knowledge Graphs as input (**KG-RAG**):\n",
"\n",
"* We will use [Docugami](http://docugami.com/) to segment out text and table chunks from documents (PDF \\[scanned or digital\\], DOC or DOCX) including semantic XML markup in the chunks.\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables and text (including semantic XML markup) along with table summaries better suited for retrieval.\n",
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
" 1. Use the [Docugami API](https://api-docs.docugami.com), specifically the [documents](https://api-docs.docugami.com/#tag/documents/operation/upload-document) endpoint. You can also use the [docugami python library](https://pypi.org/project/docugami/) as a convenient wrapper.\n",
"Once your documents are in Docugami, they are processed and organized into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. Docugami is not limited to any particular types of documents, and the clusters created depend on your particular documents. You can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later if you wish. You can monitor file status in the simple Docugami webapp, or use a [webhook](https://api-docs.docugami.com/#tag/webhooks) to be informed when your documents are done processing.\n",
"\n",
"You can also use the [Docugami API](https://api-docs.docugami.com) or the [docugami](https://pypi.org/project/docugami/) python library to do all the file processing without visiting the Docugami webapp except to get the API key.\n",
"\n",
"> You can get an API key as documented here: https://help.docugami.com/home/docugami-api. This following code assumes you have set the `DOCUGAMI_API_TOKEN` environment variable.\n",
"\n",
"First, let's define two simple helper methods to upload files and wait for them to finish processing."
"If you are on the free Docugami tier, your files should be done in ~15 minutes or less depending on the number of pages uploaded and available resources (please contact Docugami for paid plans for faster processing). You can re-run the code above without reprocessing your files to continue waiting if your notebook is not continuously running (it does not re-upload)."
]
},
{
"cell_type": "markdown",
"id": "7c24efa9-b6f6-4dc2-bfe3-70819ba3ef75",
"metadata": {},
"source": [
"### Partition PDF tables and text\n",
"\n",
"You can use the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) to very easily get chunks for your documents, including semantic and structural metadata. This is the simpler and recommended approach for most use cases but in this notebook let's explore using the `dgml-utils` library to explore the segmented output for this file in more detail by processing the XML we just downloaded above."
"<TakeoffAccident> <Analysis>The pilot reported that, as the tail lifted during takeoff, the airplane veered left. He attempted to correct with full right rudder and full brakes. However, the airplane subsequently nosed over resulting in substantial damage to the fuselage, lift struts, rudder, and vertical stabilizer. </Analysis></TakeoffAccident>\n",
"<AircraftCondition> The pilot reported that there were no preaccident mechanical malfunctions or anomalies with the airplane that would have precluded normal operation. </AircraftCondition>\n",
"<WindConditions> At about the time of the accident, wind was from <WindDirection>180</WindDirection>° at <WindConditions>5 </WindConditions>knots. The pilot decided to depart on runway <Runway>35 </Runway>due to the prevailing airport traffic. He stated that departing with “more favorable wind conditions” may have prevented the accident. </WindConditions>\n",
"<ProbableCause> The <ProbableCause>National Transportation Safety Board </ProbableCause>determines the probable cause(s) of this accident to be: </ProbableCause>\n",
"<AccidentCause> The pilot's loss of directional control during takeoff and subsequent excessive use of brakes which resulted in a nose-over. Contributing to the accident was his decision to takeoff downwind. </AccidentCause>\n",
" include_xml_tags=True, # Ensures Docugami XML semantic tags are included in the chunked output (set to False for text-only chunks and tables as Markdown)\n",
" max_text_length=1024 * 8, # 8k chars are ~2k tokens for OpenAI.\n",
"The file processed by Docugami in the example above was [this one](https://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/192541/pdf) from the NTSB and you can look at the PDF side by side to compare the XML chunks above. \n",
"\n",
"If you want text based chunks instead, Docugami also supports those and renders tables as markdown:"
"The pilot reported that, as the tail lifted during takeoff, the airplane veered left. He attempted to correct with full right rudder and full brakes. However, the airplane subsequently nosed over resulting in substantial damage to the fuselage, lift struts, rudder, and vertical stabilizer.\n",
"The pilot reported that there were no preaccident mechanical malfunctions or anomalies with the airplane that would have precluded normal operation.\n",
"At about the time of the accident, wind was from 180 ° at 5 knots. The pilot decided to depart on runway 35 due to the prevailing airport traffic. He stated that departing with “more favorable wind conditions” may have prevented the accident.\n",
"Probable Cause and Findings\n",
"The National Transportation Safety Board determines the probable cause(s) of this accident to be:\n",
"The pilot's loss of directional control during takeoff and subsequent excessive use of brakes which resulted in a nose-over. Contributing to the accident was his decision to takeoff downwind.\n",
"Page 1 of 5\n"
]
}
],
"source": [
"with open(dgml_path, \"r\") as file:\n",
" contents = file.read().encode(\"utf-8\")\n",
"\n",
" chunks = get_chunks_str(\n",
" contents,\n",
" include_xml_tags=False, # text-only chunks and tables as Markdown\n",
" max_text_length=1024\n",
" * 8, # 8k chars are ~2k tokens for OpenAI. Ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them\n",
" )\n",
"\n",
" print(f\"found {len(chunks)} chunks, here are the first few\")\n",
"## Docugami XML Deep Dive: Jane Doe NDA Example\n",
"\n",
"Let's explore the Docugami XML output for a different example PDF file (a long form contract): [Jane Doe NDA](https://github.com/docugami/dgml-utils/blob/main/python/tests/test_data/article/Jane%20Doe%20NDA.pdf). We have provided processed Docugami XML output for this PDF here: https://github.com/docugami/dgml-utils/blob/main/python/tests/test_data/article/Jane%20Doe.xml so you can follow along without processing your own documents."
"The Docugami XML contains extremely detailed semantics and visual bounding boxes for all elements. The `dgml-utils` library parses text and non-text elements into formats appropriate to pass into LLMs (chunked text with XML semantic labels)"
"<MUTUALNON-DISCLOSUREAGREEMENT> This Non-Disclosure Agreement (\"Agreement\") is entered into as of <EffectiveDate>November 4, 2023 </EffectiveDate>(\"Effective Date\"), by and between: </MUTUALNON-DISCLOSUREAGREEMENT>\n",
"Disclosing Party:\n",
"<DisclosingParty><PrincipalPlaceofBusiness>Widget Corp.</PrincipalPlaceofBusiness>, a <USState>Delaware </USState>corporation with its principal place of business at <PrincipalPlaceofBusiness><PrincipalPlaceofBusiness> <WidgetCorpAddress>123 </WidgetCorpAddress> <PrincipalPlaceofBusiness>Innovation Drive</PrincipalPlaceofBusiness> </PrincipalPlaceofBusiness> , <PrincipalPlaceofBusiness>Techville</PrincipalPlaceofBusiness>, <USState> Delaware</USState>, <PrincipalPlaceofBusiness>12345 </PrincipalPlaceofBusiness></PrincipalPlaceofBusiness> (\"<Org> <CompanyName>Widget </CompanyName> <CorporateName>Corp.</CorporateName> </Org>\") </DisclosingParty>\n",
"Receiving Party:\n",
"<RecipientName>Jane Doe</RecipientName>, an individual residing at <RecipientAddress><RecipientAddress> <RecipientAddress>456 </RecipientAddress> <RecipientAddress>Privacy Lane</RecipientAddress> </RecipientAddress> , <RecipientAddress>Safetown</RecipientAddress>, <USState> California</USState>, <RecipientAddress>67890 </RecipientAddress></RecipientAddress> (\"Recipient\")\n",
"(collectively referred to as the \"Parties\").\n",
"1. Definition of Confidential Information\n",
"<DefinitionofConfidentialInformation>For purposes of this Agreement, \"Confidential Information\" shall include all information or material that has or could have commercial value or other utility in the business in which Disclosing Party is engaged. If Confidential Information is in written form, the Disclosing Party shall label or stamp the materials with the word \"Confidential\" or some similar warning. If Confidential Information is transmitted orally, the Disclosing Party shall promptly provide writing indicating that such oral communication constituted Confidential Information . </DefinitionofConfidentialInformation>\n",
"2. Exclusions from Confidential Information\n",
"<ExclusionsFromConfidentialInformation>Recipient's obligations under this Agreement do not extend to information that is: (a) publicly known at the time of disclosure or subsequently becomes publicly known through no fault of the Recipient; (b) discovered or created by the Recipient before disclosure by Disclosing Party; (c) learned by the Recipient through legitimate means other than from the Disclosing Party or Disclosing Party's representatives; or (d) is disclosed by Recipient with Disclosing Party's prior written approval. </ExclusionsFromConfidentialInformation>\n",
"3. Obligations of Receiving Party\n",
"<ObligationsofReceivingParty>Recipient shall hold and maintain the Confidential Information in strictest confidence for the sole and exclusive benefit of the Disclosing Party. Recipient shall carefully restrict access to Confidential Information to employees, contractors, and third parties as is reasonably required and shall require those persons to sign nondisclosure restrictions at least as protective as those in this Agreement. </ObligationsofReceivingParty>\n",
"4. Time Periods\n",
"<TimePeriods>The nondisclosure provisions of this Agreement shall survive the termination of this Agreement and Recipient's duty to hold Confidential Information in confidence shall remain in effect until the Confidential Information no longer qualifies as a trade secret or until Disclosing Party sends Recipient written notice releasing Recipient from this Agreement, whichever occurs first. </TimePeriods>\n",
"5. Relationships\n",
"<Relationships>Nothing contained in this Agreement shall be deemed to constitute either party a partner, joint venture, or employee of the other party for any purpose. </Relationships>\n",
"6. Severability\n",
"<Severability>If a court finds any provision of this Agreement invalid or unenforceable, the remainder of this Agreement shall be interpreted so as best to effect the intent of the parties. </Severability>\n",
"The XML markup contains structural as well as semantic tags, which provide additional semantics to the LLM for improved retrieval and generation.\n",
"\n",
"If you prefer, you can set `include_xml_tags=False` in the `get_chunks_str` call above to not include XML markup. The text-only Docugami chunks are still very good since they follow the structural and semantic contours of the document rather than whitespace-only chunking. Tables are rendered as markdown in this case, so that some structural context is maintained even without the XML markup."
"table_elements_as_text = [c for c in chunks_as_text if \"table\" in c.structure.split()]\n",
"\n",
"print(table_elements_as_text[0].text)"
]
},
{
"cell_type": "markdown",
"id": "731b3dfc-7ddf-4a11-9a30-9a79b7c66e16",
"metadata": {},
"source": [
"## Multi-vector retriever\n",
"\n",
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
"\n",
"With the summary, we will also store the raw table elements.\n",
"\n",
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
"\n",
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",
"Let's run the same Llama2 paper example from the [Semi_Structured_RAG.ipynb](./Semi_Structured_RAG.ipynb) notebook to see if we get the same results, and to contrast the table chunk returned by Docugami with the ones returned from Unstructured."
"Docugami also provides a [langchain template](https://github.com/docugami/langchain-template-docugami-kg-rag) that you can integrate into your langchain projects.\n",