langchain/cookbook/docugami_xml_kg_rag.ipynb

957 lines
351 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAChgAAAJTCAYAAADw2I76AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAFiUAABYlAUlSJPAAAABjaVRYdFNuaXBNZXRhZGF0YQAAAAAAeyJjbGlwUG9pbnRzIjpbeyJ4IjowLCJ5IjowfSx7IngiOjI1ODQsInkiOjB9LHsieCI6MjU4NCwieSI6NTk1fSx7IngiOjAsInkiOjU5NX1dfQfgBGEAAP82SURBVHhe7N0JtGZVeef/fatQERAVcUABAZF5nmSUeRIBURQVNQ6JZrLTaTO0nZV/7KTXSqfTMZ1EjXGeB1RkdECQGQQEmREQGRRQEAFBjUJV/euzL095fOveqntroKbfd6293umcPTx7n/Oe/ZzfefbYnLm0EEIIIYQQQgghhBBCCCGEEEIIIYQQQgghhAEzHnsNIYQQQgghhBBCCCGEEEIIIYQQQgghhBBCmEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPCAxDCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQwHxEYhhBCCCGEEEIIIYQQQgghhBBCCCGEEEIIYT4iMAwhhBBCCCGEEEIIIYQQQgghhBBCCCGEEMJ8RGAYQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE+YjAMIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCPMRgWEIIYQQQgghhBBCCCGEEEIIIYQQQgghhBDmIwLDEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghzEcEhiGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQpiPsTlzeex9CCGEEEIIIYQQQgghhLBcw6W9ILf22NhYT5OxsP1nzFi6z+VPVP7C6rwgllR+E+WD6eS1JPIIi8dkfbAgqm9Wpj6ayA4ZhyGEEEIIIYSwaERgGEIIIYQQQgghhBBCCGGFgDv717/+dfvlL3/ZHn300TZ79uzHfhkXD82cObOtscYa7UlPetKEQiLb/+d//mf71a9+1R555JHHvv0NT3ziE9taa63VVltttce+WbLMmjWr1/8Xv/hFf48nPOEJbfXVV+91nq64UXvYQX7yBRvI68lPfnJ/PxXYlU3Yhl3qtgF7yGuqdbP/z3/+814veeiD2l87l7Z4c1WGvY0pfeD4qD6cDH2jP4x1/exVH60MGH9s4Jio41z7nBu0NeMwhBBCCCGEEKZHBIYhhBBCCCGEEEIIIYQQVggIqH70ox+12267rd1///3zRHUgGlpzzTXbC17wgrb++ut3IdEohHi33npru/vuu9vDDz88T6BYYqtnPOMZbdttt21PfepT+/dLEq54oqd77rmn3Xzzzb18ZT796U9vz33uc3udCQ2HqB+BFNGYthM/Dtul/Q888EC75ZZbul3wlKc8pT372c9uG2+8cd9+Ksj73nvvbT/84Q97/YgN2YQd1ltvvfb85z+/CxYnQ9vU1f7XX399399ndd1ggw3ahhtu2NZee+0pCx4fT7S9BKfaUSK0FQ3tML5vv/329v3vf78LTydjKC4sMapjx3h52tOeNm+cLW0hHnuX/dVdPaY6ZheE4+yOO+5od911V3vwwQd7OcbyZptt1tZdd93eZjYIIYQQQgghhDA1IjAMIYQQQgghhBBCCCGEsEJAiHTRRRe1008/vd1www3toYce6uIhEEM961nPascee2w75JBDuqBoVCB15513ti9+8Yvtggsu6II64iYQG4neRlz4x3/8x23TTTft3y9JCL7uu+++dtlll7XPfvaz7Qc/+EEvc8stt2z77rtvO/DAA9s666zz2NbjEBD+5Cc/ad/73vd6W3fZZZcuHoR2++7GG29sJ554YreLdhD07bbbbu3oo4/uwr6poJzrrruunXPOOe3CCy/sIkN5ET3uvffe7WUve1kXGk5GidvOPPPM9qEPfWieeFMfHHnkke3www/v+y+PEfLUmxhNm/XR1ltvPc/GKxKODcLZr371q+3LX/5yF9lNhr6tRPRJaEgAqt3G4U477dTfjwpelzTGyM9+9rN+LHz3u99tL3zhC9sOO+zw2K+LjmObHc4777wuvjU+CY9f//rX9/yJKJe2eDKEEEIIIYQQViZmvmsuj70PIYQQQgghhBBCCCGEEJZbRJm76qqr2hlnnNG+/e1vd+GdqHmVRCsjsNtkk026uG0YMY94jJCJGO+b3/xmu+mmm+btR3goAqAIagcccMBSEZgROanfNddc0wVgV199dS+T6I6Qb4sttujR2wgHJXW1DcHft771rS6AI8ASga0gjhPN8Wtf+1o7++yzezu0UyRGQiqvU0HdfvrTn3axIpGgOiqfIFKdCC9FWiTKIkobRYRFZavDF77whR5B78c//nGvi33VRYS85UXUVYJI0f6Mo3PPPbdHlSTK0xdTtdvyhGODsI6o7pRTTul9MDw2RpP+ldjAGCKy9D27sI9joaI5TtTniwNhoWOBqPCSSy7p4tgrr7yyj22C28VFVE8iYra44oorettAOOn8oF0RGIYQQgghhBDC1MkMKoQQQgghhBBCCCGE
}
},
"cell_type": "markdown",
"id": "b6d466cc-aa8b-4baf-a80a-fef01921ca8d",
"metadata": {},
"source": [
"## Docugami RAG over XML Knowledge Graphs (KG-RAG)\n",
"\n",
"Many documents contain a mixture of content types, including text and tables. \n",
"\n",
"Semi-structured data can be challenging for conventional RAG for a few reasons since semantics may be lost by text-only chunking techniques, e.g.: \n",
"\n",
"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search \n",
"\n",
"Docugami deconstructs documents into XML Knowledge Graphs consisting of hierarchical semantic chunks using the XML data model. This cookbook shows how to perform RAG using XML Knowledge Graphs as input (**KG-RAG**):\n",
"\n",
"* We will use [Docugami](http://docugami.com/) to segment out text and table chunks from documents (PDF \\[scanned or digital\\], DOC or DOCX) including semantic XML markup in the chunks.\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables and text (including semantic XML markup) along with table summaries better suited for retrieval.\n",
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
"\n",
"The overall flow is here:\n",
"\n",
"![image.png](attachment:image.png)\n",
"\n",
"## Packages"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "5740fc70-c513-4ff4-9d72-cfc098f85fef",
"metadata": {},
"outputs": [],
"source": [
"! pip install langchain docugami==0.0.8 dgml-utils==0.3.0 pydantic langchainhub langchain-chroma hnswlib --upgrade --quiet"
]
},
{
"cell_type": "markdown",
"id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
"metadata": {},
"source": [
"Docugami processes documents in the cloud, so you don't need to install any additional local dependencies. "
]
},
{
"cell_type": "markdown",
"id": "c6fb4903-f845-4907-ae14-df305891b0ff",
"metadata": {},
"source": [
"## Data Loading\n",
"\n",
"Let's use Docugami to process some documents. Here's what you need to get started:\n",
"\n",
"1. Create a [Docugami workspace](http://www.docugami.com) (free trials available)\n",
"1. Create an access token via the Developer Playground for your workspace. [Detailed instructions](https://help.docugami.com/home/docugami-api).\n",
"1. Add your documents (PDF \\[scanned or digital\\], DOC or DOCX) to Docugami for processing. There are two ways to do this:\n",
" 1. Use the simple Docugami web experience. [Detailed instructions](https://help.docugami.com/home/adding-documents).\n",
" 1. Use the [Docugami API](https://api-docs.docugami.com), specifically the [documents](https://api-docs.docugami.com/#tag/documents/operation/upload-document) endpoint. You can also use the [docugami python library](https://pypi.org/project/docugami/) as a convenient wrapper.\n",
"\n",
"Once your documents are in Docugami, they are processed and organized into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. Docugami is not limited to any particular types of documents, and the clusters created depend on your particular documents. You can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later if you wish. You can monitor file status in the simple Docugami webapp, or use a [webhook](https://api-docs.docugami.com/#tag/webhooks) to be informed when your documents are done processing.\n",
"\n",
"You can also use the [Docugami API](https://api-docs.docugami.com) or the [docugami](https://pypi.org/project/docugami/) python library to do all the file processing without visiting the Docugami webapp except to get the API key.\n",
"\n",
"> You can get an API key as documented here: https://help.docugami.com/home/docugami-api. This following code assumes you have set the `DOCUGAMI_API_TOKEN` environment variable.\n",
"\n",
"First, let's define two simple helper methods to upload files and wait for them to finish processing."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ce0b2b21-7623-46e7-ae2c-3a9f67e8b9b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Report_CEN23LA277_192541.pdf': '/tmp/tmpa0c77x46',\n",
" 'Report_CEN23LA338_192753.pdf': '/tmp/tmpaftfld2w',\n",
" 'Report_CEN23LA363_192876.pdf': '/tmp/tmpn7gp6be2',\n",
" 'Report_CEN23LA394_192995.pdf': '/tmp/tmp9udymprf',\n",
" 'Report_ERA23LA114_106615.pdf': '/tmp/tmpxdjbh4r_',\n",
" 'Report_WPR23LA254_192532.pdf': '/tmp/tmpz6h75a0h'}\n"
]
}
],
"source": [
"from pprint import pprint\n",
"\n",
"from docugami import Docugami\n",
"from docugami.lib.upload import upload_to_named_docset, wait_for_dgml\n",
"\n",
"#### START DOCSET INFO (please change this values as needed)\n",
"DOCSET_NAME = \"NTSB Aviation Incident Reports\"\n",
"FILE_PATHS = [\n",
" \"/Users/tjaffri/ntsb/Report_CEN23LA277_192541.pdf\",\n",
" \"/Users/tjaffri/ntsb/Report_CEN23LA338_192753.pdf\",\n",
" \"/Users/tjaffri/ntsb/Report_CEN23LA363_192876.pdf\",\n",
" \"/Users/tjaffri/ntsb/Report_CEN23LA394_192995.pdf\",\n",
" \"/Users/tjaffri/ntsb/Report_ERA23LA114_106615.pdf\",\n",
" \"/Users/tjaffri/ntsb/Report_WPR23LA254_192532.pdf\",\n",
"]\n",
"\n",
"# Note: Please specify ~6 (or more!) similar files to process together as a document set\n",
"# This is currently a requirement for Docugami to automatically detect motifs\n",
"# across the document set to generate a semantic XML Knowledge Graph.\n",
"assert len(FILE_PATHS) > 5, \"Please provide at least 6 files\"\n",
"#### END DOCSET INFO\n",
"\n",
"dg_client = Docugami()\n",
"dg_docs = upload_to_named_docset(dg_client, FILE_PATHS, DOCSET_NAME)\n",
"dgml_paths = wait_for_dgml(dg_client, dg_docs)\n",
"\n",
"pprint(dgml_paths)"
]
},
{
"cell_type": "markdown",
"id": "01f035e5-c3f8-4d23-9d1b-8d2babdea8e9",
"metadata": {},
"source": [
"If you are on the free Docugami tier, your files should be done in ~15 minutes or less depending on the number of pages uploaded and available resources (please contact Docugami for paid plans for faster processing). You can re-run the code above without reprocessing your files to continue waiting if your notebook is not continuously running (it does not re-upload)."
]
},
{
"cell_type": "markdown",
"id": "7c24efa9-b6f6-4dc2-bfe3-70819ba3ef75",
"metadata": {},
"source": [
"### Partition PDF tables and text\n",
"\n",
"You can use the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) to very easily get chunks for your documents, including semantic and structural metadata. This is the simpler and recommended approach for most use cases but in this notebook let's explore using the `dgml-utils` library to explore the segmented output for this file in more detail by processing the XML we just downloaded above."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "05fcdd57-090f-44bf-a1fb-2c3609c80e34",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found 30 chunks, here are the first few\n",
"<AviationInvestigationFinalReport-section>Aviation </AviationInvestigationFinalReport-section>Investigation Final Report\n",
"<table><tbody><tr><td>Location: </td> <td><Location><TownName>Elbert</TownName>, <USState>Colorado </USState></Location></td> <td>Accident Number: </td> <td><AccidentNumber>CEN23LA277 </AccidentNumber></td></tr> <tr><td><LocationDateTime>Date &amp; Time: </LocationDateTime></td> <td><DateTime><EventDate>June 26, 2023</EventDate>, <EventTime>11:00 Local </EventTime></DateTime></td> <td><DateTimeAccidentNumber>Registration: </DateTimeAccidentNumber></td> <td><Registration>N23161 </Registration></td></tr> <tr><td><LocationAircraft>Aircraft: </LocationAircraft></td> <td><AircraftType>Piper <AircraftType>J3C-50 </AircraftType></AircraftType></td> <td><AircraftAccidentNumber>Aircraft Damage: </AircraftAccidentNumber></td> <td><AircraftDamage>Substantial </AircraftDamage></td></tr> <tr><td><LocationDefiningEvent>Defining Event: </LocationDefiningEvent></td> <td><DefiningEvent>Nose over/nose down </DefiningEvent></td> <td><DefiningEventAccidentNumber>Injuries: </DefiningEventAccidentNumber></td> <td><Injuries><Minor>1 </Minor>Minor </Injuries></td></tr> <tr><td><LocationFlightConductedUnder>Flight Conducted Under: </LocationFlightConductedUnder></td> <td><FlightConductedUnder><Part91-cell>Part <RegulationPart>91</RegulationPart>: General aviation - Personal </Part91-cell></FlightConductedUnder></td><td/><td><FlightConductedUnderCEN23LA277/></td></tr></tbody></table>\n",
"Analysis\n",
"<TakeoffAccident> <Analysis>The pilot reported that, as the tail lifted during takeoff, the airplane veered left. He attempted to correct with full right rudder and full brakes. However, the airplane subsequently nosed over resulting in substantial damage to the fuselage, lift struts, rudder, and vertical stabilizer. </Analysis></TakeoffAccident>\n",
"<AircraftCondition> The pilot reported that there were no preaccident mechanical malfunctions or anomalies with the airplane that would have precluded normal operation. </AircraftCondition>\n",
"<WindConditions> At about the time of the accident, wind was from <WindDirection>180</WindDirection>° at <WindConditions>5 </WindConditions>knots. The pilot decided to depart on runway <Runway>35 </Runway>due to the prevailing airport traffic. He stated that departing with “more favorable wind conditions” may have prevented the accident. </WindConditions>\n",
"<ProbableCauseAndFindings-section>Probable Cause and Findings </ProbableCauseAndFindings-section>\n",
"<ProbableCause> The <ProbableCause>National Transportation Safety Board </ProbableCause>determines the probable cause(s) of this accident to be: </ProbableCause>\n",
"<AccidentCause> The pilot's loss of directional control during takeoff and subsequent excessive use of brakes which resulted in a nose-over. Contributing to the accident was his decision to takeoff downwind. </AccidentCause>\n",
"Page 1 of <PageNumber>5 </PageNumber>\n"
]
}
],
"source": [
"from pathlib import Path\n",
"\n",
"from dgml_utils.segmentation import get_chunks_str\n",
"\n",
"# Here we just read the first file, you can do the same for others\n",
"dgml_path = dgml_paths[Path(FILE_PATHS[0]).name]\n",
"\n",
"with open(dgml_path, \"r\") as file:\n",
" contents = file.read().encode(\"utf-8\")\n",
"\n",
" chunks = get_chunks_str(\n",
" contents,\n",
" include_xml_tags=True, # Ensures Docugami XML semantic tags are included in the chunked output (set to False for text-only chunks and tables as Markdown)\n",
" max_text_length=1024 * 8, # 8k chars are ~2k tokens for OpenAI.\n",
" # Ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them\n",
" )\n",
"\n",
" print(f\"found {len(chunks)} chunks, here are the first few\")\n",
" for chunk in chunks[:10]:\n",
" print(chunk.text)"
]
},
{
"cell_type": "markdown",
"id": "bfc1f2c9-e6d4-4d98-a799-6bc30bc61661",
"metadata": {},
"source": [
"The file processed by Docugami in the example above was [this one](https://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/192541/pdf) from the NTSB and you can look at the PDF side by side to compare the XML chunks above. \n",
"\n",
"If you want text based chunks instead, Docugami also supports those and renders tables as markdown:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8a4b49e0-de78-4790-a930-ad7cf324697a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found 30 chunks, here are the first few\n",
"Aviation Investigation Final Report\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"| Location: | Elbert , Colorado | Accident Number: | CEN23LA277 |\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"| Date & Time: | June 26, 2023 , 11:00 Local | Registration: | N23161 |\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"| Aircraft: | Piper J3C-50 | Aircraft Damage : | Substantial |\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"| Defining Event: | Nose over/nose down | Injuries: | 1 Minor |\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"| Flight Conducted Under: | Part 91 : General aviation - Personal | | |\n",
"+-------------------------+---------------------------------------+-------------------+-------------+\n",
"Analysis\n",
"The pilot reported that, as the tail lifted during takeoff, the airplane veered left. He attempted to correct with full right rudder and full brakes. However, the airplane subsequently nosed over resulting in substantial damage to the fuselage, lift struts, rudder, and vertical stabilizer.\n",
"The pilot reported that there were no preaccident mechanical malfunctions or anomalies with the airplane that would have precluded normal operation.\n",
"At about the time of the accident, wind was from 180 ° at 5 knots. The pilot decided to depart on runway 35 due to the prevailing airport traffic. He stated that departing with “more favorable wind conditions” may have prevented the accident.\n",
"Probable Cause and Findings\n",
"The National Transportation Safety Board determines the probable cause(s) of this accident to be:\n",
"The pilot's loss of directional control during takeoff and subsequent excessive use of brakes which resulted in a nose-over. Contributing to the accident was his decision to takeoff downwind.\n",
"Page 1 of 5\n"
]
}
],
"source": [
"with open(dgml_path, \"r\") as file:\n",
" contents = file.read().encode(\"utf-8\")\n",
"\n",
" chunks = get_chunks_str(\n",
" contents,\n",
" include_xml_tags=False, # text-only chunks and tables as Markdown\n",
" max_text_length=1024\n",
" * 8, # 8k chars are ~2k tokens for OpenAI. Ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them\n",
" )\n",
"\n",
" print(f\"found {len(chunks)} chunks, here are the first few\")\n",
" for chunk in chunks[:10]:\n",
" print(chunk.text)"
]
},
{
"cell_type": "markdown",
"id": "1cfc06bc-67d2-46dd-b04d-95efa3619d0a",
"metadata": {},
"source": [
"## Docugami XML Deep Dive: Jane Doe NDA Example\n",
"\n",
"Let's explore the Docugami XML output for a different example PDF file (a long form contract): [Jane Doe NDA](https://github.com/docugami/dgml-utils/blob/main/python/tests/test_data/article/Jane%20Doe%20NDA.pdf). We have provided processed Docugami XML output for this PDF here: https://github.com/docugami/dgml-utils/blob/main/python/tests/test_data/article/Jane%20Doe.xml so you can follow along without processing your own documents."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7b697d30-1e94-47f0-87e8-f81d4b180da2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"39"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"\n",
"# Download XML from known URL\n",
"dgml = requests.get(\n",
" \"https://raw.githubusercontent.com/docugami/dgml-utils/main/python/tests/test_data/article/Jane%20Doe.xml\"\n",
").text\n",
"chunks = get_chunks_str(dgml, include_xml_tags=True)\n",
"len(chunks)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "14714576-6e1d-499b-bcc8-39140bb2fd78",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'h1': 9, 'div': 12, 'p': 3, 'lim h1': 9, 'lim': 1, 'table': 1, 'h1 div': 4}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count all the different structure categories\n",
"category_counts = {}\n",
"\n",
"for element in chunks:\n",
" category = element.structure\n",
" if category in category_counts:\n",
" category_counts[category] += 1\n",
" else:\n",
" category_counts[category] = 1\n",
"\n",
"category_counts"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5462f29e-fd59-4e0e-9493-ea3b560e523e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 1 tables\n",
"There are 38 text elements\n"
]
}
],
"source": [
"# Tables\n",
"table_elements = [c for c in chunks if \"table\" in c.structure.split()]\n",
"print(f\"There are {len(table_elements)} tables\")\n",
"\n",
"# Text\n",
"text_elements = [c for c in chunks if \"table\" not in c.structure.split()]\n",
"print(f\"There are {len(text_elements)} text elements\")"
]
},
{
"cell_type": "markdown",
"id": "dc09ba64-4973-4471-9501-54294c1143fc",
"metadata": {},
"source": [
"The Docugami XML contains extremely detailed semantics and visual bounding boxes for all elements. The `dgml-utils` library parses text and non-text elements into formats appropriate to pass into LLMs (chunked text with XML semantic labels)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "2b4ece00-2e43-4254-adc9-66dbb79139a6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NON-DISCLOSURE AGREEMENT\n",
"<MUTUALNON-DISCLOSUREAGREEMENT> This Non-Disclosure Agreement (\"Agreement\") is entered into as of <EffectiveDate>November 4, 2023 </EffectiveDate>(\"Effective Date\"), by and between: </MUTUALNON-DISCLOSUREAGREEMENT>\n",
"Disclosing Party:\n",
"<DisclosingParty><PrincipalPlaceofBusiness>Widget Corp.</PrincipalPlaceofBusiness>, a <USState>Delaware </USState>corporation with its principal place of business at <PrincipalPlaceofBusiness><PrincipalPlaceofBusiness> <WidgetCorpAddress>123 </WidgetCorpAddress> <PrincipalPlaceofBusiness>Innovation Drive</PrincipalPlaceofBusiness> </PrincipalPlaceofBusiness> , <PrincipalPlaceofBusiness>Techville</PrincipalPlaceofBusiness>, <USState> Delaware</USState>, <PrincipalPlaceofBusiness>12345 </PrincipalPlaceofBusiness></PrincipalPlaceofBusiness> (\"<Org> <CompanyName>Widget </CompanyName> <CorporateName>Corp.</CorporateName> </Org>\") </DisclosingParty>\n",
"Receiving Party:\n",
"<RecipientName>Jane Doe</RecipientName>, an individual residing at <RecipientAddress><RecipientAddress> <RecipientAddress>456 </RecipientAddress> <RecipientAddress>Privacy Lane</RecipientAddress> </RecipientAddress> , <RecipientAddress>Safetown</RecipientAddress>, <USState> California</USState>, <RecipientAddress>67890 </RecipientAddress></RecipientAddress> (\"Recipient\")\n",
"(collectively referred to as the \"Parties\").\n",
"1. Definition of Confidential Information\n",
"<DefinitionofConfidentialInformation>For purposes of this Agreement, \"Confidential Information\" shall include all information or material that has or could have commercial value or other utility in the business in which Disclosing Party is engaged. If Confidential Information is in written form, the Disclosing Party shall label or stamp the materials with the word \"Confidential\" or some similar warning. If Confidential Information is transmitted orally, the Disclosing Party shall promptly provide writing indicating that such oral communication constituted Confidential Information . </DefinitionofConfidentialInformation>\n",
"2. Exclusions from Confidential Information\n",
"<ExclusionsFromConfidentialInformation>Recipient's obligations under this Agreement do not extend to information that is: (a) publicly known at the time of disclosure or subsequently becomes publicly known through no fault of the Recipient; (b) discovered or created by the Recipient before disclosure by Disclosing Party; (c) learned by the Recipient through legitimate means other than from the Disclosing Party or Disclosing Party's representatives; or (d) is disclosed by Recipient with Disclosing Party's prior written approval. </ExclusionsFromConfidentialInformation>\n",
"3. Obligations of Receiving Party\n",
"<ObligationsofReceivingParty>Recipient shall hold and maintain the Confidential Information in strictest confidence for the sole and exclusive benefit of the Disclosing Party. Recipient shall carefully restrict access to Confidential Information to employees, contractors, and third parties as is reasonably required and shall require those persons to sign nondisclosure restrictions at least as protective as those in this Agreement. </ObligationsofReceivingParty>\n",
"4. Time Periods\n",
"<TimePeriods>The nondisclosure provisions of this Agreement shall survive the termination of this Agreement and Recipient's duty to hold Confidential Information in confidence shall remain in effect until the Confidential Information no longer qualifies as a trade secret or until Disclosing Party sends Recipient written notice releasing Recipient from this Agreement, whichever occurs first. </TimePeriods>\n",
"5. Relationships\n",
"<Relationships>Nothing contained in this Agreement shall be deemed to constitute either party a partner, joint venture, or employee of the other party for any purpose. </Relationships>\n",
"6. Severability\n",
"<Severability>If a court finds any provision of this Agreement invalid or unenforceable, the remainder of this Agreement shall be interpreted so as best to effect the intent of the parties. </Severability>\n",
"7. Integration\n"
]
}
],
"source": [
"for element in text_elements[:20]:\n",
" print(element.text)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "08350119-aa22-4ec1-8f65-b1316a0d4123",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<table> <tbody> <tr> <td> Authorized Individual </td> <td> Role </td> <td>Purpose of Disclosure </td> </tr> <tr> <td> <AuthorizedIndividualJohnSmith> <Name>John Smith </Name> </AuthorizedIndividualJohnSmith> </td> <td> <JohnSmithRole> <ProjectManagerName>Project Manager </ProjectManagerName> </JohnSmithRole> </td> <td> <JohnSmithPurposeofDisclosure> Oversee project to which the NDA relates </JohnSmithPurposeofDisclosure> </td> </tr> <tr> <td> <AuthorizedIndividualLisaWhite> <Author>Lisa White </Author> </AuthorizedIndividualLisaWhite> </td> <td> <LisaWhiteRole> Lead Developer </LisaWhiteRole> </td> <td> <LisaWhitePurposeofDisclosure>Software development and analysis </LisaWhitePurposeofDisclosure> </td> </tr> <tr> <td> <AuthorizedIndividualMichaelBrown> <Name>Michael Brown </Name> </AuthorizedIndividualMichaelBrown> </td> <td> <MichaelBrownRole> Financial <FinancialAnalyst> Analyst </FinancialAnalyst> </MichaelBrownRole> </td> <td> <MichaelBrownPurposeofDisclosure>Financial analysis and reporting </MichaelBrownPurposeofDisclosure> </td> </tr> </tbody> </table>\n"
]
}
],
"source": [
"print(table_elements[0].text)"
]
},
{
"cell_type": "markdown",
"id": "dca87b46-c0c2-4973-94ec-689c18075653",
"metadata": {},
"source": [
"The XML markup contains structural as well as semantic tags, which provide additional semantics to the LLM for improved retrieval and generation.\n",
"\n",
"If you prefer, you can set `include_xml_tags=False` in the `get_chunks_str` call above to not include XML markup. The text-only Docugami chunks are still very good since they follow the structural and semantic contours of the document rather than whitespace-only chunking. Tables are rendered as markdown in this case, so that some structural context is maintained even without the XML markup."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "bcac8294-c54a-4b6e-af9d-3911a69620b2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------------+-------------------+------------------------------------------+\n",
"| Authorized Individual | Role | Purpose of Disclosure |\n",
"+-----------------------+-------------------+------------------------------------------+\n",
"| John Smith | Project Manager | Oversee project to which the NDA relates |\n",
"+-----------------------+-------------------+------------------------------------------+\n",
"| Lisa White | Lead Developer | Software development and analysis |\n",
"+-----------------------+-------------------+------------------------------------------+\n",
"| Michael Brown | Financial Analyst | Financial analysis and reporting |\n",
"+-----------------------+-------------------+------------------------------------------+\n"
]
}
],
"source": [
"chunks_as_text = get_chunks_str(dgml, include_xml_tags=False)\n",
"table_elements_as_text = [c for c in chunks_as_text if \"table\" in c.structure.split()]\n",
"\n",
"print(table_elements_as_text[0].text)"
]
},
{
"cell_type": "markdown",
"id": "731b3dfc-7ddf-4a11-9a30-9a79b7c66e16",
"metadata": {},
"source": [
"## Multi-vector retriever\n",
"\n",
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
"\n",
"With the summary, we will also store the raw table elements.\n",
"\n",
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
"\n",
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",
"\n",
"### Summaries"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8e275736-3408-4d7a-990e-4362c88e81f8",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import (\n",
" ChatPromptTemplate,\n",
" HumanMessagePromptTemplate,\n",
" SystemMessagePromptTemplate,\n",
")\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_openai import ChatOpenAI"
]
},
{
"cell_type": "markdown",
"id": "37b65677-aeb4-44fd-b06d-4539341ede97",
"metadata": {},
"source": [
"We create a simple summarize chain for each element.\n",
"\n",
"You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
"\n",
"```\n",
"from langchain import hub\n",
"obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1b12536a-1303-41ad-9948-4eb5a5f32614",
"metadata": {},
"outputs": [],
"source": [
"# Prompt\n",
"prompt_text = \"\"\"You are an assistant tasked with summarizing tables and text. \\ \n",
"Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
"prompt = ChatPromptTemplate.from_template(prompt_text)\n",
"\n",
"# Summary chain\n",
"model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
"summarize_chain = {\"element\": lambda x: x} | prompt | model | StrOutputParser()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "8d8b567c-b442-4bf0-b639-04bd89effc62",
"metadata": {},
"outputs": [],
"source": [
"# Apply summarizer to tables\n",
"tables = [i.text for i in table_elements]\n",
"table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "markdown",
"id": "60524010-754f-4924-ad75-78cb54ca7257",
"metadata": {},
"source": [
"### Add to vectorstore\n",
"\n",
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
"\n",
"* `InMemoryStore` stores the raw text, tables\n",
"* `vectorstore` stores the embedded summaries"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "346c3a02-8fea-4f75-a69e-fc9542b99dbc",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"\n",
"from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
"from langchain.storage import InMemoryStore\n",
"from langchain_chroma import Chroma\n",
docs[patch], templates[patch]: Import from core (#14575) Update imports to use core for the low-hanging fruit changes. Ran following ```bash git grep -l 'langchain.schema.runnable' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.runnable/langchain_core.runnables/g' git grep -l 'langchain.schema.output_parser' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.output_parser/langchain_core.output_parsers/g' git grep -l 'langchain.schema.messages' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.messages/langchain_core.messages/g' git grep -l 'langchain.schema.chat_histry' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.chat_history/langchain_core.chat_history/g' git grep -l 'langchain.schema.prompt_template' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.prompt_template/langchain_core.prompts/g' git grep -l 'from langchain.pydantic_v1' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.pydantic_v1/from langchain_core.pydantic_v1/g' git grep -l 'from langchain.tools.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.tools\.base/from langchain_core.tools/g' git grep -l 'from langchain.chat_models.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.chat_models.base/from langchain_core.language_models.chat_models/g' git grep -l 'from langchain.llms.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.llms\.base\ /from langchain_core.language_models.llms\ /g' git grep -l 'from langchain.embeddings.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.embeddings\.base/from langchain_core.embeddings/g' git grep -l 'from langchain.vectorstores.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.vectorstores\.base/from langchain_core.vectorstores/g' git grep -l 'from langchain.agents.tools' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.agents\.tools/from langchain_core.tools/g' git grep -l 'from langchain.schema.output' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.output\ /from langchain_core.outputs\ /g' git grep -l 'from langchain.schema.embeddings' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.embeddings/from langchain_core.embeddings/g' git grep -l 'from langchain.schema.document' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.document/from langchain_core.documents/g' git grep -l 'from langchain.schema.agent' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.agent/from langchain_core.agents/g' git grep -l 'from langchain.schema.prompt ' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.prompt\ /from langchain_core.prompt_values /g' git grep -l 'from langchain.schema.language_model' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.language_model/from langchain_core.language_models/g' ```
2023-12-12 00:49:10 +00:00
"from langchain_core.documents import Document\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"\n",
"def build_retriever(text_elements, tables, table_summaries):\n",
" # The vectorstore to use to index the child chunks\n",
" vectorstore = Chroma(\n",
" collection_name=\"summaries\", embedding_function=OpenAIEmbeddings()\n",
" )\n",
"\n",
" # The storage layer for the parent documents\n",
" store = InMemoryStore()\n",
" id_key = \"doc_id\"\n",
"\n",
" # The retriever (empty to start)\n",
" retriever = MultiVectorRetriever(\n",
" vectorstore=vectorstore,\n",
" docstore=store,\n",
" id_key=id_key,\n",
" )\n",
"\n",
" # Add texts\n",
" texts = [i.text for i in text_elements]\n",
" doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
" retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
"\n",
" # Add tables and summaries\n",
" table_ids = [str(uuid.uuid4()) for _ in tables]\n",
" summary_tables = [\n",
" Document(page_content=s, metadata={id_key: table_ids[i]})\n",
" for i, s in enumerate(table_summaries)\n",
" ]\n",
" retriever.vectorstore.add_documents(summary_tables)\n",
" retriever.docstore.mset(list(zip(table_ids, tables)))\n",
" return retriever\n",
"\n",
"\n",
"retriever = build_retriever(text_elements, tables, table_summaries)"
]
},
{
"cell_type": "markdown",
"id": "1d8bbbd9-009b-4b34-a206-5874a60adbda",
"metadata": {},
"source": [
"## RAG\n",
"\n",
"Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "f2489de4-51e3-48b4-bbcd-ed9171deadf3",
"metadata": {},
"outputs": [],
"source": [
docs[patch], templates[patch]: Import from core (#14575) Update imports to use core for the low-hanging fruit changes. Ran following ```bash git grep -l 'langchain.schema.runnable' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.runnable/langchain_core.runnables/g' git grep -l 'langchain.schema.output_parser' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.output_parser/langchain_core.output_parsers/g' git grep -l 'langchain.schema.messages' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.messages/langchain_core.messages/g' git grep -l 'langchain.schema.chat_histry' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.chat_history/langchain_core.chat_history/g' git grep -l 'langchain.schema.prompt_template' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.prompt_template/langchain_core.prompts/g' git grep -l 'from langchain.pydantic_v1' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.pydantic_v1/from langchain_core.pydantic_v1/g' git grep -l 'from langchain.tools.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.tools\.base/from langchain_core.tools/g' git grep -l 'from langchain.chat_models.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.chat_models.base/from langchain_core.language_models.chat_models/g' git grep -l 'from langchain.llms.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.llms\.base\ /from langchain_core.language_models.llms\ /g' git grep -l 'from langchain.embeddings.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.embeddings\.base/from langchain_core.embeddings/g' git grep -l 'from langchain.vectorstores.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.vectorstores\.base/from langchain_core.vectorstores/g' git grep -l 'from langchain.agents.tools' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.agents\.tools/from langchain_core.tools/g' git grep -l 'from langchain.schema.output' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.output\ /from langchain_core.outputs\ /g' git grep -l 'from langchain.schema.embeddings' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.embeddings/from langchain_core.embeddings/g' git grep -l 'from langchain.schema.document' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.document/from langchain_core.documents/g' git grep -l 'from langchain.schema.agent' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.agent/from langchain_core.agents/g' git grep -l 'from langchain.schema.prompt ' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.prompt\ /from langchain_core.prompt_values /g' git grep -l 'from langchain.schema.language_model' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.language_model/from langchain_core.language_models/g' ```
2023-12-12 00:49:10 +00:00
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"system_prompt = SystemMessagePromptTemplate.from_template(\n",
" \"You are a helpful assistant that answers questions based on provided context. Your provided context can include text or tables, \"\n",
" \"and may also contain semantic XML markup. Pay attention the semantic XML markup to understand more about the context semantics as \"\n",
" \"well as structure (e.g. lists and tabular layouts expressed with HTML-like tags)\"\n",
")\n",
"\n",
"human_prompt = HumanMessagePromptTemplate.from_template(\n",
" \"\"\"Context:\n",
"\n",
" {context}\n",
"\n",
" Question: {question}\"\"\"\n",
")\n",
"\n",
"\n",
"def build_chain(retriever, model):\n",
" prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])\n",
"\n",
" # LLM\n",
" model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
"\n",
" # RAG pipeline\n",
" chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
" )\n",
"\n",
" return chain\n",
"\n",
"\n",
"chain = build_chain(retriever, model)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "636e992f-823b-496b-a082-8b4fcd479de5",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The people authorized to receive confidential information and their roles are:\n",
"\n",
"1. John Smith - Project Manager\n",
"2. Lisa White - Lead Developer\n",
"3. Michael Brown - Financial Analyst\n"
]
}
],
"source": [
"result = chain.invoke(\n",
" \"Name all the people authorized to receive confidential information, and their roles\"\n",
")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "37f46054-e239-4ba8-af81-22d0d6a9bc32",
"metadata": {},
"source": [
"We can check the [trace](https://smith.langchain.com/public/21b3aa16-4ef3-40c3-92f6-3f0ceab2aedb/r) to see what chunks were retrieved.\n",
"\n",
"This includes Table 1 in the doc, showing the disclosures table as XML markup (same one as above)"
]
},
{
"cell_type": "markdown",
"id": "86cad5db-81fe-4ae6-a20e-550b85fcbe96",
"metadata": {},
"source": [
"# RAG on Llama2 paper\n",
"\n",
"Let's run the same Llama2 paper example from the [Semi_Structured_RAG.ipynb](./Semi_Structured_RAG.ipynb) notebook to see if we get the same results, and to contrast the table chunk returned by Docugami with the ones returned from Unstructured."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "0e4a2f43-dd48-4ae3-8e27-7e87d169965f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"669"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dgml = requests.get(\n",
" \"https://raw.githubusercontent.com/docugami/dgml-utils/main/python/tests/test_data/arxiv/2307.09288.xml\"\n",
").text\n",
"llama2_chunks = get_chunks_str(dgml, include_xml_tags=True)\n",
"len(llama2_chunks)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "56b78fb3-603d-4343-ae72-be54a3c5dd72",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 33 tables\n",
"There are 636 text elements\n"
]
}
],
"source": [
"# Tables\n",
"llama2_table_elements = [c for c in llama2_chunks if \"table\" in c.structure.split()]\n",
"print(f\"There are {len(llama2_table_elements)} tables\")\n",
"\n",
"# Text\n",
"llama2_text_elements = [c for c in llama2_chunks if \"table\" not in c.structure.split()]\n",
"print(f\"There are {len(llama2_text_elements)} text elements\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "d3cc5ba9-8553-4eda-a5d1-b799751186af",
"metadata": {},
"outputs": [],
"source": [
"# Apply summarizer to tables\n",
"llama2_tables = [i.text for i in llama2_table_elements]\n",
"llama2_table_summaries = summarize_chain.batch(llama2_tables, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "d7c73faf-74cb-400d-8059-b69e2493de38",
"metadata": {},
"outputs": [],
"source": [
"llama2_retriever = build_retriever(\n",
" llama2_text_elements, llama2_tables, llama2_table_summaries\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "4c553722-be42-42ce-83b8-76a17f323f1c",
"metadata": {},
"outputs": [],
"source": [
"llama2_chain = build_chain(llama2_retriever, model)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "65dce40b-f1c3-494a-949e-69a9c9544ddb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The number of training tokens for LLaMA2 is 2.0T for all parameter sizes.'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llama2_chain.invoke(\"What is the number of training tokens for LLaMA2?\")"
]
},
{
"cell_type": "markdown",
"id": "59877edf-9a02-45db-95cb-b7f4234abfa3",
"metadata": {},
"source": [
"We can check the [trace](https://smith.langchain.com/public/5de100c3-bb40-4234-bf02-64bc708686a1/r) to see what chunks were retrieved.\n",
"\n",
"This includes Table 1 in the doc, showing the tokens used for training table as semantic XML markup:\n",
"\n",
"```xml\n",
"<table>\n",
" <tbody>\n",
" <tr>\n",
" <td />\n",
" <td>Training Data </td>\n",
" <td>Params </td>\n",
" <td>Context Length </td>\n",
" <td>\n",
" <Org>GQA </Org>\n",
" </td>\n",
" <td>Tokens </td>\n",
" <td>LR </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Llama <Number>1 </Number></td>\n",
" <td>\n",
" <Llama1TrainingData>See <Person>Touvron </Person>et al. (<Number>2023</Number>) </Llama1TrainingData>\n",
" </td>\n",
" <td>\n",
" <Llama1Params>\n",
" <Number>7B </Number>\n",
" <Number>13B </Number>\n",
" <Number>33B </Number>\n",
" <Number>65B </Number>\n",
" </Llama1Params>\n",
" </td>\n",
" <td>\n",
" <Llama1ContextLength>\n",
" <Number>2k </Number>\n",
" <Number>2k </Number>\n",
" <Number>2k </Number>\n",
" <Number>2k </Number>\n",
" </Llama1ContextLength>\n",
" </td>\n",
" <td>\n",
" <Llama1GQA>✗ ✗ ✗ ✗ </Llama1GQA>\n",
" </td>\n",
" <td>\n",
" <Llama1Tokens><Number>1.0</Number>T <Number>1.0</Number>T <Number>1.4</Number>T <Number>\n",
" 1.4</Number>T </Llama1Tokens>\n",
" </td>\n",
" <td>\n",
" <Llama1LR> 3.0 × <Number>104 </Number> 3.0 × <Number>104 </Number> 1.5 × <Number>\n",
" 104 </Number> 1.5 × <Number>104 </Number></Llama1LR>\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Llama <Number>2 </Number></td>\n",
" <td>\n",
" <Llama2TrainingData>A new mix of publicly available online data </Llama2TrainingData>\n",
" </td>\n",
" <td>\n",
" <Llama2Params><Number>7B </Number>13B <Number>34B </Number><Number>70B </Number></Llama2Params>\n",
" </td>\n",
" <td>\n",
" <Llama2ContextLength>\n",
" <Number>4k </Number>\n",
" <Number>4k </Number>\n",
" <Number>4k </Number>\n",
" <Number>4k </Number>\n",
" </Llama2ContextLength>\n",
" </td>\n",
" <td>\n",
" <Llama2GQA>✗ ✗ ✓ ✓ </Llama2GQA>\n",
" </td>\n",
" <td>\n",
" <Llama2Tokens><Number>2.0</Number>T <Number>2.0</Number>T <Number>2.0</Number>T <Number>\n",
" 2.0</Number>T </Llama2Tokens>\n",
" </td>\n",
" <td>\n",
" <Llama2LR> 3.0 × <Number>104 </Number> 3.0 × <Number>104 </Number> 1.5 × <Number>\n",
" 104 </Number> 1.5 × <Number>104 </Number></Llama2LR>\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "867f8e11-384c-4aa1-8b3e-c59fb8d5fd7d",
"metadata": {},
"source": [
"Finally, you can ask other questions that rely on more subtle parsing of the table, e.g.:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "d38f1459-7d2b-40df-8dcd-e747f85eb144",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The learning rate for LLaMA2 was 3.0 × 104 for the 7B and 13B models, and 1.5 × 104 for the 34B and 70B models.'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llama2_chain.invoke(\"What was the learning rate for LLaMA2?\")"
]
},
{
"cell_type": "markdown",
"id": "94826165",
"metadata": {},
"source": [
"## Docugami KG-RAG Template\n",
"\n",
"Docugami also provides a [langchain template](https://github.com/docugami/langchain-template-docugami-kg-rag) that you can integrate into your langchain projects.\n",
"\n",
"Here's a walkthrough of how you can do this.\n",
"\n",
"[![Docugami KG-RAG Walkthrough](https://img.youtube.com/vi/xOHOmL1NFMg/0.jpg)](https://www.youtube.com/watch?v=xOHOmL1NFMg)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}