"from docugami.types import Document as DocugamiDocument\n",
"\n",
@ -166,6 +168,7 @@
{
"cell_type": "code",
"execution_count": 46,
"id": "ce0b2b21-7623-46e7-ae2c-3a9f67e8b9b9",
"metadata": {},
"outputs": [
{
@ -207,6 +210,7 @@
},
{
"cell_type": "markdown",
"id": "01f035e5-c3f8-4d23-9d1b-8d2babdea8e9",
"metadata": {},
"source": [
"If you are on the free Docugami tier, your files should be done in ~15 minutes or less depending on the number of pages uploaded and available resources (please contact Docugami for paid plans for faster processing). You can re-run the code above without reprocessing your files to continue waiting if your notebook is not continuously running (it does not re-upload)."
@ -225,6 +229,7 @@
{
"cell_type": "code",
"execution_count": 47,
"id": "05fcdd57-090f-44bf-a1fb-2c3609c80e34",
"metadata": {},
"outputs": [
{
@ -268,6 +273,7 @@
},
{
"cell_type": "markdown",
"id": "bfc1f2c9-e6d4-4d98-a799-6bc30bc61661",
"metadata": {},
"source": [
"The file processed by Docugami in the example above was [this one](https://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/192541/pdf) from the NTSB and you can look at the PDF side by side to compare the XML chunks above. \n",
@ -278,6 +284,7 @@
{
"cell_type": "code",
"execution_count": 48,
"id": "8a4b49e0-de78-4790-a930-ad7cf324697a",
"metadata": {},
"outputs": [
{
@ -326,6 +333,7 @@
},
{
"cell_type": "markdown",
"id": "1cfc06bc-67d2-46dd-b04d-95efa3619d0a",
"metadata": {},
"source": [
"## Docugami XML Deep Dive: Jane Doe NDA Example\n",
@ -336,6 +344,7 @@
{
"cell_type": "code",
"execution_count": 109,
"id": "7b697d30-1e94-47f0-87e8-f81d4b180da2",
"metadata": {},
"outputs": [
{
@ -361,6 +370,7 @@
{
"cell_type": "code",
"execution_count": 98,
"id": "14714576-6e1d-499b-bcc8-39140bb2fd78",
"metadata": {},
"outputs": [
{
@ -415,6 +425,7 @@
},
{
"cell_type": "markdown",
"id": "dc09ba64-4973-4471-9501-54294c1143fc",
"metadata": {},
"source": [
"The Docugami XML contains extremely detailed semantics and visual bounding boxes for all elements. The `dgml-utils` library parses text and non-text elements into formats appropriate to pass into LLMs (chunked text with XML semantic labels)"
@ -423,6 +434,7 @@
{
"cell_type": "code",
"execution_count": 100,
"id": "2b4ece00-2e43-4254-adc9-66dbb79139a6",
"metadata": {},
"outputs": [
{
@ -460,6 +472,7 @@
{
"cell_type": "code",
"execution_count": 101,
"id": "08350119-aa22-4ec1-8f65-b1316a0d4123",
"metadata": {},
"outputs": [
{
@ -476,6 +489,7 @@
},
{
"cell_type": "markdown",
"id": "dca87b46-c0c2-4973-94ec-689c18075653",
"metadata": {},
"source": [
"The XML markup contains structural as well as semantic tags, which provide additional semantics to the LLM for improved retrieval and generation.\n",
"_PROMPT_TEMPLATE = \"\"\"If someone asks you to perform a task, your job is to come up with a series of bash commands that will perform the task. There is no need to put \"#!/bin/bash\" in your answer. Make sure to reason step by step, using this format:\n",
"Question: \"copy the files in the directory named 'target' into a new directory at the same level as target called 'myNewDirectory'\"\n",