">[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform` service to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume. "
"Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.\n",
"\n",
"Learn more:\n",
"\n",
"- [Document AI overview](https://cloud.google.com/document-ai/docs/overview)\n",
"- [Document AI videos and labs](https://cloud.google.com/document-ai/docs/videos)\n",
"The module contains a `PDF` parser based on DocAI from Google Cloud.\n",
"The module contains a `PDF` parser based on DocAI from Google Cloud.\n",
"\n",
"\n",
"You need to install two libraries to use this parser:"
"You need to install two libraries to use this parser:\n"
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
"id": "34132fab-0069-4942-b68b-5b093ccfc92a",
"id": "c86b2f59",
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"!pip install google-cloud-documentai\n",
"%pip install google-cloud-documentai\n",
"!pip install google-cloud-documentai-toolbox"
"%pip install google-cloud-documentai-toolbox\n"
]
]
},
},
{
{
@ -42,8 +48,9 @@
"id": "51946817-798c-4d11-abd6-db2ae53a0270",
"id": "51946817-798c-4d11-abd6-db2ae53a0270",
"metadata": {},
"metadata": {},
"source": [
"source": [
"First, you need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor) \n",
"First, you need to set up a Google Cloud Storage (GCS) bucket and create your own Optical Character Recognition (OCR) processor as described here: https://cloud.google.com/document-ai/docs/create-processor\n",
"The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console."
"\n",
"The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a `PROCESSOR_NAME` should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID` or `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console.\n"
"Let's go and parse an Alphabet's take from here: https://abc.xyz/assets/a7/5b/9e5ae0364b12b4c883f3cf748226/goog-exhibit-99-1-q1-2023-19.pdf. Copy it to your GCS bucket first, and adjust the path below."
"For this example, you can use an Alphabet earnings report that's uploaded to a public GCS bucket.\n",
"You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing."
"You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing.\n"
]
]
},
},
{
{
@ -165,7 +167,7 @@
],
],
"source": [
"source": [
"operations = parser.docai_parse([blob])\n",
"operations = parser.docai_parse([blob])\n",
"print([op.operation.name for op in operations])"
"print([op.operation.name for op in operations])\n"
]
]
},
},
{
{
@ -173,7 +175,7 @@
"id": "a2d24d63-c2c7-454c-9df3-2a9cf51309a6",
"id": "a2d24d63-c2c7-454c-9df3-2a9cf51309a6",
"metadata": {},
"metadata": {},
"source": [
"source": [
"You can check whether operations are finished:"
"You can check whether operations are finished:\n"
]
]
},
},
{
{
@ -194,7 +196,7 @@
}
}
],
],
"source": [
"source": [
"parser.is_running(operations)"
"parser.is_running(operations)\n"
]
]
},
},
{
{
@ -202,7 +204,7 @@
"id": "602ca0bc-080a-4a4e-a413-0e705aeab189",
"id": "602ca0bc-080a-4a4e-a413-0e705aeab189",
"metadata": {},
"metadata": {},
"source": [
"source": [
"And when they're finished, you can parse the results:"
"And when they're finished, you can parse the results:\n"
]
]
},
},
{
{
@ -223,7 +225,7 @@
}
}
],
],
"source": [
"source": [
"parser.is_running(operations)"
"parser.is_running(operations)\n"
]
]
},
},
{
{
@ -242,7 +244,7 @@
],
],
"source": [
"source": [
"results = parser.get_results(operations)\n",
"results = parser.get_results(operations)\n",
"print(results[0])"
"print(results[0])\n"
]
]
},
},
{
{
@ -250,7 +252,7 @@
"id": "87e5b606-1679-46c7-9577-4cf9bc93a752",
"id": "87e5b606-1679-46c7-9577-4cf9bc93a752",
"metadata": {},
"metadata": {},
"source": [
"source": [
"And now we can finally generate Documents from parsed results:"
"And now we can finally generate Documents from parsed results:\n"