"# Data Extraction and Transformation in ELT Workflows using GPT-4o as an OCR Alternative\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A lot of enterprise data is unstructured and locked up in difficult-to-use formats, e.g. PDFs, PPT, PNG, that are not optimized for use with LLMs or databases. As a result this type of data tends to be underutilized for analysis and product development, despite it being so valuable. The traditional way of extracting information from unstructured or non-ideal formats has been to use OCR, but OCR struggles with complex layouts and can have limited multilingual support. Moreover, manually applying transforms to data can be cumbersome and timeconsuming. \n",
"\n",
"The multi-modal capabilities of GPT-4o enable new ways to extract and transform data because of GPT-4o's ability to adapt to different types of documents and to use reasoning for interpreting the content of documents. Here are some reasons why you would choose GPT-4o for your extraction and transformation workflows over traditional methods. \n"
"| **Adaptable**: Handles complex document layouts better, reducing errors | **Schema Adaptability**: Easily transforms data to fit specific schemas for database ingestion |\n",
"| **Multilingual Support**: Seamlessly processes documents in multiple languages | **Dynamic Data Mapping**: Adapts to different data structures and formats, providing flexible transformation rules |\n",
"| **Contextual Understanding**: Extracts meaningful relationships and context, not just text | **Enhanced Insight Generation**: Applies reasoning to create more insightful transformations, enriching the dataset with derived metrics, metadata and relationships |\n",
"| **Multimodality**: Processes various document elements, including images and tables | |\n"
"1. How to extract data from multilingual PDFs \n",
"2. How to transform data according to a schema for loading into a database\n",
"3. How to load transformed data into a database for downstream analysis\n",
"\n",
"We're going to mimic a simple ELT workflow where data is first extracted from PDFs into JSON using GPT-4o, stored in an unstructured format somewhere like a data lake, transformed to fit a schema using GPT-4o, and then finally ingested into a relational database for querying. It's worth noting that you can do all of this with the BatchAPI if you're interested in lowering the cost of this workflow. \n",
"\n",
"![](../images/elt_workflow.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data we'll be using is a set of publicly available 2019 hotel invoices from Germany available on [Jens Walter's GitHub](https://github.com/JensWalter/my-receipts/tree/master/2019/de/hotel), (thank you Jens!). Though hotel invoices generally contain similar information (reservation details, charges, taxes etc.), you'll notice that the invoices present itemized information in different ways and are multilingual containing both German and English. Fortunately GPT-4o can adapt to a variety of different document styles without us having to specify formats and it can seamlessly handle a variety of languages, even in the same document. \n",
"Here is what one of the invoices looks like: \n",
"\n",
"![](../images/sample_hotel_invoice.png)\n",
"\n",
"## Part 1: Extracting data from PDFs using GPT-4o's vision capabilities\n",
"GPT-4o doesn't natively handle PDFs so before we extract any data we'll first need to convert each page into an image and then encode the images as base64. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import fitz # PyMuPDF\n",
"import io\n",
"import os\n",
"from PIL import Image\n",
"import base64\n",
"import json\n",
"\n",
"api_key = os.getenv(\"OPENAI_API_KEY\")\n",
"client = OpenAI(api_key=api_key)\n",
"\n",
"\n",
"@staticmethod\n",
"def encode_image(image_path):\n",
" with open(image_path, \"rb\") as image_file:\n",
"We can then pass each base64 encoded image in a GPT-4o LLM call, specifying a high level of detail and JSON as the response format. We're not concerned about enforcing a schema at this step, we just want all of the data to be extracted regardless of type."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def extract_invoice_data(base64_image):\n",
" system_prompt = f\"\"\"\n",
" You are an OCR-like data extraction tool that extracts hotel invoice data from PDFs.\n",
" \n",
" 1. Please extract the data in this hotel invoice, grouping data according to theme/sub groups, and then output into JSON.\n",
"\n",
" 2. Please keep the keys and values of the JSON in the original language. \n",
"\n",
" 3. The type of data you might encounter in the invoice includes but is not limited to: hotel information, guest information, invoice information,\n",
" room charges, taxes, and total charges etc. \n",
"\n",
" 4. If the page contains no charge data, please output an empty JSON object and don't make up any data.\n",
"\n",
" 5. If there are blank data fields in the invoice, please include them as \"null\" values in the JSON object.\n",
" \n",
" 6. If there are tables in the invoice, capture all of the rows and columns in the JSON object. \n",
" Even if a column is blank, include it as a key in the JSON object with a null value.\n",
" \n",
" 7. If a row is blank denote missing fields with \"null\" values. \n",
" \n",
" 8. Don't interpolate or make up data.\n",
"\n",
" 9. Please maintain the table structure of the charges, i.e. capture all of the rows and columns in the JSON object.\n",
"Because invoice data can span multiple pages in a PDF, we're going to produce JSON objects for each page in the invoice and then append them together. The final invoice extraction will be a single JSON file."
"Each invoice JSON will have different keys depending on what data the original invoice contained, so at this point you can store the unschematized JSON files in a data lake that can handle unstructured data. For simplicity though, we're going to store the files in a folder. Here is what one of the extracted JSON files looks like, you'll notice that even though we didn't specify a schema, GPT-4o was able to understand German and group similar information together. Moreover, if there was a blank field in the invoice GPT-4o transcribed that as \"null\". "
" \"Beschreibung\": \"Premier Inn Frühstücksbuffet\",\n",
" \"MwSt.%\": 19.0,\n",
" \"Betrag\": 9.9,\n",
" \"Zahlung\": null\n",
" },\n",
" {\n",
" \"Datum\": \"25.09.19\",\n",
" \"Uhrzeit\": \"9:50\",\n",
" \"Beschreibung\": \"Premier Inn Frühstücksbuffet\",\n",
" \"MwSt.%\": 19.0,\n",
" \"Betrag\": 9.9,\n",
" \"Zahlung\": null\n",
" },\n",
" {\n",
" \"Datum\": \"26.09.19\",\n",
" \"Uhrzeit\": \"9:50\",\n",
" \"Beschreibung\": \"Premier Inn Frühstücksbuffet\",\n",
" \"MwSt.%\": 19.0,\n",
" \"Betrag\": 9.9,\n",
" \"Zahlung\": null\n",
" },\n",
" {\n",
" \"Datum\": \"27.09.19\",\n",
" \"Uhrzeit\": \"9:50\",\n",
" \"Beschreibung\": \"Premier Inn Frühstücksbuffet\",\n",
" \"MwSt.%\": 19.0,\n",
" \"Betrag\": 9.9,\n",
" \"Zahlung\": null\n",
" }\n",
" ],\n",
" \"Payment Information\": {\n",
" \"Zahlung\": \"550,60\",\n",
" \"Gesamt (Rechnungsbetrag)\": \"550,60\",\n",
" \"Offener Betrag\": \"0,00\",\n",
" \"Bezahlart\": \"Mastercard-Kreditkarte\"\n",
" },\n",
" \"Tax Information\": {\n",
" \"MwSt.%\": [\n",
" {\n",
" \"Rate\": 19.0,\n",
" \"Netto\": 33.28,\n",
" \"MwSt.\": 6.32,\n",
" \"Brutto\": 39.6\n",
" },\n",
" {\n",
" \"Rate\": 7.0,\n",
" \"Netto\": 477.57,\n",
" \"MwSt.\": 33.43,\n",
" \"Brutto\": 511.0\n",
" }\n",
" ]\n",
" }\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Transforming data according to a schema \n",
"\n",
"You've extracted data from PDFs and have likely loaded the unstructured extractions as JSON objects in a data lake. The next step in our ELT workflow is to use GPT-4o to transform the extractions according to our desired schema. This will enable us to ingest any resulting tables into a database. We've decided upon the following schema that broadly covers most of the information we would have seen across the different invoices. This schema will be used to process each raw JSON extraction into our desired schematized JSON and can specify particular formats such as \"date\": \"YYYY-MM-DD\". We're also going to translate the data into English at this step. \n"
" {\"type\": \"text\", \"text\": f\"Transform the following raw JSON data according to the provided schema. Ensure all data is in English and formatted as specified by values in the schema. Here is the raw JSON: {json_raw}\"}\n",
"## Part 3: Loading transformed data into a database \n",
"\n",
"Now that we've schematized all of our data, we can segment it into tables for ingesting into a relational database. In particular, we're going to create four tables: Hotels, Invoices, Charges and Taxes. All of the invoices pertained to one guest, so we won't create a guest table. "
"Now let's check that we've correctly ingested the data by running a sample SQL query to determine the most expensive hotel stay and the same of the hotel! \n",
"You can even automate the generation of SQL queries at this step by using function calling, check out our [cookbook on function calling with model generated arguments](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models#how-to-call-functions-with-model-generated-arguments) to learn how to do that. "
"To recap in this cookbook we showed you how to use GPT-4o for extracting and transforming data that would otherwise be inaccessible for data analysis. If you don't need these workflows to happen in real-time, you can take advantage of OpenAI's BatchAPI to run jobs asynchronously at a much lower cost! "