updated notebook to add setup instructions

pull/1077/head
Katia Gil Guzman 4 months ago
parent 81570a8360
commit 62a3403ce6

@ -17,9 +17,40 @@
"id": "6163ace6",
"metadata": {},
"source": [
"## Preparation\n",
"## Data preparation\n",
"\n",
"Using multiple python libraries, we can extract text from PDF files and convert each page into an image"
"In this section, we will process our input data to prepare it for retrieval.\n",
"\n",
"We will do this in 2 ways:\n",
"\n",
"1. Extracting text with pdfminer\n",
"2. Converting the PDF pages to images to analyze them with GPT-4V\n",
"\n",
"You can skip the 1st method if you want to only use the content deducted from the image analysis."
]
},
{
"cell_type": "markdown",
"id": "9322058a",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"We need to install a few libraries to convert the PDF to images and extract the text (optional).\n",
"\n",
"/!\\ You need to install `poppler` on your machine for the `pdf2image` library to work. You can follow the instructions to install it [here](https://pypi.org/project/pdf2image/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1744f6e",
"metadata": {},
"outputs": [],
"source": [
"%pip install pdf2image\n",
"%pip install pdfminer\n",
"%pip install openai"
]
},
{
@ -29,6 +60,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"from pdf2image import convert_from_path, convert_from_bytes\n",
"from pdf2image.exceptions import (\n",
" PDFInfoNotInstalledError,\n",
@ -40,6 +72,14 @@
"from io import BytesIO"
]
},
{
"cell_type": "markdown",
"id": "be1d52da",
"metadata": {},
"source": [
"### File processing"
]
},
{
"cell_type": "code",
"execution_count": 2,
@ -181,9 +221,9 @@
"id": "fee39ce0",
"metadata": {},
"source": [
"## File processing\n",
"### Image analysis with GPT-4V\n",
"\n",
"After extracting text content and images from a PDF file, we'll use GPT-4V to analyze the content based on the images."
"After converting a PDF file to multiple images, we'll use GPT-4V to analyze the content based on the images."
]
},
{
@ -327,7 +367,7 @@
},
{
"cell_type": "markdown",
"id": "ab77ebe1",
"id": "e62c481d",
"metadata": {},
"source": [
"#### Processing all documents"
@ -348,7 +388,7 @@
{
"cell_type": "code",
"execution_count": 12,
"id": "d43c70cc",
"id": "550db81d",
"metadata": {},
"outputs": [],
"source": [
@ -490,7 +530,7 @@
{
"cell_type": "code",
"execution_count": 16,
"id": "d55470d5",
"id": "eb42b0c3",
"metadata": {},
"outputs": [],
"source": [
@ -795,7 +835,7 @@
{
"cell_type": "code",
"execution_count": 27,
"id": "9632e60b",
"id": "e65487f6",
"metadata": {},
"outputs": [],
"source": [
@ -805,7 +845,7 @@
{
"cell_type": "code",
"execution_count": 28,
"id": "a149c188",
"id": "b7ec712b",
"metadata": {},
"outputs": [],
"source": [
@ -915,7 +955,7 @@
{
"cell_type": "code",
"execution_count": 36,
"id": "34722736",
"id": "c859adce",
"metadata": {},
"outputs": [],
"source": [
@ -2151,7 +2191,7 @@
},
{
"cell_type": "markdown",
"id": "b6b389e4",
"id": "f8af3d0d",
"metadata": {},
"source": [
"## Wrapping up\n",
@ -2174,7 +2214,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4738ecd9",
"id": "4b8b9c89",
"metadata": {},
"outputs": [],
"source": []

Loading…
Cancel
Save