"Using multiple python libraries, we can extract text from PDF files and convert each page into an image"
"In this section, we will process our input data to prepare it for retrieval.\n",
"\n",
"We will do this in 2 ways:\n",
"\n",
"1. Extracting text with pdfminer\n",
"2. Converting the PDF pages to images to analyze them with GPT-4V\n",
"\n",
"You can skip the 1st method if you want to only use the content deducted from the image analysis."
]
},
{
"cell_type": "markdown",
"id": "9322058a",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"We need to install a few libraries to convert the PDF to images and extract the text (optional).\n",
"\n",
"/!\\ You need to install `poppler` on your machine for the `pdf2image` library to work. You can follow the instructions to install it [here](https://pypi.org/project/pdf2image/)."