From 62a3403ce6df10ad3acfe456b4bd4fd01c7607ca Mon Sep 17 00:00:00 2001 From: Katia Gil Guzman Date: Wed, 28 Feb 2024 15:21:57 +0000 Subject: [PATCH] updated notebook to add setup instructions --- examples/Parse_PDF_docs_for_RAG.ipynb | 64 ++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 12 deletions(-) diff --git a/examples/Parse_PDF_docs_for_RAG.ipynb b/examples/Parse_PDF_docs_for_RAG.ipynb index 3284b28..aed35ff 100644 --- a/examples/Parse_PDF_docs_for_RAG.ipynb +++ b/examples/Parse_PDF_docs_for_RAG.ipynb @@ -17,9 +17,40 @@ "id": "6163ace6", "metadata": {}, "source": [ - "## Preparation\n", + "## Data preparation\n", "\n", - "Using multiple python libraries, we can extract text from PDF files and convert each page into an image" + "In this section, we will process our input data to prepare it for retrieval.\n", + "\n", + "We will do this in 2 ways:\n", + "\n", + "1. Extracting text with pdfminer\n", + "2. Converting the PDF pages to images to analyze them with GPT-4V\n", + "\n", + "You can skip the 1st method if you want to only use the content deducted from the image analysis." + ] + }, + { + "cell_type": "markdown", + "id": "9322058a", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "We need to install a few libraries to convert the PDF to images and extract the text (optional).\n", + "\n", + "/!\\ You need to install `poppler` on your machine for the `pdf2image` library to work. You can follow the instructions to install it [here](https://pypi.org/project/pdf2image/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1744f6e", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install pdf2image\n", + "%pip install pdfminer\n", + "%pip install openai" ] }, { @@ -29,6 +60,7 @@ "metadata": {}, "outputs": [], "source": [ + "# Imports\n", "from pdf2image import convert_from_path, convert_from_bytes\n", "from pdf2image.exceptions import (\n", " PDFInfoNotInstalledError,\n", @@ -40,6 +72,14 @@ "from io import BytesIO" ] }, + { + "cell_type": "markdown", + "id": "be1d52da", + "metadata": {}, + "source": [ + "### File processing" + ] + }, { "cell_type": "code", "execution_count": 2, @@ -181,9 +221,9 @@ "id": "fee39ce0", "metadata": {}, "source": [ - "## File processing\n", + "### Image analysis with GPT-4V\n", "\n", - "After extracting text content and images from a PDF file, we'll use GPT-4V to analyze the content based on the images." + "After converting a PDF file to multiple images, we'll use GPT-4V to analyze the content based on the images." ] }, { @@ -327,7 +367,7 @@ }, { "cell_type": "markdown", - "id": "ab77ebe1", + "id": "e62c481d", "metadata": {}, "source": [ "#### Processing all documents" @@ -348,7 +388,7 @@ { "cell_type": "code", "execution_count": 12, - "id": "d43c70cc", + "id": "550db81d", "metadata": {}, "outputs": [], "source": [ @@ -490,7 +530,7 @@ { "cell_type": "code", "execution_count": 16, - "id": "d55470d5", + "id": "eb42b0c3", "metadata": {}, "outputs": [], "source": [ @@ -795,7 +835,7 @@ { "cell_type": "code", "execution_count": 27, - "id": "9632e60b", + "id": "e65487f6", "metadata": {}, "outputs": [], "source": [ @@ -805,7 +845,7 @@ { "cell_type": "code", "execution_count": 28, - "id": "a149c188", + "id": "b7ec712b", "metadata": {}, "outputs": [], "source": [ @@ -915,7 +955,7 @@ { "cell_type": "code", "execution_count": 36, - "id": "34722736", + "id": "c859adce", "metadata": {}, "outputs": [], "source": [ @@ -2151,7 +2191,7 @@ }, { "cell_type": "markdown", - "id": "b6b389e4", + "id": "f8af3d0d", "metadata": {}, "source": [ "## Wrapping up\n", @@ -2174,7 +2214,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4738ecd9", + "id": "4b8b9c89", "metadata": {}, "outputs": [], "source": []