From 62a3403ce6df10ad3acfe456b4bd4fd01c7607ca Mon Sep 17 00:00:00 2001
From: Katia Gil Guzman <katia@openai.com>
Date: Wed, 28 Feb 2024 15:21:57 +0000
Subject: [PATCH] updated notebook to add setup instructions

---
 examples/Parse_PDF_docs_for_RAG.ipynb | 64 ++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/examples/Parse_PDF_docs_for_RAG.ipynb b/examples/Parse_PDF_docs_for_RAG.ipynb
index 3284b28..aed35ff 100644
--- a/examples/Parse_PDF_docs_for_RAG.ipynb
+++ b/examples/Parse_PDF_docs_for_RAG.ipynb
@@ -17,9 +17,40 @@
    "id": "6163ace6",
    "metadata": {},
    "source": [
-    "## Preparation\n",
+    "## Data preparation\n",
     "\n",
-    "Using multiple python libraries, we can extract text from PDF files and convert each page into an image"
+    "In this section, we will process our input data to prepare it for retrieval.\n",
+    "\n",
+    "We will do this in 2 ways:\n",
+    "\n",
+    "1. Extracting text with pdfminer\n",
+    "2. Converting the PDF pages to images to analyze them with GPT-4V\n",
+    "\n",
+    "You can skip the 1st method if you want to only use the content deducted from the image analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9322058a",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "\n",
+    "We need to install a few libraries to convert the PDF to images and extract the text (optional).\n",
+    "\n",
+    "/!\\ You need to install `poppler` on your machine for the `pdf2image` library to work. You can follow the instructions to install it [here](https://pypi.org/project/pdf2image/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1744f6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install pdf2image\n",
+    "%pip install pdfminer\n",
+    "%pip install openai"
    ]
   },
   {
@@ -29,6 +60,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Imports\n",
     "from pdf2image import convert_from_path, convert_from_bytes\n",
     "from pdf2image.exceptions import (\n",
     "    PDFInfoNotInstalledError,\n",
@@ -40,6 +72,14 @@
     "from io import BytesIO"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "be1d52da",
+   "metadata": {},
+   "source": [
+    "### File processing"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -181,9 +221,9 @@
    "id": "fee39ce0",
    "metadata": {},
    "source": [
-    "## File processing\n",
+    "### Image analysis with GPT-4V\n",
     "\n",
-    "After extracting text content and images from a PDF file, we'll use GPT-4V to analyze the content based on the images."
+    "After converting a PDF file to multiple images, we'll use GPT-4V to analyze the content based on the images."
    ]
   },
   {
@@ -327,7 +367,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ab77ebe1",
+   "id": "e62c481d",
    "metadata": {},
    "source": [
     "#### Processing all documents"
@@ -348,7 +388,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "d43c70cc",
+   "id": "550db81d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -490,7 +530,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "d55470d5",
+   "id": "eb42b0c3",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -795,7 +835,7 @@
   {
    "cell_type": "code",
    "execution_count": 27,
-   "id": "9632e60b",
+   "id": "e65487f6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -805,7 +845,7 @@
   {
    "cell_type": "code",
    "execution_count": 28,
-   "id": "a149c188",
+   "id": "b7ec712b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -915,7 +955,7 @@
   {
    "cell_type": "code",
    "execution_count": 36,
-   "id": "34722736",
+   "id": "c859adce",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -2151,7 +2191,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "b6b389e4",
+   "id": "f8af3d0d",
    "metadata": {},
    "source": [
     "## Wrapping up\n",
@@ -2174,7 +2214,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "4738ecd9",
+   "id": "4b8b9c89",
    "metadata": {},
    "outputs": [],
    "source": []