updated notebook to add setup instructions

4 months ago · 62a3403ce6
parent 81570a8360
commit 62a3403ce6
1 changed files with 52 additions and 12 deletions
--- a/examples/Parse_PDF_docs_for_RAG.ipynb
+++ b/examples/Parse_PDF_docs_for_RAG.ipynb
@ -17,9 +17,40 @@
   "id": "6163ace6",
   "metadata": {},
   "source": [
-    "## Preparation\n",
+    "## Data preparation\n",
    "\n",
-    "Using multiple python libraries, we can extract text from PDF files and convert each page into an image"
+    "In this section, we will process our input data to prepare it for retrieval.\n",
+    "\n",
+    "We will do this in 2 ways:\n",
+    "\n",
+    "1. Extracting text with pdfminer\n",
+    "2. Converting the PDF pages to images to analyze them with GPT-4V\n",
+    "\n",
+    "You can skip the 1st method if you want to only use the content deducted from the image analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9322058a",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "\n",
+    "We need to install a few libraries to convert the PDF to images and extract the text (optional).\n",
+    "\n",
+    "/!\\ You need to install `poppler` on your machine for the `pdf2image` library to work. You can follow the instructions to install it [here](https://pypi.org/project/pdf2image/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1744f6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install pdf2image\n",
+    "%pip install pdfminer\n",
+    "%pip install openai"
   ]
  },
  {
@ -29,6 +60,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# Imports\n",
    "from pdf2image import convert_from_path, convert_from_bytes\n",
    "from pdf2image.exceptions import (\n",
    "    PDFInfoNotInstalledError,\n",
@ -40,6 +72,14 @@
    "from io import BytesIO"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "be1d52da",
+   "metadata": {},
+   "source": [
+    "### File processing"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 2,
@ -181,9 +221,9 @@
   "id": "fee39ce0",
   "metadata": {},
   "source": [
-    "## File processing\n",
+    "### Image analysis with GPT-4V\n",
    "\n",
-    "After extracting text content and images from a PDF file, we'll use GPT-4V to analyze the content based on the images."
+    "After converting a PDF file to multiple images, we'll use GPT-4V to analyze the content based on the images."
   ]
  },
  {
@ -327,7 +367,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "ab77ebe1",
+   "id": "e62c481d",
   "metadata": {},
   "source": [
    "#### Processing all documents"
@ -348,7 +388,7 @@
  {
   "cell_type": "code",
   "execution_count": 12,
-   "id": "d43c70cc",
+   "id": "550db81d",
   "metadata": {},
   "outputs": [],
   "source": [
@ -490,7 +530,7 @@
  {
   "cell_type": "code",
   "execution_count": 16,
-   "id": "d55470d5",
+   "id": "eb42b0c3",
   "metadata": {},
   "outputs": [],
   "source": [
@ -795,7 +835,7 @@
  {
   "cell_type": "code",
   "execution_count": 27,
-   "id": "9632e60b",
+   "id": "e65487f6",
   "metadata": {},
   "outputs": [],
   "source": [
@ -805,7 +845,7 @@
  {
   "cell_type": "code",
   "execution_count": 28,
-   "id": "a149c188",
+   "id": "b7ec712b",
   "metadata": {},
   "outputs": [],
   "source": [
@ -915,7 +955,7 @@
  {
   "cell_type": "code",
   "execution_count": 36,
-   "id": "34722736",
+   "id": "c859adce",
   "metadata": {},
   "outputs": [],
   "source": [
@ -2151,7 +2191,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "b6b389e4",
+   "id": "f8af3d0d",
   "metadata": {},
   "source": [
    "## Wrapping up\n",
@ -2174,7 +2214,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "4738ecd9",
+   "id": "4b8b9c89",
   "metadata": {},
   "outputs": [],
   "source": []