langchain/cookbook/Semi_structured_and_multi_modal_RAG.ipynb

{
 "cells": [
  {
   "attachments": {
    "9bbbcfe4-2b85-4e76-996a-ce8d1497d34e.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABnkAAAMxCAYAAAAnrNaWAAAMQGlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCRBKjIGgYkcXFVy7iIANXRVR7IDYETuLYu+LBRVlXSzYlTcpoOu+8r35vrnz33/O/OfMuTP33gGAfpwnkeSimgDkiQukcaGBzNEpqUzSU0AEdEAFVkCLx8+XsGNiIgEsA+3fy7vrAJG3VxzlWv/s/69FSyDM5wOAxECcLsjn50G8HwC8mi+RFgBAlPMWkwskcgwr0JHCACFeIMeZSlwtx+lKvFthkxDHgbgVADUqjyfNBEDjEuSZhfxMqKHRC7GzWCASA0BnQuyXlzdRAHEaxLbQRgKxXJ+V/oNO5t800wc1ebzMQayci6KoBYnyJbm8qf9nOv53ycuVDfiwhpWaJQ2Lk88Z5u1mzsQIOaZC3CNOj4qGWBviDyKBwh5ilJIlC0tU2qNG/HwOzBnQg9hZwAuKgNgI4hBxblSkik/PEIVwIYYrBJ0iKuAmQKwP8QJhfnC8ymaDdGKcyhfakCHlsFX8WZ5U4Vfu674sJ5Gt0n+dJeSq9DGNoqyEZIgpEFsWipKiINaA2Ck/Jz5CZTOyKIsTNWAjlcXJ47eEOE4oDg1U6mOFGdKQOJV9aV7+wHyxDVkibpQK7y3ISghT5gdr5fMU8cO5YJeEYnbigI4wf3TkwFwEwqBg5dyxZ0JxYrxK54OkIDBOORanSHJjVPa4uTA3VM6bQ+yWXxivGosnFcAFqdTHMyQFMQnKOPGibF54jDIefCmIBBwQBJhABms6mAiygai9p7EH3il7QgAPSEEmEAJHFTMwIlnRI4bXeFAE/oRICPIHxwUqeoWgEPJfB1nl1RFkKHoLFSNywBOI80AEyIX3MsUo8aC3JPAYMqJ/eOfByofx5sIq7//3/AD7nWFDJlLFyAY8MukDlsRgYhAxjBhCtMMNcT/cB4+E1wBYXXAW7jUwj+/2hCeEDsJDwjVCJ+HWBFGx9KcoR4FOqB+iykX6j7nAraGmOx6I+0J1qIzr4YbAEXeDfti4P/TsDlmOKm55Vpg/af9tBj88DZUd2ZmMkoeQA8i2P4/UsNdwH1SR5/rH/ChjTR/MN2ew52f/nB+yL4BtxM+W2AJsH3YGO4Gdww5jjYCJHcOasDbsiBwPrq7HitU14C1OEU8O1BH9w9/Ak5VnMt+5zrnb+Yuyr0A4Rf6OBpyJkqlSUWZWAZMNvwhCJlfMdxrGdHF2cQVA/n1Rvr7exCq+G4he23du7h8A+B7r7+8/9J0LPwbAHk+4/Q9+52xZ8NOhDsDZg3yZtFDJ4fILAb4l6HCnGQATYAFs4XxcgAfwAQEgGISDaJAAUsB4GH0WXOdSMBlMB3NACSgDS8EqUAnWg01gG9gJ9oJGcBicAKfBBXAJXAN34OrpAi9AL3gHPiMIQkJoCAMxQEwRK8QBcUFYiB8SjEQicUgKkoZkImJEhkxH5iJlyHKkEtmI1CJ7kIPICeQc0oHcQh4g3chr5BOKoVRUBzVGrdHhKAtloxFoAjoOzUQnoUXoPHQxWoHWoDvQBvQEegG9hnaiL9A+DGDqmB5mhjliLIyDRWOpWAYmxWZipVg5VoPVY83wOV/BOrEe7CNOxBk4E3eEKzgMT8T5+CR8Jr4Ir8S34Q14K34Ff4D34t8INIIRwYHgTeASRhMyCZMJJYRywhbCAcIpuJe6CO+IRKIe0YboCfdiCjGbOI24iLiWuIt4nNhBfETsI5FIBiQHki8pmsQjFZBKSGtIO0jHSJdJXaQPaupqpmouaiFqqWpitWK1crXtakfVLqs9VftM1iRbkb3J0WQBeSp5CXkzuZl8kdxF/kzRothQfCkJlGzKHEoFpZ5yinKX8kZdXd1c3Us9Vl2kPlu9Qn23+ln1B+ofqdpUeyqHOpYqoy6mbqUep96ivqHRaNa0AFoqrYC2mFZLO0m7T/ugwdBw0uBqCDRmaVRpNGhc1nhJJ9Ot6Gz6eHoRvZy+j36R3qNJ1rTW5GjyNGdqVmke1Lyh2afF0BqhFa2Vp7VIa7vWOa1n2iRta+1gbYH2PO1N2ie1HzEwhgWDw+Az5jI2M04xunSIOjY6XJ1snTKdnTrtOr262rpuukm6U3SrdI/oduphetZ6XL1cvSV6e/Wu630aYjyEPUQ4ZOGQ+iGXh7zXH6ofoC/UL9XfpX9N/5MB0yDYIMdgmUGjwT1D3NDeMNZwsuE6w1OGPUN1hvoM5Q8tHbp36G0j1MjeKM5omtEmozajPmMT41BjifEa45PGPSZ6JgEm2SYrTY6adJsyTP1MRaYrTY+ZPmfqMtnMXGYFs5XZa2ZkFmYmM9to1m722dzGPNG82HyX+T0LigXLIsNipUWLRa+lqeUoy+mWdZa3rchWLKssq9VWZ6zeW9tYJ1vPt260fmajb8O1KbKps7lrS7P1t51kW2N71Y5ox7LLsVtrd8ketXe3z7Kvsr/ogDp4OIgc1jp0DCMM8xomHlYz7IYj1ZHtWOhY5/jASc8p0qnYqdHp5XDL4anDlw0/M/ybs7tzrvNm5zsjtEeEjyge0TzitYu9C9+lyuWqK801xHWWa5PrKzcHN6HbOreb7gz3Ue7z3Vvcv3p4ekg96j26PS090zyrPW+wdFgxrEWss14Er0CvWV6HvT56e3gXeO/1/svH0SfHZ7vPs5E2I4UjN4985Gvuy/Pd6Nvpx/RL89vg1+lv5s/zr/F/GGARIAjYEvCUbcfOZu9gvwx0DpQGHgh8z/HmzOAcD8KCQoNKg9qDtYMTgyuD74eYh2SG1IX0hrqHTgs9HkYIiwhbFnaDa8zlc2u5veGe4TPCWyOoEfERlREPI+0jpZHNo9BR4aNWjLobZRUljmqMBtHc6BXR92JsYibFHIolxsbEVsU+iRsRNz3uTDwjfkL89vh3CYEJSxLuJNomyhJbkuhJY5Nqk94nByUvT+4cPXz0jNEXUgxTRClNqaTUpNQtqX1jgsesGtM11n1sydjr42zGTRl3brzh+NzxRybQJ/Am7EsjpCWnbU/7wovm1fD60rnp1em9fA5/Nf+FIECwUtAt9BUuFz7N8M1YnvEs0zdzRWZ3ln9WeVaPiCOqFL3KDsten/0+Jzpna05/bnLurjy1vLS8g2JtcY64daLJxCkTOyQOkhJJ5yTvSasm9UojpFvykfxx+U0FOvBHvk1mK/tF9qDQr7Cq8MPkpMn7pmhNEU9pm2o/deHUp0UhRb9Nw6fxp7VMN5s+Z/qDGewZG2ciM9NntsyymDVvVtfs0Nnb5lDm5Mz5vdi5eHnx27nJc5vnGc+bPe/RL6G/1JVolEhLbsz3mb9+Ab5AtKB9oevCNQu/lQpKz5c5l5WXfVnEX3T+1xG/VvzavzhjcfsSjyXrlhKXipdeX+a/bNtyreVFyx+tGLWiYSVzZenKt6smrDpX7la+fjVltWx1Z0VkRdMayzVL13ypzKq8VhVYtavaqHph9fu1grWX1wWsq19vvL5s/acNog03N4ZubKixrinfRNxUuOnJ5qTNZ35j/Va7xXBL2ZavW8VbO7fFbWut9ayt3W60fUkdWier694xdselnUE7m+od6zfu0ttVthvslu1+vidtz/W9EXtb9rH21e+32l99gHGgtAFpmNrQ25jV2NmU0tRxMPxgS7NP84FDToe2HjY7XHVE98iSo5Sj8472Hys61ndccrznROaJRy0TWu6cHH3yamtsa/upiFNnT4ecPnmGfebYWd+zh895nzt4nnW+8YLHhYY297YDv7v/fqDdo73houfFpktel5o7RnYcvex/+cSVoCunr3KvXrgWda3jeuL1mzfG3ui8Kbj57FburVe3C29/vjP7LuFu6T3Ne+X3je7X/GH3x65Oj84jD4IetD2Mf3jnEf/Ri8f5j790zXtCe1L+1PRp7TOXZ4e7Q7ovPR/zvOuF5MXnnpI/tf6sfmn7cv9fAX+19Y7u7XolfdX/etEbgzdb37q9bemL6bv/Lu/d5/elHww+bPvI+njmU/Knp58nfyF9qfhq97X5
    }
   },
   "cell_type": "markdown",
   "id": "812a4dbc-fe04-4b84-bdf9-390045e30806",
   "metadata": {},
   "source": [
    "## Semi-structured and Multi-modal RAG\n",
    "\n",
    "Many documents contain a mixture of content types, including text, tables, and images. \n",
    "\n",
    "Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
    "\n",
    "* Text splitting may break up tables, corrupting the data in retrieval\n",
    "* Embedding tables may pose challenges for semantic similarity search\n",
    "\n",
    "And the information captured in images is typically lost.\n",
    "\n",
    "With the emergence of multimodal LLMs, like [GPT4-V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n",
    "\n",
    "`Option 1:` \n",
    "\n",
    "* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text\n",
    "* Retrieve both using similarity search\n",
    "* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
    "\n",
    "`Option 2:` \n",
    "\n",
    "* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
    "* Embed and retrieve text \n",
    "* Pass text chunks to an LLM for answer synthesis \n",
    "\n",
    "`Option 3:` \n",
    "\n",
    "* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
    "* Embed and retrieve image summaries with a reference to the raw image \n",
    "* Pass raw images and text chunks to a multimodal LLM for answer synthesis   \n",
    "\n",
    "This cookbook show how we might tackle this :\n",
    "\n",
    "* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).\n",
    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text, (optionally) images along with their summaries for retrieval.\n",
    "* We will demonstrate `Option 2`, and will follow-up on the other approaches in future cookbooks.\n",
    "\n",
    "![ss_mm_rag.png](attachment:9bbbcfe4-2b85-4e76-996a-ce8d1497d34e.png)\n",
    "\n",
    "## Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "140580ef-5db0-43cc-a524-9c39e04d4df0",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install langchain langchain-chroma \"unstructured[all-docs]\" pydantic lxml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74b56bde-1ba0-4525-a11d-cab02c5659e4",
   "metadata": {},
   "source": [
    "## Data Loading\n",
    "\n",
    "### Partition PDF tables, text, and images\n",
    "  \n",
    "* `LLaVA` Paper: https://arxiv.org/pdf/2304.08485.pdf\n",
    "* Use [Unstructured](https://unstructured-io.github.io/unstructured/) to partition elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "61cbb874-ecc0-4d5d-9954-f0a41f65e0d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"/Users/rlm/Desktop/Papers/LLaVA/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e98bdeb7-eb77-42e6-a3a5-c3f27a1838d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any\n",
    "\n",
    "from pydantic import BaseModel\n",
    "from unstructured.partition.pdf import partition_pdf\n",
    "\n",
    "# Get elements\n",
    "raw_pdf_elements = partition_pdf(\n",
    "    filename=path + \"LLaVA.pdf\",\n",
    "    # Using pdf format to find embedded image blocks\n",
    "    extract_images_in_pdf=True,\n",
    "    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
    "    # Titles are any sub-section of the document\n",
    "    infer_table_structure=True,\n",
    "    # Post processing to aggregate text once we have the title\n",
    "    chunking_strategy=\"by_title\",\n",
    "    # Chunking params to aggregate text blocks\n",
    "    # Attempt to create a new chunk 3800 chars\n",
    "    # Attempt to keep chunks > 2000 chars\n",
    "    # Hard max on chunks\n",
    "    max_characters=4000,\n",
    "    new_after_n_chars=3800,\n",
    "    combine_text_under_n_chars=2000,\n",
    "    image_output_dir_path=path,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7cdba921-5419-4471-b234-d93af3859b6f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{\"<class 'unstructured.documents.elements.CompositeElement'>\": 31,\n",
       " \"<class 'unstructured.documents.elements.Table'>\": 3}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a dictionary to store counts of each type\n",
    "category_counts = {}\n",
    "\n",
    "for element in raw_pdf_elements:\n",
    "    category = str(type(element))\n",
    "    if category in category_counts:\n",
    "        category_counts[category] += 1\n",
    "    else:\n",
    "        category_counts[category] = 1\n",
    "\n",
    "# Unique_categories will have unique elements\n",
    "unique_categories = set(category_counts.keys())\n",
    "category_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5f660305-e165-4b6c-ada3-a67a422defb5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "31\n"
     ]
    }
   ],
   "source": [
    "class Element(BaseModel):\n",
    "    type: str\n",
    "    text: Any\n",
    "\n",
    "\n",
    "# Categorize by type\n",
    "categorized_elements = []\n",
    "for element in raw_pdf_elements:\n",
    "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"table\", text=str(element)))\n",
    "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"text\", text=str(element)))\n",
    "\n",
    "# Tables\n",
    "table_elements = [e for e in categorized_elements if e.type == \"table\"]\n",
    "print(len(table_elements))\n",
    "\n",
    "# Text\n",
    "text_elements = [e for e in categorized_elements if e.type == \"text\"]\n",
    "print(len(text_elements))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aa7f52f-bf5c-4ba4-af72-b2ccba59a4cf",
   "metadata": {},
   "source": [
    "## Multi-vector retriever\n",
    "\n",
    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
    "\n",
    "Summaries are used to retrieve raw tables and / or raw chunks of text.\n",
    "\n",
    "### Text and Table summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "523e6ed2-2132-4748-bdb7-db765f20648d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_openai import ChatOpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "22c22e3f-42fb-4a4a-a87a-89f10ba8ab99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prompt\n",
    "prompt_text = \"\"\"You are an assistant tasked with summarizing tables and text. \\\n",
    "Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(prompt_text)\n",
    "\n",
    "# Summary chain\n",
    "model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
    "summarize_chain = {\"element\": lambda x: x} | prompt | model | StrOutputParser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f176b374-aef0-48f4-a104-fb26b1dd6922",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to text\n",
    "texts = [i.text for i in text_elements]\n",
    "text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61a6ac00-ebbe-4608-9ae5-40f81541e37f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to tables\n",
    "tables = [i.text for i in table_elements]\n",
    "table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b1feadda-8171-4aed-9a60-320a88dc9ee1",
   "metadata": {},
   "source": [
    "### Images\n",
    "\n",
    "We will implement `Option 2` discussed above: \n",
    "\n",
    "* Use a multimodal LLM ([LLaVA](https://llava.hliu.cc/)) to produce text summaries from images\n",
    "* Embed and retrieve text \n",
    "* Pass text chunks to an LLM for answer synthesis \n",
    "\n",
    "#### Image summaries \n",
    "\n",
    "We will use [LLaVA](https://github.com/haotian-liu/LLaVA/), an open source multimodal model.\n",
    " \n",
    "We will use [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) to run LLaVA locally (e.g., on a Mac laptop):\n",
    "\n",
    "* Clone [llama.cpp](https://github.com/ggerganov/llama.cpp)\n",
    "* Download the LLaVA model: `mmproj-model-f16.gguf` and one of `ggml-model-[f16|q5_k|q4_k].gguf` from [LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)\n",
    "* Build\n",
    "```\n",
    "mkdir build && cd build && cmake ..\n",
    "cmake --build .\n",
    "```\n",
    "* Run inference across images:\n",
    "```\n",
    "/Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "440b20e4-a74d-4c75-b538-0ca24d581713",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Define the directory containing the images\n",
    "IMG_DIR=~/Desktop/Papers/LLaVA/\n",
    "\n",
    "# Loop through each image in the directory\n",
    "for img in \"${IMG_DIR}\"*.jpg; do\n",
    "    # Extract the base name of the image without extension\n",
    "    base_name=$(basename \"$img\" .jpg)\n",
    "\n",
    "    # Define the output file name based on the image name\n",
    "    output_file=\"${IMG_DIR}${base_name}.txt\"\n",
    "\n",
    "    # Execute the command and save the output to the defined output file\n",
    "    /Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
    "\n",
    "done\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a69dcd6b-0226-4173-a80d-36921824c824",
   "metadata": {},
   "source": [
    "Note: \n",
    "\n",
    "To run LLaVA with python bindings, we need a Python API to run the CLIP model. \n",
    "\n",
    "CLIP support is likely to be added to `llama.cpp` in the future.\n",
    "\n",
    "After running the above, we  fetch and clean image summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "54924f9e-0f81-4232-8efb-8485db1063c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "import os\n",
    "\n",
    "# Get all .txt file summaries\n",
    "file_paths = glob.glob(os.path.expanduser(os.path.join(path, \"*.txt\")))\n",
    "\n",
    "# Read each file and store its content in a list\n",
    "img_summaries = []\n",
    "for file_path in file_paths:\n",
    "    with open(file_path, \"r\") as file:\n",
    "        img_summaries.append(file.read())\n",
    "\n",
    "# Remove any logging prior to summary\n",
    "logging_header = \"clip_model_load: total allocated memory: 201.27 MB\\n\\n\"\n",
    "cleaned_img_summary = [s.split(logging_header, 1)[1].strip() for s in img_summaries]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67b030d4-2ac5-41b6-9245-fc3ba5771d87",
   "metadata": {},
   "source": [
    "### Add to vectorstore\n",
    "\n",
    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d643cc61-827d-4f3c-8242-7a7c8291ed8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "\n",
    "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
    "from langchain.storage import InMemoryStore\n",
    "from langchain_chroma import Chroma\n",
    "from langchain_core.documents import Document\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "# The vectorstore to use to index the child chunks\n",
    "vectorstore = Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings())\n",
    "\n",
    "# The storage layer for the parent documents\n",
    "store = InMemoryStore()\n",
    "id_key = \"doc_id\"\n",
    "\n",
    "# The retriever (empty to start)\n",
    "retriever = MultiVectorRetriever(\n",
    "    vectorstore=vectorstore,\n",
    "    docstore=store,\n",
    "    id_key=id_key,\n",
    ")\n",
    "\n",
    "# Add texts\n",
    "doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
    "summary_texts = [\n",
    "    Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
    "    for i, s in enumerate(text_summaries)\n",
    "]\n",
    "retriever.vectorstore.add_documents(summary_texts)\n",
    "retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
    "\n",
    "# Add tables\n",
    "table_ids = [str(uuid.uuid4()) for _ in tables]\n",
    "summary_tables = [\n",
    "    Document(page_content=s, metadata={id_key: table_ids[i]})\n",
    "    for i, s in enumerate(table_summaries)\n",
    "]\n",
    "retriever.vectorstore.add_documents(summary_tables)\n",
    "retriever.docstore.mset(list(zip(table_ids, tables)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b90572a0-0377-4598-8d12-bba22a51b655",
   "metadata": {},
   "source": [
    "For `option 2` (above): \n",
    "\n",
    "* Store the image summary in the `docstore`, which we return to the LLM for answer generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2e0f06f3-a5bc-4342-aee6-c3495d047e66",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add image summaries\n",
    "img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]\n",
    "summary_img = [\n",
    "    Document(page_content=s, metadata={id_key: img_ids[i]})\n",
    "    for i, s in enumerate(cleaned_img_summary)\n",
    "]\n",
    "retriever.vectorstore.add_documents(summary_img)\n",
    "retriever.docstore.mset(list(zip(img_ids, cleaned_img_summary)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d667e5c-5385-48c4-b878-51dcc03cc4d0",
   "metadata": {},
   "source": [
    "For `option 3` (above): \n",
    "\n",
    "* Store the images in the `docstore`.\n",
    "* Using the image in answer synthesis will require a multimodal LLM with Python API integration.\n",
    "* GPT4-V is expected soon, and - as mentioned above - CLIP support is likely to be added to `llama.cpp` in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c75a7b3-04f3-41eb-97e5-61af49d92104",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add images\n",
    "img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]\n",
    "summary_img = [\n",
    "    Document(page_content=s, metadata={id_key: img_ids[i]})\n",
    "    for i, s in enumerate(cleaned_img_summary)\n",
    "]\n",
    "retriever.vectorstore.add_documents(summary_img)\n",
    "### Fetch images\n",
    "retriever.docstore.mset(\n",
    "    list(\n",
    "        zip(\n",
    "            img_ids,\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b45fb81-46b1-426e-aa2c-01aed4eac700",
   "metadata": {},
   "source": [
    "### Sanity Check retrieval\n",
    "\n",
    "The most complex table in the paper:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "a5f4dd59-005a-4ff8-ad51-ea2e50d79c10",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Subject Context Modality Grade Method NAT SOC LAN | TXT IMG NO | Gi6~ G7-12 | Average Representative & SoTA methods with numbers reported in the literature Human [30] 90.23 84.97 87.48 | 89.60 87.50 88.10 | 91.59 82.42 88.40 GPT-3.5 [30] 74.64 69.74 76.00 | 74.44 67.28 77.42 | 76.80 68.89 73.97 GPT-3.5 w/ CoT [30] 75.44 70.87 78.09 | 74.68 67.43 79.93 | 78.23 69.68 75.17 LLaMA-Adapter [55] 84.37 88.30 84.36 | 83.72 80.32 86.90 | 85.83 84.05 85.19 MM-CoT gase [57] 87.52 77.17 85.82 | 87.88 82.90 86.83 | 84.65 85.37 84.91 MM-CoT farge [57] 95.91 82.00 90.82 | 95.26 88.80 92.89 | 92.44 90.31 | 91.68 Results with our own experiment runs GPT-4 84.06 73.45 87.36 | 81.87 70.75 90.73 | 84.69 79.10 82.69 LLaVA 90.36 95.95 88.00 | 89.49 88.00 90.66 | 90.93 90.90 90.92 LLaVA+GPT-4 (complement) 90.36 95.50 88.55 | 89.05 87.80 91.08 | 92.22 88.73 90.97 LLaVA+GPT-4 (judge) 91.56 96.74 91.09 | 90.62 88.99 93.52 | 92.73 92.16 92.53'"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tables[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f68ef8b-0fec-4b2f-a0d3-c440c74ebaa1",
   "metadata": {},
   "source": [
    "Here is the summary, which is embedded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "9eb16ea9-d932-4062-9ace-e8f77dee530b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The table presents the performance of various methods in different subject contexts and modalities. The subjects are Natural Sciences (NAT), Social Sciences (SOC), and Language (LAN). The modalities are text (TXT), image (IMG), and no modality (NO). The methods include Human, GPT-3.5, GPT-3.5 with CoT, LLaMA-Adapter, MM-CoT gase, MM-CoT farge, GPT-4, LLaVA, LLaVA+GPT-4 (complement), and LLaVA+GPT-4 (judge). The performance is measured in grades from 6 to 12. The MM-CoT farge method had the highest performance in most categories, with LLaVA+GPT-4 (judge) showing the highest results in the experiment runs.'"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table_summaries[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc2bcc4c-c05d-4417-aaf9-78acd754dde6",
   "metadata": {},
   "source": [
    "Here is our retrieval of that table from the natural language query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "1bea75fe-85af-4955-a80c-6e0b44a8e215",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Subject Context Modality Grade Method NAT SOC LAN | TXT IMG NO | Gi6~ G7-12 | Average Representative & SoTA methods with numbers reported in the literature Human [30] 90.23 84.97 87.48 | 89.60 87.50 88.10 | 91.59 82.42 88.40 GPT-3.5 [30] 74.64 69.74 76.00 | 74.44 67.28 77.42 | 76.80 68.89 73.97 GPT-3.5 w/ CoT [30] 75.44 70.87 78.09 | 74.68 67.43 79.93 | 78.23 69.68 75.17 LLaMA-Adapter [55] 84.37 88.30 84.36 | 83.72 80.32 86.90 | 85.83 84.05 85.19 MM-CoT gase [57] 87.52 77.17 85.82 | 87.88 82.90 86.83 | 84.65 85.37 84.91 MM-CoT farge [57] 95.91 82.00 90.82 | 95.26 88.80 92.89 | 92.44 90.31 | 91.68 Results with our own experiment runs GPT-4 84.06 73.45 87.36 | 81.87 70.75 90.73 | 84.69 79.10 82.69 LLaVA 90.36 95.95 88.00 | 89.49 88.00 90.66 | 90.93 90.90 90.92 LLaVA+GPT-4 (complement) 90.36 95.50 88.55 | 89.05 87.80 91.08 | 92.22 88.73 90.97 LLaVA+GPT-4 (judge) 91.56 96.74 91.09 | 90.62 88.99 93.52 | 92.73 92.16 92.53'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can retrieve this table\n",
    "retriever.invoke(\"What are results for LLaMA across across domains / subjects?\")[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dbb23d5-ae66-444d-8f5f-b24107fb9c57",
   "metadata": {},
   "source": [
    "Image:"
   ]
  },
  {
   "attachments": {
    "5d505f36-17e1-4fe5-a405-f01f7a392716.jpg": {
     "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAE4AQUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3qaVYYHlbO1FLHAycCuT0/wARa9f2Vrq8emWsul3W1kihnLXCxsRhiMbSRnJUHjnk4rq5zItvIYkDyBSVVjgE9hntXm8ssJhQaBpmqaV4geZS9rHFKsCtuG8vx5RTGfmHJ7c0AdzJ4j0eLUl06TUrVL0kKIDKA2T0GPU+lNfxLosd6tk+p2q3LOUERlG7cO2PWuPuSY9A1Tw8+m3cmq3NxMY3FsxjdnkLJN5mNoCgqSScjbj0p82ktJ4Y8SRSWLPJcaqXwYjmRd8eGHHIwDz7UAdfaeItHv4J57TUrWaK35mdJQQg65J7DHeqN3420K30e71OLUYLiG1HziKQFs9hz3OOPWub8ZaPeXmsTmyt5TGLGBpPKjDeYsdwGKAEbWO3OFPXp3qG5tptVtdZuLeXVL6X+y5IA89iLcMSQQgGxWdhg+wz70AdpH4gtHlMn2q0+xeTHIs3ncne5UZGMAEgAHPJyO1Rnxh4dFobo6zZeQH8sv5wwGxnH5c/Sue1qP8Ati7nnitJ5LaeHTwA8DLuAuyWBUjPA5IPatWHT0/4TzU7lrQeW+mQR+YU4Y75dy578bcj6UAat94h0jTFha+1G2txMMxmSQDcPUe3I56Ul74i0fTpIo7zUraB5RuQSSAbh6/T36V53p1re6V9nnvJ9RtI59JtYY/IsBcHKIQ0TAoxU5OcHAOfapRYNo9lFFEmq29w+nJDsnshdx3KAuRE4QfKy7scFRgjrigDvtZ1VtNt7SWNFkE93DbnJxgO4XP4Zp1/qq2GoQxzSW0du0MssjyS7WUJjJC45HPJyMcetY2qQXMvh3QkNkYpkvLJpIIwWEWHXcOOw9faqfja2lvdXt7a3TfNNpGopGo/iYiIAfnQB18t/awTQxSzokkwJjVmwWwMnH0HNY91410GDSr/AFCLUbe4jsozJIsUgJ9gOe5GB2rnNXvl8Q6lpIt9J1Ga2jguluFkt3h5aEjy8sByeRnp05qr5V3f6bqNlaJdXsY0eeCJ7uwME0DEALFuwofPsONo55oA7S916JPC82tWRS4jWEyphuGx2zU1z4h0iyv47C61K2hu3xtieQBjnp9M9vWszVZP7S8AXRtoZi0loVWNomV84xjaQDnPtWLdOtnZeINJu9Mu7u81CeZ4RHbM6XCuMJ8+Nq7RhTuIxtz6UAdbfeItH0y6S2vtStredwCqSyBTg8A+wzUUeuxrf6rHdNHBbWCxsZmbAwy7iT6YrlNOJ8PQapYa3p93f3l3sIkitmmW7XyUTZuAwCCrDDYHOe9UrfRNWs7z7ZcrLdQ6fBZm4sthIuGSLDOp/jdCAQOhI9SCAD0a51OysrE3t1cxw2wAJlkbaoB6cmse+8b6DZ6fb332+GW3nuFt1eOQcMSAc88Yzk98VH4qnc6XZTwRMY/tKO8wtTO9uu0kOI+pOcDocZzjiuRWG883U794tRuYBqOn3Pmy2mx5ERhvYRqoPAHpnA+lAHpF3qtjYQRzXl3DBHJnY8jhQcKWPJ9gT+FUW8R2dzYpdaZd2dxGbiOBmafaAWYDGcH5ueB3OPWsjxpNGJ/DE72sk8SamJDGiEtgQSnIXqSOuOvHrWTqKS6vq0+qWFpcrZPc6dEd8DoZXjuNzPtYA4VSBuI7HsKAOyPiTR11H+zm1O1F7nb5BkG7OM4x647daraN4v0jWklNtdx5S5a2ALjLMC2MY9QpI9hWBp09vaaamhXmi3dzfi8LOv2dtkjGXd5/m4246NnOeMYzVVRPaQyF7K7J0/xBJeTKsDkmFy+GTA+cYcEhckYNAHc3WtabZNKt1fQQmIKZPMkC7Q2dpPpnB/I1VfxXoMZtxJq1ohuVDRBpQNyk4B+hPFcbeB9c8Q3N0mnXRs5LjTNjTW7KJFSWQs2CM4Ge49+hFM1C0uLPVvEMN3NqarqMm6JLWwWdbiMxhQm8odpBBGGIAznuaAO7u/EOkWN7HZ3Wo28NzJjbE8gDcnA+mT0qze6jaabaNdXtzHbwLjdJK20DPTk159qsE2nrdW8CX73UttFG1pPZ/aYL8iMKAXUfI3G0ncAMZxXR6+JIjouoS2ksttZ3Be4hiQyMmY2UMFHLbSR0Gec9qAE1zxvpmn6FHqFleWdwZp0t4S0wCb2YA7iMkBc5PGeK6O1lM1rFIzIxdAxKHKnI7e1ef3sUup3c+o2Vjcx2k2padt3wMjSGOXLy7SAQMFRkgfc9MV6KMY4oAKD0NFB6GgBluSbaIk5JQfyopLb/AI9Yf9xf5UUASUYFRXEpht3kWN5CqkhExlvYZIFcVY+Mr28g8PXc1pMn28Tb7eJAxkKqCu3ngdeSR0OaAO6x7UYGK59fF9g0QxDcm7+0G1+x+X+98wLuIxnGNvzZzjHetDTNZt9UWcRpLFLbyeXNDMm142wDyOnIIIIJBzQBoYHpRisGHxbYTyRFYrkWs03kRXhT9075wADnPJGASME9DyKRfFtk08Km3vFt5rg20V00WInkyRgc55IIBIwfWgDewPSlwK5GfxuJ9Liv9M068mhe6igDvGqht0oRsZYEkcj0zj3plt4ukg1DXY7q1vJ47KdeIIQfIjMMbfNzycluBk/pQB2OB6UmB6Vz3ifXZrHwx/aGmq0rTNEsciKrbRIyjdhiAeG498dqRPFVraGe3uUvCbKJXu55I1CxAoGBYjjJHZQee1AHR4qoNMtBqZ1Hys3fl+UJGYnauckAHgZIGcdcDPSsk+L7KJHa7tL20xbPcoJ4seaiDLbcE8gc7Tg+1NXxlZtNBALG/wDOuY/NtYzCAZ0HVl5wMZGd23qPWgDo8D0oxWXaa/Y3eiy6qGeK2hEnneapVozGSHDD1BU1SbxhZQ2k1zd215axx2xugZowN8QxlhgnpkZBweelAHQ4owPSsKLxVZtcLDcQ3NqXiaeJp49olRRliuCeQMHBwcdqbB4usXYfaYLqxR7d7mN7qPYrxrgsRycYBBwcH2oA38D0pMD0rDtvFVlO8QmgurRJommhkuY9iyIoySOeDjnDYOO3Bp1l4ntLy5toDb3Vv9rUtavPHtWYAZ+XnIOOcNg4zxwaANvA9KTA9KyNQ8RQWF3JarZ3l1LFEJpRbxbhGhzgkkjJODwMnjpVT/hM7CWVo7K3u71lto7o/Z4wf3TglW5I64PHX0FAG3cWNtdy20s8Qd7aTzYTkja20rn8mI/GrGB6VgnxZYy/ZxZQ3N8Z7cXSi2jztiPRjkjGecDqcHjis+w1+5vfAur
    }
   },
   "cell_type": "markdown",
   "id": "329fd4ee-4a68-4f3b-b157-a676f13ba587",
   "metadata": {},
   "source": [
    "![figure-8-1.jpg](attachment:5d505f36-17e1-4fe5-a405-f01f7a392716.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fde6f17-d244-4270-b759-68e1858d399f",
   "metadata": {},
   "source": [
    "We can retrieve this image summary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "6f52ee1e-ed46-4a81-834a-3608a1cf90ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries. The arrangement of the chicken pieces creates a visually appealing and playful representation of the world, making it an interesting and creative presentation.\\n\\nmain: image encoded in   865.20 ms by CLIP (    1.50 ms per image patch)'"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.invoke(\"Images / figures with playful and creative examples\")[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69060724-e390-4dda-8250-5f86025c874a",
   "metadata": {},
   "source": [
    "## RAG\n",
    "\n",
    "Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).\n",
    "\n",
    "For `option 1` (above): \n",
    "\n",
    "* Simply pass retrieved text chunks to LLM, as usual.\n",
    "\n",
    "For `option 2a` (above): \n",
    "\n",
    "* We would pass retrieved image and images to the multi-modal LLM.\n",
    "* This should be possible soon, once [llama-cpp-python add multi-modal support](https://github.com/abetlen/llama-cpp-python/issues/813).\n",
    "* And, of course, this will be enabled by GPT4-V API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "771a47fa-1267-4db8-a6ae-5fde48bbc069",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "# Prompt template\n",
    "template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
    "{context}\n",
    "Question: {question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# Option 1: LLM\n",
    "model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
    "# Option 2: Multi-modal LLM\n",
    "# model = GPT4-V or LLaVA\n",
    "\n",
    "# RAG pipeline\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | model\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "ea8414a8-65ee-4e11-8154-029b454f46af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The performance of LLaMA across multiple image domains/subjects is as follows: In the Natural Science (NAT) subject, it scored 84.37. In the Social Science (SOC) subject, it scored 88.30. In the Language Science (LAN) subject, it scored 84.36. In the Text Context (TXT) subject, it scored 83.72. In the Image Context (IMG) subject, it scored 80.32. In the No Context (NO) subject, it scored 86.90. For grades 1-6 (G1-6), it scored 85.83 and for grades 7-12 (G7-12), it scored 84.05. The average score was 85.19.'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\n",
    "    \"What is the performance of LLaVa across across multiple image domains / subjects?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ce57b80-fbd0-47f3-817f-6549a0409f51",
   "metadata": {},
   "source": [
    "We can check the [trace](https://smith.langchain.com/public/85a7180e-0dd1-44d9-996f-6cb9c6f53205/r) to see retrieval of tables and text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "e88f0bc7-81fb-4883-a021-58734a74411b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The text provides an example of a playful and creative image. The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries. The arrangement of the chicken pieces creates a visually appealing and playful representation of the world, making it an interesting and creative presentation.'"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\"Explain images / figures with playful and creative examples.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}