langchain/cookbook/Semi_structured_multi_modal_RAG_LLaMA2.ipynb

{
 "cells": [
  {
   "attachments": {
    "62ed3229-7c1d-4565-9b44-668977cc4e81.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABnkAAAMxCAYAAAAnrNaWAAAMQGlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCRBKjIGgYkcXFVy7iIANXRVR7IDYETuLYu+LBRVlXSzYlTcpoOu+8r35vrnz33/O/OfMuTP33gGAfpwnkeSimgDkiQukcaGBzNEpqUzSU0AEdEAFVkCLx8+XsGNiIgEsA+3fy7vrAJG3VxzlWv/s/69FSyDM5wOAxECcLsjn50G8HwC8mi+RFgBAlPMWkwskcgwr0JHCACFeIMeZSlwtx+lKvFthkxDHgbgVADUqjyfNBEDjEuSZhfxMqKHRC7GzWCASA0BnQuyXlzdRAHEaxLbQRgKxXJ+V/oNO5t800wc1ebzMQayci6KoBYnyJbm8qf9nOv53ycuVDfiwhpWaJQ2Lk88Z5u1mzsQIOaZC3CNOj4qGWBviDyKBwh5ilJIlC0tU2qNG/HwOzBnQg9hZwAuKgNgI4hBxblSkik/PEIVwIYYrBJ0iKuAmQKwP8QJhfnC8ymaDdGKcyhfakCHlsFX8WZ5U4Vfu674sJ5Gt0n+dJeSq9DGNoqyEZIgpEFsWipKiINaA2Ck/Jz5CZTOyKIsTNWAjlcXJ47eEOE4oDg1U6mOFGdKQOJV9aV7+wHyxDVkibpQK7y3ISghT5gdr5fMU8cO5YJeEYnbigI4wf3TkwFwEwqBg5dyxZ0JxYrxK54OkIDBOORanSHJjVPa4uTA3VM6bQ+yWXxivGosnFcAFqdTHMyQFMQnKOPGibF54jDIefCmIBBwQBJhABms6mAiygai9p7EH3il7QgAPSEEmEAJHFTMwIlnRI4bXeFAE/oRICPIHxwUqeoWgEPJfB1nl1RFkKHoLFSNywBOI80AEyIX3MsUo8aC3JPAYMqJ/eOfByofx5sIq7//3/AD7nWFDJlLFyAY8MukDlsRgYhAxjBhCtMMNcT/cB4+E1wBYXXAW7jUwj+/2hCeEDsJDwjVCJ+HWBFGx9KcoR4FOqB+iykX6j7nAraGmOx6I+0J1qIzr4YbAEXeDfti4P/TsDlmOKm55Vpg/af9tBj88DZUd2ZmMkoeQA8i2P4/UsNdwH1SR5/rH/ChjTR/MN2ew52f/nB+yL4BtxM+W2AJsH3YGO4Gdww5jjYCJHcOasDbsiBwPrq7HitU14C1OEU8O1BH9w9/Ak5VnMt+5zrnb+Yuyr0A4Rf6OBpyJkqlSUWZWAZMNvwhCJlfMdxrGdHF2cQVA/n1Rvr7exCq+G4he23du7h8A+B7r7+8/9J0LPwbAHk+4/Q9+52xZ8NOhDsDZg3yZtFDJ4fILAb4l6HCnGQATYAFs4XxcgAfwAQEgGISDaJAAUsB4GH0WXOdSMBlMB3NACSgDS8EqUAnWg01gG9gJ9oJGcBicAKfBBXAJXAN34OrpAi9AL3gHPiMIQkJoCAMxQEwRK8QBcUFYiB8SjEQicUgKkoZkImJEhkxH5iJlyHKkEtmI1CJ7kIPICeQc0oHcQh4g3chr5BOKoVRUBzVGrdHhKAtloxFoAjoOzUQnoUXoPHQxWoHWoDvQBvQEegG9hnaiL9A+DGDqmB5mhjliLIyDRWOpWAYmxWZipVg5VoPVY83wOV/BOrEe7CNOxBk4E3eEKzgMT8T5+CR8Jr4Ir8S34Q14K34Ff4D34t8INIIRwYHgTeASRhMyCZMJJYRywhbCAcIpuJe6CO+IRKIe0YboCfdiCjGbOI24iLiWuIt4nNhBfETsI5FIBiQHki8pmsQjFZBKSGtIO0jHSJdJXaQPaupqpmouaiFqqWpitWK1crXtakfVLqs9VftM1iRbkb3J0WQBeSp5CXkzuZl8kdxF/kzRothQfCkJlGzKHEoFpZ5yinKX8kZdXd1c3Us9Vl2kPlu9Qn23+ln1B+ofqdpUeyqHOpYqoy6mbqUep96ivqHRaNa0AFoqrYC2mFZLO0m7T/ugwdBw0uBqCDRmaVRpNGhc1nhJJ9Ot6Gz6eHoRvZy+j36R3qNJ1rTW5GjyNGdqVmke1Lyh2afF0BqhFa2Vp7VIa7vWOa1n2iRta+1gbYH2PO1N2ie1HzEwhgWDw+Az5jI2M04xunSIOjY6XJ1snTKdnTrtOr262rpuukm6U3SrdI/oduphetZ6XL1cvSV6e/Wu630aYjyEPUQ4ZOGQ+iGXh7zXH6ofoC/UL9XfpX9N/5MB0yDYIMdgmUGjwT1D3NDeMNZwsuE6w1OGPUN1hvoM5Q8tHbp36G0j1MjeKM5omtEmozajPmMT41BjifEa45PGPSZ6JgEm2SYrTY6adJsyTP1MRaYrTY+ZPmfqMtnMXGYFs5XZa2ZkFmYmM9to1m722dzGPNG82HyX+T0LigXLIsNipUWLRa+lqeUoy+mWdZa3rchWLKssq9VWZ6zeW9tYJ1vPt260fmajb8O1KbKps7lrS7P1t51kW2N71Y5ox7LLsVtrd8ketXe3z7Kvsr/ogDp4OIgc1jp0DCMM8xomHlYz7IYj1ZHtWOhY5/jASc8p0qnYqdHp5XDL4anDlw0/M/ybs7tzrvNm5zsjtEeEjyge0TzitYu9C9+lyuWqK801xHWWa5PrKzcHN6HbOreb7gz3Ue7z3Vvcv3p4ekg96j26PS090zyrPW+wdFgxrEWss14Er0CvWV6HvT56e3gXeO/1/svH0SfHZ7vPs5E2I4UjN4985Gvuy/Pd6Nvpx/RL89vg1+lv5s/zr/F/GGARIAjYEvCUbcfOZu9gvwx0DpQGHgh8z/HmzOAcD8KCQoNKg9qDtYMTgyuD74eYh2SG1IX0hrqHTgs9HkYIiwhbFnaDa8zlc2u5veGe4TPCWyOoEfERlREPI+0jpZHNo9BR4aNWjLobZRUljmqMBtHc6BXR92JsYibFHIolxsbEVsU+iRsRNz3uTDwjfkL89vh3CYEJSxLuJNomyhJbkuhJY5Nqk94nByUvT+4cPXz0jNEXUgxTRClNqaTUpNQtqX1jgsesGtM11n1sydjr42zGTRl3brzh+NzxRybQJ/Am7EsjpCWnbU/7wovm1fD60rnp1em9fA5/Nf+FIECwUtAt9BUuFz7N8M1YnvEs0zdzRWZ3ln9WeVaPiCOqFL3KDsten/0+Jzpna05/bnLurjy1vLS8g2JtcY64daLJxCkTOyQOkhJJ5yTvSasm9UojpFvykfxx+U0FOvBHvk1mK/tF9qDQr7Cq8MPkpMn7pmhNEU9pm2o/deHUp0UhRb9Nw6fxp7VMN5s+Z/qDGewZG2ciM9NntsyymDVvVtfs0Nnb5lDm5Mz5vdi5eHnx27nJc5vnGc+bPe/RL6G/1JVolEhLbsz3mb9+Ab5AtKB9oevCNQu/lQpKz5c5l5WXfVnEX3T+1xG/VvzavzhjcfsSjyXrlhKXipdeX+a/bNtyreVFyx+tGLWiYSVzZenKt6smrDpX7la+fjVltWx1Z0VkRdMayzVL13ypzKq8VhVYtavaqHph9fu1grWX1wWsq19vvL5s/acNog03N4ZubKixrinfRNxUuOnJ5qTNZ35j/Va7xXBL2ZavW8VbO7fFbWut9ayt3W60fUkdWier694xdselnUE7m+od6zfu0ttVthvslu1+vidtz/W9EXtb9rH21e+32l99gHGgtAFpmNrQ25jV2NmU0tRxMPxgS7NP84FDToe2HjY7XHVE98iSo5Sj8472Hys61ndccrznROaJRy0TWu6cHH3yamtsa/upiFNnT4ecPnmGfebYWd+zh895nzt4nnW+8YLHhYY297YDv7v/fqDdo73houfFpktel5o7RnYcvex/+cSVoCunr3KvXrgWda3jeuL1mzfG3ui8Kbj57FburVe3C29/vjP7LuFu6T3Ne+X3je7X/GH3x65Oj84jD4IetD2Mf3jnEf/Ri8f5j790zXtCe1L+1PRp7TOXZ4e7Q7ovPR/zvOuF5MXnnpI/tf6sfmn7cv9fAX+19Y7u7XolfdX/etEbgzdb37q9bemL6bv/Lu/d5/elHww+bPvI+njmU/Knp58nfyF9qfhq97X5
    }
   },
   "cell_type": "markdown",
   "id": "812a4dbc-fe04-4b84-bdf9-390045e30806",
   "metadata": {},
   "source": [
    "## Private Semi-structured and Multi-modal RAG w/ LLaMA2 and LLaVA\n",
    "\n",
    "Many documents contain a mixture of content types, including text, tables, and images. \n",
    "\n",
    "Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
    "\n",
    "* Text splitting may break up tables, corrupting the data in retrieval\n",
    "* Embedding tables may pose challenges for semantic similarity search\n",
    "\n",
    "And the information captured in images is typically lost.\n",
    "\n",
    "With the emergence of multimodal LLMs, like [GPT4-V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n",
    "\n",
    "`Option 1:` \n",
    "\n",
    "* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text\n",
    "* Retrieve both using similarity search\n",
    "* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
    "\n",
    "`Option 2:` \n",
    "\n",
    "* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
    "* Embed and retrieve text \n",
    "* Pass text chunks to an LLM for answer synthesis \n",
    "\n",
    "`Option 3:` \n",
    "\n",
    "* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
    "* Embed and retrieve image summaries with a reference to the raw image \n",
    "* Pass raw images and text chunks to a multimodal LLM for answer synthesis   \n",
    "\n",
    "This cookbook show how we might tackle this :\n",
    "\n",
    "* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).\n",
    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text, (optionally) images along with their summaries for retrieval.\n",
    "* We will demonstrate `Option 2`, and will follow-up on the other approaches in future cookbooks.\n",
    "\n",
    "![ss_mm_rag.png](attachment:62ed3229-7c1d-4565-9b44-668977cc4e81.png)\n",
    "\n",
    "## Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a01dcf9e-c8f4-4c34-a013-8fd08d2d3806",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install langchain unstructured[all-docs] pydantic lxml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74b56bde-1ba0-4525-a11d-cab02c5659e4",
   "metadata": {},
   "source": [
    "## Data Loading\n",
    "\n",
    "### Partition PDF tables, text, and images\n",
    "  \n",
    "* `LLaVA` Paper: https://arxiv.org/pdf/2304.08485.pdf\n",
    "* Use [Unstructured](https://unstructured-io.github.io/unstructured/) to partition elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3826584-1ff5-4d86-911a-a9242aaad5d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from lxml import html\n",
    "from pydantic import BaseModel\n",
    "from typing import Any, Optional\n",
    "from unstructured.partition.pdf import partition_pdf\n",
    "\n",
    "# Path to save images\n",
    "path = \"/Users/rlm/Desktop/Papers/LLaVA/\"\n",
    "\n",
    "# Get elements\n",
    "raw_pdf_elements = partition_pdf(filename=path+\"LLaVA.pdf\",\n",
    "                                 # Using pdf format to find embedded image blocks\n",
    "                                 extract_images_in_pdf=True,\n",
    "                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
    "                                 # Titles are any sub-section of the document \n",
    "                                 infer_table_structure=True, \n",
    "                                 # Post processing to aggregate text once we have the title \n",
    "                                 chunking_strategy=\"by_title\",\n",
    "                                 # Chunking params to aggregate text blocks\n",
    "                                 # Attempt to create a new chunk 3800 chars\n",
    "                                 # Attempt to keep chunks > 2000 chars \n",
    "                                 # Hard max on chunks\n",
    "                                 max_characters=4000, \n",
    "                                 new_after_n_chars=3800, \n",
    "                                 combine_text_under_n_chars=2000,\n",
    "                                 image_output_dir_path=path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7cdba921-5419-4471-b234-d93af3859b6f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{\"<class 'unstructured.documents.elements.CompositeElement'>\": 31,\n",
       " \"<class 'unstructured.documents.elements.Table'>\": 3}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a dictionary to store counts of each type\n",
    "category_counts = {}\n",
    "\n",
    "for element in raw_pdf_elements:\n",
    "    category = str(type(element))\n",
    "    if category in category_counts:\n",
    "        category_counts[category] += 1\n",
    "    else:\n",
    "        category_counts[category] = 1\n",
    "\n",
    "# Unique_categories will have unique elements\n",
    "# TableChunk if Table > max chars set above\n",
    "unique_categories = set(category_counts.keys())\n",
    "category_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5f660305-e165-4b6c-ada3-a67a422defb5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "31\n"
     ]
    }
   ],
   "source": [
    "class Element(BaseModel):\n",
    "    type: str\n",
    "    text: Any\n",
    "\n",
    "# Categorize by type\n",
    "categorized_elements = []\n",
    "for element in raw_pdf_elements:\n",
    "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"table\", text=str(element)))\n",
    "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"text\", text=str(element)))\n",
    "\n",
    "# Tables\n",
    "table_elements = [e for e in categorized_elements if e.type == \"table\"]\n",
    "print(len(table_elements))\n",
    "\n",
    "# Text\n",
    "text_elements = [e for e in categorized_elements if e.type == \"text\"]\n",
    "print(len(text_elements))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aa7f52f-bf5c-4ba4-af72-b2ccba59a4cf",
   "metadata": {},
   "source": [
    "## Multi-vector retriever\n",
    "\n",
    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
    "\n",
    "Summaries are used to retrieve raw tables and / or raw chunks of text.\n",
    "\n",
    "### Text and Table summaries\n",
    "\n",
    "Here, we use ollama.ai to run LLaMA2 locally. \n",
    "\n",
    "See details on installation [here](https://python.langchain.com/docs/guides/local_llms)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "523e6ed2-2132-4748-bdb7-db765f20648d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOllama\n",
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain.schema.output_parser import StrOutputParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "22c22e3f-42fb-4a4a-a87a-89f10ba8ab99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prompt \n",
    "prompt_text=\"\"\"You are an assistant tasked with summarizing tables and text. \\ \n",
    "Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(prompt_text) \n",
    "\n",
    "# Summary chain \n",
    "model = ChatOllama(model=\"llama2:13b-chat\")\n",
    "summarize_chain = {\"element\": lambda x:x} | prompt | model | StrOutputParser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0e1ba7ba-d209-424a-8f05-6a95d6d32bb2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to text\n",
    "texts = [i.text for i in text_elements if i.text != \"\"]\n",
    "text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a419123a-6038-4264-9ee0-bfb2a2df7153",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to tables\n",
    "tables = [i.text for i in table_elements]\n",
    "table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d52641eb-762e-4460-80c7-3ac3ddd93621",
   "metadata": {},
   "source": [
    "### Images\n",
    "\n",
    "We will implement `Option 2` discussed above: \n",
    "\n",
    "* Use a multimodal LLM ([LLaVA](https://llava.hliu.cc/)) to produce text summaries from images\n",
    "* Embed and retrieve text \n",
    "* Pass text chunks to an LLM for answer synthesis \n",
    "\n",
    "#### Image summaries \n",
    "\n",
    "We will use [LLaVA](https://github.com/haotian-liu/LLaVA/), an open source multimodal model.\n",
    " \n",
    "We will use [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) to run LLaVA locally (e.g., on a Mac laptop):\n",
    "\n",
    "* Clone [llama.cpp](https://github.com/ggerganov/llama.cpp)\n",
    "* Download the LLaVA model: `mmproj-model-f16.gguf` and one of `ggml-model-[f16|q5_k|q4_k].gguf` from [LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)\n",
    "* Build\n",
    "```\n",
    "mkdir build && cd build && cmake ..\n",
    "cmake --build .\n",
    "```\n",
    "* Run inference across images:\n",
    "```\n",
    "/Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "646a6874-008e-46aa-809d-1d59df36858b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Define the directory containing the images\n",
    "IMG_DIR=~/Desktop/Papers/LLaVA/\n",
    "\n",
    "# Loop through each image in the directory\n",
    "for img in \"${IMG_DIR}\"*.jpg; do\n",
    "    # Extract the base name of the image without extension\n",
    "    base_name=$(basename \"$img\" .jpg)\n",
    "\n",
    "    # Define the output file name based on the image name\n",
    "    output_file=\"${IMG_DIR}${base_name}.txt\"\n",
    "\n",
    "    # Execute the command and save the output to the defined output file\n",
    "    /Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
    "\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "da8a8c94-3df7-446f-9a69-703295f50f02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, glob\n",
    "\n",
    "# Get all .txt files in the directory\n",
    "file_paths = glob.glob(os.path.expanduser(os.path.join(path, \"*.txt\")))\n",
    "\n",
    "# Read each file and store its content in a list\n",
    "img_summaries = []\n",
    "for file_path in file_paths:\n",
    "    with open(file_path, 'r') as file:\n",
    "        img_summaries.append(file.read())\n",
    "\n",
    "# Clean up residual logging\n",
    "cleaned_img_summary = [s.split(\"clip_model_load: total allocated memory: 201.27 MB\\n\\n\", 1)[1].strip() for s in img_summaries]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67b030d4-2ac5-41b6-9245-fc3ba5771d87",
   "metadata": {},
   "source": [
    "### Add to vectorstore\n",
    "\n",
    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries.\n",
    "\n",
    "We use GPT4All embeddings to run locally, which are a [CPU optimized version of BERT](https://docs.gpt4all.io/gpt4all_python_embedding.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "64a5df0c-8193-407e-a83f-8fc17caff3e4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "objc[42078]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x31f870208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x31fc9c208). One of the two will be used. Which one is undefined.\n"
     ]
    }
   ],
   "source": [
    "import uuid\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.storage import InMemoryStore\n",
    "from langchain.schema.document import Document\n",
    "from langchain.embeddings import GPT4AllEmbeddings\n",
    "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
    "\n",
    "# The vectorstore to use to index the child chunks\n",
    "vectorstore = Chroma(\n",
    "    collection_name=\"summaries\",\n",
    "    embedding_function=GPT4AllEmbeddings()\n",
    ")\n",
    "\n",
    "# The storage layer for the parent documents\n",
    "store = InMemoryStore() # <- Can we extend this to images \n",
    "id_key = \"doc_id\"\n",
    "\n",
    "# The retriever (empty to start)\n",
    "retriever = MultiVectorRetriever(\n",
    "    vectorstore=vectorstore, \n",
    "    docstore=store, \n",
    "    id_key=id_key,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "339bb8be-0d7a-45a0-8815-d62bb3bbf0fc",
   "metadata": {},
   "source": [
    "For `option 2` (above): \n",
    "\n",
    "* Store the image summary in the `docstore`, which we return to the LLM for answer generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d643cc61-827d-4f3c-8242-7a7c8291ed8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add texts\n",
    "doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
    "summary_texts = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]\n",
    "retriever.vectorstore.add_documents(summary_texts)\n",
    "retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
    "\n",
    "# Add tables\n",
    "table_ids = [str(uuid.uuid4()) for _ in tables]\n",
    "summary_tables = [Document(page_content=s,metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summaries)]\n",
    "retriever.vectorstore.add_documents(summary_tables)\n",
    "retriever.docstore.mset(list(zip(table_ids, tables)))\n",
    "\n",
    "# Add images\n",
    "img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]\n",
    "summary_img = [Document(page_content=s,metadata={id_key: img_ids[i]}) for i, s in enumerate(cleaned_img_summary)]\n",
    "retriever.vectorstore.add_documents(summary_img)\n",
    "retriever.docstore.mset(list(zip(img_ids, cleaned_img_summary))) # Store the image summary as the raw document"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b45fb81-46b1-426e-aa2c-01aed4eac700",
   "metadata": {},
   "source": [
    "### Sanity Check"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dbb23d5-ae66-444d-8f5f-b24107fb9c57",
   "metadata": {},
   "source": [
    "Image:"
   ]
  },
  {
   "attachments": {
    "227da97f-e1ae-4252-b577-03a873a321e9.jpg": {
     "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAE4AQUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3qaVYYHlbO1FLHAycCuT0/wARa9f2Vrq8emWsul3W1kihnLXCxsRhiMbSRnJUHjnk4rq5zItvIYkDyBSVVjgE9hntXm8ssJhQaBpmqaV4geZS9rHFKsCtuG8vx5RTGfmHJ7c0AdzJ4j0eLUl06TUrVL0kKIDKA2T0GPU+lNfxLosd6tk+p2q3LOUERlG7cO2PWuPuSY9A1Tw8+m3cmq3NxMY3FsxjdnkLJN5mNoCgqSScjbj0p82ktJ4Y8SRSWLPJcaqXwYjmRd8eGHHIwDz7UAdfaeItHv4J57TUrWaK35mdJQQg65J7DHeqN3420K30e71OLUYLiG1HziKQFs9hz3OOPWub8ZaPeXmsTmyt5TGLGBpPKjDeYsdwGKAEbWO3OFPXp3qG5tptVtdZuLeXVL6X+y5IA89iLcMSQQgGxWdhg+wz70AdpH4gtHlMn2q0+xeTHIs3ncne5UZGMAEgAHPJyO1Rnxh4dFobo6zZeQH8sv5wwGxnH5c/Sue1qP8Ati7nnitJ5LaeHTwA8DLuAuyWBUjPA5IPatWHT0/4TzU7lrQeW+mQR+YU4Y75dy578bcj6UAat94h0jTFha+1G2txMMxmSQDcPUe3I56Ul74i0fTpIo7zUraB5RuQSSAbh6/T36V53p1re6V9nnvJ9RtI59JtYY/IsBcHKIQ0TAoxU5OcHAOfapRYNo9lFFEmq29w+nJDsnshdx3KAuRE4QfKy7scFRgjrigDvtZ1VtNt7SWNFkE93DbnJxgO4XP4Zp1/qq2GoQxzSW0du0MssjyS7WUJjJC45HPJyMcetY2qQXMvh3QkNkYpkvLJpIIwWEWHXcOOw9faqfja2lvdXt7a3TfNNpGopGo/iYiIAfnQB18t/awTQxSzokkwJjVmwWwMnH0HNY91410GDSr/AFCLUbe4jsozJIsUgJ9gOe5GB2rnNXvl8Q6lpIt9J1Ga2jguluFkt3h5aEjy8sByeRnp05qr5V3f6bqNlaJdXsY0eeCJ7uwME0DEALFuwofPsONo55oA7S916JPC82tWRS4jWEyphuGx2zU1z4h0iyv47C61K2hu3xtieQBjnp9M9vWszVZP7S8AXRtoZi0loVWNomV84xjaQDnPtWLdOtnZeINJu9Mu7u81CeZ4RHbM6XCuMJ8+Nq7RhTuIxtz6UAdbfeItH0y6S2vtStredwCqSyBTg8A+wzUUeuxrf6rHdNHBbWCxsZmbAwy7iT6YrlNOJ8PQapYa3p93f3l3sIkitmmW7XyUTZuAwCCrDDYHOe9UrfRNWs7z7ZcrLdQ6fBZm4sthIuGSLDOp/jdCAQOhI9SCAD0a51OysrE3t1cxw2wAJlkbaoB6cmse+8b6DZ6fb332+GW3nuFt1eOQcMSAc88Yzk98VH4qnc6XZTwRMY/tKO8wtTO9uu0kOI+pOcDocZzjiuRWG883U794tRuYBqOn3Pmy2mx5ERhvYRqoPAHpnA+lAHpF3qtjYQRzXl3DBHJnY8jhQcKWPJ9gT+FUW8R2dzYpdaZd2dxGbiOBmafaAWYDGcH5ueB3OPWsjxpNGJ/DE72sk8SamJDGiEtgQSnIXqSOuOvHrWTqKS6vq0+qWFpcrZPc6dEd8DoZXjuNzPtYA4VSBuI7HsKAOyPiTR11H+zm1O1F7nb5BkG7OM4x647daraN4v0jWklNtdx5S5a2ALjLMC2MY9QpI9hWBp09vaaamhXmi3dzfi8LOv2dtkjGXd5/m4246NnOeMYzVVRPaQyF7K7J0/xBJeTKsDkmFy+GTA+cYcEhckYNAHc3WtabZNKt1fQQmIKZPMkC7Q2dpPpnB/I1VfxXoMZtxJq1ohuVDRBpQNyk4B+hPFcbeB9c8Q3N0mnXRs5LjTNjTW7KJFSWQs2CM4Ge49+hFM1C0uLPVvEMN3NqarqMm6JLWwWdbiMxhQm8odpBBGGIAznuaAO7u/EOkWN7HZ3Wo28NzJjbE8gDcnA+mT0qze6jaabaNdXtzHbwLjdJK20DPTk159qsE2nrdW8CX73UttFG1pPZ/aYL8iMKAXUfI3G0ncAMZxXR6+JIjouoS2ksttZ3Be4hiQyMmY2UMFHLbSR0Gec9qAE1zxvpmn6FHqFleWdwZp0t4S0wCb2YA7iMkBc5PGeK6O1lM1rFIzIxdAxKHKnI7e1ef3sUup3c+o2Vjcx2k2padt3wMjSGOXLy7SAQMFRkgfc9MV6KMY4oAKD0NFB6GgBluSbaIk5JQfyopLb/AI9Yf9xf5UUASUYFRXEpht3kWN5CqkhExlvYZIFcVY+Mr28g8PXc1pMn28Tb7eJAxkKqCu3ngdeSR0OaAO6x7UYGK59fF9g0QxDcm7+0G1+x+X+98wLuIxnGNvzZzjHetDTNZt9UWcRpLFLbyeXNDMm142wDyOnIIIIJBzQBoYHpRisGHxbYTyRFYrkWs03kRXhT9075wADnPJGASME9DyKRfFtk08Km3vFt5rg20V00WInkyRgc55IIBIwfWgDewPSlwK5GfxuJ9Liv9M068mhe6igDvGqht0oRsZYEkcj0zj3plt4ukg1DXY7q1vJ47KdeIIQfIjMMbfNzycluBk/pQB2OB6UmB6Vz3ifXZrHwx/aGmq0rTNEsciKrbRIyjdhiAeG498dqRPFVraGe3uUvCbKJXu55I1CxAoGBYjjJHZQee1AHR4qoNMtBqZ1Hys3fl+UJGYnauckAHgZIGcdcDPSsk+L7KJHa7tL20xbPcoJ4seaiDLbcE8gc7Tg+1NXxlZtNBALG/wDOuY/NtYzCAZ0HVl5wMZGd23qPWgDo8D0oxWXaa/Y3eiy6qGeK2hEnneapVozGSHDD1BU1SbxhZQ2k1zd215axx2xugZowN8QxlhgnpkZBweelAHQ4owPSsKLxVZtcLDcQ3NqXiaeJp49olRRliuCeQMHBwcdqbB4usXYfaYLqxR7d7mN7qPYrxrgsRycYBBwcH2oA38D0pMD0rDtvFVlO8QmgurRJommhkuY9iyIoySOeDjnDYOO3Bp1l4ntLy5toDb3Vv9rUtavPHtWYAZ+XnIOOcNg4zxwaANvA9KTA9KyNQ8RQWF3JarZ3l1LFEJpRbxbhGhzgkkjJODwMnjpVT/hM7CWVo7K3u71lto7o/Z4wf3TglW5I64PHX0FAG3cWNtdy20s8Qd7aTzYTkja20rn8mI/GrGB6VgnxZYy/ZxZQ3N8Z7cXSi2jztiPRjkjGecDqcHjis+w1+5vfAur
    }
   },
   "cell_type": "markdown",
   "id": "329fd4ee-4a68-4f3b-b157-a676f13ba587",
   "metadata": {},
   "source": [
    "![figure-8-1.jpg](attachment:227da97f-e1ae-4252-b577-03a873a321e9.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fde6f17-d244-4270-b759-68e1858d399f",
   "metadata": {},
   "source": [
    "We can retrieve this image summary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "6f52ee1e-ed46-4a81-834a-3608a1cf90ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries. The arrangement of the chicken pieces creates a visually appealing and playful representation of the world, making it an interesting and creative presentation.\\n\\nmain: image encoded in   865.20 ms by CLIP (    1.50 ms per image patch)'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.get_relevant_documents(\"Images / figures with playful and creative examples\")[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69060724-e390-4dda-8250-5f86025c874a",
   "metadata": {},
   "source": [
    "## RAG\n",
    "\n",
    "Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).\n",
    "\n",
    "For `option 1` (above): \n",
    "\n",
    "* Simply pass retrieved text chunks to LLM, as usual.\n",
    "\n",
    "For `option 2a` (above): \n",
    "\n",
    "* We would pass retrieved image and images to the multi-modal LLM.\n",
    "* This should be possible soon, once [llama-cpp-python add multi-modal support](https://github.com/abetlen/llama-cpp-python/issues/813)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "771a47fa-1267-4db8-a6ae-5fde48bbc069",
   "metadata": {},
   "outputs": [],
   "source": [
    "from operator import itemgetter\n",
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "\n",
    "# Prompt template\n",
    "template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
    "{context}\n",
    "Question: {question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# Option 1: LLM\n",
    "model = ChatOllama(model=\"llama2:13b-chat\")\n",
    "# Option 2: Multi-modal LLM\n",
    "# model = LLaVA\n",
    "\n",
    "# RAG pipeline\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | model \n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ea8414a8-65ee-4e11-8154-029b454f46af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\" Based on the provided context, LLaVA's performance across multiple image domains/subjects is not explicitly mentioned. However, we can infer some information about its performance based on the given text:\\n\\n1. LLaVA achieves an accuracy of 90.92% on the ScienceQA dataset, which is close to the current SoTA (91.68%).\\n2. When prompted with a 2-shot in-context learning task using GPT-4, it achieves an accuracy of 82.69%, indicating a 7.52% absolute gain compared to GPT-3.5.\\n3. For a substantial number of questions, GPT-4 fails due to insufficient context such as images or plots.\\n\\nBased on these points, we can infer that LLaVA performs well across multiple image domains/subjects, but its performance may be limited by the quality and availability of the input images. Additionally, its ability to recognize visual content and provide detailed responses is dependent on the specific task and dataset being used.\""
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\"What is the performance of LLaVa across across multiple image domains / subjects?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b7aeb57-2ab8-496c-b909-0734ccc5da5f",
   "metadata": {},
   "source": [
    "We can check the [trace](https://smith.langchain.com/public/ab90fb1c-5949-4fc6-a002-56a6056adc6b/r) to review retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "1ad375c5-8aef-4be3-9a12-8ad953fa2d14",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Sure, I\\'d be happy to help! Based on the provided context, here are some playful and creative explanations for the images/figures mentioned in the paper:\\n\\n1. \"The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries.\"\\n\\nPlayful explanation: \"Look, ma! The fried chicken is mapping out the world one piece at a time! Who needs Google Maps when you have crispy chicken wings to guide the way?\"\\n\\nCreative explanation: \"The arrangement of the fried chicken pieces creates a visual representation of the world that\\'s both appetizing and adventurous. It\\'s like a culinary globe-trotting experience!\"\\n\\n2. \"The image is a screenshot of a conversation between two people, likely discussing a painting.\"\\n\\nPlayful explanation: \"The painting is getting a double take - these two people are having a chat about it and we get to eavesdrop on their art-loving banter!\"\\n\\nCreative explanation: \"This image captures the dynamic exchange of ideas between two art enthusiasts. It\\'s like we\\'re peeking into their creative brainstorming session, where the painting is the catalyst for a lively discussion.\"\\n\\n3. \"The image features a text-based representation of a scene with a person holding onto a rope, possibly a woman, and a boat in the background.\"\\n\\nPlayful explanation: \"This image looks like a page from a choose-your-own-adventure book! Is our brave protagonist about to embark on a thrilling boat ride or hold tight for a wild journey?\"\\n\\nCreative explanation: \"The text-based representation of the scene creates an intriguing narrative that invites the viewer to fill in the blanks. It\\'s like we\\'re reading a visual storybook, where the person holding onto the rope is the hero of their own adventure.\"\\n\\n4. \"Figure 5: LLaVA recognizes the famous art work, Mona Lisa, by Leonardo da Vinci.\"\\n\\nPlayful explanation: \"Mona Lisa is getting a digital spotlight - look at her smile now that she\\'s part of this cool image recognition tech!\"\\n\\nCreative explanation: \"This playful recognition of the Mona Lisa painting highlights the advanced technology used in image analysis. It\\'s like LLaVA is giving the famous artwork a modern makeover, showcasing its timeless beauty and relevance in the digital age.\"\\n\\nOverall, these images/figures offer unique opportunities for creative and playful explanations that can capture the viewer\\'s attention while highlighting the technology and narratives presented in the paper.'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\"Explain any images / figures in the paper with playful and creative examples.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1da79644-4046-45b0-8c25-01aa73587b22",
   "metadata": {},
   "source": [
    "We can check the [trace](https://smith.langchain.com/public/c6d3b7d5-0f40-4905-ab8f-3a2b77c39af4/r) to review retrieval."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}