langchain/cookbook/Semi_structured_multi_modal...

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "812a4dbc-fe04-4b84-bdf9-390045e30806",
   "metadata": {},
   "source": [
    "# Use Case\n",
    "\n",
    "Many documents contain a mixture of content types: \n",
    "\n",
    "* `Semi-structured data`: text and tables\n",
    "* `Image`: images contain valuable information \n",
    "\n",
    "Here, we show how `Unstructured` can be used to partition all 3 types from documents. \n",
    "\n",
    "We generate summaries of each content type, using an open source multi-modal model (LLaVA) for image summaries.\n",
    "\n",
    "We embed summaries and store them w/ the raw table and text chunks. \n",
    "\n",
    "For images, we embed and store the summaries only; future work could store the images for final answer generation."
   ]
  },
  {
   "attachments": {
    "28d8b949-8001-4b94-be3f-a64995390935.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABlwAAAHcCAYAAACznQd3AAAMQGlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCRBKjIGgYkcXFVy7iIANXRVR7IDYETuLYu+LBRVlXSzYlTcpoOu+8r35vrnz33/O/OfMuTP33gGAfpwnkeSimgDkiQukcaGBzNEpqUzSU0AEdEAFVkCLx8+XsGNiIgEsA+3fy7vrAJG3VxzlWv/s/69FSyDM5wOAxECcLsjn50G8HwC8mi+RFgBAlPMWkwskcgwr0JHCACFeIMeZSlwtx+lKvFthkxDHgbgVADUqjyfNBEDjEuSZhfxMqKHRC7GzWCASA0BnQuyXlzdRAHEaxLbQRgKxXJ+V/oNO5t800wc1ebzMQayci6KoBYnyJbm8qf9nOv53ycuVDfiwhpWaJQ2Lk88Z5u1mzsQIOaZC3CNOj4qGWBviDyKBwh5ilJIlC0tU2qNG/HwOzBnQg9hZwAuKgNgI4hBxblSkik/PEIVwIYYrBJ0iKuAmQKwP8QJhfnC8ymaDdGKcyhfakCHlsFX8WZ5U4Vfu674sJ5Gt0n+dJeSq9DGNoqyEZIgpEFsWipKiINaA2Ck/Jz5CZTOyKIsTNWAjlcXJ47eEOE4oDg1U6mOFGdKQOJV9aV7+wHyxDVkibpQK7y3ISghT5gdr5fMU8cO5YJeEYnbigI4wf3TkwFwEwqBg5dyxZ0JxYrxK54OkIDBOORanSHJjVPa4uTA3VM6bQ+yWXxivGosnFcAFqdTHMyQFMQnKOPGibF54jDIefCmIBBwQBJhABms6mAiygai9p7EH3il7QgAPSEEmEAJHFTMwIlnRI4bXeFAE/oRICPIHxwUqeoWgEPJfB1nl1RFkKHoLFSNywBOI80AEyIX3MsUo8aC3JPAYMqJ/eOfByofx5sIq7//3/AD7nWFDJlLFyAY8MukDlsRgYhAxjBhCtMMNcT/cB4+E1wBYXXAW7jUwj+/2hCeEDsJDwjVCJ+HWBFGx9KcoR4FOqB+iykX6j7nAraGmOx6I+0J1qIzr4YbAEXeDfti4P/TsDlmOKm55Vpg/af9tBj88DZUd2ZmMkoeQA8i2P4/UsNdwH1SR5/rH/ChjTR/MN2ew52f/nB+yL4BtxM+W2AJsH3YGO4Gdww5jjYCJHcOasDbsiBwPrq7HitU14C1OEU8O1BH9w9/Ak5VnMt+5zrnb+Yuyr0A4Rf6OBpyJkqlSUWZWAZMNvwhCJlfMdxrGdHF2cQVA/n1Rvr7exCq+G4he23du7h8A+B7r7+8/9J0LPwbAHk+4/Q9+52xZ8NOhDsDZg3yZtFDJ4fILAb4l6HCnGQATYAFs4XxcgAfwAQEgGISDaJAAUsB4GH0WXOdSMBlMB3NACSgDS8EqUAnWg01gG9gJ9oJGcBicAKfBBXAJXAN34OrpAi9AL3gHPiMIQkJoCAMxQEwRK8QBcUFYiB8SjEQicUgKkoZkImJEhkxH5iJlyHKkEtmI1CJ7kIPICeQc0oHcQh4g3chr5BOKoVRUBzVGrdHhKAtloxFoAjoOzUQnoUXoPHQxWoHWoDvQBvQEegG9hnaiL9A+DGDqmB5mhjliLIyDRWOpWAYmxWZipVg5VoPVY83wOV/BOrEe7CNOxBk4E3eEKzgMT8T5+CR8Jr4Ir8S34Q14K34Ff4D34t8INIIRwYHgTeASRhMyCZMJJYRywhbCAcIpuJe6CO+IRKIe0YboCfdiCjGbOI24iLiWuIt4nNhBfETsI5FIBiQHki8pmsQjFZBKSGtIO0jHSJdJXaQPaupqpmouaiFqqWpitWK1crXtakfVLqs9VftM1iRbkb3J0WQBeSp5CXkzuZl8kdxF/kzRothQfCkJlGzKHEoFpZ5yinKX8kZdXd1c3Us9Vl2kPlu9Qn23+ln1B+ofqdpUeyqHOpYqoy6mbqUep96ivqHRaNa0AFoqrYC2mFZLO0m7T/ugwdBw0uBqCDRmaVRpNGhc1nhJJ9Ot6Gz6eHoRvZy+j36R3qNJ1rTW5GjyNGdqVmke1Lyh2afF0BqhFa2Vp7VIa7vWOa1n2iRta+1gbYH2PO1N2ie1HzEwhgWDw+Az5jI2M04xunSIOjY6XJ1snTKdnTrtOr262rpuukm6U3SrdI/oduphetZ6XL1cvSV6e/Wu630aYjyEPUQ4ZOGQ+iGXh7zXH6ofoC/UL9XfpX9N/5MB0yDYIMdgmUGjwT1D3NDeMNZwsuE6w1OGPUN1hvoM5Q8tHbp36G0j1MjeKM5omtEmozajPmMT41BjifEa45PGPSZ6JgEm2SYrTY6adJsyTP1MRaYrTY+ZPmfqMtnMXGYFs5XZa2ZkFmYmM9to1m722dzGPNG82HyX+T0LigXLIsNipUWLRa+lqeUoy+mWdZa3rchWLKssq9VWZ6zeW9tYJ1vPt260fmajb8O1KbKps7lrS7P1t51kW2N71Y5ox7LLsVtrd8ketXe3z7Kvsr/ogDp4OIgc1jp0DCMM8xomHlYz7IYj1ZHtWOhY5/jASc8p0qnYqdHp5XDL4anDlw0/M/ybs7tzrvNm5zsjtEeEjyge0TzitYu9C9+lyuWqK801xHWWa5PrKzcHN6HbOreb7gz3Ue7z3Vvcv3p4ekg96j26PS090zyrPW+wdFgxrEWss14Er0CvWV6HvT56e3gXeO/1/svH0SfHZ7vPs5E2I4UjN4985Gvuy/Pd6Nvpx/RL89vg1+lv5s/zr/F/GGARIAjYEvCUbcfOZu9gvwx0DpQGHgh8z/HmzOAcD8KCQoNKg9qDtYMTgyuD74eYh2SG1IX0hrqHTgs9HkYIiwhbFnaDa8zlc2u5veGe4TPCWyOoEfERlREPI+0jpZHNo9BR4aNWjLobZRUljmqMBtHc6BXR92JsYibFHIolxsbEVsU+iRsRNz3uTDwjfkL89vh3CYEJSxLuJNomyhJbkuhJY5Nqk94nByUvT+4cPXz0jNEXUgxTRClNqaTUpNQtqX1jgsesGtM11n1sydjr42zGTRl3brzh+NzxRybQJ/Am7EsjpCWnbU/7wovm1fD60rnp1em9fA5/Nf+FIECwUtAt9BUuFz7N8M1YnvEs0zdzRWZ3ln9WeVaPiCOqFL3KDsten/0+Jzpna05/bnLurjy1vLS8g2JtcY64daLJxCkTOyQOkhJJ5yTvSasm9UojpFvykfxx+U0FOvBHvk1mK/tF9qDQr7Cq8MPkpMn7pmhNEU9pm2o/deHUp0UhRb9Nw6fxp7VMN5s+Z/qDGewZG2ciM9NntsyymDVvVtfs0Nnb5lDm5Mz5vdi5eHnx27nJc5vnGc+bPe/RL6G/1JVolEhLbsz3mb9+Ab5AtKB9oevCNQu/lQpKz5c5l5WXfVnEX3T+1xG/VvzavzhjcfsSjyXrlhKXipdeX+a/bNtyreVFyx+tGLWiYSVzZenKt6smrDpX7la+fjVltWx1Z0VkRdMayzVL13ypzKq8VhVYtavaqHph9fu1grWX1wWsq19vvL5s/acNog03N4ZubKixrinfRNxUuOnJ5qTNZ35j/Va7xXBL2ZavW8VbO7fFbWut9ayt3W60fUkdWier694xdselnUE7m+od6zfu0ttVthvslu1+vidtz/W9EXtb9rH21e+32l99gHGgtAFpmNrQ25jV2NmU0tRxMPxgS7NP84FDToe2HjY7XHVE98iSo5Sj8472Hys61ndccrznROaJRy0TWu6cHH3yamtsa/upiFNnT4ecPnmGfebYWd+zh895nzt4nnW+8YLHhYY297YDv7v/fqDdo73houfFpktel5o7RnYcvex/+cSVoCunr3KvXrgWda3jeuL1mzfG3ui8Kbj57FburVe3C29/vjP7LuFu6T3Ne+X3je7X/GH3x65Oj84jD4IetD2Mf3jnEf/Ri8f5j790zXtCe1L+1PRp7TOXZ4e7Q7ovPR/zvOuF5MXnnpI/tf6sfmn7cv9fAX+19Y7u7XolfdX/etEbgzdb37q9bemL6bv/Lu/d5/elHww+bPvI+njmU/Knp58nfyF9qfhq97X5
    }
   },
   "cell_type": "markdown",
   "id": "7ea70aa4-9a53-4e17-8f11-1d14a0ac9b43",
   "metadata": {},
   "source": [
    "![img_flow.png](attachment:28d8b949-8001-4b94-be3f-a64995390935.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74b56bde-1ba0-4525-a11d-cab02c5659e4",
   "metadata": {},
   "source": [
    "## Data Loading\n",
    "\n",
    "### Partition PDF tables, text, and images\n",
    "  \n",
    "* `LLaVA` Paper: https://arxiv.org/pdf/2304.08485.pdf\n",
    "* Use `Unstructured` to partition elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4aa9055d-1243-4b5a-aca0-2c6f8fb34143",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']\n",
      "- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from lxml import html\n",
    "from pydantic import BaseModel\n",
    "from typing import Any, Optional\n",
    "from unstructured.partition.pdf import partition_pdf\n",
    "\n",
    "# Path to save images\n",
    "path = \"/Users/rlm/Desktop/Papers/LLaVA/\"\n",
    "\n",
    "# Get elements\n",
    "raw_pdf_elements = partition_pdf(filename=path+\"LLaVA.pdf\",\n",
    "                                 # Using pdf format to find embedded image blocks\n",
    "                                 extract_images_in_pdf=True,\n",
    "                                 # Use layout model (YOLO-X) to get bounding boxes (for tables) and find titles\n",
    "                                 # Titles are any sub-section of the document \n",
    "                                 infer_table_structure=True, \n",
    "                                 # Post processing to aggregate text once we have the title \n",
    "                                 chunking_strategy=\"by_title\",\n",
    "                                 # Chunking params to aggregate text blocks\n",
    "                                 # Attempt to create a new chunk 3800 chars\n",
    "                                 # Attempt to keep chunks > 2000 chars \n",
    "                                 # Hard max on chunks\n",
    "                                 max_characters=4000, \n",
    "                                 new_after_n_chars=3800, \n",
    "                                 combine_text_under_n_chars=2000,\n",
    "                                 image_output_dir_path=path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7cdba921-5419-4471-b234-d93af3859b6f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{\"<class 'unstructured.documents.elements.CompositeElement'>\": 31,\n",
       " \"<class 'unstructured.documents.elements.Table'>\": 3}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a dictionary to store counts of each type\n",
    "category_counts = {}\n",
    "\n",
    "for element in raw_pdf_elements:\n",
    "    category = str(type(element))\n",
    "    if category in category_counts:\n",
    "        category_counts[category] += 1\n",
    "    else:\n",
    "        category_counts[category] = 1\n",
    "\n",
    "# Unique_categories will have unique elements\n",
    "# TableChunk if Table > max chars set above\n",
    "unique_categories = set(category_counts.keys())\n",
    "category_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5f660305-e165-4b6c-ada3-a67a422defb5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "31\n"
     ]
    }
   ],
   "source": [
    "class Element(BaseModel):\n",
    "    type: str\n",
    "    text: Any\n",
    "\n",
    "# Categorize by type\n",
    "categorized_elements = []\n",
    "for element in raw_pdf_elements:\n",
    "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"table\", text=str(element)))\n",
    "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
    "        categorized_elements.append(Element(type=\"text\", text=str(element)))\n",
    "\n",
    "# Tables\n",
    "table_elements = [e for e in categorized_elements if e.type == \"table\"]\n",
    "print(len(table_elements))\n",
    "\n",
    "# Text\n",
    "text_elements = [e for e in categorized_elements if e.type == \"text\"]\n",
    "print(len(text_elements))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aa7f52f-bf5c-4ba4-af72-b2ccba59a4cf",
   "metadata": {},
   "source": [
    "## Multi-vector retriever\n",
    "\n",
    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
    "\n",
    "### Text and Table summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "523e6ed2-2132-4748-bdb7-db765f20648d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOllama\n",
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain.schema.output_parser import StrOutputParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "22c22e3f-42fb-4a4a-a87a-89f10ba8ab99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prompt \n",
    "prompt_text=\"\"\"You are an assistant tasked with summarizing tables and text. \\ \n",
    "Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(prompt_text) \n",
    "\n",
    "# Summary chain \n",
    "model = ChatOllama(model=\"llama2:13b-chat\")\n",
    "summarize_chain = {\"element\": lambda x:x} | prompt | model | StrOutputParser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0e1ba7ba-d209-424a-8f05-6a95d6d32bb2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to text\n",
    "texts = [i.text for i in text_elements if i.text != \"\"]\n",
    "text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a419123a-6038-4264-9ee0-bfb2a2df7153",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to tables\n",
    "tables = [i.text for i in table_elements]\n",
    "table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d52641eb-762e-4460-80c7-3ac3ddd93621",
   "metadata": {},
   "source": [
    "### Image summaries \n",
    "\n",
    "Use [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436): \n",
    "\n",
    "* Download `mmproj-model-f16.gguf` and one of `ggml-model-[f16|q5_k|q4_k].gguf` from [LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)\n",
    "* Clone `llama.cpp` repo\n",
    "* Build\n",
    "```\n",
    "mkdir build && cd build && cmake ..\n",
    "cmake --build .\n",
    "```\n",
    "\n",
    "For [better performance](https://github.com/ggerganov/llama.cpp/issues/3602):\n",
    "\n",
    "* It appears `7b` is currently better than `13b` in the above LLaVA repo.\n",
    "* Use `--temp 0.1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "646a6874-008e-46aa-809d-1d59df36858b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Define the directory containing the images\n",
    "IMG_DIR=~/Desktop/Papers/LLaVA/\n",
    "\n",
    "# Loop through each image in the directory\n",
    "for img in \"${IMG_DIR}\"*.jpg; do\n",
    "    # Extract the base name of the image without extension\n",
    "    base_name=$(basename \"$img\" .jpg)\n",
    "\n",
    "    # Define the output file name based on the image name\n",
    "    output_file=\"${IMG_DIR}${base_name}.txt\"\n",
    "\n",
    "    # Execute the command and save the output to the defined output file\n",
    "    /Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
    "\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "da8a8c94-3df7-446f-9a69-703295f50f02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, glob\n",
    "\n",
    "# Get all .txt files in the directory\n",
    "file_paths = glob.glob(os.path.expanduser(os.path.join(path, \"*.txt\")))\n",
    "\n",
    "# Read each file and store its content in a list\n",
    "img_summaries = []\n",
    "for file_path in file_paths:\n",
    "    with open(file_path, 'r') as file:\n",
    "        img_summaries.append(file.read())\n",
    "\n",
    "cleaned_img_summary = [s.split(\"clip_model_load: total allocated memory: 201.27 MB\\n\\n\", 1)[1].strip() for s in img_summaries]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67b030d4-2ac5-41b6-9245-fc3ba5771d87",
   "metadata": {},
   "source": [
    "### Add to vectorstore\n",
    "\n",
    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "64a5df0c-8193-407e-a83f-8fc17caff3e4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "objc[42078]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x31f870208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x31fc9c208). One of the two will be used. Which one is undefined.\n"
     ]
    }
   ],
   "source": [
    "import uuid\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.storage import InMemoryStore\n",
    "from langchain.schema.document import Document\n",
    "from langchain.embeddings import GPT4AllEmbeddings\n",
    "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
    "\n",
    "# The vectorstore to use to index the child chunks\n",
    "vectorstore = Chroma(\n",
    "    collection_name=\"summaries\",\n",
    "    embedding_function=GPT4AllEmbeddings()\n",
    ")\n",
    "\n",
    "# The storage layer for the parent documents\n",
    "store = InMemoryStore()\n",
    "id_key = \"doc_id\"\n",
    "\n",
    "# The retriever (empty to start)\n",
    "retriever = MultiVectorRetriever(\n",
    "    vectorstore=vectorstore, \n",
    "    docstore=store, \n",
    "    id_key=id_key,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d643cc61-827d-4f3c-8242-7a7c8291ed8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add texts\n",
    "doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
    "summary_texts = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]\n",
    "retriever.vectorstore.add_documents(summary_texts)\n",
    "retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
    "\n",
    "# Add tables\n",
    "table_ids = [str(uuid.uuid4()) for _ in tables]\n",
    "summary_tables = [Document(page_content=s,metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summaries)]\n",
    "retriever.vectorstore.add_documents(summary_tables)\n",
    "retriever.docstore.mset(list(zip(table_ids, tables)))\n",
    "\n",
    "# Add images\n",
    "img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]\n",
    "summary_img = [Document(page_content=s,metadata={id_key: img_ids[i]}) for i, s in enumerate(cleaned_img_summary)]\n",
    "retriever.vectorstore.add_documents(summary_img)\n",
    "retriever.docstore.mset(list(zip(img_ids, cleaned_img_summary))) # Store the image summary as the raw document"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b45fb81-46b1-426e-aa2c-01aed4eac700",
   "metadata": {},
   "source": [
    "### Sanity Check"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dbb23d5-ae66-444d-8f5f-b24107fb9c57",
   "metadata": {},
   "source": [
    "Image:"
   ]
  },
  {
   "attachments": {
    "227da97f-e1ae-4252-b577-03a873a321e9.jpg": {
     "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAE4AQUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3qaVYYHlbO1FLHAycCuT0/wARa9f2Vrq8emWsul3W1kihnLXCxsRhiMbSRnJUHjnk4rq5zItvIYkDyBSVVjgE9hntXm8ssJhQaBpmqaV4geZS9rHFKsCtuG8vx5RTGfmHJ7c0AdzJ4j0eLUl06TUrVL0kKIDKA2T0GPU+lNfxLosd6tk+p2q3LOUERlG7cO2PWuPuSY9A1Tw8+m3cmq3NxMY3FsxjdnkLJN5mNoCgqSScjbj0p82ktJ4Y8SRSWLPJcaqXwYjmRd8eGHHIwDz7UAdfaeItHv4J57TUrWaK35mdJQQg65J7DHeqN3420K30e71OLUYLiG1HziKQFs9hz3OOPWub8ZaPeXmsTmyt5TGLGBpPKjDeYsdwGKAEbWO3OFPXp3qG5tptVtdZuLeXVL6X+y5IA89iLcMSQQgGxWdhg+wz70AdpH4gtHlMn2q0+xeTHIs3ncne5UZGMAEgAHPJyO1Rnxh4dFobo6zZeQH8sv5wwGxnH5c/Sue1qP8Ati7nnitJ5LaeHTwA8DLuAuyWBUjPA5IPatWHT0/4TzU7lrQeW+mQR+YU4Y75dy578bcj6UAat94h0jTFha+1G2txMMxmSQDcPUe3I56Ul74i0fTpIo7zUraB5RuQSSAbh6/T36V53p1re6V9nnvJ9RtI59JtYY/IsBcHKIQ0TAoxU5OcHAOfapRYNo9lFFEmq29w+nJDsnshdx3KAuRE4QfKy7scFRgjrigDvtZ1VtNt7SWNFkE93DbnJxgO4XP4Zp1/qq2GoQxzSW0du0MssjyS7WUJjJC45HPJyMcetY2qQXMvh3QkNkYpkvLJpIIwWEWHXcOOw9faqfja2lvdXt7a3TfNNpGopGo/iYiIAfnQB18t/awTQxSzokkwJjVmwWwMnH0HNY91410GDSr/AFCLUbe4jsozJIsUgJ9gOe5GB2rnNXvl8Q6lpIt9J1Ga2jguluFkt3h5aEjy8sByeRnp05qr5V3f6bqNlaJdXsY0eeCJ7uwME0DEALFuwofPsONo55oA7S916JPC82tWRS4jWEyphuGx2zU1z4h0iyv47C61K2hu3xtieQBjnp9M9vWszVZP7S8AXRtoZi0loVWNomV84xjaQDnPtWLdOtnZeINJu9Mu7u81CeZ4RHbM6XCuMJ8+Nq7RhTuIxtz6UAdbfeItH0y6S2vtStredwCqSyBTg8A+wzUUeuxrf6rHdNHBbWCxsZmbAwy7iT6YrlNOJ8PQapYa3p93f3l3sIkitmmW7XyUTZuAwCCrDDYHOe9UrfRNWs7z7ZcrLdQ6fBZm4sthIuGSLDOp/jdCAQOhI9SCAD0a51OysrE3t1cxw2wAJlkbaoB6cmse+8b6DZ6fb332+GW3nuFt1eOQcMSAc88Yzk98VH4qnc6XZTwRMY/tKO8wtTO9uu0kOI+pOcDocZzjiuRWG883U794tRuYBqOn3Pmy2mx5ERhvYRqoPAHpnA+lAHpF3qtjYQRzXl3DBHJnY8jhQcKWPJ9gT+FUW8R2dzYpdaZd2dxGbiOBmafaAWYDGcH5ueB3OPWsjxpNGJ/DE72sk8SamJDGiEtgQSnIXqSOuOvHrWTqKS6vq0+qWFpcrZPc6dEd8DoZXjuNzPtYA4VSBuI7HsKAOyPiTR11H+zm1O1F7nb5BkG7OM4x647daraN4v0jWklNtdx5S5a2ALjLMC2MY9QpI9hWBp09vaaamhXmi3dzfi8LOv2dtkjGXd5/m4246NnOeMYzVVRPaQyF7K7J0/xBJeTKsDkmFy+GTA+cYcEhckYNAHc3WtabZNKt1fQQmIKZPMkC7Q2dpPpnB/I1VfxXoMZtxJq1ohuVDRBpQNyk4B+hPFcbeB9c8Q3N0mnXRs5LjTNjTW7KJFSWQs2CM4Ge49+hFM1C0uLPVvEMN3NqarqMm6JLWwWdbiMxhQm8odpBBGGIAznuaAO7u/EOkWN7HZ3Wo28NzJjbE8gDcnA+mT0qze6jaabaNdXtzHbwLjdJK20DPTk159qsE2nrdW8CX73UttFG1pPZ/aYL8iMKAXUfI3G0ncAMZxXR6+JIjouoS2ksttZ3Be4hiQyMmY2UMFHLbSR0Gec9qAE1zxvpmn6FHqFleWdwZp0t4S0wCb2YA7iMkBc5PGeK6O1lM1rFIzIxdAxKHKnI7e1ef3sUup3c+o2Vjcx2k2padt3wMjSGOXLy7SAQMFRkgfc9MV6KMY4oAKD0NFB6GgBluSbaIk5JQfyopLb/AI9Yf9xf5UUASUYFRXEpht3kWN5CqkhExlvYZIFcVY+Mr28g8PXc1pMn28Tb7eJAxkKqCu3ngdeSR0OaAO6x7UYGK59fF9g0QxDcm7+0G1+x+X+98wLuIxnGNvzZzjHetDTNZt9UWcRpLFLbyeXNDMm142wDyOnIIIIJBzQBoYHpRisGHxbYTyRFYrkWs03kRXhT9075wADnPJGASME9DyKRfFtk08Km3vFt5rg20V00WInkyRgc55IIBIwfWgDewPSlwK5GfxuJ9Liv9M068mhe6igDvGqht0oRsZYEkcj0zj3plt4ukg1DXY7q1vJ47KdeIIQfIjMMbfNzycluBk/pQB2OB6UmB6Vz3ifXZrHwx/aGmq0rTNEsciKrbRIyjdhiAeG498dqRPFVraGe3uUvCbKJXu55I1CxAoGBYjjJHZQee1AHR4qoNMtBqZ1Hys3fl+UJGYnauckAHgZIGcdcDPSsk+L7KJHa7tL20xbPcoJ4seaiDLbcE8gc7Tg+1NXxlZtNBALG/wDOuY/NtYzCAZ0HVl5wMZGd23qPWgDo8D0oxWXaa/Y3eiy6qGeK2hEnneapVozGSHDD1BU1SbxhZQ2k1zd215axx2xugZowN8QxlhgnpkZBweelAHQ4owPSsKLxVZtcLDcQ3NqXiaeJp49olRRliuCeQMHBwcdqbB4usXYfaYLqxR7d7mN7qPYrxrgsRycYBBwcH2oA38D0pMD0rDtvFVlO8QmgurRJommhkuY9iyIoySOeDjnDYOO3Bp1l4ntLy5toDb3Vv9rUtavPHtWYAZ+XnIOOcNg4zxwaANvA9KTA9KyNQ8RQWF3JarZ3l1LFEJpRbxbhGhzgkkjJODwMnjpVT/hM7CWVo7K3u71lto7o/Z4wf3TglW5I64PHX0FAG3cWNtdy20s8Qd7aTzYTkja20rn8mI/GrGB6VgnxZYy/ZxZQ3N8Z7cXSi2jztiPRjkjGecDqcHjis+w1+5vfAur
    }
   },
   "cell_type": "markdown",
   "id": "329fd4ee-4a68-4f3b-b157-a676f13ba587",
   "metadata": {},
   "source": [
    "![figure-8-1.jpg](attachment:227da97f-e1ae-4252-b577-03a873a321e9.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fde6f17-d244-4270-b759-68e1858d399f",
   "metadata": {},
   "source": [
    "We can retrieve this image summary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "6f52ee1e-ed46-4a81-834a-3608a1cf90ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries. The arrangement of the chicken pieces creates a visually appealing and playful representation of the world, making it an interesting and creative presentation.\\n\\nmain: image encoded in   865.20 ms by CLIP (    1.50 ms per image patch)'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.get_relevant_documents(\"Images / figures with playful and creative examples\")[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69060724-e390-4dda-8250-5f86025c874a",
   "metadata": {},
   "source": [
    "## RAG\n",
    "\n",
    "Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "771a47fa-1267-4db8-a6ae-5fde48bbc069",
   "metadata": {},
   "outputs": [],
   "source": [
    "from operator import itemgetter\n",
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "\n",
    "# Prompt template\n",
    "template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
    "{context}\n",
    "Question: {question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# RAG pipeline\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | model \n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ea8414a8-65ee-4e11-8154-029b454f46af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\" Based on the provided context, LLaVA's performance across multiple image domains/subjects is not explicitly mentioned. However, we can infer some information about its performance based on the given text:\\n\\n1. LLaVA achieves an accuracy of 90.92% on the ScienceQA dataset, which is close to the current SoTA (91.68%).\\n2. When prompted with a 2-shot in-context learning task using GPT-4, it achieves an accuracy of 82.69%, indicating a 7.52% absolute gain compared to GPT-3.5.\\n3. For a substantial number of questions, GPT-4 fails due to insufficient context such as images or plots.\\n\\nBased on these points, we can infer that LLaVA performs well across multiple image domains/subjects, but its performance may be limited by the quality and availability of the input images. Additionally, its ability to recognize visual content and provide detailed responses is dependent on the specific task and dataset being used.\""
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\"What is the performance of LLaVa across across mutiple image domains / subjects?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b7aeb57-2ab8-496c-b909-0734ccc5da5f",
   "metadata": {},
   "source": [
    "We can check the [trace](https://smith.langchain.com/public/ab90fb1c-5949-4fc6-a002-56a6056adc6b/r) to review retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "1ad375c5-8aef-4be3-9a12-8ad953fa2d14",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Sure, I\\'d be happy to help! Based on the provided context, here are some playful and creative explanations for the images/figures mentioned in the paper:\\n\\n1. \"The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries.\"\\n\\nPlayful explanation: \"Look, ma! The fried chicken is mapping out the world one piece at a time! Who needs Google Maps when you have crispy chicken wings to guide the way?\"\\n\\nCreative explanation: \"The arrangement of the fried chicken pieces creates a visual representation of the world that\\'s both appetizing and adventurous. It\\'s like a culinary globe-trotting experience!\"\\n\\n2. \"The image is a screenshot of a conversation between two people, likely discussing a painting.\"\\n\\nPlayful explanation: \"The painting is getting a double take - these two people are having a chat about it and we get to eavesdrop on their art-loving banter!\"\\n\\nCreative explanation: \"This image captures the dynamic exchange of ideas between two art enthusiasts. It\\'s like we\\'re peeking into their creative brainstorming session, where the painting is the catalyst for a lively discussion.\"\\n\\n3. \"The image features a text-based representation of a scene with a person holding onto a rope, possibly a woman, and a boat in the background.\"\\n\\nPlayful explanation: \"This image looks like a page from a choose-your-own-adventure book! Is our brave protagonist about to embark on a thrilling boat ride or hold tight for a wild journey?\"\\n\\nCreative explanation: \"The text-based representation of the scene creates an intriguing narrative that invites the viewer to fill in the blanks. It\\'s like we\\'re reading a visual storybook, where the person holding onto the rope is the hero of their own adventure.\"\\n\\n4. \"Figure 5: LLaVA recognizes the famous art work, Mona Lisa, by Leonardo da Vinci.\"\\n\\nPlayful explanation: \"Mona Lisa is getting a digital spotlight - look at her smile now that she\\'s part of this cool image recognition tech!\"\\n\\nCreative explanation: \"This playful recognition of the Mona Lisa painting highlights the advanced technology used in image analysis. It\\'s like LLaVA is giving the famous artwork a modern makeover, showcasing its timeless beauty and relevance in the digital age.\"\\n\\nOverall, these images/figures offer unique opportunities for creative and playful explanations that can capture the viewer\\'s attention while highlighting the technology and narratives presented in the paper.'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.invoke(\"Explain any images / figures in the paper with playful and creative examples.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1da79644-4046-45b0-8c25-01aa73587b22",
   "metadata": {},
   "source": [
    "We can check the [trace](https://smith.langchain.com/public/c6d3b7d5-0f40-4905-ab8f-3a2b77c39af4/r) to review retrieval."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}