langchain/cookbook/nomic_multimodal_rag.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9fc3897d-176f-4729-8fd1-cfb4add53abd",
   "metadata": {},
   "source": [
    "## Nomic multi-modal RAG\n",
    "\n",
    "Many documents contain a mixture of content types, including text and images. \n",
    "\n",
    "Yet, information captured in images is lost in most RAG applications.\n",
    "\n",
    "With the emergence of multimodal LLMs, like [GPT-4V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n",
    "\n",
    "In this demo we\n",
    "\n",
    "* Use multimodal embeddings from Nomic Embed [Vision](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) and [Text](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) to embed images and text\n",
    "* Retrieve both using similarity search\n",
    "* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
    "\n",
    "## Signup\n",
    "\n",
    "Get your API token, then run:\n",
    "```\n",
    "! nomic login\n",
    "```\n",
    "\n",
    "Then run with your generated API token \n",
    "```\n",
    "! nomic login < token > \n",
    "```\n",
    "\n",
    "## Packages\n",
    "\n",
    "For `unstructured`, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54926b9b-75c2-4cd4-8f14-b3882a0d370b",
   "metadata": {},
   "outputs": [],
   "source": [
    "! nomic login token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "febbc459-ebba-4c1a-a52b-fed7731593f8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! pip install -U langchain-nomic langchain_community tiktoken langchain-openai chromadb langchain # (newest versions required for multi-modal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acbdc603-39e2-4a5f-836c-2bbaecd46b0b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# lock to 0.10.19 due to a persistent bug in more recent versions\n",
    "! pip install \"unstructured[all-docs]==0.10.19\" pillow pydantic lxml pillow matplotlib tiktoken"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e94b3fb-8e3e-4736-be0a-ad881626c7bd",
   "metadata": {},
   "source": [
    "## Data Loading\n",
    "\n",
    "### Partition PDF text and images\n",
    "  \n",
    "Let's look at an example pdfs containing interesting images.\n",
    "\n",
    "1/ Art from the J Paul Getty museum:\n",
    "\n",
    " * Here is a [zip file](https://drive.google.com/file/d/18kRKbq2dqAhhJ3DfZRnYcTBEUfYxe1YR/view?usp=sharing) with the PDF and the already extracted images. \n",
    "* https://www.getty.edu/publications/resources/virtuallibrary/0892360224.pdf\n",
    "\n",
    "2/ Famous photographs from library of congress:\n",
    "\n",
    "* https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf\n",
    "* We'll use this as an example below\n",
    "\n",
    "We can use `partition_pdf` below from [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#key-concepts) to extract text and images.\n",
    "\n",
    "To supply this to extract the images:\n",
    "```\n",
    "extract_images_in_pdf=True\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "If using this zip file, then you can simply process the text only with:\n",
    "```\n",
    "extract_images_in_pdf=False\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9646b524-71a7-4b2a-bdc8-0b81f77e968f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Folder with pdf and extracted images\n",
    "from pathlib import Path\n",
    "\n",
    "# replace with actual path to images\n",
    "path = Path(\"../art\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77f096ab-a933-41d0-8f4e-1efc83998fc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "path.resolve()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc4839c0-8773-4a07-ba59-5364501269b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract images, tables, and chunk text\n",
    "from unstructured.partition.pdf import partition_pdf\n",
    "\n",
    "raw_pdf_elements = partition_pdf(\n",
    "    filename=str(path.resolve()) + \"/getty.pdf\",\n",
    "    extract_images_in_pdf=False,\n",
    "    infer_table_structure=True,\n",
    "    chunking_strategy=\"by_title\",\n",
    "    max_characters=4000,\n",
    "    new_after_n_chars=3800,\n",
    "    combine_text_under_n_chars=2000,\n",
    "    image_output_dir_path=path,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "969545ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Categorize text elements by type\n",
    "tables = []\n",
    "texts = []\n",
    "for element in raw_pdf_elements:\n",
    "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
    "        tables.append(str(element))\n",
    "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
    "        texts.append(str(element))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d8e6349-1547-4cbf-9c6f-491d8610ec10",
   "metadata": {},
   "source": [
    "## Multi-modal embeddings with our document\n",
    "\n",
    "We will use [nomic-embed-vision-v1.5](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) embeddings. This model is aligned \n",
    "to [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) allowing for multimodal semantic search and Multimodal RAG!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bc15842-cb95-4f84-9eb5-656b0282a800",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import uuid\n",
    "\n",
    "import chromadb\n",
    "import numpy as np\n",
    "from langchain_community.vectorstores import Chroma\n",
    "from langchain_nomic import NomicEmbeddings\n",
    "from PIL import Image as _PILImage\n",
    "\n",
    "# Create chroma\n",
    "text_vectorstore = Chroma(\n",
    "    collection_name=\"mm_rag_clip_photos_text\",\n",
    "    embedding_function=NomicEmbeddings(\n",
    "        vision_model=\"nomic-embed-vision-v1.5\", model=\"nomic-embed-text-v1.5\"\n",
    "    ),\n",
    ")\n",
    "image_vectorstore = Chroma(\n",
    "    collection_name=\"mm_rag_clip_photos_image\",\n",
    "    embedding_function=NomicEmbeddings(\n",
    "        vision_model=\"nomic-embed-vision-v1.5\", model=\"nomic-embed-text-v1.5\"\n",
    "    ),\n",
    ")\n",
    "\n",
    "# Get image URIs with .jpg extension only\n",
    "image_uris = sorted(\n",
    "    [\n",
    "        os.path.join(path, image_name)\n",
    "        for image_name in os.listdir(path)\n",
    "        if image_name.endswith(\".jpg\")\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Add images\n",
    "image_vectorstore.add_images(uris=image_uris)\n",
    "\n",
    "# Add documents\n",
    "text_vectorstore.add_texts(texts=texts)\n",
    "\n",
    "# Make retriever\n",
    "image_retriever = image_vectorstore.as_retriever()\n",
    "text_retriever = text_vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a186d0-27e0-4820-8092-63b5349dd25d",
   "metadata": {},
   "source": [
    "## RAG\n",
    "\n",
    "`vectorstore.add_images` will store / retrieve images as base64 encoded strings.\n",
    "\n",
    "These can be passed to [GPT-4V](https://platform.openai.com/docs/guides/vision)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "344f56a8-0dc3-433e-851c-3f7600c7a72b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "import io\n",
    "from io import BytesIO\n",
    "\n",
    "import numpy as np\n",
    "from PIL import Image\n",
    "\n",
    "\n",
    "def resize_base64_image(base64_string, size=(128, 128)):\n",
    "    \"\"\"\n",
    "    Resize an image encoded as a Base64 string.\n",
    "\n",
    "    Args:\n",
    "    base64_string (str): Base64 string of the original image.\n",
    "    size (tuple): Desired size of the image as (width, height).\n",
    "\n",
    "    Returns:\n",
    "    str: Base64 string of the resized image.\n",
    "    \"\"\"\n",
    "    # Decode the Base64 string\n",
    "    img_data = base64.b64decode(base64_string)\n",
    "    img = Image.open(io.BytesIO(img_data))\n",
    "\n",
    "    # Resize the image\n",
    "    resized_img = img.resize(size, Image.LANCZOS)\n",
    "\n",
    "    # Save the resized image to a bytes buffer\n",
    "    buffered = io.BytesIO()\n",
    "    resized_img.save(buffered, format=img.format)\n",
    "\n",
    "    # Encode the resized image to Base64\n",
    "    return base64.b64encode(buffered.getvalue()).decode(\"utf-8\")\n",
    "\n",
    "\n",
    "def is_base64(s):\n",
    "    \"\"\"Check if a string is Base64 encoded\"\"\"\n",
    "    try:\n",
    "        return base64.b64encode(base64.b64decode(s)) == s.encode()\n",
    "    except Exception:\n",
    "        return False\n",
    "\n",
    "\n",
    "def split_image_text_types(docs):\n",
    "    \"\"\"Split numpy array images and texts\"\"\"\n",
    "    images = []\n",
    "    text = []\n",
    "    for doc in docs:\n",
    "        doc = doc.page_content  # Extract Document contents\n",
    "        if is_base64(doc):\n",
    "            # Resize image to avoid OAI server error\n",
    "            images.append(\n",
    "                resize_base64_image(doc, size=(250, 250))\n",
    "            )  # base64 encoded str\n",
    "        else:\n",
    "            text.append(doc)\n",
    "    return {\"images\": images, \"texts\": text}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23a2c1d8-fea6-4152-b184-3172dd46c735",
   "metadata": {},
   "source": [
    "Currently, we format the inputs using a `RunnableLambda` while we add image support to `ChatPromptTemplates`.\n",
    "\n",
    "Our runnable follows the classic RAG flow - \n",
    "\n",
    "* We first compute the context (both \"texts\" and \"images\" in this case) and the question (just a RunnablePassthrough here) \n",
    "* Then we pass this into our prompt template, which is a custom function that formats the message for the gpt-4-vision-preview model. \n",
    "* And finally we parse the output as a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d8919dc-c238-4746-86ba-45d940a7d260",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c93fab3-74c4-4f1d-958a-0bc4cdd0797e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from operator import itemgetter\n",
    "\n",
    "from langchain_core.messages import HumanMessage, SystemMessage\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnableLambda, RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "\n",
    "def prompt_func(data_dict):\n",
    "    # Joining the context texts into a single string\n",
    "    formatted_texts = \"\\n\".join(data_dict[\"text_context\"][\"texts\"])\n",
    "    messages = []\n",
    "\n",
    "    # Adding image(s) to the messages if present\n",
    "    if data_dict[\"image_context\"][\"images\"]:\n",
    "        image_message = {\n",
    "            \"type\": \"image_url\",\n",
    "            \"image_url\": {\n",
    "                \"url\": f\"data:image/jpeg;base64,{data_dict['image_context']['images'][0]}\"\n",
    "            },\n",
    "        }\n",
    "        messages.append(image_message)\n",
    "\n",
    "    # Adding the text message for analysis\n",
    "    text_message = {\n",
    "        \"type\": \"text\",\n",
    "        \"text\": (\n",
    "            \"As an expert art critic and historian, your task is to analyze and interpret images, \"\n",
    "            \"considering their historical and cultural significance. Alongside the images, you will be \"\n",
    "            \"provided with related text to offer context. Both will be retrieved from a vectorstore based \"\n",
    "            \"on user-input keywords. Please use your extensive knowledge and analytical skills to provide a \"\n",
    "            \"comprehensive summary that includes:\\n\"\n",
    "            \"- A detailed description of the visual elements in the image.\\n\"\n",
    "            \"- The historical and cultural context of the image.\\n\"\n",
    "            \"- An interpretation of the image's symbolism and meaning.\\n\"\n",
    "            \"- Connections between the image and the related text.\\n\\n\"\n",
    "            f\"User-provided keywords: {data_dict['question']}\\n\\n\"\n",
    "            \"Text and / or tables:\\n\"\n",
    "            f\"{formatted_texts}\"\n",
    "        ),\n",
    "    }\n",
    "    messages.append(text_message)\n",
    "\n",
    "    return [HumanMessage(content=messages)]\n",
    "\n",
    "\n",
    "model = ChatOpenAI(temperature=0, model=\"gpt-4-vision-preview\", max_tokens=1024)\n",
    "\n",
    "# RAG pipeline\n",
    "chain = (\n",
    "    {\n",
    "        \"text_context\": text_retriever | RunnableLambda(split_image_text_types),\n",
    "        \"image_context\": image_retriever | RunnableLambda(split_image_text_types),\n",
    "        \"question\": RunnablePassthrough(),\n",
    "    }\n",
    "    | RunnableLambda(prompt_func)\n",
    "    | model\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1566096d-97c2-4ddc-ba4a-6ef88c525e4e",
   "metadata": {},
   "source": [
    "## Test retrieval and run RAG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90121e56-674b-473b-871d-6e4753fd0c45",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import HTML, display\n",
    "\n",
    "\n",
    "def plt_img_base64(img_base64):\n",
    "    # Create an HTML img tag with the base64 string as the source\n",
    "    image_html = f'<img src=\"data:image/jpeg;base64,{img_base64}\" />'\n",
    "\n",
    "    # Display the image by rendering the HTML\n",
    "    display(HTML(image_html))\n",
    "\n",
    "\n",
    "docs = text_retriever.invoke(\"Women with children\", k=5)\n",
    "for doc in docs:\n",
    "    if is_base64(doc.page_content):\n",
    "        plt_img_base64(doc.page_content)\n",
    "    else:\n",
    "        print(doc.page_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44eaa532-f035-4c04-b578-02339d42554c",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = image_retriever.invoke(\"Women with children\", k=5)\n",
    "for doc in docs:\n",
    "    if is_base64(doc.page_content):\n",
    "        plt_img_base64(doc.page_content)\n",
    "    else:\n",
    "        print(doc.page_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69fb15fd-76fc-49b4-806d-c4db2990027d",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain.invoke(\"Women with children\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "227f08b8-e732-4089-b65c-6eb6f9e48f15",
   "metadata": {},
   "source": [
    "We can see the images retrieved in the LangSmith trace:\n",
    "\n",
    "LangSmith [trace](https://smith.langchain.com/public/69c558a5-49dc-4c60-a49b-3adbb70f74c5/r/e872c2c8-528c-468f-aefd-8b5cd730a673)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}