{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "9fc3897d-176f-4729-8fd1-cfb4add53abd", "metadata": {}, "source": [ "## Nomic multi-modal RAG\n", "\n", "Many documents contain a mixture of content types, including text and images. \n", "\n", "Yet, information captured in images is lost in most RAG applications.\n", "\n", "With the emergence of multimodal LLMs, like [GPT-4V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n", "\n", "In this demo we\n", "\n", "* Use multimodal embeddings from Nomic Embed [Vision](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) and [Text](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) to embed images and text\n", "* Retrieve both using similarity search\n", "* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n", "\n", "## Signup\n", "\n", "Get your API token, then run:\n", "```\n", "! nomic login\n", "```\n", "\n", "Then run with your generated API token \n", "```\n", "! nomic login < token > \n", "```\n", "\n", "## Packages\n", "\n", "For `unstructured`, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system." ] }, { "cell_type": "code", "execution_count": null, "id": "54926b9b-75c2-4cd4-8f14-b3882a0d370b", "metadata": {}, "outputs": [], "source": [ "! nomic login token" ] }, { "cell_type": "code", "execution_count": null, "id": "febbc459-ebba-4c1a-a52b-fed7731593f8", "metadata": { "scrolled": true }, "outputs": [], "source": [ "! pip install -U langchain-nomic langchain_community tiktoken langchain-openai chromadb langchain # (newest versions required for multi-modal)" ] }, { "cell_type": "code", "execution_count": null, "id": "acbdc603-39e2-4a5f-836c-2bbaecd46b0b", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# lock to 0.10.19 due to a persistent bug in more recent versions\n", "! pip install \"unstructured[all-docs]==0.10.19\" pillow pydantic lxml pillow matplotlib tiktoken" ] }, { "cell_type": "markdown", "id": "1e94b3fb-8e3e-4736-be0a-ad881626c7bd", "metadata": {}, "source": [ "## Data Loading\n", "\n", "### Partition PDF text and images\n", " \n", "Let's look at an example pdfs containing interesting images.\n", "\n", "1/ Art from the J Paul Getty museum:\n", "\n", " * Here is a [zip file](https://drive.google.com/file/d/18kRKbq2dqAhhJ3DfZRnYcTBEUfYxe1YR/view?usp=sharing) with the PDF and the already extracted images. \n", "* https://www.getty.edu/publications/resources/virtuallibrary/0892360224.pdf\n", "\n", "2/ Famous photographs from library of congress:\n", "\n", "* https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf\n", "* We'll use this as an example below\n", "\n", "We can use `partition_pdf` below from [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#key-concepts) to extract text and images.\n", "\n", "To supply this to extract the images:\n", "```\n", "extract_images_in_pdf=True\n", "```\n", "\n", "\n", "\n", "If using this zip file, then you can simply process the text only with:\n", "```\n", "extract_images_in_pdf=False\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "9646b524-71a7-4b2a-bdc8-0b81f77e968f", "metadata": {}, "outputs": [], "source": [ "# Folder with pdf and extracted images\n", "from pathlib import Path\n", "\n", "# replace with actual path to images\n", "path = Path(\"../art\")" ] }, { "cell_type": "code", "execution_count": null, "id": "77f096ab-a933-41d0-8f4e-1efc83998fc3", "metadata": {}, "outputs": [], "source": [ "path.resolve()" ] }, { "cell_type": "code", "execution_count": null, "id": "bc4839c0-8773-4a07-ba59-5364501269b2", "metadata": {}, "outputs": [], "source": [ "# Extract images, tables, and chunk text\n", "from unstructured.partition.pdf import partition_pdf\n", "\n", "raw_pdf_elements = partition_pdf(\n", " filename=str(path.resolve()) + \"/getty.pdf\",\n", " extract_images_in_pdf=False,\n", " infer_table_structure=True,\n", " chunking_strategy=\"by_title\",\n", " max_characters=4000,\n", " new_after_n_chars=3800,\n", " combine_text_under_n_chars=2000,\n", " image_output_dir_path=path,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "969545ad", "metadata": {}, "outputs": [], "source": [ "# Categorize text elements by type\n", "tables = []\n", "texts = []\n", "for element in raw_pdf_elements:\n", " if \"unstructured.documents.elements.Table\" in str(type(element)):\n", " tables.append(str(element))\n", " elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n", " texts.append(str(element))" ] }, { "cell_type": "markdown", "id": "5d8e6349-1547-4cbf-9c6f-491d8610ec10", "metadata": {}, "source": [ "## Multi-modal embeddings with our document\n", "\n", "We will use [nomic-embed-vision-v1.5](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) embeddings. This model is aligned \n", "to [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) allowing for multimodal semantic search and Multimodal RAG!" ] }, { "cell_type": "code", "execution_count": null, "id": "4bc15842-cb95-4f84-9eb5-656b0282a800", "metadata": {}, "outputs": [], "source": [ "import os\n", "import uuid\n", "\n", "import chromadb\n", "import numpy as np\n", "from langchain_community.vectorstores import Chroma\n", "from langchain_nomic import NomicEmbeddings\n", "from PIL import Image as _PILImage\n", "\n", "# Create chroma\n", "text_vectorstore = Chroma(\n", " collection_name=\"mm_rag_clip_photos_text\",\n", " embedding_function=NomicEmbeddings(\n", " vision_model=\"nomic-embed-vision-v1.5\", model=\"nomic-embed-text-v1.5\"\n", " ),\n", ")\n", "image_vectorstore = Chroma(\n", " collection_name=\"mm_rag_clip_photos_image\",\n", " embedding_function=NomicEmbeddings(\n", " vision_model=\"nomic-embed-vision-v1.5\", model=\"nomic-embed-text-v1.5\"\n", " ),\n", ")\n", "\n", "# Get image URIs with .jpg extension only\n", "image_uris = sorted(\n", " [\n", " os.path.join(path, image_name)\n", " for image_name in os.listdir(path)\n", " if image_name.endswith(\".jpg\")\n", " ]\n", ")\n", "\n", "# Add images\n", "image_vectorstore.add_images(uris=image_uris)\n", "\n", "# Add documents\n", "text_vectorstore.add_texts(texts=texts)\n", "\n", "# Make retriever\n", "image_retriever = image_vectorstore.as_retriever()\n", "text_retriever = text_vectorstore.as_retriever()" ] }, { "cell_type": "markdown", "id": "02a186d0-27e0-4820-8092-63b5349dd25d", "metadata": {}, "source": [ "## RAG\n", "\n", "`vectorstore.add_images` will store / retrieve images as base64 encoded strings.\n", "\n", "These can be passed to [GPT-4V](https://platform.openai.com/docs/guides/vision)." ] }, { "cell_type": "code", "execution_count": null, "id": "344f56a8-0dc3-433e-851c-3f7600c7a72b", "metadata": {}, "outputs": [], "source": [ "import base64\n", "import io\n", "from io import BytesIO\n", "\n", "import numpy as np\n", "from PIL import Image\n", "\n", "\n", "def resize_base64_image(base64_string, size=(128, 128)):\n", " \"\"\"\n", " Resize an image encoded as a Base64 string.\n", "\n", " Args:\n", " base64_string (str): Base64 string of the original image.\n", " size (tuple): Desired size of the image as (width, height).\n", "\n", " Returns:\n", " str: Base64 string of the resized image.\n", " \"\"\"\n", " # Decode the Base64 string\n", " img_data = base64.b64decode(base64_string)\n", " img = Image.open(io.BytesIO(img_data))\n", "\n", " # Resize the image\n", " resized_img = img.resize(size, Image.LANCZOS)\n", "\n", " # Save the resized image to a bytes buffer\n", " buffered = io.BytesIO()\n", " resized_img.save(buffered, format=img.format)\n", "\n", " # Encode the resized image to Base64\n", " return base64.b64encode(buffered.getvalue()).decode(\"utf-8\")\n", "\n", "\n", "def is_base64(s):\n", " \"\"\"Check if a string is Base64 encoded\"\"\"\n", " try:\n", " return base64.b64encode(base64.b64decode(s)) == s.encode()\n", " except Exception:\n", " return False\n", "\n", "\n", "def split_image_text_types(docs):\n", " \"\"\"Split numpy array images and texts\"\"\"\n", " images = []\n", " text = []\n", " for doc in docs:\n", " doc = doc.page_content # Extract Document contents\n", " if is_base64(doc):\n", " # Resize image to avoid OAI server error\n", " images.append(\n", " resize_base64_image(doc, size=(250, 250))\n", " ) # base64 encoded str\n", " else:\n", " text.append(doc)\n", " return {\"images\": images, \"texts\": text}" ] }, { "cell_type": "markdown", "id": "23a2c1d8-fea6-4152-b184-3172dd46c735", "metadata": {}, "source": [ "Currently, we format the inputs using a `RunnableLambda` while we add image support to `ChatPromptTemplates`.\n", "\n", "Our runnable follows the classic RAG flow - \n", "\n", "* We first compute the context (both \"texts\" and \"images\" in this case) and the question (just a RunnablePassthrough here) \n", "* Then we pass this into our prompt template, which is a custom function that formats the message for the gpt-4-vision-preview model. \n", "* And finally we parse the output as a string." ] }, { "cell_type": "code", "execution_count": null, "id": "5d8919dc-c238-4746-86ba-45d940a7d260", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "4c93fab3-74c4-4f1d-958a-0bc4cdd0797e", "metadata": {}, "outputs": [], "source": [ "from operator import itemgetter\n", "\n", "from langchain_core.messages import HumanMessage, SystemMessage\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.runnables import RunnableLambda, RunnablePassthrough\n", "from langchain_openai import ChatOpenAI\n", "\n", "\n", "def prompt_func(data_dict):\n", " # Joining the context texts into a single string\n", " formatted_texts = \"\\n\".join(data_dict[\"text_context\"][\"texts\"])\n", " messages = []\n", "\n", " # Adding image(s) to the messages if present\n", " if data_dict[\"image_context\"][\"images\"]:\n", " image_message = {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": f\"data:image/jpeg;base64,{data_dict['image_context']['images'][0]}\"\n", " },\n", " }\n", " messages.append(image_message)\n", "\n", " # Adding the text message for analysis\n", " text_message = {\n", " \"type\": \"text\",\n", " \"text\": (\n", " \"As an expert art critic and historian, your task is to analyze and interpret images, \"\n", " \"considering their historical and cultural significance. Alongside the images, you will be \"\n", " \"provided with related text to offer context. Both will be retrieved from a vectorstore based \"\n", " \"on user-input keywords. Please use your extensive knowledge and analytical skills to provide a \"\n", " \"comprehensive summary that includes:\\n\"\n", " \"- A detailed description of the visual elements in the image.\\n\"\n", " \"- The historical and cultural context of the image.\\n\"\n", " \"- An interpretation of the image's symbolism and meaning.\\n\"\n", " \"- Connections between the image and the related text.\\n\\n\"\n", " f\"User-provided keywords: {data_dict['question']}\\n\\n\"\n", " \"Text and / or tables:\\n\"\n", " f\"{formatted_texts}\"\n", " ),\n", " }\n", " messages.append(text_message)\n", "\n", " return [HumanMessage(content=messages)]\n", "\n", "\n", "model = ChatOpenAI(temperature=0, model=\"gpt-4-vision-preview\", max_tokens=1024)\n", "\n", "# RAG pipeline\n", "chain = (\n", " {\n", " \"text_context\": text_retriever | RunnableLambda(split_image_text_types),\n", " \"image_context\": image_retriever | RunnableLambda(split_image_text_types),\n", " \"question\": RunnablePassthrough(),\n", " }\n", " | RunnableLambda(prompt_func)\n", " | model\n", " | StrOutputParser()\n", ")" ] }, { "cell_type": "markdown", "id": "1566096d-97c2-4ddc-ba4a-6ef88c525e4e", "metadata": {}, "source": [ "## Test retrieval and run RAG" ] }, { "cell_type": "code", "execution_count": null, "id": "90121e56-674b-473b-871d-6e4753fd0c45", "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML, display\n", "\n", "\n", "def plt_img_base64(img_base64):\n", " # Create an HTML img tag with the base64 string as the source\n", " image_html = f''\n", "\n", " # Display the image by rendering the HTML\n", " display(HTML(image_html))\n", "\n", "\n", "docs = text_retriever.invoke(\"Women with children\", k=5)\n", "for doc in docs:\n", " if is_base64(doc.page_content):\n", " plt_img_base64(doc.page_content)\n", " else:\n", " print(doc.page_content)" ] }, { "cell_type": "code", "execution_count": null, "id": "44eaa532-f035-4c04-b578-02339d42554c", "metadata": {}, "outputs": [], "source": [ "docs = image_retriever.invoke(\"Women with children\", k=5)\n", "for doc in docs:\n", " if is_base64(doc.page_content):\n", " plt_img_base64(doc.page_content)\n", " else:\n", " print(doc.page_content)" ] }, { "cell_type": "code", "execution_count": null, "id": "69fb15fd-76fc-49b4-806d-c4db2990027d", "metadata": {}, "outputs": [], "source": [ "chain.invoke(\"Women with children\")" ] }, { "cell_type": "markdown", "id": "227f08b8-e732-4089-b65c-6eb6f9e48f15", "metadata": {}, "source": [ "We can see the images retrieved in the LangSmith trace:\n", "\n", "LangSmith [trace](https://smith.langchain.com/public/69c558a5-49dc-4c60-a49b-3adbb70f74c5/r/e872c2c8-528c-468f-aefd-8b5cd730a673)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }