"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search\n",
"\n",
"And the information captured in images is typically lost.\n",
"\n",
"With the emergence of multimodal LLMs, like [GPT4-V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n",
"\n",
"`Option 1:` \n",
"\n",
"* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text\n",
"* Retrieve both using similarity search\n",
"* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
"\n",
"`Option 2:` \n",
"\n",
"* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
"* Embed and retrieve text \n",
"* Pass text chunks to an LLM for answer synthesis \n",
"* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
"* Embed and retrieve image summaries with a reference to the raw image \n",
"* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
"* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text, (optionally) images along with their summaries for retrieval.\n",
"* We will demonstrate `Option 2`, and will follow-up on the other approaches in future cookbooks.\n",
"* Download the LLaVA model: `mmproj-model-f16.gguf` and one of `ggml-model-[f16|q5_k|q4_k].gguf` from [LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)\n",
"/Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
" # Extract the base name of the image without extension\n",
" base_name=$(basename \"$img\" .jpg)\n",
"\n",
" # Define the output file name based on the image name\n",
" output_file=\"${IMG_DIR}${base_name}.txt\"\n",
"\n",
" # Execute the command and save the output to the defined output file\n",
" /Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p \"Describe the image in detail. Be specific about graphs, such as bar plots.\" --image \"$img\" > \"$output_file\"\n",
"Found model file at /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"objc[42078]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x31f870208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x31fc9c208). One of the two will be used. Which one is undefined.\n"
"'The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries. The arrangement of the chicken pieces creates a visually appealing and playful representation of the world, making it an interesting and creative presentation.\\n\\nmain: image encoded in 865.20 ms by CLIP ( 1.50 ms per image patch)'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\"Images / figures with playful and creative examples\")[0]"
"\" Based on the provided context, LLaVA's performance across multiple image domains/subjects is not explicitly mentioned. However, we can infer some information about its performance based on the given text:\\n\\n1. LLaVA achieves an accuracy of 90.92% on the ScienceQA dataset, which is close to the current SoTA (91.68%).\\n2. When prompted with a 2-shot in-context learning task using GPT-4, it achieves an accuracy of 82.69%, indicating a 7.52% absolute gain compared to GPT-3.5.\\n3. For a substantial number of questions, GPT-4 fails due to insufficient context such as images or plots.\\n\\nBased on these points, we can infer that LLaVA performs well across multiple image domains/subjects, but its performance may be limited by the quality and availability of the input images. Additionally, its ability to recognize visual content and provide detailed responses is dependent on the specific task and dataset being used.\""
"We can check the [trace](https://smith.langchain.com/public/ab90fb1c-5949-4fc6-a002-56a6056adc6b/r) to review retrieval."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "1ad375c5-8aef-4be3-9a12-8ad953fa2d14",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Sure, I\\'d be happy to help! Based on the provided context, here are some playful and creative explanations for the images/figures mentioned in the paper:\\n\\n1. \"The image features a close-up of a tray filled with various pieces of fried chicken. The chicken pieces are arranged in a way that resembles a map of the world, with some pieces placed in the shape of continents and others as countries.\"\\n\\nPlayful explanation: \"Look, ma! The fried chicken is mapping out the world one piece at a time! Who needs Google Maps when you have crispy chicken wings to guide the way?\"\\n\\nCreative explanation: \"The arrangement of the fried chicken pieces creates a visual representation of the world that\\'s both appetizing and adventurous. It\\'s like a culinary globe-trotting experience!\"\\n\\n2. \"The image is a screenshot of a conversation between two people, likely discussing a painting.\"\\n\\nPlayful explanation: \"The painting is getting a double take - these two people are having a chat about it and we get to eavesdrop on their art-loving banter!\"\\n\\nCreative explanation: \"This image captures the dynamic exchange of ideas between two art enthusiasts. It\\'s like we\\'re peeking into their creative brainstorming session, where the painting is the catalyst for a lively discussion.\"\\n\\n3. \"The image features a text-based representation of a scene with a person holding onto a rope, possibly a woman, and a boat in the background.\"\\n\\nPlayful explanation: \"This image looks like a page from a choose-your-own-adventure book! Is our brave protagonist about to embark on a thrilling boat ride or hold tight for a wild journey?\"\\n\\nCreative explanation: \"The text-based representation of the scene creates an intriguing narrative that invites the viewer to fill in the blanks. It\\'s like we\\'re reading a visual storybook, where the person holding onto the rope is the hero of their own adventure.\"\\n\\n4. \"Figure 5: LLaVA recognizes the famous art work, Mona Lisa, by Leonardo da Vinci.\"\\n\\nPlayful explanation: \"Mona Lisa is getting a digital spotlight - look at her smile now that she\\'s part of this cool image recognition tech!\"\\n\\nCreative explanation: \"This playful recognition of the Mona Lisa painting highlights the advanced technology used in image analysis. It\\'s like LLaVA is giving the famous artwork a modern makeover, showcasing its timeless beauty and relevance in the digital age.\"\\n\\nOverall, these images/figures offer unique opportunities for creative and playful explanations that can capture the viewer\\'s attention while highlighting the technology and narratives presented in the paper.'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\"Explain any images / figures in the paper with playful and creative examples.\")"
]
},
{
"cell_type": "markdown",
"id": "1da79644-4046-45b0-8c25-01aa73587b22",
"metadata": {},
"source": [
"We can check the [trace](https://smith.langchain.com/public/c6d3b7d5-0f40-4905-ab8f-3a2b77c39af4/r) to review retrieval."