"Many documents contain a mixture of content types, including text and images. \n",
"\n",
"Yet, information captured in images is lost in most RAG applications.\n",
"\n",
"With the emergence of multimodal LLMs, like [GPT-4V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:\n",
"\n",
"`Option 1:` (Shown) \n",
"\n",
"* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text\n",
"* Retrieve both using similarity search\n",
"* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
"\n",
"`Option 2:` \n",
"\n",
"* Use a multimodal LLM (such as [GPT-4V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
"* Embed and retrieve text \n",
"* Pass text chunks to an LLM for answer synthesis \n",
"\n",
"`Option 3` \n",
"\n",
"* Use a multimodal LLM (such as [GPT-4V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images\n",
"* Embed and retrieve image summaries with a reference to the raw image \n",
"* Pass raw images and text chunks to a multimodal LLM for answer synthesis \n",
"\n",
"This cookbook highlights `Option 1`: \n",
"\n",
"* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).\n",
"* We will use Open Clip multi-modal embeddings.\n",
"* We will use [Chroma](https://www.trychroma.com/) with support for multi-modal.\n",
"\n",
"A seperate cookbook highlights `Options 2 and 3` [here](https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb).\n",
"For `unstructured`, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system."
"Let's look at an example pdfs containing interesting images.\n",
"\n",
"1/ Art from the J Paul Getty museum:\n",
"\n",
" * Here is a [zip file](https://drive.google.com/file/d/18kRKbq2dqAhhJ3DfZRnYcTBEUfYxe1YR/view?usp=sharing) with the PDF and the already extracted images. \n",
"We can use `partition_pdf` below from [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#key-concepts) to extract text and images.\n",
"\n",
"To supply this to extract the images:\n",
"```\n",
"extract_images_in_pdf=True\n",
"```\n",
"\n",
"\n",
"\n",
"If using this zip file, then you can simply process the text only with:\n",
"The subject of the photo, Florence Owens Thompson, a Cherokee from Oklahoma, initially regretted that Lange ever made this photograph. “She was a very strong woman. She was a leader,” her daughter Katherine later said. “I think that's one of the reasons she resented the photo — because it didn't show her in that light.”\n",
"\n",
"DOROTHEA LANGE. “DESTITUTE PEA PICKERS IN CALIFORNIA. MOTHER OF SEVEN CHILDREN. AGE THIRTY-TWO. NIPOMO, CALIFORNIA.” MARCH 1936. NITRATE NEGATIVE. FARM SECURITY ADMINISTRATION-OFFICE OF WAR INFORMATION COLLECTION. PRINTS AND PHOTOGRAPHS DIVISION.\n",
"docs = retriever.get_relevant_documents(\"Woman with children\",k=10)\n",
"for doc in docs:\n",
" if is_base64(doc.page_content):\n",
" plt_img_base64(doc.page_content)\n",
" else:\n",
" print(doc.page_content)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "69fb15fd-76fc-49b4-806d-c4db2990027d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Visual Elements:\\nThe image is a black and white photograph depicting a woman with two children. The woman is positioned centrally and appears to be in her thirties. She has a look of concern or contemplation on her face, with her hand resting on her chin. Her gaze is directed away from the camera, suggesting introspection or worry. The children are turned away from the camera, with their heads leaning against the woman, seeking comfort or protection. The clothing of the subjects is simple and worn, indicating a lack of wealth. The background is out of focus, drawing attention to the expressions and posture of the subjects.\\n\\nHistorical and Cultural Context:\\nThe photograph was taken by Dorothea Lange in March 1936 and is titled \"Destitute pea pickers in California. Mother of seven children. Age thirty-two. Nipomo, California.\" It was taken during the Great Depression in the United States, a period of severe economic hardship. The woman in the photo, Florence Owens Thompson, was a Cherokee from Oklahoma. The image is part of the Farm Security Administration-Office of War Information Collection, which aimed to document and bring attention to the plight of impoverished farmers and workers during this era.\\n\\nInterpretation and Symbolism:\\nThe photograph, often referred to as \"Migrant Mother,\" has become an iconic symbol of the Great Depression. The woman\\'s expression and posture convey a sense of worry and determination, reflecting the resilience and strength required to endure such difficult times. The children\\'s reliance on their mother for comfort underscores the family\\'s vulnerability and the burdens placed upon the woman. Despite the hardship conveyed, the image also suggests a sense of dignity and maternal protectiveness.\\n\\nThe text provided indicates that Florence Owens Thompson was a strong and leading figure within her community, which contrasts with the vulnerability shown in the photograph. This dichotomy highlights the complexity of Thompson\\'s character and the circumstances of the time, where even the strongest individuals faced moments of hardship that could overshadow their usual demeanor.\\n\\nConnections Between Image and Text:\\nThe text complements the image by providing personal insight into the subject\\'s feelings about the photograph. It reveals that Thompson resented the photo because it did not reflect her strength and leadership qualities. This adds depth to our understanding of the image, as it suggests that the moment captured by Lange is not fully representative of Thompson\\'s character. The photograph, while powerful, is a snapshot that may not encompass the entirety of the subject\\'s identity and life experiences.\\n\\nThe final line of the text, \"They\\'re willing to have me entertain them during the day, but as soon as it starts getting dark, they all go off, and leave me!\" could be interpreted as a metaphor for the transient sympathy of society towards the impoverished during the Great Depression. People may have shown interest or concern during the crisis, but ultimately, those suffering, like Thompson and her family, were left to face their struggles alone when the attention faded. This line underscores the isolation and abandonment felt by many during this period, which is poignantly captured in the photograph\\'s portrayal of the mother and her children.'"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\n",
" \"Woman with children\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "227f08b8-e732-4089-b65c-6eb6f9e48f15",
"metadata": {},
"source": [
"We can see the images retrieved in the LangSmith trace:\n",