docs: add multi-modal-docs (#21734)

We dont really have any abstractions around multi-modal... so add a section explaining we dont have any abstrations and then how to guides for openai and anthropic (probably need to add for more) --------- Co-authored-by: Chester Curme <chester.curme@gmail.com> Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: junefish <junefish@users.noreply.github.com> Co-authored-by: William Fu-Hinthorn <13333726+hinthornw@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
4 months ago · 170cc8aec3
parent fbfed65fb1
commit 170cc8aec3
6 changed files with 433 additions and 162 deletions
--- a/docs/docs/concepts.mdx
+++ b/docs/docs/concepts.mdx
@ -174,7 +174,7 @@ The `content` property describes the content of the message.
 This can be a few different things:

 - A string (most models deal this type of content)
- A List of dictionaries (this is used for multi-modal input, where the dictionary contains information about that input type and that input location)
+- A List of dictionaries (this is used for multimodal input, where the dictionary contains information about that input type and that input location)

 #### HumanMessage

@ -476,6 +476,12 @@ If you are still using AgentExecutor, do not fear: we still have a guide on [how
 It is recommended, however, that you start to transition to LangGraph.
 In order to assist in this we have put together a [transition guide on how to do so](/docs/how_to/migrate_agent)

+### Multimodal
+
+Some models are multimodal, accepting images, audio and even video as inputs. These are still less common, meaning model providers haven't standardized on the "best" way to define the API. Multimodal **outputs** are even less common. As such, we've kept our multimodal abstractions fairly light weight and plan to further solidify the multimodal APIs and interaction patterns as the field matures.
+
+In LangChain, most chat models that support multimodal inputs also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
+
 ### Callbacks

 LangChain provides a callbacks system that allows you to hook into the various stages of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.
@ -642,3 +648,7 @@ Table columns:
 | Character  | [CharacterTextSplitter](/docs/how_to/character_text_splitter/)                                                                                                                | A user defined character                                                                                        |               | Splits text based on a user defined character. One of the simpler methods.                                                                                                                                                                                                   |
 | Semantic Chunker (Experimental) | [SemanticChunker](/docs/how_to/semantic-chunker/)                                                                                                                             | Sentences                                                                                                       |               | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) |
 | Integration: AI21 Semantic | [AI21SemanticTextSplitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter/)                                                                                                                    |    ✅           | Identifies distinct topics that form coherent pieces of text and splits along those.                                                                                                                                                                                         |
+
+
+
+
--- a/docs/docs/how_to/index.mdx
+++ b/docs/docs/how_to/index.mdx
@ -174,7 +174,12 @@ LangChain Tools contain a description of the tool (to pass to the language model
 - [How to: add ad-hoc tool calling capability to LLMs and chat models](/docs/how_to/tools_prompting)
 - [How to: add a human in the loop to tool usage](/docs/how_to/tools_human)
 - [How to: handle errors when calling tools](/docs/how_to/tools_error)
- [How to: call tools using multi-modal data](/docs/how_to/tool_calls_multi_modal)
+
+### Multimodal
+
+- [How to: pass multimodal data directly to models](/docs/how_to/multimodal_inputs/)
+- [How to: use multimodal prompts](/docs/how_to/multimodal_prompts/)
+

 ### Agents

--- a/docs/docs/how_to/multimodal_inputs.ipynb
+++ b/docs/docs/how_to/multimodal_inputs.ipynb
@ -0,0 +1,228 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
+   "metadata": {},
+   "source": [
+    "# How to pass multimodal data directly to models\n",
+    "\n",
+    "Here we demonstrate how to pass multimodal input directly to models. \n",
+    "We currently expect all input to be passed in the same format as [OpenAI expects](https://platform.openai.com/docs/guides/vision).\n",
+    "For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.\n",
+    "\n",
+    "In this example we will ask a model to describe an image."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "fb896ce9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_core.messages import HumanMessage\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "model = ChatOpenAI(model=\"gpt-4o\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fca4da7",
+   "metadata": {},
+   "source": [
+    "The most commonly supported way to pass in images is to pass it in as a byte string.\n",
+    "This should work for most model integrations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9ca1040c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import base64\n",
+    "\n",
+    "import httpx\n",
+    "\n",
+    "image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ec680b6b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The weather in the image appears to be clear and pleasant. The sky is mostly blue with scattered, light clouds, suggesting a sunny day with minimal cloud cover. There is no indication of rain or strong winds, and the overall scene looks bright and calm. The lush green grass and clear visibility further indicate good weather conditions.\n"
+     ]
+    }
+   ],
+   "source": [
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
+    "        {\n",
+    "            \"type\": \"image_url\",\n",
+    "            \"image_url\": {\"url\": f\"data:image/jpeg;base64,{image_data}\"},\n",
+    "        },\n",
+    "    ],\n",
+    ")\n",
+    "response = model.invoke([message])\n",
+    "print(response.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8656018e-c56d-47d2-b2be-71e87827f90a",
+   "metadata": {},
+   "source": [
+    "We can feed the image URL directly in a content block of type \"image_url\". Note that only some model providers support this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "a8819cf3-5ddc-44f0-889a-19ca7b7fe77e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The weather in the image appears to be clear and sunny. The sky is mostly blue with a few scattered clouds, suggesting good visibility and a likely pleasant temperature. The bright sunlight is casting distinct shadows on the grass and vegetation, indicating it is likely daytime, possibly late morning or early afternoon. The overall ambiance suggests a warm and inviting day, suitable for outdoor activities.\n"
+     ]
+    }
+   ],
+   "source": [
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
+    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
+    "    ],\n",
+    ")\n",
+    "response = model.invoke([message])\n",
+    "print(response.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c470309",
+   "metadata": {},
+   "source": [
+    "We can also pass in multiple images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "325fb4ca",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Yes, the two images are the same. They both depict a wooden boardwalk extending through a grassy field under a blue sky with light clouds. The scenery, lighting, and composition are identical.\n"
+     ]
+    }
+   ],
+   "source": [
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"are these two images the same?\"},\n",
+    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
+    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
+    "    ],\n",
+    ")\n",
+    "response = model.invoke([message])\n",
+    "print(response.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71bd28cf-d76c-44e2-a55e-c5f265db986e",
+   "metadata": {},
+   "source": [
+    "## Tool calls\n",
+    "\n",
+    "Some multimodal models support [tool calling](/docs/concepts/#functiontool-calling) features as well. To call tools using such models, simply bind tools to them in the [usual way](/docs/how_to/tool_calling), and invoke the model using content blocks of the desired type (e.g., containing image data)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "cd22ea82-2f93-46f9-9f7a-6aaf479fcaa9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_BSX4oq4SKnLlp2WlzDhToHBr'}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from typing import Literal\n",
+    "\n",
+    "from langchain_core.tools import tool\n",
+    "\n",
+    "\n",
+    "@tool\n",
+    "def weather_tool(weather: Literal[\"sunny\", \"cloudy\", \"rainy\"]) -> None:\n",
+    "    \"\"\"Describe the weather\"\"\"\n",
+    "    pass\n",
+    "\n",
+    "\n",
+    "model_with_tools = model.bind_tools([weather_tool])\n",
+    "\n",
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
+    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
+    "    ],\n",
+    ")\n",
+    "response = model_with_tools.invoke([message])\n",
+    "print(response.tool_calls)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/docs/how_to/multimodal_prompts.ipynb
+++ b/docs/docs/how_to/multimodal_prompts.ipynb
@ -0,0 +1,184 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
+   "metadata": {},
+   "source": [
+    "# How to use multimodal prompts\n",
+    "\n",
+    "Here we demonstrate how to use prompt templates to format multimodal inputs to models. \n",
+    "\n",
+    "In this example we will ask a model to describe an image."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import base64\n",
+    "\n",
+    "import httpx\n",
+    "\n",
+    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
+    "image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2671f995",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "model = ChatOpenAI(model=\"gpt-4o\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "4ee35e4f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = ChatPromptTemplate.from_messages(\n",
+    "    [\n",
+    "        (\"system\", \"Describe the image provided\"),\n",
+    "        (\n",
+    "            \"user\",\n",
+    "            [{\"type\": \"image_url\", \"image_url\": \"data:image/jpeg;base64,{image_data}\"}],\n",
+    "        ),\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "089f75c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = prompt | model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "02744b06",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The image depicts a sunny day with a beautiful blue sky filled with scattered white clouds. The sky has varying shades of blue, ranging from a deeper hue near the horizon to a lighter, almost pale blue higher up. The white clouds are fluffy and scattered across the expanse of the sky, creating a peaceful and serene atmosphere. The lighting and cloud patterns suggest pleasant weather conditions, likely during the daytime hours on a mild, sunny day in an outdoor natural setting.\n"
+     ]
+    }
+   ],
+   "source": [
+    "response = chain.invoke({\"image_data\": image_data})\n",
+    "print(response.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e9b9ebf6",
+   "metadata": {},
+   "source": [
+    "We can also pass in multiple images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "02190ee3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = ChatPromptTemplate.from_messages(\n",
+    "    [\n",
+    "        (\"system\", \"compare the two pictures provided\"),\n",
+    "        (\n",
+    "            \"user\",\n",
+    "            [\n",
+    "                {\n",
+    "                    \"type\": \"image_url\",\n",
+    "                    \"image_url\": \"data:image/jpeg;base64,{image_data1}\",\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"image_url\",\n",
+    "                    \"image_url\": \"data:image/jpeg;base64,{image_data2}\",\n",
+    "                },\n",
+    "            ],\n",
+    "        ),\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "42af057b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = prompt | model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "513abe00",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The two images provided are identical. Both images feature a wooden boardwalk path extending through a lush green field under a bright blue sky with some clouds. The perspective, colors, and elements in both images are exactly the same.\n"
+     ]
+    }
+   ],
+   "source": [
+    "response = chain.invoke({\"image_data1\": image_data, \"image_data2\": image_data})\n",
+    "print(response.content)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea8152c3",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/docs/how_to/tool_calls_multi_modal.ipynb
+++ b/docs/docs/how_to/tool_calls_multi_modal.ipynb
@ -1,160 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
-   "metadata": {},
-   "source": [
-    "# How to call tools with multi-modal data\n",
-    "\n",
-    "Here we demonstrate how to call tools with multi-modal data, such as images.\n",
-    "\n",
-    "Some multi-modal models, such as those that can reason over images or audio, support [tool calling](/docs/concepts/#functiontool-calling) features as well.\n",
-    "\n",
-    "To call tools using such models, simply bind tools to them in the [usual way](/docs/how_to/tool_calling), and invoke the model using content blocks of the desired type (e.g., containing image data).\n",
-    "\n",
-    "Below, we demonstrate examples using [OpenAI](/docs/integrations/platforms/openai) and [Anthropic](/docs/integrations/platforms/anthropic). We will use the same image and tool in all cases. Let's first select an image, and build a placeholder tool that expects as input the string \"sunny\", \"cloudy\", or \"rainy\". We will ask the models to describe the weather in the image."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from typing import Literal\n",
-    "\n",
-    "from langchain_core.tools import tool\n",
-    "\n",
-    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
-    "\n",
-    "\n",
-    "@tool\n",
-    "def weather_tool(weather: Literal[\"sunny\", \"cloudy\", \"rainy\"]) -> None:\n",
-    "    \"\"\"Describe the weather\"\"\"\n",
-    "    pass"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8656018e-c56d-47d2-b2be-71e87827f90a",
-   "metadata": {},
-   "source": [
-    "## OpenAI\n",
-    "\n",
-    "For OpenAI, we can feed the image URL directly in a content block of type \"image_url\":"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "a8819cf3-5ddc-44f0-889a-19ca7b7fe77e",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_mRYL50MtHdeNuNIjSCm5UPmB'}]\n"
-     ]
-    }
-   ],
-   "source": [
-    "from langchain_core.messages import HumanMessage\n",
-    "from langchain_openai import ChatOpenAI\n",
-    "\n",
-    "model = ChatOpenAI(model=\"gpt-4o\").bind_tools([weather_tool])\n",
-    "\n",
-    "message = HumanMessage(\n",
-    "    content=[\n",
-    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
-    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
-    "    ],\n",
-    ")\n",
-    "response = model.invoke([message])\n",
-    "print(response.tool_calls)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e5738224-1109-4bf8-8976-ff1570dd1d46",
-   "metadata": {},
-   "source": [
-    "Note that we recover tool calls with parsed arguments in LangChain's [standard format](/docs/how_to/tool_calling) in the model response."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0cee63ff-e09f-4dd8-8323-912edbde94f6",
-   "metadata": {},
-   "source": [
-    "## Anthropic\n",
-    "\n",
-    "For Anthropic, we can format a base64-encoded image into a content block of type \"image\", as below:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "d90c4590-71c8-42b1-99ff-03a9eca8082e",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'toolu_016m9KfknJqx5fVRYk4tkF6s'}]\n"
-     ]
-    }
-   ],
-   "source": [
-    "import base64\n",
-    "\n",
-    "import httpx\n",
-    "from langchain_anthropic import ChatAnthropic\n",
-    "\n",
-    "image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")\n",
-    "\n",
-    "model = ChatAnthropic(model=\"claude-3-sonnet-20240229\").bind_tools([weather_tool])\n",
-    "\n",
-    "message = HumanMessage(\n",
-    "    content=[\n",
-    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
-    "        {\n",
-    "            \"type\": \"image\",\n",
-    "            \"source\": {\n",
-    "                \"type\": \"base64\",\n",
-    "                \"media_type\": \"image/jpeg\",\n",
-    "                \"data\": image_data,\n",
-    "            },\n",
-    "        },\n",
-    "    ],\n",
-    ")\n",
-    "response = model.invoke([message])\n",
-    "print(response.tool_calls)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
--- a/docs/vercel.json
+++ b/docs/vercel.json
@ -13,6 +13,10 @@
    }
  ],
  "redirects": [
+    {
+      "source": "/docs/how_to/tool_calls_multi_modal(/?)",
+      "destination": "/docs/how_to/multimodal_inputs/"
+    },
    {
      "source": "/v0.2/docs/langsmith(/?)",
      "destination": "https://docs.smith.langchain.com/"