docs: add hybrid retrieval how-to guide (#21613)

Updating v0.2 docs with https://github.com/langchain-ai/langchain/pull/21245
2 weeks ago · fe08421207
parent bcf53f93e1
commit fe08421207
2 changed files with 393 additions and 0 deletions
--- a/docs/docs/how_to/hybrid.ipynb
+++ b/docs/docs/how_to/hybrid.ipynb
@ -0,0 +1,392 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "14d3fd06",
+   "metadata": {
+    "id": "14d3fd06"
+   },
+   "source": [
+    "# Hybrid Search\n",
+    "\n",
+    "The standard search in LangChain is done by vector similarity. However, a number of vectorstores implementations (Astra DB, ElasticSearch, Neo4J, AzureSearch, ...) also support more advanced search combining vector similarity search and other search techniques (full-text, BM25, and so on). This is generally referred to as \"Hybrid\" search.\n",
+    "\n",
+    "**Step 1: Make sure the vectorstore you are using supports hybrid search**\n",
+    "\n",
+    "At the moment, there is no unified way to perform hybrid search in LangChain. Each vectorstore may have their own way to do it. This is generally exposed as a keyword argument that is passed in during `similarity_search`. By reading the documentation or source code, figure out whether the vectorstore you are using supports hybrid search, and, if so, how to use it.\n",
+    "\n",
+    "**Step 2: Add that parameter as a configurable field for the chain**\n",
+    "\n",
+    "This will let you easily call the chain and configure any relevant flags at runtime. See [this documentation](/docs/how_to/configure) for more information on configuration.\n",
+    "\n",
+    "**Step 3: Call the chain with that configurable field**\n",
+    "\n",
+    "Now, at runtime you can call this chain with configurable field.\n",
+    "\n",
+    "## Code Example\n",
+    "\n",
+    "Let's see a concrete example of what this looks like in code. We will use the Cassandra/CQL interface of Astra DB for this example.\n",
+    "\n",
+    "Install the following Python package:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2efe35eea197769",
+   "metadata": {
+    "id": "c2efe35eea197769",
+    "outputId": "527275b4-076e-4b22-945c-e41a59188116"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install \"cassio>=0.1.7\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4ef96d44341cd84",
+   "metadata": {
+    "collapsed": false,
+    "id": "b4ef96d44341cd84"
+   },
+   "source": [
+    "Get the [connection secrets](https://docs.datastax.com/en/astra/astra-db-vector/get-started/quickstart.html).\n",
+    "\n",
+    "Initialize cassio:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb2cef097277c32e",
+   "metadata": {
+    "id": "cb2cef097277c32e",
+    "outputId": "4c3d05a0-319a-44a0-8ec3-0a9c78453132"
+   },
+   "outputs": [],
+   "source": [
+    "import cassio\n",
+    "\n",
+    "cassio.init(\n",
+    "    database_id=\"Your database ID\",\n",
+    "    token=\"Your application token\",\n",
+    "    keyspace=\"Your key space\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1e51444877f45eb",
+   "metadata": {
+    "collapsed": false,
+    "id": "e1e51444877f45eb"
+   },
+   "source": [
+    "Create the Cassandra VectorStore with a standard [index analyzer](https://docs.datastax.com/en/astra/astra-db-vector/cql/use-analyzers-with-cql.html). The index analyzer is needed to enable term matching."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7345de3c",
+   "metadata": {
+    "id": "7345de3c",
+    "outputId": "d38bcee0-0134-4ac6-8d35-afcce282481b"
+   },
+   "outputs": [],
+   "source": [
+    "from cassio.table.cql import STANDARD_ANALYZER\n",
+    "from langchain_community.vectorstores import Cassandra\n",
+    "from langchain_openai import OpenAIEmbeddings\n",
+    "\n",
+    "embeddings = OpenAIEmbeddings()\n",
+    "vectorstore = Cassandra(\n",
+    "    embedding=embeddings,\n",
+    "    table_name=\"test_hybrid\",\n",
+    "    body_index_options=[STANDARD_ANALYZER],\n",
+    "    session=None,\n",
+    "    keyspace=None,\n",
+    ")\n",
+    "\n",
+    "vectorstore.add_texts(\n",
+    "    [\n",
+    "        \"In 2023, I visited Paris\",\n",
+    "        \"In 2022, I visited New York\",\n",
+    "        \"In 2021, I visited New Orleans\",\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73887f23bbab978c",
+   "metadata": {
+    "collapsed": false,
+    "id": "73887f23bbab978c"
+   },
+   "source": [
+    "If we do a standard similarity search, we get all the documents:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c2a39fa",
+   "metadata": {
+    "id": "3c2a39fa",
+    "outputId": "5290085b-896c-4c81-9b40-c315331b7009"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Document(page_content='In 2022, I visited New York'),\n",
+       "Document(page_content='In 2023, I visited Paris'),\n",
+       "Document(page_content='In 2021, I visited New Orleans')]"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vectorstore.as_retriever().invoke(\"What city did I visit last?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78d4c3c79e67d8c3",
+   "metadata": {
+    "collapsed": false,
+    "id": "78d4c3c79e67d8c3"
+   },
+   "source": [
+    "The Astra DB vectorstore `body_search` argument can be used to filter the search on the term `new`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "56393baa",
+   "metadata": {
+    "id": "56393baa",
+    "outputId": "d1c939f3-342f-4df4-94a3-d25429b5a25e"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Document(page_content='In 2022, I visited New York'),\n",
+       "Document(page_content='In 2021, I visited New Orleans')]"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vectorstore.as_retriever(search_kwargs={\"body_search\": \"new\"}).invoke(\n",
+    "    \"What city did I visit last?\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88ae97ed",
+   "metadata": {
+    "id": "88ae97ed"
+   },
+   "source": [
+    "We can now create the chain that we will use to do question-answering over"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62707b4f",
+   "metadata": {
+    "id": "62707b4f"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_core.output_parsers import StrOutputParser\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_core.runnables import (\n",
+    "    ConfigurableField,\n",
+    "    RunnablePassthrough,\n",
+    ")\n",
+    "from langchain_openai import ChatOpenAI"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6778ffa",
+   "metadata": {
+    "id": "b6778ffa"
+   },
+   "source": [
+    "This is basic question-answering chain set up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44a865f6",
+   "metadata": {
+    "id": "44a865f6"
+   },
+   "outputs": [],
+   "source": [
+    "template = \"\"\"Answer the question based only on the following context:\n",
+    "{context}\n",
+    "Question: {question}\n",
+    "\"\"\"\n",
+    "prompt = ChatPromptTemplate.from_template(template)\n",
+    "\n",
+    "model = ChatOpenAI()\n",
+    "\n",
+    "retriever = vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72125166",
+   "metadata": {
+    "id": "72125166"
+   },
+   "source": [
+    "Here we mark the retriever as having a configurable field. All vectorstore retrievers have `search_kwargs` as a field. This is just a dictionary, with vectorstore specific fields"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "babbadff",
+   "metadata": {
+    "id": "babbadff"
+   },
+   "outputs": [],
+   "source": [
+    "configurable_retriever = retriever.configurable_fields(\n",
+    "    search_kwargs=ConfigurableField(\n",
+    "        id=\"search_kwargs\",\n",
+    "        name=\"Search Kwargs\",\n",
+    "        description=\"The search kwargs to use\",\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d481b70",
+   "metadata": {
+    "id": "2d481b70"
+   },
+   "source": [
+    "We can now create the chain using our configurable retriever"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "210b0446",
+   "metadata": {
+    "id": "210b0446"
+   },
+   "outputs": [],
+   "source": [
+    "chain = (\n",
+    "    {\"context\": configurable_retriever, \"question\": RunnablePassthrough()}\n",
+    "    | prompt\n",
+    "    | model\n",
+    "    | StrOutputParser()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a38037b2",
+   "metadata": {
+    "id": "a38037b2",
+    "outputId": "1ea14996-5965-4a5e-9678-b9c35ce5c6de"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Paris"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain.invoke(\"What city did I visit last?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f6458c3",
+   "metadata": {
+    "id": "7f6458c3"
+   },
+   "source": [
+    "We can now invoke the chain with configurable options. `search_kwargs` is the id of the configurable field. The value is the search kwargs to use for Astra DB."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9gYLqBTH8BFz",
+   "metadata": {
+    "id": "9gYLqBTH8BFz",
+    "outputId": "4358a2e6-f306-48f1-dd5c-781ac8a33e89"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "New York"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain.invoke(\n",
+    "    \"What city did I visit last?\",\n",
+    "    config={\"configurable\": {\"search_kwargs\": {\"body_search\": \"new\"}}},\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/docs/how_to/index.mdx
+++ b/docs/docs/how_to/index.mdx
@ -146,6 +146,7 @@ Retrievers are responsible for taking a query and returning relevant documents.
 - [How to retrieve the whole document for a chunk](/docs/how_to/parent_document_retriever)
 - [How to generate metadata filters](/docs/how_to/self_query)
 - [How to create a time-weighted retriever](/docs/how_to/time_weighted_vectorstore)
+- [How to use hybrid vector and keyword retrieval](/docs/how_to/hybrid)

 ### Indexing