docs: add hybrid retrieval how-to guide (#21613)

Updating v0.2 docs with
https://github.com/langchain-ai/langchain/pull/21245
pull/21146/head^2
ccurme 2 weeks ago committed by GitHub
parent bcf53f93e1
commit fe08421207
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,392 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "14d3fd06",
"metadata": {
"id": "14d3fd06"
},
"source": [
"# Hybrid Search\n",
"\n",
"The standard search in LangChain is done by vector similarity. However, a number of vectorstores implementations (Astra DB, ElasticSearch, Neo4J, AzureSearch, ...) also support more advanced search combining vector similarity search and other search techniques (full-text, BM25, and so on). This is generally referred to as \"Hybrid\" search.\n",
"\n",
"**Step 1: Make sure the vectorstore you are using supports hybrid search**\n",
"\n",
"At the moment, there is no unified way to perform hybrid search in LangChain. Each vectorstore may have their own way to do it. This is generally exposed as a keyword argument that is passed in during `similarity_search`. By reading the documentation or source code, figure out whether the vectorstore you are using supports hybrid search, and, if so, how to use it.\n",
"\n",
"**Step 2: Add that parameter as a configurable field for the chain**\n",
"\n",
"This will let you easily call the chain and configure any relevant flags at runtime. See [this documentation](/docs/how_to/configure) for more information on configuration.\n",
"\n",
"**Step 3: Call the chain with that configurable field**\n",
"\n",
"Now, at runtime you can call this chain with configurable field.\n",
"\n",
"## Code Example\n",
"\n",
"Let's see a concrete example of what this looks like in code. We will use the Cassandra/CQL interface of Astra DB for this example.\n",
"\n",
"Install the following Python package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2efe35eea197769",
"metadata": {
"id": "c2efe35eea197769",
"outputId": "527275b4-076e-4b22-945c-e41a59188116"
},
"outputs": [],
"source": [
"!pip install \"cassio>=0.1.7\""
]
},
{
"cell_type": "markdown",
"id": "b4ef96d44341cd84",
"metadata": {
"collapsed": false,
"id": "b4ef96d44341cd84"
},
"source": [
"Get the [connection secrets](https://docs.datastax.com/en/astra/astra-db-vector/get-started/quickstart.html).\n",
"\n",
"Initialize cassio:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb2cef097277c32e",
"metadata": {
"id": "cb2cef097277c32e",
"outputId": "4c3d05a0-319a-44a0-8ec3-0a9c78453132"
},
"outputs": [],
"source": [
"import cassio\n",
"\n",
"cassio.init(\n",
" database_id=\"Your database ID\",\n",
" token=\"Your application token\",\n",
" keyspace=\"Your key space\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e1e51444877f45eb",
"metadata": {
"collapsed": false,
"id": "e1e51444877f45eb"
},
"source": [
"Create the Cassandra VectorStore with a standard [index analyzer](https://docs.datastax.com/en/astra/astra-db-vector/cql/use-analyzers-with-cql.html). The index analyzer is needed to enable term matching."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7345de3c",
"metadata": {
"id": "7345de3c",
"outputId": "d38bcee0-0134-4ac6-8d35-afcce282481b"
},
"outputs": [],
"source": [
"from cassio.table.cql import STANDARD_ANALYZER\n",
"from langchain_community.vectorstores import Cassandra\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = Cassandra(\n",
" embedding=embeddings,\n",
" table_name=\"test_hybrid\",\n",
" body_index_options=[STANDARD_ANALYZER],\n",
" session=None,\n",
" keyspace=None,\n",
")\n",
"\n",
"vectorstore.add_texts(\n",
" [\n",
" \"In 2023, I visited Paris\",\n",
" \"In 2022, I visited New York\",\n",
" \"In 2021, I visited New Orleans\",\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "73887f23bbab978c",
"metadata": {
"collapsed": false,
"id": "73887f23bbab978c"
},
"source": [
"If we do a standard similarity search, we get all the documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c2a39fa",
"metadata": {
"id": "3c2a39fa",
"outputId": "5290085b-896c-4c81-9b40-c315331b7009"
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='In 2022, I visited New York'),\n",
"Document(page_content='In 2023, I visited Paris'),\n",
"Document(page_content='In 2021, I visited New Orleans')]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.as_retriever().invoke(\"What city did I visit last?\")"
]
},
{
"cell_type": "markdown",
"id": "78d4c3c79e67d8c3",
"metadata": {
"collapsed": false,
"id": "78d4c3c79e67d8c3"
},
"source": [
"The Astra DB vectorstore `body_search` argument can be used to filter the search on the term `new`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56393baa",
"metadata": {
"id": "56393baa",
"outputId": "d1c939f3-342f-4df4-94a3-d25429b5a25e"
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='In 2022, I visited New York'),\n",
"Document(page_content='In 2021, I visited New Orleans')]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.as_retriever(search_kwargs={\"body_search\": \"new\"}).invoke(\n",
" \"What city did I visit last?\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "88ae97ed",
"metadata": {
"id": "88ae97ed"
},
"source": [
"We can now create the chain that we will use to do question-answering over"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62707b4f",
"metadata": {
"id": "62707b4f"
},
"outputs": [],
"source": [
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import (\n",
" ConfigurableField,\n",
" RunnablePassthrough,\n",
")\n",
"from langchain_openai import ChatOpenAI"
]
},
{
"cell_type": "markdown",
"id": "b6778ffa",
"metadata": {
"id": "b6778ffa"
},
"source": [
"This is basic question-answering chain set up."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44a865f6",
"metadata": {
"id": "44a865f6"
},
"outputs": [],
"source": [
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"Question: {question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"model = ChatOpenAI()\n",
"\n",
"retriever = vectorstore.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "72125166",
"metadata": {
"id": "72125166"
},
"source": [
"Here we mark the retriever as having a configurable field. All vectorstore retrievers have `search_kwargs` as a field. This is just a dictionary, with vectorstore specific fields"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "babbadff",
"metadata": {
"id": "babbadff"
},
"outputs": [],
"source": [
"configurable_retriever = retriever.configurable_fields(\n",
" search_kwargs=ConfigurableField(\n",
" id=\"search_kwargs\",\n",
" name=\"Search Kwargs\",\n",
" description=\"The search kwargs to use\",\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2d481b70",
"metadata": {
"id": "2d481b70"
},
"source": [
"We can now create the chain using our configurable retriever"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "210b0446",
"metadata": {
"id": "210b0446"
},
"outputs": [],
"source": [
"chain = (\n",
" {\"context\": configurable_retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a38037b2",
"metadata": {
"id": "a38037b2",
"outputId": "1ea14996-5965-4a5e-9678-b9c35ce5c6de"
},
"outputs": [
{
"data": {
"text/plain": [
"Paris"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\"What city did I visit last?\")"
]
},
{
"cell_type": "markdown",
"id": "7f6458c3",
"metadata": {
"id": "7f6458c3"
},
"source": [
"We can now invoke the chain with configurable options. `search_kwargs` is the id of the configurable field. The value is the search kwargs to use for Astra DB."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9gYLqBTH8BFz",
"metadata": {
"id": "9gYLqBTH8BFz",
"outputId": "4358a2e6-f306-48f1-dd5c-781ac8a33e89"
},
"outputs": [
{
"data": {
"text/plain": [
"New York"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\n",
" \"What city did I visit last?\",\n",
" config={\"configurable\": {\"search_kwargs\": {\"body_search\": \"new\"}}},\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -146,6 +146,7 @@ Retrievers are responsible for taking a query and returning relevant documents.
- [How to retrieve the whole document for a chunk](/docs/how_to/parent_document_retriever)
- [How to generate metadata filters](/docs/how_to/self_query)
- [How to create a time-weighted retriever](/docs/how_to/time_weighted_vectorstore)
- [How to use hybrid vector and keyword retrieval](/docs/how_to/hybrid)
### Indexing

Loading…
Cancel
Save