docs: how-to on adding scores to retriever results (#21626)

pull/21666/head
ccurme 2 weeks ago committed by GitHub
parent 972d2071c6
commit 2463c8060c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,446 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9d59582a-6473-4b34-929b-3e94cb443c3d",
"metadata": {},
"source": [
"# How to add scores to retriever results\n",
"\n",
"Retrievers will return sequences of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, which by default include no information about the process that retrieved them (e.g., a similarity score against a query). Here we demonstrate how to add retrieval scores to the `.metadata` of documents:\n",
"1. From [vectorstore retrievers](/docs/how_to/vectorstore_retriever);\n",
"2. From higher-order LangChain retrievers, such as [SelfQueryRetriever](/docs/how_to/self_query) or [MultiVectorRetriever](/docs/how_to/multi_vector).\n",
"\n",
"For (1), we will implement a short wrapper function around the corresponding vectorstore. For (2), we will update a method of the corresponding class.\n",
"\n",
"## Create vectorstore\n",
"\n",
"First we populate a vectorstore with some data. We will use a [PineconeVectorStore](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html), but this guide is compatible with any LangChain vectorstore that implements a `.similarity_search_with_score` method."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b8cfcb1b-64ee-4b91-8d82-ce7803834985",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.documents import Document\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_pinecone import PineconeVectorStore\n",
"\n",
"docs = [\n",
" Document(\n",
" page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
" metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
" metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n",
" ),\n",
" Document(\n",
" page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
" metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n",
" ),\n",
" Document(\n",
" page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
" metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n",
" ),\n",
" Document(\n",
" page_content=\"Toys come alive and have a blast doing so\",\n",
" metadata={\"year\": 1995, \"genre\": \"animated\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
" metadata={\n",
" \"year\": 1979,\n",
" \"director\": \"Andrei Tarkovsky\",\n",
" \"genre\": \"thriller\",\n",
" \"rating\": 9.9,\n",
" },\n",
" ),\n",
"]\n",
"\n",
"vectorstore = PineconeVectorStore.from_documents(\n",
" docs, index_name=\"sample\", embedding=OpenAIEmbeddings()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "22ac5ef6-ce18-427f-a91c-62b38a8b41e9",
"metadata": {},
"source": [
"## Retriever\n",
"\n",
"To obtain scores from a vectorstore retriever, we wrap the underlying vectorstore's `.similarity_search_with_score` method in a short function that packages scores into the associated document's metadata.\n",
"\n",
"We add a `@chain` decorator to the function to create a [Runnable](/docs/concepts/#langchain-expression-language) that can be used similarly to a typical retriever."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7e5677c3-f6ee-4974-ab5f-a0f50c199d45",
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"\n",
"from langchain_core.documents import Document\n",
"from langchain_core.runnables import chain\n",
"\n",
"\n",
"@chain\n",
"def retriever(query: str) -> List[Document]:\n",
" docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
" for doc, score in zip(docs, scores):\n",
" doc.metadata[\"score\"] = score\n",
"\n",
" return docs"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c9cad75e-b955-4012-989c-3c1820b49ba9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),\n",
" Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),\n",
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = retriever.invoke(\"dinosaur\")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "6671308a-be8d-4c15-ae1f-5bd07b342560",
"metadata": {},
"source": [
"Note that similarity scores from the retrieval step are included in the metadata of the above documents."
]
},
{
"cell_type": "markdown",
"id": "af2e73a0-46a1-47e2-8103-68aaa637642a",
"metadata": {},
"source": [
"## SelfQueryRetriever\n",
"\n",
"`SelfQueryRetriever` will use a LLM to generate a query that is potentially structured-- for example, it can construct filters for the retrieval on top of the usual semantic-similarity driven selection. See [this guide](/docs/how_to/self_query) for more detail.\n",
"\n",
"`SelfQueryRetriever` includes a short (1 - 2 line) method `_get_docs_with_query` that executes the vectorstore search. We can subclass `SelfQueryRetriever` and override this method to propagate similarity scores.\n",
"\n",
"First, following the [how-to guide](/docs/how_to/self_query), we will need to establish some metadata on which to filter:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8280b829-2e81-4454-8adc-9a0930047fa2",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"genre\",\n",
" description=\"The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"year\",\n",
" description=\"The year the movie was released\",\n",
" type=\"integer\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"director\",\n",
" description=\"The name of the movie director\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
" ),\n",
"]\n",
"document_content_description = \"Brief summary of a movie\"\n",
"llm = ChatOpenAI(temperature=0)"
]
},
{
"cell_type": "markdown",
"id": "0a6c6fa8-1e2f-45ee-83e9-a6cbd82292d2",
"metadata": {},
"source": [
"We then override the `_get_docs_with_query` to use the `similarity_search_with_score` method of the underlying vectorstore: "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "62c8f3fa-8b64-4afb-87c4-ccbbf9a8bc54",
"metadata": {},
"outputs": [],
"source": [
"from typing import Any, Dict\n",
"\n",
"\n",
"class CustomSelfQueryRetriever(SelfQueryRetriever):\n",
" def _get_docs_with_query(\n",
" self, query: str, search_kwargs: Dict[str, Any]\n",
" ) -> List[Document]:\n",
" \"\"\"Get docs, adding score information.\"\"\"\n",
" docs, scores = zip(\n",
" *vectorstore.similarity_search_with_score(query, **search_kwargs)\n",
" )\n",
" for doc, score in zip(docs, scores):\n",
" doc.metadata[\"score\"] = score\n",
"\n",
" return docs"
]
},
{
"cell_type": "markdown",
"id": "56e40109-1db6-44c7-a6e6-6989175e267c",
"metadata": {},
"source": [
"Invoking this retriever will now include similarity scores in the document metadata. Note that the underlying structured-query capabilities of `SelfQueryRetriever` are retained."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3359a1ee-34ff-41b6-bded-64c05785b333",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = CustomSelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
")\n",
"\n",
"\n",
"result = retriever.invoke(\"dinosaur movie with rating less than 8\")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "689ab3ba-3494-448b-836e-05fbe1ffd51c",
"metadata": {},
"source": [
"## MultiVectorRetriever\n",
"\n",
"`MultiVectorRetriever` allows you to associate multiple vectors with a single document. This can be useful in a number of applications. For example, we can index small chunks of a larger document and run the retrieval on the chunks, but return the larger \"parent\" document when invoking the retriever. [ParentDocumentRetriever](/docs/how_to/parent_document_retriever/), a subclass of `MultiVectorRetriever`, includes convenience methods for populating a vectorstore to support this. Further applications are detailed in this [how-to guide](/docs/how_to/multi_vector/).\n",
"\n",
"To propagate similarity scores through this retriever, we can again subclass `MultiVectorRetriever` and override a method. This time we will override `_get_relevant_documents`.\n",
"\n",
"First, we prepare some fake data. We generate fake \"whole documents\" and store them in a document store; here we will use a simple [InMemoryStore](https://api.python.langchain.com/en/latest/stores/langchain_core.stores.InMemoryBaseStore.html)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a112e545-7b53-4fcd-9c4a-7a42a5cc646d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.storage import InMemoryStore\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"# The storage layer for the parent documents\n",
"docstore = InMemoryStore()\n",
"fake_whole_documents = [\n",
" (\"fake_id_1\", Document(page_content=\"fake whole document 1\")),\n",
" (\"fake_id_2\", Document(page_content=\"fake whole document 2\")),\n",
"]\n",
"docstore.mset(fake_whole_documents)"
]
},
{
"cell_type": "markdown",
"id": "453b7415-4a6d-45d4-a329-9c1d7271d1b2",
"metadata": {},
"source": [
"Next we will add some fake \"sub-documents\" to our vectorstore. We can link these sub-documents to the parent documents by populating the `\"doc_id\"` key in its metadata."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "314519c0-dde4-41ea-a1ab-d3cf1c17c63f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['62a85353-41ff-4346-bff7-be6c8ec2ed89',\n",
" '5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',\n",
" '8c1d9a56-120f-45e4-ba70-a19cd19a38f4']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = [\n",
" Document(\n",
" page_content=\"A snippet from a larger document discussing cats.\",\n",
" metadata={\"doc_id\": \"fake_id_1\"},\n",
" ),\n",
" Document(\n",
" page_content=\"A snippet from a larger document discussing discourse.\",\n",
" metadata={\"doc_id\": \"fake_id_1\"},\n",
" ),\n",
" Document(\n",
" page_content=\"A snippet from a larger document discussing chocolate.\",\n",
" metadata={\"doc_id\": \"fake_id_2\"},\n",
" ),\n",
"]\n",
"\n",
"vectorstore.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"id": "e391f7f3-5a58-40fd-89fa-a0815c5146f7",
"metadata": {},
"source": [
"To propagate the scores, we subclass `MultiVectorRetriever` and override its `_get_relevant_documents` method. Here we will make two changes:\n",
"\n",
"1. We will add similarity scores to the metadata of the corresponding \"sub-documents\" using the `similarity_search_with_score` method of the underlying vectorstore as above;\n",
"2. We will include a list of these sub-documents in the metadata of the retrieved parent document. This surfaces what snippets of text were identified by the retrieval, together with their corresponding similarity scores."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1de61de7-1b58-41d6-9dea-939fef7d741d",
"metadata": {},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"\n",
"from langchain.retrievers import MultiVectorRetriever\n",
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
"\n",
"\n",
"class CustomMultiVectorRetriever(MultiVectorRetriever):\n",
" def _get_relevant_documents(\n",
" self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
" ) -> List[Document]:\n",
" \"\"\"Get documents relevant to a query.\n",
" Args:\n",
" query: String to find relevant documents for\n",
" run_manager: The callbacks handler to use\n",
" Returns:\n",
" List of relevant documents\n",
" \"\"\"\n",
" results = self.vectorstore.similarity_search_with_score(\n",
" query, **self.search_kwargs\n",
" )\n",
"\n",
" # Map doc_ids to list of sub-documents, adding scores to metadata\n",
" id_to_doc = defaultdict(list)\n",
" for doc, score in results:\n",
" doc_id = doc.metadata.get(\"doc_id\")\n",
" if doc_id:\n",
" doc.metadata[\"score\"] = score\n",
" id_to_doc[doc_id].append(doc)\n",
"\n",
" # Fetch documents corresponding to doc_ids, retaining sub_docs in metadata\n",
" docs = []\n",
" for _id, sub_docs in id_to_doc.items():\n",
" docstore_docs = self.docstore.mget([_id])\n",
" if docstore_docs:\n",
" if doc := docstore_docs[0]:\n",
" doc.metadata[\"sub_docs\"] = sub_docs\n",
" docs.append(doc)\n",
"\n",
" return docs"
]
},
{
"cell_type": "markdown",
"id": "7af27b38-631c-463f-9d66-bcc985f06a4f",
"metadata": {},
"source": [
"Invoking this retriever, we can see that it identifies the correct parent document, including the relevant snippet from the sub-document with similarity score."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "dc42a1be-22e1-4ade-b1bd-bafb85f2424f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = CustomMultiVectorRetriever(vectorstore=vectorstore, docstore=docstore)\n",
"\n",
"retriever.invoke(\"cat\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -143,6 +143,7 @@ Retrievers are responsible for taking a query and returning relevant documents.
- [How to: generate multiple queries to retrieve data for](/docs/how_to/MultiQueryRetriever)
- [How to: use contextual compression to compress the data retrieved](/docs/how_to/contextual_compression)
- [How to: write a custom retriever class](/docs/how_to/custom_retriever)
- [How to: add similarity scores to retriever results](/docs/how_to/add_scores_retriever)
- [How to: combine the results from multiple retrievers](/docs/how_to/ensemble_retriever)
- [How to: reorder retrieved results to put most relevant documents not in the middle](/docs/how_to/long_context_reorder)
- [How to: generate multiple embeddings per document](/docs/how_to/multi_vector)

Loading…
Cancel
Save