pull/21626/head
Chester Curme 2 weeks ago
parent f2bb56bcc6
commit 7feb1e51f2

@ -93,7 +93,7 @@
"\n",
"@chain\n",
"def retriever(query: str) -> List[Document]:\n",
" docs, scores = zip(*vectorstore.similarity_search_with_score(\"dinosaur\"))\n",
" docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
" for doc, score in zip(docs, scores):\n",
" doc.metadata[\"score\"] = score\n",
"\n",
@ -121,7 +121,7 @@
}
],
"source": [
"result = retriever.invoke(\"cat\")\n",
"result = retriever.invoke(\"dinosaur\")\n",
"result"
]
},
@ -138,7 +138,13 @@
"id": "af2e73a0-46a1-47e2-8103-68aaa637642a",
"metadata": {},
"source": [
"## SelfQueryRetriever"
"## SelfQueryRetriever\n",
"\n",
"`SelfQueryRetriever` will use a LLM to generate a query that is potentially structured-- for example, it can construct filters for the retrieval on top of the usual semantic-similarity driven selection. See [this guide](/docs/how_to/self_query) for more detail.\n",
"\n",
"`SelfQueryRetriever` includes a short (1 - 2 line) method `_get_docs_with_query` that executes the vectorstore search. We can subclass `SelfQueryRetriever` and override this method to propagate similarity scores.\n",
"\n",
"First, following the [how-to guide](/docs/how_to/self_query), we will need to establish some metadata on which to filter:"
]
},
{
@ -176,6 +182,14 @@
"llm = ChatOpenAI(temperature=0)"
]
},
{
"cell_type": "markdown",
"id": "0a6c6fa8-1e2f-45ee-83e9-a6cbd82292d2",
"metadata": {},
"source": [
"We then override the `_get_docs_with_query` to use the `similarity_search_with_score` method of the underlying vectorstore: "
]
},
{
"cell_type": "code",
"execution_count": 6,
@ -191,7 +205,9 @@
" self, query: str, search_kwargs: Dict[str, Any]\n",
" ) -> List[Document]:\n",
" \"\"\"Get docs, adding score information.\"\"\"\n",
" docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
" docs, scores = zip(\n",
" *vectorstore.similarity_search_with_score(query, **search_kwargs)\n",
" )\n",
" for doc, score in zip(docs, scores):\n",
" doc.metadata[\"score\"] = score\n",
"\n",
@ -199,42 +215,40 @@
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "682ee75a-cf33-44d8-914a-7836410fc2e0",
"cell_type": "markdown",
"id": "56e40109-1db6-44c7-a6e6-6989175e267c",
"metadata": {},
"outputs": [],
"source": [
"retriever = CustomSelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
")"
"Invoking this retriever will now include similarity scores in the document metadata. Note that the underlying structured-query capabilities of `SelfQueryRetriever` are retained."
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"id": "3359a1ee-34ff-41b6-bded-64c05785b333",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),\n",
" Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),\n",
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))"
"(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)"
]
},
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = retriever.invoke(\"dinosaur\")\n",
"retriever = CustomSelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
")\n",
"\n",
"\n",
"result = retriever.invoke(\"dinosaur movie with rating less than 8\")\n",
"result"
]
},
@ -243,12 +257,18 @@
"id": "689ab3ba-3494-448b-836e-05fbe1ffd51c",
"metadata": {},
"source": [
"## MultiVectorRetriever"
"## MultiVectorRetriever\n",
"\n",
"`MultiVectorRetriever` allows you to associate multiple vectors with a single document. This can be useful in a number of applications. For example, we can index small chunks of a larger document and run the retrieval on the chunks, but return the larger \"parent\" document when invoking the retriever. [ParentDocumentRetriever](/docs/how_to/parent_document_retriever/), a subclass of `MultiVectorRetriever`, includes convenience methods for populating a vectorstore to support this. Further applications are detailed in this [how-to guide](/docs/how_to/multi_vector/).\n",
"\n",
"To propagate similarity scores through this retriever, we can again subclass `MultiVectorRetriever` and override a method. This time we will override `_get_relevant_documents`.\n",
"\n",
"First, we prepare some fake data. We generate fake \"whole documents\" and store them in a document store; here we will use a simple [InMemoryStore](https://api.python.langchain.com/en/latest/stores/langchain_core.stores.InMemoryBaseStore.html)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"id": "a112e545-7b53-4fcd-9c4a-7a42a5cc646d",
"metadata": {},
"outputs": [],
@ -265,20 +285,29 @@
"docstore.mset(fake_whole_documents)"
]
},
{
"cell_type": "markdown",
"id": "453b7415-4a6d-45d4-a329-9c1d7271d1b2",
"metadata": {},
"source": [
"Next we will add some fake \"sub-documents\" to our vectorstore. We can link these sub-documents to the parent documents by populating the `\"doc_id\"` key in its metadata."
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 9,
"id": "314519c0-dde4-41ea-a1ab-d3cf1c17c63f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['41fdfac4-ecdc-4786-9867-e4227a490427',\n",
" 'a9096391-d88c-46d2-8f3a-350c8570387e']"
"['62a85353-41ff-4346-bff7-be6c8ec2ed89',\n",
" '5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',\n",
" '8c1d9a56-120f-45e4-ba70-a19cd19a38f4']"
]
},
"execution_count": 10,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
@ -290,6 +319,10 @@
" metadata={\"doc_id\": \"fake_id_1\"},\n",
" ),\n",
" Document(\n",
" page_content=\"A snippet from a larger document discussing discourse.\",\n",
" metadata={\"doc_id\": \"fake_id_1\"},\n",
" ),\n",
" Document(\n",
" page_content=\"A snippet from a larger document discussing chocolate.\",\n",
" metadata={\"doc_id\": \"fake_id_2\"},\n",
" ),\n",
@ -298,9 +331,20 @@
"vectorstore.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"id": "e391f7f3-5a58-40fd-89fa-a0815c5146f7",
"metadata": {},
"source": [
"To propagate the scores, we subclass `MultiVectorRetriever` and override its `_get_relevant_documents` method. Here we will make two changes:\n",
"\n",
"1. We will add similarity scores to the metadata of the corresponding \"sub-documents\" using the `similarity_search_with_score` method of the underlying vectorstore as above;\n",
"2. We will include a list of these sub-documents in the metadata of the retrieved parent document. This surfaces what snippets of text were identified by the retrieval, together with their corresponding similarity scores."
]
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 10,
"id": "1de61de7-1b58-41d6-9dea-939fef7d741d",
"metadata": {},
"outputs": [],
@ -346,19 +390,27 @@
" return docs"
]
},
{
"cell_type": "markdown",
"id": "7af27b38-631c-463f-9d66-bcc985f06a4f",
"metadata": {},
"source": [
"Invoking this retriever, we can see that it identifies the correct parent document, including the relevant snippet from the sub-document with similarity score."
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 11,
"id": "dc42a1be-22e1-4ade-b1bd-bafb85f2424f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.83121419})]})]"
"[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]"
]
},
"execution_count": 13,
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}

Loading…
Cancel
Save