update

2 weeks ago · 7feb1e51f2
parent f2bb56bcc6
commit 7feb1e51f2
1 changed files with 83 additions and 31 deletions
--- a/docs/docs/how_to/add_scores_retriever.ipynb
+++ b/docs/docs/how_to/add_scores_retriever.ipynb
@ -93,7 +93,7 @@
    "\n",
    "@chain\n",
    "def retriever(query: str) -> List[Document]:\n",
-    "    docs, scores = zip(*vectorstore.similarity_search_with_score(\"dinosaur\"))\n",
+    "    docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
    "    for doc, score in zip(docs, scores):\n",
    "        doc.metadata[\"score\"] = score\n",
    "\n",
@ -121,7 +121,7 @@
    }
   ],
   "source": [
-    "result = retriever.invoke(\"cat\")\n",
+    "result = retriever.invoke(\"dinosaur\")\n",
    "result"
   ]
  },
@ -138,7 +138,13 @@
   "id": "af2e73a0-46a1-47e2-8103-68aaa637642a",
   "metadata": {},
   "source": [
-    "## SelfQueryRetriever"
+    "## SelfQueryRetriever\n",
+    "\n",
+    "`SelfQueryRetriever` will use a LLM to generate a query that is potentially structured-- for example, it can construct filters for the retrieval on top of the usual semantic-similarity driven selection. See [this guide](/docs/how_to/self_query) for more detail.\n",
+    "\n",
+    "`SelfQueryRetriever` includes a short (1 - 2 line) method `_get_docs_with_query` that executes the vectorstore search. We can subclass `SelfQueryRetriever` and override this method to propagate similarity scores.\n",
+    "\n",
+    "First, following the [how-to guide](/docs/how_to/self_query), we will need to establish some metadata on which to filter:"
   ]
  },
  {
@ -176,6 +182,14 @@
    "llm = ChatOpenAI(temperature=0)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "0a6c6fa8-1e2f-45ee-83e9-a6cbd82292d2",
+   "metadata": {},
+   "source": [
+    "We then override the `_get_docs_with_query` to use the `similarity_search_with_score` method of the underlying vectorstore: "
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 6,
@ -191,7 +205,9 @@
    "        self, query: str, search_kwargs: Dict[str, Any]\n",
    "    ) -> List[Document]:\n",
    "        \"\"\"Get docs, adding score information.\"\"\"\n",
-    "        docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
+    "        docs, scores = zip(\n",
+    "            *vectorstore.similarity_search_with_score(query, **search_kwargs)\n",
+    "        )\n",
    "        for doc, score in zip(docs, scores):\n",
    "            doc.metadata[\"score\"] = score\n",
    "\n",
@ -199,42 +215,40 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "id": "682ee75a-cf33-44d8-914a-7836410fc2e0",
+   "cell_type": "markdown",
+   "id": "56e40109-1db6-44c7-a6e6-6989175e267c",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "retriever = CustomSelfQueryRetriever.from_llm(\n",
-    "    llm,\n",
-    "    vectorstore,\n",
-    "    document_content_description,\n",
-    "    metadata_field_info,\n",
-    ")"
+    "Invoking this retriever will now include similarity scores in the document metadata. Note that the underlying structured-query capabilities of `SelfQueryRetriever` are retained."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
   "id": "3359a1ee-34ff-41b6-bded-64c05785b333",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),\n",
-       " Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),\n",
-       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),\n",
-       " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))"
+       "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)"
      ]
     },
-     "execution_count": 8,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "result = retriever.invoke(\"dinosaur\")\n",
+    "retriever = CustomSelfQueryRetriever.from_llm(\n",
+    "    llm,\n",
+    "    vectorstore,\n",
+    "    document_content_description,\n",
+    "    metadata_field_info,\n",
+    ")\n",
+    "\n",
+    "\n",
+    "result = retriever.invoke(\"dinosaur movie with rating less than 8\")\n",
    "result"
   ]
  },
@ -243,12 +257,18 @@
   "id": "689ab3ba-3494-448b-836e-05fbe1ffd51c",
   "metadata": {},
   "source": [
-    "## MultiVectorRetriever"
+    "## MultiVectorRetriever\n",
+    "\n",
+    "`MultiVectorRetriever` allows you to associate multiple vectors with a single document. This can be useful in a number of applications. For example, we can index small chunks of a larger document and run the retrieval on the chunks, but return the larger \"parent\" document when invoking the retriever. [ParentDocumentRetriever](/docs/how_to/parent_document_retriever/), a subclass of `MultiVectorRetriever`, includes convenience methods for populating a vectorstore to support this. Further applications are detailed in this [how-to guide](/docs/how_to/multi_vector/).\n",
+    "\n",
+    "To propagate similarity scores through this retriever, we can again subclass `MultiVectorRetriever` and override a method. This time we will override `_get_relevant_documents`.\n",
+    "\n",
+    "First, we prepare some fake data. We generate fake \"whole documents\" and store them in a document store; here we will use a simple [InMemoryStore](https://api.python.langchain.com/en/latest/stores/langchain_core.stores.InMemoryBaseStore.html)."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
   "id": "a112e545-7b53-4fcd-9c4a-7a42a5cc646d",
   "metadata": {},
   "outputs": [],
@ -265,20 +285,29 @@
    "docstore.mset(fake_whole_documents)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "453b7415-4a6d-45d4-a329-9c1d7271d1b2",
+   "metadata": {},
+   "source": [
+    "Next we will add some fake \"sub-documents\" to our vectorstore. We can link these sub-documents to the parent documents by populating the `\"doc_id\"` key in its metadata."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 9,
   "id": "314519c0-dde4-41ea-a1ab-d3cf1c17c63f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "['41fdfac4-ecdc-4786-9867-e4227a490427',\n",
-       " 'a9096391-d88c-46d2-8f3a-350c8570387e']"
+       "['62a85353-41ff-4346-bff7-be6c8ec2ed89',\n",
+       " '5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',\n",
+       " '8c1d9a56-120f-45e4-ba70-a19cd19a38f4']"
      ]
     },
-     "execution_count": 10,
+     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -290,6 +319,10 @@
    "        metadata={\"doc_id\": \"fake_id_1\"},\n",
    "    ),\n",
    "    Document(\n",
+    "        page_content=\"A snippet from a larger document discussing discourse.\",\n",
+    "        metadata={\"doc_id\": \"fake_id_1\"},\n",
+    "    ),\n",
+    "    Document(\n",
    "        page_content=\"A snippet from a larger document discussing chocolate.\",\n",
    "        metadata={\"doc_id\": \"fake_id_2\"},\n",
    "    ),\n",
@ -298,9 +331,20 @@
    "vectorstore.add_documents(docs)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "e391f7f3-5a58-40fd-89fa-a0815c5146f7",
+   "metadata": {},
+   "source": [
+    "To propagate the scores, we subclass `MultiVectorRetriever` and override its `_get_relevant_documents` method. Here we will make two changes:\n",
+    "\n",
+    "1. We will add similarity scores to the metadata of the corresponding \"sub-documents\" using the `similarity_search_with_score` method of the underlying vectorstore as above;\n",
+    "2. We will include a list of these sub-documents in the metadata of the retrieved parent document. This surfaces what snippets of text were identified by the retrieval, together with their corresponding similarity scores."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 10,
   "id": "1de61de7-1b58-41d6-9dea-939fef7d741d",
   "metadata": {},
   "outputs": [],
@ -346,19 +390,27 @@
    "        return docs"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "7af27b38-631c-463f-9d66-bcc985f06a4f",
+   "metadata": {},
+   "source": [
+    "Invoking this retriever, we can see that it identifies the correct parent document, including the relevant snippet from the sub-document with similarity score."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 11,
   "id": "dc42a1be-22e1-4ade-b1bd-bafb85f2424f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.83121419})]})]"
+       "[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]"
      ]
     },
-     "execution_count": 13,
+     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }