Improve vespa interface (#4546)

![Screenshot 2023-05-11 at 7 50 31 PM](https://github.com/hwchase17/langchain/assets/130488702/bc8ab4bb-8006-44fc-ba07-df54e84ee2c1)
1 year ago · a4a9d1f403
parent 72f18fd08b
commit a4a9d1f403
2 changed files with 234 additions and 54 deletions
--- a/docs/modules/indexes/retrievers/examples/vespa_retriever.ipynb
+++ b/docs/modules/indexes/retrievers/examples/vespa_retriever.ipynb
@ -11,6 +11,8 @@
    "Vespa.ai is a platform for highly efficient structured text and vector search.\n",
    "Please refer to [Vespa.ai](https://vespa.ai) for more information.\n",
    "\n",
    "In this example we'll work with the public [cord-19-search](https://github.com/vespa-cloud/cord-19-search) app which serves an index for the [CORD-19](https://allenai.org/data/cord-19) dataset containing Covid-19 research papers.\n",
    "\n",
    "In order to create a retriever, we use [pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to\n",
    "create a connection a Vespa service."
   ]
@ -18,34 +20,42 @@
  {
   "cell_type": "code",
   "execution_count": 1,
-   "id": "c10dd962",
+   "id": "101c8eb3",
   "metadata": {},
   "outputs": [],
   "source": [
-    "from vespa.application import Vespa\n",
+    "# Uncomment below if you haven't install pyvespa\n",
    "\n",
-    "vespa_app = Vespa(url=\"https://doc-search.vespa.oath.cloud\")"
+    "# !pip install pyvespa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9f0406d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _pretty_print(docs):\n",
    "    for doc in docs:\n",
    "        print(\"-\" * 80)\n",
    "        print(\"CONTENT: \" + doc.page_content + \"\\n\")\n",
    "        print(\"METADATA: \" + str(doc.metadata))\n",
    "        print(\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "3df4ce53",
+   "id": "3db3bfea",
   "metadata": {},
   "source": [
-    "This creates a connection to a Vespa service, here the Vespa documentation search service.\n",
+    "## Retrieving documents"
    "Using pyvespa, you can also connect to a\n",
    "[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
    "or a local\n",
    "[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
    "\n",
    "\n",
    "After connecting to the service, you can set up the retriever:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
-   "id": "7ccca1f4",
+   "id": "d83331fa",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
@ -53,51 +63,143 @@
   },
   "outputs": [],
   "source": [
-    "from langchain.retrievers.vespa_retriever import VespaRetriever\n",
+    "from langchain.retrievers import VespaRetriever\n",
    "\n",
-    "vespa_query_body = {\n",
+    "# Retrieve the abstracts of the top 2 papers that best match the user query.\n",
-    "    \"yql\": \"select content from paragraph where userQuery()\",\n",
+    "retriever = VespaRetriever.from_params(\n",
-    "    \"hits\": 5,\n",
+    "    'https://api.cord19.vespa.ai', \n",
-    "    \"ranking\": \"documentation\",\n",
+    "    \"abstract\",\n",
-    "    \"locale\": \"en-us\"\n",
+    "    k=2,\n",
-    "}\n",
+    ")"
    "vespa_content_field = \"content\"\n",
    "retriever = VespaRetriever(vespa_app, vespa_query_body, vespa_content_field)"
   ]
  },
  {
-   "cell_type": "markdown",
+   "cell_type": "code",
-   "id": "1e7e34e1",
+   "execution_count": 4,
   "id": "f47a2bfe",
   "metadata": {
    "pycharm": {
-     "name": "#%% md\n"
+     "name": "#%%\n"
    }
   },
-   "source": [
+   "outputs": [
-    "This sets up a LangChain retriever that fetches documents from the Vespa application.\n",
+    {
-    "Here, up to 5 results are retrieved from the `content` field in the `paragraph` document type,\n",
+     "name": "stdout",
-    "using `doumentation` as the ranking method. The `userQuery()` is replaced with the actual query\n",
+     "output_type": "stream",
-    "passed from LangChain.\n",
+     "text": [
      "--------------------------------------------------------------------------------\n",
      "CONTENT: <sep />and peak hospitalizations by 4-96x, without contact tracing. Although contact tracing was highly <hi>effective</hi> at reducing spread, it was insufficient to stop outbreaks caused by <hi>travellers</hi> in even the best-case scenario, and the likelihood of exceeding contact tracing capacity was a concern in most scenarios. Quarantine compliance had only a small impact on <hi>COVID</hi> spread; <hi>travel</hi> volume and infection rate drove spread. Interpretation: NL's <hi>travel</hi> <hi>ban</hi> was likely a critically important intervention to prevent <hi>COVID</hi> spread. Even a small number<sep />\n",
      "\n",
-    "Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
+      "METADATA: {'id': 'index:content/1/544bbfee3466d2c126719d5f'}\n",
-    "for more information.\n",
+      "--------------------------------------------------------------------------------\n",
      "--------------------------------------------------------------------------------\n",
      "CONTENT: How <hi>effective</hi> are restrictions on mobility in limiting <hi>COVID</hi>-19 spread? Using zip code data across five U.S. cities, we estimate that total cases per capita decrease by 20% for every ten percentage point fall in mobility. Addressing endogeneity concerns, we instrument for <hi>travel</hi> by residential teleworkable and essential shares and find a 27% decline in cases per capita. Using panel data for NYC with week and zip code fixed effects, we estimate a decline of 17%. We find substantial spatial and temporal heterogeneity;east coast cities have stronger effects, with the largest for NYC<sep />\n",
      "\n",
-    "Now you can return the results and continue using the results in LangChain."
+      "METADATA: {'id': 'index:content/0/911dfc6986f1c8bc15fc3a26'}\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "docs = retriever.get_relevant_documents(\"How effective are covid travel bans?\")\n",
    "_pretty_print(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a158b8e",
   "metadata": {},
   "source": [
    "## Configuring the retriever\n",
    "We can further configure our results by specifying metadata fields to retrieve, specifying sources to pull from, adding filters and adding index-specific parameters."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 8,
-   "id": "f47a2bfe",
+   "id": "dc6be773",
-   "metadata": {
+   "metadata": {},
-    "pycharm": {
+   "outputs": [
-     "name": "#%%\n"
+    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n",
      "CONTENT: ...and peak hospitalizations by 4-96x, without contact tracing. Although contact tracing was highly effective at reducing spread, it was insufficient to stop outbreaks caused by travellers in even the best-case scenario, and the likelihood of exceeding contact tracing capacity was a concern in most scenarios. Quarantine compliance had only a small impact on COVID spread; travel volume and infection rate drove spread. Interpretation: NL's travel ban was likely a critically important intervention to prevent COVID spread. Even a small number...\n",
      "\n",
      "METADATA: {'matchfeatures': {'bm25': 35.5404665009022, 'colbert_maxsim': 78.48671418428421}, 'sddocname': 'doc', 'title': \"How effective was Newfoundland & Labrador's travel ban to prevent the spread of COVID-19? An agent-based analysis\", 'id': 'index:content/1/544bbfee3466d2c126719d5f', 'timestamp': 1612738800, 'license': 'medrxiv', 'doi': 'https://doi.org/10.1101/2021.02.05.21251157', 'authors': [{'first': ' D. M.', 'name': ' D. M. Aleman', 'last': 'Aleman'}, {'first': ' B. Z.', 'name': ' B. Z.  Tham', 'last': ' Tham'}, {'first': ' S. J.', 'name': ' S. J.  Wagner', 'last': ' Wagner'}, {'first': ' J.', 'name': ' J.  Semelhago', 'last': ' Semelhago'}, {'first': ' A.', 'name': ' A.  Mohammadi', 'last': ' Mohammadi'}, {'first': ' P.', 'name': ' P.  Price', 'last': ' Price'}, {'first': ' R.', 'name': ' R.  Giffen', 'last': ' Giffen'}, {'first': ' P.', 'name': ' P.  Rahman', 'last': ' Rahman'}], 'source': 'MedRxiv; WHO', 'cord_uid': '9b9kt4sp'}\n",
      "--------------------------------------------------------------------------------\n",
      "--------------------------------------------------------------------------------\n",
      "CONTENT: ...reduction in COVID-19 importation and a delay of the COVID-19 outbreak in Australia by approximately one month. Further projection of COVID-19 to May 2020 showed spread patterns depending on the basic reproduction number. CONCLUSION: Imposing the travel ban was effective in delaying widespread transmission of COVID-19. However, strengthening of the domestic control measures is needed to prevent Australia from becoming another epicentre. Implications for public health: This report has shown the importance of border closure to pandemic control.\n",
      "\n",
      "METADATA: {'matchfeatures': {'bm25': 32.398379319326295, 'colbert_maxsim': 73.91238763928413}, 'sddocname': 'doc', 'title': 'Delaying the COVID-19 epidemic in Australia: evaluating the effectiveness of international travel bans', 'id': 'index:content/1/decd6a8642418607b0d7dff9', 'timestamp': 0, 'license': 'unk', 'authors': [{'first': ' Adeshina', 'name': ' Adeshina Adekunle', 'last': 'Adekunle'}, {'first': ' Michael', 'name': ' Michael  Meehan', 'last': ' Meehan'}, {'first': ' Diana', 'name': ' Diana  Rojas-Alvarez', 'last': ' Rojas-Alvarez'}, {'first': ' James', 'name': ' James  Trauer', 'last': ' Trauer'}, {'first': ' Emma', 'name': ' Emma  McBryde', 'last': ' McBryde'}], 'source': 'WHO', 'cord_uid': 'jdh33itm', 'journal': 'Aust N Z J Public Health'}\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "retriever = VespaRetriever.from_params(\n",
    "    'https://api.cord19.vespa.ai', \n",
    "    \"abstract\",\n",
    "    k=2,\n",
    "    metadata_fields=\"*\",  # return all data fields and store as metadata\n",
    "    ranking=\"hybrid-colbert\",  # other valid values: colbert, bm25\n",
    "    bolding=False,\n",
    ")\n",
    "docs = retriever.get_relevant_documents(\"How effective are covid travel bans?\")\n",
    "_pretty_print(docs)"
   ]
  },
-   "outputs": [],
+  {
   "cell_type": "markdown",
   "id": "11242e84",
   "metadata": {},
   "source": [
    "# Querying with filtering conditions\n",
    "\n",
    "Vespa has powerful querying abilities, and lets you specify many different conditions in YQL. You can add these filtering conditions using the `get_relevant_documents_with_filter` function.\n",
    "\n",
    "Read more on the Vespa query language here: https://docs.vespa.ai/en/query-language.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "223aeaa9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n",
      "CONTENT: Importance: As countermeasures against the economic downturn caused by the coronavirus 2019 (COVID-19) pandemic, many countries have introduced or considering financial incentives for people to engage in economic activities such as travel and use restaurants. Japan has implemented a large-scale, nationwide government-funded program that subsidizes up to 50% of all travel expenses since July 2020 with the aim of reviving the travel industry. However, it remains unknown as to how such provision of government subsidies for travel impacted the COVID-19 pandemic...\n",
      "\n",
      "METADATA: {'matchfeatures': {'bm25': 22.54935242101209, 'colbert_maxsim': 55.04242363572121}, 'sddocname': 'doc', 'title': 'Association between Participation in Government Subsidy Program for Domestic Travel and Symptoms Indicative of COVID-19 Infection', 'journal': 'medRxiv : the preprint server for health sciences', 'id': 'index:content/0/d88422d1d176ab0a854caccc', 'timestamp': 1607036400, 'license': 'medrxiv', 'doi': 'https://doi.org/10.1101/2020.12.03.20243352', 'authors': [{'first': ' A.', 'name': ' A. Miyawaki', 'last': 'Miyawaki'}, {'first': ' T.', 'name': ' T.  Tabuchi', 'last': ' Tabuchi'}, {'first': ' Y.', 'name': ' Y.  Tomata', 'last': ' Tomata'}, {'first': ' Y.', 'name': ' Y.  Tsugawa', 'last': ' Tsugawa'}], 'source': 'MedRxiv; Medline; WHO', 'cord_uid': '0isi7yd4'}\n",
      "--------------------------------------------------------------------------------\n",
      "--------------------------------------------------------------------------------\n",
      "CONTENT: The Japanese government has declared a national emergency and travel entry ban since the coronavirus disease 2019 (COVID-19) pandemic began. As of June 19, 2020, there have been no confirmed cases of COVID-19 in Iwate, a prefecture of Japan. Here, we analyzed the excess deaths as well as the number of patients and medical earnings due to the pandemic from prefectural ...\n",
      "\n",
      "METADATA: {'matchfeatures': {'bm25': 19.348708049098548, 'colbert_maxsim': 58.35367426276207}, 'sddocname': 'doc', 'title': 'Affected medical services in Iwate prefecture in the absence of a COVID-19 outbreak', 'id': 'index:content/1/9f27176791532b37ef8e4a24', 'timestamp': 1592604000, 'license': 'medrxiv', 'doi': 'https://doi.org/10.1101/2020.06.19.20135269', 'authors': [{'first': ' N.', 'name': ' N. Sasaki', 'last': 'Sasaki'}, {'first': ' S. S.', 'name': ' S. S.  Nishizuka', 'last': ' Nishizuka'}], 'source': 'MedRxiv; WHO', 'cord_uid': '7egroqb1'}\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
-    "retriever.get_relevant_documents(\"what is vespa?\")"
+    "docs = retriever.get_relevant_documents_with_filter(\n",
    "    \"How effective are covid travel bans?\", \n",
    "    _filter='abstract contains \"Japan\" and license matches \"medrxiv\"'\n",
    ")\n",
    "_pretty_print(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13039caf",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
@ -116,7 +218,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.11.3"
  }
 },
 "nbformat": 4,
--- a/langchain/retrievers/vespa_retriever.py
+++ b/langchain/retrievers/vespa_retriever.py
@ -1,9 +1,8 @@
 """Wrapper for retrieving documents from Vespa."""
 from __future__ import annotations
 import json
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union
 from langchain.schema import BaseRetriever, Document
@ -12,14 +11,19 @@ if TYPE_CHECKING:
 class VespaRetriever(BaseRetriever):
-    def __init__(self, app: Vespa, body: dict, content_field: str):
+    def __init__(
        self,
        app: Vespa,
        body: Dict,
        content_field: str,
        metadata_fields: Optional[Sequence[str]] = None,
    ):
        self._application = app
        self._query_body = body
        self._content_field = content_field
        self._metadata_fields = metadata_fields or ()
-    def get_relevant_documents(self, query: str) -> List[Document]:
+    def _query(self, body: Dict) -> List[Document]:
        body = self._query_body.copy()
        body["query"] = query
        response = self._application.query(body)
        if not str(response.status_code).startswith("2"):
@ -33,12 +37,86 @@ class VespaRetriever(BaseRetriever):
        if "errors" in root:
            raise RuntimeError(json.dumps(root["errors"]))
-        hits = []
+        docs = []
        for child in response.hits:
-            page_content = child["fields"][self._content_field]
+            page_content = child["fields"].pop(self._content_field, "")
-            metadata = {"id": child["id"]}
+            if self._metadata_fields == "*":
-            hits.append(Document(page_content=page_content, metadata=metadata))
+                metadata = child["fields"]
-        return hits
+            else:
                metadata = {mf: child["fields"].get(mf) for mf in self._metadata_fields}
            metadata["id"] = child["id"]
            docs.append(Document(page_content=page_content, metadata=metadata))
        return docs
    def get_relevant_documents(self, query: str) -> List[Document]:
        body = self._query_body.copy()
        body["query"] = query
        return self._query(body)
    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError
    def get_relevant_documents_with_filter(
        self, query: str, *, _filter: Optional[str] = None
    ) -> List[Document]:
        body = self._query_body.copy()
        _filter = f" and {_filter}" if _filter else ""
        body["yql"] = body["yql"] + _filter
        body["query"] = query
        return self._query(body)
    @classmethod
    def from_params(
        cls,
        url: str,
        content_field: str,
        *,
        k: Optional[int] = None,
        metadata_fields: Union[Sequence[str], Literal["*"]] = (),
        sources: Union[Sequence[str], Literal["*"], None] = None,
        _filter: Optional[str] = None,
        yql: Optional[str] = None,
        **kwargs: Any,
    ) -> VespaRetriever:
        """Instantiate retriever from params.
        Args:
            url (str): Vespa app URL.
            content_field (str): Field in results to return as Document page_content.
            k (Optional[int]): Number of Documents to return. Defaults to None.
            metadata_fields(Sequence[str] or "*"): Fields in results to include in
                document metadata. Defaults to empty tuple ().
            sources (Sequence[str] or "*" or None): Sources to retrieve
                from. Defaults to None.
            _filter (Optional[str]): Document filter condition expressed in YQL.
                Defaults to None.
            yql (Optional[str]): Full YQL query to be used. Should not be specified
                if _filter or sources are specified. Defaults to None.
            kwargs (Any): Keyword arguments added to query body.
        """
        try:
            from vespa.application import Vespa
        except ImportError:
            raise ImportError(
                "pyvespa is not installed, please install with `pip install pyvespa`"
            )
        app = Vespa(url)
        body = kwargs.copy()
        if yql and (sources or _filter):
            raise ValueError(
                "yql should only be specified if both sources and _filter are not "
                "specified."
            )
        else:
            if metadata_fields == "*":
                _fields = "*"
                body["summary"] = "short"
            else:
                _fields = ", ".join([content_field] + list(metadata_fields or []))
            _sources = ", ".join(sources) if isinstance(sources, Sequence) else "*"
            _filter = f" and {_filter}" if _filter else ""
            yql = f"select {_fields} from sources {_sources} where userQuery(){_filter}"
        body["yql"] = yql
        if k:
            body["hits"] = k
        return cls(app, body, content_field, metadata_fields=metadata_fields)