Google Cloud Enterprise Search retriever (#7857)

Added a retriever that encapsulated Google Cloud Enterprise Search. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago · f2ef3ff54a
parent 1152f4d48b
commit f2ef3ff54a
4 changed files with 467 additions and 0 deletions
--- a/docs/extras/modules/data_connection/retrievers/integrations/google_cloud_enterprise_search.ipynb
+++ b/docs/extras/modules/data_connection/retrievers/integrations/google_cloud_enterprise_search.ipynb
@ -0,0 +1,246 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Google Cloud Enterprise Search\n",
+    "\n",
+    "\n",
+    "[Enterprise Search](https://cloud.google.com/enterprise-search) is a part of the Generative AI App Builder suite of tools offered by Google Cloud.\n",
+    "\n",
+    "Gen AI App Builder lets developers, even those with limited machine learning skills, quickly and easily tap into the power of Google’s foundation models, search expertise, and conversational AI technologies to create enterprise-grade generative AI applications. \n",
+    "\n",
+    "Enterprise Search lets organizations quickly build generative AI powered search engines for customers and employees.Enterprise Search is underpinned by a variety of Google Search technologies, including semantic search, which helps deliver more relevant results than traditional keyword-based search techniques by using natural language processing and machine learning techniques to infer relationships within the content and intent from the user’s query input. Enterprise Search also benefits from Google’s expertise in understanding how users search and factors in content relevance to order displayed results. \n",
+    "\n",
+    "Google Cloud offers Enterprise Search via Gen App Builder in Google Cloud Console and via an API for enterprise workflow integration. \n",
+    "\n",
+    "This notebook demonstrates how to configure Enterprise Search and use the Enterprise Search retriever. The Enterprise Search retriever encapsulates the [Generative AI App Builder Python client library](https://cloud.google.com/generative-ai-app-builder/docs/libraries#client-libraries-install-python) and uses it to access the Enterprise Search [Search Service API](https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1beta.services.search_service)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install pre-requisites\n",
+    "\n",
+    "You need to install the `google-cloud-discoverengine` package to use the Enterprise Search retriever."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install google-cloud-discoveryengine"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Configure access to Google Cloud and Google Cloud Enterprise Search\n",
+    "\n",
+    "Enterprise Search is generally available for the allowlist (which means customers need to be approved for access) as of June 6, 2023. Contact your Google Cloud sales team for access and pricing details. We are previewing additional features that are coming soon to the generally available offering as part of our [Trusted Tester](https://cloud.google.com/ai/earlyaccess/join?hl=en) program. Sign up for [Trusted Tester](https://cloud.google.com/ai/earlyaccess/join?hl=en) and contact your Google Cloud sales team for an expedited trial.\n",
+    "\n",
+    "Before you can run this notebook you need to:\n",
+    "- Set or create a Google Cloud project and turn on Gen App Builder\n",
+    "- Create and populate an unstructured data store\n",
+    "- Set credentials to access `Enterprise Search API`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Set or create a Google Cloud poject and turn on Gen App Builder\n",
+    "\n",
+    "Follow the instructions in the [Enterprise Search Getting Started guide](https://cloud.google.com/generative-ai-app-builder/docs/before-you-begin) to set/create a GCP project and enable Gen App Builder.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create and populate an unstructured data store\n",
+    "\n",
+    "[Use Google Cloud Console to create an unstructured data store](https://cloud.google.com/generative-ai-app-builder/docs/create-engine-es#unstructured-data) and populate it with the example PDF documents from the  `gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs` Cloud Storage folder. Make sure to use the `Cloud Storage (without metadata)` option."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Set credentials to access Enterprise Search API\n",
+    "\n",
+    "The [Gen App Builder client libraries](https://cloud.google.com/generative-ai-app-builder/docs/libraries) used by the Enterprise Search retriever provide high-level language support for authenticating to Gen App Builder programmatically. Client libraries support [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/application-default-credentials); the libraries look for credentials in a set of defined locations and use those credentials to authenticate requests to the API. With ADC, you can make credentials available to your application in a variety of environments, such as local development or production, without needing to modify your application code.\n",
+    "\n",
+    "If running in [Google Colab](https://colab.google) authenticate with `google.colab.google.auth` otherwise follow one of the [supported methods](https://cloud.google.com/docs/authentication/application-default-credentials) to make sure that you Application Default Credentials are properly set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "if \"google.colab\" in sys.modules:\n",
+    "    from google.colab import auth as google_auth\n",
+    "\n",
+    "    google_auth.authenticate_user()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Configure and use the Enterprise Search retriever\n",
+    "\n",
+    "The Enterprise Search retriever is implemented in the `langchain.retriever.GoogleCloudEntepriseSearchRetriever` class. The `get_relevan_documents` method returns a list of `langchain.schema.Document` documents where the `page_content` field of each document is populated with either an `extractive segment` or an `extractive answer` that matches a query. The `metadata` field is populated with metadata (if any) of a document from which the segments or answers were extracted.\n",
+    "\n",
+    "An extractive answer is verbatim text that is returned with each search result. It is extracted directly from the original document. Extractive answers are typically displayed near the top of web pages to provide an end user with a brief answer that is contextually relevant to their query. Extractive answers are available for website and unstructured search.\n",
+    "\n",
+    "An extractive segment is verbatim text that is returned with each search result. An extractive segment is usually more verbose than an extractive answer. Extractive segments can be displayed as an answer to a query, and can be used to perform post-processing tasks and as input for large language models to generate answers or new text. Extractive segments are available for unstructured search.\n",
+    "\n",
+    "For more information about extractive segments and extractive answers refer to [product documentation](https://cloud.google.com/generative-ai-app-builder/docs/snippets).\n",
+    "\n",
+    "When creating an instance of the retriever you can specify a number of parameters that control which Enterprise data store to access and how a natural language query is processed, including configurations for extractive answers and segments.\n",
+    "\n",
+    "The mandatory parameters are:\n",
+    "\n",
+    "- `project_id` - Your Google Cloud PROJECT_ID\n",
+    "- `search_engine_id` - The ID of the data store you want to use. \n",
+    "\n",
+    "The `project_id` and `search_engine_id` parameters can be provided explicitly in the retriever's constructor or through the environment variables - `PROJECT_ID` and `SEARCH_ENGINE_ID`.\n",
+    "\n",
+    "You can also configure a number of optional parameters, including:\n",
+    "\n",
+    "- `max_documents` - The maximum number of documents used to provide extractive segments or extractive answers\n",
+    "- `get_extractive_answers` - By default, the retriever is configured to return extractive segments. Set this field to `True` to return extractive answers\n",
+    "- `max_extractive_answer_count` - The maximum number of extractive answers returned in each search result.\n",
+    "    At most 5 answers will be returned\n",
+    "- `max_extractive_segment_count` - The maximum number of extractive segments returned in each search result.\n",
+    "    Currently one segment will be returned\n",
+    "- `filter` - The filter expression that allows you filter the search results based on the metadata associated with the documents in the searched data store. \n",
+    "- `query_expansion_condition` - Specification to determine under which conditions query expansion should occur.\n",
+    "    0 - Unspecified query expansion condition. In this case, server behavior defaults to disabled.\n",
+    "    1 - Disabled query expansion. Only the exact search query is used, even if SearchResponse.total_size is zero.\n",
+    "    2 - Automatic query expansion built by the Search API.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configure and use the retriever with extractve segments"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.retrievers import GoogleCloudEnterpriseSearchRetriever\n",
+    "\n",
+    "PROJECT_ID = \"<YOUR PROJECT ID>\"  # Set to your Project ID\n",
+    "SEARCH_ENGINE_ID = \"<YOUR SEARCH ENGINE ID>\"  # Set to your data store ID"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "retriever = GoogleCloudEnterpriseSearchRetriever(\n",
+    "    project_id=PROJECT_ID,\n",
+    "    search_engine_id=SEARCH_ENGINE_ID,\n",
+    "    max_documents=3,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"What are Alphabet's Other Bets?\"\n",
+    "\n",
+    "result = retriever.get_relevant_documents(query)\n",
+    "for doc in result:\n",
+    "    print(doc)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configure and use the retriever with extractve answers "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "retriever = GoogleCloudEnterpriseSearchRetriever(\n",
+    "    project_id=PROJECT_ID,\n",
+    "    search_engine_id=SEARCH_ENGINE_ID,\n",
+    "    max_documents=3,\n",
+    "    max_extractive_answer_count=3,\n",
+    "    get_extractive_answers=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"What are Alphabet's Other Bets?\"\n",
+    "\n",
+    "result = retriever.get_relevant_documents(query)\n",
+    "for doc in result:\n",
+    "    print(doc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.10"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/langchain/retrievers/init.py
+++ b/langchain/retrievers/init.py
@ -6,6 +6,9 @@ from langchain.retrievers.chatgpt_plugin_retriever import ChatGPTPluginRetriever
 from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
 from langchain.retrievers.docarray import DocArrayRetriever
 from langchain.retrievers.elastic_search_bm25 import ElasticSearchBM25Retriever
+from langchain.retrievers.google_cloud_enterprise_search import (
+    GoogleCloudEnterpriseSearchRetriever,
+)
 from langchain.retrievers.kendra import AmazonKendraRetriever
 from langchain.retrievers.knn import KNNRetriever
 from langchain.retrievers.llama_index import (
@ -39,6 +42,7 @@ __all__ = [
    "ContextualCompressionRetriever",
    "ChaindeskRetriever",
    "ElasticSearchBM25Retriever",
+    "GoogleCloudEnterpriseSearchRetriever",
    "KNNRetriever",
    "LlamaIndexGraphRetriever",
    "LlamaIndexRetriever",
--- a/langchain/retrievers/google_cloud_enterprise_search.py
+++ b/langchain/retrievers/google_cloud_enterprise_search.py
@ -0,0 +1,191 @@
+"""Retriever wrapper for Google Cloud Enterprise Search on Gen App Builder."""
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Sequence
+
+from pydantic import Extra, Field, root_validator
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForRetrieverRun,
+    CallbackManagerForRetrieverRun,
+)
+from langchain.schema import BaseRetriever, Document
+from langchain.utils import get_from_dict_or_env
+
+if TYPE_CHECKING:
+    from google.cloud.discoveryengine_v1beta import (
+        SearchRequest,
+        SearchResult,
+        SearchServiceClient,
+    )
+
+
+class GoogleCloudEnterpriseSearchRetriever(BaseRetriever):
+    """Wrapper around Google Cloud Enterprise Search Service API.
+    For the detailed explanation of the Enterprise Search concepts
+    and configuration parameters refer to the product documentation.
+
+    https://cloud.google.com/generative-ai-app-builder/docs/enterprise-search-introduction
+    """
+
+    project_id: str
+    """Google Cloud Project ID."""
+    search_engine_id: str
+    """Enterprise Search engine ID."""
+    serving_config_id: str = "default_config"
+    """Enterprise Search serving config ID."""
+    location_id: str = "global"
+    """Enterprise Search engine location."""
+    filter: Optional[str] = None
+    """Filter expression."""
+    get_extractive_answers: bool = False
+    """If True return Extractive Answers, otherwise return Extractive Segments."""
+    max_documents: int = Field(default=5, ge=1, le=100)
+    """The maximum number of documents to return."""
+    max_extractive_answer_count: int = Field(default=1, ge=1, le=5)
+    """The maximum number of extractive answers returned in each search result.
+    At most 5 answers will be returned for each SearchResult.
+    """
+    max_extractive_segment_count: int = Field(default=1, ge=1, le=1)
+    """The maximum number of extractive segments returned in each search result.
+    Currently one segment will be returned for each SearchResult.
+    """
+    query_expansion_condition: int = Field(default=1, ge=0, le=2)
+    """Specification to determine under which conditions query expansion should occur.
+    0 - Unspecified query expansion condition. In this case, server behavior defaults 
+        to disabled
+    1 - Disabled query expansion. Only the exact search query is used, even if 
+        SearchResponse.total_size is zero.
+    2 - Automatic query expansion built by the Search API.
+    """
+    credentials: Any = None
+    """The default custom credentials (google.auth.credentials.Credentials) to use
+    when making API calls. If not provided, credentials will be ascertained from
+    the environment."""
+
+    _client: SearchServiceClient
+    _serving_config: str
+
+    class Config:
+        """Configuration for this pydantic object."""
+
+        extra = Extra.forbid
+        arbitrary_types_allowed = True
+        underscore_attrs_are_private = True
+
+    @root_validator(pre=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validates the environment."""
+        try:
+            from google.cloud import discoveryengine_v1beta  # noqa: F401
+        except ImportError as exc:
+            raise ImportError(
+                "google.cloud.discoveryengine is not installed. "
+                "Please install it with pip install google-cloud-discoveryengine"
+            ) from exc
+
+        values["project_id"] = get_from_dict_or_env(values, "project_id", "PROJECT_ID")
+        values["search_engine_id"] = get_from_dict_or_env(
+            values, "search_engine_id", "SEARCH_ENGINE_ID"
+        )
+
+        return values
+
+    def __init__(self, **data: Any) -> None:
+        """Initializes private fields."""
+        from google.cloud.discoveryengine_v1beta import SearchServiceClient
+
+        super().__init__(**data)
+        self._client = SearchServiceClient(credentials=self.credentials)
+        self._serving_config = self._client.serving_config_path(
+            project=self.project_id,
+            location=self.location_id,
+            data_store=self.search_engine_id,
+            serving_config=self.serving_config_id,
+        )
+
+    def _convert_search_response(
+        self, results: Sequence[SearchResult]
+    ) -> List[Document]:
+        """Converts a sequence of search results to a list of LangChain documents."""
+        from google.protobuf.json_format import MessageToDict
+
+        documents = []
+        for result in results:
+            document_dict = MessageToDict(result.document._pb)
+            derived_struct_data = document_dict.get("derivedStructData", None)
+            if derived_struct_data:
+                doc_metadata = document_dict.get("structData", {})
+                chunk_type = (
+                    "extractive_answers"
+                    if self.get_extractive_answers
+                    else "extractive_segments"
+                )
+                for chunk in derived_struct_data.get(chunk_type, []):
+                    if chunk_type == "extractive_answers":
+                        doc_metadata["source"] = (
+                            f"{derived_struct_data.get('link', '')}"
+                            f":{chunk.get('pageNumber', '')}"
+                        )
+                    else:
+                        doc_metadata[
+                            "source"
+                        ] = f"{derived_struct_data.get('link', '')}"
+                    doc_metadata["id"] = document_dict["id"]
+                    document = Document(
+                        page_content=chunk.get("content", ""), metadata=doc_metadata
+                    )
+                    documents.append(document)
+
+        return documents
+
+    def _create_search_request(self, query: str) -> SearchRequest:
+        """Prepares a SearchRequest object."""
+        from google.cloud.discoveryengine_v1beta import SearchRequest
+
+        query_expansion_spec = SearchRequest.QueryExpansionSpec(
+            condition=self.query_expansion_condition,
+        )
+
+        if self.get_extractive_answers:
+            extractive_content_spec = (
+                SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
+                    max_extractive_answer_count=self.max_extractive_answer_count,
+                )
+            )
+        else:
+            extractive_content_spec = (
+                SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
+                    max_extractive_segment_count=self.max_extractive_segment_count,
+                )
+            )
+
+        content_search_spec = SearchRequest.ContentSearchSpec(
+            extractive_content_spec=extractive_content_spec,
+        )
+
+        request = SearchRequest(
+            query=query,
+            filter=self.filter,
+            serving_config=self._serving_config,
+            page_size=self.max_documents,
+            content_search_spec=content_search_spec,
+            query_expansion_spec=query_expansion_spec,
+        )
+
+        return request
+
+    def _get_relevant_documents(
+        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
+    ) -> List[Document]:
+        """Get documents relevant for a query."""
+        search_request = self._create_search_request(query)
+        response = self._client.search(search_request)
+        documents = self._convert_search_response(response.results)
+
+        return documents
+
+    async def _aget_relevant_documents(
+        self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
+    ) -> List[Document]:
+        raise NotImplementedError
--- a/tests/integration_tests/retrievers/test_google_cloud_enterprise_search.py
+++ b/tests/integration_tests/retrievers/test_google_cloud_enterprise_search.py
@ -0,0 +1,26 @@
+"""Test Google Cloud Enterprise Search retriever.
+
+You need to create a Gen App Builder search app and populate it 
+with data to run the integration tests.
+Follow the instructions in the example notebook:
+google_cloud_enterprise_search.ipynb
+to set up the app and configure authentication.
+
+Set the following environment variables before the tests:
+PROJECT_ID - set to your Google Cloud project ID
+SEARCH_ENGINE_ID - the ID of the search engine to use for the test
+"""
+
+from langchain.retrievers.google_cloud_enterprise_search import (
+    GoogleCloudEnterpriseSearchRetriever,
+)
+from langchain.schema import Document
+
+
+def test_google_cloud_enterprise_search_get_relevant_documents() -> None:
+    """Test the get_relevant_documents() method."""
+    retriever = GoogleCloudEnterpriseSearchRetriever()
+    documents = retriever.get_relevant_documents("What are Alphabet's Other Bets?")
+    for doc in documents:
+        assert isinstance(doc, Document)
+        assert doc.page_content