[searx-search] add docs, improved wrapper api, registered as tool

- Improved the search wrapper API to mirror the usage of the google search one. - Register searx-search as loadable tool - Added documentation and example notebook
1 year ago · a62b134e99
parent a21e9becd4
commit a62b134e99
10 changed files with 349 additions and 35 deletions
--- a/docs/ecosystem/searx.md
+++ b/docs/ecosystem/searx.md
@ -0,0 +1,35 @@
+# SearxNG Search API
+
+This page covers how to use the SearxNG search API within LangChain.
+It is broken into two parts: installation and setup, and then references to the specific SearxNG API wrapper.
+
+## Installation and Setup
+
+- You can find a list of public SearxNG instances [here](https://searx.space/). 
+- It recommended to use a self-hosted instance to avoid abuse on the public instances. Also note that public instances often have a limit on the number of requests.
+- To run a self-hosted instance see [this page](https://searxng.github.io/searxng/admin/installation.html) for more information.
+- To use the tool you need to provide the searx host url by:
+    1. passing the named parameter `searx_host` when creating the instance.
+    2. exporting the environment variable `SEARXNG_HOST`. 
+
+## Wrappers
+
+### Utility
+
+You can use the wrapper to get results from a SearxNG instance. 
+
+```python
+from langchain.utilities import SearxSearchWrapper
+```
+
+### Tool
+
+You can also easily load this wrapper as a Tool (to use with an Agent).
+You can do this with:
+
+```python
+from langchain.agents import load_tools
+tools = load_tools(["searx-search"], searx_host="https://searx.example.com")
+```
+
+For more information on this, see [this page](../modules/agents/tools.md)
--- a/docs/modules/agents/tools.md
+++ b/docs/modules/agents/tools.md
@ -119,3 +119,11 @@ Below is a list of all supported tools and relevant information:
 - Requires LLM: No
 - Extra Parameters: `google_api_key`, `google_cse_id`
 - For more information on this, see [this page](../../ecosystem/google_search.md)
+
+**searx-search**
+
+- Tool Name: Search
+- Tool Description: A wrapper around SearxNG meta search engine. Input should be a search query. 
+- Notes: SearxNG is easy to deploy self-hosted. It is a good privacy friendly alternative to Google Search. Uses the SearxNG API. 
+- Requires LLM: No
+- Extra Parameters: `searx_host`
--- a/docs/modules/utils/examples/searx_search.ipynb
+++ b/docs/modules/utils/examples/searx_search.ipynb
@ -0,0 +1,197 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "40c7223e",
+   "metadata": {},
+   "source": [
+    "# SearxNG Search API\n",
+    "\n",
+    "This notebook goes over how to use a self hosted SearxNG search API to search the web."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "288f2aa4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.searx_search import SearxSearchWrapper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f4ce83fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "search = SearxSearchWrapper(searx_host=\"http://127.0.0.1:8888\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "ff6ef4e7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'In all, 45 individuals have served 46 presidencies spanning 58 full four-year terms. Joe Biden is the 46th and current president of the United States, having assumed office on January 20, 2021.'"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "search.run(\"Who is the current president of the united states of america?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf728728",
+   "metadata": {},
+   "source": [
+    "For some engines, if a direct `answer` is available the warpper will print the answer instead of the full search results. You can use the `results` method of the wrapper if you want to obtain all the results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbac93d4",
+   "metadata": {},
+   "source": [
+    "\n",
+    "# Custom Parameters\n",
+    "\n",
+    "SearxNG supports up to [139 search engines](https://docs.searxng.org/admin/engines/configured_engines.html#configured-engines). You can also customize the Searx wrapper with arbitrary named parameters that will be passed to the Searx search API . In the below example we will making a more interesting use of custom search parameters from searx search api."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7844deaa",
+   "metadata": {},
+   "source": [
+    "In this example we will be using the `engines` parameters to query wikipedia"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "1517e24b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "search = SearxSearchWrapper(searx_host=\"http://127.0.0.1:8888\", k=5) # k is for max number of items"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "4ded48b0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Large language models (LLMs) represent a major advancement in AI, with the promise of transforming domains through learned knowledge. LLM sizes have been increasing 10X every year for the last few years, and as these models grow in complexity and size, so do their capabilities.\\n\\nGPT-3 can translate language, write essays, generate computer code, and more — all with limited to no supervision. In July 2020, OpenAI unveiled GPT-3, a language model that was easily the largest known at the time. Put simply, GPT-3 is trained to predict the next word in a sentence, much like how a text message autocomplete feature works.\\n\\nAll of today’s well-known language models—e.g., GPT-3 from OpenAI, PaLM or LaMDA from Google, Galactica or OPT from Meta, Megatron-Turing from Nvidia/Microsoft, Jurassic-1 from AI21 Labs—are...\\n\\nLarge language models are computer programs that open new possibilities of text understanding and generation in software systems. Consider this: ...\\n\\nLarge language models (LLMs) such as GPT-3are increasingly being used to generate text. These tools should be used with care, since they can generate content that is biased, non-verifiable, constitutes original research, or violates copyrights.'"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "search.run(\"large language model \", engines='wiki')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "259f5a5b",
+   "metadata": {},
+   "source": [
+    "## Obtaining results with metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c4cf1db",
+   "metadata": {},
+   "source": [
+    "In this example we will be looking for scientific paper using the `categories` parameter and limiting the results to a `time_range` (not all engines support the time range option).\n",
+    "\n",
+    "We also would like to obtain the results in a structured way including metadata. For this we will be using the `results` method of the wrapper."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "7cd5510b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "search = SearxSearchWrapper(searx_host=\"http://127.0.0.1:8888\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "2ff1acd5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'snippet': '… on natural language instructions, large language models (… the prompt used to steer the model, and most effective prompts … to prompt engineering, we propose Automatic Prompt …',\n",
+       "  'title': 'Large language models are human-level prompt engineers',\n",
+       "  'link': 'https://arxiv.org/abs/2211.01910'},\n",
+       " {'snippet': '… Large language models (LLMs) have introduced new possibilities for prototyping with AI [18]. Pre-trained on a large amount of text data, models … language instructions called prompts. …',\n",
+       "  'title': 'Promptchainer: Chaining large language model prompts through visual programming',\n",
+       "  'link': 'https://dl.acm.org/doi/abs/10.1145/3491101.3519729'},\n",
+       " {'snippet': '… can introspect the large prompt model. We derive the view ϕ0(X) and the model h0 from T01. However, instead of fully fine-tuning T0 during co-training, we focus on soft prompt tuning, …',\n",
+       "  'title': 'Co-training improves prompt-based learning for large language models',\n",
+       "  'link': 'https://proceedings.mlr.press/v162/lang22a.html'},\n",
+       " {'snippet': '… With the success of large language models (LLMs) of code and their use as … prompt design process become important. In this work, we propose a framework called Repo-Level Prompt …',\n",
+       "  'title': 'Repository-level prompt generation for large language models of code',\n",
+       "  'link': 'https://arxiv.org/abs/2206.12839'},\n",
+       " {'snippet': '… Figure 2 | The benefits of different components of a prompt for the largest language model (Gopher), as estimated from hierarchical logistic regression. Each point estimates the unique …',\n",
+       "  'title': 'Can language models learn from explanations in context?',\n",
+       "  'link': 'https://arxiv.org/abs/2204.02329'}]"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "search.results(\"Large Language Model prompt\", num_results=5, categories='science', time_range='year')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/modules/utils/generic_how_to.rst
+++ b/docs/modules/utils/generic_how_to.rst
@ -15,6 +15,8 @@ The utilities listed here are all generic utilities.

 `SerpAPI <./examples/serpapi.html>`_: How to use the SerpAPI wrapper to search the web.

+`SearxNG Search API <./examples/searx_search.html>`_: Hot to use the SearxNG meta search wrapper to search the web.
+
 `Bing Search <./examples/bing_search.html>`_: How to use the Bing search wrapper to search the web.

 `Wolfram Alpha <./examples/wolfram_alpha.html>`_: How to use the Wolfram Alpha wrapper to interact with Wolfram Alpha.
--- a/docs/modules/utils/key_concepts.md
+++ b/docs/modules/utils/key_concepts.md
@ -36,3 +36,8 @@ This uses the official Google Search API to look up information on the web.

 ## SerpAPI
 This uses SerpAPI, a third party search API engine, to interact with Google Search.
+
+## Searx Search
+This uses the Searx (SearxNG fork) meta search engine API to lookup information
+on the web.  It supports 139 search engines and is easy to self-host
+which makes it a good choice for privacy-conscious users.
--- a/docs/reference/modules/searx_search.rst
+++ b/docs/reference/modules/searx_search.rst
@ -0,0 +1,6 @@
+SearxNG Search
+=============================
+
+.. automodule:: langchain.searx_search
+   :members:
+   :undoc-members:
--- a/docs/reference/utils.rst
+++ b/docs/reference/utils.rst
@ -13,6 +13,7 @@ These can largely be grouped into two categories: generic utilities, and then ut

   modules/python
   modules/serpapi
+   modules/searx_search


 .. toctree::
--- a/langchain/agents/load_tools.py
+++ b/langchain/agents/load_tools.py
@ -14,6 +14,7 @@ from langchain.serpapi import SerpAPIWrapper
 from langchain.utilities.bash import BashProcess
 from langchain.utilities.google_search import GoogleSearchAPIWrapper
 from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper
+from langchain.searx_search import SearxSearchWrapper


 def _get_python_repl() -> Tool:
@ -139,15 +140,23 @@ def _get_serpapi(**kwargs: Any) -> Tool:
        coroutine=SerpAPIWrapper(**kwargs).arun,
    )

+def _get_searx_search(**kwargs: Any) -> Tool:
+    return Tool(
+            "Search",
+            SearxSearchWrapper(**kwargs).run,
+            "A meta search engine. Useful for when you need to answer questions about current events. Input should be a search query."
+            )

 _EXTRA_LLM_TOOLS = {
    "news-api": (_get_news_api, ["news_api_key"]),
    "tmdb-api": (_get_tmdb_api, ["tmdb_bearer_token"]),
 }
+
 _EXTRA_OPTIONAL_TOOLS = {
    "wolfram-alpha": (_get_wolfram_alpha, ["wolfram_alpha_appid"]),
    "google-search": (_get_google_search, ["google_api_key", "google_cse_id"]),
    "serpapi": (_get_serpapi, ["serpapi_api_key", "aiosession"]),
+    "searx-search": (_get_searx_search, ["searx_host", "searx_host"])
 }


--- a/langchain/searx_search.py
+++ b/langchain/searx_search.py
@ -1,18 +1,28 @@
-"""Chain that calls SearxAPI.
+"""Chain that calls Searx meta search API.

-This is developed based on the SearxNG fork https://github.com/searxng/searxng
-For Searx API refer to https://docs.searxng.org/index.html
+SearxNG is a privacy-friendly free metasearch engine that aggregates results from multiple search engines
+and databases.
+
+For Searx search API refer to https://docs.searxng.org/dev/search_api.html
+
+This is based on the SearxNG fork https://github.com/searxng/searxng which is
+better maintained than the original Searx project and offers more features.
+
+For a list of public SearxNG instances see https://searx.space/
+
+NOTE: SearxNG instances often have a rate limit, so you might want to use a 
+self hosted instance and disable the rate limiter or use this PR: https://github.com/searxng/searxng/pull/2129 that adds whitelisting to the rate limiter.
 """

 import requests
 from pydantic import BaseModel, PrivateAttr, Extra, Field, validator, root_validator
 from typing import Optional, List, Dict, Any
 import json
+from langchain.utils import get_from_dict_or_env


 def _get_default_params() -> dict:
    return {
-        # "engines": "google",
        "lang": "en",
        "format": "json"
    }
@ -36,23 +46,50 @@ class SearxResults(dict):
    # to silence mypy errors
    @property
    def results(self) -> Any:
-        return self.results
+        return self.get("results")

    @property
    def answers(self) -> Any:
-        return self.results
+        return self.get("answers")


 class SearxSearchWrapper(BaseModel):
+    """Wrapper for Searx API.
+    
+    To use you need to provide the searx host by passing the named parameter
+    ``searx_host`` or exporting the environment variable ``SEARX_HOST``.
+
+    In some situations you might want to disable SSL verification, for example
+    if you are running searx locally. You can do this by passing the named parameter
+    ``unsecure``. 
+
+    You can also pass the host url scheme as ``http`` to disable SSL.
+
+    Example:
+        .. code-block:: python
+
+            from langchain.searx_search import SearxSearchWrapper
+            searx = SearxSearchWrapper(searx_host="https://searx.example.com")
+
+    Example with SSL disabled:
+        .. code-block:: python
+
+            from langchain.searx_search import SearxSearchWrapper
+            # note the unsecure parameter is not needed if you pass the url scheme as http
+            searx = SearxSearchWrapper(searx_host="http://searx.example.com", unsecure=True)
+
+
+    """
    _result: SearxResults = PrivateAttr()
-    host: str = ""
+    searx_host = ""
    unsecure: bool = False
    params: dict = Field(default_factory=_get_default_params)
    headers: Optional[dict] = None
    k: int = 10


-    @validator("unsecure", pre=True)
+
+    @validator("unsecure")
    def disable_ssl_warnings(cls, v: bool) -> bool:
        if v:
            # requests.urllib3.disable_warnings()
@ -71,16 +108,16 @@ class SearxSearchWrapper(BaseModel):
        default = _get_default_params()
        values["params"] = {**default, **user_params}

-        return values
-
+        searx_host = get_from_dict_or_env(values, "searx_host", "SEARX_HOST")
+        if not searx_host.startswith("http"):
+            print(f"Warning: `searx_host` is missing the url scheme, assuming secure https://{searx_host} ")
+            searx_host = "https://" + searx_host
+        elif searx_host.startswith("http://"):
+            values["unsecure"] = True
+            cls.disable_ssl_warnings(True)
+        values["searx_host"] = searx_host

-    @validator("host", pre=True, always=True)
-    def valid_host_url(cls, host: str) -> str:
-        if len(host) == 0:
-            raise ValueError("url can not be empty")
-        if not host.startswith("http"):
-            host = "http://" + host
-        return host
+        return values

    class Config:
        """Configuration for this pydantic object."""
@ -88,19 +125,36 @@ class SearxSearchWrapper(BaseModel):

    def _searx_api_query(self, params: dict) -> SearxResults:
        """actual request to searx API """
-        raw_result = requests.get(self.host, headers=self.headers
-                            , params=params,
-                            verify=not self.unsecure).text
-        self._result = SearxResults(raw_result)
-        return self._result
+        raw_result = requests.get(self.searx_host, headers=self.headers,
+                                  params=params,
+                                  verify=not self.unsecure).text
+        res = SearxResults(raw_result)
+        self._result = res
+        return res
+
+    def run(self, query: str, **kwargs: Any) -> str:
+        """Run query through Searx API and parse results.
+            You can pass any other params to the searx query API.
+
+        Args:
+            query: The query to search for.
+            **kwargs: any parameters to pass to the searx API.
+
+        Example:
+            This will make a query to the qwant engine:

+            .. code-block:: python

-    def run(self, query: str) -> str:
-        """Run query through Searx API and parse results"""
-        _params = { 
+                from langchain.searx_search import SearxSearchWrapper
+                searx = SearxSearchWrapper(searx_host="http://my.searx.host")
+                searx.run("what is the weather in France ?", engine="qwant")
+
+
+        """
+        _params = {
            "q": query,
-       }
-        params = {**self.params, **_params}
+               }
+        params = {**self.params, **_params, **kwargs}
        res = self._searx_api_query(params)

        if len(res.answers) > 0:
@ -108,13 +162,13 @@ class SearxSearchWrapper(BaseModel):

        # only return the content of the results list
        elif len(res.results) > 0:
-            toret = "\n\n".join([r['content'] for r in res.results[:self.k]])
+            toret = "\n\n".join([r.get('content', 'no result found') for r in res.results[:self.k]])
        else:
            toret = "No good search result found"

        return toret

-    def results(self, query: str, num_results: int) -> List[Dict]:
+    def results(self, query: str, num_results: int, **kwargs: Any) -> List[Dict]:
        """Run query through Searx API and returns the results with metadata.

            Args:
@ -131,7 +185,7 @@ class SearxSearchWrapper(BaseModel):
        _params = {
                "q": query,
        }
-        params = {**self.params, **_params}
+        params = {**self.params, **_params, **kwargs}
        results = self._searx_api_query(params).results[:num_results]
        if len(results) == 0:
            return [{"Result": "No good Search Result was found"}]
@ -144,8 +198,3 @@ class SearxSearchWrapper(BaseModel):
            metadata_results.append(metadata_result)

        return metadata_results
-
-
-# if __name__ == "__main__":
-#     search = SearxSearchWrapper(host='search.c.gopher', unsecure=True)
-#     print(search.run("who is the current president of Bengladesh ?"))
--- a/langchain/utilities/init.py
+++ b/langchain/utilities/init.py
@ -2,6 +2,7 @@
 from langchain.python import PythonREPL
 from langchain.requests import RequestsWrapper
 from langchain.serpapi import SerpAPIWrapper
+from langchain.searx_search import SearxSearchWrapper
 from langchain.utilities.bash import BashProcess
 from langchain.utilities.bing_search import BingSearchAPIWrapper
 from langchain.utilities.google_search import GoogleSearchAPIWrapper
@ -14,5 +15,6 @@ __all__ = [
    "GoogleSearchAPIWrapper",
    "WolframAlphaAPIWrapper",
    "SerpAPIWrapper",
+    "SearxSearchWrapper",
    "BingSearchAPIWrapper",
 ]