community[minor]: jina search tools integrating (jina reader) (#23339)

- **PR title**: "community: add Jina Search tool" - **Description:** Added the Jina Search tool for querying the Jina search API. This includes the implementation of the JinaSearchAPIWrapper and the JinaSearch tool, along with a Jupyter notebook example demonstrating its usage. - **Issue:** N/A - **Dependencies:** N/A - **Twitter handle:** [Twitter handle](https://x.com/yashp3020?t=7wM0gQ7XjGciFoh9xaBtqA&s=09) - [x] **Add tests and docs**: If you're adding a new integration, please include 1. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-11-10 01:10:59 +00:00 · 2024-09-03 03:22:14 +05:30 · 2024-09-03 03:22:14 +05:30 · 51dae57357
commit 51dae57357
parent 66828f4ecc
7 changed files with 407 additions and 0 deletions
--- a/docs/docs/integrations/tools/jina_search.ipynb
+++ b/docs/docs/integrations/tools/jina_search.ipynb
@ -0,0 +1,284 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "id": "10238e62-3465-4973-9279-606cbb7ccf16",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "sidebar_label: Jina Search\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6f91f20",
+   "metadata": {},
+   "source": [
+    "# Jina Search\n",
+    "\n",
+    "This notebook provides a quick overview for getting started with Jina [tool](/docs/integrations/tools/). For detailed documentation of all Jina features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/tools/langchain_community.tools.jina_search.tool.JinaSearch.html).\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "### Integration details\n",
+    "\n",
+    "| Class | Package | Serializable | JS support |  Package latest |\n",
+    "| :--- | :--- | :---: | :---: | :---: |\n",
+    "| [JinaSearch](https://python.langchain.com/v0.2/api_reference/community/tools/langchain_community.tools.jina_search.tool.JinaSearch.html) | [langchain-community](https://python.langchain.com/v0.2/api_reference/community/) | ❌ | ❌ |  ![PyPI - Version](https://img.shields.io/pypi/v/langchain-community?style=flat-square&label=%20) |\n",
+    "\n",
+    "### Tool features\n",
+    "| [Returns artifact](/docs/how_to/tool_artifacts/) | Native async | Return data | Pricing |\n",
+    "| :---: | :---: | :---: | :---: |\n",
+    "| ❌ | ❌ | URL, Snippet, Title, Page Content | 1M response tokens free | \n",
+    "\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "The integration lives in the `langchain-community` package and was added in version `0.2.16`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f85b4089",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet -U \"langchain-community>=0.2.16\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b15e9266",
+   "metadata": {},
+   "source": [
+    "### Credentials"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e0b178a2-8816-40ca-b57c-ccdd86dde9c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import getpass\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc5ab717-fd27-4c59-b912-bdd099541478",
+   "metadata": {},
+   "source": [
+    "It's also helpful (but not needed) to set up [LangSmith](https://smith.langchain.com/) for best-in-class observability:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "a6c2f136-6367-4f1f-825d-ae741e1bf281",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
+    "# os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c97218f-f366-479d-8bf7-fe9f2f6df73f",
+   "metadata": {},
+   "source": [
+    "## Instantiation\n",
+    "\n",
+    "- TODO: Fill in instantiation params\n",
+    "\n",
+    "Here we show how to instantiate an instance of the Jina tool, with "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "8b3ddfe9-ca79-494c-a7ab-1f56d9407a64",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.tools import JinaSearch\n",
+    "\n",
+    "tool = JinaSearch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74147a1a",
+   "metadata": {},
+   "source": [
+    "## Invocation\n",
+    "\n",
+    "### [Invoke directly with args](/docs/concepts/#invoke-with-just-the-arguments)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "65310a8b-eb0c-4d9e-a618-4f4abe2414fc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{\"title\": \"LangGraph\", \"link\": \"https://www.langchain.com/langgraph\", \"snippet\": \"<strong>LangGraph</strong> helps teams of all sizes, across all industries, from ambitious startups to established enterprises. \\u201cLangChain is streets ahead with what they&#x27;ve put forward with <strong>LangGraph</strong>.\", \"content\": \"![Image 1](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667b080e4b3ca12dc5d5d439_Langgraph%20UI-2.webp)\\n\\nControllable cognitive architecture for any task\\n------------------------------------------------\\n\\nLangGraph's flexible API supports diverse control flows \\u2013 single agent, multi-agent, hierarchical, sequential \\u2013 and robustly handles realistic, complex scenarios.\\n\\nEnsure reliability with easy-to-add moderation and quality loops that prevent agents from veering off course.\\n\\n[See the docs](https://langchain-ai.github.io/langgraph/)\\n\\nDesigned for human-agent collaboration\\n--------------------------------------\\n\\nWith built-in stat\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tool.invoke({\"query\": \"what is langgraph\"})[:1000])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6e73897",
+   "metadata": {},
+   "source": [
+    "### [Invoke with ToolCall](/docs/concepts/#invoke-with-toolcall)\n",
+    "\n",
+    "We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "f90e33a7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{\"title\": \"LangGraph Tutorial: What Is LangGraph and How to Use It?\", \"link\": \"https://www.datacamp.com/tutorial/langgraph-tutorial\", \"snippet\": \"<strong>LangGraph</strong> <strong>is</strong> a library within the LangChain ecosystem that provides a framework for defining, coordinating, and executing multiple LLM agents (or chains) in a structured and efficient manner.\", \"content\": \"Imagine you're building a complex, multi-agent large language model (LLM) application. It's exciting, but it comes with challenges: managing the state of various agents, coordinating their interactions, and handling errors effectively. This is where LangGraph can help.\\n\\nLangGraph is a library within the LangChain ecosystem designed to tackle these challenges head-on. LangGraph provides a framework for defining, coordinating, and executing multiple LLM agents (or chains) in a structured manner.\\n\\nIt simplifies the development process by enabling the creation of cyclical graphs, which are essential for de\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This is usually generated by a model, but we'll create a tool call directly for demo purposes.\n",
+    "model_generated_tool_call = {\n",
+    "    \"args\": {\"query\": \"what is langgraph\"},\n",
+    "    \"id\": \"1\",\n",
+    "    \"name\": tool.name,\n",
+    "    \"type\": \"tool_call\",\n",
+    "}\n",
+    "tool_msg = tool.invoke(model_generated_tool_call)\n",
+    "print(tool_msg.content[:1000])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "659f9fbd-6fcf-445f-aa8c-72d8e60154bd",
+   "metadata": {},
+   "source": [
+    "## Chaining\n",
+    "\n",
+    "We can use our tool in a chain by first binding it to a [tool-calling model](/docs/how_to/tool_calling/) and then calling it:\n",
+    "\n",
+    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
+    "\n",
+    "<ChatModelTabs customVarName=\"llm\" />\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af3123ad-7a02-40e5-b58e-7d56e23e5830",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "# !pip install -qU langchain langchain-openai\n",
+    "from langchain.chat_models import init_chat_model\n",
+    "\n",
+    "llm = init_chat_model(model=\"gpt-4o\", model_provider=\"openai\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "fdbf35b5-3aaf-4947-9ec6-48c21533fb95",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AIMessage(content=\"LangGraph is a library designed for building stateful, multi-actor applications with language models (LLMs). It is particularly useful for creating agent and multi-agent workflows. Compared to other LLM frameworks, LangGraph offers unique benefits such as cycles, controllability, and persistence. Here are some key points:\\n\\n1. **Stateful and Multi-Actor Applications**: LangGraph allows for the definition of flows involving cycles, essential for most agentic architectures. This is a significant differentiation from Directed Acyclic Graph (DAG)-based solutions.\\n\\n2. **Controllability**: The framework offers fine-grained control over both the flow and state of applications, which is crucial for creating reliable agents.\\n\\n3. **Persistence**: Built-in persistence is available, enabling advanced features like human-in-the-loop workflows and memory.\\n\\n4. **Human-in-the-Loop**: LangGraph supports interrupting graph execution for human approval or editing of the agent's next planned action.\\n\\n5. **Streaming Support**: The library can stream outputs as they are produced by each node, including token streaming.\\n\\n6. **Integration with LangChain**: While it integrates seamlessly with LangChain and LangSmith, LangGraph can also be used independently.\\n\\n7. **Inspiration and Interface**: LangGraph is inspired by systems like Pregel and Apache Beam, with its public interface drawing inspiration from NetworkX.\\n\\nLangGraph is designed to handle more complex agent applications that require cycles and state management, making it an ideal choice for developers seeking to build sophisticated LLM-driven applications. For more detailed information, you can visit their [official documentation](https://langchain-ai.github.io/langgraph/).\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 338, 'prompt_tokens': 14774, 'total_tokens': 15112}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'stop', 'logprobs': None}, id='run-420d16ed-535c-41c6-8814-2186b42be0f8-0', usage_metadata={'input_tokens': 14774, 'output_tokens': 338, 'total_tokens': 15112})"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_core.runnables import RunnableConfig, chain\n",
+    "\n",
+    "prompt = ChatPromptTemplate(\n",
+    "    [\n",
+    "        (\"system\", \"You are a helpful assistant.\"),\n",
+    "        (\"human\", \"{user_input}\"),\n",
+    "        (\"placeholder\", \"{messages}\"),\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "\n",
+    "llm_with_tools = llm.bind_tools([tool])\n",
+    "llm_chain = prompt | llm_with_tools\n",
+    "\n",
+    "\n",
+    "@chain\n",
+    "def tool_chain(user_input: str, config: RunnableConfig):\n",
+    "    input_ = {\"user_input\": user_input}\n",
+    "    ai_msg = llm_chain.invoke(input_, config=config)\n",
+    "    tool_msgs = tool.batch(ai_msg.tool_calls, config=config)\n",
+    "    return llm_chain.invoke({**input_, \"messages\": [ai_msg, *tool_msgs]}, config=config)\n",
+    "\n",
+    "\n",
+    "tool_chain.invoke(\"what's langgraph\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ac8146c",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all Jina features and configurations head to the API reference: https://python.langchain.com/v0.2/api_reference/community/tools/langchain_community.tools.jina_search.tool.JinaSearch.html"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "poetry-venv-311",
+   "language": "python",
+   "name": "poetry-venv-311"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/scripts/tool_feat_table.py
+++ b/docs/scripts/tool_feat_table.py
@ -62,6 +62,11 @@ SEARCH_TOOL_FEAT_TABLE = {
        "available_data": "Answer",
        "link": "/docs/integrations/tools/serpapi",
    },
+    "Jina Search": {
+        "pricing": "1M Response Tokens Free",
+        "available_data": "URL, Snippet, Title, Page Content",
+        "link": "/docs/integrations/tools/jina_search/",
+    },
 }

 CODE_INTERPRETER_TOOL_FEAT_TABLE = {
--- a/libs/community/langchain_community/tools/init.py
+++ b/libs/community/langchain_community/tools/init.py
@ -166,6 +166,7 @@ if TYPE_CHECKING:
    from langchain_community.tools.interaction.tool import (
        StdInInquireTool,
    )
+    from langchain_community.tools.jina_search.tool import JinaSearch
    from langchain_community.tools.jira.tool import (
        JiraAction,
    )
@ -419,6 +420,7 @@ __all__ = [
    "InfoSQLDatabaseTool",
    "InfoSparkSQLTool",
    "JiraAction",
+    "JinaSearch",
    "JsonGetValueTool",
    "JsonListKeysTool",
    "ListDirectoryTool",
@ -570,6 +572,7 @@ _module_lookup = {
    "InfoSQLDatabaseTool": "langchain_community.tools.sql_database.tool",
    "InfoSparkSQLTool": "langchain_community.tools.spark_sql.tool",
    "JiraAction": "langchain_community.tools.jira.tool",
+    "JinaSearch": "langchain_community.tools.jina_search.tool",
    "JsonGetValueTool": "langchain_community.tools.json.tool",
    "JsonListKeysTool": "langchain_community.tools.json.tool",
    "ListDirectoryTool": "langchain_community.tools.file_management",
--- a/libs/community/langchain_community/tools/jina_search/init.py
+++ b/libs/community/langchain_community/tools/jina_search/init.py
@ -0,0 +1,5 @@
+"""Jina AI toolkit"""
+
+from langchain_community.tools.jina_search.tool import JinaSearch
+
+__all__ = ["JinaSearch"]
--- a/libs/community/langchain_community/tools/jina_search/tool.py
+++ b/libs/community/langchain_community/tools/jina_search/tool.py
@ -0,0 +1,41 @@
+from __future__ import annotations
+
+from typing import Optional
+
+from langchain_core.callbacks import CallbackManagerForToolRun
+from langchain_core.pydantic_v1 import BaseModel, Field
+from langchain_core.tools import BaseTool
+
+from langchain_community.utilities.jina_search import JinaSearchAPIWrapper
+
+
+class JinaInput(BaseModel):
+    """Input for the Jina search tool."""
+
+    query: str = Field(description="search query to look up")
+
+
+class JinaSearch(BaseTool):
+    """Tool that queries the JinaSearch.
+
+    ..versionadded:: 0.2.16
+    """
+
+    name: str = "jina_search"
+    description: str = (
+        "Jina Reader allows you to ground your LLM with the latest information from "
+        "the web. "
+        "Jina Reader will search the web and return the top five results with their "
+        "URLs and contents, "
+        "each in clean, LLM-friendly text. This way, you can always keep your LLM "
+        "up-to-date, improve its factuality, and reduce hallucinations."
+    )
+    search_wrapper: JinaSearchAPIWrapper = Field(default_factory=JinaSearchAPIWrapper)
+
+    def _run(
+        self,
+        query: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool."""
+        return self.search_wrapper.run(query)
--- a/libs/community/langchain_community/utilities/jina_search.py
+++ b/libs/community/langchain_community/utilities/jina_search.py
@ -0,0 +1,68 @@
+import json
+from typing import List
+
+import requests
+from langchain_core.documents import Document
+from langchain_core.pydantic_v1 import BaseModel
+from yarl import URL
+
+
+class JinaSearchAPIWrapper(BaseModel):
+    """Wrapper around the Jina search engine."""
+
+    base_url: str = "https://s.jina.ai/"
+    """The base URL for the Jina search engine."""
+
+    def run(self, query: str) -> str:
+        """Query the Jina search engine and return the results as a JSON string.
+
+        Args:
+            query: The query to search for.
+
+        Returns: The results as a JSON string.
+
+        """
+        web_search_results = self._search_request(query=query)
+        final_results = [
+            {
+                "title": item.get("title"),
+                "link": item.get("url"),
+                "snippet": item.get("description"),
+                "content": item.get("content"),
+            }
+            for item in web_search_results
+        ]
+        return json.dumps(final_results)
+
+    def download_documents(self, query: str) -> List[Document]:
+        """Query the Jina search engine and return the results as a list of Documents.
+
+        Args:
+            query: The query to search for.
+
+        Returns: The results as a list of Documents.
+
+        """
+        results = self._search_request(query)
+        return [
+            Document(
+                page_content=item.get("content"),  # type: ignore[arg-type]
+                metadata={
+                    "title": item.get("title"),
+                    "link": item.get("url"),
+                    "description": item.get("description"),
+                },
+            )
+            for item in results
+        ]
+
+    def _search_request(self, query: str) -> List[dict]:
+        headers = {
+            "Accept": "application/json",
+        }
+        url = str(URL(self.base_url + query))
+        response = requests.get(url, headers=headers)
+        if not response.ok:
+            raise Exception(f"HTTP error {response.status_code}")
+
+        return response.json().get("data", [])
--- a/libs/community/tests/unit_tests/tools/test_imports.py
+++ b/libs/community/tests/unit_tests/tools/test_imports.py
@ -74,6 +74,7 @@ EXPECTED_ALL = [
    "InfoPowerBITool",
    "InfoSQLDatabaseTool",
    "InfoSparkSQLTool",
+    "JinaSearch",
    "JiraAction",
    "JsonGetValueTool",
    "JsonListKeysTool",