Add HugeGraphQAChain to support gremlin generating chain (#7132)

[Apache HugeGraph](https://github.com/apache/incubator-hugegraph) is a convenient, efficient, and adaptable graph database, compatible with the Apache TinkerPop3 framework and the Gremlin query language. In this PR, the HugeGraph and HugeGraphQAChain provide the same functionality as the existing integration with Neo4j and enables query generation and question answering over HugeGraph database. The difference is that the graph query language supported by HugeGraph is not cypher but another very popular graph query language [Gremlin](https://tinkerpop.apache.org/gremlin.html). A notebook example and a simple test case have also been added. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago · 81eebc4070
parent 5585607654
commit 81eebc4070
9 changed files with 531 additions and 1 deletions
--- a/docs/extras/modules/chains/additional/graph_hugegraph_qa.ipynb
+++ b/docs/extras/modules/chains/additional/graph_hugegraph_qa.ipynb
@ -0,0 +1,302 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d2777010",
+   "metadata": {},
+   "source": [
+    "# HugeGraph QA Chain\n",
+    "\n",
+    "This notebook shows how to use LLMs to provide a natural language interface to [HugeGraph](https://hugegraph.apache.org/cn/) database."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f26dcbe4",
+   "metadata": {},
+   "source": [
+    "You will need to have a running HugeGraph instance.\n",
+    "You can run a local docker container by running the executing the following script:\n",
+    "\n",
+    "```\n",
+    "docker run \\\n",
+    "    --name=graph \\\n",
+    "    -itd \\\n",
+    "    -p 8080:8080 \\\n",
+    "    hugegraph/hugegraph\n",
+    "```\n",
+    "\n",
+    "If we want to connect HugeGraph in the application, we need to install python sdk:\n",
+    "\n",
+    "```\n",
+    "pip3 install hugegraph-python\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d64a29f1",
+   "metadata": {},
+   "source": [
+    "If you are using the docker container, you need to wait a couple of second for the database to start, and then we need create schema and write graph data for the database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "e53ab93e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from hugegraph.connection import PyHugeGraph\n",
+    "\n",
+    "client = PyHugeGraph(\"localhost\", \"8080\", user=\"admin\", pwd=\"admin\", graph=\"hugegraph\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7c3a50e",
+   "metadata": {},
+   "source": [
+    "First, we create the schema for a simple movie database:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ef5372a8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'create EdgeLabel success, Detail: \"b\\'{\"id\":1,\"name\":\"ActedIn\",\"source_label\":\"Person\",\"target_label\":\"Movie\",\"frequency\":\"SINGLE\",\"sort_keys\":[],\"nullable_keys\":[],\"index_labels\":[],\"properties\":[],\"status\":\"CREATED\",\"ttl\":0,\"enable_label_index\":true,\"user_data\":{\"~create_time\":\"2023-07-04 10:48:47.908\"}}\\'\"'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"schema\"\"\"\n",
+    "schema = client.schema()\n",
+    "schema.propertyKey(\"name\").asText().ifNotExist().create()\n",
+    "schema.propertyKey(\"birthDate\").asText().ifNotExist().create()\n",
+    "schema.vertexLabel(\"Person\").properties(\"name\", \"birthDate\").usePrimaryKeyId().primaryKeys(\"name\").ifNotExist().create()\n",
+    "schema.vertexLabel(\"Movie\").properties(\"name\").usePrimaryKeyId().primaryKeys(\"name\").ifNotExist().create()\n",
+    "schema.edgeLabel(\"ActedIn\").sourceLabel(\"Person\").targetLabel(\"Movie\").ifNotExist().create()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "016f7989",
+   "metadata": {},
+   "source": [
+    "Then we can insert some data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "b7f4c370",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1:Robert De Niro--ActedIn-->2:The Godfather Part II"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"graph\"\"\"\n",
+    "g = client.graph()\n",
+    "g.addVertex(\"Person\", {\"name\": \"Al Pacino\", \"birthDate\": \"1940-04-25\"})\n",
+    "g.addVertex(\"Person\", {\"name\": \"Robert De Niro\", \"birthDate\": \"1943-08-17\"})\n",
+    "g.addVertex(\"Movie\", {\"name\": \"The Godfather\"})\n",
+    "g.addVertex(\"Movie\", {\"name\": \"The Godfather Part II\"})\n",
+    "g.addVertex(\"Movie\", {\"name\": \"The Godfather Coda The Death of Michael Corleone\"})\n",
+    "\n",
+    "g.addEdge(\"ActedIn\", \"1:Al Pacino\", \"2:The Godfather\", {})\n",
+    "g.addEdge(\"ActedIn\", \"1:Al Pacino\", \"2:The Godfather Part II\", {})\n",
+    "g.addEdge(\"ActedIn\", \"1:Al Pacino\", \"2:The Godfather Coda The Death of Michael Corleone\", {})\n",
+    "g.addEdge(\"ActedIn\", \"1:Robert De Niro\", \"2:The Godfather Part II\", {})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b8f7788",
+   "metadata": {},
+   "source": [
+    "## Creating `HugeGraphQAChain`\n",
+    "\n",
+    "We can now create the `HugeGraph` and `HugeGraphQAChain`. To create the `HugeGraph` we simply need to pass the database object to the `HugeGraph` constructor."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "f1f68fcf",
+   "metadata": {
+    "is_executing": true
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.chains import HugeGraphQAChain\n",
+    "from langchain.graphs import HugeGraph"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "b86ebfa7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "graph = HugeGraph(\n",
+    "    username=\"admin\",\n",
+    "    password=\"admin\",\n",
+    "    address=\"localhost\",\n",
+    "    port=8080,\n",
+    "    graph=\"hugegraph\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e262540b",
+   "metadata": {},
+   "source": [
+    "## Refresh graph schema information\n",
+    "\n",
+    "If the schema of database changes, you can refresh the schema information needed to generate Gremlin statements."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "134dd8d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# graph.refresh_schema()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "e78b8e72",
+   "metadata": {
+    "ExecuteTime": {}
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Node properties: [name: Person, primary_keys: ['name'], properties: ['name', 'birthDate'], name: Movie, primary_keys: ['name'], properties: ['name']]\n",
+      "Edge properties: [name: ActedIn, properties: []]\n",
+      "Relationships: ['Person--ActedIn-->Movie']\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(graph.get_schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c27e813",
+   "metadata": {},
+   "source": [
+    "## Querying the graph\n",
+    "\n",
+    "We can now use the graph Gremlin QA chain to ask question of the graph"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "3b23dead",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = HugeGraphQAChain.from_llm(\n",
+    "    ChatOpenAI(temperature=0), graph=graph, verbose=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "76aecc93",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "\u001b[1m> Entering new  chain...\u001b[0m\n",
+      "Generated gremlin:\n",
+      "\u001b[32;1m\u001b[1;3mg.V().has('Movie', 'name', 'The Godfather').in('ActedIn').valueMap(true)\u001b[0m\n",
+      "Full Context:\n",
+      "\u001b[32;1m\u001b[1;3m[{'id': '1:Al Pacino', 'label': 'Person', 'name': ['Al Pacino'], 'birthDate': ['1940-04-25']}]\u001b[0m\n",
+      "\n",
+      "\u001b[1m> Finished chain.\u001b[0m\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Al Pacino played in The Godfather.'"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain.run(\"Who played in The Godfather?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "869f0258",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "venv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/langchain/chains/init.py
+++ b/langchain/chains/init.py
@ -15,6 +15,7 @@ from langchain.chains.conversational_retrieval.base import (
 from langchain.chains.flare.base import FlareChain
 from langchain.chains.graph_qa.base import GraphQAChain
 from langchain.chains.graph_qa.cypher import GraphCypherQAChain
+from langchain.chains.graph_qa.hugegraph import HugeGraphQAChain
 from langchain.chains.graph_qa.kuzu import KuzuQAChain
 from langchain.chains.graph_qa.nebulagraph import NebulaGraphQAChain
 from langchain.chains.hyde.base import HypotheticalDocumentEmbedder
@ -69,6 +70,7 @@ __all__ = [
    "GraphQAChain",
    "HypotheticalDocumentEmbedder",
    "KuzuQAChain",
+    "HugeGraphQAChain",
    "LLMBashChain",
    "LLMChain",
    "LLMCheckerChain",
--- a/langchain/chains/graph_qa/hugegraph.py
+++ b/langchain/chains/graph_qa/hugegraph.py
@ -0,0 +1,94 @@
+"""Question answering over a graph."""
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from pydantic import Field
+
+from langchain.base_language import BaseLanguageModel
+from langchain.callbacks.manager import CallbackManagerForChainRun
+from langchain.chains.base import Chain
+from langchain.chains.graph_qa.prompts import (
+    CYPHER_QA_PROMPT,
+    GREMLIN_GENERATION_PROMPT,
+)
+from langchain.chains.llm import LLMChain
+from langchain.graphs.hugegraph import HugeGraph
+from langchain.schema import BasePromptTemplate
+
+
+class HugeGraphQAChain(Chain):
+    """Chain for question-answering against a graph by generating gremlin statements."""
+
+    graph: HugeGraph = Field(exclude=True)
+    gremlin_generation_chain: LLMChain
+    qa_chain: LLMChain
+    input_key: str = "query"  #: :meta private:
+    output_key: str = "result"  #: :meta private:
+
+    @property
+    def input_keys(self) -> List[str]:
+        """Return the input keys.
+
+        :meta private:
+        """
+        return [self.input_key]
+
+    @property
+    def output_keys(self) -> List[str]:
+        """Return the output keys.
+
+        :meta private:
+        """
+        _output_keys = [self.output_key]
+        return _output_keys
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        *,
+        qa_prompt: BasePromptTemplate = CYPHER_QA_PROMPT,
+        gremlin_prompt: BasePromptTemplate = GREMLIN_GENERATION_PROMPT,
+        **kwargs: Any,
+    ) -> HugeGraphQAChain:
+        """Initialize from LLM."""
+        qa_chain = LLMChain(llm=llm, prompt=qa_prompt)
+        gremlin_generation_chain = LLMChain(llm=llm, prompt=gremlin_prompt)
+
+        return cls(
+            qa_chain=qa_chain,
+            gremlin_generation_chain=gremlin_generation_chain,
+            **kwargs,
+        )
+
+    def _call(
+        self,
+        inputs: Dict[str, Any],
+        run_manager: Optional[CallbackManagerForChainRun] = None,
+    ) -> Dict[str, str]:
+        """Generate gremlin statement, use it to look up in db and answer question."""
+        _run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
+        callbacks = _run_manager.get_child()
+        question = inputs[self.input_key]
+
+        generated_gremlin = self.gremlin_generation_chain.run(
+            {"question": question, "schema": self.graph.get_schema}, callbacks=callbacks
+        )
+
+        _run_manager.on_text("Generated gremlin:", end="\n", verbose=self.verbose)
+        _run_manager.on_text(
+            generated_gremlin, color="green", end="\n", verbose=self.verbose
+        )
+        context = self.graph.query(generated_gremlin)
+
+        _run_manager.on_text("Full Context:", end="\n", verbose=self.verbose)
+        _run_manager.on_text(
+            str(context), color="green", end="\n", verbose=self.verbose
+        )
+
+        result = self.qa_chain(
+            {"question": question, "context": context},
+            callbacks=callbacks,
+        )
+        return {self.output_key: result[self.qa_chain.output_key]}
--- a/langchain/chains/graph_qa/prompts.py
+++ b/langchain/chains/graph_qa/prompts.py
@ -90,6 +90,12 @@ KUZU_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=KUZU_GENERATION_TEMPLATE
 )

+GREMLIN_GENERATION_TEMPLATE = CYPHER_GENERATION_TEMPLATE.replace("Cypher", "Gremlin")
+
+GREMLIN_GENERATION_PROMPT = PromptTemplate(
+    input_variables=["schema", "question"], template=GREMLIN_GENERATION_TEMPLATE
+)
+
 CYPHER_QA_TEMPLATE = """You are an assistant that helps to form nice and human understandable answers.
 The information part contains the provided information that you must use to construct an answer.
 The provided information is authorative, you must never doubt it or try to use your internal knowledge to correct it.
--- a/langchain/graphs/init.py
+++ b/langchain/graphs/init.py
@ -1,7 +1,8 @@
 """Graph implementations."""
+from langchain.graphs.hugegraph import HugeGraph
 from langchain.graphs.kuzu_graph import KuzuGraph
 from langchain.graphs.nebula_graph import NebulaGraph
 from langchain.graphs.neo4j_graph import Neo4jGraph
 from langchain.graphs.networkx_graph import NetworkxEntityGraph

-__all__ = ["NetworkxEntityGraph", "Neo4jGraph", "NebulaGraph", "KuzuGraph"]
+__all__ = ["NetworkxEntityGraph", "Neo4jGraph", "NebulaGraph", "KuzuGraph", "HugeGraph"]
--- a/langchain/graphs/hugegraph.py
+++ b/langchain/graphs/hugegraph.py
@ -0,0 +1,62 @@
+from typing import Any, Dict, List
+
+
+class HugeGraph:
+    """HugeGraph wrapper for graph operations"""
+
+    def __init__(
+        self,
+        username: str = "default",
+        password: str = "default",
+        address: str = "127.0.0.1",
+        port: int = 8081,
+        graph: str = "hugegraph",
+    ) -> None:
+        """Create a new HugeGraph wrapper instance."""
+        try:
+            from hugegraph.connection import PyHugeGraph
+        except ImportError:
+            raise ValueError(
+                "Please install HugeGraph Python client first: "
+                "`pip3 install hugegraph-python`"
+            )
+
+        self.username = username
+        self.password = password
+        self.address = address
+        self.port = port
+        self.graph = graph
+        self.client = PyHugeGraph(
+            address, port, user=username, pwd=password, graph=graph
+        )
+        self.schema = ""
+        # Set schema
+        try:
+            self.refresh_schema()
+        except Exception as e:
+            raise ValueError(f"Could not refresh schema. Error: {e}")
+
+    @property
+    def get_schema(self) -> str:
+        """Returns the schema of the HugeGraph database"""
+        return self.schema
+
+    def refresh_schema(self) -> None:
+        """
+        Refreshes the HugeGraph schema information.
+        """
+        schema = self.client.schema()
+        vertex_schema = schema.getVertexLabels()
+        edge_schema = schema.getEdgeLabels()
+        relationships = schema.getRelations()
+
+        self.schema = (
+            f"Node properties: {vertex_schema}\n"
+            f"Edge properties: {edge_schema}\n"
+            f"Relationships: {relationships}\n"
+        )
+
+    def query(self, query: str) -> List[Dict[str, Any]]:
+        g = self.client.gremlin()
+        res = g.exec(query)
+        return res["data"]
--- a/poetry.lock
+++ b/poetry.lock
@ -3653,6 +3653,23 @@ cli = ["click (>=8.0.0,<9.0.0)", "pygments (>=2.0.0,<3.0.0)", "rich (>=10,<14)"]
 http2 = ["h2 (>=3,<5)"]
 socks = ["socksio (>=1.0.0,<2.0.0)"]

+[[package]]
+name = "hugegraph-python"
+version = "1.0.0.12"
+description = "Python client for HugeGraph"
+optional = true
+python-versions = "*"
+files = [
+    {file = "hugegraph-python-1.0.0.12.tar.gz", hash = "sha256:06b2dded70c4f4570083f8b6e3a9edfebcf5ac4f07300727afad72389917ab85"},
+    {file = "hugegraph_python-1.0.0.12-py3-none-any.whl", hash = "sha256:69fe20edbe1a392d16afc74df5c94b3b96bc02c848e9ab5b5f18c112a9bc3ebe"},
+]
+
+[package.dependencies]
+decorator = "5.1.1"
+Requests = "2.31.0"
+setuptools = "67.6.1"
+urllib3 = "2.0.3"
+
 [[package]]
 name = "huggingface-hub"
 version = "0.15.1"
--- a/tests/integration_tests/graphs/init.py
+++ b/tests/integration_tests/graphs/init.py
--- a/tests/integration_tests/graphs/test_hugegraph.py
+++ b/tests/integration_tests/graphs/test_hugegraph.py
@ -0,0 +1,46 @@
+import unittest
+from typing import Any
+from unittest.mock import MagicMock, patch
+
+from langchain.graphs import HugeGraph
+
+
+class TestHugeGraph(unittest.TestCase):
+    def setUp(self) -> None:
+        self.username = "test_user"
+        self.password = "test_password"
+        self.address = "test_address"
+        self.graph = "test_hugegraph"
+        self.port = 1234
+        self.session_pool_size = 10
+
+    @patch("hugegraph.connection.PyHugeGraph")
+    def test_init(self, mock_client: Any) -> None:
+        mock_client.return_value = MagicMock()
+        huge_graph = HugeGraph(
+            self.username, self.password, self.address, self.port, self.graph
+        )
+        self.assertEqual(huge_graph.username, self.username)
+        self.assertEqual(huge_graph.password, self.password)
+        self.assertEqual(huge_graph.address, self.address)
+        self.assertEqual(huge_graph.port, self.port)
+        self.assertEqual(huge_graph.graph, self.graph)
+
+    @patch("hugegraph.connection.PyHugeGraph")
+    def test_execute(self, mock_client: Any) -> None:
+        mock_client.return_value = MagicMock()
+        huge_graph = HugeGraph(
+            self.username, self.password, self.address, self.port, self.graph
+        )
+        query = "g.V().limit(10)"
+        result = huge_graph.query(query)
+        self.assertIsInstance(result, MagicMock)
+
+    @patch("hugegraph.connection.PyHugeGraph")
+    def test_refresh_schema(self, mock_client: Any) -> None:
+        mock_client.return_value = MagicMock()
+        huge_graph = HugeGraph(
+            self.username, self.password, self.address, self.port, self.graph
+        )
+        huge_graph.refresh_schema()
+        self.assertNotEqual(huge_graph.get_schema, "")