Diffbot Graph Transformer / Neo4j Graph document ingestion (#9979)

Co-authored-by: Bagatur <baskaryan@gmail.com>
pull/10149/head
Tomaz Bratanic 1 year ago committed by GitHub
parent ccb9e3ee2d
commit db73c9d5b5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,307 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7f0b0c06-ee70-468c-8bf5-b023f9e5e0a2",
"metadata": {},
"source": [
"# Diffbot Graph Transformer\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/more/graph/diffbot_transformer.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"Text data often contain rich relationships and insights that can be useful for various analytics, recommendation engines, or knowledge management applications.\n",
"\n",
"Diffbot's NLP API allows for the extraction of entities, relationships, and semantic meaning from unstructured text data.\n",
"\n",
"By coupling Diffbot's NLP API with Neo4j, a graph database, you can create powerful, dynamic graph structures based on the information extracted from text. These graph structures are fully queryable and can be integrated into various applications.\n",
"\n",
"This combination allows for use cases such as:\n",
"\n",
"* Building knowledge graphs from textual documents, websites, or social media feeds.\n",
"* Generating recommendations based on semantic relationships in the data.\n",
"* Creating advanced search features that understand the relationships between entities.\n",
"* Building analytics dashboards that allow users to explore the hidden relationships in data.\n",
"\n",
"## Overview\n",
"\n",
"LangChain provides tools to interact with Graph Databases:\n",
"\n",
"1. `Construct knowledge graphs from text` using graph transformer and store integrations \n",
"2. `Query a graph database` using chains for query creation and execution\n",
"3. `Interact with a graph database` using agents for robust and flexible querying \n",
"\n",
"## Quickstart\n",
"\n",
"First, get required packages and set environment variables:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "975648da-b24f-4164-a671-6772179e12df",
"metadata": {},
"outputs": [],
"source": [
"!pip install langchain langchain-experimental openai neo4j wikipedia"
]
},
{
"cell_type": "markdown",
"id": "77718977-629e-46c2-b091-f9191b9ec569",
"metadata": {},
"source": [
"## Diffbot NLP Service\n",
"\n",
"Diffbot's NLP service is a tool for extracting entities, relationships, and semantic context from unstructured text data.\n",
"This extracted information can be used to construct a knowledge graph.\n",
"To use their service, you'll need to obtain an API key from [Diffbot](https://www.diffbot.com/products/natural-language/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2cbf97d0-3682-439b-8750-b695ff726789",
"metadata": {},
"outputs": [],
"source": [
"from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer\n",
"\n",
"diffbot_api_key = \"DIFFBOT_API_KEY\"\n",
"diffbot_nlp = DiffbotGraphTransformer(diffbot_api_key=diffbot_api_key)"
]
},
{
"cell_type": "markdown",
"id": "5e3b894a-e3ee-46c7-8116-f8377f8f0159",
"metadata": {},
"source": [
"This code fetches Wikipedia articles about \"Baldur's Gate 3\" and then uses `DiffbotGraphTransformer` to extract entities and relationships.\n",
"The `DiffbotGraphTransformer` outputs a structured data `GraphDocument`, which can be used to populate a graph database.\n",
"Note that text chunking is avoided due to Diffbot's [character limit per API request](https://docs.diffbot.com/reference/introduction-to-natural-language-api)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "53f8df86-47a1-44a1-9a0f-6725b90703bc",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import WikipediaLoader\n",
"\n",
"query = \"Warren Buffett\"\n",
"raw_documents = WikipediaLoader(query=query).load()\n",
"graph_documents = diffbot_nlp.convert_to_graph_documents(raw_documents)"
]
},
{
"cell_type": "markdown",
"id": "31bb851a-aab4-4b97-a6b7-fce397d32b47",
"metadata": {},
"source": [
"## Loading the data into a knowledge graph\n",
"\n",
"You will need to have a running Neo4j instance. One option is to create a [free Neo4j database instance in their Aura cloud service](https://neo4j.com/cloud/platform/aura-graph-database/). You can also run the database locally using the [Neo4j Desktop application](https://neo4j.com/download/), or running a docker container. You can run a local docker container by running the executing the following script:\n",
"```\n",
"docker run \\\n",
" --name neo4j \\\n",
" -p 7474:7474 -p 7687:7687 \\\n",
" -d \\\n",
" -e NEO4J_AUTH=neo4j/pleaseletmein \\\n",
" -e NEO4J_PLUGINS=\\[\\\"apoc\\\"\\] \\\n",
" neo4j:latest\n",
"``` \n",
"If you are using the docker container, you need to wait a couple of second for the database to start."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0b2b6641-5a5d-467c-b148-e6aad5e4baa7",
"metadata": {},
"outputs": [],
"source": [
"from langchain.graphs import Neo4jGraph\n",
"\n",
"url=\"bolt://localhost:7687\"\n",
"username=\"neo4j\"\n",
"password=\"pleaseletmein\"\n",
"\n",
"graph = Neo4jGraph(\n",
" url=url,\n",
" username=username, \n",
" password=password\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0b15e840-fe6f-45db-9193-1b4e2df5c12c",
"metadata": {},
"source": [
"The `GraphDocuments` can be loaded into a knowledge graph using the `add_graph_documents` method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1a67c4a8-955c-42a2-9c5d-de3ac0e640ec",
"metadata": {},
"outputs": [],
"source": [
"graph.add_graph_documents(graph_documents)"
]
},
{
"cell_type": "markdown",
"id": "ed411e05-2b03-460d-997e-938482774f40",
"metadata": {},
"source": [
"## Refresh graph schema information\n",
"If the schema of database changes, you can refresh the schema information needed to generate Cypher statements"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "904c9ee3-787c-403f-857d-459ce5ad5a1b",
"metadata": {},
"outputs": [],
"source": [
"graph.refresh_schema()"
]
},
{
"cell_type": "markdown",
"id": "f19d1387-5899-4258-8c94-8ef5fa7db464",
"metadata": {},
"source": [
"## Querying the graph\n",
"We can now use the graph cypher QA chain to ask question of the graph. It is advisable to use **gpt-4** to construct Cypher queries to get the best experience."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9393b732-67c8-45c1-9ec2-089f49c62448",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import GraphCypherQAChain\n",
"from langchain.chat_models import ChatOpenAI\n",
"\n",
"chain = GraphCypherQAChain.from_llm(\n",
" cypher_llm=ChatOpenAI(temperature=0, model_name=\"gpt-4\"),\n",
" qa_llm=ChatOpenAI(temperature=0, model_name=\"gpt-3.5-turbo\"),\n",
" graph=graph, verbose=True,\n",
" \n",
")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1a9b3652-b436-404d-aa25-5fb576f23dc0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
"Generated Cypher:\n",
"\u001b[32;1m\u001b[1;3mMATCH (p:Person {name: \"Warren Buffett\"})-[:EDUCATED_AT]->(o:Organization)\n",
"RETURN o.name\u001b[0m\n",
"Full Context:\n",
"\u001b[32;1m\u001b[1;3m[{'o.name': 'New York Institute of Finance'}, {'o.name': 'Alice Deal Junior High School'}, {'o.name': 'Woodrow Wilson High School'}, {'o.name': 'University of Nebraska'}]\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'Warren Buffett attended the University of Nebraska.'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.run(\"Which university did Warren Buffett attend?\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "adc0ba0f-a62c-4875-89ce-da717f3ab148",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
"Generated Cypher:\n",
"\u001b[32;1m\u001b[1;3mMATCH (p:Person)-[r:EMPLOYEE_OR_MEMBER_OF]->(o:Organization) WHERE o.name = 'Berkshire Hathaway' RETURN p.name\u001b[0m\n",
"Full Context:\n",
"\u001b[32;1m\u001b[1;3m[{'p.name': 'Charlie Munger'}, {'p.name': 'Oliver Chace'}, {'p.name': 'Howard Buffett'}, {'p.name': 'Howard'}, {'p.name': 'Susan Buffett'}, {'p.name': 'Warren Buffett'}]\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'Charlie Munger, Oliver Chace, Howard Buffett, Susan Buffett, and Warren Buffett are or were working at Berkshire Hathaway.'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.run(\"Who is or was working at Berkshire Hathaway?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d636954b-d967-4e96-9489-92e11c74af35",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,5 @@
from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer
__all__ = [
"DiffbotGraphTransformer",
]

@ -0,0 +1,316 @@
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
import requests
from langchain.graphs.graph_document import GraphDocument, Node, Relationship
from langchain.schema import Document
from langchain.utils import get_from_env
def format_property_key(s: str) -> str:
words = s.split()
if not words:
return s
first_word = words[0].lower()
capitalized_words = [word.capitalize() for word in words[1:]]
return "".join([first_word] + capitalized_words)
class NodesList:
"""
Manages a list of nodes with associated properties.
Attributes:
nodes (Dict[Tuple, Any]): Stores nodes as keys and their properties as values.
Each key is a tuple where the first element is the
node ID and the second is the node type.
"""
def __init__(self) -> None:
self.nodes: Dict[Tuple[Union[str, int], str], Any] = dict()
def add_node_property(
self, node: Tuple[Union[str, int], str], properties: Dict[str, Any]
) -> None:
"""
Adds or updates node properties.
If the node does not exist in the list, it's added along with its properties.
If the node already exists, its properties are updated with the new values.
Args:
node (Tuple): A tuple containing the node ID and node type.
properties (Dict): A dictionary of properties to add or update for the node.
"""
if node not in self.nodes:
self.nodes[node] = properties
else:
self.nodes[node].update(properties)
def return_node_list(self) -> List[Node]:
"""
Returns the nodes as a list of Node objects.
Each Node object will have its ID, type, and properties populated.
Returns:
List[Node]: A list of Node objects.
"""
nodes = [
Node(id=key[0], type=key[1], properties=self.nodes[key])
for key in self.nodes
]
return nodes
# Properties that should be treated as node properties instead of relationships
FACT_TO_PROPERTY_TYPE = [
"Date",
"Number",
"Job title",
"Cause of death",
"Organization type",
"Academic title",
]
schema_mapping = [
("HEADQUARTERS", "ORGANIZATION_LOCATIONS"),
("RESIDENCE", "PERSON_LOCATION"),
("ALL_PERSON_LOCATIONS", "PERSON_LOCATION"),
("CHILD", "HAS_CHILD"),
("PARENT", "HAS_PARENT"),
("CUSTOMERS", "HAS_CUSTOMER"),
("SKILLED_AT", "INTERESTED_IN"),
]
class SimplifiedSchema:
"""
Provides functionality for working with a simplified schema mapping.
Attributes:
schema (Dict): A dictionary containing the mapping to simplified schema types.
"""
def __init__(self) -> None:
"""Initializes the schema dictionary based on the predefined list."""
self.schema = dict()
for row in schema_mapping:
self.schema[row[0]] = row[1]
def get_type(self, type: str) -> str:
"""
Retrieves the simplified schema type for a given original type.
Args:
type (str): The original schema type to find the simplified type for.
Returns:
str: The simplified schema type if it exists;
otherwise, returns the original type.
"""
try:
return self.schema[type]
except KeyError:
return type
class DiffbotGraphTransformer:
"""Transforms documents into graph documents using Diffbot's NLP API.
A graph document transformation system takes a sequence of Documents and returns a
sequence of Graph Documents.
Example:
.. code-block:: python
class DiffbotGraphTransformer(BaseGraphDocumentTransformer):
def transform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[GraphDocument]:
results = []
for document in documents:
raw_results = self.nlp_request(document.page_content)
graph_document = self.process_response(raw_results, document)
results.append(graph_document)
return results
async def atransform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
raise NotImplementedError
"""
def __init__(
self,
diffbot_api_key: Optional[str] = None,
fact_confidence_threshold: float = 0.7,
include_qualifiers: bool = True,
include_evidence: bool = True,
simplified_schema: bool = True,
) -> None:
"""
Initialize the graph transformer with various options.
Args:
diffbot_api_key (str):
The API key for Diffbot's NLP services.
fact_confidence_threshold (float):
Minimum confidence level for facts to be included.
include_qualifiers (bool):
Whether to include qualifiers in the relationships.
include_evidence (bool):
Whether to include evidence for the relationships.
simplified_schema (bool):
Whether to use a simplified schema for relationships.
"""
self.diffbot_api_key = diffbot_api_key or get_from_env(
"diffbot_api_key", "DIFFBOT_API_KEY"
)
self.fact_threshold_confidence = fact_confidence_threshold
self.include_qualifiers = include_qualifiers
self.include_evidence = include_evidence
self.simplified_schema = None
if simplified_schema:
self.simplified_schema = SimplifiedSchema()
def nlp_request(self, text: str) -> Dict[str, Any]:
"""
Make an API request to the Diffbot NLP endpoint.
Args:
text (str): The text to be processed.
Returns:
Dict[str, Any]: The JSON response from the API.
"""
# Relationship extraction only works for English
payload = {
"content": text,
"lang": "en",
}
FIELDS = "facts"
HOST = "nl.diffbot.com"
url = (
f"https://{HOST}/v1/?fields={FIELDS}&"
f"token={self.diffbot_api_key}&language=en"
)
result = requests.post(url, data=payload)
return result.json()
def process_response(
self, payload: Dict[str, Any], document: Document
) -> GraphDocument:
"""
Transform the Diffbot NLP response into a GraphDocument.
Args:
payload (Dict[str, Any]): The JSON response from Diffbot's NLP API.
document (Document): The original document.
Returns:
GraphDocument: The transformed document as a graph.
"""
# Return empty result if there are no facts
if "facts" not in payload or not payload["facts"]:
return GraphDocument(nodes=[], relationships=[], source=document)
# Nodes are a custom class because we need to deduplicate
nodes_list = NodesList()
# Relationships are a list because we don't deduplicate nor anything else
relationships = list()
for record in payload["facts"]:
# Skip if the fact is below the threshold confidence
if record["confidence"] < self.fact_threshold_confidence:
continue
# TODO: It should probably be treated as a node property
if not record["value"]["allTypes"]:
continue
# Define source node
source_id = (
record["entity"]["allUris"][0]
if record["entity"]["allUris"]
else record["entity"]["name"]
)
source_label = record["entity"]["allTypes"][0]["name"].capitalize()
source_name = record["entity"]["name"]
source_node = Node(id=source_id, type=source_label)
nodes_list.add_node_property(
(source_id, source_label), {"name": source_name}
)
# Define target node
target_id = (
record["value"]["allUris"][0]
if record["value"]["allUris"]
else record["value"]["name"]
)
target_label = record["value"]["allTypes"][0]["name"].capitalize()
target_name = record["value"]["name"]
# Some facts are better suited as node properties
if target_label in FACT_TO_PROPERTY_TYPE:
nodes_list.add_node_property(
(source_id, source_label),
{format_property_key(record["property"]["name"]): target_name},
)
else: # Define relationship
# Define target node object
target_node = Node(id=target_id, type=target_label)
nodes_list.add_node_property(
(target_id, target_label), {"name": target_name}
)
# Define relationship type
rel_type = record["property"]["name"].replace(" ", "_").upper()
if self.simplified_schema:
rel_type = self.simplified_schema.get_type(rel_type)
# Relationship qualifiers/properties
rel_properties = dict()
relationship_evidence = [el["passage"] for el in record["evidence"]][0]
if self.include_evidence:
rel_properties.update({"evidence": relationship_evidence})
if self.include_qualifiers and record.get("qualifiers"):
for property in record["qualifiers"]:
prop_key = format_property_key(property["property"]["name"])
rel_properties[prop_key] = property["value"]["name"]
relationship = Relationship(
source=source_node,
target=target_node,
type=rel_type,
properties=rel_properties,
)
relationships.append(relationship)
return GraphDocument(
nodes=nodes_list.return_node_list(),
relationships=relationships,
source=document,
)
def convert_to_graph_documents(
self, documents: Sequence[Document]
) -> List[GraphDocument]:
"""Convert a sequence of documents into graph documents.
Args:
documents (Sequence[Document]): The original documents.
**kwargs: Additional keyword arguments.
Returns:
Sequence[GraphDocument]: The transformed documents as graphs.
"""
results = []
for document in documents:
raw_results = self.nlp_request(document.page_content)
graph_document = self.process_response(raw_results, document)
results.append(graph_document)
return results

@ -3752,6 +3752,31 @@ files = [
{file = "types_PyYAML-6.0.12.11-py3-none-any.whl", hash = "sha256:a461508f3096d1d5810ec5ab95d7eeecb651f3a15b71959999988942063bf01d"},
]
[[package]]
name = "types-requests"
version = "2.31.0.2"
description = "Typing stubs for requests"
optional = false
python-versions = "*"
files = [
{file = "types-requests-2.31.0.2.tar.gz", hash = "sha256:6aa3f7faf0ea52d728bb18c0a0d1522d9bfd8c72d26ff6f61bfc3d06a411cf40"},
{file = "types_requests-2.31.0.2-py3-none-any.whl", hash = "sha256:56d181c85b5925cbc59f4489a57e72a8b2166f18273fd8ba7b6fe0c0b986f12a"},
]
[package.dependencies]
types-urllib3 = "*"
[[package]]
name = "types-urllib3"
version = "1.26.25.14"
description = "Typing stubs for urllib3"
optional = false
python-versions = "*"
files = [
{file = "types-urllib3-1.26.25.14.tar.gz", hash = "sha256:229b7f577c951b8c1b92c1bc2b2fdb0b49847bd2af6d1cc2a2e3dd340f3bda8f"},
{file = "types_urllib3-1.26.25.14-py3-none-any.whl", hash = "sha256:9683bbb7fb72e32bfe9d2be6e04875fbe1b3eeec3cbb4ea231435aa7fd6b4f0e"},
]
[[package]]
name = "typing-extensions"
version = "4.7.1"
@ -3995,4 +4020,4 @@ extended-testing = ["faker", "presidio-analyzer", "presidio-anonymizer"]
[metadata]
lock-version = "2.0"
python-versions = ">=3.8.1,<4.0"
content-hash = "66ac482bd05eb74414210ac28fc1e8dae1a9928a4a1314e1326fada3551aa8ad"
content-hash = "443e88f690572715cf58671e4480a006574c7141a1258dff0a0818b954184901"

@ -23,6 +23,7 @@ black = "^23.1.0"
[tool.poetry.group.typing.dependencies]
mypy = "^0.991"
types-pyyaml = "^6.0.12.2"
types-requests = "^2.28.11.5"
[tool.poetry.group.dev.dependencies]
jupyter = "^1.0.0"

@ -0,0 +1,51 @@
from __future__ import annotations
from typing import List, Union
from langchain.load.serializable import Serializable
from langchain.pydantic_v1 import Field
from langchain.schema import Document
class Node(Serializable):
"""Represents a node in a graph with associated properties.
Attributes:
id (Union[str, int]): A unique identifier for the node.
type (str): The type or label of the node, default is "Node".
properties (dict): Additional properties and metadata associated with the node.
"""
id: Union[str, int]
type: str = "Node"
properties: dict = Field(default_factory=dict)
class Relationship(Serializable):
"""Represents a directed relationship between two nodes in a graph.
Attributes:
source (Node): The source node of the relationship.
target (Node): The target node of the relationship.
type (str): The type of the relationship.
properties (dict): Additional properties associated with the relationship.
"""
source: Node
target: Node
type: str
properties: dict = Field(default_factory=dict)
class GraphDocument(Serializable):
"""Represents a graph document consisting of nodes and relationships.
Attributes:
nodes (List[Node]): A list of nodes in the graph.
relationships (List[Relationship]): A list of relationships in the graph.
source (Document): The document from which the graph information is derived.
"""
nodes: List[Node]
relationships: List[Relationship]
source: Document

@ -1,5 +1,7 @@
from typing import Any, Dict, List
from langchain.graphs.graph_document import GraphDocument
node_properties_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
@ -99,3 +101,56 @@ class Neo4jGraph:
The relationships are the following:
{[el['output'] for el in relationships]}
"""
def add_graph_documents(
self, graph_documents: List[GraphDocument], include_source: bool = False
) -> None:
"""
Take GraphDocument as input as uses it to construct a graph.
"""
for document in graph_documents:
include_docs_query = (
"CREATE (d:Document) "
"SET d.text = $document.page_content "
"SET d += $document.metadata "
"WITH d "
)
# Import nodes
self.query(
(
f"{include_docs_query if include_source else ''}"
"UNWIND $data AS row "
"CALL apoc.merge.node([row.type], {id: row.id}, "
"row.properties, {}) YIELD node "
f"{'MERGE (d)-[:MENTIONS]->(node) ' if include_source else ''}"
"RETURN distinct 'done' AS result"
),
{
"data": [el.__dict__ for el in document.nodes],
"document": document.source.__dict__,
},
)
# Import relationships
self.query(
"UNWIND $data AS row "
"CALL apoc.merge.node([row.source_label], {id: row.source},"
"{}, {}) YIELD node as source "
"CALL apoc.merge.node([row.target_label], {id: row.target},"
"{}, {}) YIELD node as target "
"CALL apoc.merge.relationship(source, row.type, "
"{}, row.properties, target) YIELD rel "
"RETURN distinct 'done'",
{
"data": [
{
"source": el.source.id,
"source_label": el.source.type,
"target": el.target.id,
"target_label": el.target.type,
"type": el.type.replace(" ", "_").upper(),
"properties": el.properties,
}
for el in document.relationships
]
},
)

Loading…
Cancel
Save