{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "5e3cb542-933d-4bf3-a82b-d9d6395a7832", "metadata": { "tags": [] }, "source": [ "# Wikibase Agent\n", "\n", "This notebook demonstrates a very simple wikibase agent that uses sparql generation. Although this code is intended to work against any\n", "wikibase instance, we use http://wikidata.org for testing.\n", "\n", "If you are interested in wikibases and sparql, please consider helping to improve this agent. Look [here](https://github.com/donaldziff/langchain-wikibase) for more details and open questions.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "07d42966-7e99-4157-90dc-6704977dcf1b", "metadata": { "tags": [] }, "source": [ "## Preliminaries" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9132f093-c61e-4b8d-abef-91ebef3fc85f", "metadata": { "tags": [] }, "source": [ "### API keys and other secrats\n", "\n", "We use an `.ini` file, like this: \n", "```\n", "[OPENAI]\n", "OPENAI_API_KEY=xyzzy\n", "[WIKIDATA]\n", "WIKIDATA_USER_AGENT_HEADER=argle-bargle\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "99567dfd-05a7-412f-abf0-9b9f4424acbd", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['./secrets.ini']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import configparser\n", "\n", "config = configparser.ConfigParser()\n", "config.read(\"./secrets.ini\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "332b6658-c978-41ca-a2be-4f8677fecaef", "metadata": { "tags": [] }, "source": [ "### OpenAI API Key\n", "\n", "An OpenAI API key is required unless you modify the code below to use another LLM provider." ] }, { "cell_type": "code", "execution_count": 2, "id": "dd328ee2-33cc-4e1e-aff7-cc0a2e05e2e6", "metadata": { "tags": [] }, "outputs": [], "source": [ "openai_api_key = config[\"OPENAI\"][\"OPENAI_API_KEY\"]\n", "import os\n", "\n", "os.environ.update({\"OPENAI_API_KEY\": openai_api_key})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "42a9311b-600d-42bc-b000-2692ef87a213", "metadata": { "tags": [] }, "source": [ "### Wikidata user-agent header\n", "\n", "Wikidata policy requires a user-agent header. See https://meta.wikimedia.org/wiki/User-Agent_policy. However, at present this policy is not strictly enforced." ] }, { "cell_type": "code", "execution_count": 3, "id": "17ba657e-789d-40e1-b4b7-4f29ba06fe79", "metadata": {}, "outputs": [], "source": [ "wikidata_user_agent_header = (\n", " None\n", " if not config.has_section(\"WIKIDATA\")\n", " else config[\"WIKIDATA\"][\"WIKIDAtA_USER_AGENT_HEADER\"]\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "db08d308-050a-4fc8-93c9-8de4ae977ac3", "metadata": {}, "source": [ "### Enable tracing if desired" ] }, { "cell_type": "code", "execution_count": 4, "id": "77d2da08-fccd-4676-b77e-c0e89bf343cb", "metadata": {}, "outputs": [], "source": [ "# import os\n", "# os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\"\n", "# os.environ[\"LANGCHAIN_SESSION\"] = \"default\" # Make sure this session actually exists." ] }, { "attachments": {}, "cell_type": "markdown", "id": "3dbc5bfc-48ce-4f90-873c-7336b21300c6", "metadata": {}, "source": [ "# Tools\n", "\n", "Three tools are provided for this simple agent:\n", "* `ItemLookup`: for finding the q-number of an item\n", "* `PropertyLookup`: for finding the p-number of a property\n", "* `SparqlQueryRunner`: for running a sparql query" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1f801b4e-6576-4914-aa4f-6f4c4e3c7924", "metadata": { "tags": [] }, "source": [ "## Item and Property lookup\n", "\n", "Item and Property lookup are implemented in a single method, using an elastic search endpoint. Not all wikibase instances have it, but wikidata does, and that's where we'll start." ] }, { "cell_type": "code", "execution_count": 5, "id": "42d23f0a-1c74-4c9c-85f2-d0e24204e96a", "metadata": {}, "outputs": [], "source": [ "def get_nested_value(o: dict, path: list) -> any:\n", " current = o\n", " for key in path:\n", " try:\n", " current = current[key]\n", " except:\n", " return None\n", " return current\n", "\n", "\n", "import requests\n", "\n", "from typing import Optional\n", "\n", "\n", "def vocab_lookup(\n", " search: str,\n", " entity_type: str = \"item\",\n", " url: str = \"https://www.wikidata.org/w/api.php\",\n", " user_agent_header: str = wikidata_user_agent_header,\n", " srqiprofile: str = None,\n", ") -> Optional[str]:\n", " headers = {\"Accept\": \"application/json\"}\n", " if wikidata_user_agent_header is not None:\n", " headers[\"User-Agent\"] = wikidata_user_agent_header\n", "\n", " if entity_type == \"item\":\n", " srnamespace = 0\n", " srqiprofile = \"classic_noboostlinks\" if srqiprofile is None else srqiprofile\n", " elif entity_type == \"property\":\n", " srnamespace = 120\n", " srqiprofile = \"classic\" if srqiprofile is None else srqiprofile\n", " else:\n", " raise ValueError(\"entity_type must be either 'property' or 'item'\")\n", "\n", " params = {\n", " \"action\": \"query\",\n", " \"list\": \"search\",\n", " \"srsearch\": search,\n", " \"srnamespace\": srnamespace,\n", " \"srlimit\": 1,\n", " \"srqiprofile\": srqiprofile,\n", " \"srwhat\": \"text\",\n", " \"format\": \"json\",\n", " }\n", "\n", " response = requests.get(url, headers=headers, params=params)\n", "\n", " if response.status_code == 200:\n", " title = get_nested_value(response.json(), [\"query\", \"search\", 0, \"title\"])\n", " if title is None:\n", " return f\"I couldn't find any {entity_type} for '{search}'. Please rephrase your request and try again\"\n", " # if there is a prefix, strip it off\n", " return title.split(\":\")[-1]\n", " else:\n", " return \"Sorry, I got an error. Please try again.\"" ] }, { "cell_type": "code", "execution_count": 6, "id": "e52060fa-3614-43fb-894e-54e9b75d1e9f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q4180017\n" ] } ], "source": [ "print(vocab_lookup(\"Malin 1\"))" ] }, { "cell_type": "code", "execution_count": 7, "id": "b23ab322-b2cf-404e-b36f-2bfc1d79b0d3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P31\n" ] } ], "source": [ "print(vocab_lookup(\"instance of\", entity_type=\"property\"))" ] }, { "cell_type": "code", "execution_count": 8, "id": "89020cc8-104e-42d0-ac32-885e590de515", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I couldn't find any item for 'Ceci n'est pas un q-item'. Please rephrase your request and try again\n" ] } ], "source": [ "print(vocab_lookup(\"Ceci n'est pas un q-item\"))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "78d66d8b-0e34-4d3f-a18d-c7284840ac76", "metadata": {}, "source": [ "## Sparql runner " ] }, { "attachments": {}, "cell_type": "markdown", "id": "c6f60069-fbe0-4015-87fb-0e487cd914e7", "metadata": {}, "source": [ "This tool runs sparql - by default, wikidata is used." ] }, { "cell_type": "code", "execution_count": 9, "id": "b5b97a4d-2a39-4993-88d9-e7818c0a2853", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from typing import List, Dict, Any\n", "import json\n", "\n", "\n", "def run_sparql(\n", " query: str,\n", " url=\"https://query.wikidata.org/sparql\",\n", " user_agent_header: str = wikidata_user_agent_header,\n", ") -> List[Dict[str, Any]]:\n", " headers = {\"Accept\": \"application/json\"}\n", " if wikidata_user_agent_header is not None:\n", " headers[\"User-Agent\"] = wikidata_user_agent_header\n", "\n", " response = requests.get(\n", " url, headers=headers, params={\"query\": query, \"format\": \"json\"}\n", " )\n", "\n", " if response.status_code != 200:\n", " return \"That query failed. Perhaps you could try a different one?\"\n", " results = get_nested_value(response.json(), [\"results\", \"bindings\"])\n", " return json.dumps(results)" ] }, { "cell_type": "code", "execution_count": 10, "id": "149722ec-8bc1-4d4f-892b-e4ddbe8444c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'[{\"count\": {\"datatype\": \"http://www.w3.org/2001/XMLSchema#integer\", \"type\": \"literal\", \"value\": \"20\"}}]'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "run_sparql(\"SELECT (COUNT(?children) as ?count) WHERE { wd:Q1339 wdt:P40 ?children . }\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9f0302fd-ba35-4acc-ba32-1d7c9295c898", "metadata": {}, "source": [ "# Agent" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3122a961-9673-4a52-b1cd-7d62fbdf8d96", "metadata": {}, "source": [ "## Wrap the tools" ] }, { "cell_type": "code", "execution_count": 11, "id": "cc41ae88-2e53-4363-9878-28b26430cb1e", "metadata": {}, "outputs": [], "source": [ "from langchain.agents import (\n", " Tool,\n", " AgentExecutor,\n", " LLMSingleActionAgent,\n", " AgentOutputParser,\n", ")\n", "from langchain.prompts import StringPromptTemplate\n", "from langchain import OpenAI, LLMChain\n", "from typing import List, Union\n", "from langchain.schema import AgentAction, AgentFinish\n", "import re" ] }, { "cell_type": "code", "execution_count": 12, "id": "2810a3ce-b9c6-47ee-8068-12ca967cd0ea", "metadata": {}, "outputs": [], "source": [ "# Define which tools the agent can use to answer user queries\n", "tools = [\n", " Tool(\n", " name=\"ItemLookup\",\n", " func=(lambda x: vocab_lookup(x, entity_type=\"item\")),\n", " description=\"useful for when you need to know the q-number for an item\",\n", " ),\n", " Tool(\n", " name=\"PropertyLookup\",\n", " func=(lambda x: vocab_lookup(x, entity_type=\"property\")),\n", " description=\"useful for when you need to know the p-number for a property\",\n", " ),\n", " Tool(\n", " name=\"SparqlQueryRunner\",\n", " func=run_sparql,\n", " description=\"useful for getting results from a wikibase\",\n", " ),\n", "]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ab0f2778-a195-4a4a-a5b4-c1e809e1fb7b", "metadata": {}, "source": [ "## Prompts" ] }, { "cell_type": "code", "execution_count": 13, "id": "7bd4ba4f-57d6-4ceb-b932-3cb0d0509a24", "metadata": {}, "outputs": [], "source": [ "# Set up the base template\n", "template = \"\"\"\n", "Answer the following questions by running a sparql query against a wikibase where the p and q items are \n", "completely unknown to you. You will need to discover the p and q items before you can generate the sparql.\n", "Do not assume you know the p and q items for any concepts. Always use tools to find all p and q items.\n", "After you generate the sparql, you should run it. The results will be returned in json. \n", "Summarize the json results in natural language.\n", "\n", "You may assume the following prefixes:\n", "PREFIX wd: \n", "PREFIX wdt: \n", "PREFIX p: \n", "PREFIX ps: \n", "\n", "When generating sparql:\n", "* Try to avoid \"count\" and \"filter\" queries if possible\n", "* Never enclose the sparql in back-quotes\n", "\n", "You have access to the following tools:\n", "\n", "{tools}\n", "\n", "Use the following format:\n", "\n", "Question: the input question for which you must provide a natural language answer\n", "Thought: you should always think about what to do\n", "Action: the action to take, should be one of [{tool_names}]\n", "Action Input: the input to the action\n", "Observation: the result of the action\n", "... (this Thought/Action/Action Input/Observation can repeat N times)\n", "Thought: I now know the final answer\n", "Final Answer: the final answer to the original input question\n", "\n", "Question: {input}\n", "{agent_scratchpad}\"\"\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "7e8d771a-64bb-4ec8-b472-6a9a40c6dd38", "metadata": {}, "outputs": [], "source": [ "# Set up a prompt template\n", "class CustomPromptTemplate(StringPromptTemplate):\n", " # The template to use\n", " template: str\n", " # The list of tools available\n", " tools: List[Tool]\n", "\n", " def format(self, **kwargs) -> str:\n", " # Get the intermediate steps (AgentAction, Observation tuples)\n", " # Format them in a particular way\n", " intermediate_steps = kwargs.pop(\"intermediate_steps\")\n", " thoughts = \"\"\n", " for action, observation in intermediate_steps:\n", " thoughts += action.log\n", " thoughts += f\"\\nObservation: {observation}\\nThought: \"\n", " # Set the agent_scratchpad variable to that value\n", " kwargs[\"agent_scratchpad\"] = thoughts\n", " # Create a tools variable from the list of tools provided\n", " kwargs[\"tools\"] = \"\\n\".join(\n", " [f\"{tool.name}: {tool.description}\" for tool in self.tools]\n", " )\n", " # Create a list of tool names for the tools provided\n", " kwargs[\"tool_names\"] = \", \".join([tool.name for tool in self.tools])\n", " return self.template.format(**kwargs)" ] }, { "cell_type": "code", "execution_count": 15, "id": "f97dca78-fdde-4a70-9137-e34a21d14e64", "metadata": {}, "outputs": [], "source": [ "prompt = CustomPromptTemplate(\n", " template=template,\n", " tools=tools,\n", " # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically\n", " # This includes the `intermediate_steps` variable because that is needed\n", " input_variables=[\"input\", \"intermediate_steps\"],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12c57d77-3c1e-4cde-9a83-7d2134392479", "metadata": {}, "source": [ "## Output parser \n", "This is unchanged from langchain docs" ] }, { "cell_type": "code", "execution_count": 16, "id": "42da05eb-c103-4649-9d20-7143a8880721", "metadata": {}, "outputs": [], "source": [ "class CustomOutputParser(AgentOutputParser):\n", " def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:\n", " # Check if agent should finish\n", " if \"Final Answer:\" in llm_output:\n", " return AgentFinish(\n", " # Return values is generally always a dictionary with a single `output` key\n", " # It is not recommended to try anything else at the moment :)\n", " return_values={\"output\": llm_output.split(\"Final Answer:\")[-1].strip()},\n", " log=llm_output,\n", " )\n", " # Parse out the action and action input\n", " regex = r\"Action: (.*?)[\\n]*Action Input:[\\s]*(.*)\"\n", " match = re.search(regex, llm_output, re.DOTALL)\n", " if not match:\n", " raise ValueError(f\"Could not parse LLM output: `{llm_output}`\")\n", " action = match.group(1).strip()\n", " action_input = match.group(2)\n", " # Return the action and action input\n", " return AgentAction(\n", " tool=action, tool_input=action_input.strip(\" \").strip('\"'), log=llm_output\n", " )" ] }, { "cell_type": "code", "execution_count": 17, "id": "d2b4d710-8cc9-4040-9269-59cf6c5c22be", "metadata": {}, "outputs": [], "source": [ "output_parser = CustomOutputParser()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "48a758cb-93a7-4555-b69a-896d2d43c6f0", "metadata": {}, "source": [ "## Specify the LLM model" ] }, { "cell_type": "code", "execution_count": 18, "id": "72988c79-8f60-4b0f-85ee-6af32e8de9c2", "metadata": {}, "outputs": [], "source": [ "from langchain.chat_models import ChatOpenAI\n", "\n", "llm = ChatOpenAI(model_name=\"gpt-4\", temperature=0)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "95685d14-647a-4e24-ae2c-a8dd1e364921", "metadata": {}, "source": [ "## Agent and agent executor" ] }, { "cell_type": "code", "execution_count": 19, "id": "13d55765-bfa1-43b3-b7cb-00f52ebe7747", "metadata": {}, "outputs": [], "source": [ "# LLM chain consisting of the LLM and a prompt\n", "llm_chain = LLMChain(llm=llm, prompt=prompt)" ] }, { "cell_type": "code", "execution_count": 20, "id": "b3f7ac3c-398e-49f9-baed-554f49a191c3", "metadata": {}, "outputs": [], "source": [ "tool_names = [tool.name for tool in tools]\n", "agent = LLMSingleActionAgent(\n", " llm_chain=llm_chain,\n", " output_parser=output_parser,\n", " stop=[\"\\nObservation:\"],\n", " allowed_tools=tool_names,\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "id": "65740577-272e-4853-8d47-b87784cfaba0", "metadata": {}, "outputs": [], "source": [ "agent_executor = AgentExecutor.from_agent_and_tools(\n", " agent=agent, tools=tools, verbose=True\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "66e3d13b-77cf-41d3-b541-b54535c14459", "metadata": {}, "source": [ "## Run it!" ] }, { "cell_type": "code", "execution_count": 22, "id": "6e97a07c-d7bf-4a35-9ab2-b59ae865c62c", "metadata": {}, "outputs": [], "source": [ "# If you prefer in-line tracing, uncomment this line\n", "# agent_executor.agent.llm_chain.verbose = True" ] }, { "cell_type": "code", "execution_count": 23, "id": "a11ca60d-f57b-4fe8-943e-a258e37463c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: I need to find the Q number for J.S. Bach.\n", "Action: ItemLookup\n", "Action Input: J.S. Bach\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mQ1339\u001b[0m\u001b[32;1m\u001b[1;3mI need to find the P number for children.\n", "Action: PropertyLookup\n", "Action Input: children\u001b[0m\n", "\n", "Observation:\u001b[33;1m\u001b[1;3mP1971\u001b[0m\u001b[32;1m\u001b[1;3mNow I can query the number of children J.S. Bach had.\n", "Action: SparqlQueryRunner\n", "Action Input: SELECT ?children WHERE { wd:Q1339 wdt:P1971 ?children }\u001b[0m\n", "\n", "Observation:\u001b[38;5;200m\u001b[1;3m[{\"children\": {\"datatype\": \"http://www.w3.org/2001/XMLSchema#decimal\", \"type\": \"literal\", \"value\": \"20\"}}]\u001b[0m\u001b[32;1m\u001b[1;3mI now know the final answer.\n", "Final Answer: J.S. Bach had 20 children.\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "'J.S. Bach had 20 children.'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\"How many children did J.S. Bach have?\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "d0b42a41-996b-4156-82e4-f0651a87ee34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: To find Hakeem Olajuwon's Basketball-Reference.com NBA player ID, I need to first find his Wikidata item (Q-number) and then query for the relevant property (P-number).\n", "Action: ItemLookup\n", "Action Input: Hakeem Olajuwon\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mQ273256\u001b[0m\u001b[32;1m\u001b[1;3mNow that I have Hakeem Olajuwon's Wikidata item (Q273256), I need to find the P-number for the Basketball-Reference.com NBA player ID property.\n", "Action: PropertyLookup\n", "Action Input: Basketball-Reference.com NBA player ID\u001b[0m\n", "\n", "Observation:\u001b[33;1m\u001b[1;3mP2685\u001b[0m\u001b[32;1m\u001b[1;3mNow that I have both the Q-number for Hakeem Olajuwon (Q273256) and the P-number for the Basketball-Reference.com NBA player ID property (P2685), I can run a SPARQL query to get the ID value.\n", "Action: SparqlQueryRunner\n", "Action Input: \n", "SELECT ?playerID WHERE {\n", " wd:Q273256 wdt:P2685 ?playerID .\n", "}\u001b[0m\n", "\n", "Observation:\u001b[38;5;200m\u001b[1;3m[{\"playerID\": {\"type\": \"literal\", \"value\": \"o/olajuha01\"}}]\u001b[0m\u001b[32;1m\u001b[1;3mI now know the final answer\n", "Final Answer: Hakeem Olajuwon's Basketball-Reference.com NBA player ID is \"o/olajuha01\".\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "'Hakeem Olajuwon\\'s Basketball-Reference.com NBA player ID is \"o/olajuha01\".'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\n", " \"What is the Basketball-Reference.com NBA player ID of Hakeem Olajuwon?\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "05fb3a3e-8a9f-482d-bd54-4c6e60ef60dd", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda210", "language": "python", "name": "conda210" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }