openai-cookbook/examples/Named_Entity_Recognition_to_enrich_text.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Named Entity Recognition (NER) to Enrich Text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Named Entity Recognition` (NER) is a `Natural Language Processing` task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.\n",
    "\n",
    "This notebook demonstrates how to carry out NER with [chat completion](https://platform.openai.com/docs/api-reference/chat) and [functions-calling](https://platform.openai.com/docs/guides/gpt/function-calling) to enrich a text with links to a knowledge base such as Wikipedia:\n",
    "\n",
    "**Text:**\n",
    "\n",
    "*In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.*\n",
    "\n",
    "**Text enriched with Wikipedia links:**\n",
    "\n",
    "*In [Germany](https://en.wikipedia.org/wiki/Germany), in 1440, goldsmith [Johannes Gutenberg]() invented the [movable-type printing press](https://en.wikipedia.org/wiki/Movable_Type). His work led to an [information revolution](https://en.wikipedia.org/wiki/Information_revolution) and the unprecedented mass-spread of literature throughout [Europe](https://en.wikipedia.org/wiki/Europe). Modelled on the design of the existing screw presses, a single [Renaissance](https://en.wikipedia.org/wiki/Renaissance) [movable-type printing press](https://en.wikipedia.org/wiki/Movable_Type) could produce up to 3,600 pages per workday.*\n",
    "\n",
    "**Inference Costs:** The notebook also illustrates how to estimate OpenAI API costs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1 Install/Upgrade Python packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n",
      "Note: you may need to restart the kernel to use updated packages.\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install --upgrade openai --quiet\n",
    "%pip install --upgrade nlpia2-wikipedia --quiet\n",
    "%pip install --upgrade tenacity --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2 Load packages and OPENAI_API_KEY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can generate an API key in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook works with the latest OpeanAI models `gpt-3.5-turbo-0613` and `gpt-4-0613`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import logging\n",
    "import os\n",
    "\n",
    "import openai\n",
    "import wikipedia\n",
    "\n",
    "from typing import Optional\n",
    "from IPython.display import display, Markdown\n",
    "from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
    "\n",
    "logging.basicConfig(level=logging.INFO, format=' %(asctime)s - %(levelname)s - %(message)s')\n",
    "\n",
    "OPENAI_MODEL = 'gpt-3.5-turbo-0613'\n",
    "\n",
    "client = openai.OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Define the NER labels to be Identified"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a standard set of NER labels to showcase a wide range of use cases. However, for our specific task of enriching text with knowledge base links, only a subset is practically required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = [\n",
    "    \"person\",      # people, including fictional characters\n",
    "    \"fac\",         # buildings, airports, highways, bridges\n",
    "    \"org\",         # organizations, companies, agencies, institutions\n",
    "    \"gpe\",         # geopolitical entities like countries, cities, states\n",
    "    \"loc\",         # non-gpe locations\n",
    "    \"product\",     # vehicles, foods, appareal, appliances, software, toys \n",
    "    \"event\",       # named sports, scientific milestones, historical events\n",
    "    \"work_of_art\", # titles of books, songs, movies\n",
    "    \"law\",         # named laws, acts, or legislations\n",
    "    \"language\",    # any named language\n",
    "    \"date\",        # absolute or relative dates or periods\n",
    "    \"time\",        # time units smaller than a day\n",
    "    \"percent\",     # percentage (e.g., \"twenty percent\", \"18%\")\n",
    "    \"money\",       # monetary values, including unit\n",
    "    \"quantity\",    # measurements, e.g., weight or distance\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Prepare messages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [chat completions API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) takes a list of messages as input and delivers a model-generated message as an output. While the chat format is primarily designed for facilitating multi-turn conversations, it is equally efficient for single-turn tasks without any preceding conversation. For our purposes, we will specify a message for the system, assistant, and user roles."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 System Message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `system message` (prompt) sets the assistant's behavior by defining its desired persona and task. We also delineate the specific set of entity labels we aim to identify."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although one can instruct the model to format its response, it has to be noted that both `gpt-3.5-turbo-0613` and `gpt-4-0613` have been fine-tuned to discern when a function should be invoked, and to reply with `JSON` formatted according to the function's signature. This capability streamlines our prompt and enables us to receive structured data directly from the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def system_message(labels):\n",
    "    return f\"\"\"\n",
    "You are an expert in Natural Language Processing. Your task is to identify common Named Entities (NER) in a given text.\n",
    "The possible common Named Entities (NER) types are exclusively: ({\", \".join(labels)}).\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 Assistant Message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Assistant messages` usually store previous assistant responses. However, as in our scenario, they can also be crafted to provide examples of the desired behavior. While OpenAI is able to execute `zero-shot` Named Entity Recognition, we have found that a `one-shot` approach produces more precise results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def assisstant_message():\n",
    "    return f\"\"\"\n",
    "EXAMPLE:\n",
    "    Text: 'In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread / \n",
    "    of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.'\n",
    "    {{\n",
    "        \"gpe\": [\"Germany\", \"Europe\"],\n",
    "        \"date\": [\"1440\"],\n",
    "        \"person\": [\"Johannes Gutenberg\"],\n",
    "        \"product\": [\"movable-type printing press\"],\n",
    "        \"event\": [\"Renaissance\"],\n",
    "        \"quantity\": [\"3,600 pages\"],\n",
    "        \"time\": [\"workday\"]\n",
    "    }}\n",
    "--\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 User Message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `user message` provides the specific text for the assistant task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def user_message(text):\n",
    "    return f\"\"\"\n",
    "TASK:\n",
    "    Text: {text}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. OpenAI Functions (and Utils)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In an OpenAI API call, we can describe `functions` to `gpt-3.5-turbo-0613` and `gpt-4-0613` and have the model intelligently choose to output a `JSON` object containing arguments to call those `functions`. It's important to note that the [chat completions API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) doesn't actually execute the `function`. Instead, it provides the `JSON` output, which can then be used to call the `function` in our code. For more details, refer to the [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our function, `enrich_entities(text, label_entities)` gets a block of text and a dictionary containing identified labels and entities as parameters. It then associates the recognized entities with their corresponding links to the Wikipedia articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))\n",
    "def find_link(entity: str) -> Optional[str]:\n",
    "    \"\"\"\n",
    "    Finds a Wikipedia link for a given entity.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        titles = wikipedia.search(entity)\n",
    "        if titles:\n",
    "            # naively consider the first result as the best\n",
    "            page = wikipedia.page(titles[0])\n",
    "            return page.url\n",
    "    except (wikipedia.exceptions.WikipediaException) as ex:\n",
    "        logging.error(f'Error occurred while searching for Wikipedia link for entity {entity}: {str(ex)}')\n",
    "\n",
    "    return None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_all_links(label_entities:dict) -> dict:\n",
    "    \"\"\" \n",
    "    Finds all Wikipedia links for the dictionary entities in the whitelist label list.\n",
    "    \"\"\"\n",
    "    whitelist = ['event', 'gpe', 'org', 'person', 'product', 'work_of_art']\n",
    "    \n",
    "    return {e: find_link(e) for label, entities in label_entities.items() \n",
    "                            for e in entities\n",
    "                            if label in whitelist}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def enrich_entities(text: str, label_entities: dict) -> str:\n",
    "    \"\"\"\n",
    "    Enriches text with knowledge base links.\n",
    "    \"\"\"\n",
    "    entity_link_dict = find_all_links(label_entities)\n",
    "    logging.info(f\"entity_link_dict: {entity_link_dict}\")\n",
    "    \n",
    "    for entity, link in entity_link_dict.items():\n",
    "        text = text.replace(entity, f\"[{entity}]({link})\")\n",
    "\n",
    "    return text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. ChatCompletion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As previously highlighted, `gpt-3.5-turbo-0613` and `gpt-4-0613` have been fine-tuned to detect when a `function` should to be called. Moreover, they can produce a `JSON` response that conforms to the `function` signature. Here's the sequence we follow:\n",
    "\n",
    "1. Define our `function` and its associated `JSON` Schema.\n",
    "2. Invoke the model using the `messages`, `tools` and `tool_choice` parameters.\n",
    "3. Convert the output into a `JSON` object, and then call the `function` with the `arguments` provided by the model.\n",
    "\n",
    "In practice, one might want to re-invoke the model again by appending the `function` response as a new message, and let the model summarize the results back to the user. Nevertheless, for our purposes, this step is not needed.\n",
    "\n",
    "*Note that in a real-case scenario it is strongly recommended to build in user confirmation flows before taking actions.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.1 Define our Function and JSON schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we want the model to output a dictionary of labels and recognized entities:\n",
    "\n",
    "```python\n",
    "{   \n",
    "    \"gpe\": [\"Germany\", \"Europe\"],   \n",
    "    \"date\": [\"1440\"],   \n",
    "    \"person\": [\"Johannes Gutenberg\"],   \n",
    "    \"product\": [\"movable-type printing press\"],   \n",
    "    \"event\": [\"Renaissance\"],   \n",
    "    \"quantity\": [\"3,600 pages\"],   \n",
    "    \"time\": [\"workday\"]   \n",
    "}   \n",
    "```\n",
    "we need to define the corresponding `JSON` schema to be passed to the `tools` parameter: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_functions(labels: dict) -> list:\n",
    "    return [\n",
    "        {   \n",
    "            \"type\": \"function\",\n",
    "            \"function\": {\n",
    "                \"name\": \"enrich_entities\",\n",
    "                \"description\": \"Enrich Text with Knowledge Base Links\",\n",
    "                \"parameters\": {\n",
    "                    \"type\": \"object\",\n",
    "                        \"properties\": {\n",
    "                            \"r'^(?:' + '|'.join({labels}) + ')$'\": \n",
    "                            {\n",
    "                                \"type\": \"array\",\n",
    "                                \"items\": {\n",
    "                                    \"type\": \"string\"\n",
    "                                }\n",
    "                            }\n",
    "                        },\n",
    "                        \"additionalProperties\": False\n",
    "                },\n",
    "            }\n",
    "        }\n",
    "    ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2 Chat Completion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we invoke the model. It's important to note that we direct the API to use a specific function by setting the `tool_choice` parameter to `{\"type\": \"function\", \"function\" : {\"name\": \"enrich_entities\"}}`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))\n",
    "def run_openai_task(labels, text):\n",
    "    messages = [\n",
    "          {\"role\": \"system\", \"content\": system_message(labels=labels)},\n",
    "          {\"role\": \"assistant\", \"content\": assisstant_message()},\n",
    "          {\"role\": \"user\", \"content\": user_message(text=text)}\n",
    "      ]\n",
    "\n",
    "    # TODO: functions and function_call are deprecated, need to be updated\n",
    "    # See: https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools\n",
    "    response = openai.chat.completions.create(\n",
    "        model=\"gpt-3.5-turbo-0613\",\n",
    "        messages=messages,\n",
    "        tools=generate_functions(labels),\n",
    "        tool_choice={\"type\": \"function\", \"function\" : {\"name\": \"enrich_entities\"}}, \n",
    "        temperature=0,\n",
    "        frequency_penalty=0,\n",
    "        presence_penalty=0,\n",
    "    )\n",
    "\n",
    "    response_message = response.choices[0].message\n",
    "    \n",
    "    available_functions = {\"enrich_entities\": enrich_entities}  \n",
    "    function_name = response_message.tool_calls[0].function.name\n",
    "    \n",
    "    function_to_call = available_functions[function_name]\n",
    "    logging.info(f\"function_to_call: {function_to_call}\")\n",
    "\n",
    "    function_args = json.loads(response_message.tool_calls[0].function.arguments)\n",
    "    logging.info(f\"function_args: {function_args}\")\n",
    "\n",
    "    function_response = function_to_call(text, function_args)\n",
    "\n",
    "    return {\"model_response\": response, \n",
    "            \"function_response\": function_response}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Let's Enrich a Text with Wikipedia links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.1 Run OpenAI Task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 2023-10-20 18:05:51,729 - INFO - function_to_call: <function enrich_entities at 0x0000021D30C462A0>\n",
      " 2023-10-20 18:05:51,730 - INFO - function_args: {'person': ['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr'], 'org': ['The Beatles'], 'gpe': ['Liverpool'], 'date': ['1960']}\n",
      " 2023-10-20 18:06:09,858 - INFO - entity_link_dict: {'John Lennon': 'https://en.wikipedia.org/wiki/John_Lennon', 'Paul McCartney': 'https://en.wikipedia.org/wiki/Paul_McCartney', 'George Harrison': 'https://en.wikipedia.org/wiki/George_Harrison', 'Ringo Starr': 'https://en.wikipedia.org/wiki/Ringo_Starr', 'The Beatles': 'https://en.wikipedia.org/wiki/The_Beatles', 'Liverpool': 'https://en.wikipedia.org/wiki/Liverpool'}\n"
     ]
    }
   ],
   "source": [
    "text = \"\"\"The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison, and Ringo Starr.\"\"\"\n",
    "result = run_openai_task(labels, text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.2 Function Response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Text:** The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison, and Ringo Starr.   \n",
       "                     **Enriched_Text:** [The Beatles](https://en.wikipedia.org/wiki/The_Beatles) were an English rock band formed in [Liverpool](https://en.wikipedia.org/wiki/Liverpool) in 1960, comprising [John Lennon](https://en.wikipedia.org/wiki/John_Lennon), [Paul McCartney](https://en.wikipedia.org/wiki/Paul_McCartney), [George Harrison](https://en.wikipedia.org/wiki/George_Harrison), and [Ringo Starr](https://en.wikipedia.org/wiki/Ringo_Starr)."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(Markdown(f\"\"\"**Text:** {text}   \n",
    "                     **Enriched_Text:** {result['function_response']}\"\"\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.3 Token Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To estimate the inference costs, we can parse the response's \"usage\" field. Detailed token costs per model are available in the [OpenAI Pricing Guide](https://openai.com/pricing):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token Usage\n",
      "    Prompt: 331 tokens\n",
      "    Completion: 47 tokens\n",
      "    Cost estimation: $0.00059\n"
     ]
    }
   ],
   "source": [
    "# estimate inference cost assuming gpt-3.5-turbo (4K context)\n",
    "i_tokens  = result[\"model_response\"].usage.prompt_tokens \n",
    "o_tokens = result[\"model_response\"].usage.completion_tokens \n",
    "\n",
    "i_cost = (i_tokens / 1000) * 0.0015\n",
    "o_cost = (o_tokens / 1000) * 0.002\n",
    "\n",
    "print(f\"\"\"Token Usage\n",
    "    Prompt: {i_tokens} tokens\n",
    "    Completion: {o_tokens} tokens\n",
    "    Cost estimation: ${round(i_cost + o_cost, 5)}\"\"\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}