langchain/docs/extras/modules/chains/additional/extraction.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "6605e7f7",
      "metadata": {},
      "source": [
        "# Extraction\n",
        "\n",
        "The extraction chain uses the OpenAI `functions` parameter to specify a schema to extract entities from a document. This helps us make sure that the model outputs exactly the schema of entities and properties that we want, with their appropriate types.\n",
        "\n",
        "The extraction chain is to be used when we want to extract several entities with their properties from the same passage (i.e. what people were mentioned in this passage?)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "34f04daf",
      "metadata": {},
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/Users/harrisonchase/.pyenv/versions/3.9.1/envs/langchain/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.6.4) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.\n",
            "  warnings.warn(\n"
          ]
        }
      ],
      "source": [
        "from langchain.chat_models import ChatOpenAI\n",
        "from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic\n",
        "from langchain.prompts import ChatPromptTemplate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "a2648974",
      "metadata": {},
      "outputs": [],
      "source": [
        "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5ef034ce",
      "metadata": {},
      "source": [
        "## Extracting entities"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "78ff9df9",
      "metadata": {},
      "source": [
        "To extract entities, we need to create a schema where we specify all the properties we want to find and the type we expect them to have. We can also specify which of these properties are required and which are optional."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "4ac43eba",
      "metadata": {},
      "outputs": [],
      "source": [
        "schema = {\n",
        "    \"properties\": {\n",
        "        \"name\": {\"type\": \"string\"},\n",
        "        \"height\": {\"type\": \"integer\"},\n",
        "        \"hair_color\": {\"type\": \"string\"},\n",
        "    },\n",
        "    \"required\": [\"name\", \"height\"],\n",
        "}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "640bd005",
      "metadata": {},
      "outputs": [],
      "source": [
        "inp = \"\"\"\n",
        "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
        "        \"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "64313214",
      "metadata": {},
      "outputs": [],
      "source": [
        "chain = create_extraction_chain(schema, llm)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "17c48adb",
      "metadata": {},
      "source": [
        "As we can see, we extracted the required entities and their properties in the required format (it even calculated Claudia's height before returning!)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "cc5436ed",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},\n",
              " {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chain.run(inp)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8d51fcdc",
      "metadata": {},
      "source": [
        "## Several entity types"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5813affe",
      "metadata": {},
      "source": [
        "Notice that we are using OpenAI functions under the hood and thus the model can only call one function per request (with one, unique schema)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "511b9838",
      "metadata": {},
      "source": [
        "If we want to extract more than one entity type, we need to introduce a little hack - we will define our properties with an included entity type. \n",
        "\n",
        "Following we have an example where we also want to extract dog attributes from the passage. Notice the 'person_' and 'dog_' prefixes we use for each property; this tells the model which entity type the property refers to. In this way, the model can return properties from several entity types in one single call."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "cf243a26",
      "metadata": {},
      "outputs": [],
      "source": [
        "schema = {\n",
        "    \"properties\": {\n",
        "        \"person_name\": {\"type\": \"string\"},\n",
        "        \"person_height\": {\"type\": \"integer\"},\n",
        "        \"person_hair_color\": {\"type\": \"string\"},\n",
        "        \"dog_name\": {\"type\": \"string\"},\n",
        "        \"dog_breed\": {\"type\": \"string\"},\n",
        "    },\n",
        "    \"required\": [\"person_name\", \"person_height\"],\n",
        "}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "52841fb3",
      "metadata": {},
      "outputs": [],
      "source": [
        "inp = \"\"\"\n",
        "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
        "Alex's dog Frosty is a labrador and likes to play hide and seek.\n",
        "        \"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "93f904ab",
      "metadata": {},
      "outputs": [],
      "source": [
        "chain = create_extraction_chain(schema, llm)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "eb074f7b",
      "metadata": {},
      "source": [
        "People attributes and dog attributes were correctly extracted from the text in the same call"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "db3e9e17",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'person_name': 'Alex',\n",
              "  'person_height': 5,\n",
              "  'person_hair_color': 'blonde',\n",
              "  'dog_name': 'Frosty',\n",
              "  'dog_breed': 'labrador'},\n",
              " {'person_name': 'Claudia',\n",
              "  'person_height': 6,\n",
              "  'person_hair_color': 'brunette'}]"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chain.run(inp)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0273e0e2",
      "metadata": {},
      "source": [
        "## Unrelated entities"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c07b3480",
      "metadata": {},
      "source": [
        "What if our entities are unrelated? In that case, the model will return the unrelated entities in different dictionaries, allowing us to successfully extract several unrelated entity types in the same call."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "01d98af0",
      "metadata": {},
      "source": [
        "Notice that we use `required: []`: we need to allow the model to return **only** person attributes or **only** dog attributes for a single entity (person or dog)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 48,
      "id": "e584c993",
      "metadata": {},
      "outputs": [],
      "source": [
        "schema = {\n",
        "    \"properties\": {\n",
        "        \"person_name\": {\"type\": \"string\"},\n",
        "        \"person_height\": {\"type\": \"integer\"},\n",
        "        \"person_hair_color\": {\"type\": \"string\"},\n",
        "        \"dog_name\": {\"type\": \"string\"},\n",
        "        \"dog_breed\": {\"type\": \"string\"},\n",
        "    },\n",
        "    \"required\": [],\n",
        "}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 49,
      "id": "ad6b105f",
      "metadata": {},
      "outputs": [],
      "source": [
        "inp = \"\"\"\n",
        "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
        "\n",
        "Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.\n",
        "\"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 50,
      "id": "6bfe5a33",
      "metadata": {},
      "outputs": [],
      "source": [
        "chain = create_extraction_chain(schema, llm)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "24fe09af",
      "metadata": {},
      "source": [
        "We have each entity in its own separate dictionary, with only the appropriate attributes being returned"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 51,
      "id": "f6e1fd89",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},\n",
              " {'person_name': 'Claudia',\n",
              "  'person_height': 6,\n",
              "  'person_hair_color': 'brunette'},\n",
              " {'dog_name': 'Willow', 'dog_breed': 'German Shepherd'},\n",
              " {'dog_name': 'Milo', 'dog_breed': 'border collie'}]"
            ]
          },
          "execution_count": 51,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chain.run(inp)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0ac466d1",
      "metadata": {},
      "source": [
        "## Extra info for an entity"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d240ffc1",
      "metadata": {},
      "source": [
        "What if.. _we don't know what we want?_ More specifically, say we know a few properties we want to extract for a given entity but we also want to know if there's any extra information in the passage. Fortunately, we don't need to structure everything - we can have unstructured extraction as well. \n",
        "\n",
        "We can do this by introducing another hack, namely the *extra_info* attribute - let's see an example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 68,
      "id": "f19685f6",
      "metadata": {},
      "outputs": [],
      "source": [
        "schema = {\n",
        "    \"properties\": {\n",
        "        \"person_name\": {\"type\": \"string\"},\n",
        "        \"person_height\": {\"type\": \"integer\"},\n",
        "        \"person_hair_color\": {\"type\": \"string\"},\n",
        "        \"dog_name\": {\"type\": \"string\"},\n",
        "        \"dog_breed\": {\"type\": \"string\"},\n",
        "        \"dog_extra_info\": {\"type\": \"string\"},\n",
        "    },\n",
        "}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 81,
      "id": "200c3477",
      "metadata": {},
      "outputs": [],
      "source": [
        "inp = \"\"\"\n",
        "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
        "\n",
        "Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.\n",
        "\"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 82,
      "id": "ddad7dc6",
      "metadata": {},
      "outputs": [],
      "source": [
        "chain = create_extraction_chain(schema, llm)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e5c0dbbc",
      "metadata": {},
      "source": [
        "It is nice to know more about Willow and Milo!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 83,
      "id": "c22cfd30",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},\n",
              " {'person_name': 'Claudia',\n",
              "  'person_height': 6,\n",
              "  'person_hair_color': 'brunette'},\n",
              " {'dog_name': 'Willow',\n",
              "  'dog_breed': 'German Shepherd',\n",
              "  'dog_extra_information': 'likes to play with other dogs'},\n",
              " {'dog_name': 'Milo',\n",
              "  'dog_breed': 'border collie',\n",
              "  'dog_extra_information': 'lives close by'}]"
            ]
          },
          "execution_count": 83,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chain.run(inp)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "698b4c4d",
      "metadata": {},
      "source": [
        "## Pydantic example"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6504a6d9",
      "metadata": {},
      "source": [
        "We can also use a Pydantic schema to choose the required properties and types and we will set as 'Optional' those that are not strictly required.\n",
        "\n",
        "By using the `create_extraction_chain_pydantic` function, we can send a Pydantic schema as input and the output will be an instantiated object that respects our desired schema. \n",
        "\n",
        "In this way, we can specify our schema in the same manner that we would a new class or function in Python - with purely Pythonic types."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "6792866b",
      "metadata": {},
      "outputs": [],
      "source": [
        "from typing import Optional, List\n",
        "from pydantic import BaseModel, Field"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "36a63761",
      "metadata": {},
      "outputs": [],
      "source": [
        "class Properties(BaseModel):\n",
        "    person_name: str\n",
        "    person_height: int\n",
        "    person_hair_color: str\n",
        "    dog_breed: Optional[str]\n",
        "    dog_name: Optional[str]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "8ffd1e57",
      "metadata": {},
      "outputs": [],
      "source": [
        "chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "id": "24baa954",
      "metadata": {
        "scrolled": false
      },
      "outputs": [],
      "source": [
        "inp = \"\"\"\n",
        "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
        "Alex's dog Frosty is a labrador and likes to play hide and seek.\n",
        "        \"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "id": "84e0a241",
      "metadata": {},
      "source": [
        "As we can see, we extracted the required entities and their properties in the required format:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "id": "f771df58",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[Properties(person_name='Alex', person_height=5, person_hair_color='blonde', dog_breed='labrador', dog_name='Frosty'),\n",
              " Properties(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)]"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chain.run(inp)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0df61283",
      "metadata": {},
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.1"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}