langchain/docs/extras/use_cases/extraction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b84edb4e",
   "metadata": {},
   "source": [
    "# Extraction\n",
    "\n",
    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/extraction/extraction.ipynb)\n",
    "\n",
    "## Use case\n",
    "\n",
    "Getting structured output from raw LLM generations is hard.\n",
    "\n",
    "For example, suppose you need the model output formatted with a specific schema for:\n",
    "\n",
    "- Extracting a structured row to insert into a database \n",
    "- Extracting API parameters\n",
    "- Extracting different parts of a user query (e.g., for semantic vs keyword search)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "178dbc59",
   "metadata": {},
   "source": [
    "![Image description](/img/extraction.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97f474d4",
   "metadata": {},
   "source": [
    "## Overview \n",
    "\n",
    "There are two primary approaches for this:\n",
    "\n",
    "- `Functions`: Some LLMs can call [functions](https://openai.com/blog/function-calling-and-other-api-updates) to extract arbitrary entities from LLM responses.\n",
    "\n",
    "- `Parsing`: [Output parsers](/docs/modules/model_io/output_parsers/) are classes that structure LLM responses. \n",
    "\n",
    "Only some LLMs support functions (e.g., OpenAI), and they are more general than parsers. \n",
    "\n",
    "Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).\n",
    "\n",
    "Functions can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25d89f21",
   "metadata": {},
   "source": [
    "## Quickstart\n",
    "\n",
    "OpenAI funtions are one way to get started with extraction.\n",
    "\n",
    "Define a schema that specifies the properties we want to extract from the LLM output.\n",
    "\n",
    "Then, we can use `create_extraction_chain` to extract our desired schema using an OpenAI function call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f5ec7a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install langchain openai \n",
    "\n",
    "# Set env var OPENAI_API_KEY or load from a .env file:\n",
    "# import dotenv\n",
    "# dotenv.load_env()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3e017ba0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},\n",
       " {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.chains import create_extraction_chain\n",
    "\n",
    "# Schema\n",
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"name\": {\"type\": \"string\"},\n",
    "        \"height\": {\"type\": \"integer\"},\n",
    "        \"hair_color\": {\"type\": \"string\"},\n",
    "    },\n",
    "    \"required\": [\"name\", \"height\"],\n",
    "}\n",
    "\n",
    "# Input \n",
    "inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\"\"\"\n",
    "\n",
    "# Run chain\n",
    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\")\n",
    "chain = create_extraction_chain(schema, llm)\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f7eb826",
   "metadata": {},
   "source": [
    "## Option 1: OpenAI funtions\n",
    "\n",
    "### Looking under the hood\n",
    "\n",
    "Let's dig into what is happening when we call `create_extraction_chain`.\n",
    "\n",
    "The [LangSmith trace](https://smith.langchain.com/public/72bc3205-7743-4ca6-929a-966a9d4c2a77/r) shows that we call the function `information_extraction` on the input string, `inp`.\n",
    "\n",
    "![Image description](/img/extraction_trace_function.png)\n",
    "\n",
    "This `information_extraction` function is defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/openai_functions/extraction.py) and returns a dict.\n",
    "\n",
    "We can see the `dict` in the model output:\n",
    "```\n",
    " {\n",
    "      \"info\": [\n",
    "        {\n",
    "          \"name\": \"Alex\",\n",
    "          \"height\": 5,\n",
    "          \"hair_color\": \"blonde\"\n",
    "        },\n",
    "        {\n",
    "          \"name\": \"Claudia\",\n",
    "          \"height\": 6,\n",
    "          \"hair_color\": \"brunette\"\n",
    "        }\n",
    "      ]\n",
    "    }\n",
    "```\n",
    "\n",
    "The `create_extraction_chain` then parses the raw LLM output for us using [`JsonKeyOutputFunctionsParser`](https://github.com/langchain-ai/langchain/blob/f81e613086d211327b67b0fb591fd4d5f9a85860/libs/langchain/langchain/chains/openai_functions/extraction.py#L62).\n",
    "\n",
    "This results in the list of JSON objects returned by the chain above:\n",
    "```\n",
    "[{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},\n",
    " {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]\n",
    " ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcb03138",
   "metadata": {},
   "source": [
    "### Multiple entity types\n",
    "\n",
    "We can extend this further.\n",
    "\n",
    "Let's say we want to differentiate between dogs and people.\n",
    "\n",
    "We can add `person_` and `dog_` prefixes for each property"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "01eae733",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'person_name': 'Alex',\n",
       "  'person_height': 5,\n",
       "  'person_hair_color': 'blonde',\n",
       "  'dog_name': 'Frosty',\n",
       "  'dog_breed': 'labrador'},\n",
       " {'person_name': 'Claudia',\n",
       "  'person_height': 6,\n",
       "  'person_hair_color': 'brunette'}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"person_name\": {\"type\": \"string\"},\n",
    "        \"person_height\": {\"type\": \"integer\"},\n",
    "        \"person_hair_color\": {\"type\": \"string\"},\n",
    "        \"dog_name\": {\"type\": \"string\"},\n",
    "        \"dog_breed\": {\"type\": \"string\"},\n",
    "    },\n",
    "    \"required\": [\"person_name\", \"person_height\"],\n",
    "}\n",
    "\n",
    "chain = create_extraction_chain(schema, llm)\n",
    "\n",
    "inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
    "Alex's dog Frosty is a labrador and likes to play hide and seek.\"\"\"\n",
    "\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f205905c",
   "metadata": {},
   "source": [
    "### Unrelated entities\n",
    "\n",
    "If we use `required: []`, we allow the model to return **only** person attributes or **only** dog attributes for a single entity (person or dog)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6ff4ac7e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},\n",
       " {'person_name': 'Claudia',\n",
       "  'person_height': 6,\n",
       "  'person_hair_color': 'brunette'},\n",
       " {'dog_name': 'Willow', 'dog_breed': 'German Shepherd'},\n",
       " {'dog_name': 'Milo', 'dog_breed': 'border collie'}]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"person_name\": {\"type\": \"string\"},\n",
    "        \"person_height\": {\"type\": \"integer\"},\n",
    "        \"person_hair_color\": {\"type\": \"string\"},\n",
    "        \"dog_name\": {\"type\": \"string\"},\n",
    "        \"dog_breed\": {\"type\": \"string\"},\n",
    "    },\n",
    "    \"required\": [],\n",
    "}\n",
    "\n",
    "chain = create_extraction_chain(schema, llm)\n",
    "\n",
    "inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
    "Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.\"\"\"\n",
    "\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34f3b958",
   "metadata": {},
   "source": [
    "### Extra information\n",
    "\n",
    "The power of functions (relative to using parsers alone) lies in the ability to perform sematic extraction.\n",
    "\n",
    "In particular, `we can ask for things that are not explictly enumerated in the schema`.\n",
    "\n",
    "Suppose we want unspecified additional information about dogs. \n",
    "\n",
    "We can use add a placeholder for unstructured extraction, `dog_extra_info`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "40c7b26f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},\n",
       " {'person_name': 'Claudia',\n",
       "  'person_height': 6,\n",
       "  'person_hair_color': 'brunette'},\n",
       " {'dog_name': 'Willow',\n",
       "  'dog_breed': 'German Shepherd',\n",
       "  'dog_extra_info': 'likes to play with other dogs'},\n",
       " {'dog_name': 'Milo',\n",
       "  'dog_breed': 'border collie',\n",
       "  'dog_extra_info': 'lives close by'}]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"person_name\": {\"type\": \"string\"},\n",
    "        \"person_height\": {\"type\": \"integer\"},\n",
    "        \"person_hair_color\": {\"type\": \"string\"},\n",
    "        \"dog_name\": {\"type\": \"string\"},\n",
    "        \"dog_breed\": {\"type\": \"string\"},\n",
    "        \"dog_extra_info\": {\"type\": \"string\"},\n",
    "    },\n",
    "}\n",
    "\n",
    "chain = create_extraction_chain(schema, llm)\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a949c60",
   "metadata": {},
   "source": [
    "This gives us additional information about the dogs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf71ddce",
   "metadata": {},
   "source": [
    "### Pydantic \n",
    "\n",
    "Pydantic is a data validation and settings management library for Python. \n",
    "\n",
    "It allows you to create data classes with attributes that are automatically validated when you instantiate an object.\n",
    "\n",
    "Lets define a class with attributes annotated with types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d36a743b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Properties(person_name='Alex', person_height=5, person_hair_color='blonde', dog_breed=None, dog_name=None),\n",
       " Properties(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import Optional, List\n",
    "from pydantic import BaseModel, Field\n",
    "from langchain.chains import create_extraction_chain_pydantic\n",
    "\n",
    "# Pydantic data class\n",
    "class Properties(BaseModel):\n",
    "    person_name: str\n",
    "    person_height: int\n",
    "    person_hair_color: str\n",
    "    dog_breed: Optional[str]\n",
    "    dog_name: Optional[str]\n",
    "        \n",
    "# Extraction\n",
    "chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm)\n",
    "\n",
    "# Run \n",
    "inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\"\"\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07a0351a",
   "metadata": {},
   "source": [
    "As we can see from the [trace](https://smith.langchain.com/public/fed50ae6-26bb-4235-a254-e0b7a229d10f/r), we use the function `information_extraction`, as above, with the Pydantic schema. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbd9f121",
   "metadata": {},
   "source": [
    "## Option 2: Parsing\n",
    "\n",
    "[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. \n",
    "\n",
    "As shown above, they are used to parse the output of the OpenAI function calls in `create_extraction_chain`.\n",
    "\n",
    "But, they can be used independent of functions.\n",
    "\n",
    "### Pydantic\n",
    "\n",
    "Just as a above, let's parse a generation based on a Pydantic data class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "64650362",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blonde', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import Sequence\n",
    "from langchain.prompts import (\n",
    "    PromptTemplate,\n",
    "    ChatPromptTemplate,\n",
    "    HumanMessagePromptTemplate,\n",
    ")\n",
    "from langchain.llms import OpenAI\n",
    "from pydantic import BaseModel, Field, validator\n",
    "from langchain.output_parsers import PydanticOutputParser\n",
    "\n",
    "class Person(BaseModel):\n",
    "    person_name: str\n",
    "    person_height: int\n",
    "    person_hair_color: str\n",
    "    dog_breed: Optional[str]\n",
    "    dog_name: Optional[str]\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "    people: Sequence[Person]\n",
    "\n",
    "        \n",
    "# Run \n",
    "query = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\"\"\"\n",
    "\n",
    "# Set up a parser + inject instructions into the prompt template.\n",
    "parser = PydanticOutputParser(pydantic_object=People)\n",
    "\n",
    "# Prompt\n",
    "prompt = PromptTemplate(\n",
    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
    "    input_variables=[\"query\"],\n",
    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
    ")\n",
    "\n",
    "# Run\n",
    "_input = prompt.format_prompt(query=query)\n",
    "model = OpenAI(temperature=0)\n",
    "output = model(_input.to_string())\n",
    "parser.parse(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "826899df",
   "metadata": {},
   "source": [
    "We can see from the [LangSmith trace](https://smith.langchain.com/public/8e3aa858-467e-46a5-aa49-5db65f0a2b9a/r) that we get the same output as above.\n",
    "\n",
    "![Image description](/img/extraction_trace_function_2.png)\n",
    "\n",
    "We can see that we provide a two-shot prompt in order to instruct the LLM to output in our desired format.\n",
    "\n",
    "And, we need to do a bit more work:\n",
    "\n",
    "* Define a class that holds multiple instances of `Person`\n",
    "* Explicty parse the output of the LLM to the Pydantic class\n",
    "\n",
    "We can see this for other cases, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "837c350e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.prompts import (\n",
    "    PromptTemplate,\n",
    "    ChatPromptTemplate,\n",
    "    HumanMessagePromptTemplate,\n",
    ")\n",
    "from langchain.llms import OpenAI\n",
    "from pydantic import BaseModel, Field, validator\n",
    "from langchain.output_parsers import PydanticOutputParser\n",
    "\n",
    "# Define your desired data structure.\n",
    "class Joke(BaseModel):\n",
    "    setup: str = Field(description=\"question to set up a joke\")\n",
    "    punchline: str = Field(description=\"answer to resolve the joke\")\n",
    "\n",
    "    # You can add custom validation logic easily with Pydantic.\n",
    "    @validator(\"setup\")\n",
    "    def question_ends_with_question_mark(cls, field):\n",
    "        if field[-1] != \"?\":\n",
    "            raise ValueError(\"Badly formed question!\")\n",
    "        return field\n",
    "\n",
    "# And a query intented to prompt a language model to populate the data structure.\n",
    "joke_query = \"Tell me a joke.\"\n",
    "\n",
    "# Set up a parser + inject instructions into the prompt template.\n",
    "parser = PydanticOutputParser(pydantic_object=Joke)\n",
    "\n",
    "# Prompt\n",
    "prompt = PromptTemplate(\n",
    "    template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
    "    input_variables=[\"query\"],\n",
    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
    ")\n",
    "\n",
    "# Run\n",
    "_input = prompt.format_prompt(query=joke_query)\n",
    "model = OpenAI(temperature=0)\n",
    "output = model(_input.to_string())\n",
    "parser.parse(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3601bde",
   "metadata": {},
   "source": [
    "As we can see, we get an output of the `Joke` class, which respects our originally desired schema: 'setup' and 'punchline'.\n",
    "\n",
    "We can look at the [LangSmith trace](https://smith.langchain.com/public/69f11d41-41be-4319-93b0-6d0eda66e969/r) to see exactly what is going on under the hood.\n",
    "\n",
    "![Image description](/img/extraction_trace_joke.png)\n",
    "\n",
    "### Going deeper\n",
    "\n",
    "* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetimne, enum, etc).  \n",
    "* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n",
    "* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}