"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/extraction.ipynb)\n",
"Getting structured output from raw LLM generations is hard.\n",
"\n",
"For example, suppose you need the model output formatted with a specific schema for:\n",
"\n",
"- Extracting a structured row to insert into a database \n",
"- Extracting API parameters\n",
"- Extracting different parts of a user query (e.g., for semantic vs keyword search)\n"
]
},
{
"cell_type": "markdown",
"id": "178dbc59",
"metadata": {},
"source": [
"![Image description](/img/extraction.png)"
]
},
{
"cell_type": "markdown",
"id": "97f474d4",
"metadata": {},
"source": [
"## Overview \n",
"\n",
"There are two primary approaches for this:\n",
"\n",
"- `Functions`: Some LLMs can call [functions](https://openai.com/blog/function-calling-and-other-api-updates) to extract arbitrary entities from LLM responses.\n",
"\n",
"- `Parsing`: [Output parsers](/docs/modules/model_io/output_parsers/) are classes that structure LLM responses. \n",
"\n",
"Only some LLMs support functions (e.g., OpenAI), and they are more general than parsers. \n",
"\n",
"Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).\n",
"\n",
"Functions can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for)."
"Let's dig into what is happening when we call `create_extraction_chain`.\n",
"\n",
"The [LangSmith trace](https://smith.langchain.com/public/72bc3205-7743-4ca6-929a-966a9d4c2a77/r) shows that we call the function `information_extraction` on the input string, `inp`.\n",
"This `information_extraction` function is defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/openai_functions/extraction.py) and returns a dict.\n",
"\n",
"We can see the `dict` in the model output:\n",
"```\n",
" {\n",
" \"info\": [\n",
" {\n",
" \"name\": \"Alex\",\n",
" \"height\": 5,\n",
" \"hair_color\": \"blonde\"\n",
" },\n",
" {\n",
" \"name\": \"Claudia\",\n",
" \"height\": 6,\n",
" \"hair_color\": \"brunette\"\n",
" }\n",
" ]\n",
" }\n",
"```\n",
"\n",
"The `create_extraction_chain` then parses the raw LLM output for us using [`JsonKeyOutputFunctionsParser`](https://github.com/langchain-ai/langchain/blob/f81e613086d211327b67b0fb591fd4d5f9a85860/libs/langchain/langchain/chains/openai_functions/extraction.py#L62).\n",
"\n",
"This results in the list of JSON objects returned by the chain above:\n",
"inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
"Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.\"\"\"\n",
"\n",
"chain.run(inp)"
]
},
{
"cell_type": "markdown",
"id": "34f3b958",
"metadata": {},
"source": [
"### Extra information\n",
"\n",
"The power of functions (relative to using parsers alone) lies in the ability to perform sematic extraction.\n",
"\n",
"In particular, `we can ask for things that are not explictly enumerated in the schema`.\n",
"\n",
"Suppose we want unspecified additional information about dogs. \n",
"\n",
"We can use add a placeholder for unstructured extraction, `dog_extra_info`."
"inp = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\"\"\"\n",
"chain.run(inp)"
]
},
{
"cell_type": "markdown",
"id": "07a0351a",
"metadata": {},
"source": [
"As we can see from the [trace](https://smith.langchain.com/public/fed50ae6-26bb-4235-a254-e0b7a229d10f/r), we use the function `information_extraction`, as above, with the Pydantic schema. "
]
},
{
"cell_type": "markdown",
"id": "cbd9f121",
"metadata": {},
"source": [
"## Option 2: Parsing\n",
"\n",
"[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. \n",
"\n",
"As shown above, they are used to parse the output of the OpenAI function calls in `create_extraction_chain`.\n",
"\n",
"But, they can be used independent of functions.\n",
"\n",
"### Pydantic\n",
"\n",
"Just as a above, let's parse a generation based on a Pydantic data class."
"We can see from the [LangSmith trace](https://smith.langchain.com/public/8e3aa858-467e-46a5-aa49-5db65f0a2b9a/r) that we get the same output as above.\n",
"As we can see, we get an output of the `Joke` class, which respects our originally desired schema: 'setup' and 'punchline'.\n",
"\n",
"We can look at the [LangSmith trace](https://smith.langchain.com/public/69f11d41-41be-4319-93b0-6d0eda66e969/r) to see exactly what is going on under the hood.\n",
"* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetimne, enum, etc). \n",
"* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n",
"* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM."