Docs: Revamp Extraction Use Case (#18588)

Revamp the extraction use case documentation

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
pull/17922/head^2
Eugene Yurtsev 4 months ago committed by GitHub
parent 1100f8de7a
commit a4a6978224
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -1,831 +0,0 @@
{
"cells": [
{
"cell_type": "raw",
"id": "df29b30a-fd27-4e08-8269-870df5631f9e",
"metadata": {},
"source": [
"---\n",
"title: Extraction\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "b84edb4e",
"metadata": {},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/extraction.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"LLMs can be used to generate text that is structured according to a specific schema. This can be useful in a number of scenarios, including:\n",
"\n",
"- Extracting a structured row to insert into a database \n",
"- Extracting API parameters\n",
"- Extracting different parts of a user query (e.g., for semantic vs keyword search)\n"
]
},
{
"cell_type": "markdown",
"id": "178dbc59",
"metadata": {},
"source": [
"![Image description](../../static/img/extraction.png)"
]
},
{
"cell_type": "markdown",
"id": "97f474d4",
"metadata": {},
"source": [
"## Overview \n",
"\n",
"There are two broad approaches for this:\n",
"\n",
"- `Tools and JSON mode`: Some LLMs specifically support structured output generation in certain contexts. Examples include OpenAI's [function and tool calling](https://platform.openai.com/docs/guides/function-calling) or [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode).\n",
"\n",
"- `Parsing`: LLMs can often be instructed to output their response in a dseired format. [Output parsers](/docs/modules/model_io/output_parsers/) will parse text generations into a structured form.\n",
"\n",
"Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).\n",
"\n",
"Functions and tools can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for)."
]
},
{
"cell_type": "markdown",
"id": "fbea06b5-66b6-4958-936d-23212061e4c8",
"metadata": {},
"source": [
"## Option 1: Leveraging tools and JSON mode"
]
},
{
"cell_type": "markdown",
"id": "25d89f21",
"metadata": {},
"source": [
"### Quickstart\n",
"\n",
"`create_structured_output_runnable` will create Runnables to support structured data extraction via OpenAI tool use and JSON mode.\n",
"\n",
"The desired output schema can be expressed either via a Pydantic model or a Python dict representing valid [JsonSchema](https://json-schema.org/).\n",
"\n",
"This function supports three modes for structured data extraction:\n",
"- `\"openai-functions\"` will define OpenAI functions and bind them to the given LLM;\n",
"- `\"openai-tools\"` will define OpenAI tools and bind them to the given LLM;\n",
"- `\"openai-json\"` will bind `response_format={\"type\": \"json_object\"}` to the given LLM.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f5ec7a3",
"metadata": {},
"outputs": [],
"source": [
"pip install langchain langchain-openai \n",
"\n",
"# Set env var OPENAI_API_KEY or load from a .env file:\n",
"# import dotenv\n",
"# dotenv.load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4c2bc413-eacd-44bd-9fcb-bbbe1f97ca6c",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"from langchain.chains import create_structured_output_runnable\n",
"from langchain_core.pydantic_v1 import BaseModel\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" person_name: str\n",
" person_height: int\n",
" person_hair_color: str\n",
" dog_breed: Optional[str]\n",
" dog_name: Optional[str]\n",
"\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4-0125-preview\", temperature=0)\n",
"runnable = create_structured_output_runnable(Person, llm)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "de8c9d7b-bb7b-45bc-9794-a355ed0d1508",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inp = \"Alex is 5 feet tall and has blond hair.\"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "02fd21ff-27a8-4890-bb18-fc852cafb18a",
"metadata": {},
"source": [
"### Specifying schemas"
]
},
{
"cell_type": "markdown",
"id": "a5a74f3e-92aa-4ac7-96f2-ea89b8740ba8",
"metadata": {},
"source": [
"A convenient way to express desired output schemas is via Pydantic. The above example specified the desired output schema via `Person`, a Pydantic model. Such schemas can be easily combined together to generate richer output formats:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c1c8fe71-0ae4-466a-b32f-001c59b62bb3",
"metadata": {},
"outputs": [],
"source": [
"from typing import Sequence\n",
"\n",
"\n",
"class People(BaseModel):\n",
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
"\n",
" people: Sequence[Person]\n",
"\n",
"\n",
"runnable = create_structured_output_runnable(People, llm)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c5aa9e43-9202-4b2d-a767-e596296b3a81",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry')])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inp = \"\"\"Alex is 5 feet tall and has blond hair.\n",
"Claudia is 1 feet taller Alex and jumps higher than him.\n",
"Claudia is a brunette and has a beagle named Harry.\"\"\"\n",
"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "53e316ea-b74a-4512-a9ab-c5d01ff583fe",
"metadata": {},
"source": [
"Note that `dog_breed` and `dog_name` are optional attributes, such that here they are extracted for Claudia and not for Alex.\n",
"\n",
"One can also specify the desired output format with a Python dict representing valid JsonSchema:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3e017ba0",
"metadata": {},
"outputs": [],
"source": [
"schema = {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"name\": {\"type\": \"string\"},\n",
" \"height\": {\"type\": \"integer\"},\n",
" \"hair_color\": {\"type\": \"string\"},\n",
" },\n",
" \"required\": [\"name\", \"height\"],\n",
"}\n",
"\n",
"runnable = create_structured_output_runnable(schema, llm)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fb525991-643d-4d47-9111-a3d4364c03d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Alex', 'height': 60}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inp = \"Alex is 5 feet tall. I don't know his hair color.\"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a3d3f0d2-c9d4-4ab8-9a5a-1ddda62db6ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Alex', 'height': 60, 'hair_color': 'blond'}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inp = \"Alex is 5 feet tall. He is blond.\"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "34f3b958",
"metadata": {},
"source": [
"#### Extra information\n",
"\n",
"Runnables constructed via `create_structured_output_runnable` generally are capable of semantic extraction, such that they can populate information that is not explicitly enumerated in the schema.\n",
"\n",
"Suppose we want unspecified additional information about dogs. \n",
"\n",
"We can use add a placeholder for unstructured extraction, `dog_extra_info`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0ed3b5e6-a7f3-453e-be61-d94fc665c16b",
"metadata": {},
"outputs": [],
"source": [
"inp = \"\"\"Alex is 5 feet tall and has blond hair.\n",
"Claudia is 1 feet taller Alex and jumps higher than him.\n",
"Claudia is a brunette and has a beagle named Harry.\n",
"Harry likes to play with other dogs and can always be found\n",
"playing with Milo, a border collie that lives close by.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "be07928a-8022-4963-a15e-eb3097beef9f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"People(people=[Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None), Person(person_name='Claudia', person_height=72, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry', dog_extra_info='likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.')])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class Person(BaseModel):\n",
" person_name: str\n",
" person_height: int\n",
" person_hair_color: str\n",
" dog_breed: Optional[str]\n",
" dog_name: Optional[str]\n",
" dog_extra_info: Optional[str]\n",
"\n",
"\n",
"class People(BaseModel):\n",
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
"\n",
" people: Sequence[Person]\n",
"\n",
"\n",
"runnable = create_structured_output_runnable(People, llm)\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "3a949c60",
"metadata": {},
"source": [
"This gives us additional information about the dogs."
]
},
{
"cell_type": "markdown",
"id": "97ed9f5e-33be-4667-aa82-af49cc874e1d",
"metadata": {},
"source": [
"### Specifying extraction mode\n",
"\n",
"`create_structured_output_runnable` supports varying implementations of the underlying extraction under the hood, which are configured via the `mode` parameter. This parameter can be one of `\"openai-functions\"`, `\"openai-tools\"`, or `\"openai-json\"`."
]
},
{
"cell_type": "markdown",
"id": "7c8e0b00-d6e6-432d-b9b0-8d0a3c0c6572",
"metadata": {},
"source": [
"#### OpenAI Functions and Tools"
]
},
{
"cell_type": "markdown",
"id": "07ccdbb1-cbe5-45af-87e4-dde42baee5eb",
"metadata": {},
"source": [
"Some LLMs are fine-tuned to support the invocation of functions or tools. If they are given an input schema for a tool and recognize an occasion to use it, they may emit JSON output conforming to that schema. We can leverage this to drive structured data extraction from natural language.\n",
"\n",
"OpenAI originally released this via a [`functions` parameter in its chat completions API](https://openai.com/blog/function-calling-and-other-api-updates). This has since been deprecated in favor of a [`tools` parameter](https://platform.openai.com/docs/guides/function-calling), which can include (multiple) functions."
]
},
{
"cell_type": "markdown",
"id": "e6b02442-2884-4b45-a5a0-4fdac729fdb3",
"metadata": {},
"source": [
"Using OpenAI Functions:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7b1c2266-b04b-4a23-83a9-da3cd2f88137",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"runnable = create_structured_output_runnable(Person, llm, mode=\"openai-functions\")\n",
"\n",
"inp = \"Alex is 5 feet tall and has blond hair.\"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "1c07427b-a582-4489-a486-4c24a6c3165f",
"metadata": {},
"source": [
"Using OpenAI Tools:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "0b1ca93a-ffd9-4d37-8baa-377757405357",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='Alex', person_height=152, person_hair_color='blond', dog_breed=None, dog_name=None)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"runnable = create_structured_output_runnable(Person, llm, mode=\"openai-tools\")\n",
"\n",
"runnable.invoke(inp)"
]
},
{
"cell_type": "markdown",
"id": "4018a8fc-1799-4c9d-b655-a66f618204b3",
"metadata": {},
"source": [
"The corresponding [LangSmith trace](https://smith.langchain.com/public/04cc37a7-7a1c-4bae-b972-1cb1a642568c/r) illustrates the tool call that generated our structured output.\n",
"\n",
"![Image description](../../static/img/extraction_trace_tool.png)"
]
},
{
"cell_type": "markdown",
"id": "fb2662d5-9492-4acc-935b-eb8fccebbe0f",
"metadata": {},
"source": [
"#### JSON Mode"
]
},
{
"cell_type": "markdown",
"id": "c0fd98ba-c887-4c30-8c9e-896ae90ac56a",
"metadata": {},
"source": [
"Some LLMs support generating JSON more generally. OpenAI implements this via a [`response_format` parameter](https://platform.openai.com/docs/guides/text-generation/json-mode) in its chat completions API.\n",
"\n",
"Note that this method may require explicit prompting (e.g., OpenAI requires that input messages contain the word \"json\" in some form when using this parameter)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6b3e4679-eadc-42c8-b882-92a600083f2f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"\n",
"system_prompt = \"\"\"You extract information in structured JSON formats.\n",
"\n",
"Extract a valid JSON blob from the user input that matches the following JSON Schema:\n",
"\n",
"{output_schema}\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", system_prompt),\n",
" (\"human\", \"{input}\"),\n",
" ]\n",
")\n",
"runnable = create_structured_output_runnable(\n",
" Person,\n",
" llm,\n",
" mode=\"openai-json\",\n",
" prompt=prompt,\n",
" enforce_function_usage=False,\n",
")\n",
"\n",
"runnable.invoke({\"input\": inp})"
]
},
{
"cell_type": "markdown",
"id": "b22d8262-a9b8-415c-a142-d0ee4db7ec2b",
"metadata": {},
"source": [
"### Few-shot examples"
]
},
{
"cell_type": "markdown",
"id": "a01c75f6-99d7-4d7b-a58f-b0ea7e8f338a",
"metadata": {},
"source": [
"Suppose we want to tune the behavior of our extractor. There are a few options available. For example, if we want to redact names but retain other information, we could adjust the system prompt:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "c5d16ad6-824e-434a-906a-d94e78259d4f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='REDACTED', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"system_prompt = \"\"\"You extract information in structured JSON formats.\n",
"\n",
"Extract a valid JSON blob from the user input that matches the following JSON Schema:\n",
"\n",
"{output_schema}\n",
"\n",
"Redact all names.\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [(\"system\", system_prompt), (\"human\", \"{input}\")]\n",
")\n",
"runnable = create_structured_output_runnable(\n",
" Person,\n",
" llm,\n",
" mode=\"openai-json\",\n",
" prompt=prompt,\n",
" enforce_function_usage=False,\n",
")\n",
"\n",
"runnable.invoke({\"input\": inp})"
]
},
{
"cell_type": "markdown",
"id": "be611688-1224-4d5a-9e34-a158b3c04296",
"metadata": {},
"source": [
"Few-shot examples are another, effective way to illustrate intended behavior. For instance, if we want to redact names with a specific character string, a one-shot example will convey this. We can use a `FewShotChatMessagePromptTemplate` to easily accommodate both a fixed set of examples as well as the dynamic selection of examples based on the input."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "0aeee951-7f73-4e24-9033-c81a08af14dc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(person_name='#####', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.prompts import FewShotChatMessagePromptTemplate\n",
"\n",
"examples = [\n",
" {\n",
" \"input\": \"Samus is 6 ft tall and blonde.\",\n",
" \"output\": Person(\n",
" person_name=\"######\",\n",
" person_height=5,\n",
" person_hair_color=\"blonde\",\n",
" ).dict(),\n",
" }\n",
"]\n",
"\n",
"example_prompt = ChatPromptTemplate.from_messages(\n",
" [(\"human\", \"{input}\"), (\"ai\", \"{output}\")]\n",
")\n",
"few_shot_prompt = FewShotChatMessagePromptTemplate(\n",
" examples=examples,\n",
" example_prompt=example_prompt,\n",
")\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [(\"system\", system_prompt), few_shot_prompt, (\"human\", \"{input}\")]\n",
")\n",
"runnable = create_structured_output_runnable(\n",
" Person,\n",
" llm,\n",
" mode=\"openai-json\",\n",
" prompt=prompt,\n",
" enforce_function_usage=False,\n",
")\n",
"\n",
"runnable.invoke({\"input\": inp})"
]
},
{
"cell_type": "markdown",
"id": "51846211-e86b-4807-9348-eb263999f7f7",
"metadata": {},
"source": [
"Here, the [LangSmith trace](https://smith.langchain.com/public/6fe5e694-9c04-48f7-83ff-e541da764781/r) for the chat model call shows how the one-shot example is formatted into the prompt.\n",
"\n",
"![Image description](../../static/img/extraction_trace_few_shot.png)"
]
},
{
"cell_type": "markdown",
"id": "cbd9f121",
"metadata": {},
"source": [
"## Option 2: Parsing\n",
"\n",
"[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. \n",
"\n",
"As shown above, they are used to parse the output of the runnable created by `create_structured_output_runnable`.\n",
"\n",
"They can also be used more generally, if a LLM is instructed to emit its output in a certain format. Parsers include convenience methods for generating formatting instructions for use in prompts.\n",
"\n",
"Below we implement an example."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "64650362",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional, Sequence\n",
"\n",
"from langchain.output_parsers import PydanticOutputParser\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" person_name: str\n",
" person_height: int\n",
" person_hair_color: str\n",
" dog_breed: Optional[str]\n",
" dog_name: Optional[str]\n",
"\n",
"\n",
"class People(BaseModel):\n",
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
"\n",
" people: Sequence[Person]\n",
"\n",
"\n",
"# Run\n",
"query = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blond.\"\"\"\n",
"\n",
"# Set up a parser + inject instructions into the prompt template.\n",
"parser = PydanticOutputParser(pydantic_object=People)\n",
"\n",
"# Prompt\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"# Run\n",
"_input = prompt.format_prompt(query=query)\n",
"model = ChatOpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "727f3bf2-31b1-4b07-94f5-9568acf3ffdf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = model.invoke(_input.to_string())\n",
"\n",
"parser.parse(output.content)"
]
},
{
"cell_type": "markdown",
"id": "826899df",
"metadata": {},
"source": [
"We can see from the [LangSmith trace](https://smith.langchain.com/public/aec42dd3-d471-4d34-801b-20dd88444931/r) that we get the same output as above.\n",
"\n",
"![Image description](../../static/img/extraction_trace_parsing.png)\n",
"\n",
"We can see that we provide a two-shot prompt in order to instruct the LLM to output in our desired format."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "837c350e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Joke(setup=\"Why couldn't the bicycle find its way home?\", punchline='Because it lost its bearings!')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define your desired data structure.\n",
"class Joke(BaseModel):\n",
" setup: str = Field(description=\"question to set up a joke\")\n",
" punchline: str = Field(description=\"answer to resolve the joke\")\n",
"\n",
" # You can add custom validation logic easily with Pydantic.\n",
" @validator(\"setup\")\n",
" def question_ends_with_question_mark(cls, field):\n",
" if field[-1] != \"?\":\n",
" raise ValueError(\"Badly formed question!\")\n",
" return field\n",
"\n",
"\n",
"# And a query intended to prompt a language model to populate the data structure.\n",
"joke_query = \"Tell me a joke.\"\n",
"\n",
"# Set up a parser + inject instructions into the prompt template.\n",
"parser = PydanticOutputParser(pydantic_object=Joke)\n",
"\n",
"# Prompt\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"# Run\n",
"_input = prompt.format_prompt(query=joke_query)\n",
"model = ChatOpenAI(temperature=0)\n",
"output = model.invoke(_input.to_string())\n",
"parser.parse(output.content)"
]
},
{
"cell_type": "markdown",
"id": "d3601bde",
"metadata": {},
"source": [
"As we can see, we get an output of the `Joke` class, which respects our originally desired schema: 'setup' and 'punchline'.\n",
"\n",
"We can look at the [LangSmith trace](https://smith.langchain.com/public/557ad630-af35-43e9-b043-93800539025f/r) to see exactly what is going on under the hood.\n",
"\n",
"### Going deeper\n",
"\n",
"* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetime, enum, etc). \n",
"* The experimental [Anthropic function calling](https://python.langchain.com/docs/integrations/chat/anthropic_functions) support provides similar functionality to Anthropic chat models.\n",
"* [LlamaCPP](https://python.langchain.com/docs/integrations/llms/llamacpp#grammars) natively supports constrained decoding using custom grammars, making it easy to output structured content using local LLMs \n",
"* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n",
"* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM."
]
},
{
"cell_type": "markdown",
"id": "aab95ecf",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,68 @@
{
"cells": [
{
"cell_type": "raw",
"id": "913dd5a2-24d1-4f8e-bc15-ab518483eef9",
"metadata": {},
"source": [
"---\n",
"title: Guidelines\n",
"sidebar_position: 5\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e",
"metadata": {},
"source": [
"The quality of extraction results depends on many factors. \n",
"\n",
"Here is a set of guidelines to help you squeeze out the best performance from your models:\n",
"\n",
"* Set the model temperature to `0`.\n",
"* Improve the prompt. The prompt should be precise and to the point.\n",
"* Document the schema: Make sure the schema is documented to provide more information to the LLM.\n",
"* Provide reference examples! Diverse examples can help, including examples where nothing should be extracted.\n",
"* If you have a lot of examples, use a retriever to retrieve the most relevant examples.\n",
"* Benchmark with the best available LLM/Chat Model (e.g., gpt-4, claude-3, etc) -- check with the model provider which one is the latest and greatest!\n",
"* If the schema is very large, try breaking it into multiple smaller schemas, run separate extractions and merge the results.\n",
"* Make sure that the schema allows the model to REJECT extracting information. If it doesn't, the model will be forced to make up information!\n",
"* Add verification/correction steps (ask an LLM to correct or verify the results of the extraction).\n",
"\n",
"## Benchmark\n",
"\n",
"* Create and benchmark data for your use case using [LangSmith 🦜️🛠️](https://docs.smith.langchain.com/).\n",
"* Is your LLM good enough? Use [langchain-benchmarks 🦜💯 ](https://github.com/langchain-ai/langchain-benchmarks) to test out your LLM using existing datasets.\n",
"\n",
"## Keep in mind! 😶‍🌫️\n",
"\n",
"* LLMs are great, but are not required for all cases! If youre extracting information from a single structured source (e.g., linkedin), using an LLM is not a good idea traditional web-scraping will be much cheaper and reliable.\n",
"\n",
"* **human in the loop** If you need **perfect quality**, you'll likely need to plan on having a human in the loop -- even the best LLMs will make mistakes when dealing with complex extraction tasks."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,2 @@
label: 'How-To Guides'
position: 1

@ -0,0 +1,464 @@
{
"cells": [
{
"cell_type": "raw",
"id": "a37d08e8-8d6d-4cf2-8215-2aafb6877fb5",
"metadata": {},
"source": [
"---\n",
"title: Use Reference Examples\n",
"sidebar_position: 1\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "70403d4f-50c1-43f8-a7ea-a211167649a5",
"metadata": {},
"source": [
"The quality of extractions can often be improved by providing reference examples to the LLM.\n",
"\n",
":::{.callout-tip}\n",
"While this tutorial focuses how to use examples with a tool calling model, this technique is generally applicable, and will work\n",
"also with JSON more or prompt based techniques.\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "89579144-bcb3-490a-8036-86a0a6bcd56b",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"\n",
"# Define a custom prompt to provide instructions and any additional context.\n",
"# 1) You can add examples into the prompt template to improve extraction quality\n",
"# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
"# about the document from which the text was extracted.)\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are an expert extraction algorithm. \"\n",
" \"Only extract relevant information from the text. \"\n",
" \"If you do not know the value of an attribute asked \"\n",
" \"to extract, return null for the attribute's value.\",\n",
" ),\n",
" # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓\n",
" MessagesPlaceholder(\"examples\"), # <-- EXAMPLES!\n",
" # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑\n",
" (\"human\", \"{text}\"),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2484008c-ba1a-42a5-87a1-628a900de7fd",
"metadata": {},
"source": [
"Test out the template:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "610c3025-ea63-4cd7-88bd-c8cbcb4d8a3f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ChatPromptValue(messages=[SystemMessage(content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"), HumanMessage(content='testing 1 2 3'), HumanMessage(content='this is some text')])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.messages import (\n",
" HumanMessage,\n",
")\n",
"\n",
"prompt.invoke(\n",
" {\"text\": \"this is some text\", \"examples\": [HumanMessage(content=\"testing 1 2 3\")]}\n",
")"
]
},
{
"cell_type": "markdown",
"id": "368abd80-0cf0-41a7-8224-acf90dd6830d",
"metadata": {},
"source": [
"## Define the schema\n",
"\n",
"Let's re-use the person schema from the quickstart."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d875a49a-d2cb-4b9e-b5bf-41073bc3905c",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
"\n",
" # Note that:\n",
" # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
" # 2. Each field has a `description` -- this description is used by the LLM.\n",
" # Having a good description can help improve extraction results.\n",
" name: Optional[str] = Field(..., description=\"The name of the person\")\n",
" hair_color: Optional[str] = Field(\n",
" ..., description=\"The color of the peron's eyes if known\"\n",
" )\n",
" height_in_meters: Optional[str] = Field(..., description=\"Height in METERs\")\n",
"\n",
"\n",
"class Data(BaseModel):\n",
" \"\"\"Extracted data about people.\"\"\"\n",
"\n",
" # Creates a model so that we can extract multiple entities.\n",
" people: List[Person]"
]
},
{
"cell_type": "markdown",
"id": "96c42162-e4f6-4461-88fd-c76f5aab7e32",
"metadata": {},
"source": [
"## Define reference examples\n",
"\n",
"Examples can be defined as a list of input-output pairs. \n",
"\n",
"Each example contains an example `input` text and an example `output` showing what should be extracted from the text.\n",
"\n",
":::{.callout-important}\n",
"This is a bit in the weeds, so feel free to ignore if you don't get it!\n",
"\n",
"The format of the example needs to match the API used (e.g., tool calling or JSON mode etc.).\n",
"\n",
"Here, the formatted examples will match the format expected for the tool calling API since that's what we're using.\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "08356810-77ce-4e68-99d9-faa0326f2cee",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from typing import Dict, List, TypedDict\n",
"\n",
"from langchain_core.messages import (\n",
" AIMessage,\n",
" BaseMessage,\n",
" HumanMessage,\n",
" SystemMessage,\n",
" ToolMessage,\n",
")\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"class Example(TypedDict):\n",
" \"\"\"A representation of an example consisting of text input and expected tool calls.\n",
"\n",
" For extraction, the tool calls are represented as instances of pydantic model.\n",
" \"\"\"\n",
"\n",
" input: str # This is the example text\n",
" tool_calls: List[BaseModel] # Instances of pydantic model that should be extracted\n",
"\n",
"\n",
"def tool_example_to_messages(example: Example) -> List[BaseMessage]:\n",
" \"\"\"Convert an example into a list of messages that can be fed into an LLM.\n",
"\n",
" This code is an adapter that converts our example to a list of messages\n",
" that can be fed into a chat model.\n",
"\n",
" The list of messages per example corresponds to:\n",
"\n",
" 1) HumanMessage: contains the content from which content should be extracted.\n",
" 2) AIMessage: contains the extracted information from the model\n",
" 3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.\n",
"\n",
" The ToolMessage is required because some of the chat models are hyper-optimized for agents\n",
" rather than for an extraction use case.\n",
" \"\"\"\n",
" messages: List[BaseMessage] = [HumanMessage(content=example[\"input\"])]\n",
" openai_tool_calls = []\n",
" for tool_call in example[\"tool_calls\"]:\n",
" openai_tool_calls.append(\n",
" {\n",
" \"id\": str(uuid.uuid4()),\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" # The name of the function right now corresponds\n",
" # to the name of the pydantic model\n",
" # This is implicit in the API right now,\n",
" # and will be improved over time.\n",
" \"name\": tool_call.__class__.__name__,\n",
" \"arguments\": tool_call.json(),\n",
" },\n",
" }\n",
" )\n",
" messages.append(\n",
" AIMessage(content=\"\", additional_kwargs={\"tool_calls\": openai_tool_calls})\n",
" )\n",
" tool_outputs = example.get(\"tool_outputs\") or [\n",
" \"You have correctly called this tool.\"\n",
" ] * len(openai_tool_calls)\n",
" for output, tool_call in zip(tool_outputs, openai_tool_calls):\n",
" messages.append(ToolMessage(content=output, tool_call_id=tool_call[\"id\"]))\n",
" return messages"
]
},
{
"cell_type": "markdown",
"id": "463aa282-51c4-42bf-9463-6ca3b2c08de6",
"metadata": {},
"source": [
"Next let's define our examples and then convert them into message format."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7f59a745-5c81-4011-a4c5-a33ec1eca7ef",
"metadata": {},
"outputs": [],
"source": [
"examples = [\n",
" (\n",
" \"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\",\n",
" Person(name=None, height_in_meters=None, hair_color=None),\n",
" ),\n",
" (\n",
" \"Fiona traveled far from France to Spain.\",\n",
" Person(name=\"Fiona\", height_in_meters=None, hair_color=None),\n",
" ),\n",
"]\n",
"\n",
"\n",
"messages = []\n",
"\n",
"for text, tool_call in examples:\n",
" messages.extend(\n",
" tool_example_to_messages({\"input\": text, \"tool_calls\": [tool_call]})\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "6fdbda30-e7e3-46b5-a54a-1769c580af93",
"metadata": {},
"source": [
"Let's test out the prompt"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e61fa3a5-3d15-46a2-a23b-788f9a3ede52",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ChatPromptValue(messages=[SystemMessage(content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"), HumanMessage(content=\"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\"), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'c75e57cc-8212-4959-81e9-9477b0b79126', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{\"name\": null, \"hair_color\": null, \"height_in_meters\": null}'}}]}), ToolMessage(content='You have correctly called this tool.', tool_call_id='c75e57cc-8212-4959-81e9-9477b0b79126'), HumanMessage(content='Fiona traveled far from France to Spain.'), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '69da50b5-e427-44be-b396-1e56d821c6b0', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{\"name\": \"Fiona\", \"hair_color\": null, \"height_in_meters\": null}'}}]}), ToolMessage(content='You have correctly called this tool.', tool_call_id='69da50b5-e427-44be-b396-1e56d821c6b0'), HumanMessage(content='this is some text')])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt.invoke({\"text\": \"this is some text\", \"examples\": messages})"
]
},
{
"cell_type": "markdown",
"id": "47b0bbef-bc6b-4535-a8e2-5c84f09d5637",
"metadata": {},
"source": [
"## Create an extractor\n",
"Here, we'll create an extractor using **gpt-4**."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dbfea43d-769b-42e9-a76f-ce722f7d6f93",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/harrisonchase/workplace/langchain/libs/core/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.\n",
" warn_beta(\n"
]
}
],
"source": [
"# We will be using tool calling mode, which\n",
"# requires a tool calling capable model.\n",
"llm = ChatOpenAI(\n",
" # Consider benchmarking with a good model to get\n",
" # a sense of the best possible quality.\n",
" model=\"gpt-4-0125-preview\",\n",
" # Remember to set the temperature to 0 for extractions!\n",
" temperature=0,\n",
")\n",
"\n",
"\n",
"runnable = prompt | llm.with_structured_output(\n",
" schema=Data,\n",
" method=\"function_calling\",\n",
" include_raw=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "58a8139e-f201-4b8e-baf0-16a83e5fa987",
"metadata": {},
"source": [
"## Without examples 😿\n",
"\n",
"Notice that even though we're using gpt-4, it's failing with a **very simple** test case!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8b1d6273-5ec5-4970-af8a-0da1f1efa293",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"people=[]\n",
"people=[Person(name='earth', hair_color=None, height_in_meters=None)]\n",
"people=[Person(name='earth', hair_color=None, height_in_meters=None)]\n",
"people=[]\n",
"people=[]\n"
]
}
],
"source": [
"for _ in range(5):\n",
" text = \"The solar system is large, but earth has only 1 moon.\"\n",
" print(runnable.invoke({\"text\": text, \"examples\": []}))"
]
},
{
"cell_type": "markdown",
"id": "09840f17-ab26-4ea2-8a39-c747103804ec",
"metadata": {},
"source": [
"## With examples 😻\n",
"\n",
"Reference examples helps to fix the failure!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "9bdfa49e-0005-4c06-9598-2adfd882b014",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"people=[]\n",
"people=[]\n",
"people=[]\n",
"people=[]\n",
"people=[]\n"
]
}
],
"source": [
"for _ in range(5):\n",
" text = \"The solar system is large, but earth has only 1 moon.\"\n",
" print(runnable.invoke({\"text\": text, \"examples\": messages}))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "84413e17-608d-4f85-b70e-00b89b271927",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"runnable.invoke(\n",
" {\n",
" \"text\": \"My name is Harrison. My hair is black.\",\n",
" \"examples\": messages,\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d18bb013",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "raw",
"id": "8371e5d6-eb65-4c97-aac2-05037356c2c1",
"metadata": {},
"source": [
"---\n",
"title: Handle Files\n",
"sidebar_position: 3\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0d5eea7c-bc69-4da2-b91d-d7c71f7085d0",
"metadata": {},
"source": [
"Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.\n",
"\n",
"You can use LangChain [document loaders](/modules/data_connection/document_loaders/) to parse files into a text format that can be fed into LLMs.\n",
"\n",
"LangChain features a large number of [document loader integrations](/docs/integrations/document_loaders).\n",
"\n",
"## MIME type based parsing\n",
"\n",
"For basic parsing exmaples take a look [at document loaders](/modules/data_connection/document_loaders/).\n",
"\n",
"Here, we'll be looking at mime-type based parsing which is often useful for extraction based applications if you're writing server code that accepts user uploaded files.\n",
"\n",
"In this case, it's best to assume that the file extension of the file provided by the user is wrong and instead infer the mimetype from the binary content of the file.\n",
"\n",
"Let's download some content. This will be an HTML file, but the code below will work with other file types."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "76d42bb2-090b-4a70-a656-d6e9af769eba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'<!DOCTYPE html>\\n<htm'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"\n",
"response = requests.get(\"https://en.wikipedia.org/wiki/Car\")\n",
"data = response.content\n",
"data[:20]"
]
},
{
"cell_type": "markdown",
"id": "389400a2-6f05-48da-810e-9438d626e64b",
"metadata": {},
"source": [
"Configure the parsers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "430806a7-7d8a-4111-9f5d-46840dab0dc0",
"metadata": {},
"outputs": [],
"source": [
"import magic\n",
"from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser\n",
"from langchain.document_loaders.parsers.generic import MimeTypeBasedParser\n",
"from langchain.document_loaders.parsers.txt import TextParser\n",
"from langchain_community.document_loaders import Blob\n",
"\n",
"# Configure the parsers that you want to use per mime-type!\n",
"HANDLERS = {\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"text/html\": BS4HTMLParser(),\n",
"}\n",
"\n",
"# Instantiate a mimetype based parser with the given parsers\n",
"MIMETYPE_BASED_PARSER = MimeTypeBasedParser(\n",
" handlers=HANDLERS,\n",
" fallback_parser=None,\n",
")\n",
"\n",
"mime = magic.Magic(mime=True)\n",
"mime_type = mime.from_buffer(data)\n",
"\n",
"# A blob represents binary data by either reference (path on file system)\n",
"# or value (bytes in memory).\n",
"blob = Blob.from_data(\n",
" data=data,\n",
" mime_type=mime_type,\n",
")\n",
"\n",
"parser = HANDLERS[mime_type]\n",
"documents = parser.parse(blob=blob)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fb618df7-d7be-4f34-8939-6f7b10dfc2b6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Car - Wikipedia\n"
]
}
],
"source": [
"print(documents[0].page_content[:30].strip())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,428 @@
{
"cells": [
{
"cell_type": "raw",
"id": "913dd5a2-24d1-4f8e-bc15-ab518483eef9",
"metadata": {},
"source": [
"---\n",
"title: Handle Long Text\n",
"sidebar_position: 2\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e",
"metadata": {},
"source": [
"When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. To process this text, consider these strategies:\n",
"\n",
"1. **Change LLM** Choose a different LLM that supports a larger context window.\n",
"2. **Brute Force** Chunk the document, and extract content from each chunk.\n",
"3. **RAG** Chunk the document, index the chunks, and only extract content from a subset of chunks that look \"relevant\".\n",
"\n",
"Keep in mind that these strategies have different trade off and the best strategy likely depends on the application that you're designing!"
]
},
{
"cell_type": "markdown",
"id": "57969139-ad0a-487e-97d8-cb30e2af9742",
"metadata": {},
"source": [
"## Set up\n",
"\n",
"We need some example data! Let's download an article about [cars from wikipedia](https://en.wikipedia.org/wiki/Car) and load it as a LangChain `Document`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "571aad22-2cec-4b9b-b656-5e4b81a1ec6c",
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"\n",
"import requests\n",
"from langchain_community.document_loaders import BSHTMLLoader\n",
"\n",
"# Download the content\n",
"response = requests.get(\"https://en.wikipedia.org/wiki/Car\")\n",
"# Write it to a file\n",
"with open(\"car.html\", \"w\", encoding=\"utf-8\") as f:\n",
" f.write(response.text)\n",
"# Load it with an HTML parser\n",
"loader = BSHTMLLoader(\"car.html\")\n",
"document = loader.load()[0]\n",
"# Clean up code\n",
"# Replace consecutive new lines with a single new line\n",
"document.page_content = re.sub(\"\\n\\n+\", \"\\n\", document.page_content)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "85656454-6d5d-4ff6-93ca-690791ac1ec4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"78967\n"
]
}
],
"source": [
"print(len(document.page_content))"
]
},
{
"cell_type": "markdown",
"id": "af3ffb8d-587a-4370-886a-e56e617bcb9c",
"metadata": {},
"source": [
"## Define the schema\n",
"\n",
"Here, we'll define schema to extract key developments from the text."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3b288ed-87a6-4af0-aac8-20921dc370d4",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.\n",
" warn_beta(\n"
]
}
],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain.chains import create_structured_output_runnable\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"\n",
"class KeyDevelopment(BaseModel):\n",
" \"\"\"Information about a development in the history of cars.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
" # Note that all fields are required rather than optional!\n",
" year: int = Field(\n",
" ..., description=\"The year when there was an important historic development.\"\n",
" )\n",
" description: str = Field(\n",
" ..., description=\"What happened in this year? What was the development?\"\n",
" )\n",
" evidence: str = Field(\n",
" ...,\n",
" description=\"Repeat in verbatim the sentence(s) from which the year and description information were extracted\",\n",
" )\n",
"\n",
"\n",
"class ExtractionData(BaseModel):\n",
" \"\"\"Extracted information about key developments in the history of cars.\"\"\"\n",
"\n",
" key_developments: List[KeyDevelopment]\n",
"\n",
"\n",
"# Define a custom prompt to provide instructions and any additional context.\n",
"# 1) You can add examples into the prompt template to improve extraction quality\n",
"# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
"# about the document from which the text was extracted.)\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are an expert at identifying key historic development in text. \"\n",
" \"Only extract important historic developments. Extract nothing if no important information can be found in the text.\",\n",
" ),\n",
" # MessagesPlaceholder('examples'), # Keep on reading through this use case to see how to use examples to improve performance\n",
" (\"human\", \"{text}\"),\n",
" ]\n",
")\n",
"\n",
"\n",
"# We will be using tool calling mode, which\n",
"# requires a tool calling capable model.\n",
"llm = ChatOpenAI(\n",
" # Consider benchmarking with a good model to get\n",
" # a sense of the best possible quality.\n",
" model=\"gpt-4-0125-preview\",\n",
" # Remember to set the temperature to 0 for extractions!\n",
" temperature=0,\n",
")\n",
"\n",
"extractor = prompt | llm.with_structured_output(\n",
" schema=ExtractionData,\n",
" method=\"function_calling\",\n",
" include_raw=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "13aebafb-26b5-42b2-ae8e-9c05cd56e5c5",
"metadata": {},
"source": [
"## Brute force approach\n",
"\n",
"Split the documents into chunks such that each chunk fits into the context window of the LLMs."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "27b8a373-14b3-45ea-8bf5-9749122ad927",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import TokenTextSplitter\n",
"\n",
"text_splitter = TokenTextSplitter(\n",
" # Controls the size of each chunk\n",
" chunk_size=2000,\n",
" # Controls overlap between chunks\n",
" chunk_overlap=20,\n",
")\n",
"\n",
"texts = text_splitter.split_text(document.page_content)"
]
},
{
"cell_type": "markdown",
"id": "5b43d7e0-3c85-4d97-86c7-e8c984b60b0a",
"metadata": {},
"source": [
"Use `.batch` functionality to run the extraction in **parallel** across each chunk! \n",
"\n",
":::{.callout-tip}\n",
"You can often use .batch() to parallelize the extractions! `batch` uses a threadpool under the hood to help you parallelize workloads.\n",
"\n",
"If your model is exposed via an API, this will likley speed up your extraction flow!\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6ba766b5-8d6c-48e6-8d69-f391a66b65d2",
"metadata": {},
"outputs": [],
"source": [
"# Limit just to the first 3 chunks\n",
"# so the code can be re-run quickly\n",
"first_few = texts[:3]\n",
"\n",
"extractions = extractor.batch(\n",
" [{\"text\": text} for text in first_few],\n",
" {\"max_concurrency\": 5}, # limit the concurrency by passing max concurrency!\n",
")"
]
},
{
"cell_type": "markdown",
"id": "67da8904-e927-406b-a439-2a16f6087ccf",
"metadata": {},
"source": [
"### Merge results\n",
"\n",
"After extracting data from across the chunks, we'll want to merge the extractions together."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "30b35897-4d94-44ad-80c6-446eff61b76b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[KeyDevelopment(year=1966, description=\"The Toyota Corolla began production, recognized as the world's best-selling automobile.\", evidence=\"The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile.\"),\n",
" KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),\n",
" KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),\n",
" KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),\n",
" KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),\n",
" KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),\n",
" KeyDevelopment(year=1888, description=\"Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.\", evidence=\"In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention.\"),\n",
" KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),\n",
" KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),\n",
" KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),\n",
" KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),\n",
" KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),\n",
" KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),\n",
" KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),\n",
" KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),\n",
" KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),\n",
" KeyDevelopment(year=1913, description=\"Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.\", evidence=\"This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant.\"),\n",
" KeyDevelopment(year=1914, description=\"Ford's assembly line worker could buy a Model T with four months' pay.\", evidence=\"In 1914, an assembly line worker could buy a Model T with four months' pay.\"),\n",
" KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"key_developments = []\n",
"\n",
"for extraction in extractions:\n",
" key_developments.extend(extraction.key_developments)\n",
"\n",
"key_developments[:20]"
]
},
{
"cell_type": "markdown",
"id": "48afd4a7-abcd-48b4-8ff1-6ca485f529e3",
"metadata": {},
"source": [
"## RAG based approach\n",
"\n",
"Another simple idea is to chunk up the text, but instead of extracting information from every chunk, just focus on the the most relevant chunks.\n",
"\n",
":::{.callout-caution}\n",
"It can be difficult to identify which chunks are relevant.\n",
"\n",
"For example, in the `car` article we're using here, most of the article contains key development information. So by using\n",
"**RAG**, we'll likely be throwing out a lot of relevant information.\n",
"\n",
"We suggest experimenting with your use case and determining whether this approach works or not.\n",
":::\n",
"\n",
"Here's a simple example that relies on the `FAISS` vectorstore."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "aaf37c82-625b-4fa1-8e88-73303f08ac16",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import FAISS\n",
"from langchain_core.documents import Document\n",
"from langchain_core.runnables import RunnableLambda\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"texts = text_splitter.split_text(document.page_content)\n",
"vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())\n",
"\n",
"retriever = vectorstore.as_retriever(\n",
" search_kwargs={\"k\": 1}\n",
") # Only extract from first document"
]
},
{
"cell_type": "markdown",
"id": "013ecad9-f80f-477c-b954-494b46a02a07",
"metadata": {},
"source": [
"In this case the RAG extractor is only looking at the top document."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "47aad00b-7013-4f7f-a1b0-02ef269093bf",
"metadata": {},
"outputs": [],
"source": [
"rag_extractor = {\n",
" \"text\": retriever | (lambda docs: docs[0].page_content) # fetch content of top doc\n",
"} | extractor"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "68f2de01-0cd8-456e-a959-db236189d41b",
"metadata": {},
"outputs": [],
"source": [
"results = rag_extractor.invoke(\"Key developments associated with cars\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "56f434ea-1869-4192-914e-3ccf64e72f75",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"year=1924 description=\"Germany's first mass-manufactured car, the Opel 4PS Laubfrosch, was produced, making Opel the top car builder in Germany with 37.5% of the market.\" evidence=\"Germany's first mass-manufactured car, the Opel 4PS Laubfrosch (Tree Frog), came off the line at Rüsselsheim in 1924, soon making Opel the top car builder in Germany, with 37.5 per cent of the market.\"\n",
"year=1925 description='Morris had 41% of total British car production, dominating the market.' evidence='in 1925, Morris had 41 per cent of total British car production.'\n",
"year=1925 description='Citroën, Renault, and Peugeot produced 550,000 cars in France, dominating the market.' evidence=\"Citroën did the same in France, coming to cars in 1919; between them and other cheap cars in reply such as Renault's 10CV and Peugeot's 5CV, they produced 550,000 cars in 1925.\"\n",
"year=2017 description='Production of petrol-fuelled cars peaked.' evidence='Production of petrol-fuelled cars peaked in 2017.'\n"
]
}
],
"source": [
"for key_development in results.key_developments:\n",
" print(key_development)"
]
},
{
"cell_type": "markdown",
"id": "cf36e626-cf5d-4324-ba29-9bd602be9b97",
"metadata": {},
"source": [
"## Common issues\n",
"\n",
"Different methods have their own pros and cons related to cost, speed, and accuracy.\n",
"\n",
"Watch out for these issues:\n",
"\n",
"* Chunking content means that the LLM can fail to extract information if the information is spread across multiple chunks.\n",
"* Large chunk overlap may cause the same information to be extracted twice, so be prepared to de-duplicate!\n",
"* LLMs can make up data. If looking for a single fact across a large text and using a brute force approach, you may end up getting more made up data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5f9685f-9d68-4155-a78c-0cb50821e21f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,331 @@
{
"cells": [
{
"cell_type": "raw",
"id": "df29b30a-fd27-4e08-8269-870df5631f9e",
"metadata": {},
"source": [
"---\n",
"title: Parsing\n",
"sidebar_position: 4\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "ea37db49-d389-4291-be73-885d06c1fb7e",
"metadata": {},
"source": [
"LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.\n",
"\n",
"This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.\n",
"\n",
"Here, we'll use Claude which is great at following instructions! See [Anthropic models](https://www.anthropic.com/api)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d71b32de-a6b4-45ed-83a9-ba1925f9470c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_anthropic.chat_models import ChatAnthropic\n",
"\n",
"model = ChatAnthropic(model_name=\"claude-3-sonnet-20240229\")"
]
},
{
"cell_type": "markdown",
"id": "3e412374-3beb-4bbf-966b-400c1f66a258",
"metadata": {},
"source": [
":::{.callout-tip}\n",
"All the same considerations for extraction quality apply for parsing approach. Review the [guidelines](/docs/use_cases/extraction/guidelines) for extraction quality.\n",
"\n",
"This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "abc1a945-0f80-4953-add4-cd572b6f2a51",
"metadata": {},
"source": [
"## Using PydanticOutputParser\n",
"\n",
"The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "497eb023-c043-443d-ac62-2d4ea85fe1b0",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain.output_parsers import PydanticOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" name: str = Field(..., description=\"The name of the person\")\n",
" height_in_meters: float = Field(\n",
" ..., description=\"The height of the person expressed in meters.\"\n",
" )\n",
"\n",
"\n",
"class People(BaseModel):\n",
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
"\n",
" people: List[Person]\n",
"\n",
"\n",
"# Set up a parser\n",
"parser = PydanticOutputParser(pydantic_object=People)\n",
"\n",
"# Prompt\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
" ),\n",
" (\"human\", \"{query}\"),\n",
" ]\n",
").partial(format_instructions=parser.get_format_instructions())"
]
},
{
"cell_type": "markdown",
"id": "c31aa2c8-05a9-4a12-80c5-ea1250dea0ae",
"metadata": {},
"source": [
"Let's take a look at what information is sent to the model"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "20b99ffb-a114-49a9-a7be-154c525f8ada",
"metadata": {},
"outputs": [],
"source": [
"query = \"Anna is 23 years old and she is 6 feet tall\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4f3a66ce-de19-4571-9e54-67504ae3fba7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System: Answer the user query. Wrap the output in `json` tags\n",
"The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
"\n",
"As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
"the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
"\n",
"Here is the output schema:\n",
"```\n",
"{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
"```\n",
"Human: Anna is 23 years old and she is 6 feet tall\n"
]
}
],
"source": [
"print(prompt.format_prompt(query=query).to_string())"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3a46b5fd-9242-4b8c-a4e2-3f04fc19b3a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"People(people=[Person(name='Anna', height_in_meters=1.83)])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain = prompt | model | parser\n",
"chain.invoke({\"query\": query})"
]
},
{
"cell_type": "markdown",
"id": "815b3b87-3bc6-4b56-835e-c6b6703cef5d",
"metadata": {},
"source": [
"## Custom Parsing\n",
"\n",
"It's easy to create a custom prompt and parser with `LangChain` and `LCEL`.\n",
"\n",
"You can use a simple function to parse the output from the model!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b1f11912-c1bb-4a2a-a482-79bf3996961f",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"from typing import List, Optional\n",
"\n",
"from langchain_anthropic.chat_models import ChatAnthropic\n",
"from langchain_core.messages import AIMessage\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" name: str = Field(..., description=\"The name of the person\")\n",
" height_in_meters: float = Field(\n",
" ..., description=\"The height of the person expressed in meters.\"\n",
" )\n",
"\n",
"\n",
"class People(BaseModel):\n",
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
"\n",
" people: List[Person]\n",
"\n",
"\n",
"# Prompt\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"Answer the user query. Output your answer as JSON that \"\n",
" \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
" \"Make sure to wrap the answer in ```json and ``` tags\",\n",
" ),\n",
" (\"human\", \"{query}\"),\n",
" ]\n",
").partial(schema=People.schema())\n",
"\n",
"\n",
"# Custom parser\n",
"def extract_json(message: AIMessage) -> List[dict]:\n",
" \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
"\n",
" Parameters:\n",
" text (str): The text containing the JSON content.\n",
"\n",
" Returns:\n",
" list: A list of extracted JSON strings.\n",
" \"\"\"\n",
" text = message.content\n",
" # Define the regular expression pattern to match JSON blocks\n",
" pattern = r\"```json(.*?)```\"\n",
"\n",
" # Find all non-overlapping matches of the pattern in the string\n",
" matches = re.findall(pattern, text, re.DOTALL)\n",
"\n",
" # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
" try:\n",
" return [json.loads(match.strip()) for match in matches]\n",
" except Exception:\n",
" raise ValueError(f\"Failed to parse: {message}\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "cda52ef5-a354-47a7-9c25-45153c2389e2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System: Answer the user query. Output your answer as JSON that matches the given schema: ```json\n",
"{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
"```. Make sure to wrap the answer in ```json and ``` tags\n",
"Human: Anna is 23 years old and she is 6 feet tall\n"
]
}
],
"source": [
"query = \"Anna is 23 years old and she is 6 feet tall\"\n",
"print(prompt.format_prompt(query=query).to_string())"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "993dc61a-229d-4795-a746-0d17df86b5c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain = prompt | model | extract_json\n",
"chain.invoke({\"query\": query})"
]
},
{
"cell_type": "markdown",
"id": "d3601bde",
"metadata": {},
"source": [
"## Other Libraries\n",
"\n",
"If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n",
"helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,105 @@
{
"cells": [
{
"cell_type": "raw",
"id": "df29b30a-fd27-4e08-8269-870df5631f9e",
"metadata": {},
"source": [
"---\n",
"title: Extraction\n",
"sidebar_position: 3\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "5e397959-1622-4c1c-bdb6-4660a3c39e14",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"Large Language Models (LLMs) are emerging as an extremely capable technology for powering information extraction applications.\n",
"\n",
"Classical solutions to information extraction rely on a combination of people, (many) hand-crafted rules (e.g., regular expressions), and custom fine-tuned ML models.\n",
"\n",
"Such systems tend to get complex over time and become progressively more expensive to maintain and more difficult to enhance.\n",
"\n",
"LLMs can be adapted quickly for specific extraction tasks just by providing appropriate instructions to them and appropriate reference examples.\n",
"\n",
"This guide will show you how to use LLMs for extraction applications!\n",
"\n",
"## Approaches\n",
"\n",
"There are 3 broad approaches for information extraction using LLMs:\n",
"\n",
"- **Tool/Function Calling** Mode: Some LLMs support a *tool or function calling* mode. These LLMs can structure output according to a given **schema**. Generally, this approach is the easiest to work with and is expected to yield good results.\n",
"\n",
"- **JSON Mode**: Some LLMs are can be forced to output valid JSON. This is similar to **tool/function Calling** approach, except that the schema is provided as part of the prompt. Generally, our intuition is that this performs worse than a **tool/function calling** approach.\n",
"\n",
"- **Prompting Based**: LLMs that can follow instructions well can be instructed to generate text in a desired format. The generated text can be parsed downstream using existing [Output Parsers](/docs/modules/model_io/output_parsers/) or using [custom parsers](/docs/modules/model_io/output_parsers/custom) into a structured format like JSON. This approach can be used with LLMs that **do not support** JSON mode or tool/function calling modes. This approach is more broadly applicable, though may yield worse results than models that have been fine-tuned for extraction or function calling.\n",
"\n",
"## Quickstart\n",
"\n",
"Head to the [quickstart](/docs/use_cases/extraction/quickstart) to see how to extract information using LLMs using a basic end-to-end example.\n",
"\n",
"The quickstart focuses on information extraction using the **tool/function calling** approach.\n",
"\n",
"\n",
"## How-To Guides\n",
"\n",
"- [Use Reference Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n",
"- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n",
"- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n",
"- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n",
"\n",
"## Guidelines\n",
"\n",
"Head to the [Guidelines](/docs/use_cases/extraction/guidelines) page to see a list of opinionated guidelines on how to get the best performance for extraction use cases.\n",
"\n",
"## Use Case Accelerant\n",
"\n",
"[langchain-extract](https://github.com/langchain-ai/langchain-extract) is a starter repo that implements a simple web server for information extraction from text and files using LLMs. It is build using **FastAPI**, **LangChain** and **Postgresql**. Feel free to adapt it to your own use cases.\n",
"\n",
"## Other Resources\n",
"\n",
"* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetime, enum, etc).\n",
"* LangChain [document loaders](/modules/data_connection/document_loaders/) to load content from files. Please see list of [integrations](/docs/integrations/document_loaders).\n",
"* The experimental [Anthropic function calling](https://python.langchain.com/docs/integrations/chat/anthropic_functions) support provides similar functionality to Anthropic chat models.\n",
"* [LlamaCPP](https://python.langchain.com/docs/integrations/llms/llamacpp#grammars) natively supports constrained decoding using custom grammars, making it easy to output structured content using local LLMs \n",
"* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n",
"* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM. Kor is optimized to work for a parsing approach.\n",
"* [OpenAI's function and tool calling](https://platform.openai.com/docs/guides/function-calling)\n",
"* For example, see [OpenAI's JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e171cab",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,352 @@
{
"cells": [
{
"cell_type": "raw",
"id": "df29b30a-fd27-4e08-8269-870df5631f9e",
"metadata": {},
"source": [
"---\n",
"title: Quickstart\n",
"sidebar_position: 0\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "d28530a6-ddfd-49c0-85dc-b723551f6614",
"metadata": {},
"source": [
"In this quick start, we will use LLMs that are capable of **function/tool calling** to extract information from text.\n",
"\n",
":::{.callout-important}\n",
"Extraction using **function/tool calling** only works with [models that support **function/tool calling**](/docs/modules/model_io/chat/function_calling).\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "4412def2-38e3-4bd0-bbf0-fb09ff9e5985",
"metadata": {},
"source": [
"## Set up\n",
"\n",
"We will use the new [structured output](/docs/guides/structured_output) method available on LLMs that are capable of **function/tool calling**. \n",
"\n",
"Select a model, install the dependencies for it and set up API keys!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "380c0425-6062-4837-8630-c220240c83b9",
"metadata": {},
"outputs": [],
"source": [
"!pip install langchain\n",
"\n",
"# Install a model capable of tool calling\n",
"# pip install langchain-openai\n",
"# pip install langchain-mistralai\n",
"# pip install langchain-fireworks\n",
"\n",
"# Set env vars for the relevant model or load from a .env file:\n",
"# import dotenv\n",
"# dotenv.load_dotenv()"
]
},
{
"cell_type": "markdown",
"id": "54d6b970-2ea3-4192-951e-21237212b359",
"metadata": {},
"source": [
"## The Schema\n",
"\n",
"First, we need to describe what information we want to extract from the text.\n",
"\n",
"We'll use Pydantic to define an example schema to extract personal information."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c141084c-fb94-4093-8d6a-81175d688e40",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
"\n",
" # Note that:\n",
" # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
" # 2. Each field has a `description` -- this description is used by the LLM.\n",
" # Having a good description can help improve extraction results.\n",
" name: Optional[str] = Field(..., description=\"The name of the person\")\n",
" hair_color: Optional[str] = Field(\n",
" ..., description=\"The color of the peron's eyes if known\"\n",
" )\n",
" height_in_meters: Optional[str] = Field(..., description=\"Height in METERs\")"
]
},
{
"cell_type": "markdown",
"id": "f248dd54-e36d-435a-b154-394ab4ed6792",
"metadata": {},
"source": [
"There are two best practices when defining schema:\n",
"\n",
"1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.\n",
"2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.\n",
"\n",
":::{.callout-important}\n",
"For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.\n",
":::\n",
"\n",
"## The Extractor\n",
"\n",
"Let's create an information extractor using the schema we defined above."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a5e490f6-35ad-455e-8ae4-2bae021583ff",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"from langchain.chains import create_structured_output_runnable\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"# Define a custom prompt to provide instructions and any additional context.\n",
"# 1) You can add examples into the prompt template to improve extraction quality\n",
"# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
"# about the document from which the text was extracted.)\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are an expert extraction algorithm. \"\n",
" \"Only extract relevant information from the text. \"\n",
" \"If you do not know the value of an attribute asked to extract, \"\n",
" \"return null for the attribute's value.\",\n",
" ),\n",
" # Please see the how-to about improving performance with\n",
" # reference examples.\n",
" # MessagesPlaceholder('examples'),\n",
" (\"human\", \"{text}\"),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "832bf6a1-8e0c-4b6a-aa37-12fe9c42a6d9",
"metadata": {},
"source": [
"We need to use a model that supports function/tool calling.\n",
"\n",
"Please review [structured output](/docs/guides/structured_output) for list of some models that can be used with this API."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "04d846a6-d5cb-4009-ac19-61e3aac0177e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_mistralai import ChatMistralAI\n",
"\n",
"llm = ChatMistralAI(model=\"mistral-large-latest\")\n",
"\n",
"runnable = prompt | llm.with_structured_output(schema=Person)"
]
},
{
"cell_type": "markdown",
"id": "23582c0b-00ed-403f-a10e-3aeabf921f12",
"metadata": {},
"source": [
"Let's test it out"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "13165ac8-a1dc-44ce-a6ed-f52b577473e4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = \"Alan Smith is 6 feet tall and has blond hair.\"\n",
"runnable.invoke({\"text\": text})"
]
},
{
"cell_type": "markdown",
"id": "bd1c493d-f9dc-4236-8da9-50f6919f5710",
"metadata": {},
"source": [
":::{.callout-important} \n",
"\n",
"Extraction is Generative 🤯\n",
"\n",
"LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters\n",
"even though it was provided in feet!\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "28c5ef0c-b8d1-4e12-bd0e-e2528de87fcc",
"metadata": {},
"source": [
"## Multiple Entities\n",
"\n",
"In **most cases**, you should be extracting a list of entities rather than a single entity.\n",
"\n",
"This can be easily achieved using pydantic by nesting models inside one another."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "591a0c16-7a17-4883-91ee-0d6d2fdb265c",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
"\n",
" # Note that:\n",
" # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
" # 2. Each field has a `description` -- this description is used by the LLM.\n",
" # Having a good description can help improve extraction results.\n",
" name: Optional[str] = Field(..., description=\"The name of the person\")\n",
" hair_color: Optional[str] = Field(\n",
" ..., description=\"The color of the peron's eyes if known\"\n",
" )\n",
" height_in_meters: Optional[str] = Field(..., description=\"Height in meters\")\n",
"\n",
"\n",
"class Data(BaseModel):\n",
" \"\"\"Extracted data about people.\"\"\"\n",
"\n",
" # Creates a model so that we can extract multiple entities.\n",
" people: List[Person]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cf7062cc-1d1d-4a37-9122-509d1b87f0a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='2'), Person(name='Anna', hair_color=None, height_in_meters=None)])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"runnable = prompt | llm.with_structured_output(schema=Data)\n",
"text = \"My name is Jeff and I am 2 meters. I have black hair. Anna has the same color hair as me.\"\n",
"runnable.invoke({\"text\": text})"
]
},
{
"cell_type": "markdown",
"id": "fba1d770-bf4d-4de4-9e4f-7384872ef0dc",
"metadata": {},
"source": [
":::{.callout-tip}\n",
"When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information\n",
"is in the text by providing an empty list. \n",
"\n",
"This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "f07a7455-7de6-4a6f-9772-0477ef65e3dc",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guide:\n",
"\n",
"- [Add Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n",
"- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n",
"- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n",
"- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n",
"- [Guidelines](/docs/use_cases/extraction/guidelines): Guidelines for getting good performance on extraction tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "082fc1af",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading…
Cancel
Save