From a4a6978224f7addcb7e255c8e6e7fe517dd9955f Mon Sep 17 00:00:00 2001 From: Eugene Yurtsev Date: Wed, 6 Mar 2024 09:18:25 -0500 Subject: [PATCH] Docs: Revamp Extraction Use Case (#18588) Revamp the extraction use case documentation --------- Co-authored-by: Harrison Chase --- docs/docs/use_cases/extraction.ipynb | 831 ------------------ .../use_cases/extraction/guidelines.ipynb | 68 ++ .../extraction/how_to/_category_.yml | 2 + .../extraction/how_to/examples.ipynb | 464 ++++++++++ .../extraction/how_to/handle_files.ipynb | 150 ++++ .../extraction/how_to/handle_long_text.ipynb | 428 +++++++++ .../use_cases/extraction/how_to/parse.ipynb | 331 +++++++ docs/docs/use_cases/extraction/index.ipynb | 105 +++ .../use_cases/extraction/quickstart.ipynb | 352 ++++++++ 9 files changed, 1900 insertions(+), 831 deletions(-) delete mode 100644 docs/docs/use_cases/extraction.ipynb create mode 100644 docs/docs/use_cases/extraction/guidelines.ipynb create mode 100644 docs/docs/use_cases/extraction/how_to/_category_.yml create mode 100644 docs/docs/use_cases/extraction/how_to/examples.ipynb create mode 100644 docs/docs/use_cases/extraction/how_to/handle_files.ipynb create mode 100644 docs/docs/use_cases/extraction/how_to/handle_long_text.ipynb create mode 100644 docs/docs/use_cases/extraction/how_to/parse.ipynb create mode 100644 docs/docs/use_cases/extraction/index.ipynb create mode 100644 docs/docs/use_cases/extraction/quickstart.ipynb diff --git a/docs/docs/use_cases/extraction.ipynb b/docs/docs/use_cases/extraction.ipynb deleted file mode 100644 index f99e8c10c0..0000000000 --- a/docs/docs/use_cases/extraction.ipynb +++ /dev/null @@ -1,831 +0,0 @@ -{ - "cells": [ - { - "cell_type": "raw", - "id": "df29b30a-fd27-4e08-8269-870df5631f9e", - "metadata": {}, - "source": [ - "---\n", - "title: Extraction\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "b84edb4e", - "metadata": {}, - "source": [ - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/extraction.ipynb)\n", - "\n", - "## Use case\n", - "\n", - "LLMs can be used to generate text that is structured according to a specific schema. This can be useful in a number of scenarios, including:\n", - "\n", - "- Extracting a structured row to insert into a database \n", - "- Extracting API parameters\n", - "- Extracting different parts of a user query (e.g., for semantic vs keyword search)\n" - ] - }, - { - "cell_type": "markdown", - "id": "178dbc59", - "metadata": {}, - "source": [ - "![Image description](../../static/img/extraction.png)" - ] - }, - { - "cell_type": "markdown", - "id": "97f474d4", - "metadata": {}, - "source": [ - "## Overview \n", - "\n", - "There are two broad approaches for this:\n", - "\n", - "- `Tools and JSON mode`: Some LLMs specifically support structured output generation in certain contexts. Examples include OpenAI's [function and tool calling](https://platform.openai.com/docs/guides/function-calling) or [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode).\n", - "\n", - "- `Parsing`: LLMs can often be instructed to output their response in a dseired format. [Output parsers](/docs/modules/model_io/output_parsers/) will parse text generations into a structured form.\n", - "\n", - "Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).\n", - "\n", - "Functions and tools can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for)." - ] - }, - { - "cell_type": "markdown", - "id": "fbea06b5-66b6-4958-936d-23212061e4c8", - "metadata": {}, - "source": [ - "## Option 1: Leveraging tools and JSON mode" - ] - }, - { - "cell_type": "markdown", - "id": "25d89f21", - "metadata": {}, - "source": [ - "### Quickstart\n", - "\n", - "`create_structured_output_runnable` will create Runnables to support structured data extraction via OpenAI tool use and JSON mode.\n", - "\n", - "The desired output schema can be expressed either via a Pydantic model or a Python dict representing valid [JsonSchema](https://json-schema.org/).\n", - "\n", - "This function supports three modes for structured data extraction:\n", - "- `\"openai-functions\"` will define OpenAI functions and bind them to the given LLM;\n", - "- `\"openai-tools\"` will define OpenAI tools and bind them to the given LLM;\n", - "- `\"openai-json\"` will bind `response_format={\"type\": \"json_object\"}` to the given LLM.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3f5ec7a3", - "metadata": {}, - "outputs": [], - "source": [ - "pip install langchain langchain-openai \n", - "\n", - "# Set env var OPENAI_API_KEY or load from a .env file:\n", - "# import dotenv\n", - "# dotenv.load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "4c2bc413-eacd-44bd-9fcb-bbbe1f97ca6c", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Optional\n", - "\n", - "from langchain.chains import create_structured_output_runnable\n", - "from langchain_core.pydantic_v1 import BaseModel\n", - "from langchain_openai import ChatOpenAI\n", - "\n", - "\n", - "class Person(BaseModel):\n", - " person_name: str\n", - " person_height: int\n", - " person_hair_color: str\n", - " dog_breed: Optional[str]\n", - " dog_name: Optional[str]\n", - "\n", - "\n", - "llm = ChatOpenAI(model=\"gpt-4-0125-preview\", temperature=0)\n", - "runnable = create_structured_output_runnable(Person, llm)" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "de8c9d7b-bb7b-45bc-9794-a355ed0d1508", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None)" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "inp = \"Alex is 5 feet tall and has blond hair.\"\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "02fd21ff-27a8-4890-bb18-fc852cafb18a", - "metadata": {}, - "source": [ - "### Specifying schemas" - ] - }, - { - "cell_type": "markdown", - "id": "a5a74f3e-92aa-4ac7-96f2-ea89b8740ba8", - "metadata": {}, - "source": [ - "A convenient way to express desired output schemas is via Pydantic. The above example specified the desired output schema via `Person`, a Pydantic model. Such schemas can be easily combined together to generate richer output formats:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "c1c8fe71-0ae4-466a-b32f-001c59b62bb3", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Sequence\n", - "\n", - "\n", - "class People(BaseModel):\n", - " \"\"\"Identifying information about all people in a text.\"\"\"\n", - "\n", - " people: Sequence[Person]\n", - "\n", - "\n", - "runnable = create_structured_output_runnable(People, llm)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "c5aa9e43-9202-4b2d-a767-e596296b3a81", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry')])" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "inp = \"\"\"Alex is 5 feet tall and has blond hair.\n", - "Claudia is 1 feet taller Alex and jumps higher than him.\n", - "Claudia is a brunette and has a beagle named Harry.\"\"\"\n", - "\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "53e316ea-b74a-4512-a9ab-c5d01ff583fe", - "metadata": {}, - "source": [ - "Note that `dog_breed` and `dog_name` are optional attributes, such that here they are extracted for Claudia and not for Alex.\n", - "\n", - "One can also specify the desired output format with a Python dict representing valid JsonSchema:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "3e017ba0", - "metadata": {}, - "outputs": [], - "source": [ - "schema = {\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"name\": {\"type\": \"string\"},\n", - " \"height\": {\"type\": \"integer\"},\n", - " \"hair_color\": {\"type\": \"string\"},\n", - " },\n", - " \"required\": [\"name\", \"height\"],\n", - "}\n", - "\n", - "runnable = create_structured_output_runnable(schema, llm)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "fb525991-643d-4d47-9111-a3d4364c03d7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'name': 'Alex', 'height': 60}" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "inp = \"Alex is 5 feet tall. I don't know his hair color.\"\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "a3d3f0d2-c9d4-4ab8-9a5a-1ddda62db6ec", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'name': 'Alex', 'height': 60, 'hair_color': 'blond'}" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "inp = \"Alex is 5 feet tall. He is blond.\"\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "34f3b958", - "metadata": {}, - "source": [ - "#### Extra information\n", - "\n", - "Runnables constructed via `create_structured_output_runnable` generally are capable of semantic extraction, such that they can populate information that is not explicitly enumerated in the schema.\n", - "\n", - "Suppose we want unspecified additional information about dogs. \n", - "\n", - "We can use add a placeholder for unstructured extraction, `dog_extra_info`." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "0ed3b5e6-a7f3-453e-be61-d94fc665c16b", - "metadata": {}, - "outputs": [], - "source": [ - "inp = \"\"\"Alex is 5 feet tall and has blond hair.\n", - "Claudia is 1 feet taller Alex and jumps higher than him.\n", - "Claudia is a brunette and has a beagle named Harry.\n", - "Harry likes to play with other dogs and can always be found\n", - "playing with Milo, a border collie that lives close by.\"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "be07928a-8022-4963-a15e-eb3097beef9f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "People(people=[Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None), Person(person_name='Claudia', person_height=72, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry', dog_extra_info='likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.')])" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "class Person(BaseModel):\n", - " person_name: str\n", - " person_height: int\n", - " person_hair_color: str\n", - " dog_breed: Optional[str]\n", - " dog_name: Optional[str]\n", - " dog_extra_info: Optional[str]\n", - "\n", - "\n", - "class People(BaseModel):\n", - " \"\"\"Identifying information about all people in a text.\"\"\"\n", - "\n", - " people: Sequence[Person]\n", - "\n", - "\n", - "runnable = create_structured_output_runnable(People, llm)\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "3a949c60", - "metadata": {}, - "source": [ - "This gives us additional information about the dogs." - ] - }, - { - "cell_type": "markdown", - "id": "97ed9f5e-33be-4667-aa82-af49cc874e1d", - "metadata": {}, - "source": [ - "### Specifying extraction mode\n", - "\n", - "`create_structured_output_runnable` supports varying implementations of the underlying extraction under the hood, which are configured via the `mode` parameter. This parameter can be one of `\"openai-functions\"`, `\"openai-tools\"`, or `\"openai-json\"`." - ] - }, - { - "cell_type": "markdown", - "id": "7c8e0b00-d6e6-432d-b9b0-8d0a3c0c6572", - "metadata": {}, - "source": [ - "#### OpenAI Functions and Tools" - ] - }, - { - "cell_type": "markdown", - "id": "07ccdbb1-cbe5-45af-87e4-dde42baee5eb", - "metadata": {}, - "source": [ - "Some LLMs are fine-tuned to support the invocation of functions or tools. If they are given an input schema for a tool and recognize an occasion to use it, they may emit JSON output conforming to that schema. We can leverage this to drive structured data extraction from natural language.\n", - "\n", - "OpenAI originally released this via a [`functions` parameter in its chat completions API](https://openai.com/blog/function-calling-and-other-api-updates). This has since been deprecated in favor of a [`tools` parameter](https://platform.openai.com/docs/guides/function-calling), which can include (multiple) functions." - ] - }, - { - "cell_type": "markdown", - "id": "e6b02442-2884-4b45-a5a0-4fdac729fdb3", - "metadata": {}, - "source": [ - "Using OpenAI Functions:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "7b1c2266-b04b-4a23-83a9-da3cd2f88137", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "runnable = create_structured_output_runnable(Person, llm, mode=\"openai-functions\")\n", - "\n", - "inp = \"Alex is 5 feet tall and has blond hair.\"\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "1c07427b-a582-4489-a486-4c24a6c3165f", - "metadata": {}, - "source": [ - "Using OpenAI Tools:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "0b1ca93a-ffd9-4d37-8baa-377757405357", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='Alex', person_height=152, person_hair_color='blond', dog_breed=None, dog_name=None)" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "runnable = create_structured_output_runnable(Person, llm, mode=\"openai-tools\")\n", - "\n", - "runnable.invoke(inp)" - ] - }, - { - "cell_type": "markdown", - "id": "4018a8fc-1799-4c9d-b655-a66f618204b3", - "metadata": {}, - "source": [ - "The corresponding [LangSmith trace](https://smith.langchain.com/public/04cc37a7-7a1c-4bae-b972-1cb1a642568c/r) illustrates the tool call that generated our structured output.\n", - "\n", - "![Image description](../../static/img/extraction_trace_tool.png)" - ] - }, - { - "cell_type": "markdown", - "id": "fb2662d5-9492-4acc-935b-eb8fccebbe0f", - "metadata": {}, - "source": [ - "#### JSON Mode" - ] - }, - { - "cell_type": "markdown", - "id": "c0fd98ba-c887-4c30-8c9e-896ae90ac56a", - "metadata": {}, - "source": [ - "Some LLMs support generating JSON more generally. OpenAI implements this via a [`response_format` parameter](https://platform.openai.com/docs/guides/text-generation/json-mode) in its chat completions API.\n", - "\n", - "Note that this method may require explicit prompting (e.g., OpenAI requires that input messages contain the word \"json\" in some form when using this parameter)." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "6b3e4679-eadc-42c8-b882-92a600083f2f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_core.prompts import ChatPromptTemplate\n", - "\n", - "system_prompt = \"\"\"You extract information in structured JSON formats.\n", - "\n", - "Extract a valid JSON blob from the user input that matches the following JSON Schema:\n", - "\n", - "{output_schema}\"\"\"\n", - "prompt = ChatPromptTemplate.from_messages(\n", - " [\n", - " (\"system\", system_prompt),\n", - " (\"human\", \"{input}\"),\n", - " ]\n", - ")\n", - "runnable = create_structured_output_runnable(\n", - " Person,\n", - " llm,\n", - " mode=\"openai-json\",\n", - " prompt=prompt,\n", - " enforce_function_usage=False,\n", - ")\n", - "\n", - "runnable.invoke({\"input\": inp})" - ] - }, - { - "cell_type": "markdown", - "id": "b22d8262-a9b8-415c-a142-d0ee4db7ec2b", - "metadata": {}, - "source": [ - "### Few-shot examples" - ] - }, - { - "cell_type": "markdown", - "id": "a01c75f6-99d7-4d7b-a58f-b0ea7e8f338a", - "metadata": {}, - "source": [ - "Suppose we want to tune the behavior of our extractor. There are a few options available. For example, if we want to redact names but retain other information, we could adjust the system prompt:" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "c5d16ad6-824e-434a-906a-d94e78259d4f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='REDACTED', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "system_prompt = \"\"\"You extract information in structured JSON formats.\n", - "\n", - "Extract a valid JSON blob from the user input that matches the following JSON Schema:\n", - "\n", - "{output_schema}\n", - "\n", - "Redact all names.\n", - "\"\"\"\n", - "prompt = ChatPromptTemplate.from_messages(\n", - " [(\"system\", system_prompt), (\"human\", \"{input}\")]\n", - ")\n", - "runnable = create_structured_output_runnable(\n", - " Person,\n", - " llm,\n", - " mode=\"openai-json\",\n", - " prompt=prompt,\n", - " enforce_function_usage=False,\n", - ")\n", - "\n", - "runnable.invoke({\"input\": inp})" - ] - }, - { - "cell_type": "markdown", - "id": "be611688-1224-4d5a-9e34-a158b3c04296", - "metadata": {}, - "source": [ - "Few-shot examples are another, effective way to illustrate intended behavior. For instance, if we want to redact names with a specific character string, a one-shot example will convey this. We can use a `FewShotChatMessagePromptTemplate` to easily accommodate both a fixed set of examples as well as the dynamic selection of examples based on the input." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "0aeee951-7f73-4e24-9033-c81a08af14dc", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Person(person_name='#####', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_core.prompts import FewShotChatMessagePromptTemplate\n", - "\n", - "examples = [\n", - " {\n", - " \"input\": \"Samus is 6 ft tall and blonde.\",\n", - " \"output\": Person(\n", - " person_name=\"######\",\n", - " person_height=5,\n", - " person_hair_color=\"blonde\",\n", - " ).dict(),\n", - " }\n", - "]\n", - "\n", - "example_prompt = ChatPromptTemplate.from_messages(\n", - " [(\"human\", \"{input}\"), (\"ai\", \"{output}\")]\n", - ")\n", - "few_shot_prompt = FewShotChatMessagePromptTemplate(\n", - " examples=examples,\n", - " example_prompt=example_prompt,\n", - ")\n", - "prompt = ChatPromptTemplate.from_messages(\n", - " [(\"system\", system_prompt), few_shot_prompt, (\"human\", \"{input}\")]\n", - ")\n", - "runnable = create_structured_output_runnable(\n", - " Person,\n", - " llm,\n", - " mode=\"openai-json\",\n", - " prompt=prompt,\n", - " enforce_function_usage=False,\n", - ")\n", - "\n", - "runnable.invoke({\"input\": inp})" - ] - }, - { - "cell_type": "markdown", - "id": "51846211-e86b-4807-9348-eb263999f7f7", - "metadata": {}, - "source": [ - "Here, the [LangSmith trace](https://smith.langchain.com/public/6fe5e694-9c04-48f7-83ff-e541da764781/r) for the chat model call shows how the one-shot example is formatted into the prompt.\n", - "\n", - "![Image description](../../static/img/extraction_trace_few_shot.png)" - ] - }, - { - "cell_type": "markdown", - "id": "cbd9f121", - "metadata": {}, - "source": [ - "## Option 2: Parsing\n", - "\n", - "[Output parsers](/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. \n", - "\n", - "As shown above, they are used to parse the output of the runnable created by `create_structured_output_runnable`.\n", - "\n", - "They can also be used more generally, if a LLM is instructed to emit its output in a certain format. Parsers include convenience methods for generating formatting instructions for use in prompts.\n", - "\n", - "Below we implement an example." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "64650362", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Optional, Sequence\n", - "\n", - "from langchain.output_parsers import PydanticOutputParser\n", - "from langchain.prompts import PromptTemplate\n", - "from langchain_core.pydantic_v1 import BaseModel, Field, validator\n", - "from langchain_openai import ChatOpenAI\n", - "\n", - "\n", - "class Person(BaseModel):\n", - " person_name: str\n", - " person_height: int\n", - " person_hair_color: str\n", - " dog_breed: Optional[str]\n", - " dog_name: Optional[str]\n", - "\n", - "\n", - "class People(BaseModel):\n", - " \"\"\"Identifying information about all people in a text.\"\"\"\n", - "\n", - " people: Sequence[Person]\n", - "\n", - "\n", - "# Run\n", - "query = \"\"\"Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blond.\"\"\"\n", - "\n", - "# Set up a parser + inject instructions into the prompt template.\n", - "parser = PydanticOutputParser(pydantic_object=People)\n", - "\n", - "# Prompt\n", - "prompt = PromptTemplate(\n", - " template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", - " input_variables=[\"query\"],\n", - " partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", - ")\n", - "\n", - "# Run\n", - "_input = prompt.format_prompt(query=query)\n", - "model = ChatOpenAI()" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "727f3bf2-31b1-4b07-94f5-9568acf3ffdf", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)])" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output = model.invoke(_input.to_string())\n", - "\n", - "parser.parse(output.content)" - ] - }, - { - "cell_type": "markdown", - "id": "826899df", - "metadata": {}, - "source": [ - "We can see from the [LangSmith trace](https://smith.langchain.com/public/aec42dd3-d471-4d34-801b-20dd88444931/r) that we get the same output as above.\n", - "\n", - "![Image description](../../static/img/extraction_trace_parsing.png)\n", - "\n", - "We can see that we provide a two-shot prompt in order to instruct the LLM to output in our desired format." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "837c350e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Joke(setup=\"Why couldn't the bicycle find its way home?\", punchline='Because it lost its bearings!')" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Define your desired data structure.\n", - "class Joke(BaseModel):\n", - " setup: str = Field(description=\"question to set up a joke\")\n", - " punchline: str = Field(description=\"answer to resolve the joke\")\n", - "\n", - " # You can add custom validation logic easily with Pydantic.\n", - " @validator(\"setup\")\n", - " def question_ends_with_question_mark(cls, field):\n", - " if field[-1] != \"?\":\n", - " raise ValueError(\"Badly formed question!\")\n", - " return field\n", - "\n", - "\n", - "# And a query intended to prompt a language model to populate the data structure.\n", - "joke_query = \"Tell me a joke.\"\n", - "\n", - "# Set up a parser + inject instructions into the prompt template.\n", - "parser = PydanticOutputParser(pydantic_object=Joke)\n", - "\n", - "# Prompt\n", - "prompt = PromptTemplate(\n", - " template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", - " input_variables=[\"query\"],\n", - " partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", - ")\n", - "\n", - "# Run\n", - "_input = prompt.format_prompt(query=joke_query)\n", - "model = ChatOpenAI(temperature=0)\n", - "output = model.invoke(_input.to_string())\n", - "parser.parse(output.content)" - ] - }, - { - "cell_type": "markdown", - "id": "d3601bde", - "metadata": {}, - "source": [ - "As we can see, we get an output of the `Joke` class, which respects our originally desired schema: 'setup' and 'punchline'.\n", - "\n", - "We can look at the [LangSmith trace](https://smith.langchain.com/public/557ad630-af35-43e9-b043-93800539025f/r) to see exactly what is going on under the hood.\n", - "\n", - "### Going deeper\n", - "\n", - "* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetime, enum, etc). \n", - "* The experimental [Anthropic function calling](https://python.langchain.com/docs/integrations/chat/anthropic_functions) support provides similar functionality to Anthropic chat models.\n", - "* [LlamaCPP](https://python.langchain.com/docs/integrations/llms/llamacpp#grammars) natively supports constrained decoding using custom grammars, making it easy to output structured content using local LLMs \n", - "* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n", - "* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM." - ] - }, - { - "cell_type": "markdown", - "id": "aab95ecf", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/docs/use_cases/extraction/guidelines.ipynb b/docs/docs/use_cases/extraction/guidelines.ipynb new file mode 100644 index 0000000000..0639e8d2c1 --- /dev/null +++ b/docs/docs/use_cases/extraction/guidelines.ipynb @@ -0,0 +1,68 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "913dd5a2-24d1-4f8e-bc15-ab518483eef9", + "metadata": {}, + "source": [ + "---\n", + "title: Guidelines\n", + "sidebar_position: 5\n", + "---" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e", + "metadata": {}, + "source": [ + "The quality of extraction results depends on many factors. \n", + "\n", + "Here is a set of guidelines to help you squeeze out the best performance from your models:\n", + "\n", + "* Set the model temperature to `0`.\n", + "* Improve the prompt. The prompt should be precise and to the point.\n", + "* Document the schema: Make sure the schema is documented to provide more information to the LLM.\n", + "* Provide reference examples! Diverse examples can help, including examples where nothing should be extracted.\n", + "* If you have a lot of examples, use a retriever to retrieve the most relevant examples.\n", + "* Benchmark with the best available LLM/Chat Model (e.g., gpt-4, claude-3, etc) -- check with the model provider which one is the latest and greatest!\n", + "* If the schema is very large, try breaking it into multiple smaller schemas, run separate extractions and merge the results.\n", + "* Make sure that the schema allows the model to REJECT extracting information. If it doesn't, the model will be forced to make up information!\n", + "* Add verification/correction steps (ask an LLM to correct or verify the results of the extraction).\n", + "\n", + "## Benchmark\n", + "\n", + "* Create and benchmark data for your use case using [LangSmith 🦜️🛠️](https://docs.smith.langchain.com/).\n", + "* Is your LLM good enough? Use [langchain-benchmarks 🦜💯 ](https://github.com/langchain-ai/langchain-benchmarks) to test out your LLM using existing datasets.\n", + "\n", + "## Keep in mind! 😶‍🌫️\n", + "\n", + "* LLMs are great, but are not required for all cases! If you’re extracting information from a single structured source (e.g., linkedin), using an LLM is not a good idea – traditional web-scraping will be much cheaper and reliable.\n", + "\n", + "* **human in the loop** If you need **perfect quality**, you'll likely need to plan on having a human in the loop -- even the best LLMs will make mistakes when dealing with complex extraction tasks." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/use_cases/extraction/how_to/_category_.yml b/docs/docs/use_cases/extraction/how_to/_category_.yml new file mode 100644 index 0000000000..3b30375ba6 --- /dev/null +++ b/docs/docs/use_cases/extraction/how_to/_category_.yml @@ -0,0 +1,2 @@ +label: 'How-To Guides' +position: 1 \ No newline at end of file diff --git a/docs/docs/use_cases/extraction/how_to/examples.ipynb b/docs/docs/use_cases/extraction/how_to/examples.ipynb new file mode 100644 index 0000000000..e37511da19 --- /dev/null +++ b/docs/docs/use_cases/extraction/how_to/examples.ipynb @@ -0,0 +1,464 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "a37d08e8-8d6d-4cf2-8215-2aafb6877fb5", + "metadata": {}, + "source": [ + "---\n", + "title: Use Reference Examples\n", + "sidebar_position: 1\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "70403d4f-50c1-43f8-a7ea-a211167649a5", + "metadata": {}, + "source": [ + "The quality of extractions can often be improved by providing reference examples to the LLM.\n", + "\n", + ":::{.callout-tip}\n", + "While this tutorial focuses how to use examples with a tool calling model, this technique is generally applicable, and will work\n", + "also with JSON more or prompt based techniques.\n", + ":::" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "89579144-bcb3-490a-8036-86a0a6bcd56b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", + "\n", + "# Define a custom prompt to provide instructions and any additional context.\n", + "# 1) You can add examples into the prompt template to improve extraction quality\n", + "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n", + "# about the document from which the text was extracted.)\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\n", + " \"system\",\n", + " \"You are an expert extraction algorithm. \"\n", + " \"Only extract relevant information from the text. \"\n", + " \"If you do not know the value of an attribute asked \"\n", + " \"to extract, return null for the attribute's value.\",\n", + " ),\n", + " # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓\n", + " MessagesPlaceholder(\"examples\"), # <-- EXAMPLES!\n", + " # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑\n", + " (\"human\", \"{text}\"),\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2484008c-ba1a-42a5-87a1-628a900de7fd", + "metadata": {}, + "source": [ + "Test out the template:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "610c3025-ea63-4cd7-88bd-c8cbcb4d8a3f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ChatPromptValue(messages=[SystemMessage(content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"), HumanMessage(content='testing 1 2 3'), HumanMessage(content='this is some text')])" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_core.messages import (\n", + " HumanMessage,\n", + ")\n", + "\n", + "prompt.invoke(\n", + " {\"text\": \"this is some text\", \"examples\": [HumanMessage(content=\"testing 1 2 3\")]}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "368abd80-0cf0-41a7-8224-acf90dd6830d", + "metadata": {}, + "source": [ + "## Define the schema\n", + "\n", + "Let's re-use the person schema from the quickstart." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d875a49a-d2cb-4b9e-b5bf-41073bc3905c", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List, Optional\n", + "\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "\n", + "class Person(BaseModel):\n", + " \"\"\"Information about a person.\"\"\"\n", + "\n", + " # ^ Doc-string for the entity Person.\n", + " # This doc-string is sent to the LLM as the description of the schema Person,\n", + " # and it can help to improve extraction results.\n", + "\n", + " # Note that:\n", + " # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n", + " # 2. Each field has a `description` -- this description is used by the LLM.\n", + " # Having a good description can help improve extraction results.\n", + " name: Optional[str] = Field(..., description=\"The name of the person\")\n", + " hair_color: Optional[str] = Field(\n", + " ..., description=\"The color of the peron's eyes if known\"\n", + " )\n", + " height_in_meters: Optional[str] = Field(..., description=\"Height in METERs\")\n", + "\n", + "\n", + "class Data(BaseModel):\n", + " \"\"\"Extracted data about people.\"\"\"\n", + "\n", + " # Creates a model so that we can extract multiple entities.\n", + " people: List[Person]" + ] + }, + { + "cell_type": "markdown", + "id": "96c42162-e4f6-4461-88fd-c76f5aab7e32", + "metadata": {}, + "source": [ + "## Define reference examples\n", + "\n", + "Examples can be defined as a list of input-output pairs. \n", + "\n", + "Each example contains an example `input` text and an example `output` showing what should be extracted from the text.\n", + "\n", + ":::{.callout-important}\n", + "This is a bit in the weeds, so feel free to ignore if you don't get it!\n", + "\n", + "The format of the example needs to match the API used (e.g., tool calling or JSON mode etc.).\n", + "\n", + "Here, the formatted examples will match the format expected for the tool calling API since that's what we're using.\n", + ":::" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "08356810-77ce-4e68-99d9-faa0326f2cee", + "metadata": {}, + "outputs": [], + "source": [ + "import uuid\n", + "from typing import Dict, List, TypedDict\n", + "\n", + "from langchain_core.messages import (\n", + " AIMessage,\n", + " BaseMessage,\n", + " HumanMessage,\n", + " SystemMessage,\n", + " ToolMessage,\n", + ")\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "\n", + "\n", + "class Example(TypedDict):\n", + " \"\"\"A representation of an example consisting of text input and expected tool calls.\n", + "\n", + " For extraction, the tool calls are represented as instances of pydantic model.\n", + " \"\"\"\n", + "\n", + " input: str # This is the example text\n", + " tool_calls: List[BaseModel] # Instances of pydantic model that should be extracted\n", + "\n", + "\n", + "def tool_example_to_messages(example: Example) -> List[BaseMessage]:\n", + " \"\"\"Convert an example into a list of messages that can be fed into an LLM.\n", + "\n", + " This code is an adapter that converts our example to a list of messages\n", + " that can be fed into a chat model.\n", + "\n", + " The list of messages per example corresponds to:\n", + "\n", + " 1) HumanMessage: contains the content from which content should be extracted.\n", + " 2) AIMessage: contains the extracted information from the model\n", + " 3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.\n", + "\n", + " The ToolMessage is required because some of the chat models are hyper-optimized for agents\n", + " rather than for an extraction use case.\n", + " \"\"\"\n", + " messages: List[BaseMessage] = [HumanMessage(content=example[\"input\"])]\n", + " openai_tool_calls = []\n", + " for tool_call in example[\"tool_calls\"]:\n", + " openai_tool_calls.append(\n", + " {\n", + " \"id\": str(uuid.uuid4()),\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " # The name of the function right now corresponds\n", + " # to the name of the pydantic model\n", + " # This is implicit in the API right now,\n", + " # and will be improved over time.\n", + " \"name\": tool_call.__class__.__name__,\n", + " \"arguments\": tool_call.json(),\n", + " },\n", + " }\n", + " )\n", + " messages.append(\n", + " AIMessage(content=\"\", additional_kwargs={\"tool_calls\": openai_tool_calls})\n", + " )\n", + " tool_outputs = example.get(\"tool_outputs\") or [\n", + " \"You have correctly called this tool.\"\n", + " ] * len(openai_tool_calls)\n", + " for output, tool_call in zip(tool_outputs, openai_tool_calls):\n", + " messages.append(ToolMessage(content=output, tool_call_id=tool_call[\"id\"]))\n", + " return messages" + ] + }, + { + "cell_type": "markdown", + "id": "463aa282-51c4-42bf-9463-6ca3b2c08de6", + "metadata": {}, + "source": [ + "Next let's define our examples and then convert them into message format." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7f59a745-5c81-4011-a4c5-a33ec1eca7ef", + "metadata": {}, + "outputs": [], + "source": [ + "examples = [\n", + " (\n", + " \"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\",\n", + " Person(name=None, height_in_meters=None, hair_color=None),\n", + " ),\n", + " (\n", + " \"Fiona traveled far from France to Spain.\",\n", + " Person(name=\"Fiona\", height_in_meters=None, hair_color=None),\n", + " ),\n", + "]\n", + "\n", + "\n", + "messages = []\n", + "\n", + "for text, tool_call in examples:\n", + " messages.extend(\n", + " tool_example_to_messages({\"input\": text, \"tool_calls\": [tool_call]})\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "6fdbda30-e7e3-46b5-a54a-1769c580af93", + "metadata": {}, + "source": [ + "Let's test out the prompt" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "e61fa3a5-3d15-46a2-a23b-788f9a3ede52", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ChatPromptValue(messages=[SystemMessage(content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"), HumanMessage(content=\"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\"), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'c75e57cc-8212-4959-81e9-9477b0b79126', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{\"name\": null, \"hair_color\": null, \"height_in_meters\": null}'}}]}), ToolMessage(content='You have correctly called this tool.', tool_call_id='c75e57cc-8212-4959-81e9-9477b0b79126'), HumanMessage(content='Fiona traveled far from France to Spain.'), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '69da50b5-e427-44be-b396-1e56d821c6b0', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{\"name\": \"Fiona\", \"hair_color\": null, \"height_in_meters\": null}'}}]}), ToolMessage(content='You have correctly called this tool.', tool_call_id='69da50b5-e427-44be-b396-1e56d821c6b0'), HumanMessage(content='this is some text')])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt.invoke({\"text\": \"this is some text\", \"examples\": messages})" + ] + }, + { + "cell_type": "markdown", + "id": "47b0bbef-bc6b-4535-a8e2-5c84f09d5637", + "metadata": {}, + "source": [ + "## Create an extractor\n", + "Here, we'll create an extractor using **gpt-4**." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "dbfea43d-769b-42e9-a76f-ce722f7d6f93", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/harrisonchase/workplace/langchain/libs/core/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.\n", + " warn_beta(\n" + ] + } + ], + "source": [ + "# We will be using tool calling mode, which\n", + "# requires a tool calling capable model.\n", + "llm = ChatOpenAI(\n", + " # Consider benchmarking with a good model to get\n", + " # a sense of the best possible quality.\n", + " model=\"gpt-4-0125-preview\",\n", + " # Remember to set the temperature to 0 for extractions!\n", + " temperature=0,\n", + ")\n", + "\n", + "\n", + "runnable = prompt | llm.with_structured_output(\n", + " schema=Data,\n", + " method=\"function_calling\",\n", + " include_raw=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "58a8139e-f201-4b8e-baf0-16a83e5fa987", + "metadata": {}, + "source": [ + "## Without examples 😿\n", + "\n", + "Notice that even though we're using gpt-4, it's failing with a **very simple** test case!" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8b1d6273-5ec5-4970-af8a-0da1f1efa293", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "people=[]\n", + "people=[Person(name='earth', hair_color=None, height_in_meters=None)]\n", + "people=[Person(name='earth', hair_color=None, height_in_meters=None)]\n", + "people=[]\n", + "people=[]\n" + ] + } + ], + "source": [ + "for _ in range(5):\n", + " text = \"The solar system is large, but earth has only 1 moon.\"\n", + " print(runnable.invoke({\"text\": text, \"examples\": []}))" + ] + }, + { + "cell_type": "markdown", + "id": "09840f17-ab26-4ea2-8a39-c747103804ec", + "metadata": {}, + "source": [ + "## With examples 😻\n", + "\n", + "Reference examples helps to fix the failure!" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "9bdfa49e-0005-4c06-9598-2adfd882b014", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "people=[]\n", + "people=[]\n", + "people=[]\n", + "people=[]\n", + "people=[]\n" + ] + } + ], + "source": [ + "for _ in range(5):\n", + " text = \"The solar system is large, but earth has only 1 moon.\"\n", + " print(runnable.invoke({\"text\": text, \"examples\": messages}))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "84413e17-608d-4f85-b70e-00b89b271927", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "runnable.invoke(\n", + " {\n", + " \"text\": \"My name is Harrison. My hair is black.\",\n", + " \"examples\": messages,\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d18bb013", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.1" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/use_cases/extraction/how_to/handle_files.ipynb b/docs/docs/use_cases/extraction/how_to/handle_files.ipynb new file mode 100644 index 0000000000..eed1eb16ac --- /dev/null +++ b/docs/docs/use_cases/extraction/how_to/handle_files.ipynb @@ -0,0 +1,150 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "8371e5d6-eb65-4c97-aac2-05037356c2c1", + "metadata": {}, + "source": [ + "---\n", + "title: Handle Files\n", + "sidebar_position: 3\n", + "---" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0d5eea7c-bc69-4da2-b91d-d7c71f7085d0", + "metadata": {}, + "source": [ + "Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.\n", + "\n", + "You can use LangChain [document loaders](/modules/data_connection/document_loaders/) to parse files into a text format that can be fed into LLMs.\n", + "\n", + "LangChain features a large number of [document loader integrations](/docs/integrations/document_loaders).\n", + "\n", + "## MIME type based parsing\n", + "\n", + "For basic parsing exmaples take a look [at document loaders](/modules/data_connection/document_loaders/).\n", + "\n", + "Here, we'll be looking at mime-type based parsing which is often useful for extraction based applications if you're writing server code that accepts user uploaded files.\n", + "\n", + "In this case, it's best to assume that the file extension of the file provided by the user is wrong and instead infer the mimetype from the binary content of the file.\n", + "\n", + "Let's download some content. This will be an HTML file, but the code below will work with other file types." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "76d42bb2-090b-4a70-a656-d6e9af769eba", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "b'\\n List[dict]:\n", + " \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n", + "\n", + " Parameters:\n", + " text (str): The text containing the JSON content.\n", + "\n", + " Returns:\n", + " list: A list of extracted JSON strings.\n", + " \"\"\"\n", + " text = message.content\n", + " # Define the regular expression pattern to match JSON blocks\n", + " pattern = r\"```json(.*?)```\"\n", + "\n", + " # Find all non-overlapping matches of the pattern in the string\n", + " matches = re.findall(pattern, text, re.DOTALL)\n", + "\n", + " # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n", + " try:\n", + " return [json.loads(match.strip()) for match in matches]\n", + " except Exception:\n", + " raise ValueError(f\"Failed to parse: {message}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "cda52ef5-a354-47a7-9c25-45153c2389e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "System: Answer the user query. Output your answer as JSON that matches the given schema: ```json\n", + "{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n", + "```. Make sure to wrap the answer in ```json and ``` tags\n", + "Human: Anna is 23 years old and she is 6 feet tall\n" + ] + } + ], + "source": [ + "query = \"Anna is 23 years old and she is 6 feet tall\"\n", + "print(prompt.format_prompt(query=query).to_string())" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "993dc61a-229d-4795-a746-0d17df86b5c0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chain = prompt | model | extract_json\n", + "chain.invoke({\"query\": query})" + ] + }, + { + "cell_type": "markdown", + "id": "d3601bde", + "metadata": {}, + "source": [ + "## Other Libraries\n", + "\n", + "If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n", + "helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/use_cases/extraction/index.ipynb b/docs/docs/use_cases/extraction/index.ipynb new file mode 100644 index 0000000000..5e6fc6b87a --- /dev/null +++ b/docs/docs/use_cases/extraction/index.ipynb @@ -0,0 +1,105 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "df29b30a-fd27-4e08-8269-870df5631f9e", + "metadata": {}, + "source": [ + "---\n", + "title: Extraction\n", + "sidebar_position: 3\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "5e397959-1622-4c1c-bdb6-4660a3c39e14", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "Large Language Models (LLMs) are emerging as an extremely capable technology for powering information extraction applications.\n", + "\n", + "Classical solutions to information extraction rely on a combination of people, (many) hand-crafted rules (e.g., regular expressions), and custom fine-tuned ML models.\n", + "\n", + "Such systems tend to get complex over time and become progressively more expensive to maintain and more difficult to enhance.\n", + "\n", + "LLMs can be adapted quickly for specific extraction tasks just by providing appropriate instructions to them and appropriate reference examples.\n", + "\n", + "This guide will show you how to use LLMs for extraction applications!\n", + "\n", + "## Approaches\n", + "\n", + "There are 3 broad approaches for information extraction using LLMs:\n", + "\n", + "- **Tool/Function Calling** Mode: Some LLMs support a *tool or function calling* mode. These LLMs can structure output according to a given **schema**. Generally, this approach is the easiest to work with and is expected to yield good results.\n", + "\n", + "- **JSON Mode**: Some LLMs are can be forced to output valid JSON. This is similar to **tool/function Calling** approach, except that the schema is provided as part of the prompt. Generally, our intuition is that this performs worse than a **tool/function calling** approach.\n", + "\n", + "- **Prompting Based**: LLMs that can follow instructions well can be instructed to generate text in a desired format. The generated text can be parsed downstream using existing [Output Parsers](/docs/modules/model_io/output_parsers/) or using [custom parsers](/docs/modules/model_io/output_parsers/custom) into a structured format like JSON. This approach can be used with LLMs that **do not support** JSON mode or tool/function calling modes. This approach is more broadly applicable, though may yield worse results than models that have been fine-tuned for extraction or function calling.\n", + "\n", + "## Quickstart\n", + "\n", + "Head to the [quickstart](/docs/use_cases/extraction/quickstart) to see how to extract information using LLMs using a basic end-to-end example.\n", + "\n", + "The quickstart focuses on information extraction using the **tool/function calling** approach.\n", + "\n", + "\n", + "## How-To Guides\n", + "\n", + "- [Use Reference Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n", + "- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n", + "- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n", + "- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n", + "\n", + "## Guidelines\n", + "\n", + "Head to the [Guidelines](/docs/use_cases/extraction/guidelines) page to see a list of opinionated guidelines on how to get the best performance for extraction use cases.\n", + "\n", + "## Use Case Accelerant\n", + "\n", + "[langchain-extract](https://github.com/langchain-ai/langchain-extract) is a starter repo that implements a simple web server for information extraction from text and files using LLMs. It is build using **FastAPI**, **LangChain** and **Postgresql**. Feel free to adapt it to your own use cases.\n", + "\n", + "## Other Resources\n", + "\n", + "* The [output parser](/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific types (e.g., lists, datetime, enum, etc).\n", + "* LangChain [document loaders](/modules/data_connection/document_loaders/) to load content from files. Please see list of [integrations](/docs/integrations/document_loaders).\n", + "* The experimental [Anthropic function calling](https://python.langchain.com/docs/integrations/chat/anthropic_functions) support provides similar functionality to Anthropic chat models.\n", + "* [LlamaCPP](https://python.langchain.com/docs/integrations/llms/llamacpp#grammars) natively supports constrained decoding using custom grammars, making it easy to output structured content using local LLMs \n", + "* [JSONFormer](/docs/integrations/llms/jsonformer_experimental) offers another way for structured decoding of a subset of the JSON Schema.\n", + "* [Kor](https://eyurtsev.github.io/kor/) is another library for extraction where schema and examples can be provided to the LLM. Kor is optimized to work for a parsing approach.\n", + "* [OpenAI's function and tool calling](https://platform.openai.com/docs/guides/function-calling)\n", + "* For example, see [OpenAI's JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e171cab", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.1" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/use_cases/extraction/quickstart.ipynb b/docs/docs/use_cases/extraction/quickstart.ipynb new file mode 100644 index 0000000000..051bdcef2f --- /dev/null +++ b/docs/docs/use_cases/extraction/quickstart.ipynb @@ -0,0 +1,352 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "df29b30a-fd27-4e08-8269-870df5631f9e", + "metadata": {}, + "source": [ + "---\n", + "title: Quickstart\n", + "sidebar_position: 0\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "d28530a6-ddfd-49c0-85dc-b723551f6614", + "metadata": {}, + "source": [ + "In this quick start, we will use LLMs that are capable of **function/tool calling** to extract information from text.\n", + "\n", + ":::{.callout-important}\n", + "Extraction using **function/tool calling** only works with [models that support **function/tool calling**](/docs/modules/model_io/chat/function_calling).\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "4412def2-38e3-4bd0-bbf0-fb09ff9e5985", + "metadata": {}, + "source": [ + "## Set up\n", + "\n", + "We will use the new [structured output](/docs/guides/structured_output) method available on LLMs that are capable of **function/tool calling**. \n", + "\n", + "Select a model, install the dependencies for it and set up API keys!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "380c0425-6062-4837-8630-c220240c83b9", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install langchain\n", + "\n", + "# Install a model capable of tool calling\n", + "# pip install langchain-openai\n", + "# pip install langchain-mistralai\n", + "# pip install langchain-fireworks\n", + "\n", + "# Set env vars for the relevant model or load from a .env file:\n", + "# import dotenv\n", + "# dotenv.load_dotenv()" + ] + }, + { + "cell_type": "markdown", + "id": "54d6b970-2ea3-4192-951e-21237212b359", + "metadata": {}, + "source": [ + "## The Schema\n", + "\n", + "First, we need to describe what information we want to extract from the text.\n", + "\n", + "We'll use Pydantic to define an example schema to extract personal information." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c141084c-fb94-4093-8d6a-81175d688e40", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Optional\n", + "\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "\n", + "\n", + "class Person(BaseModel):\n", + " \"\"\"Information about a person.\"\"\"\n", + "\n", + " # ^ Doc-string for the entity Person.\n", + " # This doc-string is sent to the LLM as the description of the schema Person,\n", + " # and it can help to improve extraction results.\n", + "\n", + " # Note that:\n", + " # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n", + " # 2. Each field has a `description` -- this description is used by the LLM.\n", + " # Having a good description can help improve extraction results.\n", + " name: Optional[str] = Field(..., description=\"The name of the person\")\n", + " hair_color: Optional[str] = Field(\n", + " ..., description=\"The color of the peron's eyes if known\"\n", + " )\n", + " height_in_meters: Optional[str] = Field(..., description=\"Height in METERs\")" + ] + }, + { + "cell_type": "markdown", + "id": "f248dd54-e36d-435a-b154-394ab4ed6792", + "metadata": {}, + "source": [ + "There are two best practices when defining schema:\n", + "\n", + "1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.\n", + "2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.\n", + "\n", + ":::{.callout-important}\n", + "For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.\n", + ":::\n", + "\n", + "## The Extractor\n", + "\n", + "Let's create an information extractor using the schema we defined above." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a5e490f6-35ad-455e-8ae4-2bae021583ff", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Optional\n", + "\n", + "from langchain.chains import create_structured_output_runnable\n", + "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "# Define a custom prompt to provide instructions and any additional context.\n", + "# 1) You can add examples into the prompt template to improve extraction quality\n", + "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n", + "# about the document from which the text was extracted.)\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\n", + " \"system\",\n", + " \"You are an expert extraction algorithm. \"\n", + " \"Only extract relevant information from the text. \"\n", + " \"If you do not know the value of an attribute asked to extract, \"\n", + " \"return null for the attribute's value.\",\n", + " ),\n", + " # Please see the how-to about improving performance with\n", + " # reference examples.\n", + " # MessagesPlaceholder('examples'),\n", + " (\"human\", \"{text}\"),\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "832bf6a1-8e0c-4b6a-aa37-12fe9c42a6d9", + "metadata": {}, + "source": [ + "We need to use a model that supports function/tool calling.\n", + "\n", + "Please review [structured output](/docs/guides/structured_output) for list of some models that can be used with this API." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "04d846a6-d5cb-4009-ac19-61e3aac0177e", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_mistralai import ChatMistralAI\n", + "\n", + "llm = ChatMistralAI(model=\"mistral-large-latest\")\n", + "\n", + "runnable = prompt | llm.with_structured_output(schema=Person)" + ] + }, + { + "cell_type": "markdown", + "id": "23582c0b-00ed-403f-a10e-3aeabf921f12", + "metadata": {}, + "source": [ + "Let's test it out" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "13165ac8-a1dc-44ce-a6ed-f52b577473e4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = \"Alan Smith is 6 feet tall and has blond hair.\"\n", + "runnable.invoke({\"text\": text})" + ] + }, + { + "cell_type": "markdown", + "id": "bd1c493d-f9dc-4236-8da9-50f6919f5710", + "metadata": {}, + "source": [ + ":::{.callout-important} \n", + "\n", + "Extraction is Generative 🤯\n", + "\n", + "LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters\n", + "even though it was provided in feet!\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "28c5ef0c-b8d1-4e12-bd0e-e2528de87fcc", + "metadata": {}, + "source": [ + "## Multiple Entities\n", + "\n", + "In **most cases**, you should be extracting a list of entities rather than a single entity.\n", + "\n", + "This can be easily achieved using pydantic by nesting models inside one another." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "591a0c16-7a17-4883-91ee-0d6d2fdb265c", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List, Optional\n", + "\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "\n", + "\n", + "class Person(BaseModel):\n", + " \"\"\"Information about a person.\"\"\"\n", + "\n", + " # ^ Doc-string for the entity Person.\n", + " # This doc-string is sent to the LLM as the description of the schema Person,\n", + " # and it can help to improve extraction results.\n", + "\n", + " # Note that:\n", + " # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n", + " # 2. Each field has a `description` -- this description is used by the LLM.\n", + " # Having a good description can help improve extraction results.\n", + " name: Optional[str] = Field(..., description=\"The name of the person\")\n", + " hair_color: Optional[str] = Field(\n", + " ..., description=\"The color of the peron's eyes if known\"\n", + " )\n", + " height_in_meters: Optional[str] = Field(..., description=\"Height in meters\")\n", + "\n", + "\n", + "class Data(BaseModel):\n", + " \"\"\"Extracted data about people.\"\"\"\n", + "\n", + " # Creates a model so that we can extract multiple entities.\n", + " people: List[Person]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cf7062cc-1d1d-4a37-9122-509d1b87f0a6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='2'), Person(name='Anna', hair_color=None, height_in_meters=None)])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "runnable = prompt | llm.with_structured_output(schema=Data)\n", + "text = \"My name is Jeff and I am 2 meters. I have black hair. Anna has the same color hair as me.\"\n", + "runnable.invoke({\"text\": text})" + ] + }, + { + "cell_type": "markdown", + "id": "fba1d770-bf4d-4de4-9e4f-7384872ef0dc", + "metadata": {}, + "source": [ + ":::{.callout-tip}\n", + "When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information\n", + "is in the text by providing an empty list. \n", + "\n", + "This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "f07a7455-7de6-4a6f-9772-0477ef65e3dc", + "metadata": {}, + "source": [ + "## Next steps\n", + "\n", + "Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guide:\n", + "\n", + "- [Add Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n", + "- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n", + "- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n", + "- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n", + "- [Guidelines](/docs/use_cases/extraction/guidelines): Guidelines for getting good performance on extraction tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "082fc1af", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.1" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}