langchain/docs/extras/use_cases/more/data_generation.ipynb
PaperMoose 5d7c6d1bca
Synthetic Data generation (#9472)
---------

Co-authored-by: William Fu-Hinthorn <13333726+hinthornw@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-09-28 18:16:05 -07:00

621 lines
21 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "aa3571cc",
"metadata": {},
"source": [
"# Synthetic Data generation\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_generation.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations. \n",
"\n",
"Benefits of Synthetic Data:\n",
"\n",
"1. **Privacy and Security**: No real personal data at risk of breaches.\n",
"2. **Data Augmentation**: Expands datasets for machine learning.\n",
"3. **Flexibility**: Create specific or rare scenarios.\n",
"4. **Cost-effective**: Often cheaper than real-world data collection.\n",
"5. **Regulatory Compliance**: Helps navigate strict data protection laws.\n",
"6. **Model Robustness**: Can lead to better generalizing AI models.\n",
"7. **Rapid Prototyping**: Enables quick testing without real data.\n",
"8. **Controlled Experimentation**: Simulate specific conditions.\n",
"9. **Access to Data**: Alternative when real data isn't available.\n",
"\n",
"Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.\n",
"\n",
"## Quickstart\n",
"\n",
"In this notebook, we'll dive deep into generating synthetic medical billing records using the langchain library. This tool is particularly useful when you want to develop or test algorithms but don't want to use real patient data due to privacy concerns or data availability issues."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "bca57012",
"metadata": {},
"source": [
"### Setup\n",
"First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0377478",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U langchain langchain_experimental openai\n",
"# Set env var OPENAI_API_KEY or load from a .env file:\n",
"# import dotenv\n",
"# dotenv.load_dotenv()\n",
"\n",
"from langchain.prompts import FewShotPromptTemplate, PromptTemplate\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.pydantic_v1 import BaseModel\n",
"from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator\n",
"from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE\n",
"from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a5a0917b",
"metadata": {},
"source": [
"## 1. Define Your Data Model\n",
"Every dataset has a structure or a \"schema\". The MedicalBilling class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "291bad6e",
"metadata": {},
"outputs": [],
"source": [
"class MedicalBilling(BaseModel):\n",
" patient_id: int\n",
" patient_name: str\n",
" diagnosis_code: str\n",
" procedure_code: str\n",
" total_charge: float\n",
" insurance_claim_amount: float\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2059ca63",
"metadata": {},
"source": [
"For instance, every record will have a `patient_id` that's an integer, a `patient_name` that's a string, and so on.\n",
"\n",
"## 2. Sample Data\n",
"To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a \"seed\" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.\n",
"\n",
"Here are some fictional medical billing records:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b989b792",
"metadata": {},
"outputs": [],
"source": [
"examples = [\n",
" {\"example\": \"\"\"Patient ID: 123456, Patient Name: John Doe, Diagnosis Code: \n",
" J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350\"\"\"},\n",
" {\"example\": \"\"\"Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis \n",
" Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120\"\"\"},\n",
" {\"example\": \"\"\"Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code: \n",
" E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250\"\"\"},\n",
"]\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "57e28809",
"metadata": {},
"source": [
"## 3. Craft a Prompt Template\n",
"The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea6e042e",
"metadata": {},
"outputs": [],
"source": [
"OPENAI_TEMPLATE = PromptTemplate(input_variables=[\"example\"], template=\"{example}\")\n",
"\n",
"prompt_template = FewShotPromptTemplate(\n",
" prefix=SYNTHETIC_FEW_SHOT_PREFIX,\n",
" examples=examples,\n",
" suffix=SYNTHETIC_FEW_SHOT_SUFFIX,\n",
" input_variables=[\"subject\", \"extra\"],\n",
" example_prompt=OPENAI_TEMPLATE,\n",
")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fa6da3cb",
"metadata": {},
"source": [
"The `FewShotPromptTemplate` includes:\n",
"\n",
"- `prefix` and `suffix`: These likely contain guiding context or instructions.\n",
"- `examples`: The sample data we defined earlier.\n",
"- `input_variables`: These variables (\"subject\", \"extra\") are placeholders you can dynamically fill later. For instance, \"subject\" might be filled with \"medical_billing\" to guide the model further.\n",
"- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.\n",
"\n",
"## 4. Creating the Data Generator\n",
"With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b9ba911",
"metadata": {},
"outputs": [],
"source": [
"synthetic_data_generator = create_openai_data_generator(\n",
" output_schema=MedicalBilling,\n",
" llm=ChatOpenAI(temperature=1), # You'll need to replace with your actual Language Model instance\n",
" prompt=prompt_template,\n",
")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a4198bd6",
"metadata": {},
"source": [
"## 5. Generate Synthetic Data\n",
"Finally, let's get our synthetic data!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a424c890",
"metadata": {},
"outputs": [],
"source": [
"synthetic_results = synthetic_data_generator.generate(\n",
" subject=\"medical_billing\",\n",
" extra=\"the name must be chosen at random. Make it something you wouldn't normally choose.\",\n",
" runs=10,\n",
")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fa4402e9",
"metadata": {},
"source": [
"This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "53a4cbf9",
"metadata": {},
"source": [
"### Other implementations\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e715d94",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain_experimental.synthetic_data import create_data_generation_chain, DatasetGenerator"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "94fccedd",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# LLM\n",
"model = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0.7)\n",
"chain = create_data_generation_chain(model)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4314c3ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'fields': ['blue', 'yellow'],\n",
" 'preferences': {},\n",
" 'text': 'The vibrant blue sky contrasted beautifully with the bright yellow sun, creating a stunning display of colors that instantly lifted the spirits of all who gazed upon it.'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain({\"fields\": [\"blue\", \"yellow\"], \"preferences\": {}})"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b116c487",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'fields': {'colors': ['blue', 'yellow']},\n",
" 'preferences': {'style': 'Make it in a style of a weather forecast.'},\n",
" 'text': \"Good morning! Today's weather forecast brings a beautiful combination of colors to the sky, with hues of blue and yellow gently blending together like a mesmerizing painting.\"}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain({\"fields\": {\"colors\": [\"blue\", \"yellow\"]}, \"preferences\": {\"style\": \"Make it in a style of a weather forecast.\"}})"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ff823394",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},\n",
" 'preferences': None,\n",
" 'text': 'Tom Hanks, the renowned actor known for his incredible versatility and charm, has graced the silver screen in unforgettable movies such as \"Forrest Gump\" and \"Green Mile\".'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain({\"fields\": {\"actor\": \"Tom Hanks\", \"movies\": [\"Forrest Gump\", \"Green Mile\"]}, \"preferences\": None})"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1ea1ad5b",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},\n",
" {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],\n",
" 'preferences': {'minimum_length': 200, 'style': 'gossip'},\n",
" 'text': 'Did you know that Tom Hanks, the beloved Hollywood actor known for his roles in \"Forrest Gump\" and \"Green Mile\", has shared the screen with the talented Mads Mikkelsen, who gained international acclaim for his performances in \"Hannibal\" and \"Another round\"? These two incredible actors have brought their exceptional skills and captivating charisma to the big screen, delivering unforgettable performances that have enthralled audiences around the world. Whether it\\'s Hanks\\' endearing portrayal of Forrest Gump or Mikkelsen\\'s chilling depiction of Hannibal Lecter, these movies have solidified their places in cinematic history, leaving a lasting impact on viewers and cementing their status as true icons of the silver screen.'}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain(\n",
" {\n",
" \"fields\": [\n",
" {\"actor\": \"Tom Hanks\", \"movies\": [\"Forrest Gump\", \"Green Mile\"]},\n",
" {\"actor\": \"Mads Mikkelsen\", \"movies\": [\"Hannibal\", \"Another round\"]}\n",
" ],\n",
" \"preferences\": {\"minimum_length\": 200, \"style\": \"gossip\"}\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"id": "93c7a4bb",
"metadata": {},
"source": [
"As we can see created examples are diversified and possess information we wanted them to have. Also, their style reflects the given preferences quite well."
]
},
{
"cell_type": "markdown",
"id": "75f7f55a",
"metadata": {},
"source": [
"## Generating exemplary dataset for extraction benchmarking purposes"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "94e98bc4",
"metadata": {},
"outputs": [],
"source": [
"inp = [\n",
" {\n",
" 'Actor': 'Tom Hanks',\n",
" 'Film': [\n",
" 'Forrest Gump',\n",
" 'Saving Private Ryan',\n",
" 'The Green Mile',\n",
" 'Toy Story',\n",
" 'Catch Me If You Can']\n",
" },\n",
" {\n",
" 'Actor': 'Tom Hardy',\n",
" 'Film': [\n",
" 'Inception',\n",
" 'The Dark Knight Rises',\n",
" 'Mad Max: Fury Road',\n",
" 'The Revenant',\n",
" 'Dunkirk'\n",
" ]\n",
" }\n",
"]\n",
"\n",
"generator = DatasetGenerator(model, {\"style\": \"informal\", \"minimal length\": 500})\n",
"dataset = generator(inp)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "478eaca4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'fields': {'Actor': 'Tom Hanks',\n",
" 'Film': ['Forrest Gump',\n",
" 'Saving Private Ryan',\n",
" 'The Green Mile',\n",
" 'Toy Story',\n",
" 'Catch Me If You Can']},\n",
" 'preferences': {'style': 'informal', 'minimal length': 500},\n",
" 'text': 'Tom Hanks, the versatile and charismatic actor, has graced the silver screen in numerous iconic films including the heartwarming and inspirational \"Forrest Gump,\" the intense and gripping war drama \"Saving Private Ryan,\" the emotionally charged and thought-provoking \"The Green Mile,\" the beloved animated classic \"Toy Story,\" and the thrilling and captivating true story adaptation \"Catch Me If You Can.\" With his impressive range and genuine talent, Hanks continues to captivate audiences worldwide, leaving an indelible mark on the world of cinema.'},\n",
" {'fields': {'Actor': 'Tom Hardy',\n",
" 'Film': ['Inception',\n",
" 'The Dark Knight Rises',\n",
" 'Mad Max: Fury Road',\n",
" 'The Revenant',\n",
" 'Dunkirk']},\n",
" 'preferences': {'style': 'informal', 'minimal length': 500},\n",
" 'text': 'Tom Hardy, the versatile actor known for his intense performances, has graced the silver screen in numerous iconic films, including \"Inception,\" \"The Dark Knight Rises,\" \"Mad Max: Fury Road,\" \"The Revenant,\" and \"Dunkirk.\" Whether he\\'s delving into the depths of the subconscious mind, donning the mask of the infamous Bane, or navigating the treacherous wasteland as the enigmatic Max Rockatansky, Hardy\\'s commitment to his craft is always evident. From his breathtaking portrayal of the ruthless Eames in \"Inception\" to his captivating transformation into the ferocious Max in \"Mad Max: Fury Road,\" Hardy\\'s dynamic range and magnetic presence captivate audiences and leave an indelible mark on the world of cinema. In his most physically demanding role to date, he endured the harsh conditions of the freezing wilderness as he portrayed the rugged frontiersman John Fitzgerald in \"The Revenant,\" earning him critical acclaim and an Academy Award nomination. In Christopher Nolan\\'s war epic \"Dunkirk,\" Hardy\\'s stoic and heroic portrayal of Royal Air Force pilot Farrier showcases his ability to convey deep emotion through nuanced performances. With his chameleon-like ability to inhabit a wide range of characters and his unwavering commitment to his craft, Tom Hardy has undoubtedly solidified his place as one of the most talented and sought-after actors of his generation.'}]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset"
]
},
{
"cell_type": "markdown",
"id": "293a7d64",
"metadata": {},
"source": [
"## Extraction from generated examples\n",
"Okay, let's see if we can now extract output from this generated data and how it compares with our case!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "03c6a375",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain.output_parsers import PydanticOutputParser\n",
"from langchain.chains import create_extraction_chain_pydantic, SimpleSequentialChain\n",
"from pydantic import BaseModel, Field\n",
"from typing import List"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "9461d225",
"metadata": {},
"outputs": [],
"source": [
"class Actor(BaseModel):\n",
" Actor: str = Field(description=\"name of an actor\")\n",
" Film: List[str] = Field(description=\"list of names of films they starred in\")"
]
},
{
"cell_type": "markdown",
"id": "8390171d",
"metadata": {},
"source": [
"### Parsers"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "8a5528d2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm = OpenAI()\n",
"parser = PydanticOutputParser(pydantic_object=Actor)\n",
"\n",
"prompt = PromptTemplate(\n",
" template=\"Extract fields from a given text.\\n{format_instructions}\\n{text}\\n\",\n",
" input_variables=[\"text\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(text=dataset[0][\"text\"])\n",
"output = llm(_input.to_string())\n",
"\n",
"parsed = parser.parse(output)\n",
"parsed"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "926a7eed",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(parsed.Actor == inp[0][\"Actor\"]) & (parsed.Film == inp[0][\"Film\"])"
]
},
{
"cell_type": "markdown",
"id": "b00f0b87",
"metadata": {},
"source": [
"### Extractors"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "523bb584",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=model)\n",
"extracted = extractor.run(dataset[1][\"text\"])\n",
"extracted"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "f8451c2b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(extracted[0].Actor == inp[1][\"Actor\"]) & (extracted[0].Film == inp[1][\"Film\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b03de4d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}