pull/1147/merge
flippy 3 weeks ago committed by GitHub
commit ddabef816f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -93,3 +93,8 @@ royziv11:
website: "https://www.linkedin.com/in/roy-ziv-a46001149/"
avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4"
FardinAhsan146:
name: Fardin Ahsan"
website: "https://www.linkedin.com/in/fardin-ahsan/"
avatar: "https://raw.githubusercontent.com/FardinAhsan146/FardinAhsan146.github.io/main/img2.png"

@ -0,0 +1,326 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "ca3f397e-5fbb-4f4c-a191-4e6de2cd9d2d",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json \n",
"import pandas as pd \n",
"from openai import OpenAI\n",
"pd.set_option('display.max_colwidth', None)\n",
"\n",
"try:\n",
" from dotenv import load_dotenv\n",
"except ImportError:\n",
" print(\"Installing python-dotenv package...\")\n",
" os.system('pip install python-dotenv')\n",
" from dotenv import load_dotenv\n",
"load_dotenv()\n",
"\n",
"client = OpenAI(api_key = os.environ['OPENAI_API_KEY'])"
]
},
{
"cell_type": "markdown",
"id": "a15e9742",
"metadata": {},
"source": [
"## Cleaning unformatted data using function calling \n",
"\n",
"User input data is often not formatted consistently. Data Analysts usually clean up the data using rule based parsing using string manipulation and regex. This process can get quite tedious as one has to account for all potential formats.\n",
"\n",
"Using a LLM to parse the data can significantly streamline data cleaning. \n",
"\n",
"In this notebook, we clean up some artificial user input data that includes order items from an imaginary ecommerce company and their respective dimensions. "
]
},
{
"cell_type": "markdown",
"id": "2e93edb2-e8f8-4d45-952d-62b7c136f0d9",
"metadata": {},
"source": [
"### The data \n",
"\n",
"The data does not adhere to any one format. To a human its parseable, but creating a set of rules that would be able to parse even 5 distinct formats is going to be difficult. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "12704164-228c-46d1-a89d-794421d55dcb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ticket_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>OI-842342, length = 73cm, width = 45cm, height = 55cm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>#Item-325364, 46x34x56 cm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>#OI-43253252, l-45cm,w-34cm,h-67cm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>#452453 34inx56cmx2ft</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>OrderItem#373578 96,56,23</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ticket_content\n",
"0 OI-842342, length = 73cm, width = 45cm, height = 55cm\n",
"1 #Item-325364, 46x34x56 cm\n",
"2 #OI-43253252, l-45cm,w-34cm,h-67cm\n",
"3 #452453 34inx56cmx2ft\n",
"4 OrderItem#373578 96,56,23 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('data/item_dimension_tickets.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "a14e80ce",
"metadata": {},
"source": [
"### The function call\n",
"\n",
"However, LLMs are smart enough to semantically parse messy text data. \n",
"\n",
"Function calling is a tool in the ChatCompletion endpoint. It allows you to specify the output format of the completion request as a JSON object. \n",
"\n",
"Lets define a python function that accepts a ticket and returns a dictionary of the contents from it that we want extracted. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ecddfe05-d9e8-4a75-ae4b-ef2fb2914fea",
"metadata": {},
"outputs": [],
"source": [
"def call_and_clean(text: str, model: str = \"gpt-4-turbo-preview\") -> dict:\n",
" \"\"\"\n",
" Use OpenAI function tool to extract order item id and dimensions with their units from user tickets.\n",
"\n",
" Args:\n",
" text (str): The text that you want to parse and clean.\n",
" model (str): The OpenAI model alias.\n",
"\n",
" Returns:\n",
" dict: The cleaned and parsed text output from the model.\n",
" \"\"\"\n",
"\n",
" # View this as a structured output format you want \n",
" tools = [{\"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"get_item_id_and_dimensions\",\n",
" \"description\": \"Gets the order item number and dimensions from a user ticket about an item.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"order_item_id\": {\n",
" \"type\": \"number\",\n",
" \"description\": \"The order item identifier, its only the number that usually follows a hashtag, or oi tag.\"\n",
" },\n",
" \"length\": { \"type\": \"number\" },\n",
" \"width\": { \"type\": \"number\" },\n",
" \"height\": { \"type\": \"number\" },\n",
" \"length_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] },\n",
" \"width_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] },\n",
" \"height_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] }\n",
" },\n",
" \"required\": [\"order_item_id\", \"length\", \"width\", \"height\", \"length_units\", \"width_units\", \"height_units\"]\n",
" }\n",
" }\n",
" }\n",
" ]\n",
" \n",
" system_prompt = \"\"\"\n",
" Given an item ticket, parse the ticket such that the get_item_id_and_dimensions function can be called for the contents of the ticket.\n",
"\n",
" If no dimensions are provided assume they are in cm. Sometimes the units might be spelled incorrectly, infer the unit in those cases.\n",
" \"\"\"\n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\":\"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": f\"Item Ticket: {text}\"}\n",
" ],\n",
" tools=tools \n",
" )\n",
"\n",
" return json.loads(response.choices[0].message.tool_calls[0].function.arguments)"
]
},
{
"cell_type": "markdown",
"id": "d017fd74",
"metadata": {},
"source": [
"### Clean the data "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0d0adf8d",
"metadata": {},
"outputs": [],
"source": [
"df['parsed_ticket_content'] = df['ticket_content'].apply(call_and_clean)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "64c6899b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ticket_content</th>\n",
" <th>parsed_ticket_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>OI-842342, length = 73cm, width = 45cm, height = 55cm</td>\n",
" <td>{'height': 55, 'height_units': 'cm', 'length': 73, 'length_units': 'cm', 'order_item_id': 842342, 'width': 45, 'width_units': 'cm'}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>#Item-325364, 46x34x56 cm</td>\n",
" <td>{'height': 56, 'height_units': 'cm', 'length': 46, 'length_units': 'cm', 'order_item_id': 325364, 'width': 34, 'width_units': 'cm'}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>#OI-43253252, l-45cm,w-34cm,h-67cm</td>\n",
" <td>{'height': 67, 'height_units': 'cm', 'length': 45, 'length_units': 'cm', 'order_item_id': 43253252, 'width': 34, 'width_units': 'cm'}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>#452453 34inx56cmx2ft</td>\n",
" <td>{'height': 2, 'height_units': 'ft', 'length': 34, 'length_units': 'in', 'order_item_id': 452453, 'width': 56, 'width_units': 'cm'}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>OrderItem#373578 96,56,23</td>\n",
" <td>{'height': 23, 'height_units': 'cm', 'length': 96, 'length_units': 'cm', 'order_item_id': 373578, 'width': 56, 'width_units': 'cm'}</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ticket_content \\\n",
"0 OI-842342, length = 73cm, width = 45cm, height = 55cm \n",
"1 #Item-325364, 46x34x56 cm \n",
"2 #OI-43253252, l-45cm,w-34cm,h-67cm \n",
"3 #452453 34inx56cmx2ft \n",
"4 OrderItem#373578 96,56,23 \n",
"\n",
" parsed_ticket_content \n",
"0 {'height': 55, 'height_units': 'cm', 'length': 73, 'length_units': 'cm', 'order_item_id': 842342, 'width': 45, 'width_units': 'cm'} \n",
"1 {'height': 56, 'height_units': 'cm', 'length': 46, 'length_units': 'cm', 'order_item_id': 325364, 'width': 34, 'width_units': 'cm'} \n",
"2 {'height': 67, 'height_units': 'cm', 'length': 45, 'length_units': 'cm', 'order_item_id': 43253252, 'width': 34, 'width_units': 'cm'} \n",
"3 {'height': 2, 'height_units': 'ft', 'length': 34, 'length_units': 'in', 'order_item_id': 452453, 'width': 56, 'width_units': 'cm'} \n",
"4 {'height': 23, 'height_units': 'cm', 'length': 96, 'length_units': 'cm', 'order_item_id': 373578, 'width': 56, 'width_units': 'cm'} "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,6 @@
ticket_content
"OI-842342, length = 73cm, width = 45cm, height = 55cm"
"#Item-325364, 46x34x56 cm"
"#OI-43253252, l-45cm,w-34cm,h-67cm"
#452453 34inx56cmx2ft
"OrderItem#373578 96,56,23 "
1 ticket_content
2 OI-842342, length = 73cm, width = 45cm, height = 55cm
3 #Item-325364, 46x34x56 cm
4 #OI-43253252, l-45cm,w-34cm,h-67cm
5 #452453 34inx56cmx2ft
6 OrderItem#373578 96,56,23

@ -1279,6 +1279,14 @@
- vision
- embeddings
- title: Cleaning inconsistently formatted user input data with function calling
path: examples/Clean_user_input_data_with_functions.ipynb
date: 2024-04-18
authors:
- FardinAhsan146
tags:
- functions
- title: Batch processing with the Batch API
path: examples/batch_processing.ipynb
date: 2024-04-24
@ -1286,4 +1294,5 @@
- katiagg
tags:
- batch
- completions
- completions

Loading…
Cancel
Save