cleaning data with function calling notebook

pull/1147/head
FardinAhsan146 2 months ago
parent 555bbc0d34
commit cfaf8500e8

@ -0,0 +1,185 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 6,
"id": "ca3f397e-5fbb-4f4c-a191-4e6de2cd9d2d",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd \n",
"from openai import OpenAI\n",
"\n",
"try:\n",
" from dotenv import load_dotenv\n",
"except ImportError:\n",
" print(\"Installing python-dotenv package...\")\n",
" os.system('pip install python-dotenv')\n",
" from dotenv import load_dotenv\n",
"load_dotenv()\n",
"\n",
"client = OpenAI(api_key = os.environ['OPENAI_API_KEY'])"
]
},
{
"cell_type": "markdown",
"id": "a15e9742",
"metadata": {},
"source": [
"## Cleaning unformatted data using function calling \n",
"\n",
"User input data is often not formatted consistently. Data Analysts usually clean up the data using rule based parsing using string manipulation and regex. This process can get quite tedious as one has to account for all potential formats.\n",
"\n",
"Using a LLM to parse the data can significantly streamline data cleaning. \n",
"\n",
"In this notebook, we clean up some artificial user input data that includes order items and their respective dimensions. "
]
},
{
"cell_type": "markdown",
"id": "2e93edb2-e8f8-4d45-952d-62b7c136f0d9",
"metadata": {},
"source": [
"### The data \n",
"\n",
"The data does not adhere to any one format. To a human its parseable, but creating a set of rules that would be able to parse even 5 distinct formats is going to be difficult."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "12704164-228c-46d1-a89d-794421d55dcb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ticket_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>OI-842342, length = 73cm, width = 45cm, height...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>#Item-325364, 46x34x56 cm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>#OI-43253252, l-45cm,w-34cm,h-67cm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>#452453 34inx56cmx2ft</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>OrderItem#373578 96,56,23</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ticket_content\n",
"0 OI-842342, length = 73cm, width = 45cm, height...\n",
"1 #Item-325364, 46x34x56 cm\n",
"2 #OI-43253252, l-45cm,w-34cm,h-67cm\n",
"3 #452453 34inx56cmx2ft\n",
"4 OrderItem#373578 96,56,23 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('data/item_dimension_tickets.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "a14e80ce",
"metadata": {},
"source": [
"### The function call "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ecddfe05-d9e8-4a75-ae4b-ef2fb2914fea",
"metadata": {},
"outputs": [],
"source": [
"def call_and_clean(text: str, model: str) -> str:\n",
" \"\"\"\n",
" args-- \n",
" text: The text that you want to parse and clean \n",
" model: The OpenAi model alias \n",
" \"\"\"\n",
" completion = client.chat.completions.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": \"Hello!\"}\n",
" ]\n",
" )\n",
"\n",
" return completion.choices[0].message"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e9e1568-8295-4a8c-b543-46651f85ced3",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,6 @@
ticket_content
"OI-842342, length = 73cm, width = 45cm, height = 55cm"
"#Item-325364, 46x34x56 cm"
"#OI-43253252, l-45cm,w-34cm,h-67cm"
#452453 34inx56cmx2ft
"OrderItem#373578 96,56,23 "
1 ticket_content
2 OI-842342, length = 73cm, width = 45cm, height = 55cm
3 #Item-325364, 46x34x56 cm
4 #OI-43253252, l-45cm,w-34cm,h-67cm
5 #452453 34inx56cmx2ft
6 OrderItem#373578 96,56,23
Loading…
Cancel
Save