diff --git a/authors.yaml b/authors.yaml index 54c948aa..5fda8ad9 100644 --- a/authors.yaml +++ b/authors.yaml @@ -93,3 +93,8 @@ royziv11: website: "https://www.linkedin.com/in/roy-ziv-a46001149/" avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4" +FardinAhsan146: + name: Fardin Ahsan" + website: "https://www.linkedin.com/in/fardin-ahsan/" + avatar: "https://raw.githubusercontent.com/FardinAhsan146/FardinAhsan146.github.io/main/img2.png" + \ No newline at end of file diff --git a/examples/Clean_user_input_data_with_functions.ipynb b/examples/Clean_user_input_data_with_functions.ipynb new file mode 100644 index 00000000..e4801385 --- /dev/null +++ b/examples/Clean_user_input_data_with_functions.ipynb @@ -0,0 +1,326 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "ca3f397e-5fbb-4f4c-a191-4e6de2cd9d2d", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json \n", + "import pandas as pd \n", + "from openai import OpenAI\n", + "pd.set_option('display.max_colwidth', None)\n", + "\n", + "try:\n", + " from dotenv import load_dotenv\n", + "except ImportError:\n", + " print(\"Installing python-dotenv package...\")\n", + " os.system('pip install python-dotenv')\n", + " from dotenv import load_dotenv\n", + "load_dotenv()\n", + "\n", + "client = OpenAI(api_key = os.environ['OPENAI_API_KEY'])" + ] + }, + { + "cell_type": "markdown", + "id": "a15e9742", + "metadata": {}, + "source": [ + "## Cleaning unformatted data using function calling \n", + "\n", + "User input data is often not formatted consistently. Data Analysts usually clean up the data using rule based parsing using string manipulation and regex. This process can get quite tedious as one has to account for all potential formats.\n", + "\n", + "Using a LLM to parse the data can significantly streamline data cleaning. \n", + "\n", + "In this notebook, we clean up some artificial user input data that includes order items from an imaginary ecommerce company and their respective dimensions. " + ] + }, + { + "cell_type": "markdown", + "id": "2e93edb2-e8f8-4d45-952d-62b7c136f0d9", + "metadata": {}, + "source": [ + "### The data \n", + "\n", + "The data does not adhere to any one format. To a human its parseable, but creating a set of rules that would be able to parse even 5 distinct formats is going to be difficult. " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "12704164-228c-46d1-a89d-794421d55dcb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ticket_content
0OI-842342, length = 73cm, width = 45cm, height = 55cm
1#Item-325364, 46x34x56 cm
2#OI-43253252, l-45cm,w-34cm,h-67cm
3#452453 34inx56cmx2ft
4OrderItem#373578 96,56,23
\n", + "
" + ], + "text/plain": [ + " ticket_content\n", + "0 OI-842342, length = 73cm, width = 45cm, height = 55cm\n", + "1 #Item-325364, 46x34x56 cm\n", + "2 #OI-43253252, l-45cm,w-34cm,h-67cm\n", + "3 #452453 34inx56cmx2ft\n", + "4 OrderItem#373578 96,56,23 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv('data/item_dimension_tickets.csv')\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "a14e80ce", + "metadata": {}, + "source": [ + "### The function call\n", + "\n", + "However, LLMs are smart enough to semantically parse messy text data. \n", + "\n", + "Function calling is a tool in the ChatCompletion endpoint. It allows you to specify the output format of the completion request as a JSON object. \n", + "\n", + "Lets define a python function that accepts a ticket and returns a dictionary of the contents from it that we want extracted. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ecddfe05-d9e8-4a75-ae4b-ef2fb2914fea", + "metadata": {}, + "outputs": [], + "source": [ + "def call_and_clean(text: str, model: str = \"gpt-4-turbo-preview\") -> dict:\n", + " \"\"\"\n", + " Use OpenAI function tool to extract order item id and dimensions with their units from user tickets.\n", + "\n", + " Args:\n", + " text (str): The text that you want to parse and clean.\n", + " model (str): The OpenAI model alias.\n", + "\n", + " Returns:\n", + " dict: The cleaned and parsed text output from the model.\n", + " \"\"\"\n", + "\n", + " # View this as a structured output format you want \n", + " tools = [{\"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_item_id_and_dimensions\",\n", + " \"description\": \"Gets the order item number and dimensions from a user ticket about an item.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"order_item_id\": {\n", + " \"type\": \"number\",\n", + " \"description\": \"The order item identifier, its only the number that usually follows a hashtag, or oi tag.\"\n", + " },\n", + " \"length\": { \"type\": \"number\" },\n", + " \"width\": { \"type\": \"number\" },\n", + " \"height\": { \"type\": \"number\" },\n", + " \"length_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] },\n", + " \"width_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] },\n", + " \"height_units\": { \"type\": \"string\", \"enum\": [\"cm\", \"in\", \"ft\"] }\n", + " },\n", + " \"required\": [\"order_item_id\", \"length\", \"width\", \"height\", \"length_units\", \"width_units\", \"height_units\"]\n", + " }\n", + " }\n", + " }\n", + " ]\n", + " \n", + " system_prompt = \"\"\"\n", + " Given an item ticket, parse the ticket such that the get_item_id_and_dimensions function can be called for the contents of the ticket.\n", + "\n", + " If no dimensions are provided assume they are in cm. Sometimes the units might be spelled incorrectly, infer the unit in those cases.\n", + " \"\"\"\n", + " response = client.chat.completions.create(\n", + " model=model,\n", + " messages=[\n", + " {\"role\":\"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": f\"Item Ticket: {text}\"}\n", + " ],\n", + " tools=tools \n", + " )\n", + "\n", + " return json.loads(response.choices[0].message.tool_calls[0].function.arguments)" + ] + }, + { + "cell_type": "markdown", + "id": "d017fd74", + "metadata": {}, + "source": [ + "### Clean the data " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "0d0adf8d", + "metadata": {}, + "outputs": [], + "source": [ + "df['parsed_ticket_content'] = df['ticket_content'].apply(call_and_clean)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "64c6899b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ticket_contentparsed_ticket_content
0OI-842342, length = 73cm, width = 45cm, height = 55cm{'height': 55, 'height_units': 'cm', 'length': 73, 'length_units': 'cm', 'order_item_id': 842342, 'width': 45, 'width_units': 'cm'}
1#Item-325364, 46x34x56 cm{'height': 56, 'height_units': 'cm', 'length': 46, 'length_units': 'cm', 'order_item_id': 325364, 'width': 34, 'width_units': 'cm'}
2#OI-43253252, l-45cm,w-34cm,h-67cm{'height': 67, 'height_units': 'cm', 'length': 45, 'length_units': 'cm', 'order_item_id': 43253252, 'width': 34, 'width_units': 'cm'}
3#452453 34inx56cmx2ft{'height': 2, 'height_units': 'ft', 'length': 34, 'length_units': 'in', 'order_item_id': 452453, 'width': 56, 'width_units': 'cm'}
4OrderItem#373578 96,56,23{'height': 23, 'height_units': 'cm', 'length': 96, 'length_units': 'cm', 'order_item_id': 373578, 'width': 56, 'width_units': 'cm'}
\n", + "
" + ], + "text/plain": [ + " ticket_content \\\n", + "0 OI-842342, length = 73cm, width = 45cm, height = 55cm \n", + "1 #Item-325364, 46x34x56 cm \n", + "2 #OI-43253252, l-45cm,w-34cm,h-67cm \n", + "3 #452453 34inx56cmx2ft \n", + "4 OrderItem#373578 96,56,23 \n", + "\n", + " parsed_ticket_content \n", + "0 {'height': 55, 'height_units': 'cm', 'length': 73, 'length_units': 'cm', 'order_item_id': 842342, 'width': 45, 'width_units': 'cm'} \n", + "1 {'height': 56, 'height_units': 'cm', 'length': 46, 'length_units': 'cm', 'order_item_id': 325364, 'width': 34, 'width_units': 'cm'} \n", + "2 {'height': 67, 'height_units': 'cm', 'length': 45, 'length_units': 'cm', 'order_item_id': 43253252, 'width': 34, 'width_units': 'cm'} \n", + "3 {'height': 2, 'height_units': 'ft', 'length': 34, 'length_units': 'in', 'order_item_id': 452453, 'width': 56, 'width_units': 'cm'} \n", + "4 {'height': 23, 'height_units': 'cm', 'length': 96, 'length_units': 'cm', 'order_item_id': 373578, 'width': 56, 'width_units': 'cm'} " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/data/item_dimension_tickets.csv b/examples/data/item_dimension_tickets.csv new file mode 100644 index 00000000..e4099300 --- /dev/null +++ b/examples/data/item_dimension_tickets.csv @@ -0,0 +1,6 @@ +ticket_content +"OI-842342, length = 73cm, width = 45cm, height = 55cm" +"#Item-325364, 46x34x56 cm" +"#OI-43253252, l-45cm,w-34cm,h-67cm" +#452453 34inx56cmx2ft +"OrderItem#373578 96,56,23 " diff --git a/registry.yaml b/registry.yaml index b7baa0a7..28f0f0ab 100644 --- a/registry.yaml +++ b/registry.yaml @@ -1279,6 +1279,14 @@ - vision - embeddings +- title: Cleaning inconsistently formatted user input data with function calling + path: examples/Clean_user_input_data_with_functions.ipynb + date: 2024-04-18 + authors: + - FardinAhsan146 + tags: + - functions + - title: Batch processing with the Batch API path: examples/batch_processing.ipynb date: 2024-04-24 @@ -1286,4 +1294,5 @@ - katiagg tags: - batch - - completions \ No newline at end of file + - completions +