mirror of
https://github.com/openai/openai-cookbook
synced 2024-11-04 06:00:33 +00:00
2c441ab9a2
Co-authored-by: ayush rajgor <ayushrajgorar@gmail.com>
464 lines
83 KiB
Plaintext
464 lines
83 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Clustering for Transaction Classification\n",
|
|
"\n",
|
|
"This notebook covers use cases where your data is unlabelled but has features that can be used to cluster them into meaningful categories. The challenge with clustering is making the features that make those clusters stand out human-readable, and that is where we'll look to use GPT-3 to generate meaningful cluster descriptions for us. We can then use these to apply labels to a previously unlabelled dataset.\n",
|
|
"\n",
|
|
"To feed the model we use embeddings created using the approach displayed in the notebook [Multiclass classification for transactions Notebook](Multiclass_classification_for_transactions.ipynb), applied to the full 359 transactions in the dataset to give us a bigger pool for learning"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"True"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# optional env import\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"load_dotenv()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# imports\n",
|
|
" \n",
|
|
"from openai import OpenAI\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from sklearn.cluster import KMeans\n",
|
|
"from sklearn.manifold import TSNE\n",
|
|
"import matplotlib\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import os\n",
|
|
"from ast import literal_eval\n",
|
|
"\n",
|
|
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",
|
|
"COMPLETIONS_MODEL = \"gpt-3.5-turbo\"\n",
|
|
"\n",
|
|
"# This path leads to a file with data and precomputed embeddings\n",
|
|
"embedding_path = \"data/library_transactions_with_embeddings_359.csv\"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Clustering\n",
|
|
"\n",
|
|
"We'll reuse the approach from the [Clustering Notebook](Clustering.ipynb), using K-Means to cluster our dataset using the feature embeddings we created previously. We'll then use the Completions endpoint to generate cluster descriptions for us and judge their effectiveness"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Date</th>\n",
|
|
" <th>Supplier</th>\n",
|
|
" <th>Description</th>\n",
|
|
" <th>Transaction value (£)</th>\n",
|
|
" <th>combined</th>\n",
|
|
" <th>n_tokens</th>\n",
|
|
" <th>embedding</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>21/04/2016</td>\n",
|
|
" <td>M & J Ballantyne Ltd</td>\n",
|
|
" <td>George IV Bridge Work</td>\n",
|
|
" <td>35098.0</td>\n",
|
|
" <td>Supplier: M & J Ballantyne Ltd; Description: G...</td>\n",
|
|
" <td>118</td>\n",
|
|
" <td>[-0.013169967569410801, -0.004833734128624201,...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>26/04/2016</td>\n",
|
|
" <td>Private Sale</td>\n",
|
|
" <td>Literary & Archival Items</td>\n",
|
|
" <td>30000.0</td>\n",
|
|
" <td>Supplier: Private Sale; Description: Literary ...</td>\n",
|
|
" <td>114</td>\n",
|
|
" <td>[-0.019571533426642418, -0.010801066644489765,...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30/04/2016</td>\n",
|
|
" <td>City Of Edinburgh Council</td>\n",
|
|
" <td>Non Domestic Rates</td>\n",
|
|
" <td>40800.0</td>\n",
|
|
" <td>Supplier: City Of Edinburgh Council; Descripti...</td>\n",
|
|
" <td>114</td>\n",
|
|
" <td>[-0.0054041435942053795, -6.548957026097924e-0...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>09/05/2016</td>\n",
|
|
" <td>Computacenter Uk</td>\n",
|
|
" <td>Kelvin Hall</td>\n",
|
|
" <td>72835.0</td>\n",
|
|
" <td>Supplier: Computacenter Uk; Description: Kelvi...</td>\n",
|
|
" <td>113</td>\n",
|
|
" <td>[-0.004776035435497761, -0.005533686839044094,...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>09/05/2016</td>\n",
|
|
" <td>John Graham Construction Ltd</td>\n",
|
|
" <td>Causewayside Refurbishment</td>\n",
|
|
" <td>64361.0</td>\n",
|
|
" <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
|
|
" <td>117</td>\n",
|
|
" <td>[0.003290407592430711, -0.0073441751301288605,...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Date Supplier Description \\\n",
|
|
"0 21/04/2016 M & J Ballantyne Ltd George IV Bridge Work \n",
|
|
"1 26/04/2016 Private Sale Literary & Archival Items \n",
|
|
"2 30/04/2016 City Of Edinburgh Council Non Domestic Rates \n",
|
|
"3 09/05/2016 Computacenter Uk Kelvin Hall \n",
|
|
"4 09/05/2016 John Graham Construction Ltd Causewayside Refurbishment \n",
|
|
"\n",
|
|
" Transaction value (£) combined \\\n",
|
|
"0 35098.0 Supplier: M & J Ballantyne Ltd; Description: G... \n",
|
|
"1 30000.0 Supplier: Private Sale; Description: Literary ... \n",
|
|
"2 40800.0 Supplier: City Of Edinburgh Council; Descripti... \n",
|
|
"3 72835.0 Supplier: Computacenter Uk; Description: Kelvi... \n",
|
|
"4 64361.0 Supplier: John Graham Construction Ltd; Descri... \n",
|
|
"\n",
|
|
" n_tokens embedding \n",
|
|
"0 118 [-0.013169967569410801, -0.004833734128624201,... \n",
|
|
"1 114 [-0.019571533426642418, -0.010801066644489765,... \n",
|
|
"2 114 [-0.0054041435942053795, -6.548957026097924e-0... \n",
|
|
"3 113 [-0.004776035435497761, -0.005533686839044094,... \n",
|
|
"4 117 [0.003290407592430711, -0.0073441751301288605,... "
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df = pd.read_csv(embedding_path)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(359, 1536)"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"embedding_df = pd.read_csv(embedding_path)\n",
|
|
"embedding_df[\"embedding\"] = embedding_df.embedding.apply(literal_eval).apply(np.array)\n",
|
|
"matrix = np.vstack(embedding_df.embedding.values)\n",
|
|
"matrix.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"n_clusters = 5\n",
|
|
"\n",
|
|
"kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42, n_init=10)\n",
|
|
"kmeans.fit(matrix)\n",
|
|
"labels = kmeans.labels_\n",
|
|
"embedding_df[\"Cluster\"] = labels"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
},
|
|
{
|
|
"data": {
|
|
"image/png": "",
|
|
"text/plain": [
|
|
"<Figure size 640x480 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"tsne = TSNE(\n",
|
|
" n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200\n",
|
|
")\n",
|
|
"vis_dims2 = tsne.fit_transform(matrix)\n",
|
|
"\n",
|
|
"x = [x for x, y in vis_dims2]\n",
|
|
"y = [y for x, y in vis_dims2]\n",
|
|
"\n",
|
|
"for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\",\"yellow\"]):\n",
|
|
" xs = np.array(x)[embedding_df.Cluster == category]\n",
|
|
" ys = np.array(y)[embedding_df.Cluster == category]\n",
|
|
" plt.scatter(xs, ys, color=color, alpha=0.3)\n",
|
|
"\n",
|
|
" avg_x = xs.mean()\n",
|
|
" avg_y = ys.mean()\n",
|
|
"\n",
|
|
" plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",
|
|
"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Cluster 0 Theme:\n",
|
|
"\n",
|
|
"The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van.\n",
|
|
"\n",
|
|
"\n",
|
|
"EDF ENERGY, Electricity Oct 2019 3 buildings\n",
|
|
"City Of Edinburgh Council, Non Domestic Rates \n",
|
|
"EDF, Electricity\n",
|
|
"EX LIBRIS, IT equipment\n",
|
|
"City Of Edinburgh Council, Non Domestic Rates \n",
|
|
"CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place\n",
|
|
"EDF Energy, Electricity\n",
|
|
"XMA Scotland Ltd, IT equipment\n",
|
|
"Computer Centre UK Ltd, Computer equipment\n",
|
|
"ARNOLD CLARK, Purchase of an electric van\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"\n",
|
|
"\n",
|
|
"Cluster 1 Theme:\n",
|
|
"\n",
|
|
"The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items.\n",
|
|
"\n",
|
|
"\n",
|
|
"Institute of Conservation, This payment covers 2 invoices for student bursary costs\n",
|
|
"PRIVATE SALE, Collection of papers of an individual\n",
|
|
"LEE BOYD LIMITED, Architectural Works\n",
|
|
"ALDL, Legal Deposit Services\n",
|
|
"RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray\n",
|
|
"ADAM MATTHEW DIGITAL LTD, Resource - slavery abolution and social justice\n",
|
|
"PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items\n",
|
|
"LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20\n",
|
|
"ALDL, ALDL Charges\n",
|
|
"Private Sale, Literary & Archival Items\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"\n",
|
|
"\n",
|
|
"Cluster 2 Theme:\n",
|
|
"\n",
|
|
"The common theme among these transactions is that they all involve spending money at Kelvin Hall.\n",
|
|
"\n",
|
|
"\n",
|
|
"CBRE, Kelvin Hall\n",
|
|
"GLASGOW CITY COUNCIL, Kelvin Hall\n",
|
|
"University Of Glasgow, Kelvin Hall\n",
|
|
"GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall\n",
|
|
"Computacenter Uk, Kelvin Hall\n",
|
|
"XMA Scotland Ltd, Kelvin Hall\n",
|
|
"GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19\n",
|
|
"Glasgow Life, Kelvin Hall Service Charges\n",
|
|
"Glasgow City Council, Kelvin Hall\n",
|
|
"GLASGOW LIFE, Quarterly service charge KH\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"\n",
|
|
"\n",
|
|
"Cluster 3 Theme:\n",
|
|
"\n",
|
|
"The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service.\n",
|
|
"\n",
|
|
"\n",
|
|
"ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees\n",
|
|
"ECG FACILITIES SERVICE, Facilities Management Charge\n",
|
|
"ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties\n",
|
|
"ECG Facilities Service, Facilities Management Charge\n",
|
|
"ECG FACILITIES SERVICE, Maintenance contract - October\n",
|
|
"ECG FACILITIES SERVICE, Electrical and mechanical works\n",
|
|
"ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees\n",
|
|
"ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling\n",
|
|
"ECG Facilities Service, Facilities Management Charge\n",
|
|
"ECG Facilities Service, Facilities Management Charge\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"\n",
|
|
"\n",
|
|
"Cluster 4 Theme:\n",
|
|
"\n",
|
|
"The common theme among these transactions is that they all involve construction or refurbishment work.\n",
|
|
"\n",
|
|
"\n",
|
|
"M & J Ballantyne Ltd, George IV Bridge Work\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"ARTHUR MCKAY BUILDING SERVICES, Causewayside Work\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"Morris & Spottiswood Ltd, George IV Bridge Work\n",
|
|
"ECG FACILITIES SERVICE, Causewayside IT Work\n",
|
|
"John Graham Construction Ltd, Causewayside Refurbishment\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# We'll read 10 transactions per cluster as we're expecting some variation\n",
|
|
"transactions_per_cluster = 10\n",
|
|
"\n",
|
|
"for i in range(n_clusters):\n",
|
|
" print(f\"Cluster {i} Theme:\\n\")\n",
|
|
"\n",
|
|
" transactions = \"\\n\".join(\n",
|
|
" embedding_df[embedding_df.Cluster == i]\n",
|
|
" .combined.str.replace(\"Supplier: \", \"\")\n",
|
|
" .str.replace(\"Description: \", \": \")\n",
|
|
" .str.replace(\"Value: \", \": \")\n",
|
|
" .sample(transactions_per_cluster, random_state=42)\n",
|
|
" .values\n",
|
|
" )\n",
|
|
" response = client.chat.completions.create(\n",
|
|
" model=COMPLETIONS_MODEL,\n",
|
|
" # We'll include a prompt to instruct the model what sort of description we're looking for\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"user\",\n",
|
|
" \"content\": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money. \n",
|
|
" What do the following transactions have in common?\\n\\nTransactions:\\n\"\"\"\\n{transactions}\\n\"\"\"\\n\\nTheme:'''}\n",
|
|
" ],\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=100,\n",
|
|
" top_p=1,\n",
|
|
" frequency_penalty=0,\n",
|
|
" presence_penalty=0,\n",
|
|
" )\n",
|
|
" print(response.choices[0].message.content.replace(\"\\n\", \"\"))\n",
|
|
" print(\"\\n\")\n",
|
|
"\n",
|
|
" sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)\n",
|
|
" for j in range(transactions_per_cluster):\n",
|
|
" print(sample_cluster_rows.Supplier.values[j], end=\", \")\n",
|
|
" print(sample_cluster_rows.Description.values[j], end=\"\\n\")\n",
|
|
"\n",
|
|
" print(\"-\" * 100)\n",
|
|
" print(\"\\n\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Conclusion\n",
|
|
"\n",
|
|
"We now have five new clusters that we can use to describe our data. Looking at the visualisation some of our clusters have some overlap and we'll need some tuning to get to the right place, but already we can see that GPT-3 has made some effective inferences. In particular, it picked up that items including legal deposits were related to literature archival, which is true but the model was given no clues on. Very cool, and with some tuning we can create a base set of clusters that we can then use with a multiclass classifier to generalise to other transactional datasets we might use."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.6"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|