{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering for Transaction Classification\n", "\n", "This notebook covers use cases where your data is unlabelled but has features that can be used to cluster them into meaningful categories. The challenge with clustering is making the features that make those clusters stand out human-readable, and that is where we'll look to use GPT-3 to generate meaningful cluster descriptions for us. We can then use these to apply labels to a previously unlabelled dataset.\n", "\n", "To feed the model we use embeddings created using the approach displayed in the notebook [Multiclass classification for transactions Notebook](Multiclass_classification_for_transactions.ipynb), applied to the full 359 transactions in the dataset to give us a bigger pool for learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# optional env import\n", "from dotenv import load_dotenv\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# imports\n", " \n", "from openai import OpenAI\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.cluster import KMeans\n", "from sklearn.manifold import TSNE\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import os\n", "from ast import literal_eval\n", "\n", "#openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n", "client = OpenAI()\n", "COMPLETIONS_MODEL = \"gpt-3.5-turbo\"\n", "\n", "# This path leads to a file with data and precomputed embeddings\n", "embedding_path = \"data/library_transactions_with_embeddings_359.csv\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering\n", "\n", "We'll reuse the approach from the [Clustering Notebook](Clustering.ipynb), using K-Means to cluster our dataset using the feature embeddings we created previously. We'll then use the Completions endpoint to generate cluster descriptions for us and judge their effectiveness" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Date | \n", "Supplier | \n", "Description | \n", "Transaction value (£) | \n", "combined | \n", "n_tokens | \n", "embedding | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "21/04/2016 | \n", "M & J Ballantyne Ltd | \n", "George IV Bridge Work | \n", "35098.0 | \n", "Supplier: M & J Ballantyne Ltd; Description: G... | \n", "118 | \n", "[-0.013169967569410801, -0.004833734128624201,... | \n", "
1 | \n", "26/04/2016 | \n", "Private Sale | \n", "Literary & Archival Items | \n", "30000.0 | \n", "Supplier: Private Sale; Description: Literary ... | \n", "114 | \n", "[-0.019571533426642418, -0.010801066644489765,... | \n", "
2 | \n", "30/04/2016 | \n", "City Of Edinburgh Council | \n", "Non Domestic Rates | \n", "40800.0 | \n", "Supplier: City Of Edinburgh Council; Descripti... | \n", "114 | \n", "[-0.0054041435942053795, -6.548957026097924e-0... | \n", "
3 | \n", "09/05/2016 | \n", "Computacenter Uk | \n", "Kelvin Hall | \n", "72835.0 | \n", "Supplier: Computacenter Uk; Description: Kelvi... | \n", "113 | \n", "[-0.004776035435497761, -0.005533686839044094,... | \n", "
4 | \n", "09/05/2016 | \n", "John Graham Construction Ltd | \n", "Causewayside Refurbishment | \n", "64361.0 | \n", "Supplier: John Graham Construction Ltd; Descri... | \n", "117 | \n", "[0.003290407592430711, -0.0073441751301288605,... | \n", "