openai-cookbook/examples/Multiclass_classification_for_transactions.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiclass Classification for Transactions\n",
    "\n",
    "For this notebook we will be looking to classify a public dataset of transactions into a number of categories that we have predefined. These approaches should be replicable to any multiclass classificaiton use case where we are trying to fit transactional data into predefined categories, and by the end of running through this you should have a few approaches for dealing with both labelled and unlabelled datasets.\n",
    "\n",
    "The different approaches we'll be taking in this notebook are:\n",
    "- **Zero-shot Classification:** First we'll do zero shot classification to put transactions in one of five named buckets using only a prompt for guidance\n",
    "- **Classification with Embeddings:** Following this we'll create embeddings on a labelled dataset, and then use a traditional classification model to test their effectiveness at identifying our categories\n",
    "- **Fine-tuned Classification:** Lastly we'll produce a fine-tuned model trained on our labelled dataset to see how this compares to the zero-shot and few-shot classification approaches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 311,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import json\n",
    "import os\n",
    "\n",
    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
    "COMPLETIONS_MODEL = \"text-davinci-002\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load dataset\n",
    "\n",
    "We're using a public transaction dataset of transactions over £25k for the Library of Scotland. The dataset has three features that we'll be using:\n",
    "- Supplier: The name of the supplier\n",
    "- Description: A text description of the transaction\n",
    "- Value: The value of the transaction in GBP\n",
    "\n",
    "**Source**:\n",
    "\n",
    "https://data.nls.uk/data/organisational-data/transactions-over-25k/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 312,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "359"
      ]
     },
     "execution_count": 312,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transactions = pd.read_csv('./data/25000_spend_dataset_current.csv', encoding= 'unicode_escape')\n",
    "len(transactions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 313,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21/04/2016</td>\n",
       "      <td>M &amp; J Ballantyne Ltd</td>\n",
       "      <td>George IV Bridge Work</td>\n",
       "      <td>35098.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>26/04/2016</td>\n",
       "      <td>Private Sale</td>\n",
       "      <td>Literary &amp; Archival Items</td>\n",
       "      <td>30000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30/04/2016</td>\n",
       "      <td>City Of Edinburgh Council</td>\n",
       "      <td>Non Domestic Rates</td>\n",
       "      <td>40800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>09/05/2016</td>\n",
       "      <td>Computacenter Uk</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>72835.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>09/05/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>64361.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date                      Supplier                 Description  \\\n",
       "0  21/04/2016          M & J Ballantyne Ltd       George IV Bridge Work   \n",
       "1  26/04/2016                  Private Sale   Literary & Archival Items   \n",
       "2  30/04/2016     City Of Edinburgh Council         Non Domestic Rates    \n",
       "3  09/05/2016              Computacenter Uk                 Kelvin Hall   \n",
       "4  09/05/2016  John Graham Construction Ltd  Causewayside Refurbishment   \n",
       "\n",
       "   Transaction value (£)  \n",
       "0                35098.0  \n",
       "1                30000.0  \n",
       "2                40800.0  \n",
       "3                72835.0  \n",
       "4                64361.0  "
      ]
     },
     "execution_count": 313,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transactions.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 335,
   "metadata": {},
   "outputs": [],
   "source": [
    "def request_completion(prompt):\n",
    "    \n",
    "    completion_response =   openai.Completion.create(\n",
    "                            prompt=prompt,\n",
    "                            temperature=0,\n",
    "                            max_tokens=5,\n",
    "                            top_p=1,\n",
    "                            frequency_penalty=0,\n",
    "                            presence_penalty=0,\n",
    "                            model=COMPLETIONS_MODEL\n",
    "                            )\n",
    "        \n",
    "    return completion_response\n",
    "\n",
    "def classify_transaction(transaction,prompt):\n",
    "    \n",
    "    prompt = prompt.replace('SUPPLIER_NAME',transaction['Supplier'])\n",
    "    prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])\n",
    "    prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))\n",
    "    \n",
    "    classification = request_completion(prompt)['choices'][0]['text'].replace('\\n','')\n",
    "    \n",
    "    return classification\n",
    "\n",
    "# This function takes your training and validation outputs from the prepare_data function of the Finetuning API, and\n",
    "# confirms that each have the same number of classes.\n",
    "# If they do not have the same number of classes the fine-tune will fail and return an error\n",
    "\n",
    "def check_finetune_classes(train_file,valid_file):\n",
    "\n",
    "    train_classes = set()\n",
    "    valid_classes = set()\n",
    "    with open(train_file, 'r') as json_file:\n",
    "        json_list = list(json_file)\n",
    "        print(len(json_list))\n",
    "\n",
    "    for json_str in json_list:\n",
    "        result = json.loads(json_str)\n",
    "        train_classes.add(result['completion'])\n",
    "        #print(f\"result: {result['completion']}\")\n",
    "        #print(isinstance(result, dict))\n",
    "\n",
    "    with open(valid_file, 'r') as json_file:\n",
    "        json_list = list(json_file)\n",
    "        print(len(json_list))\n",
    "\n",
    "    for json_str in json_list:\n",
    "        result = json.loads(json_str)\n",
    "        valid_classes.add(result['completion'])\n",
    "        #print(f\"result: {result['completion']}\")\n",
    "        #print(isinstance(result, dict))\n",
    "        \n",
    "    if len(train_classes) == len(valid_classes):\n",
    "        print('All good')\n",
    "        \n",
    "    else:\n",
    "        print('Classes do not match, please prepare data again')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zero-shot Classification\n",
    "\n",
    "We'll first assess the performance of the base models at classifying these transactions using a simple prompt. We'll provide the model with 5 categories and a catch-all of \"Could not classify\" for ones that it cannot place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 277,
   "metadata": {},
   "outputs": [],
   "source": [
    "zero_shot_prompt = '''You are a data expert working for the National Library of Scotland. \n",
    "You are analysing all transactions over £25,000 in value and classifying them into one of five categories.\n",
    "The five categories are Building Improvement, Literature & Archive, Utility Bills, Professional Services and Software/IT.\n",
    "If you can't tell what it is, say Could not classify\n",
    "                      \n",
    "Transaction:\n",
    "                      \n",
    "Supplier: SUPPLIER_NAME\n",
    "Description: DESCRIPTION_TEXT\n",
    "Value: TRANSACTION_VALUE\n",
    "                      \n",
    "The classification is:'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 315,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Building Improvement\n"
     ]
    }
   ],
   "source": [
    "# Get a test transaction\n",
    "transaction = transactions.iloc[0]\n",
    "\n",
    "# Interpolate the values into the prompt\n",
    "prompt = zero_shot_prompt.replace('SUPPLIER_NAME',transaction['Supplier'])\n",
    "prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])\n",
    "prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))\n",
    "\n",
    "# Use our completion function to return a prediction\n",
    "completion_response = request_completion(prompt)\n",
    "print(completion_response['choices'][0]['text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our first attempt is correct, M & J Ballantyne Ltd are a house builder and the work they performed is indeed Building Improvement.\n",
    "\n",
    "Lets expand the sample size to 25 and see how it performs, again with just a simple prompt to guide it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 291,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  \n"
     ]
    }
   ],
   "source": [
    "test_transactions = transactions.iloc[:25]\n",
    "test_transactions['Classification'] = test_transactions.apply(lambda x: classify_transaction(x,zero_shot_prompt),axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 292,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       " Building Improvement    14\n",
       " Could not classify       5\n",
       " Literature & Archive     3\n",
       " Software/IT              2\n",
       " Utility Bills            1\n",
       "Name: Classification, dtype: int64"
      ]
     },
     "execution_count": 292,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_transactions['Classification'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 293,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21/04/2016</td>\n",
       "      <td>M &amp; J Ballantyne Ltd</td>\n",
       "      <td>George IV Bridge Work</td>\n",
       "      <td>35098.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>26/04/2016</td>\n",
       "      <td>Private Sale</td>\n",
       "      <td>Literary &amp; Archival Items</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Literature &amp; Archive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30/04/2016</td>\n",
       "      <td>City Of Edinburgh Council</td>\n",
       "      <td>Non Domestic Rates</td>\n",
       "      <td>40800.0</td>\n",
       "      <td>Utility Bills</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>09/05/2016</td>\n",
       "      <td>Computacenter Uk</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>72835.0</td>\n",
       "      <td>Software/IT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>09/05/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>64361.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>09/05/2016</td>\n",
       "      <td>A McGillivray</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>53690.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>16/05/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>365344.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>23/05/2016</td>\n",
       "      <td>Computacenter Uk</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26506.0</td>\n",
       "      <td>Software/IT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>23/05/2016</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>32777.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>23/05/2016</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>32777.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>30/05/2016</td>\n",
       "      <td>ALDL</td>\n",
       "      <td>ALDL Charges</td>\n",
       "      <td>32317.0</td>\n",
       "      <td>Could not classify</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>10/06/2016</td>\n",
       "      <td>Wavetek Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>87589.0</td>\n",
       "      <td>Could not classify</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>10/06/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>381803.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>28/06/2016</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>32832.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>30/06/2016</td>\n",
       "      <td>Glasgow City Council</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>1700000.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>11/07/2016</td>\n",
       "      <td>Wavetek Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>65692.0</td>\n",
       "      <td>Could not classify</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>11/07/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>139845.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>15/07/2016</td>\n",
       "      <td>Sotheby'S</td>\n",
       "      <td>Literary &amp; Archival Items</td>\n",
       "      <td>28500.0</td>\n",
       "      <td>Literature &amp; Archive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18/07/2016</td>\n",
       "      <td>Christies</td>\n",
       "      <td>Literary &amp; Archival Items</td>\n",
       "      <td>33800.0</td>\n",
       "      <td>Literature &amp; Archive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>25/07/2016</td>\n",
       "      <td>A McGillivray</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>30113.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>31/07/2016</td>\n",
       "      <td>ALDL</td>\n",
       "      <td>ALDL Charges</td>\n",
       "      <td>32317.0</td>\n",
       "      <td>Could not classify</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>08/08/2016</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>32795.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866.0</td>\n",
       "      <td>Could not classify</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>196807.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>24/08/2016</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>32795.0</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Date                        Supplier                   Description  \\\n",
       "0   21/04/2016            M & J Ballantyne Ltd         George IV Bridge Work   \n",
       "1   26/04/2016                    Private Sale     Literary & Archival Items   \n",
       "2   30/04/2016       City Of Edinburgh Council           Non Domestic Rates    \n",
       "3   09/05/2016                Computacenter Uk                   Kelvin Hall   \n",
       "4   09/05/2016    John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "5   09/05/2016                   A McGillivray    Causewayside Refurbishment   \n",
       "6   16/05/2016    John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "7   23/05/2016                Computacenter Uk                   Kelvin Hall   \n",
       "8   23/05/2016          ECG Facilities Service  Facilities Management Charge   \n",
       "9   23/05/2016          ECG Facilities Service  Facilities Management Charge   \n",
       "10  30/05/2016                            ALDL                  ALDL Charges   \n",
       "11  10/06/2016                     Wavetek Ltd                   Kelvin Hall   \n",
       "12  10/06/2016    John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "13  28/06/2016          ECG Facilities Service  Facilities Management Charge   \n",
       "14  30/06/2016            Glasgow City Council                   Kelvin Hall   \n",
       "15  11/07/2016                     Wavetek Ltd                   Kelvin Hall   \n",
       "16  11/07/2016    John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "17  15/07/2016                       Sotheby'S     Literary & Archival Items   \n",
       "18  18/07/2016                       Christies     Literary & Archival Items   \n",
       "19  25/07/2016                   A McGillivray    Causewayside Refurbishment   \n",
       "20  31/07/2016                            ALDL                  ALDL Charges   \n",
       "21  08/08/2016          ECG Facilities Service  Facilities Management Charge   \n",
       "22  15/08/2016  Creative Video Productions Ltd                   Kelvin Hall   \n",
       "23  15/08/2016    John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "24  24/08/2016          ECG Facilities Service  Facilities Management Charge   \n",
       "\n",
       "    Transaction value (£)         Classification  \n",
       "0                 35098.0   Building Improvement  \n",
       "1                 30000.0   Literature & Archive  \n",
       "2                 40800.0          Utility Bills  \n",
       "3                 72835.0            Software/IT  \n",
       "4                 64361.0   Building Improvement  \n",
       "5                 53690.0   Building Improvement  \n",
       "6                365344.0   Building Improvement  \n",
       "7                 26506.0            Software/IT  \n",
       "8                 32777.0   Building Improvement  \n",
       "9                 32777.0   Building Improvement  \n",
       "10                32317.0     Could not classify  \n",
       "11                87589.0     Could not classify  \n",
       "12               381803.0   Building Improvement  \n",
       "13                32832.0   Building Improvement  \n",
       "14              1700000.0   Building Improvement  \n",
       "15                65692.0     Could not classify  \n",
       "16               139845.0   Building Improvement  \n",
       "17                28500.0   Literature & Archive  \n",
       "18                33800.0   Literature & Archive  \n",
       "19                30113.0   Building Improvement  \n",
       "20                32317.0     Could not classify  \n",
       "21                32795.0   Building Improvement  \n",
       "22                26866.0     Could not classify  \n",
       "23               196807.0   Building Improvement  \n",
       "24                32795.0   Building Improvement  "
      ]
     },
     "execution_count": 293,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_transactions.head(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initial results are pretty good even with no labelled examples! The ones that it could not classify were tougher cases with few clues as to their topic, but maybe if we clean up the labelled dataset to give more examples we can get better performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification with Embeddings\n",
    "\n",
    "Lets create embeddings from the small set that we've classified so far - we've made a set of labelled examples by running the zero-shot classifier on 101 transactions from our dataset and manually correcting the 15 **Could not classify** results that we got\n",
    "\n",
    "### Create embeddings\n",
    "\n",
    "This initial section reuses the approach from the [Obtain_dataset Notebook](Obtain_dataset.ipynb) to create embeddings from a combined field concatenating all of our features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 317,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866</td>\n",
       "      <td>Other</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>74806</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>Morris &amp; Spottiswood Ltd</td>\n",
       "      <td>George IV Bridge Work</td>\n",
       "      <td>56448</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>164691</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24/07/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>27926</td>\n",
       "      <td>Building Improvement</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date                        Supplier                 Description  \\\n",
       "0  15/08/2016  Creative Video Productions Ltd                 Kelvin Hall   \n",
       "1  29/05/2017    John Graham Construction Ltd  Causewayside Refurbishment   \n",
       "2  29/05/2017        Morris & Spottiswood Ltd       George IV Bridge Work   \n",
       "3  31/05/2017    John Graham Construction Ltd  Causewayside Refurbishment   \n",
       "4  24/07/2017    John Graham Construction Ltd  Causewayside Refurbishment   \n",
       "\n",
       "   Transaction value (£)        Classification  \n",
       "0                  26866                 Other  \n",
       "1                  74806  Building Improvement  \n",
       "2                  56448  Building Improvement  \n",
       "3                 164691  Building Improvement  \n",
       "4                  27926  Building Improvement  "
      ]
     },
     "execution_count": 317,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./data/labelled_transactions.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 318,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "      <th>combined</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: Creative Video Productions Ltd; Desc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>74806</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date                        Supplier                 Description  \\\n",
       "0  15/08/2016  Creative Video Productions Ltd                 Kelvin Hall   \n",
       "1  29/05/2017    John Graham Construction Ltd  Causewayside Refurbishment   \n",
       "\n",
       "   Transaction value (£)        Classification  \\\n",
       "0                  26866                 Other   \n",
       "1                  74806  Building Improvement   \n",
       "\n",
       "                                            combined  \n",
       "0  Supplier: Creative Video Productions Ltd; Desc...  \n",
       "1  Supplier: John Graham Construction Ltd; Descri...  "
      ]
     },
     "execution_count": 318,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['combined'] = \"Supplier: \" + df['Supplier'].str.strip() + \"; Description: \" + df['Description'].str.strip() + \"; Value: \" + str(df['Transaction value (£)']).strip()\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 319,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "101"
      ]
     },
     "execution_count": 319,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from transformers import GPT2TokenizerFast\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
    "\n",
    "df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 320,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_path = './data/transactions_with_embeddings_100.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 321,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai.embeddings_utils import get_embedding\n",
    "\n",
    "df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n",
    "df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n",
    "df.to_csv(embedding_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use embeddings for classification\n",
    "\n",
    "Now that we have our embeddings, let see if classifying these into the categories we've named gives us any more success.\n",
    "\n",
    "For this we'll use a template from the [Classification_using_embeddings](Classification_using_embeddings.ipynb) notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 309,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "      <th>combined</th>\n",
       "      <th>n_tokens</th>\n",
       "      <th>babbage_similarity</th>\n",
       "      <th>babbage_search</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: Creative Video Productions Ltd; Desc...</td>\n",
       "      <td>136</td>\n",
       "      <td>[-0.009802100248634815, 0.022551486268639565, ...</td>\n",
       "      <td>[-0.00232666521333158, 0.019198870286345482, 0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>74806</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>140</td>\n",
       "      <td>[-0.009065819904208183, 0.012094118632376194, ...</td>\n",
       "      <td>[0.005169447045773268, 0.00473341578617692, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>Morris &amp; Spottiswood Ltd</td>\n",
       "      <td>George IV Bridge Work</td>\n",
       "      <td>56448</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: Morris &amp; Spottiswood Ltd; Descriptio...</td>\n",
       "      <td>141</td>\n",
       "      <td>[-0.009000026620924473, 0.02405017428100109, -...</td>\n",
       "      <td>[0.0028343256562948227, 0.021166473627090454, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>31/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>164691</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>140</td>\n",
       "      <td>[-0.009065819904208183, 0.012094118632376194, ...</td>\n",
       "      <td>[0.005169447045773268, 0.00473341578617692, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>24/07/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>27926</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>140</td>\n",
       "      <td>[-0.009065819904208183, 0.012094118632376194, ...</td>\n",
       "      <td>[0.005169447045773268, 0.00473341578617692, -0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0        Date                        Supplier  \\\n",
       "0           0  15/08/2016  Creative Video Productions Ltd   \n",
       "1           1  29/05/2017    John Graham Construction Ltd   \n",
       "2           2  29/05/2017        Morris & Spottiswood Ltd   \n",
       "3           3  31/05/2017    John Graham Construction Ltd   \n",
       "4           4  24/07/2017    John Graham Construction Ltd   \n",
       "\n",
       "                  Description  Transaction value (£)        Classification  \\\n",
       "0                 Kelvin Hall                  26866                 Other   \n",
       "1  Causewayside Refurbishment                  74806  Building Improvement   \n",
       "2       George IV Bridge Work                  56448  Building Improvement   \n",
       "3  Causewayside Refurbishment                 164691  Building Improvement   \n",
       "4  Causewayside Refurbishment                  27926  Building Improvement   \n",
       "\n",
       "                                            combined  n_tokens  \\\n",
       "0  Supplier: Creative Video Productions Ltd; Desc...       136   \n",
       "1  Supplier: John Graham Construction Ltd; Descri...       140   \n",
       "2  Supplier: Morris & Spottiswood Ltd; Descriptio...       141   \n",
       "3  Supplier: John Graham Construction Ltd; Descri...       140   \n",
       "4  Supplier: John Graham Construction Ltd; Descri...       140   \n",
       "\n",
       "                                  babbage_similarity  \\\n",
       "0  [-0.009802100248634815, 0.022551486268639565, ...   \n",
       "1  [-0.009065819904208183, 0.012094118632376194, ...   \n",
       "2  [-0.009000026620924473, 0.02405017428100109, -...   \n",
       "3  [-0.009065819904208183, 0.012094118632376194, ...   \n",
       "4  [-0.009065819904208183, 0.012094118632376194, ...   \n",
       "\n",
       "                                      babbage_search  \n",
       "0  [-0.00232666521333158, 0.019198870286345482, 0...  \n",
       "1  [0.005169447045773268, 0.00473341578617692, -0...  \n",
       "2  [0.0028343256562948227, 0.021166473627090454, ...  \n",
       "3  [0.005169447045773268, 0.00473341578617692, -0...  \n",
       "4  [0.005169447045773268, 0.00473341578617692, -0...  "
      ]
     },
     "execution_count": 309,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report, accuracy_score\n",
    "\n",
    "fs_df = pd.read_csv(embedding_path)\n",
    "fs_df[\"babbage_similarity\"] = fs_df.babbage_similarity.apply(eval).apply(np.array)\n",
    "fs_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 310,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                      precision    recall  f1-score   support\n",
      "\n",
      "Building Improvement       0.92      1.00      0.96        11\n",
      "Literature & Archive       1.00      1.00      1.00         3\n",
      "               Other       0.00      0.00      0.00         1\n",
      "         Software/IT       1.00      1.00      1.00         1\n",
      "       Utility Bills       1.00      1.00      1.00         5\n",
      "\n",
      "            accuracy                           0.95        21\n",
      "           macro avg       0.78      0.80      0.79        21\n",
      "        weighted avg       0.91      0.95      0.93        21\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    list(fs_df.babbage_similarity.values), fs_df.Classification, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "clf = RandomForestClassifier(n_estimators=100)\n",
    "clf.fit(X_train, y_train)\n",
    "preds = clf.predict(X_test)\n",
    "probas = clf.predict_proba(X_test)\n",
    "\n",
    "report = classification_report(y_test, preds)\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performance for this model is pretty strong, so creating embeddings and using even a simpler classifier looks like an effective approach as well, with the zero-shot classifier helping us do the initial classification of the unlabelled dataset.\n",
    "\n",
    "Lets take it one step further and see if a fine-tuned model trained on this same labelled datasets gives us comparable results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-tuned Transaction Classification\n",
    "\n",
    "For this use case we're going to try to improve on the few-shot classification from above by training a fine-tuned model on the same labelled set of 101 transactions and applying this fine-tuned model on group of unseen transactions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building Fine-tuned Classifier\n",
    "\n",
    "We'll need to do some data prep first to get our data ready. This will take the following steps:\n",
    "- First we'll list out our classes and replace them with numeric identifiers. Making the model predict a single token rather than multiple consecutive ones like 'Building Improvement' should give us better results\n",
    "- We also need to add a common prefix and suffix to each example to aid the model in making predictions - in our case our text is already started with 'Supplier' and we'll add a suffix of '\\n\\n###\\n\\n'\n",
    "- Lastly we'll aid a leading whitespace onto each of our target classes for classification, again to aid the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "101"
      ]
     },
     "execution_count": 210,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ft_prep_df = fs_df.copy()\n",
    "len(ft_prep_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "      <th>combined</th>\n",
       "      <th>n_tokens</th>\n",
       "      <th>babbage_similarity</th>\n",
       "      <th>babbage_search</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: Creative Video Productions Ltd; Desc...</td>\n",
       "      <td>12</td>\n",
       "      <td>[-0.009630300104618073, 0.009887108579277992, ...</td>\n",
       "      <td>[-0.008217384107410908, 0.025170527398586273, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>74806</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>16</td>\n",
       "      <td>[-0.006144719664007425, -0.0018709596479311585...</td>\n",
       "      <td>[-0.007424891460686922, 0.008475713431835175, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>29/05/2017</td>\n",
       "      <td>Morris &amp; Spottiswood Ltd</td>\n",
       "      <td>George IV Bridge Work</td>\n",
       "      <td>56448</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: Morris &amp; Spottiswood Ltd; Descriptio...</td>\n",
       "      <td>17</td>\n",
       "      <td>[-0.005225738976150751, 0.015156379900872707, ...</td>\n",
       "      <td>[-0.007611643522977829, 0.030322374776005745, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>31/05/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>164691</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>16</td>\n",
       "      <td>[-0.006144719664007425, -0.0018709596479311585...</td>\n",
       "      <td>[-0.007424891460686922, 0.008475713431835175, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>24/07/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>27926</td>\n",
       "      <td>Building Improvement</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>16</td>\n",
       "      <td>[-0.006144719664007425, -0.0018709596479311585...</td>\n",
       "      <td>[-0.007424891460686922, 0.008475713431835175, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0        Date                        Supplier  \\\n",
       "0           0  15/08/2016  Creative Video Productions Ltd   \n",
       "1           1  29/05/2017    John Graham Construction Ltd   \n",
       "2           2  29/05/2017        Morris & Spottiswood Ltd   \n",
       "3           3  31/05/2017    John Graham Construction Ltd   \n",
       "4           4  24/07/2017    John Graham Construction Ltd   \n",
       "\n",
       "                  Description  Transaction value (£)        Classification  \\\n",
       "0                 Kelvin Hall                  26866                 Other   \n",
       "1  Causewayside Refurbishment                  74806  Building Improvement   \n",
       "2       George IV Bridge Work                  56448  Building Improvement   \n",
       "3  Causewayside Refurbishment                 164691  Building Improvement   \n",
       "4  Causewayside Refurbishment                  27926  Building Improvement   \n",
       "\n",
       "                                            combined  n_tokens  \\\n",
       "0  Supplier: Creative Video Productions Ltd; Desc...        12   \n",
       "1  Supplier: John Graham Construction Ltd; Descri...        16   \n",
       "2  Supplier: Morris & Spottiswood Ltd; Descriptio...        17   \n",
       "3  Supplier: John Graham Construction Ltd; Descri...        16   \n",
       "4  Supplier: John Graham Construction Ltd; Descri...        16   \n",
       "\n",
       "                                  babbage_similarity  \\\n",
       "0  [-0.009630300104618073, 0.009887108579277992, ...   \n",
       "1  [-0.006144719664007425, -0.0018709596479311585...   \n",
       "2  [-0.005225738976150751, 0.015156379900872707, ...   \n",
       "3  [-0.006144719664007425, -0.0018709596479311585...   \n",
       "4  [-0.006144719664007425, -0.0018709596479311585...   \n",
       "\n",
       "                                      babbage_search  \n",
       "0  [-0.008217384107410908, 0.025170527398586273, ...  \n",
       "1  [-0.007424891460686922, 0.008475713431835175, ...  \n",
       "2  [-0.007611643522977829, 0.030322374776005745, ...  \n",
       "3  [-0.007424891460686922, 0.008475713431835175, ...  \n",
       "4  [-0.007424891460686922, 0.008475713431835175, ...  "
      ]
     },
     "execution_count": 211,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ft_prep_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 212,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(   class_id                 class\n",
       " 0         0  Literature & Archive\n",
       " 1         1         Utility Bills\n",
       " 2         2  Building Improvement\n",
       " 3         3           Software/IT\n",
       " 4         4                 Other,\n",
       " 5)"
      ]
     },
     "execution_count": 212,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classes = list(set(ft_prep_df['Classification']))\n",
    "class_df = pd.DataFrame(classes).reset_index()\n",
    "class_df.columns = ['class_id','class']\n",
    "class_df  , len(class_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>Classification</th>\n",
       "      <th>combined</th>\n",
       "      <th>n_tokens</th>\n",
       "      <th>babbage_similarity</th>\n",
       "      <th>babbage_search</th>\n",
       "      <th>class_id</th>\n",
       "      <th>prompt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>15/08/2016</td>\n",
       "      <td>Creative Video Productions Ltd</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>26866</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: Creative Video Productions Ltd; Desc...</td>\n",
       "      <td>12</td>\n",
       "      <td>[-0.009630300104618073, 0.009887108579277992, ...</td>\n",
       "      <td>[-0.008217384107410908, 0.025170527398586273, ...</td>\n",
       "      <td>4</td>\n",
       "      <td>Supplier: Creative Video Productions Ltd; Desc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>51</td>\n",
       "      <td>31/03/2017</td>\n",
       "      <td>NLS Foundation</td>\n",
       "      <td>Grant Payment</td>\n",
       "      <td>177500</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: NLS Foundation; Description: Grant P...</td>\n",
       "      <td>11</td>\n",
       "      <td>[-0.022305507212877274, 0.008543581701815128, ...</td>\n",
       "      <td>[-0.020519884303212166, 0.01993306167423725, -...</td>\n",
       "      <td>4</td>\n",
       "      <td>Supplier: NLS Foundation; Description: Grant P...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>70</td>\n",
       "      <td>26/06/2017</td>\n",
       "      <td>British Library</td>\n",
       "      <td>Legal Deposit Services</td>\n",
       "      <td>50056</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: British Library; Description: Legal ...</td>\n",
       "      <td>11</td>\n",
       "      <td>[-0.01019938476383686, 0.015277703292667866, -...</td>\n",
       "      <td>[-0.01843327097594738, 0.03343546763062477, -0...</td>\n",
       "      <td>4</td>\n",
       "      <td>Supplier: British Library; Description: Legal ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>71</td>\n",
       "      <td>24/07/2017</td>\n",
       "      <td>ALDL</td>\n",
       "      <td>Legal Deposit Services</td>\n",
       "      <td>27067</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: ALDL; Description: Legal Deposit Ser...</td>\n",
       "      <td>11</td>\n",
       "      <td>[-0.008471488021314144, 0.004098685923963785, ...</td>\n",
       "      <td>[-0.012966590002179146, 0.01299362163990736, 0...</td>\n",
       "      <td>4</td>\n",
       "      <td>Supplier: ALDL; Description: Legal Deposit Ser...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>100</td>\n",
       "      <td>24/07/2017</td>\n",
       "      <td>AM Phillip</td>\n",
       "      <td>Vehicle Purchase</td>\n",
       "      <td>26604</td>\n",
       "      <td>Other</td>\n",
       "      <td>Supplier: AM Phillip; Description: Vehicle Pur...</td>\n",
       "      <td>10</td>\n",
       "      <td>[-0.003459023078903556, 0.004626389592885971, ...</td>\n",
       "      <td>[-0.0010945454705506563, 0.008626140654087067,...</td>\n",
       "      <td>4</td>\n",
       "      <td>Supplier: AM Phillip; Description: Vehicle Pur...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0        Date                        Supplier  \\\n",
       "0           0  15/08/2016  Creative Video Productions Ltd   \n",
       "1          51  31/03/2017                  NLS Foundation   \n",
       "2          70  26/06/2017                 British Library   \n",
       "3          71  24/07/2017                            ALDL   \n",
       "4         100  24/07/2017                      AM Phillip   \n",
       "\n",
       "              Description  Transaction value (£) Classification  \\\n",
       "0             Kelvin Hall                  26866          Other   \n",
       "1           Grant Payment                 177500          Other   \n",
       "2  Legal Deposit Services                  50056          Other   \n",
       "3  Legal Deposit Services                  27067          Other   \n",
       "4        Vehicle Purchase                  26604          Other   \n",
       "\n",
       "                                            combined  n_tokens  \\\n",
       "0  Supplier: Creative Video Productions Ltd; Desc...        12   \n",
       "1  Supplier: NLS Foundation; Description: Grant P...        11   \n",
       "2  Supplier: British Library; Description: Legal ...        11   \n",
       "3  Supplier: ALDL; Description: Legal Deposit Ser...        11   \n",
       "4  Supplier: AM Phillip; Description: Vehicle Pur...        10   \n",
       "\n",
       "                                  babbage_similarity  \\\n",
       "0  [-0.009630300104618073, 0.009887108579277992, ...   \n",
       "1  [-0.022305507212877274, 0.008543581701815128, ...   \n",
       "2  [-0.01019938476383686, 0.015277703292667866, -...   \n",
       "3  [-0.008471488021314144, 0.004098685923963785, ...   \n",
       "4  [-0.003459023078903556, 0.004626389592885971, ...   \n",
       "\n",
       "                                      babbage_search class_id  \\\n",
       "0  [-0.008217384107410908, 0.025170527398586273, ...        4   \n",
       "1  [-0.020519884303212166, 0.01993306167423725, -...        4   \n",
       "2  [-0.01843327097594738, 0.03343546763062477, -0...        4   \n",
       "3  [-0.012966590002179146, 0.01299362163990736, 0...        4   \n",
       "4  [-0.0010945454705506563, 0.008626140654087067,...        4   \n",
       "\n",
       "                                              prompt  \n",
       "0  Supplier: Creative Video Productions Ltd; Desc...  \n",
       "1  Supplier: NLS Foundation; Description: Grant P...  \n",
       "2  Supplier: British Library; Description: Legal ...  \n",
       "3  Supplier: ALDL; Description: Legal Deposit Ser...  \n",
       "4  Supplier: AM Phillip; Description: Vehicle Pur...  "
      ]
     },
     "execution_count": 215,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ft_df_with_class = ft_prep_df.merge(class_df,left_on='Classification',right_on='class',how='inner')\n",
    "\n",
    "# Adding a leading whitespace onto each completion to help the model\n",
    "ft_df_with_class['class_id'] = ft_df_with_class.apply(lambda x: ' ' + str(x['class_id']),axis=1)\n",
    "ft_df_with_class = ft_df_with_class.drop('class', axis=1)\n",
    "\n",
    "# Adding a common separator onto the end of each prompt so the model knows when a prompt is terminating\n",
    "ft_df_with_class['prompt'] = ft_df_with_class.apply(lambda x: x['combined'] + '\\n\\n###\\n\\n',axis=1)\n",
    "ft_df_with_class.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prompt</th>\n",
       "      <th>completion</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ordering</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Supplier: Sothebys; Description: Literary &amp; Ar...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Supplier: Sotheby'S; Description: Literary &amp; A...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Supplier: City Of Edinburgh Council; Descripti...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     prompt completion\n",
       "ordering                                                              \n",
       "0         Supplier: Sothebys; Description: Literary & Ar...          0\n",
       "1         Supplier: Sotheby'S; Description: Literary & A...          0\n",
       "2         Supplier: City Of Edinburgh Council; Descripti...          1\n",
       "2         Supplier: John Graham Construction Ltd; Descri...          2\n",
       "3         Supplier: John Graham Construction Ltd; Descri...          2"
      ]
     },
     "execution_count": 236,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This step is unnecessary if you have a number of observations in each class\n",
    "# In our case we don't, so we shuffle the data to give us a better chance of getting equal classes in our train and validation sets\n",
    "# Our fine-tuned model will error if we have less classes in the validation set, so this is a necessary step\n",
    "\n",
    "import random \n",
    "\n",
    "labels = [x for x in ft_df_with_class['class_id']]\n",
    "text = [x for x in ft_df_with_class['prompt']]\n",
    "ft_df = pd.DataFrame(zip(text, labels), columns = ['prompt','class_id']) #[:300]\n",
    "ft_df.columns = ['prompt','completion']\n",
    "ft_df['ordering'] = ft_df.apply(lambda x: random.randint(0,len(ft_df)), axis = 1)\n",
    "ft_df.set_index('ordering',inplace=True)\n",
    "ft_df_sorted = ft_df.sort_index(ascending=True)\n",
    "ft_df_sorted.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This step is to remove any existing files if we've already produced training/validation sets for this classifier\n",
    "#!rm transactions_grouped*\n",
    "\n",
    "# We output our shuffled dataframe to a .jsonl file and run the prepare_data function to get us our input files\n",
    "ft_df_sorted.to_json(\"transactions_grouped.jsonl\", orient='records', lines=True)\n",
    "!openai tools fine_tunes.prepare_data -f transactions_grouped.jsonl -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 322,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31\n",
      "8\n",
      "All good\n"
     ]
    }
   ],
   "source": [
    "# This functions checks that your classes all appear in both prepared files\n",
    "# If they don't, the fine-tuned model creation will fail\n",
    "check_classes('transactions_grouped_prepared_train.jsonl','transactions_grouped_prepared_valid.jsonl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This step creates your model\n",
    "!openai api fine_tunes.create -t \"transactions_grouped_prepared_train.jsonl\" -v \"transactions_grouped_prepared_valid.jsonl\" --compute_classification_metrics --classification_n_classes 5 -m curie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 323,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Congrats, you've got a fine-tuned model!\n",
    "# Copy/paste the name provided into the variable below and we'll take it for a spin\n",
    "fine_tuned_model = 'curie:ft-personal-2022-10-20-10-42-56'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Applying Fine-tuned Classifier\n",
    "\n",
    "Now we'll apply our classifier to see how it performs. We only had 31 unique observations in our training set and 8 in our validation set, so lets see how the performance is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 324,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prompt</th>\n",
       "      <th>completion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Supplier: Wavetek Ltd; Description: Kelvin Hal...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Supplier: ECG Facilities Service; Description:...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Supplier: M &amp; J Ballantyne Ltd; Description: G...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Supplier: Private Sale; Description: Literary ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Supplier: Ex Libris; Description: IT equipment...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              prompt  completion\n",
       "0  Supplier: Wavetek Ltd; Description: Kelvin Hal...           2\n",
       "1  Supplier: ECG Facilities Service; Description:...           1\n",
       "2  Supplier: M & J Ballantyne Ltd; Description: G...           2\n",
       "3  Supplier: Private Sale; Description: Literary ...           0\n",
       "4  Supplier: Ex Libris; Description: IT equipment...           3"
      ]
     },
     "execution_count": 324,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_set = pd.read_json('transactions_grouped_prepared_valid.jsonl', lines=True)\n",
    "test_set.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 325,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_set['predicted_class'] = test_set.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['prompt'], max_tokens=1, temperature=0, logprobs=5),axis=1)\n",
    "test_set['pred'] = test_set.apply(lambda x : x['predicted_class']['choices'][0]['text'],axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 326,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_set['result'] = test_set.apply(lambda x: str(x['pred']).strip() == str(x['completion']).strip(), axis = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 327,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True     4\n",
       "False    4\n",
       "Name: result, dtype: int64"
      ]
     },
     "execution_count": 327,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_set['result'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performance is not great - unfortunately this is expected. With only a few examples of each class, the above approach with embeddings and a traditional classifier worked better.\n",
    "\n",
    "A fine-tuned model works best with a great number of labelled observations. If we had a few hundred or thousand we may get better results, but lets do one last test on a holdout set to confirm that it doesn't generalise well to a new set of observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 330,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>23/10/2017</td>\n",
       "      <td>City Building LLP</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>53147.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>30/10/2017</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>30/10/2017</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>104</th>\n",
       "      <td>06/11/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>134208.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>06/11/2017</td>\n",
       "      <td>ALDL</td>\n",
       "      <td>Legal Deposit Services</td>\n",
       "      <td>27067.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Date                      Supplier                   Description  \\\n",
       "101  23/10/2017             City Building LLP    Causewayside Refurbishment   \n",
       "102  30/10/2017        ECG Facilities Service  Facilities Management Charge   \n",
       "103  30/10/2017        ECG Facilities Service  Facilities Management Charge   \n",
       "104  06/11/2017  John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "105  06/11/2017                          ALDL        Legal Deposit Services   \n",
       "\n",
       "     Transaction value (£)  \n",
       "101                53147.0  \n",
       "102                35758.0  \n",
       "103                35758.0  \n",
       "104               134208.0  \n",
       "105                27067.0  "
      ]
     },
     "execution_count": 330,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "holdout_df = transactions.copy().iloc[101:]\n",
    "holdout_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 332,
   "metadata": {},
   "outputs": [],
   "source": [
    "holdout_df['combined'] = \"Supplier: \" + holdout_df['Supplier'].str.strip() + \"; Description: \" + holdout_df['Description'].str.strip() + '\\n\\n###\\n\\n' # + \"; Value: \" + str(df['Transaction value (£)']).strip()\n",
    "holdout_df['prediction_result'] = holdout_df.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['combined'], max_tokens=1, temperature=0, logprobs=5),axis=1)\n",
    "holdout_df['pred'] = holdout_df.apply(lambda x : x['prediction_result']['choices'][0]['text'],axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 333,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Supplier</th>\n",
       "      <th>Description</th>\n",
       "      <th>Transaction value (£)</th>\n",
       "      <th>combined</th>\n",
       "      <th>prediction_result</th>\n",
       "      <th>pred</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>23/10/2017</td>\n",
       "      <td>City Building LLP</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>53147.0</td>\n",
       "      <td>Supplier: City Building LLP; Description: Caus...</td>\n",
       "      <td>{'id': 'cmpl-63YDadbYLo8xKsGY2vReOFCMgTOvG', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>30/10/2017</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "      <td>Supplier: ECG Facilities Service; Description:...</td>\n",
       "      <td>{'id': 'cmpl-63YDbNK1D7UikDc3xi5ATihg5kQEt', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>30/10/2017</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "      <td>Supplier: ECG Facilities Service; Description:...</td>\n",
       "      <td>{'id': 'cmpl-63YDbwfiHjkjMWsfTKNt6naeqPzOe', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>104</th>\n",
       "      <td>06/11/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>134208.0</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>{'id': 'cmpl-63YDbWAndtsRqPTi2ZHZtPodZvOwr', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>06/11/2017</td>\n",
       "      <td>ALDL</td>\n",
       "      <td>Legal Deposit Services</td>\n",
       "      <td>27067.0</td>\n",
       "      <td>Supplier: ALDL; Description: Legal Deposit Ser...</td>\n",
       "      <td>{'id': 'cmpl-63YDbDu7WM3svYWsRAMdDUKtSFDBu', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106</th>\n",
       "      <td>27/11/2017</td>\n",
       "      <td>Maggs Bros Ltd</td>\n",
       "      <td>Literary &amp; Archival Items</td>\n",
       "      <td>26500.0</td>\n",
       "      <td>Supplier: Maggs Bros Ltd; Description: Literar...</td>\n",
       "      <td>{'id': 'cmpl-63YDbxNNI8ZH5CJJNxQ0IF9Zf925C', '...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107</th>\n",
       "      <td>30/11/2017</td>\n",
       "      <td>Glasgow City Council</td>\n",
       "      <td>Kelvin Hall</td>\n",
       "      <td>42345.0</td>\n",
       "      <td>Supplier: Glasgow City Council; Description: K...</td>\n",
       "      <td>{'id': 'cmpl-63YDb8R1FWu4bjwM2xE775rouwneV', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>108</th>\n",
       "      <td>11/12/2017</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "      <td>Supplier: ECG Facilities Service; Description:...</td>\n",
       "      <td>{'id': 'cmpl-63YDcAPsp37WhbPs9kwfUX0kBk7Hv', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>109</th>\n",
       "      <td>11/12/2017</td>\n",
       "      <td>John Graham Construction Ltd</td>\n",
       "      <td>Causewayside Refurbishment</td>\n",
       "      <td>159275.0</td>\n",
       "      <td>Supplier: John Graham Construction Ltd; Descri...</td>\n",
       "      <td>{'id': 'cmpl-63YDcML2welrC3wF0nuKgcNmVu1oQ', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>110</th>\n",
       "      <td>08/01/2018</td>\n",
       "      <td>ECG Facilities Service</td>\n",
       "      <td>Facilities Management Charge</td>\n",
       "      <td>35758.0</td>\n",
       "      <td>Supplier: ECG Facilities Service; Description:...</td>\n",
       "      <td>{'id': 'cmpl-63YDc95SSdOHnIliFB2cjMEEm7Z2u', '...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Date                      Supplier                   Description  \\\n",
       "101  23/10/2017             City Building LLP    Causewayside Refurbishment   \n",
       "102  30/10/2017        ECG Facilities Service  Facilities Management Charge   \n",
       "103  30/10/2017        ECG Facilities Service  Facilities Management Charge   \n",
       "104  06/11/2017  John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "105  06/11/2017                          ALDL        Legal Deposit Services   \n",
       "106  27/11/2017                Maggs Bros Ltd     Literary & Archival Items   \n",
       "107  30/11/2017          Glasgow City Council                   Kelvin Hall   \n",
       "108  11/12/2017        ECG Facilities Service  Facilities Management Charge   \n",
       "109  11/12/2017  John Graham Construction Ltd    Causewayside Refurbishment   \n",
       "110  08/01/2018        ECG Facilities Service  Facilities Management Charge   \n",
       "\n",
       "     Transaction value (£)                                           combined  \\\n",
       "101                53147.0  Supplier: City Building LLP; Description: Caus...   \n",
       "102                35758.0  Supplier: ECG Facilities Service; Description:...   \n",
       "103                35758.0  Supplier: ECG Facilities Service; Description:...   \n",
       "104               134208.0  Supplier: John Graham Construction Ltd; Descri...   \n",
       "105                27067.0  Supplier: ALDL; Description: Legal Deposit Ser...   \n",
       "106                26500.0  Supplier: Maggs Bros Ltd; Description: Literar...   \n",
       "107                42345.0  Supplier: Glasgow City Council; Description: K...   \n",
       "108                35758.0  Supplier: ECG Facilities Service; Description:...   \n",
       "109               159275.0  Supplier: John Graham Construction Ltd; Descri...   \n",
       "110                35758.0  Supplier: ECG Facilities Service; Description:...   \n",
       "\n",
       "                                     prediction_result pred  \n",
       "101  {'id': 'cmpl-63YDadbYLo8xKsGY2vReOFCMgTOvG', '...    2  \n",
       "102  {'id': 'cmpl-63YDbNK1D7UikDc3xi5ATihg5kQEt', '...    2  \n",
       "103  {'id': 'cmpl-63YDbwfiHjkjMWsfTKNt6naeqPzOe', '...    2  \n",
       "104  {'id': 'cmpl-63YDbWAndtsRqPTi2ZHZtPodZvOwr', '...    2  \n",
       "105  {'id': 'cmpl-63YDbDu7WM3svYWsRAMdDUKtSFDBu', '...    2  \n",
       "106  {'id': 'cmpl-63YDbxNNI8ZH5CJJNxQ0IF9Zf925C', '...    0  \n",
       "107  {'id': 'cmpl-63YDb8R1FWu4bjwM2xE775rouwneV', '...    2  \n",
       "108  {'id': 'cmpl-63YDcAPsp37WhbPs9kwfUX0kBk7Hv', '...    2  \n",
       "109  {'id': 'cmpl-63YDcML2welrC3wF0nuKgcNmVu1oQ', '...    2  \n",
       "110  {'id': 'cmpl-63YDc95SSdOHnIliFB2cjMEEm7Z2u', '...    2  "
      ]
     },
     "execution_count": 333,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "holdout_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 334,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 2    231\n",
       " 0     27\n",
       "Name: pred, dtype: int64"
      ]
     },
     "execution_count": 334,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "holdout_df['pred'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well those results were similarly underwhelming - so we've learned that with a dataset with a small number of labelled observations, either zero-shot classification or traditional classification with embeddings return better results than a fine-tuned model.\n",
    "\n",
    "A fine-tuned model is still a great tool, but is more effective when you have a larger number of labelled examples for each class that you're looking to classify"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}