openai-cookbook/examples/Clustering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering\n",
    "\n",
    "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 2048)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\"  # for your convenience, we precomputed the embeddings\n",
    "df = pd.read_csv(datafile_path)\n",
    "df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",
    "matrix = np.vstack(df.babbage_similarity.values)\n",
    "matrix.shape\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Find the clusters using K-means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Cluster\n",
       "2    2.543478\n",
       "3    4.374046\n",
       "0    4.709402\n",
       "1    4.832099\n",
       "Name: Score, dtype: float64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "n_clusters = 4\n",
    "\n",
    "kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",
    "kmeans.fit(matrix)\n",
    "labels = kmeans.labels_\n",
    "df[\"Cluster\"] = labels\n",
    "\n",
    "df.groupby(\"Cluster\").Score.mean().sort_values()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like cluster 2 focused on negative reviews, while cluster 0 and 1 focused on positive reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAADPp0lEQVR4nOz9d3xc13nnj7/P9F4wM6gDgiAAdkoUSYiqpmRZtqTIpuUozdnEVmT5m91N1vb+lPWud9dOcbyJV1mvoy2xZSVyEicudLi0aVGWLFOdBRSLWEAUEiQwGJTpvc/9/XEwQwAESEAkJZb58MUXgJl77zn3zp3Pee5TPo9QFIUaaqihhhquf6je7wnUUEMNNdTw3qBG+DXUUEMNNwhqhF9DDTXUcIOgRvg11FBDDTcIaoRfQw011HCDoEb4NdRQQw03CK4JwhdC/LEQ4h/f73ksBkKIXUKIT83z3lIhhCKE0FyhsZNCiGVTvxuFED8VQsSEED8SQvy2EOLFd3ncTwsh3rjUOV0JzL6mF7r+lzDGvPehEOJuIUTfuzzuu76uNwKuxPURQiyZuifVl/O4VzuuGsIXQnxSCHFg6kMYm/rC3nUZj39FSXY2FEV5UFGU717pcYQQrwghPjNrbIuiKKen/nwUaABciqL8mqIo31MU5cNXel6zMWtO78V478n1nzbe64qirHivxruWIYS4TQjxkhAiLIQITBkiTe/lHBRFGZ66J0uX+9hCiOeEEF+9yDYOIcTfCiHGhRAJIUS/EOI/TntfEUIcFUKopr32VSHEc1O/V/gsOev/b1xo3KuC8IUQ/x74n8DXkOS0BPg/wNb3cVoz8F4tFFcAbUC/oijF93siNdQwBSfwbWAp8v5MAH/3fk7ofcA3AAuwCrADHwMGZ23TDPzmRY7jmFq4Kv9/cMGtFUV5X/9PnWwS+LULbPPHwD9O/X4P4Jv1/hngQ1O/3wocAOLABPA/pl4fBpSpsZLA7VOv/x7QC0SAnwNt046rAP8WGACGADH1QU1OHf8osHaeOb8CfGbqdzXwFBAETk8dUwE0067Bs8AYMAp8FVBPvfdp4I2p/SNT83hw6r0/B0pAduqc/te0eXcCfwLkgcLU+49XjjdtniuBl4Aw0Af8+rT3XMBPps51P/Bn0/eddb67gD+Y9doR4BPT5zT1+0PACeQXfRR4cvq5zjrG9P1+BTg0NZ8R4I+nbbd01jWdfv2PTPvck1Pb3TP13m3AW0B0art7ph2zHXh1ap4vAf+LqftwjvO/h2n3JfKefBJ4B4gBPwAM8+w7+zP55tT5xYG3gbtnfRd+CPz91LyOA5umvb9h6holgB9NjfvVS72+U+//LnAWCAH/lZnfOxXwH4FTU+//EKhbIAdsABLv8r6bcd0XwQdz3S9/Brw5de1eBNwLOfdZY38W+X3LI++1n84z72PAxy9wTRTgi0juqczxq8Bzc81/of+vBsJ/ACheaOIsjvD3AL8z9bsFuG2+C4R8ghhErrIa4L8Ab8266C8BdYAR+AjyC+hAkv8qoGmeOb/COcL5feAk0Dp1rN2zbrbtwLcAM1A/dZP/f9O+pAXgCeTC8a8BPyBmjzPPl7h67WZ/6afGGwEemzr/W5CL0uqp97+P/OKagbVIcp7vi/e7wJvT/l6NJFH9HHMaY4rEkNbehtlzm+dc7gHWIcnlJuQX+OMX+AJ/Zo55fnbqs7ABLcgv8ENTx7x/6m/PtHvpfwB64ANIIlgM4e9HWml1SKPi9+fZd8Z5A/8KSXoa4P8HjDO1WEx9ntmpOauB/wbsnXpPhySlzwFa4BNI4lko4V/o+q5GEthdU+M8hbwvK9+7zwF7Ae/U9foW8M8L5IDPV87hXdx3M677u+WDqfvlFLAc+V1/BfiLhZz7HHN6rnLNL3DO30Eu1o8BXXO8rwBdSL6p8MglE/7V4NJxAUHl8rkcCkCnEMKtKEpSUZS9F9j294H/pihK79T4XwPWCyHapm3z3xRFCSuKkpk6thVpFYup/cYWMKdfB/6noigjiqKEkV9SAIQQDcgv7+cVRUkpijKJfIqY/ih3VlGUZxTpb/wu0IR0fV0qHgbOKIryd4qiFBVFOQT8GPi1qWDWrwJfnprXsamx58N2Zl673wb+RVGU3BzbFoDVQgiboigRRVEOLmSyiqK8oijKUUVRyoqivAP8M7BlYacKUzGhrwIfUxQljiTW5xVFeX7qmC8hrcGHhBBLgG7gvyqKklMU5TXgpwsdawp/rSiKf+oz/ymwfiE7KYryj4qihKY+k79CEuj0+MAbU3MuAf8A3Dz1+m3IReKvFUUpKIryL8hFZ0G4yPV9FGmtvqEoSh74MpJwKvh94D8riuKb+sz/GHj0Yq5QIcRNU8f6o6m/F3vfXQyL4YO/UxSlf+q7/kPOfV4XO/d3gz8Evgf8AXBCCDEohHhw1jYK8mnivwohdPMcJyiEiE77v+pCg14NhB8C3JfRR/44cpU+KYToEUI8fIFt24BvVi4W0q0hkJZfBSOVXxRF+SXysf5/A5NCiG8LIWwLmFPz9OMgrbDpc9ACY9Pm8S2kpV/B+LQ5pKd+tSxg3IuhDdg8/YZBEnUj4EGSx3zzngFFURLAzzi3UP0W8oaeC7+KXOTOCiFeFULcvpDJCiE2CyF2TwX6YkiScS9w31bkl/hTiqL0T73chlzcpp//XcgFtRmIKIqSmnaYec9/HoxP+z3NAj8zIcSTQojeqcyqKNLlN/08Zx/XMPX9aQZGlSkTcArTP7+LjXuh6zvjHp66D0PTdm8Dtk+7jr1Id+O8hokQohPpCvycoiivT728qPtuAVgMH8z3eV3s3C+Iqcy4SlB119QxMoqifE1RlI1Io/eHwI+EEHXT91UU5XnAB/x/8xzerSiKY9r/3gvN5Wog/D1ADvj4ArdPAabKH1MWgafyt6IoA4qi/BaSMP8S2CaEMDP3ijyCdJ1Mv2BGRVHemrbNjP0URfnrqQ9pNfJG+qMFzHkM6c6pYMmsOeSY+cHZFEVZs4Djnje/RWIEeHXW+VsURfnXQADpaptv3nPhn4HfmiJwA9J1df6EFaVHUZStyM/o/yFvdjj/s22ctes/IX27rYqi2IG/QS7QF4QQwjg1zv9UFGXXtLdGgH+Ydf5mRVH+AvmZOafunQoudv6XDCHE3cB/QD4VOhVFcSBjABc9T+ScW4QQ07ed/vldyvUdQ7prKvsakURVwQgytjT9WhoURRmd5zzbgF8Af6Yoyj9Me2ux99275YPF4GLnPhuzOeN7yrmg6mwrnqmnza8hXVjtcxzvPwNfYtp5vlu874SvKEoM+Yj0v4UQHxdCmIQQWiHEg0KIr8+xSz/SovkVIYQW6XfXV94UQvwrIYRHUZQy0ocMUEbeSGVgei743wD/SQixZmpfuxDi1+abqxCie8oK0iJvtOzUMS+GHwL/TgjhFUI4kcGtyvmPIQNEfyWEsAkhVEKIDiHEQl0VE7POaTHYCSwXQvzO1DXXTp3jqil3wb8Afzz1mawGPnWR4z2PtPT+FPjB1GcwA0II3ZTFY1cUpYAMplW2OwKsEUKsF0IYkG6B6bACYUVRskKIW4FPLvA8/xY4qSjK7PvpH4GPCiE+IoRQCyEMQoh7hBBeRVHOIt07fzI157uAjy5wvEuBFUl4AUAjhPgyMt6wEOxBWtV/IITQCCG2IoOWFVzK9d2GvFZ3TLkX/piZi9DfAH9ecekJITxT458HIUQL8EtkksHfTH/vXdx375YPFoOLnftsXPQ7KYT4r1PfNd3UZ/G5qfmdV8uhKMoryCDvxb5/F8X7TvgAU37Kf4/8sAJIa+EPkFbZ7G1jwL9BBj1GkcTrm7bJA8BxIUQSme3wm1OPT2lkVsubU4+dtymKsh256n9fCBFHXtTzVuBpsAHPILNlKhH7/76AU3wGmQF0BDiIvKGn43eRwaATU8fehnQrLATfRPpKI0KIv17gPkDVDfNhpBvGj3yk/UvOfWH+APlYO44MRP3dRY6XQ57bh5DW4nz4HeDM1DX/faQbiSlXy58iLb8BZHbSdPwb4E+FEAmkkfBDFobfBB4RM/OV71Y
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "tsne = TSNE(\n",
    "    n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200\n",
    ")\n",
    "vis_dims2 = tsne.fit_transform(matrix)\n",
    "\n",
    "x = [x for x, y in vis_dims2]\n",
    "y = [y for x, y in vis_dims2]\n",
    "\n",
    "for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",
    "    xs = np.array(x)[df.Cluster == category]\n",
    "    ys = np.array(y)[df.Cluster == category]\n",
    "    plt.scatter(xs, ys, color=color, alpha=0.3)\n",
    "\n",
    "    avg_x = xs.mean()\n",
    "    avg_y = ys.mean()\n",
    "\n",
    "    plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",
    "plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualization of clusters in a 2d projection. The red cluster clearly represents negative reviews. The blue cluster seems quite different from the others. Let's see a few samples from each cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Text samples in the clusters & naming the clusters\n",
    "\n",
    "Let's show random samples from each cluster. We'll use davinci-instruct-beta-v3 to name the clusters, based on a random sample of 6 reviews from that cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cluster 0 Theme:  All of the customer reviews mention the great flavor of the product.\n",
      "5, French Vanilla Cappuccino:   Great price.  Really love the the flavor.  No need to add anything to \n",
      "5, great coffee:   A bit pricey once you add the S & H but this is one of the best flavor\n",
      "5, Love It:   First let me say I'm new to drinking tea. So you're not getting a well\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Cluster 1 Theme:  All three reviews mention the quality of the product.\n",
      "5, Beautiful:   I don't plan to grind these, have plenty other peppers for that.  I go\n",
      "5, Awesome:   I can't find this in the stores and thought I would like it.  So I bou\n",
      "5, Came as expected:   It was tasty and fresh. The other one I bought was old and tasted mold\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Cluster 2 Theme:  All reviews are about customer's disappointment.\n",
      "1, Disappointed...:   I should read the fine print, I guess.  I mostly went by the picture a\n",
      "5, Excellent but Price?:   I first heard about this on America's Test Kitchen where it won a blin\n",
      "1, Disappointed:   I received the offer from Amazon and had never tried this brand before\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Cluster 3 Theme: The reviews for these products have in common that the customers are happy with the product.\n",
      "5, My Dog's Favorite Snack!:   I was first introduced to this snack at my dog's training classes at p\n",
      "4, Fruitables Crunchy Dog Treats:   My lab goes wild for these and I am almost tempted to have a go at som\n",
      "5, Happy with the product:   My dog was suffering with itchy skin.  He had been eating Natural Choi\n",
      "----------------------------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "import openai\n",
    "\n",
    "# Reading a review which belong to each group.\n",
    "rev_per_cluster = 3\n",
    "\n",
    "for i in range(n_clusters):\n",
    "    print(f\"Cluster {i} Theme:\", end=\" \")\n",
    "\n",
    "    reviews = \"\\n\".join(\n",
    "        df[df.Cluster == i]\n",
    "        .combined.str.replace(\"Title: \", \"\")\n",
    "        .str.replace(\"\\n\\nContent: \", \":  \")\n",
    "        .sample(rev_per_cluster, random_state=42)\n",
    "        .values\n",
    "    )\n",
    "    response = openai.Completion.create(\n",
    "        engine=\"davinci-instruct-beta-v3\",\n",
    "        prompt=f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:',\n",
    "        temperature=0,\n",
    "        max_tokens=64,\n",
    "        top_p=1,\n",
    "        frequency_penalty=0,\n",
    "        presence_penalty=0,\n",
    "    )\n",
    "    print(response[\"choices\"][0][\"text\"].replace(\"\\n\", \"\"))\n",
    "\n",
    "    sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",
    "    for j in range(rev_per_cluster):\n",
    "        print(sample_cluster_rows.Score.values[j], end=\", \")\n",
    "        print(sample_cluster_rows.Summary.values[j], end=\":   \")\n",
    "        print(sample_cluster_rows.Text.str[:70].values[j])\n",
    "\n",
    "    print(\"-\" * 100)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see based on the average ratings per cluster, that Cluster 2 contains mostly negative reviews. Cluster 0 and 1 contain mostly positive reviews, whilst Cluster 3 appears to contain reviews about dog products."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.9 ('openai')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Initial commit 2 years ago			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Clustering\n",`
			`"\n",`
			`"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds precomputed embeddings 2 years ago			`"execution_count": 1,`
Initial commit 2 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"(1000, 2048)"`
			`]`
			`},`
adds precomputed embeddings 2 years ago			`"execution_count": 1,`
Initial commit 2 years ago			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"import pandas as pd\n",`
			`"import numpy as np\n",`
			`"\n",`
adds precomputed embeddings 2 years ago			`"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",`
			`"df = pd.read_csv(datafile_path)\n",`
lint 2 years ago			`"df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",`
Initial commit 2 years ago			`"matrix = np.vstack(df.babbage_similarity.values)\n",`
lint 2 years ago			`"matrix.shape\n"`
Initial commit 2 years ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### 1. Find the clusters using K-means"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds precomputed embeddings 2 years ago			`"execution_count": 2,`
Initial commit 2 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Cluster\n",`
			`"2 2.543478\n",`
			`"3 4.374046\n",`
			`"0 4.709402\n",`
			`"1 4.832099\n",`
			`"Name: Score, dtype: float64"`
			`]`
			`},`
adds precomputed embeddings 2 years ago			`"execution_count": 2,`
Initial commit 2 years ago			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.cluster import KMeans\n",`
			`"\n",`
			`"n_clusters = 4\n",`
			`"\n",`
lint 2 years ago			`"kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",`
Initial commit 2 years ago			`"kmeans.fit(matrix)\n",`
			`"labels = kmeans.labels_\n",`
lint 2 years ago			`"df[\"Cluster\"] = labels\n",`
Initial commit 2 years ago			`"\n",`
lint 2 years ago			`"df.groupby(\"Cluster\").Score.mean().sort_values()\n"`
Initial commit 2 years ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"It looks like cluster 2 focused on negative reviews, while cluster 0 and 1 focused on positive reviews."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds precomputed embeddings 2 years ago			`"execution_count": 3,`
Initial commit 2 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"`
			`]`
			`},`
adds precomputed embeddings 2 years ago			`"execution_count": 3,`
Initial commit 2 years ago			`"metadata": {},`
			`"output_type": "execute_result"`
			`},`
			`{`
			`"data": {`
adds precomputed embeddings 2 years ago			"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAADPp0lEQVR4nOz9d3xc13nnj7/P9F4wM6gDgiAAdkoUSYiqpmRZtqTIpuUozdnEVmT5m91N1vb+lPWud9dOcbyJV1mvoy2xZSVyEicudLi0aVGWLFOdBRSLWEAUEiQwGJTpvc/9/XEwQwAESEAkJZb58MUXgJl77zn3zp3Pee5TPo9QFIUaaqihhhquf6je7wnUUEMNNdTw3qBG+DXUUEMNNwhqhF9DDTXUcIOgRvg11FBDDTcIaoRfQw011HCDoEb4NdRQQw03CK4JwhdC/LEQ4h/f73ksBkKIXUKIT83z3lIhhCKE0FyhsZNCiGVTvxuFED8VQsSEED8SQvy2EOLFd3ncTwsh3rjUOV0JzL6mF7r+lzDGvPehEOJuIUTfuzzuu76uNwKuxPURQiyZuifVl/O4VzuuGsIXQnxSCHFg6kMYm/rC3nUZj39FSXY2FEV5UFGU717pcYQQrwghPjNrbIuiKKen/nwUaABciqL8mqIo31MU5cNXel6zMWtO78V478n1nzbe64qirHivxruWIYS4TQjxkhAiLIQITBkiTe/lHBRFGZ66J0uX+9hCiOeEEF+9yDYOIcTfCiHGhRAJIUS/EOI/TntfEUIcFUKopr32VSHEc1O/V/gsOev/b1xo3KuC8IUQ/x74n8DXkOS0BPg/wNb3cVoz8F4tFFcAbUC/oijF93siNdQwBSfwbWAp8v5MAH/3fk7ofcA3AAuwCrADHwMGZ23TDPzmRY7jmFq4Kv9/cMGtFUV5X/9PnWwS+LULbPPHwD9O/X4P4Jv1/hngQ1O/3wocAOLABPA/pl4fBpSpsZLA7VOv/x7QC0SAnwNt046rAP8WGACGADH1QU1OHf8osHaeOb8CfGbqdzXwFBAETk8dUwE0067Bs8AYMAp8FVBPvfdp4I2p/SNT83hw6r0/B0pAduqc/te0eXcCfwLkgcLU+49XjjdtniuBl4Aw0Af8+rT3XMBPps51P/Bn0/eddb67gD+Y9doR4BPT5zT1+0PACeQXfRR4cvq5zjrG9P1+BTg0NZ8R4I+nbbd01jWdfv2PTPvck1Pb3TP13m3AW0B0art7ph2zHXh1ap4vAf+LqftwjvO/h2n3JfKefBJ4B4gBPwAM8+w7+zP55tT5xYG3gbtnfRd+CPz91LyOA5umvb9h6holgB9NjfvVS72+U+//LnAWCAH/lZnfOxXwH4FTU+//EKhbIAdsABLv8r6bcd0XwQdz3S9/Brw5de1eBNwLOfdZY38W+X3LI++1n84z72PAxy9wTRTgi0juqczxq8Bzc81/of+vBsJ/ACheaOIsjvD3AL8z9bsFuG2+C4R8ghhErrIa4L8Ab8266C8BdYAR+AjyC+hAkv8qoGmeOb/COcL5feAk0Dp1rN2zbrbtwLcAM1A/dZP/f9O+pAXgCeTC8a8BPyBmjzPPl7h67WZ/6afGGwEemzr/W5CL0uqp97+P/OKagbVIcp7vi/e7wJvT/l6NJFH9HHMaY4rEkNbehtlzm+dc7gHWIcnlJuQX+OMX+AJ/Zo55fnbqs7ABLcgv8ENTx7x/6m/PtHvpfwB64ANIIlgM4e9HWml1SKPi9+fZd8Z5A/8KSXoa4P8HjDO1WEx9ntmpOauB/wbsnXpPhySlzwFa4BNI4lko4V/o+q5GEthdU+M8hbwvK9+7zwF7Ae/U9foW8M8L5IDPV87hXdx3M677u+WDqfvlFLAc+V1/BfiLhZz7HHN6rnLNL3DO30Eu1o8BXXO8rwBdSL6p8MglE/7V4NJxAUHl8rkcCkCnEMKtKEpSUZS9F9j294H/pihK79T4XwPWCyHapm3z3xRFCSuKkpk6thVpFYup/cYWMKdfB/6noigjiqKEkV9SAIQQDcgv7+cVRUkpijKJfIqY/ih3VlGUZxTpb/wu0IR0fV0qHgbOKIryd4qiFBVFOQT8GPi1qWDWrwJfnprXsamx58N2Zl673wb+RVGU3BzbFoDVQgiboigRRVEOLmSyiqK8oijKUUVRyoqivAP8M7BlYacKUzGhrwIfUxQljiTW5xVFeX7qmC8hrcGHhBBLgG7gvyqKklMU5TXgpwsdawp/rSiKf+oz/ymwfiE7KYryj4qihKY+k79CEuj0+MAbU3MuAf8A3Dz1+m3IReKvFUUpKIryL8hFZ0G4yPV9FGmtvqEoSh74MpJwKvh94D8riuKb+sz/GHj0Yq5QIcRNU8f6o6m/F3vfXQyL4YO/UxSlf+q7/kPOfV4XO/d3gz8Evgf8AXBCCDEohHhw1jYK8mnivwohdPMcJyiEiE77v+pCg14NhB8C3JfRR/44cpU+KYToEUI8fIFt24BvVi4W0q0hkJZfBSOVXxRF+SXysf5/A5NCiG8LIWwLmFPz9OMgrbDpc9ACY9Pm8S2kpV/B+LQ5pKd+tSxg3IuhDdg8/YZBEnUj4EGSx3zzngFFURLAzzi3UP0W8oaeC7+KXOTOCiFeFULcvpDJCiE2CyF2TwX6YkiScS9w31bkl/hTiqL0T73chlzcpp//XcgFtRmIKIqSmnaYec9/HoxP+z3NAj8zIcSTQojeqcyqKNLlN/08Zx/XMPX9aQZGlSkTcArTP7+LjXuh6zvjHp66D0PTdm8Dtk+7jr1Id+O8hokQohPpCvycoiivT728qPtuAVgMH8z3eV3s3C+Iqcy4SlB119QxMoqifE1RlI1Io/eHwI+EEHXT91UU5XnAB/x/8xzerSiKY9r/3gvN5Wog/D1ADvj4ArdPAabKH1MWgafyt6IoA4qi/BaSMP8S2CaEMDP3ijyCdJ1Mv2BGRVHemrbNjP0URfnrqQ9pNfJG+qMFzHkM6c6pYMmsOeSY+cHZFEVZs4Djnje/RWIEeHXW+VsURfnXQADpaptv3nPhn4HfmiJwA9J1df6EFaVHUZStyM/o/yFvdjj/s22ctes/IX27rYqi2IG/QS7QF4QQwjg1zv9UFGXXtLdGgH+Ydf5mRVH+AvmZOafunQoudv6XDCHE3cB/QD4VOhVFcSBjABc9T+ScW4QQ07ed/vldyvUdQ7prKvsakURVwQgytjT9WhoURRmd5zzbgF8Af6Yoyj9Me2ux99275YPF4GLnPhuzOeN7yrmg6mwrnqmnza8hXVjtcxzvPwNfYtp5vlu874SvKEoM+Yj0v4UQHxdCmIQQWiHEg0KIr8+xSz/SovkVIYQW6XfXV94UQvwrIYRHUZQy0ocMUEbeSGVgei743wD/SQixZmpfuxDi1+abqxCie8oK0iJvtOzUMS+GHwL/TgjhFUI4kcGtyvmPIQNEfyWEsAkhVEKIDiHEQl0VE7POaTHYCSwXQvzO1DXXTp3jqil3wb8Afzz1mawGPnWR4z2PtPT+FPjB1GcwA0II3ZTFY1cUpYAMplW2OwKsEUKsF0IYkG6B6bACYUVRskKIW4FPLvA8/xY4qSjK7PvpH4GPCiE+IoRQCyEMQoh7hBBeRVHOIt07fzI157uAjy5wvEuBFUl4AUAjhPgyMt6wEOxBWtV/IITQCCG2IoOWFVzK9d2GvFZ3TLkX/piZi9DfAH9ecekJITxT458HIUQL8EtkksHfTH/vXdx375YPFoOLnftsXPQ7KYT4r1PfNd3UZ/G5qfmdV8uhKMoryCDvxb5/F8X7TvgAU37Kf4/8sAJIa+EPkFbZ7G1jwL9BBj1GkcTrm7bJA8BxIUQSme3wm1OPT2lkVsubU4+dtymKsh256n9fCBFHXtTzVuBpsAHPILNlKhH7/76AU3wGmQF0BDiIvKGn43eRwaATU8fehnQrLATfRPpKI0KIv17gPkDVDfNhpBvGj3yk/UvOfWH+APlYO44MRP3dRY6XQ57bh5DW4nz4HeDM1DX/faQbiSlXy58iLb8BZHbSdPwb4E+FEAmkkfBDFobfBB4RM/OV71Y
Initial commit 2 years ago			`"text/plain": [`
			`"<Figure size 432x288 with 1 Axes>"`
			`]`
			`},`
			`"metadata": {`
			`"needs_background": "light"`
			`},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.manifold import TSNE\n",`
			`"import matplotlib\n",`
			`"import matplotlib.pyplot as plt\n",`
			`"\n",`
lint 2 years ago			`"tsne = TSNE(\n",`
			`" n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200\n",`
			`")\n",`
Initial commit 2 years ago			`"vis_dims2 = tsne.fit_transform(matrix)\n",`
			`"\n",`
lint 2 years ago			`"x = [x for x, y in vis_dims2]\n",`
			`"y = [y for x, y in vis_dims2]\n",`
Initial commit 2 years ago			`"\n",`
lint 2 years ago			`"for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",`
			`" xs = np.array(x)[df.Cluster == category]\n",`
			`" ys = np.array(y)[df.Cluster == category]\n",`
Initial commit 2 years ago			`" plt.scatter(xs, ys, color=color, alpha=0.3)\n",`
			`"\n",`
			`" avg_x = xs.mean()\n",`
			`" avg_y = ys.mean()\n",`
lint 2 years ago			`"\n",`
			`" plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",`
			`"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"`
Initial commit 2 years ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Visualization of clusters in a 2d projection. The red cluster clearly represents negative reviews. The blue cluster seems quite different from the others. Let's see a few samples from each cluster."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### 2. Text samples in the clusters & naming the clusters\n",`
			`"\n",`
			`"Let's show random samples from each cluster. We'll use davinci-instruct-beta-v3 to name the clusters, based on a random sample of 6 reviews from that cluster."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds precomputed embeddings 2 years ago			`"execution_count": 4,`
Initial commit 2 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Cluster 0 Theme: All of the customer reviews mention the great flavor of the product.\n",`
			`"5, French Vanilla Cappuccino: Great price. Really love the the flavor. No need to add anything to \n",`
			`"5, great coffee: A bit pricey once you add the S & H but this is one of the best flavor\n",`
			`"5, Love It: First let me say I'm new to drinking tea. So you're not getting a well\n",`
			`"----------------------------------------------------------------------------------------------------\n",`
			`"Cluster 1 Theme: All three reviews mention the quality of the product.\n",`
			`"5, Beautiful: I don't plan to grind these, have plenty other peppers for that. I go\n",`
			`"5, Awesome: I can't find this in the stores and thought I would like it. So I bou\n",`
			`"5, Came as expected: It was tasty and fresh. The other one I bought was old and tasted mold\n",`
			`"----------------------------------------------------------------------------------------------------\n",`
			`"Cluster 2 Theme: All reviews are about customer's disappointment.\n",`
			`"1, Disappointed...: I should read the fine print, I guess. I mostly went by the picture a\n",`
			`"5, Excellent but Price?: I first heard about this on America's Test Kitchen where it won a blin\n",`
			`"1, Disappointed: I received the offer from Amazon and had never tried this brand before\n",`
			`"----------------------------------------------------------------------------------------------------\n",`
adds precomputed embeddings 2 years ago			`"Cluster 3 Theme: The reviews for these products have in common that the customers are happy with the product.\n",`
Initial commit 2 years ago			`"5, My Dog's Favorite Snack!: I was first introduced to this snack at my dog's training classes at p\n",`
			`"4, Fruitables Crunchy Dog Treats: My lab goes wild for these and I am almost tempted to have a go at som\n",`
			`"5, Happy with the product: My dog was suffering with itchy skin. He had been eating Natural Choi\n",`
			`"----------------------------------------------------------------------------------------------------\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"import openai\n",`
			`"\n",`
			`"# Reading a review which belong to each group.\n",`
			`"rev_per_cluster = 3\n",`
			`"\n",`
			`"for i in range(n_clusters):\n",`
			`" print(f\"Cluster {i} Theme:\", end=\" \")\n",`
lint 2 years ago			`"\n",`
			`" reviews = \"\\n\".join(\n",`
			`" df[df.Cluster == i]\n",`
			`" .combined.str.replace(\"Title: \", \"\")\n",`
			`" .str.replace(\"\\n\\nContent: \", \": \")\n",`
			`" .sample(rev_per_cluster, random_state=42)\n",`
			`" .values\n",`
			`" )\n",`
Initial commit 2 years ago			`" response = openai.Completion.create(\n",`
			`" engine=\"davinci-instruct-beta-v3\",\n",`
lint 2 years ago			`" prompt=f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:',\n",`
Initial commit 2 years ago			`" temperature=0,\n",`
			`" max_tokens=64,\n",`
			`" top_p=1,\n",`
			`" frequency_penalty=0,\n",`
lint 2 years ago			`" presence_penalty=0,\n",`
Initial commit 2 years ago			`" )\n",`
lint 2 years ago			`" print(response[\"choices\"][0][\"text\"].replace(\"\\n\", \"\"))\n",`
Initial commit 2 years ago			`"\n",`
lint 2 years ago			`" sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",`
Initial commit 2 years ago			`" for j in range(rev_per_cluster):\n",`
			`" print(sample_cluster_rows.Score.values[j], end=\", \")\n",`
			`" print(sample_cluster_rows.Summary.values[j], end=\": \")\n",`
			`" print(sample_cluster_rows.Text.str[:70].values[j])\n",`
lint 2 years ago			`"\n",`
			`" print(\"-\" * 100)\n"`
Initial commit 2 years ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can see based on the average ratings per cluster, that Cluster 2 contains mostly negative reviews. Cluster 0 and 1 contain mostly positive reviews, whilst Cluster 3 appears to contain reviews about dog products."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
lint 2 years ago			`"display_name": "Python 3.9.9 ('openai')",`
			`"language": "python",`
Initial commit 2 years ago			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.9.9"`
			`},`
lint 2 years ago			`"orig_nbformat": 4,`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`
			`}`
			`}`
Initial commit 2 years ago			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`