You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/Clustering.ipynb

274 lines
79 KiB
Plaintext

2 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clustering\n",
"\n",
"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 1,
2 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1000, 2048)"
]
},
"execution_count": 1,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
"df = pd.read_csv(datafile_path)\n",
2 years ago
"df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",
2 years ago
"matrix = np.vstack(df.babbage_similarity.values)\n",
2 years ago
"matrix.shape\n"
2 years ago
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Find the clusters using K-means"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
]
},
{
"cell_type": "code",
"execution_count": 2,
2 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Cluster\n",
"2 2.543478\n",
"3 4.374046\n",
"0 4.709402\n",
"1 4.832099\n",
"Name: Score, dtype: float64"
]
},
"execution_count": 2,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.cluster import KMeans\n",
"\n",
"n_clusters = 4\n",
"\n",
2 years ago
"kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",
2 years ago
"kmeans.fit(matrix)\n",
"labels = kmeans.labels_\n",
2 years ago
"df[\"Cluster\"] = labels\n",
2 years ago
"\n",
2 years ago
"df.groupby(\"Cluster\").Score.mean().sort_values()\n"
2 years ago
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like cluster 2 focused on negative reviews, while cluster 0 and 1 focused on positive reviews."
]
},
{
"cell_type": "code",
"execution_count": 3,
2 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"
]
},
"execution_count": 3,
2 years ago
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAADPp0lEQVR4nOz9d3xc13nnj7/P9F4wM6gDgiAAdkoUSYiqpmRZtqTIpuUozdnEVmT5m91N1vb+lPWud9dOcbyJV1mvoy2xZSVyEicudLi0aVGWLFOdBRSLWEAUEiQwGJTpvc/9/XEwQwAESEAkJZb58MUXgJl77zn3zp3Pee5TPo9QFIUaaqihhhquf6je7wnUUEMNNdTw3qBG+DXUUEMNNwhqhF9DDTXUcIOgRvg11FBDDTcIaoRfQw011HCDoEb4NdRQQw03CK4JwhdC/LEQ4h/f73ksBkKIXUKIT83z3lIhhCKE0FyhsZNCiGVTvxuFED8VQsSEED8SQvy2EOLFd3ncTwsh3rjUOV0JzL6mF7r+lzDGvPehEOJuIUTfuzzuu76uNwKuxPURQiyZuifVl/O4VzuuGsIXQnxSCHFg6kMYm/rC3nUZj39FSXY2FEV5UFGU717pcYQQrwghPjNrbIuiKKen/nwUaABciqL8mqIo31MU5cNXel6zMWtO78V478n1nzbe64qirHivxruWIYS4TQjxkhAiLIQITBkiTe/lHBRFGZ66J0uX+9hCiOeEEF+9yDYOIcTfCiHGhRAJIUS/EOI/TntfEUIcFUKopr32VSHEc1O/V/gsOev/b1xo3KuC8IUQ/x74n8DXkOS0BPg/wNb3cVoz8F4tFFcAbUC/oijF93siNdQwBSfwbWAp8v5MAH/3fk7ofcA3AAuwCrADHwMGZ23TDPzmRY7jmFq4Kv9/cMGtFUV5X/9PnWwS+LULbPPHwD9O/X4P4Jv1/hngQ1O/3wocAOLABPA/pl4fBpSpsZLA7VOv/x7QC0SAnwNt046rAP8WGACGADH1QU1OHf8osHaeOb8CfGbqdzXwFBAETk8dUwE0067Bs8AYMAp8FVBPvfdp4I2p/SNT83hw6r0/B0pAduqc/te0eXcCfwLkgcLU+49XjjdtniuBl4Aw0Af8+rT3XMBPps51P/Bn0/eddb67gD+Y9doR4BPT5zT1+0PACeQXfRR4cvq5zjrG9P1+BTg0NZ8R4I+nbbd01jWdfv2PTPvck1Pb3TP13m3AW0B0art7ph2zHXh1ap4vAf+LqftwjvO/h2n3JfKefBJ4B4gBPwAM8+w7+zP55tT5xYG3gbtnfRd+CPz91LyOA5umvb9h6holgB9NjfvVS72+U+//LnAWCAH/lZnfOxXwH4FTU+//EKhbIAdsABLv8r6bcd0XwQdz3S9/Brw5de1eBNwLOfdZY38W+X3LI++1n84z72PAxy9wTRTgi0juqczxq8Bzc81/of+vBsJ/ACheaOIsjvD3AL8z9bsFuG2+C4R8ghhErrIa4L8Ab8266C8BdYAR+AjyC+hAkv8qoGmeOb/COcL5feAk0Dp1rN2zbrbtwLcAM1A/dZP/f9O+pAXgCeTC8a8BPyBmjzPPl7h67WZ/6afGGwEemzr/W5CL0uqp97+P/OKagbVIcp7vi/e7wJvT/l6NJFH9HHMaY4rEkNbehtlzm+dc7gHWIcnlJuQX+OMX+AJ/Zo55fnbqs7ABLcgv8ENTx7x/6m/PtHvpfwB64ANIIlgM4e9HWml1SKPi9+fZd8Z5A/8KSXoa4P8HjDO1WEx9ntmpOauB/wbsnXpPhySlzwFa4BNI4lko4V/o+q5GEthdU+M8hbwvK9+7zwF7Ae/U9foW8M8L5IDPV87hXdx3M677u+WDqfvlFLAc+V1/BfiLhZz7HHN6rnLNL3DO30Eu1o8BXXO8rwBdSL6p8MglE/7V4NJxAUHl8rkcCkCnEMKtKEpSUZS9F9j294H/pihK79T4XwPWCyHapm3z3xRFCSuKkpk6thVpFYup/cYWMKdfB/6noigjiqKEkV9SAIQQDcgv7+cVRUkpijKJfIqY/ih3VlGUZxTpb/wu0IR0fV0qHgbOKIryd4qiFBVFOQT8GPi1qWDWrwJfnprXsamx58N2Zl673wb+RVGU3BzbFoDVQgiboigRRVEOLmSyiqK8oijKUUVRyoqivAP8M7BlYacKUzGhrwIfUxQljiTW5xVFeX7qmC8hrcGHhBBLgG7gvyqKklMU5TXgpwsdawp/rSiKf+oz/ymwfiE7KYryj4qihKY+k79CEuj0+MAbU3MuAf8A3Dz1+m3IReKvFUUpKIryL8hFZ0G4yPV9FGmtvqEoSh74MpJwKvh94D8riuKb+sz/GHj0Yq5QIcRNU8f6o6m/F3vfXQyL4YO/UxSlf+q7/kPOfV4XO/d3gz8Evgf8AXBCCDEohHhw1jYK8mnivwohdPMcJyiEiE77v+pCg14NhB8C3JfRR/44cpU+KYToEUI8fIFt24BvVi4W0q0hkJZfBSOVXxRF+SXysf5/A5NCiG8LIWwLmFPz9OMgrbDpc9ACY9Pm8S2kpV/B+LQ5pKd+tSxg3IuhDdg8/YZBEnUj4EGSx3zzngFFURLAzzi3UP0W8oaeC7+KXOTOCiFeFULcvpDJCiE2CyF2TwX6YkiScS9w31bkl/hTiqL0T73chlzcpp//XcgFtRmIKIqSmnaYec9/HoxP+z3NAj8zIcSTQojeqcyqKNLlN/08Zx/XMPX9aQZGlSkTcArTP7+LjXuh6zvjHp66D0PTdm8Dtk+7jr1Id+O8hokQohPpCvycoiivT728qPtuAVgMH8z3eV3s3C+Iqcy4SlB119QxMoqifE1RlI1Io/eHwI+EEHXT91UU5XnAB/x/8xzerSiKY9r/3gvN5Wog/D1ADvj4ArdPAabKH1MWgafyt6IoA4qi/BaSMP8S2CaEMDP3ijyCdJ1Mv2BGRVHemrbNjP0URfnrqQ9pNfJG+qMFzHkM6c6pYMmsOeSY+cHZFEVZs4Djnje/RWIEeHXW+VsURfnXQADpaptv3nPhn4HfmiJwA9J1df6EFaVHUZStyM/o/yFvdjj/s22ctes/IX27rYqi2IG/QS7QF4QQwjg1zv9UFGXXtLdGgH+Ydf5mRVH+AvmZOafunQoudv6XDCHE3cB/QD4VOhVFcSBjABc9T+ScW4QQ07ed/vldyvUdQ7prKvsakURVwQgytjT9WhoURRmd5zzbgF8Af6Yoyj9Me2ux99275YPF4GLnPhuzOeN7yrmg6mwrnqmnza8hXVjtcxzvPwNfYtp5vlu874SvKEoM+Yj0v4UQHxdCmIQQWiHEg0KIr8+xSz/SovkVIYQW6XfXV94UQvwrIYRHUZQy0ocMUEbeSGVgei743wD/SQixZmpfuxDi1+abqxCie8oK0iJvtOzUMS+GHwL/TgjhFUI4kcGtyvmPIQNEfyWEsAkhVEKIDiHEQl0VE7POaTHYCSwXQvzO1DXXTp3jqil3wb8Afzz1mawGPnWR4z2PtPT+FPjB1GcwA0II3ZTFY1cUpYAMplW2OwKsEUKsF0IYkG6B6bACYUVRskKIW4FPLvA8/xY4qSjK7PvpH4GPCiE+IoRQCyEMQoh7hBBeRVHOIt07fzI157uAjy5wvEuBFUl4AUAjhPgyMt6wEOxBWtV/IITQCCG2IoOWFVzK9d2GvFZ3TLkX/piZi9DfAH9ecekJITxT458HIUQL8EtkksHfTH/vXdx375YPFoOLnftsXPQ7KYT4r1PfNd3UZ/G5qfmdV8uhKMoryCDvxb5/F8X7TvgAU37Kf4/8sAJIa+EPkFbZ7G1jwL9BBj1GkcTrm7bJA8BxIUQSme3wm1OPT2lkVsubU4+dtymKsh256n9fCBFHXtTzVuBpsAHPILNlKhH7/76AU3wGmQF0BDiIvKGn43eRwaATU8fehnQrLATfRPpKI0KIv17gPkDVDfNhpBvGj3yk/UvOfWH+APlYO44MRP3dRY6XQ57bh5DW4nz4HeDM1DX/faQbiSlXy58iLb8BZHbSdPwb4E+FEAmkkfBDFobfBB4RM/OV71Y
2 years ago
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from sklearn.manifold import TSNE\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"\n",
2 years ago
"tsne = TSNE(\n",
" n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200\n",
")\n",
2 years ago
"vis_dims2 = tsne.fit_transform(matrix)\n",
"\n",
2 years ago
"x = [x for x, y in vis_dims2]\n",
"y = [y for x, y in vis_dims2]\n",
2 years ago
"\n",
2 years ago
"for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",
" xs = np.array(x)[df.Cluster == category]\n",
" ys = np.array(y)[df.Cluster == category]\n",
2 years ago
" plt.scatter(xs, ys, color=color, alpha=0.3)\n",
"\n",
" avg_x = xs.mean()\n",
" avg_y = ys.mean()\n",
2 years ago
"\n",
" plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",
"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
2 years ago
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualization of clusters in a 2d projection. The red cluster clearly represents negative reviews. The blue cluster seems quite different from the others. Let's see a few samples from each cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Text samples in the clusters & naming the clusters\n",
"\n",
"Let's show random samples from each cluster. We'll use davinci-instruct-beta-v3 to name the clusters, based on a random sample of 6 reviews from that cluster."
]
},
{
"cell_type": "code",
"execution_count": 4,
2 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster 0 Theme: All of the customer reviews mention the great flavor of the product.\n",
"5, French Vanilla Cappuccino: Great price. Really love the the flavor. No need to add anything to \n",
"5, great coffee: A bit pricey once you add the S & H but this is one of the best flavor\n",
"5, Love It: First let me say I'm new to drinking tea. So you're not getting a well\n",
"----------------------------------------------------------------------------------------------------\n",
"Cluster 1 Theme: All three reviews mention the quality of the product.\n",
"5, Beautiful: I don't plan to grind these, have plenty other peppers for that. I go\n",
"5, Awesome: I can't find this in the stores and thought I would like it. So I bou\n",
"5, Came as expected: It was tasty and fresh. The other one I bought was old and tasted mold\n",
"----------------------------------------------------------------------------------------------------\n",
"Cluster 2 Theme: All reviews are about customer's disappointment.\n",
"1, Disappointed...: I should read the fine print, I guess. I mostly went by the picture a\n",
"5, Excellent but Price?: I first heard about this on America's Test Kitchen where it won a blin\n",
"1, Disappointed: I received the offer from Amazon and had never tried this brand before\n",
"----------------------------------------------------------------------------------------------------\n",
"Cluster 3 Theme: The reviews for these products have in common that the customers are happy with the product.\n",
2 years ago
"5, My Dog's Favorite Snack!: I was first introduced to this snack at my dog's training classes at p\n",
"4, Fruitables Crunchy Dog Treats: My lab goes wild for these and I am almost tempted to have a go at som\n",
"5, Happy with the product: My dog was suffering with itchy skin. He had been eating Natural Choi\n",
"----------------------------------------------------------------------------------------------------\n"
]
}
],
"source": [
"import openai\n",
"\n",
"# Reading a review which belong to each group.\n",
"rev_per_cluster = 3\n",
"\n",
"for i in range(n_clusters):\n",
" print(f\"Cluster {i} Theme:\", end=\" \")\n",
2 years ago
"\n",
" reviews = \"\\n\".join(\n",
" df[df.Cluster == i]\n",
" .combined.str.replace(\"Title: \", \"\")\n",
" .str.replace(\"\\n\\nContent: \", \": \")\n",
" .sample(rev_per_cluster, random_state=42)\n",
" .values\n",
" )\n",
2 years ago
" response = openai.Completion.create(\n",
" engine=\"davinci-instruct-beta-v3\",\n",
2 years ago
" prompt=f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:',\n",
2 years ago
" temperature=0,\n",
" max_tokens=64,\n",
" top_p=1,\n",
" frequency_penalty=0,\n",
2 years ago
" presence_penalty=0,\n",
2 years ago
" )\n",
2 years ago
" print(response[\"choices\"][0][\"text\"].replace(\"\\n\", \"\"))\n",
2 years ago
"\n",
2 years ago
" sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",
2 years ago
" for j in range(rev_per_cluster):\n",
" print(sample_cluster_rows.Score.values[j], end=\", \")\n",
" print(sample_cluster_rows.Summary.values[j], end=\": \")\n",
" print(sample_cluster_rows.Text.str[:70].values[j])\n",
2 years ago
"\n",
" print(\"-\" * 100)\n"
2 years ago
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see based on the average ratings per cluster, that Cluster 2 contains mostly negative reviews. Cluster 0 and 1 contain mostly positive reviews, whilst Cluster 3 appears to contain reviews about dog products."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."
]
}
],
"metadata": {
"kernelspec": {
2 years ago
"display_name": "Python 3.9.9 ('openai')",
"language": "python",
2 years ago
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
2 years ago
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
}
2 years ago
},
"nbformat": 4,
"nbformat_minor": 2
}