mirror of
https://github.com/openai/openai-cookbook
synced 2024-11-04 06:00:33 +00:00
278 lines
71 KiB
Plaintext
278 lines
71 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Clustering\n",
|
|
"\n",
|
|
"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(1000, 1536)"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# imports\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"# load data\n",
|
|
"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
|
|
"\n",
|
|
"df = pd.read_csv(datafile_path)\n",
|
|
"df[\"embedding\"] = df.embedding.apply(eval).apply(np.array) # convert string to numpy array\n",
|
|
"matrix = np.vstack(df.embedding.values)\n",
|
|
"matrix.shape\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### 1. Find the clusters using K-means"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/Users/ted/.virtualenvs/openai/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
|
|
" warnings.warn(\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Cluster\n",
|
|
"0 4.105691\n",
|
|
"1 4.191176\n",
|
|
"2 4.215613\n",
|
|
"3 4.306590\n",
|
|
"Name: Score, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.cluster import KMeans\n",
|
|
"\n",
|
|
"n_clusters = 4\n",
|
|
"\n",
|
|
"kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",
|
|
"kmeans.fit(matrix)\n",
|
|
"labels = kmeans.labels_\n",
|
|
"df[\"Cluster\"] = labels\n",
|
|
"\n",
|
|
"df.groupby(\"Cluster\").Score.mean().sort_values()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
},
|
|
{
|
|
"data": {
|
|
"image/png": "",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.manifold import TSNE\n",
|
|
"import matplotlib\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"tsne = TSNE(n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200)\n",
|
|
"vis_dims2 = tsne.fit_transform(matrix)\n",
|
|
"\n",
|
|
"x = [x for x, y in vis_dims2]\n",
|
|
"y = [y for x, y in vis_dims2]\n",
|
|
"\n",
|
|
"for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",
|
|
" xs = np.array(x)[df.Cluster == category]\n",
|
|
" ys = np.array(y)[df.Cluster == category]\n",
|
|
" plt.scatter(xs, ys, color=color, alpha=0.3)\n",
|
|
"\n",
|
|
" avg_x = xs.mean()\n",
|
|
" avg_y = ys.mean()\n",
|
|
"\n",
|
|
" plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",
|
|
"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### 2. Text samples in the clusters & naming the clusters\n",
|
|
"\n",
|
|
"Let's show random samples from each cluster. We'll use davinci-instruct-beta-v3 to name the clusters, based on a random sample of 6 reviews from that cluster."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Cluster 0 Theme: All of the reviews are positive and the customers are satisfied with the product they purchased.\n",
|
|
"5, Loved these gluten free healthy bars, saved $$ ordering on Amazon: These Kind Bars are so good and healthy & gluten free. My daughter ca\n",
|
|
"1, Should advertise coconut as an ingredient more prominently: First, these should be called Mac - Coconut bars, as Coconut is the #2\n",
|
|
"5, very good!!: just like the runts<br />great flavor, def worth getting<br />I even o\n",
|
|
"5, Excellent product: After scouring every store in town for orange peels and not finding an\n",
|
|
"5, delicious: Gummi Frogs have been my favourite candy that I have ever tried. of co\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"Cluster 1 Theme: All of the reviews are about pet food.\n",
|
|
"2, Messy and apparently undelicious: My cat is not a huge fan. Sure, she'll lap up the gravy, but leaves th\n",
|
|
"4, The cats like it: My 7 cats like this food but it is a little yucky for the human. Piece\n",
|
|
"5, cant get enough of it!!!: Our lil shih tzu puppy cannot get enough of it. Everytime she sees the\n",
|
|
"1, Food Caused Illness: I switched my cats over from the Blue Buffalo Wildnerness Food to this\n",
|
|
"5, My furbabies LOVE these!: Shake the container and they come running. Even my boy cat, who isn't \n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"Cluster 2 Theme: All of the reviews are positive and express satisfaction with the product.\n",
|
|
"5, Fog Chaser Coffee: This coffee has a full body and a rich taste. The price is far below t\n",
|
|
"5, Excellent taste: This is to me a great coffee, once you try it you will enjoy it, this \n",
|
|
"4, Good, but not Wolfgang Puck good: Honestly, I have to admit that I expected a little better. That's not \n",
|
|
"5, Just My Kind of Coffee: Coffee Masters Hazelnut coffee used to be carried in a local coffee/pa\n",
|
|
"5, Rodeo Drive is Crazy Good Coffee!: Rodeo Drive is my absolute favorite and I'm ready to order more! That\n",
|
|
"----------------------------------------------------------------------------------------------------\n",
|
|
"Cluster 3 Theme: All of the reviews are about food or drink products.\n",
|
|
"5, Wonderful alternative to soda pop: This is a wonderful alternative to soda pop. It's carbonated for thos\n",
|
|
"5, So convenient, for so little!: I needed two vanilla beans for the Love Goddess cake that my husbands \n",
|
|
"2, bot very cheesy: Got this about a month ago.first of all it smells horrible...it tastes\n",
|
|
"5, Delicious!: I am not a huge beer lover. I do enjoy an occasional Blue Moon (all o\n",
|
|
"3, Just ok: I bought this brand because it was all they had at Ranch 99 near us. I\n",
|
|
"----------------------------------------------------------------------------------------------------\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import openai\n",
|
|
"\n",
|
|
"# Reading a review which belong to each group.\n",
|
|
"rev_per_cluster = 5\n",
|
|
"\n",
|
|
"for i in range(n_clusters):\n",
|
|
" print(f\"Cluster {i} Theme:\", end=\" \")\n",
|
|
"\n",
|
|
" reviews = \"\\n\".join(\n",
|
|
" df[df.Cluster == i]\n",
|
|
" .combined.str.replace(\"Title: \", \"\")\n",
|
|
" .str.replace(\"\\n\\nContent: \", \": \")\n",
|
|
" .sample(rev_per_cluster, random_state=42)\n",
|
|
" .values\n",
|
|
" )\n",
|
|
" response = openai.Completion.create(\n",
|
|
" engine=\"text-davinci-003\",\n",
|
|
" prompt=f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:',\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=64,\n",
|
|
" top_p=1,\n",
|
|
" frequency_penalty=0,\n",
|
|
" presence_penalty=0,\n",
|
|
" )\n",
|
|
" print(response[\"choices\"][0][\"text\"].replace(\"\\n\", \"\"))\n",
|
|
"\n",
|
|
" sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",
|
|
" for j in range(rev_per_cluster):\n",
|
|
" print(sample_cluster_rows.Score.values[j], end=\", \")\n",
|
|
" print(sample_cluster_rows.Summary.values[j], end=\": \")\n",
|
|
" print(sample_cluster_rows.Text.str[:70].values[j])\n",
|
|
"\n",
|
|
" print(\"-\" * 100)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "openai",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.9"
|
|
},
|
|
"orig_nbformat": 4,
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|