openai-cookbook/examples/Regression_using_embeddings.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression using the embeddings\n",
    "\n",
    "Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
    "\n",
    "We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Babbage similarity embedding performance on 1k Amazon reviews: mse=0.39, mae=0.38\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
    "\n",
    "datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\"  # for your convenience, we precomputed the embeddings\n",
    "df = pd.read_csv(datafile_path)\n",
    "df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size=0.2, random_state=42)\n",
    "\n",
    "rfr = RandomForestRegressor(n_estimators=100)\n",
    "rfr.fit(X_train, y_train)\n",
    "preds = rfr.predict(X_test)\n",
    "\n",
    "mse = mean_squared_error(y_test, preds)\n",
    "mae = mean_absolute_error(y_test, preds)\n",
    "\n",
    "print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dummy mean prediction performance on Amazon reviews: mse=1.81, mae=1.08\n"
     ]
    }
   ],
   "source": [
    "bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
    "bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
    "print(\n",
    "    f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.9 ('openai')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Initial commit 2022-03-11 02:08:53 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Regression using the embeddings\n",`
			`"\n",`
			`"Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",`
			`"\n",`
			`"We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"execution_count": 1,`
Initial commit 2022-03-11 02:08:53 +00:00			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"Babbage similarity embedding performance on 1k Amazon reviews: mse=0.39, mae=0.38\n"`
Initial commit 2022-03-11 02:08:53 +00:00			`]`
			`}`
			`],`
			`"source": [`
			`"import pandas as pd\n",`
			`"import numpy as np\n",`
			`"\n",`
			`"from sklearn.ensemble import RandomForestRegressor\n",`
			`"from sklearn.model_selection import train_test_split\n",`
			`"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",`
			`"\n",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",`
			`"df = pd.read_csv(datafile_path)\n",`
			`"df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",`
Initial commit 2022-03-11 02:08:53 +00:00			`"\n",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size=0.2, random_state=42)\n",`
Initial commit 2022-03-11 02:08:53 +00:00			`"\n",`
			`"rfr = RandomForestRegressor(n_estimators=100)\n",`
			`"rfr.fit(X_train, y_train)\n",`
			`"preds = rfr.predict(X_test)\n",`
			`"\n",`
			`"mse = mean_squared_error(y_test, preds)\n",`
			`"mae = mean_absolute_error(y_test, preds)\n",`
			`"\n",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"`
Initial commit 2022-03-11 02:08:53 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"execution_count": 2,`
Initial commit 2022-03-11 02:08:53 +00:00			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"Dummy mean prediction performance on Amazon reviews: mse=1.81, mae=1.08\n"`
Initial commit 2022-03-11 02:08:53 +00:00			`]`
			`}`
			`],`
			`"source": [`
			`"bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",`
			`"bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"print(\n",`
			`" f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\"\n",`
			`")\n"`
Initial commit 2022-03-11 02:08:53 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"display_name": "Python 3.9.9 ('openai')",`
			`"language": "python",`
Initial commit 2022-03-11 02:08:53 +00:00			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"version": "3.9.9"`
Initial commit 2022-03-11 02:08:53 +00:00			`},`
adds data download from CDN with precomputed embeddings 2022-07-12 00:02:00 +00:00			`"orig_nbformat": 4,`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`
			`}`
			`}`
Initial commit 2022-03-11 02:08:53 +00:00			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`