mirror of
https://github.com/openai/openai-cookbook
synced 2024-11-04 06:00:33 +00:00
110 lines
3.5 KiB
Plaintext
110 lines
3.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Regression using the embeddings\n",
|
|
"\n",
|
|
"Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
|
|
"\n",
|
|
"We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Babbage similarity embedding performance on 1k Amazon reviews: mse=0.38, mae=0.39\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
|
|
"\n",
|
|
"df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
|
|
"df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n",
|
|
"\n",
|
|
"rfr = RandomForestRegressor(n_estimators=100)\n",
|
|
"rfr.fit(X_train, y_train)\n",
|
|
"preds = rfr.predict(X_test)\n",
|
|
"\n",
|
|
"\n",
|
|
"mse = mean_squared_error(y_test, preds)\n",
|
|
"mae = mean_absolute_error(y_test, preds)\n",
|
|
"\n",
|
|
"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Dummy mean prediction performance on Amazon reviews: mse=1.77, mae=1.04\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
|
|
"bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
|
|
"print(f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"interpreter": {
|
|
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3.7.3 64-bit ('base': conda)",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.3"
|
|
},
|
|
"orig_nbformat": 4
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|