{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression using the embeddings\n", "\n", "Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n", "\n", "We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Babbage similarity embedding performance on 1k Amazon reviews: mse=0.38, mae=0.39\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error\n", "\n", "df = pd.read_csv('output/embedded_1k_reviews.csv')\n", "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n", "\n", "rfr = RandomForestRegressor(n_estimators=100)\n", "rfr.fit(X_train, y_train)\n", "preds = rfr.predict(X_test)\n", "\n", "\n", "mse = mean_squared_error(y_test, preds)\n", "mae = mean_absolute_error(y_test, preds)\n", "\n", "print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy mean prediction performance on Amazon reviews: mse=1.77, mae=1.04\n" ] } ], "source": [ "bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n", "bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n", "print(f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features." ] } ], "metadata": { "interpreter": { "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8" }, "kernelspec": { "display_name": "Python 3.7.3 64-bit ('base': conda)", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }