Add description and rename Obtain_dataset, and throw error when fine tuned model not available (#727)

pull/1077/head
Cathy Chen 8 months ago committed by GitHub
parent 61de0c3b02
commit 487913a10b

3
.gitignore vendored

@ -109,6 +109,9 @@ venv/
ENV/
env.bak/
venv.bak/
pyvenv.cfg
share/
bin/
# Spyder project settings
.spyderproject

@ -9,7 +9,7 @@
"\n",
"There are many ways to classify text. This notebook shares an example of text classification using embeddings. For many text classification tasks, we've seen fine-tuned models do better than embeddings. See an example of fine-tuned models for classification in [Fine-tuned_classification.ipynb](Fine-tuned_classification.ipynb). We also recommend having more examples than embedding dimensions, which we don't quite achieve here.\n",
"\n",
"In this text classification task, we predict the score of a food review (1 to 5) based on the embedding of the review's text. We split the dataset into a training and a testing set for all the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n"
"In this text classification task, we predict the score of a food review (1 to 5) based on the embedding of the review's text. We split the dataset into a training and a testing set for all the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).\n"
]
},
{

@ -7,7 +7,7 @@
"source": [
"## Clustering\n",
"\n",
"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
]
},
{

@ -4,6 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Get embeddings from dataset\n",
"\n",
"This notebook gives an example on how to get embeddings from a large dataset.\n",
"\n",
"\n",
"## 1. Load the dataset\n",
"\n",
"The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",

@ -601,6 +601,9 @@
"response = openai.FineTuningJob.retrieve(job_id)\n",
"fine_tuned_model_id = response[\"fine_tuned_model\"]\n",
"\n",
"if fine_tuned_model_id is None: \n",
" raise RuntimeError(\"Fine-tuned model ID not found. Your job has likely not been completed yet.\")\n",
"\n",
"print(\"Fine-tuned model ID:\", fine_tuned_model_id)"
]
},

@ -686,7 +686,7 @@
"\n",
"### Create embeddings\n",
"\n",
"This initial section reuses the approach from the [Obtain_dataset Notebook](Obtain_dataset.ipynb) to create embeddings from a combined field concatenating all of our features"
"This initial section reuses the approach from the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb) to create embeddings from a combined field concatenating all of our features"
]
},
{

@ -7,7 +7,7 @@
"source": [
"## Regression using the embeddings\n",
"\n",
"Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
"Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).\n",
"\n",
"We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
]

@ -7,7 +7,7 @@
"source": [
"## Semantic text search using embeddings\n",
"\n",
"We can search through all our reviews semantically in a very efficient manner and at very low cost, by embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
"We can search through all our reviews semantically in a very efficient manner and at very low cost, by embedding our search query, and then finding the most similar reviews. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
]
},
{

@ -7,7 +7,7 @@
"source": [
"## User and product embeddings\n",
"\n",
"We calculate user and product embeddings based on the training set, and evaluate the results on the unseen test set. We will evaluate the results by plotting the user and product similarity versus the review score. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
"We calculate user and product embeddings based on the training set, and evaluate the results on the unseen test set. We will evaluate the results by plotting the user and product similarity versus the review score. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
]
},
{

@ -7,7 +7,7 @@
"source": [
"## Visualizing the embeddings in 2D\n",
"\n",
"We will use t-SNE to reduce the dimensionality of the embeddings from 1536 to 2. Once the embeddings are reduced to two dimensions, we can plot them in a 2D scatter plot. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
"We will use t-SNE to reduce the dimensionality of the embeddings from 1536 to 2. Once the embeddings are reduced to two dimensions, we can plot them in a 2D scatter plot. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
]
},
{

@ -7,7 +7,7 @@
"source": [
"## Visualizing the embeddings in W&B\n",
"\n",
"We will upload the data to [Weights & Biases](http://wandb.ai) and use an [Embedding Projector](https://docs.wandb.ai/ref/app/features/panels/weave/embedding-projector) to visualize the embeddings using common dimension reduction algorithms like PCA, UMAP, and t-SNE. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
"We will upload the data to [Weights & Biases](http://wandb.ai) and use an [Embedding Projector](https://docs.wandb.ai/ref/app/features/panels/weave/embedding-projector) to visualize the embeddings using common dimension reduction algorithms like PCA, UMAP, and t-SNE. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
]
},
{

@ -7,7 +7,7 @@
"source": [
"## Zero-shot classification with embeddings\n",
"\n",
"In this notebook we will classify the sentiment of reviews using embeddings and zero labeled data! The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
"In this notebook we will classify the sentiment of reviews using embeddings and zero labeled data! The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).\n",
"\n",
"We'll define positive sentiment to be 4- and 5-star reviews, and negative sentiment to be 1- and 2-star reviews. 3-star reviews are considered neutral and we won't use them for this example.\n",
"\n",

Loading…
Cancel
Save