{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get embeddings from dataset\n", "\n", "This notebook gives an example on how to get embeddings from a large dataset.\n", "\n", "\n", "## 1. Load the dataset\n", "\n", "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n", "\n", "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import pandas as pd\n", "import tiktoken\n", "\n", "from utils.embeddings_utils import get_embedding\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# embedding model parameters\n", "embedding_model = \"text-embedding-ada-002\"\n", "embedding_encoding = \"cl100k_base\" # this the encoding for text-embedding-ada-002\n", "max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Time | \n", "ProductId | \n", "UserId | \n", "Score | \n", "Summary | \n", "Text | \n", "combined | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1351123200 | \n", "B003XPF9BO | \n", "A3R7JR3FMEBXQB | \n", "5 | \n", "where does one start...and stop... with a tre... | \n", "Wanted to save some to bring to my Chicago fam... | \n", "Title: where does one start...and stop... wit... | \n", "
1 | \n", "1351123200 | \n", "B003JK537S | \n", "A3JBPC3WFUT5ZP | \n", "1 | \n", "Arrived in pieces | \n", "Not pleased at all. When I opened the box, mos... | \n", "Title: Arrived in pieces; Content: Not pleased... | \n", "