{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Vq31CdSRpgkI" }, "source": [ "# Customizing embeddings\n", "\n", "This notebook demonstrates one way to customize OpenAI embeddings to a particular task.\n", "\n", "The input is training data in the form of [text_1, text_2, label] where label is +1 if the pairs are similar and -1 if the pairs are dissimilar.\n", "\n", "The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a 'custom embedding' that will better emphasize aspects of the text relevant to your use case. In binary classification use cases, we've seen error rates drop by as much as 50%.\n", "\n", "In the following example, I use 1,000 sentence pairs picked from the SNLI corpus. Each pair of sentences are logically entailed (i.e., one implies the other). These pairs are our positives (label = 1). We generate synthetic negatives by combining sentences from different pairs, which are presumed to not be logically entailed (label = -1).\n", "\n", "For a clustering use case, you can generate positives by creating pairs from texts in the same clusters and generate negatives by creating pairs from sentences in different clusters.\n", "\n", "With other data sets, we have seen decent improvement with as little as ~100 training examples. Of course, performance will be better with more examples." ] }, { "cell_type": "markdown", "metadata": { "id": "arB38jFwpgkK" }, "source": [ "# 0. Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "ifvM7g4apgkK" }, "outputs": [], "source": [ "# imports\n", "from typing import List, Tuple # for type hints\n", "\n", "import numpy as np # for manipulating arrays\n", "import pandas as pd # for manipulating data in dataframes\n", "import pickle # for saving the embeddings cache\n", "import plotly.express as px # for plots\n", "import random # for generating run IDs\n", "from sklearn.model_selection import train_test_split # for splitting train & test data\n", "import torch # for matrix optimization\n", "\n", "from openai.embeddings_utils import get_embedding, cosine_similarity # for embeddings\n" ] }, { "cell_type": "markdown", "metadata": { "id": "DtBbryAapgkL" }, "source": [ "## 1. Inputs\n", "\n", "Most inputs are here. The key things to change are where to load your datset from, where to save a cache of embeddings to, and which embedding engine you want to use.\n", "\n", "Depending on how your data is formatted, you'll want to rewrite the process_input_data function." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "UzxcWRCkpgkM" }, "outputs": [], "source": [ "# input parameters\n", "embedding_cache_path = \"data/snli_embedding_cache.pkl\" # embeddings will be saved/loaded here\n", "default_embedding_engine = \"babbage-similarity\" # text-embedding-ada-002 is recommended\n", "num_pairs_to_embed = 1000 # 1000 is arbitrary\n", "local_dataset_path = \"data/snli_1.0_train_2k.csv\" # download from: https://nlp.stanford.edu/projects/snli/\n", "\n", "\n", "def process_input_data(df: pd.DataFrame) -> pd.DataFrame:\n", " # you can customize this to preprocess your own dataset\n", " # output should be a dataframe with 3 columns: text_1, text_2, label (1 for similar, -1 for dissimilar)\n", " df[\"label\"] = df[\"gold_label\"]\n", " df = df[df[\"label\"].isin([\"entailment\"])]\n", " df[\"label\"] = df[\"label\"].apply(lambda x: {\"entailment\": 1, \"contradiction\": -1}[x])\n", " df = df.rename(columns={\"sentence1\": \"text_1\", \"sentence2\": \"text_2\"})\n", " df = df[[\"text_1\", \"text_2\", \"label\"]]\n", " df = df.head(num_pairs_to_embed)\n", " return df\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aBbH71hEpgkM" }, "source": [ "## 2. Load and process input data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "kAKLjYG6pgkN", "outputId": "dc178688-e97d-4ad0-b26c-dff67b858966" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/r4/x3kdvs816995fnnph2gdpwp40000gn/T/ipykernel_17509/1977422881.py:13: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df[\"label\"] = df[\"label\"].apply(lambda x: {\"entailment\": 1, \"contradiction\": -1}[x])\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2A person on a horse jumps over a broken down a...A person is outdoors, on a horse.1
4Children smiling and waving at cameraThere are children present1
7A boy is jumping on skateboard in the middle o...The boy does a skateboarding trick.1
14Two blond women are hugging one another.There are women showing affection.1
17A few people in a restaurant setting, one of t...The diners are at a restaurant.1
\n", "
" ], "text/plain": [ " text_1 \\\n", "2 A person on a horse jumps over a broken down a... \n", "4 Children smiling and waving at camera \n", "7 A boy is jumping on skateboard in the middle o... \n", "14 Two blond women are hugging one another. \n", "17 A few people in a restaurant setting, one of t... \n", "\n", " text_2 label \n", "2 A person is outdoors, on a horse. 1 \n", "4 There are children present 1 \n", "7 The boy does a skateboarding trick. 1 \n", "14 There are women showing affection. 1 \n", "17 The diners are at a restaurant. 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load data\n", "df = pd.read_csv(local_dataset_path)\n", "\n", "# process input data\n", "df = process_input_data(df) # this demonstrates training data containing only positives\n", "\n", "# view data\n", "df.head()\n" ] }, { "cell_type": "markdown", "metadata": { "id": "z2F1cCoYpgkO" }, "source": [ "## 3. Split data into training test sets\n", "\n", "Note that it's important to split data into training and test sets *before* generating synethetic negatives or positives. You don't want any text strings in the training data to show up in the test data. If there's contamination, the test metrics will look better than they'll actually be in production." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "50QmnH2qpgkO", "outputId": "6144029b-eb29-439e-9990-7aeb28168e56" }, "outputs": [], "source": [ "# split data into train and test sets\n", "test_fraction = 0.5 # 0.5 is fairly arbitrary\n", "random_seed = 123 # random seed is arbitrary, but is helpful in reproducibility\n", "train_df, test_df = train_test_split(\n", " df, test_size=test_fraction, stratify=df[\"label\"], random_state=random_seed\n", ")\n", "train_df.loc[:, \"dataset\"] = \"train\"\n", "test_df.loc[:, \"dataset\"] = \"test\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MzAFkA2opgkP" }, "source": [ "## 4. Generate synthetic negatives\n", "\n", "This is another piece of the code that you will need to modify to match your use case.\n", "\n", "If you have data with positives and negatives, you can skip this section.\n", "\n", "If you have data with only positives, you can mostly keep it as is, where it generates negatives only.\n", "\n", "If you have multiclass data, you will want to generate both positives and negatives. The positives can be pairs of text that share labels, and the negatives can be pairs of text that do not share labels.\n", "\n", "The final output should be a dataframe with text pairs, where each pair is labeled -1 or 1." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "rUYd9V0zpgkP" }, "outputs": [], "source": [ "# generate negatives\n", "def dataframe_of_negatives(dataframe_of_positives: pd.DataFrame) -> pd.DataFrame:\n", " \"\"\"Return dataframe of negative pairs made by combining elements of positive pairs.\"\"\"\n", " texts = set(dataframe_of_positives[\"text_1\"].values) | set(\n", " dataframe_of_positives[\"text_2\"].values\n", " )\n", " all_pairs = {(t1, t2) for t1 in texts for t2 in texts if t1 < t2}\n", " positive_pairs = set(\n", " tuple(text_pair)\n", " for text_pair in dataframe_of_positives[[\"text_1\", \"text_2\"]].values\n", " )\n", " negative_pairs = all_pairs - positive_pairs\n", " df_of_negatives = pd.DataFrame(list(negative_pairs), columns=[\"text_1\", \"text_2\"])\n", " df_of_negatives[\"label\"] = -1\n", " return df_of_negatives\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "Rkh8-J89pgkP" }, "outputs": [], "source": [ "negatives_per_positive = (\n", " 1 # it will work at higher values too, but more data will be slower\n", ")\n", "# generate negatives for training dataset\n", "train_df_negatives = dataframe_of_negatives(train_df)\n", "train_df_negatives[\"dataset\"] = \"train\"\n", "# generate negatives for test dataset\n", "test_df_negatives = dataframe_of_negatives(test_df)\n", "test_df_negatives[\"dataset\"] = \"test\"\n", "# sample negatives and combine with positives\n", "train_df = pd.concat(\n", " [\n", " train_df,\n", " train_df_negatives.sample(\n", " n=len(train_df) * negatives_per_positive, random_state=random_seed\n", " ),\n", " ]\n", ")\n", "test_df = pd.concat(\n", " [\n", " test_df,\n", " test_df_negatives.sample(\n", " n=len(test_df) * negatives_per_positive, random_state=random_seed\n", " ),\n", " ]\n", ")\n", "\n", "df = pd.concat([train_df, test_df])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8MVSLMSrpgkQ" }, "source": [ "## 5. ## 5. Calculate embeddings and cosine similarities

Here, I create a cache to save the embeddings. walks in the water of a creek.\n", "Getting embedding for The toddler has milk around the corners of his mouth.\n", "Getting embedding for Community members are spending time in the park near a foundtain.\n", "Getting embedding for Some women are reading.\n", "Getting embedding for There is a windsurfer balancing on choppy water.\n", "Getting embedding for White small child wearing a brown and gray striped hoodie plays at park.\n", "Getting embedding for A man wearing a striped top and jeans does a skateboard trick on some steps while a man who is hunched over photographs him.\n", "Getting embedding for A good-looking firefighter sets up \"Do Not Cross\" tape in the city.\n", "Getting embedding for Children enjoy playing together.\n", "Getting embedding for a man with a white covering is walking up a flight of stairs.\n", "Getting embedding for The couple is outdoors.\n", "Getting embedding for Two women hug each other.\n", "Getting embedding for A man sits in front of a set up chess game.\n", "Getting embedding for Children's soccer game being played while the sun sets in the background.\n", "Getting embedding for A happy woman smiling\n", "Getting embedding for Two children, in colorful outfits, playing in a field with a big rock in the middle.\n", "Getting embedding for There are people watching another person hang up pictures.\n", "Getting embedding for a man is photographing a man skateboarding.\n", "Getting embedding for People are near water.\n", "Getting embedding for A young woman frolicking on the lawn in front of the us capitol building.\n", "Getting embedding for The boy is making a mess.\n", "Getting embedding for One person running next to their bike with the person riding their bike behind them.\n", "Getting embedding for The child was walking near the grass making a funny face.\n", "Getting embedding for People are working.\n", "Getting embedding for A woman in blue jeans and a dark jacket walks in front of a building.\n", "Getting embedding for There are two woman in this picture.\n", "Getting embedding for Two children, in colorful outfits, playing in a field with a big rock in the middle.\n", "Getting embedding for A middle aged oriental woman in a green headscarf and blue shirt is flashing a giant smile\n", "Getting embedding for People are outside.\n", "Getting embedding for A man is wakeboarding.\n", "Getting embedding for The person is interested in a water jet.\n", "Getting embedding for The red and black team are playing a game.\n", "Getting embedding for People with bikes.\n", "Getting embedding for The guys are playing a game.\n", "Getting embedding for a group of men and women converse\n", "Getting embedding for People wait on traffic.\n", "Getting embedding for The biker is jumping into a hole.\n", "Getting embedding for An elderly man is drinking orange juice at a cafe.\n", "Getting embedding for two people by a fountain\n", "Getting embedding for People are walking outdoors.\n", "Getting embedding for the woman is wearing a red shirt.\n", "Getting embedding for Firefighters are checking a car.\n", "Getting embedding for Students practicing yoga in a class setting.\n", "Getting embedding for The people stretched on yoga mats.\n", "Getting embedding for Woman in white in foreground and a man slightly behind walking with a sign for John's Pizza and Gyro in the background.\n", "Getting embedding for a woman is talking\n", "Getting embedding for The man plays guitar\n", "Getting embedding for Man in blue glasses walking pass a building\n", "Getting embedding for People are having a discussion.\n", "Getting embedding for Three men are grouped around the back of a car with its tailgate out, two of the men clothed in yellow uniforms and one in blue.\n", "Getting embedding for Outside by the trees, a woman wearing jeans and red jacket throws something for a German shepherd to chase.\n", "Getting embedding for Several people in an alleyway.\n", "Getting embedding for a bearded man pulls a rope\n", "Getting embedding for A person is a red hat and winter jacket is looking into the distance.\n", "Getting embedding for Two children play outside in a field.\n", "Getting embedding for lots of people are in the street\n", "Getting embedding for The silhouette of three people in front of a wall.\n", "Getting embedding for Two kids are playing with a big rock in the field\n", "Getting embedding for There is a soccer game.\n", "Getting embedding for Four guys are playing basketball.\n", "Getting embedding for The boy in the blue and yellow top is standing with arms outstretched.\n", "Getting embedding for The red and black team are playing a game.\n", "Getting embedding for The blond woman is searching for medical supplies in a suitcase.\n", "Getting embedding for The boy does a skateboarding trick.\n", "Getting embedding for Street performer in colorful shirt performing with small guitar.\n", "Getting embedding for lots of people are in the street\n", "Getting embedding for the dogs see each other\n", "Getting embedding for Man wearing black t-shirt sitting at a computer desk.\n", "Getting embedding for two people by a fountain\n", "Getting embedding for The girl blows a butterfly.\n", "Getting embedding for The people are by the wall.\n", "Getting embedding for A white horse is pulling a cart while a man stands and watches.\n", "Getting embedding for Schoolchildren together\n", "Getting embedding for They are outside wearing coats.\n", "Getting embedding for The dog is running.\n", "Getting embedding for Two people enjoying a water fountain display.\n", "Getting embedding for Two people enjoying a water fountain display.\n", "Getting embedding for a woman on a yellow shirt is on the floor.\n", "Getting embedding for People walking around in a big city.\n", "Getting embedding for The man is outside.\n", "Getting embedding for There are bicyclists stopped at a road.\n", "Getting embedding for The couple is outdoors.\n", "Getting embedding for The people are outside.\n", "Getting embedding for Two older men in coats are standing outside.\n", "Getting embedding for Two men are laughing and enjoying themselves.\n", "Getting embedding for A street performer is trying to earn extra money.\n", "Getting embedding for The dog is running.\n", "Getting embedding for There are people at work.\n", "Getting embedding for Two barefoot men are playing on a green lawn outside a building with other people in the background.\n", "Getting embedding for A man wearing a colorful and striped sweater plays music in the street.\n", "Getting embedding for a woman looking at her cellphone\n", "Getting embedding for The man is laying down to sleep\n", "Getting embedding for Woman in white in foreground and a man slightly behind walking with a sign for John's Pizza and Gyro in the background.\n", "Getting embedding for An Asian woman in a blue top and green headscarf smiling widely as another woman rows a boat in the background.\n", "Getting embedding for The dog is in the snow.\n", "Getting embedding for There are scultupres nearby.\n", "Getting embedding for The crowd looked on while the players prepared themselves.\n", "Getting embedding for A yellow uniformed skier is performing a trick across a railed object.\n", "Getting embedding for man ringing a bell\n", "Getting embedding for Two people are outside.\n", "Getting embedding for Two kids are playing with a big rock in the field\n", "Getting embedding for a person in orange\n", "Getting embedding for The parents of the younger male are posing for a picture in front of a water fountain.\n", "Getting embedding for two men serving preparing food.\n", "Getting embedding for many people relax in the yard.\n", "Getting embedding for Two children play outside in a field.\n", "Getting embedding for The man is drinking water.\n", "Getting embedding for the child is working with wood.\n", "Getting embedding for dogs attacking another dog\n", "Getting embedding for There are scultupres nearby.\n", "Getting embedding for The man is laying down to sleep\n", "Getting embedding for man ringing a bell\n", "Getting embedding for A woman wearing all white and eating, walks next to a man holding a briefcase.\n", "Getting embedding for A street performer is trying to earn extra money.\n", "Getting embedding for Child in red and blue shirt painting a log.\n", "Getting embedding for A group of people gathers on the grass in a backyard with tents, tables, and chairs set up.\n", "Getting embedding for The couple is outdoors.\n", "Getting embedding for A man is running behind a sled.\n", "Getting embedding for Three people stand proudly by a truck stocked with building supplies in the street.\n", "Getting embedding for a man with a white covering is walking up a flight of stairs.\n", "Getting embedding for Four guys in wheelchairs on a basketball court two are trying to grab a basketball in midair.\n", "Getting embedding for a man is photographing a man skateboarding.\n", "Getting embedding for A woman in colorful garb with her back to the camera and cloth on her hear.\n", "Getting embedding for Some women are talking.\n", "Getting embedding for The dog is running.\n", "Getting embedding for Two soccer teams are competing on a soccer field.\n", "Getting embedding for A person is dipping her foot into water.\n", "Getting embedding for People are near water.\n", "Getting embedding for A man in the distance is walking past a brick wall painted with words and graffiti.\n", "Getting embedding for Grafffiti on a brick wall.\n", "Getting embedding for Two people with bicycles, one in front running with a bike and one in back riding.\n", "Getting embedding for Two people enjoying a water fountain display.\n", "Getting embedding for The bikers are in the town.\n", "Getting embedding for A group of people sitting at some sort of gathering.\n", "Getting embedding for People are in the street.\n", "Getting embedding for Kids are playing outdoors.\n", "Getting embedding for An old man is standing by a building in downtown.\n", "Getting embedding for Workers are on break.\n", "Getting embedding for The child is happy.\n", "Getting embedding for a man with a white covering is walking up a flight of stairs.\n", "Getting embedding for Two kids are playing with a big rock in the field\n", "Getting embedding for lots of people are in the street\n", "Getting embedding for Two people with bicycles, one in front running with a bike and one in back riding.\n", "Getting embedding for A woman is at a machine.\n", "Getting embedding for Street performer in colorful shirt performing with small guitar.\n", "Getting embedding for People are about to eat.\n", "Getting embedding for Man sitting on a motorcycle on the sidewalk\n", "Getting embedding for Two blond women are hugging one another.\n", "Getting embedding for the child is working with wood.\n", "Getting embedding for Some women are reading.\n", "Getting embedding for A woman preparing to glaze\n", "Getting embedding for There is a windsurfer balancing on choppy water.\n", "Getting embedding for The man is able to grow a beard.\n", "Getting embedding for THe woman is sitting down\n", "Getting embedding for Two women hug each other.\n", "Getting embedding for There is a girl standing\n", "Getting embedding for A man sits in front of a set up chess game.\n", "Getting embedding for Men are playing soccer, the one in front is about to kick the ball.\n", "Getting embedding for The man is playing music on an instrument.\n", "Getting embedding for A white and brown dog is leaping through the air.\n", "Getting embedding for The racer is driving.\n", "Getting embedding for a bearded man pulls a rope\n", "Getting embedding for Six soccer players on field with player in red uniform in the air and ball airborne.\n", "Getting embedding for The boy is young.\n", "Getting embedding for Cheerleaders cheer on a field for an activity.\n", "Getting embedding for Two people are next to a fountain with a red bottom and arches of water.\n", "Getting embedding for A woman talking to four little children outside.\n", "Getting embedding for Two children in hats play in an open, rocky field.\n", "Getting embedding for a woman with a straw hat working on a strange machine with coconuts at her side.\n", "Getting embedding for The guys are playing a game.\n", "Getting embedding for The blond woman is searching for medical supplies in a suitcase.\n", "Getting embedding for The child is painting.\n", "Getting embedding for A woman holding a boombox.\n", "Getting embedding for A woman stand on a fountain and dips her toes in.\n", "Getting embedding for dogs attacking another dog\n", "Getting embedding for Four people near a body of water, one sitting and three standing, while two people walk on a nearby sidewalk.\n", "Getting embedding for People are working.\n", "Getting embedding for The girl is under the age of 88 years old.\n", "Getting embedding for Workers are on break.\n", "Getting embedding for three bikers stop in town.\n", "Getting embedding for Two kids are with a wagon.\n", "Getting embedding for Exhausted looking firemen are walking.\n", "Getting embedding for An older man dressed in blue historical clothing is ringing a bell in his right hand.\n", "Getting embedding for A woman walking outside.\n", "Getting embedding for A dog is running outdoors.\n", "Getting embedding for A crowded city during daytime.\n", "Getting embedding for A man wants a woman to look at his clipboard\n", "Getting embedding for There are scultupres nearby.\n", "Getting embedding for A woman holding a boombox.\n", "Getting embedding for man playing soccer\n", "Getting embedding for A woman in capri jeans crouches on the edge of a fountain with her left foot kicked out to touch the falling water.\n", "Getting embedding for lots of people are in the street\n", "Getting embedding for People are fishing and walking next to the water.\n", "Getting embedding for two men serving preparing food.\n", "Getting embedding for the bike is tied to a sign\n", "Getting embedding for Some women are talking.\n", "Getting embedding for The man is outside.\n", "Getting embedding for many people relax in the yard.\n", "Getting embedding for Some children are playing jump rope.\n", "Getting embedding for A middle aged oriental woman in a green headscarf and blue shirt is flashing a giant smile\n", "Getting embedding for People are in the street.\n", "Getting embedding for The man wearing lots of medals is watching the girl in the yellow bikini top.\n", "Getting embedding for a man wearing blue plays soccer.\n", "Getting embedding for The biker is jumping into a hole.\n", "Getting embedding for man playing soccer\n", "Getting embedding for There is a soccer game with a team in yellow.\n", "Getting embedding for Two children in hats play in an open, rocky field.\n", "Getting embedding for A man is on a dirt bike.\n", "Getting embedding for Someone sitting outside behind a chessboard.\n", "Getting embedding for The old man is painting a portrait.\n", "Getting embedding for There are some people outside.\n", "Getting embedding for many people relax in the yard.\n", "Getting embedding for A young girl sitting at a table with a bowl on her head\n", "Getting embedding for Three men are grouped around the back of a car with its tailgate out, two of the men clothed in yellow uniforms and one in blue.\n", "Getting embedding for A woman is like to touch the water in fountain\n", "Getting embedding for The boy is young.\n", "Getting embedding for Three construction workers posing with construction materials.\n", "Getting embedding for The man is putting up a poster.\n", "Getting embedding for a woman is talking\n", "Getting embedding for Small laughing child with blond-hair sitting at a table holding a green sippy cup.\n", "Getting embedding for A woman with a yellow to sits.\n" ] } ], "source": [ "# establish a cache of embeddings to avoid recomputing\n", "# cache is a dict of tuples (text, engine) -> embedding\n", "try:\n", " with open(embedding_cache_path, \"rb\") as f:\n", " embedding_cache = pickle.load(f)\n", "except FileNotFoundError:\n", " precomputed_embedding_cache_path = \"https://cdn.openai.com/API/examples/data/snli_embedding_cache.pkl\"\n", " embedding_cache = pd.read_pickle(precomputed_embedding_cache_path)\n", "\n", "\n", "# this function will get embeddings from the cache and save them there afterward\n", "def get_embedding_with_cache(\n", " text: str,\n", " engine: str = default_embedding_engine,\n", " embedding_cache: dict = embedding_cache,\n", " embedding_cache_path: str = embedding_cache_path,\n", ") -> list:\n", " print(f\"Getting embedding for {text}\")\n", " if (text, engine) not in embedding_cache.keys():\n", " # if not in cache, call API to get embedding\n", " embedding_cache[(text, engine)] = get_embedding(text, engine)\n", " # save embeddings cache to disk after each update\n", " with open(embedding_cache_path, \"wb\") as embedding_cache_file:\n", " pickle.dump(embedding_cache, embedding_cache_file)\n", " return embedding_cache[(text, engine)]\n", "\n", "\n", "# create column of embeddings\n", "for column in [\"text_1\", \"text_2\"]:\n", " df[f\"{column}_embedding\"] = df[column].apply(get_embedding_with_cache)\n", "\n", "# create column of cosine similarity between embeddings\n", "df[\"cosine_similarity\"] = df.apply(\n", " lambda row: cosine_similarity(row[\"text_1_embedding\"], row[\"text_2_embedding\"]),\n", " axis=1,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "4pwn608LpgkQ" }, "source": [ "## 6. Plot distribution of cosine similarity\n", "\n", "Here we measure similarity of text using cosine similarity. In our experience, most distance functions (L1, L2, cosine similarity) all work about the same. Note that our embeddings are already normalized to length 1, so cosine similarity is equivalent to dot product.\n", "\n", "The graphs show how much the overlap there is between the distribution of cosine similarities for similar and dissimilar pairs. If there is a high amount of overlap, that means there are some dissimilar pairs with greater cosine similarity than some similar pairs.\n", "\n", "The accuracy I compute is the accuracy of a simple rule that predicts 'similar (1)' if the cosine similarity is above some threshold X and otherwise predicts 'dissimilar (0)'." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "SoeDF8vqpgkQ", "outputId": "17db817e-1702-4089-c4e8-8ca32d294930" }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "label=1
