{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "0wjP9mrldJsd" }, "source": [ "## Visualizing the embeddings in Kangas\n", "\n", "In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions." ] }, { "cell_type": "markdown", "metadata": { "id": "4tPKQqqldJsj" }, "source": [ "## What is Kangas?\n", "\n", "[Kangas](https://github.com/comet-ml/kangas/) as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by [Comet](https://comet.com/), a company designed to help reduce the friction of moving models into production. " ] }, { "cell_type": "markdown", "metadata": { "id": "6sNsB2iFdJsk" }, "source": [ "### 1. Setup\n", "\n", "To get started, we pip install kangas, and import it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "N8gi529adL-f", "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad" }, "outputs": [], "source": [ "%pip install kangas --quiet" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "htxjXThodRxD" }, "outputs": [], "source": [ "import kangas as kg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Constructing a Kangas DataGrid\n", "\n", "We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.\n", "\n", "We use Kangas to read the CSV file into a DataGrid for further processing." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0SxWlRTrdVJq", "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "1001it [00:00, 2412.90it/s]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n" ] } ], "source": [ "data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can review the fields of the CSV file:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bzhQgoRGeMCp", "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataGrid (in memory)\n", " Name : fine_food_reviews_with_embeddings_1k\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 INTEGER \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 TEXT \n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And get a glimpse of the first and last rows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 349 }, "id": "Q95N832aeaBr", "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding
1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064
2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196
3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795
5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
...
996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533
997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467
998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240
999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919
1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261
[1000 rows x 9 columns]
* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface
" ], "text/plain": [ " row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding \n", " 1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064 \n", " 2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196 \n", " 3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", " 4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795 \n", " 5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", "...\n", " 996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533 \n", " 997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467 \n", " 998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240 \n", " 999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919 \n", " 1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261 \n", "\n", " [1000 rows x 9 columns] \n", "\n", "* Use DataGrid.save() to save to disk\n", "** Use DataGrid.show() to start user interface" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we create a new DataGrid, converting the numbers into an Embedding:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "Bu0erP68dvLU" }, "outputs": [], "source": [ "import ast # to convert string of a list of numbers into a list of numbers\n", "\n", "dg = kg.DataGrid(\n", " name=\"openai_embeddings\",\n", " columns=data.get_columns(),\n", " converters={\"Score\": str},\n", ")\n", "for row in data:\n", " embedding = ast.literal_eval(row[8])\n", " row[8] = kg.Embedding(\n", " embedding, \n", " name=str(row[3]), \n", " text=\"%s - %.10s\" % (row[3], row[4]),\n", " projection=\"umap\",\n", " )\n", " dg.append(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new DataGrid now has an Embedding column with proper datatype." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gd6Od4Bmhijy", "outputId": "9aa38221-0272-4a63-e393-706e0a0c5879" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataGrid (in memory)\n", " Name : openai_embeddings\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 TEXT \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 EMBEDDING-ASSET \n" ] } ], "source": [ "dg.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We simply save the datagrid, and we're done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dg.save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Render 2D Projections\n", "\n", "To render the data directly in the notebook, simply show it. Note that each row contains an embedding projection. \n", "\n", "Scroll to far right to see embeddings projection per row.\n", "\n", "The color of the point in projection space represents the Score." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 771 }, "id": "Z8j-GdpiijU0", "outputId": "20a0b1ca-3059-4384-cd8c-b32b1aa1c270" }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dg.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Group by \"Score\" to see rows of each group." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dg.show(group=\"Score\", sort=\"Score\", rows=5, select=\"Score,embedding\")" ] }, { "cell_type": "markdown", "metadata": { "id": "vLIxfmK5dJsq" }, "source": [ "An example of this datagrid is hosted here: https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid" ] } ], "metadata": { "accelerator": "TPU", "colab": { "gpuType": "V100", "machine_shape": "hm", "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" } } }, "nbformat": 4, "nbformat_minor": 4 }