{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "0wjP9mrldJsd" }, "source": [ "## Visualizing the embeddings in Kangas\n", "\n", "In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions." ] }, { "cell_type": "markdown", "metadata": { "id": "4tPKQqqldJsj" }, "source": [ "## What is Kangas?\n", "\n", "[Kangas](https://github.com/comet-ml/kangas/) as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by [Comet](https://comet.com/), a company designed to help reduce the friction of moving models into production. " ] }, { "cell_type": "markdown", "metadata": { "id": "6sNsB2iFdJsk" }, "source": [ "### 1. Setup\n", "\n", "To get started, we pip install kangas, and import it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "N8gi529adL-f", "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad" }, "outputs": [], "source": [ "%pip install kangas --quiet" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "htxjXThodRxD" }, "outputs": [], "source": [ "import kangas as kg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Constructing a Kangas DataGrid\n", "\n", "We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.\n", "\n", "We use Kangas to read the CSV file into a DataGrid for further processing." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0SxWlRTrdVJq", "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "1001it [00:00, 2412.90it/s]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n" ] } ], "source": [ "data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can review the fields of the CSV file:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bzhQgoRGeMCp", "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataGrid (in memory)\n", " Name : fine_food_reviews_with_embeddings_1k\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 INTEGER \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 TEXT \n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And get a glimpse of the first and last rows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 349 }, "id": "Q95N832aeaBr", "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e" }, "outputs": [ { "data": { "text/html": [ "
row-id | Column 1 | ProductId | UserId | Score | Summary | Text | combined | n_tokens | embedding |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | B003XPF9BO | A3R7JR3FMEBXQB | 5 | where does one | Wanted to save | Title: where do | 52 | [0.007018072064 |
2 | 297 | B003VXHGPK | A21VWSCGW7UUAR | 4 | Good, but not W | Honestly, I hav | Title: Good, bu | 178 | [-0.00314055196 |
3 | 296 | B008JKTTUA | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
4 | 295 | B000LKTTTW | A14MQ40CCU8B13 | 5 | Best tomato sou | I have a hard t | Title: Best tom | 111 | [-0.00139322795 |
5 | 294 | B001D09KAM | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
... | 996 | 623 | B0000CFXYA | A3GS4GWPIBV0NT | 1 | Strange inflamm | Truthfully wasn | Title: Strange | 110 | [0.000110913533 |
997 | 624 | B0001BH5YM | A1BZ3HMAKK0NC | 5 | My favorite and | You've just got | Title: My favor | 80 | [-0.02086931467 |
998 | 625 | B0009ET7TC | A2FSDQY5AI6TNX | 5 | My furbabies LO | Shake the conta | Title: My furba | 47 | [-0.00974910240 |
999 | 619 | B007PA32L2 | A15FF2P7RPKH6G | 5 | got this for th | all i have hear | Title: got this | 50 | [-0.00521062919 |
1000 | 999 | B001EQ5GEO | A3VYU0VO6DYV6I | 5 | I love Maui Cof | My first experi | Title: I love M | 118 | [-0.00605782261 |
[1000 rows x 9 columns] | |||||||||
* Use DataGrid.save() to save to disk | |||||||||
** Use DataGrid.show() to start user interface |