{ "cells": [ { "cell_type": "markdown", "id": "983ef639-fbf4-4912-b593-9cf08aeb11cd", "metadata": {}, "source": [ "# Visualizing embeddings in 3D" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9c9ea9a8-675d-4e3a-a8f7-6f4563df84ad", "metadata": {}, "source": [ "The example uses [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the dimensionality fo the embeddings from 1536 to 3. Then we can visualize the data points in a 3D plot. The small dataset `dbpedia_samples.jsonl` is curated by randomly sampling 200 samples from [DBpedia validation dataset](https://www.kaggle.com/danofer/dbpedia-classes?select=DBPEDIA_val.csv)." ] }, { "cell_type": "markdown", "id": "8df5f2c3-ddbb-4cc4-9205-4c0af1670562", "metadata": {}, "source": [ "### 1. Load the dataset and query embeddings" ] }, { "cell_type": "code", "execution_count": 1, "id": "133dfc2a-9dbd-4a5a-96fa-477272f7af5a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categories of DBpedia samples: Artist 21\n", "Film 19\n", "Plant 19\n", "OfficeHolder 18\n", "Company 17\n", "NaturalPlace 16\n", "Athlete 16\n", "Village 12\n", "WrittenWork 11\n", "Building 11\n", "Album 11\n", "Animal 11\n", "EducationalInstitution 10\n", "MeanOfTransportation 8\n", "Name: category, dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", " | text | \n", "category | \n", "
---|---|---|
0 | \n", "Morada Limited is a textile company based in ... | \n", "Company | \n", "
1 | \n", "The Armenian Mirror-Spectator is a newspaper ... | \n", "WrittenWork | \n", "
2 | \n", "Mt. Kinka (金華山 Kinka-zan) also known as Kinka... | \n", "NaturalPlace | \n", "
3 | \n", "Planning the Play of a Bridge Hand is a book ... | \n", "WrittenWork | \n", "
4 | \n", "Wang Yuanping (born 8 December 1976) is a ret... | \n", "Athlete | \n", "