Pushing updated version responding to PR comments

2 years ago · 5ffd9e72ec
parent ed17c4c1b9
commit 5ffd9e72ec
1 changed files with 209 additions and 161 deletions
--- a/examples/vector_databases/Vector_db_introduction.ipynb
+++ b/examples/vector_databases/Vector_db_introduction.ipynb
@ -9,6 +9,16 @@
    "\n",
    "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
    "\n",
+    "### What is a Vector Database\n",
+    "\n",
+    "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
+    "\n",
+    "### Why use a Vector Database\n",
+    "\n",
+    "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
+    "\n",
+    "\n",
+    "### Demo Flow\n",
    "The demo flow is:\n",
    "- **Setup**: Import packages and set any required variables\n",
    "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
@ -21,7 +31,7 @@
    "    - *Index Data*: We'll create an index with __title__ search vectors in it\n",
    "    - *Search Data*: We'll run a few searches to confirm it works\n",
    "\n",
-    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings"
+    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
   ]
  },
  {
@ -31,12 +41,12 @@
   "source": [
    "## Setup\n",
    "\n",
-    "Here we import the required libraries and set the embedding model that we'd like to use"
+    "Import the required libraries and set the embedding model that we'd like to use."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 98,
+   "execution_count": 1,
   "id": "5be94df6",
   "metadata": {},
   "outputs": [],
@ -60,7 +70,7 @@
    "import weaviate\n",
    "\n",
    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
-    "MODEL = \"text-embedding-ada-002\"\n",
+    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
    "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
    "import warnings\n",
@ -76,14 +86,12 @@
   "source": [
    "## Load data\n",
    "\n",
-    "In this section we'll source the data for this task, embed it and format it for insertion into a vector database\n",
-    "\n",
-    "*Thanks to Ryan Greene for the template used for the batch ingestion"
+    "In this section we'll source the data for this task, embed it and format it for insertion into a vector database"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 116,
+   "execution_count": 6,
   "id": "bd99e08e",
   "metadata": {},
   "outputs": [],
@ -92,7 +100,7 @@
    "def get_embeddings(input: List):\n",
    "    response = openai.Embedding.create(\n",
    "        input=input,\n",
-    "        model=MODEL,\n",
+    "        model=EMBEDDING_MODEL,\n",
    "    )[\"data\"]\n",
    "    return [data[\"embedding\"] for data in response]\n",
    "\n",
@ -102,7 +110,6 @@
    "        yield iterable[ndx : min(ndx + n, l)]\n",
    "\n",
    "# Function for batching and parallel processing the embeddings\n",
-    "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
    "def embed_corpus(\n",
    "    corpus: List[str],\n",
    "    batch_size=64,\n",
@ -126,28 +133,21 @@
    "    # Embed the corpus\n",
    "    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
    "        \n",
-    "        try:\n",
-    "            futures = [\n",
-    "                executor.submit(get_embeddings, text_batch)\n",
-    "                for text_batch in batchify(encoded_corpus, batch_size)\n",
-    "            ]\n",
+    "        futures = [\n",
+    "            executor.submit(get_embeddings, text_batch)\n",
+    "            for text_batch in batchify(encoded_corpus, batch_size)\n",
+    "        ]\n",
+    "\n",
+    "        with tqdm(total=len(encoded_corpus)) as pbar:\n",
+    "            for _ in concurrent.futures.as_completed(futures):\n",
+    "                pbar.update(batch_size)\n",
    "\n",
-    "            with tqdm(total=len(encoded_corpus)) as pbar:\n",
-    "                for _ in concurrent.futures.as_completed(futures):\n",
-    "                    pbar.update(batch_size)\n",
+    "        embeddings = []\n",
+    "        for future in futures:\n",
+    "            data = future.result()\n",
+    "            embeddings.extend(data)\n",
    "\n",
-    "            embeddings = []\n",
-    "            for future in futures:\n",
-    "                data = future.result()\n",
-    "                embeddings.extend(data)\n",
-    "                \n",
-    "            return embeddings\n",
-    "                \n",
-    "        except Exception as e:\n",
-    "            print('Get embeddings failed, returning exception')\n",
-    "            \n",
-    "            return e\n",
-    "        "
+    "        return embeddings"
   ]
  },
  {
@ -159,13 +159,13 @@
   "source": [
    "# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
    "dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
-    "# Limited to 50k articles for demo purposes\n",
-    "dataset = dataset[:50_000]  "
+    "# Limited to 25k articles for demo purposes\n",
+    "dataset = dataset[:25_000]  "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 118,
+   "execution_count": 15,
   "id": "e6ee90ce",
   "metadata": {},
   "outputs": [
@ -173,57 +173,67 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "num_articles=50000, num_tokens=18272526, est_embedding_cost=7.31 USD\n"
+      "num_articles=25000, num_tokens=12896881, est_embedding_cost=5.16 USD\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "50048it [02:30, 332.26it/s]                                                                                                                                                      \n"
+      "25024it [01:11, 348.92it/s]                                                                                                                                           "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "num_articles=50000, num_tokens=202363, est_embedding_cost=0.08 USD\n"
+      "CPU times: user 15.8 s, sys: 1.96 s, total: 17.8 s\n",
+      "Wall time: 1min 14s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "50048it [00:53, 942.94it/s]                                                                                                                                                      "
+      "\n"
     ]
-    },
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# Embed the article text\n",
+    "dataset_embeddings = embed_corpus([article[\"text\"] for article in dataset])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "850c7215",
+   "metadata": {},
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "CPU times: user 48.7 s, sys: 1min 19s, total: 2min 7s\n",
-      "Wall time: 5min 53s\n"
+      "num_articles=25000, num_tokens=88300, est_embedding_cost=0.04 USD\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "\n"
+      "25024it [00:21, 1164.97it/s]                                                                                                                                          \n"
     ]
    }
   ],
   "source": [
-    "%%time\n",
-    "# Embed the article text\n",
-    "dataset_embeddings = embed_corpus([article[\"text\"] for article in dataset])\n",
    "# Embed the article titles separately\n",
    "title_embeddings = embed_corpus([article[\"title\"] for article in dataset])"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 119,
+   "execution_count": 17,
   "id": "1410daaa",
   "metadata": {},
   "outputs": [
@ -264,7 +274,7 @@
       "      <td>https://simple.wikipedia.org/wiki/April</td>\n",
       "      <td>April</td>\n",
       "      <td>April is the fourth month of the year in the J...</td>\n",
-       "      <td>[0.00107035250402987, -0.02077057771384716, -0...</td>\n",
+       "      <td>[0.0010547508718445897, -0.020757636055350304,...</td>\n",
       "      <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
@ -274,7 +284,7 @@
       "      <td>https://simple.wikipedia.org/wiki/August</td>\n",
       "      <td>August</td>\n",
       "      <td>August (Aug.) is the eighth month of the year ...</td>\n",
-       "      <td>[0.0010461278725415468, 0.0008924593566916883,...</td>\n",
+       "      <td>[0.0009623901569284499, 0.0008108559413813055,...</td>\n",
       "      <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
@ -284,7 +294,7 @@
       "      <td>https://simple.wikipedia.org/wiki/Art</td>\n",
       "      <td>Art</td>\n",
       "      <td>Art is a creative activity that expresses imag...</td>\n",
-       "      <td>[0.0033627033699303865, 0.006122018210589886, ...</td>\n",
+       "      <td>[0.0033528385683894157, 0.006173426751047373, ...</td>\n",
       "      <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
@ -294,7 +304,7 @@
       "      <td>https://simple.wikipedia.org/wiki/A</td>\n",
       "      <td>A</td>\n",
       "      <td>A or a is the first letter of the English alph...</td>\n",
-       "      <td>[0.015406121499836445, -0.013689860701560974, ...</td>\n",
+       "      <td>[0.015449387952685356, -0.013746200129389763, ...</td>\n",
       "      <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
@ -304,7 +314,7 @@
       "      <td>https://simple.wikipedia.org/wiki/Air</td>\n",
       "      <td>Air</td>\n",
       "      <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
-       "      <td>[0.022219523787498474, -0.020443666726350784, ...</td>\n",
+       "      <td>[0.0222249086946249, -0.020463958382606506, -0...</td>\n",
       "      <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
@ -328,11 +338,11 @@
       "4  Air refers to the Earth's atmosphere. Air is a...   \n",
       "\n",
       "                                        title_vector  \\\n",
-       "0  [0.00107035250402987, -0.02077057771384716, -0...   \n",
-       "1  [0.0010461278725415468, 0.0008924593566916883,...   \n",
-       "2  [0.0033627033699303865, 0.006122018210589886, ...   \n",
-       "3  [0.015406121499836445, -0.013689860701560974, ...   \n",
-       "4  [0.022219523787498474, -0.020443666726350784, ...   \n",
+       "0  [0.0010547508718445897, -0.020757636055350304,...   \n",
+       "1  [0.0009623901569284499, 0.0008108559413813055,...   \n",
+       "2  [0.0033528385683894157, 0.006173426751047373, ...   \n",
+       "3  [0.015449387952685356, -0.013746200129389763, ...   \n",
+       "4  [0.0222249086946249, -0.020463958382606506, -0...   \n",
       "\n",
       "                                      content_vector vector_id  \n",
       "0  [-0.011253940872848034, -0.013491976074874401,...         0  \n",
@ -342,7 +352,7 @@
       "4  [0.021524671465158463, 0.018522677943110466, -...         4  "
      ]
     },
-     "execution_count": 119,
+     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -376,7 +386,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 93,
+   "execution_count": 18,
   "id": "92e6152a",
   "metadata": {},
   "outputs": [],
@ -392,12 +402,14 @@
   "source": [
    "### Create Index\n",
    "\n",
-    "First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.)."
+    "First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [Pinecone documentation](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).\n",
+    "\n",
+    "If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on [batch inserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel)."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 94,
+   "execution_count": 19,
   "id": "0a71c575",
   "metadata": {},
   "outputs": [],
@ -429,7 +441,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 99,
+   "execution_count": 20,
   "id": "7ea9ad46",
   "metadata": {},
   "outputs": [
@ -439,7 +451,7 @@
       "['wikipedia-articles']"
      ]
     },
-     "execution_count": 99,
+     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -462,7 +474,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 100,
+   "execution_count": 22,
   "id": "5daeba00",
   "metadata": {},
   "outputs": [
@ -476,7 +488,6 @@
   ],
   "source": [
    "# Upsert content vectors in content namespace\n",
-    "# NOTE: Using a thread pool here can accelerate this upsert operation\n",
    "print(\"Uploading vectors to content namespace..\")\n",
    "for batch_df in df_batcher(article_df):\n",
    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')"
@ -484,7 +495,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 101,
+   "execution_count": 23,
   "id": "5fc1b083",
   "metadata": {},
   "outputs": [
@ -498,7 +509,6 @@
   ],
   "source": [
    "# Upsert title vectors in title namespace\n",
-    "# NOTE: Using a thread pool here can accelerate this upsert operation\n",
    "print(\"Uploading vectors to title namespace..\")\n",
    "for batch_df in df_batcher(article_df):\n",
    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')"
@ -506,7 +516,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 102,
+   "execution_count": 24,
   "id": "f90c7fba",
   "metadata": {},
   "outputs": [
@ -514,19 +524,19 @@
     "data": {
      "text/plain": [
       "{'dimension': 1536,\n",
-       " 'index_fullness': 0.2,\n",
-       " 'namespaces': {'content': {'vector_count': 50000},\n",
-       "                'title': {'vector_count': 50000}},\n",
-       " 'total_vector_count': 100000}"
+       " 'index_fullness': 0.1,\n",
+       " 'namespaces': {'content': {'vector_count': 25000},\n",
+       "                'title': {'vector_count': 25000}},\n",
+       " 'total_vector_count': 50000}"
      ]
     },
-     "execution_count": 102,
+     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "# Check index size for each namespace\n",
+    "# Check index size for each namespace to confirm all of our docs have loaded\n",
    "index.describe_index_stats()"
   ]
  },
@ -542,7 +552,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 103,
+   "execution_count": 25,
   "id": "d701b3c7",
   "metadata": {},
   "outputs": [],
@ -554,7 +564,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 104,
+   "execution_count": 28,
   "id": "3c8c2aa1",
   "metadata": {},
   "outputs": [],
@ -566,16 +576,16 @@
    "    # Create vector embeddings based on the title column\n",
    "    embedded_query = openai.Embedding.create(\n",
    "                                            input=query,\n",
-    "                                            model=MODEL,\n",
+    "                                            model=EMBEDDING_MODEL,\n",
    "                                            )[\"data\"][0]['embedding']\n",
    "\n",
    "    # Query namespace passed as parameter using title vector\n",
    "    query_result = index.query(embedded_query, \n",
-    "                               namespace=namespace, \n",
-    "                               top_k=top_k)\n",
+    "                                      namespace=namespace, \n",
+    "                                      top_k=top_k)\n",
    "\n",
    "    # Print query results \n",
-    "    print(f'\\nMost similar results querying {query} in \"{namespace}\" namespace:\\n')\n",
+    "    print(f'\\nMost similar results to {query} in \"{namespace}\" namespace:\\n')\n",
    "    if not query_result.matches:\n",
    "        print('no query result')\n",
    "    \n",
@ -591,7 +601,7 @@
    "    counter = 0\n",
    "    for k,v in df.iterrows():\n",
    "        counter += 1\n",
-    "        print(f'Result {counter} with a score of {v.score} is {v.title}')\n",
+    "        print(f'{v.title} (score = {v.score})')\n",
    "    \n",
    "    print('\\n')\n",
    "\n",
@ -600,7 +610,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 105,
+   "execution_count": 29,
   "id": "67b3584d",
   "metadata": {},
   "outputs": [
@ -609,13 +619,13 @@
     "output_type": "stream",
     "text": [
      "\n",
-      "Most similar results querying modern art in Europe in \"title\" namespace:\n",
+      "Most similar results to modern art in Europe in \"title\" namespace:\n",
      "\n",
-      "Result 1 with a score of 0.890994787 is Early modern Europe\n",
-      "Result 2 with a score of 0.875286043 is Museum of Modern Art\n",
-      "Result 3 with a score of 0.867404044 is Western Europe\n",
-      "Result 4 with a score of 0.864250064 is Renaissance art\n",
-      "Result 5 with a score of 0.860506058 is Pop art\n",
+      "Museum of Modern Art (score = 0.875286043)\n",
+      "Western Europe (score = 0.867383599)\n",
+      "Renaissance art (score = 0.864250064)\n",
+      "Pop art (score = 0.860506058)\n",
+      "Northern Europe (score = 0.854678154)\n",
      "\n",
      "\n"
     ]
@ -627,7 +637,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 106,
+   "execution_count": 30,
   "id": "3e7ac79b",
   "metadata": {},
   "outputs": [
@ -636,13 +646,13 @@
     "output_type": "stream",
     "text": [
      "\n",
-      "Most similar results querying Famous battles in Scottish history in \"content\" namespace:\n",
+      "Most similar results to Famous battles in Scottish history in \"content\" namespace:\n",
      "\n",
-      "Result 1 with a score of 0.869324744 is Battle of Bannockburn\n",
-      "Result 2 with a score of 0.861479 is Wars of Scottish Independence\n",
-      "Result 3 with a score of 0.852555931 is 1651\n",
-      "Result 4 with a score of 0.84969604 is First War of Scottish Independence\n",
-      "Result 5 with a score of 0.846192539 is Robert I of Scotland\n",
+      "Battle of Bannockburn (score = 0.869324744)\n",
+      "Wars of Scottish Independence (score = 0.861479)\n",
+      "1651 (score = 0.852555931)\n",
+      "First War of Scottish Independence (score = 0.84969604)\n",
+      "Robert I of Scotland (score = 0.846192539)\n",
      "\n",
      "\n"
     ]
@ -685,7 +695,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 107,
+   "execution_count": 33,
   "id": "b9ea472d",
   "metadata": {},
   "outputs": [],
@ -695,7 +705,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 108,
+   "execution_count": 34,
   "id": "13be220d",
   "metadata": {},
   "outputs": [
@ -705,7 +715,7 @@
       "{'classes': []}"
      ]
     },
-     "execution_count": 108,
+     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -717,7 +727,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 109,
+   "execution_count": 35,
   "id": "73d33184",
   "metadata": {},
   "outputs": [
@ -727,7 +737,7 @@
       "True"
      ]
     },
-     "execution_count": 109,
+     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -752,7 +762,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 110,
+   "execution_count": 36,
   "id": "e868d143",
   "metadata": {},
   "outputs": [
@ -794,7 +804,7 @@
       "   'vectorizer': 'none'}]}"
      ]
     },
-     "execution_count": 110,
+     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -802,7 +812,7 @@
   "source": [
    "class_obj = {\n",
    "    \"class\": \"Article\",\n",
-    "    \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model\n",
+    "    \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves\n",
    "    \"properties\": [{\n",
    "        \"name\": \"title\",\n",
    "        \"description\": \"Title of the article\",\n",
@ -824,7 +834,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 111,
+   "execution_count": 39,
   "id": "786d437f",
   "metadata": {},
   "outputs": [
@ -843,41 +853,79 @@
    "    data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))\n",
    "\n",
    "# Upsert into article schema\n",
-    "# NOTE: Using a thread pool here can accelerate this upsert operation\n",
    "print(\"Uploading vectors to article schema..\")\n",
+    "\n",
+    "# Store a list of UUIDs in case we want to use to refer back to the initial dataframe\n",
    "uuids = []\n",
-    "for articles in data_objects:\n",
-    "    uuid = client.data_object.create(\n",
+    "\n",
+    "# Reuse our batcher from the Pinecone ingestion\n",
+    "for batch_df in df_batcher(article_df):\n",
+    "    for k,v in batch_df.iterrows():\n",
+    "        #print(articles)\n",
+    "        uuid = client.data_object.create(\n",
    "                              {\n",
-    "                               \"title\": articles[0],\n",
-    "                               \"content\": articles[1]\n",
+    "                                  \"title\": v['title'],\n",
+    "                                  \"content\": v['text']\n",
    "                              },\n",
    "                              \"Article\",\n",
-    "                              vector=articles[2]\n",
+    "                              vector=v['title_vector']\n",
    "                            )\n",
-    "    uuids.append(uuid)"
+    "        uuids.append(uuid)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 112,
+   "execution_count": 47,
   "id": "3658693c",
   "metadata": {},
   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cave Story\n",
+      "is a freeware video game released in 2004 for PC. It was thought of and created over five years by Daisuke Amaya, known by his pseudonym, or art name, Pixel. The game is an action-adventure game, and is similar to the Castlevania and Metroid games. It was first made in Japanese, and was translated to English by the fan translating group, Aeon Genesis.\n",
+      "\n",
+      "References \n",
+      "\n",
+      "Notes\n",
+      "\n",
+      "2004 video games\n",
+      "Amiga games\n",
+      "Dreamcast games\n",
+      "Freeware games\n",
+      "Indie video games\n",
+      "Nintendo 3DS games\n",
+      "Nintendo Switch games\n",
+      "MacOS games\n",
+      "Platform games\n",
+      "Sega Genesis games\n",
+      "Video games developed in Japan\n",
+      "Wii games\n",
+      "Windows games\n"
+     ]
+    },
    {
     "data": {
      "text/plain": [
-       "{'content': 'Eddie Cantor (January 31, 1892 - October 10, 1964) was an American comedian, singer, actor, songwriter. Familiar to Broadway, radio and early television audiences, this \"Apostle of Pep\" was regarded almost as a family member by millions because his top-rated radio shows revealed intimate stories and amusing anecdotes about his wife Ida and five daughters. His eye-rolling song-and-dance routines eventually led to his nickname, Banjo Eyes, and in 1933, the artist Frederick J. Garner caricatured Cantor with large round and white eyes resembling the drum-like pot of a banjo. Cantor\\'s eyes became his trademark, often exaggerated in illustrations, and leading to his appearance on Broadway in the musical Banjo Eyes (1941). He was the original singer of 1929 hit song \"Makin\\' Whoopie\".\\n\\nReferences\\n\\nPresidents of the Screen Actors Guild\\nAmerican stage actors\\nComedians from New York City\\nAmerican Jews\\nActors from New York City\\nSingers from New York City\\nAmerican television actors\\nAmerican radio actors\\n1892 births\\n1964 deaths',\n",
-       " 'title': 'Eddie Cantor'}"
+       "{'Aggregate': {'Article': [{'meta': {'count': 25000}}]}}"
      ]
     },
-     "execution_count": 112,
+     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "client.data_object.get()['objects'][0]['properties']"
+    "# Test our insert has worked by checking one object\n",
+    "print(client.data_object.get()['objects'][0]['properties']['title'])\n",
+    "print(client.data_object.get()['objects'][0]['properties']['content'])\n",
+    "\n",
+    "# Test that all data has loaded\n",
+    "result = client.query.aggregate(\"Article\") \\\n",
+    "    .with_fields('meta { count }') \\\n",
+    "    .do()\n",
+    "result['data']"
   ]
  },
  {
@ -892,7 +940,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 113,
+   "execution_count": 48,
   "id": "5acd5437",
   "metadata": {},
   "outputs": [],
@ -902,7 +950,7 @@
    "    # Creates embedding vector from user query\n",
    "    embedded_query = openai.Embedding.create(\n",
    "                                                input=query,\n",
-    "                                                model=MODEL,\n",
+    "                                                model=EMBEDDING_MODEL,\n",
    "                                            )[\"data\"][0]['embedding']\n",
    "    \n",
    "    near_vector = {\"vector\": embedded_query}\n",
@ -918,7 +966,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 114,
+   "execution_count": 49,
   "id": "15def653",
   "metadata": {},
   "outputs": [
@ -926,26 +974,26 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "1. Title: Early modern Europe Certainty: 0.9454971551895142\n",
-      "2. Title: Museum of Modern Art Certainty: 0.9376430511474609\n",
-      "3. Title: Western Europe Certainty: 0.9337018430233002\n",
-      "4. Title: Renaissance art Certainty: 0.932124525308609\n",
-      "5. Title: Pop art Certainty: 0.9302527010440826\n",
-      "6. Title: Art exhibition Certainty: 0.9282020926475525\n",
-      "7. Title: History of Europe Certainty: 0.927833616733551\n",
-      "8. Title: Northern Europe Certainty: 0.9273514151573181\n",
-      "9. Title: Concert of Europe Certainty: 0.9268475472927094\n",
-      "10. Title: Hellenistic art Certainty: 0.9264959394931793\n",
-      "11. Title: Piet Mondrian Certainty: 0.9235787093639374\n",
-      "12. Title: Modernist literature Certainty: 0.9235587120056152\n",
-      "13. Title: European Capital of Culture Certainty: 0.9227772951126099\n",
-      "14. Title: Art film Certainty: 0.9217384457588196\n",
-      "15. Title: Europa Certainty: 0.9216940104961395\n",
-      "16. Title: Art rock Certainty: 0.9212885200977325\n",
-      "17. Title: Central Europe Certainty: 0.9212715923786163\n",
-      "18. Title: Art Certainty: 0.9207542240619659\n",
-      "19. Title: European Certainty: 0.9207191467285156\n",
-      "20. Title: Byzantine art Certainty: 0.9204496443271637\n"
+      "1. Museum of Modern Art (Score: 0.938)\n",
+      "2. Western Europe (Score: 0.934)\n",
+      "3. Renaissance art (Score: 0.932)\n",
+      "4. Pop art (Score: 0.93)\n",
+      "5. Northern Europe (Score: 0.927)\n",
+      "6. Hellenistic art (Score: 0.926)\n",
+      "7. Modernist literature (Score: 0.924)\n",
+      "8. Art film (Score: 0.922)\n",
+      "9. Central Europe (Score: 0.921)\n",
+      "10. Art (Score: 0.921)\n",
+      "11. European (Score: 0.921)\n",
+      "12. Byzantine art (Score: 0.92)\n",
+      "13. Postmodernism (Score: 0.92)\n",
+      "14. Eastern Europe (Score: 0.92)\n",
+      "15. Cubism (Score: 0.92)\n",
+      "16. Europe (Score: 0.919)\n",
+      "17. Impressionism (Score: 0.919)\n",
+      "18. Bauhaus (Score: 0.919)\n",
+      "19. Surrealism (Score: 0.919)\n",
+      "20. Expressionism (Score: 0.918)\n"
     ]
    }
   ],
@ -954,12 +1002,12 @@
    "counter = 0\n",
    "for article in query_result['data']['Get']['Article']:\n",
    "    counter += 1\n",
-    "    print(f\"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}\")"
+    "    print(f\"{counter}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 115,
+   "execution_count": 50,
   "id": "93c4a696",
   "metadata": {},
   "outputs": [
@ -967,26 +1015,26 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "1. Title: Historic Scotland Certainty: 0.9465253949165344\n",
-      "2. Title: First War of Scottish Independence Certainty: 0.9461104869842529\n",
-      "3. Title: Battle of Bannockburn Certainty: 0.9455604553222656\n",
-      "4. Title: Wars of Scottish Independence Certainty: 0.944368839263916\n",
-      "5. Title: Second War of Scottish Independence Certainty: 0.9394940435886383\n",
-      "6. Title: List of Scottish monarchs Certainty: 0.9366503059864044\n",
-      "7. Title: Kingdom of Scotland Certainty: 0.9353288412094116\n",
-      "8. Title: Scottish Borders Certainty: 0.9317235946655273\n",
-      "9. Title: List of rivers of Scotland Certainty: 0.9296278059482574\n",
-      "10. Title: Braveheart Certainty: 0.9294214248657227\n",
-      "11. Title: John of Scotland Certainty: 0.9292325675487518\n",
-      "12. Title: Duncan II of Scotland Certainty: 0.9291643798351288\n",
-      "13. Title: Bannockburn Certainty: 0.929103285074234\n",
-      "14. Title: The Scotsman Certainty: 0.9280981719493866\n",
-      "15. Title: Flag of Scotland Certainty: 0.9270428121089935\n",
-      "16. Title: Banff and Macduff Certainty: 0.9267247915267944\n",
-      "17. Title: Guardians of Scotland Certainty: 0.9260668158531189\n",
-      "18. Title: Scottish Parliament Certainty: 0.9251855313777924\n",
-      "19. Title: Holyrood Abbey Certainty: 0.925055593252182\n",
-      "20. Title: Scottish Certainty: 0.9249534606933594\n"
+      "1. Historic Scotland (Score: 0.947)\n",
+      "2. First War of Scottish Independence (Score: 0.946)\n",
+      "3. Battle of Bannockburn (Score: 0.946)\n",
+      "4. Wars of Scottish Independence (Score: 0.944)\n",
+      "5. Second War of Scottish Independence (Score: 0.94)\n",
+      "6. List of Scottish monarchs (Score: 0.937)\n",
+      "7. Scottish Borders (Score: 0.932)\n",
+      "8. Braveheart (Score: 0.929)\n",
+      "9. John of Scotland (Score: 0.929)\n",
+      "10. Guardians of Scotland (Score: 0.926)\n",
+      "11. Holyrood Abbey (Score: 0.925)\n",
+      "12. Scottish (Score: 0.925)\n",
+      "13. Scots (Score: 0.925)\n",
+      "14. Robert I of Scotland (Score: 0.924)\n",
+      "15. Scottish people (Score: 0.924)\n",
+      "16. Alexander I of Scotland (Score: 0.924)\n",
+      "17. Edinburgh Castle (Score: 0.924)\n",
+      "18. Robert Burns (Score: 0.923)\n",
+      "19. Battle of Bosworth Field (Score: 0.922)\n",
+      "20. David II of Scotland (Score: 0.922)\n"
     ]
    }
   ],
@ -995,7 +1043,7 @@
    "counter = 0\n",
    "for article in query_result['data']['Get']['Article']:\n",
    "    counter += 1\n",
-    "    print(f\"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}\")"
+    "    print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
   ]
  },
  {
@ -1003,7 +1051,7 @@
   "id": "ad74202e",
   "metadata": {},
   "source": [
-    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo"
+    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
   ]
  }
 ],