openai-cookbook/examples/vector_databases/weaviate/Using_Weaviate_for_embeddin...

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cb1537e6",
   "metadata": {},
   "source": [
    "# Using Weaviate for Embeddings Search\n",
    "\n",
    "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
    "\n",
    "### What is a Vector Database\n",
    "\n",
    "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
    "\n",
    "### Why use a Vector Database\n",
    "\n",
    "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
    "\n",
    "\n",
    "### Demo Flow\n",
    "The demo flow is:\n",
    "- **Setup**: Import packages and set any required variables\n",
    "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
    "- **Weaviate**\n",
    "    - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n",
    "    - *Index Data*: We'll create an index with __title__ search vectors in it\n",
    "    - *Search Data*: We'll run a few searches to confirm it works\n",
    "\n",
    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2b59250",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Import the required libraries and set the embedding model that we'd like to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d8810f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll need to install the Weaviate client\n",
    "!pip install weaviate-client\n",
    "\n",
    "#Install wget to pull zip file\n",
    "!pip install wget"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5be94df6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "\n",
    "from typing import List, Iterator\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "import wget\n",
    "from ast import literal_eval\n",
    "\n",
    "# Weaviate's client library for Python\n",
    "import weaviate\n",
    "\n",
    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
    "EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
    "\n",
    "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
    "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d9d2e1",
   "metadata": {},
   "source": [
    "## Load data\n",
    "\n",
    "In this section we'll load embedded data that we've prepared previous to this session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5dff8b55",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
    "\n",
    "# The file is ~700 MB so this will take some time\n",
    "wget.download(embeddings_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21097972",
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile\n",
    "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
    "    zip_ref.extractall(\"../data\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "70bbd8ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1721e45d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>title_vector</th>\n",
       "      <th>content_vector</th>\n",
       "      <th>vector_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/April</td>\n",
       "      <td>April</td>\n",
       "      <td>April is the fourth month of the year in the J...</td>\n",
       "      <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
       "      <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/August</td>\n",
       "      <td>August</td>\n",
       "      <td>August (Aug.) is the eighth month of the year ...</td>\n",
       "      <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
       "      <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Art</td>\n",
       "      <td>Art</td>\n",
       "      <td>Art is a creative activity that expresses imag...</td>\n",
       "      <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
       "      <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/A</td>\n",
       "      <td>A</td>\n",
       "      <td>A or a is the first letter of the English alph...</td>\n",
       "      <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
       "      <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Air</td>\n",
       "      <td>Air</td>\n",
       "      <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
       "      <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
       "      <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                                       url   title  \\\n",
       "0   1   https://simple.wikipedia.org/wiki/April   April   \n",
       "1   2  https://simple.wikipedia.org/wiki/August  August   \n",
       "2   6     https://simple.wikipedia.org/wiki/Art     Art   \n",
       "3   8       https://simple.wikipedia.org/wiki/A       A   \n",
       "4   9     https://simple.wikipedia.org/wiki/Air     Air   \n",
       "\n",
       "                                                text  \\\n",
       "0  April is the fourth month of the year in the J...   \n",
       "1  August (Aug.) is the eighth month of the year ...   \n",
       "2  Art is a creative activity that expresses imag...   \n",
       "3  A or a is the first letter of the English alph...   \n",
       "4  Air refers to the Earth's atmosphere. Air is a...   \n",
       "\n",
       "                                        title_vector  \\\n",
       "0  [0.001009464613161981, -0.020700545981526375, ...   \n",
       "1  [0.0009286514250561595, 0.000820168002974242, ...   \n",
       "2  [0.003393713850528002, 0.0061537534929811954, ...   \n",
       "3  [0.0153952119871974, -0.013759135268628597, 0....   \n",
       "4  [0.02224554680287838, -0.02044147066771984, -0...   \n",
       "\n",
       "                                      content_vector  vector_id  \n",
       "0  [-0.011253940872848034, -0.013491976074874401,...          0  \n",
       "1  [0.0003609954728744924, 0.007262262050062418, ...          1  \n",
       "2  [-0.004959689453244209, 0.015772193670272827, ...          2  \n",
       "3  [0.024894846603274345, -0.022186409682035446, ...          3  \n",
       "4  [0.021524671465158463, 0.018522677943110466, -...          4  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "article_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "960b82af",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read vectors from strings back into a list\n",
    "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n",
    "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n",
    "\n",
    "# Set vector_id to be a string\n",
    "article_df['vector_id'] = article_df['vector_id'].apply(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a334ab8b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 25000 entries, 0 to 24999\n",
      "Data columns (total 7 columns):\n",
      " #   Column          Non-Null Count  Dtype \n",
      "---  ------          --------------  ----- \n",
      " 0   id              25000 non-null  int64 \n",
      " 1   url             25000 non-null  object\n",
      " 2   title           25000 non-null  object\n",
      " 3   text            25000 non-null  object\n",
      " 4   title_vector    25000 non-null  object\n",
      " 5   content_vector  25000 non-null  object\n",
      " 6   vector_id       25000 non-null  object\n",
      "dtypes: int64(1), object(6)\n",
      "memory usage: 1.3+ MB\n"
     ]
    }
   ],
   "source": [
    "article_df.info(show_counts=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d939342f",
   "metadata": {},
   "source": [
    "## Weaviate\n",
    "\n",
    "Another vector database option we'll explore is **Weaviate**, which offers both a managed, [SaaS](https://console.weaviate.io/) option, as well as a self-hosted [open source](https://github.com/weaviate/weaviate) option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
    "\n",
    "For this we will:\n",
    "- Set up a local deployment of Weaviate\n",
    "- Create indices in Weaviate\n",
    "- Store our data there\n",
    "- Fire some similarity search queries\n",
    "- Try a real use case\n",
    "\n",
    "\n",
    "### Bring your own vectors approach\n",
    "In this cookbook, we provide the data with already generated vectors. This is a good approach for scenarios, where your data is already vectorized.\n",
    "\n",
    "### Automated vectorization with OpenAI module\n",
    "For scenarios, where your data is not vectorized yet, you can delegate the vectorization task with OpenAI to Weaviate.\n",
    "Weaviate offers a built-in module [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the vectorization for you at:\n",
    "* import\n",
    "* for any CRUD operations\n",
    "* for semantic search\n",
    "\n",
    "Check out the [Getting Started with Weaviate and OpenAI module cookbook](./weaviate/getting-started-with-weaviate-and-openai.ipynb) to learn step by step how to import and vectorize data in one step."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfdfe260",
   "metadata": {},
   "source": [
    "### Setup\n",
    "\n",
    "To run Weaviate locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Weaviate documentation [here](https://weaviate.io/developers/weaviate/installation/docker-compose), we created an example docker-compose.yml file in this repo saved at [./weaviate/docker-compose.yml](./weaviate/docker-compose.yml).\n",
    "\n",
    "After starting Docker, you can start Weaviate locally by navigating to the `examples/vector_databases/weaviate/` directory and running `docker-compose up -d`.\n",
    "\n",
    "#### SaaS\n",
    "Alternatively you can use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
    "1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
    "2. create a `Weaviate Cluster` with the following settings:\n",
    "    * Sandbox: `Sandbox Free`\n",
    "    * Weaviate Version: Use default (latest)\n",
    "    * OIDC Authentication: `Disabled`\n",
    "3. your instance should be ready in a minute or two\n",
    "4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name-suffix.weaviate.network` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a78f95d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Option #1 - Self-hosted - Weaviate Open Source \n",
    "client = weaviate.Client(\n",
    "    url=\"http://localhost:8080\",\n",
    "    additional_headers={\n",
    "        \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e00b7d68",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Option #2 - SaaS - (Weaviate Cloud Service)\n",
    "client = weaviate.Client(\n",
    "    url=\"https://your-wcs-instance-name.weaviate.network\",\n",
    "    additional_headers={\n",
    "        \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d370afa",
   "metadata": {},
   "outputs": [],
   "source": [
    "client.is_ready()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03a926b9",
   "metadata": {},
   "source": [
    "### Index data\n",
    "\n",
    "In Weaviate you create __schemas__ to capture each of the entities you will be searching. \n",
    "\n",
    "In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n",
    "\n",
    "The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/quickstart).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0e6175a1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'classes': [{'class': 'Article',\n",
       "   'description': 'A collection of articles',\n",
       "   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},\n",
       "    'cleanupIntervalSeconds': 60,\n",
       "    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},\n",
       "   'moduleConfig': {'text2vec-openai': {'model': 'ada',\n",
       "     'modelVersion': '002',\n",
       "     'type': 'text',\n",
       "     'vectorizeClassName': True}},\n",
       "   'properties': [{'dataType': ['string'],\n",
       "     'description': 'Title of the article',\n",
       "     'moduleConfig': {'text2vec-openai': {'skip': False,\n",
       "       'vectorizePropertyName': False}},\n",
       "     'name': 'title',\n",
       "     'tokenization': 'word'},\n",
       "    {'dataType': ['text'],\n",
       "     'description': 'Contents of the article',\n",
       "     'moduleConfig': {'text2vec-openai': {'skip': True,\n",
       "       'vectorizePropertyName': False}},\n",
       "     'name': 'content',\n",
       "     'tokenization': 'word'}],\n",
       "   'replicationConfig': {'factor': 1},\n",
       "   'shardingConfig': {'virtualPerPhysical': 128,\n",
       "    'desiredCount': 1,\n",
       "    'actualCount': 1,\n",
       "    'desiredVirtualCount': 128,\n",
       "    'actualVirtualCount': 128,\n",
       "    'key': '_id',\n",
       "    'strategy': 'hash',\n",
       "    'function': 'murmur3'},\n",
       "   'vectorIndexConfig': {'skip': False,\n",
       "    'cleanupIntervalSeconds': 300,\n",
       "    'maxConnections': 64,\n",
       "    'efConstruction': 128,\n",
       "    'ef': -1,\n",
       "    'dynamicEfMin': 100,\n",
       "    'dynamicEfMax': 500,\n",
       "    'dynamicEfFactor': 8,\n",
       "    'vectorCacheMaxObjects': 1000000000000,\n",
       "    'flatSearchCutoff': 40000,\n",
       "    'distance': 'cosine'},\n",
       "   'vectorIndexType': 'hnsw',\n",
       "   'vectorizer': 'text2vec-openai'}]}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clear up the schema, so that we can recreate it\n",
    "client.schema.delete_all()\n",
    "client.schema.get()\n",
    "\n",
    "# Define the Schema object to use `text-embedding-3-small` on `title` and `content`, but skip it for `url`\n",
    "article_schema = {\n",
    "    \"class\": \"Article\",\n",
    "    \"description\": \"A collection of articles\",\n",
    "    \"vectorizer\": \"text2vec-openai\",\n",
    "    \"moduleConfig\": {\n",
    "        \"text2vec-openai\": {\n",
    "          \"model\": \"ada\",\n",
    "          \"modelVersion\": \"002\",\n",
    "          \"type\": \"text\"\n",
    "        }\n",
    "    },\n",
    "    \"properties\": [{\n",
    "        \"name\": \"title\",\n",
    "        \"description\": \"Title of the article\",\n",
    "        \"dataType\": [\"string\"]\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"content\",\n",
    "        \"description\": \"Contents of the article\",\n",
    "        \"dataType\": [\"text\"],\n",
    "        \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
    "    }]\n",
    "}\n",
    "\n",
    "# add the Article schema\n",
    "client.schema.create_class(article_schema)\n",
    "\n",
    "# get the schema to make sure it worked\n",
    "client.schema.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ea838e7d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<weaviate.batch.crud_batch.Batch at 0x3f0ca0fa0>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk\n",
    "# - starting batch size of 100\n",
    "# - dynamically increase/decrease based on performance\n",
    "# - add timeout retries if something goes wrong\n",
    "\n",
    "client.batch.configure(\n",
    "    batch_size=100,\n",
    "    dynamic=True,\n",
    "    timeout_retries=3,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b4c967ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading data with vectors to Article schema..\n",
      "Import 0 / 25000 \n",
      "Import 100 / 25000 \n",
      "Import 200 / 25000 \n",
      "Import 300 / 25000 \n",
      "Import 400 / 25000 \n",
      "Import 500 / 25000 \n",
      "Import 600 / 25000 \n",
      "Import 700 / 25000 \n",
      "Import 800 / 25000 \n",
      "Import 900 / 25000 \n",
      "Import 1000 / 25000 \n",
      "Import 1100 / 25000 \n",
      "Import 1200 / 25000 \n",
      "Import 1300 / 25000 \n",
      "Import 1400 / 25000 \n",
      "Import 1500 / 25000 \n",
      "Import 1600 / 25000 \n",
      "Import 1700 / 25000 \n",
      "Import 1800 / 25000 \n",
      "Import 1900 / 25000 \n",
      "Import 2000 / 25000 \n",
      "Import 2100 / 25000 \n",
      "Import 2200 / 25000 \n",
      "Import 2300 / 25000 \n",
      "Import 2400 / 25000 \n",
      "Import 2500 / 25000 \n",
      "Import 2600 / 25000 \n",
      "Import 2700 / 25000 \n",
      "Import 2800 / 25000 \n",
      "Import 2900 / 25000 \n",
      "Import 3000 / 25000 \n",
      "Import 3100 / 25000 \n",
      "Import 3200 / 25000 \n",
      "Import 3300 / 25000 \n",
      "Import 3400 / 25000 \n",
      "Import 3500 / 25000 \n",
      "Import 3600 / 25000 \n",
      "Import 3700 / 25000 \n",
      "Import 3800 / 25000 \n",
      "Import 3900 / 25000 \n",
      "Import 4000 / 25000 \n",
      "Import 4100 / 25000 \n",
      "Import 4200 / 25000 \n",
      "Import 4300 / 25000 \n",
      "Import 4400 / 25000 \n",
      "Import 4500 / 25000 \n",
      "Import 4600 / 25000 \n",
      "Import 4700 / 25000 \n",
      "Import 4800 / 25000 \n",
      "Import 4900 / 25000 \n",
      "Import 5000 / 25000 \n",
      "Import 5100 / 25000 \n",
      "Import 5200 / 25000 \n",
      "Import 5300 / 25000 \n",
      "Import 5400 / 25000 \n",
      "Import 5500 / 25000 \n",
      "Import 5600 / 25000 \n",
      "Import 5700 / 25000 \n",
      "Import 5800 / 25000 \n",
      "Import 5900 / 25000 \n",
      "Import 6000 / 25000 \n",
      "Import 6100 / 25000 \n",
      "Import 6200 / 25000 \n",
      "Import 6300 / 25000 \n",
      "Import 6400 / 25000 \n",
      "Import 6500 / 25000 \n",
      "Import 6600 / 25000 \n",
      "Import 6700 / 25000 \n",
      "Import 6800 / 25000 \n",
      "Import 6900 / 25000 \n",
      "Import 7000 / 25000 \n",
      "Import 7100 / 25000 \n",
      "Import 7200 / 25000 \n",
      "Import 7300 / 25000 \n",
      "Import 7400 / 25000 \n",
      "Import 7500 / 25000 \n",
      "Import 7600 / 25000 \n",
      "Import 7700 / 25000 \n",
      "Import 7800 / 25000 \n",
      "Import 7900 / 25000 \n",
      "Import 8000 / 25000 \n",
      "Import 8100 / 25000 \n",
      "Import 8200 / 25000 \n",
      "Import 8300 / 25000 \n",
      "Import 8400 / 25000 \n",
      "Import 8500 / 25000 \n",
      "Import 8600 / 25000 \n",
      "Import 8700 / 25000 \n",
      "Import 8800 / 25000 \n",
      "Import 8900 / 25000 \n",
      "Import 9000 / 25000 \n",
      "Import 9100 / 25000 \n",
      "Import 9200 / 25000 \n",
      "Import 9300 / 25000 \n",
      "Import 9400 / 25000 \n",
      "Import 9500 / 25000 \n",
      "Import 9600 / 25000 \n",
      "Import 9700 / 25000 \n",
      "Import 9800 / 25000 \n",
      "Import 9900 / 25000 \n",
      "Import 10000 / 25000 \n",
      "Import 10100 / 25000 \n",
      "Import 10200 / 25000 \n",
      "Import 10300 / 25000 \n",
      "Import 10400 / 25000 \n",
      "Import 10500 / 25000 \n",
      "Import 10600 / 25000 \n",
      "Import 10700 / 25000 \n",
      "Import 10800 / 25000 \n",
      "Import 10900 / 25000 \n",
      "Import 11000 / 25000 \n",
      "Import 11100 / 25000 \n",
      "Import 11200 / 25000 \n",
      "Import 11300 / 25000 \n",
      "Import 11400 / 25000 \n",
      "Import 11500 / 25000 \n",
      "Import 11600 / 25000 \n",
      "Import 11700 / 25000 \n",
      "Import 11800 / 25000 \n",
      "Import 11900 / 25000 \n",
      "Import 12000 / 25000 \n",
      "Import 12100 / 25000 \n",
      "Import 12200 / 25000 \n",
      "Import 12300 / 25000 \n",
      "Import 12400 / 25000 \n",
      "Import 12500 / 25000 \n",
      "Import 12600 / 25000 \n",
      "Import 12700 / 25000 \n",
      "Import 12800 / 25000 \n",
      "Import 12900 / 25000 \n",
      "Import 13000 / 25000 \n",
      "Import 13100 / 25000 \n",
      "Import 13200 / 25000 \n",
      "Import 13300 / 25000 \n",
      "Import 13400 / 25000 \n",
      "Import 13500 / 25000 \n",
      "Import 13600 / 25000 \n",
      "Import 13700 / 25000 \n",
      "Import 13800 / 25000 \n",
      "Import 13900 / 25000 \n",
      "Import 14000 / 25000 \n",
      "Import 14100 / 25000 \n",
      "Import 14200 / 25000 \n",
      "Import 14300 / 25000 \n",
      "Import 14400 / 25000 \n",
      "Import 14500 / 25000 \n",
      "Import 14600 / 25000 \n",
      "Import 14700 / 25000 \n",
      "Import 14800 / 25000 \n",
      "Import 14900 / 25000 \n",
      "Import 15000 / 25000 \n",
      "Import 15100 / 25000 \n",
      "Import 15200 / 25000 \n",
      "Import 15300 / 25000 \n",
      "Import 15400 / 25000 \n",
      "Import 15500 / 25000 \n",
      "Import 15600 / 25000 \n",
      "Import 15700 / 25000 \n",
      "Import 15800 / 25000 \n",
      "Import 15900 / 25000 \n",
      "Import 16000 / 25000 \n",
      "Import 16100 / 25000 \n",
      "Import 16200 / 25000 \n",
      "Import 16300 / 25000 \n",
      "Import 16400 / 25000 \n",
      "Import 16500 / 25000 \n",
      "Import 16600 / 25000 \n",
      "Import 16700 / 25000 \n",
      "Import 16800 / 25000 \n",
      "Import 16900 / 25000 \n",
      "Import 17000 / 25000 \n",
      "Import 17100 / 25000 \n",
      "Import 17200 / 25000 \n",
      "Import 17300 / 25000 \n",
      "Import 17400 / 25000 \n",
      "Import 17500 / 25000 \n",
      "Import 17600 / 25000 \n",
      "Import 17700 / 25000 \n",
      "Import 17800 / 25000 \n",
      "Import 17900 / 25000 \n",
      "Import 18000 / 25000 \n",
      "Import 18100 / 25000 \n",
      "Import 18200 / 25000 \n",
      "Import 18300 / 25000 \n",
      "Import 18400 / 25000 \n",
      "Import 18500 / 25000 \n",
      "Import 18600 / 25000 \n",
      "Import 18700 / 25000 \n",
      "Import 18800 / 25000 \n",
      "Import 18900 / 25000 \n",
      "Import 19000 / 25000 \n",
      "Import 19100 / 25000 \n",
      "Import 19200 / 25000 \n",
      "Import 19300 / 25000 \n",
      "Import 19400 / 25000 \n",
      "Import 19500 / 25000 \n",
      "Import 19600 / 25000 \n",
      "Import 19700 / 25000 \n",
      "Import 19800 / 25000 \n",
      "Import 19900 / 25000 \n",
      "Import 20000 / 25000 \n",
      "Import 20100 / 25000 \n",
      "Import 20200 / 25000 \n",
      "Import 20300 / 25000 \n",
      "Import 20400 / 25000 \n",
      "Import 20500 / 25000 \n",
      "Import 20600 / 25000 \n",
      "Import 20700 / 25000 \n",
      "Import 20800 / 25000 \n",
      "Import 20900 / 25000 \n",
      "Import 21000 / 25000 \n",
      "Import 21100 / 25000 \n",
      "Import 21200 / 25000 \n",
      "Import 21300 / 25000 \n",
      "Import 21400 / 25000 \n",
      "Import 21500 / 25000 \n",
      "Import 21600 / 25000 \n",
      "Import 21700 / 25000 \n",
      "Import 21800 / 25000 \n",
      "Import 21900 / 25000 \n",
      "Import 22000 / 25000 \n",
      "Import 22100 / 25000 \n",
      "Import 22200 / 25000 \n",
      "Import 22300 / 25000 \n",
      "Import 22400 / 25000 \n",
      "Import 22500 / 25000 \n",
      "Import 22600 / 25000 \n",
      "Import 22700 / 25000 \n",
      "Import 22800 / 25000 \n",
      "Import 22900 / 25000 \n",
      "Import 23000 / 25000 \n",
      "Import 23100 / 25000 \n",
      "Import 23200 / 25000 \n",
      "Import 23300 / 25000 \n",
      "Import 23400 / 25000 \n",
      "Import 23500 / 25000 \n",
      "Import 23600 / 25000 \n",
      "Import 23700 / 25000 \n",
      "Import 23800 / 25000 \n",
      "Import 23900 / 25000 \n",
      "Import 24000 / 25000 \n",
      "Import 24100 / 25000 \n",
      "Import 24200 / 25000 \n",
      "Import 24300 / 25000 \n",
      "Import 24400 / 25000 \n",
      "Import 24500 / 25000 \n",
      "Import 24600 / 25000 \n",
      "Import 24700 / 25000 \n",
      "Import 24800 / 25000 \n",
      "Import 24900 / 25000 \n",
      "Importing (25000) Articles complete\n"
     ]
    }
   ],
   "source": [
    "### Step 2 - import data\n",
    "\n",
    "print(\"Uploading data with vectors to Article schema..\")\n",
    "\n",
    "counter=0\n",
    "\n",
    "with client.batch as batch:\n",
    "    for k,v in article_df.iterrows():\n",
    "        \n",
    "        # print update message every 100 objects        \n",
    "        if (counter %100 == 0):\n",
    "            print(f\"Import {counter} / {len(article_df)} \")\n",
    "        \n",
    "        properties = {\n",
    "            \"title\": v[\"title\"],\n",
    "            \"content\": v[\"text\"]\n",
    "        }\n",
    "        \n",
    "        vector = v[\"title_vector\"]\n",
    "        \n",
    "        batch.add_data_object(properties, \"Article\", None, vector)\n",
    "        counter = counter+1\n",
    "\n",
    "print(f\"Importing ({len(article_df)}) Articles complete\")  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f826e1ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Object count:  [{'meta': {'count': 25000}}]\n"
     ]
    }
   ],
   "source": [
    "# Test that all data has loaded – get object count\n",
    "result = (\n",
    "    client.query.aggregate(\"Article\")\n",
    "    .with_fields(\"meta { count }\")\n",
    "    .do()\n",
    ")\n",
    "print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5c09d483",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "000393f2-1182-4e3d-abcf-4217eda64be0\n",
      "Lago d'Origlio\n",
      "Lago d'Origlio is a lake in the municipality of Origlio, in Ticino, Switzerland.\n",
      "\n",
      "Lakes of Ticino\n"
     ]
    }
   ],
   "source": [
    "# Test one article has worked by checking one object\n",
    "test_article = (\n",
    "    client.query\n",
    "    .get(\"Article\", [\"title\", \"content\", \"_additional {id}\"])\n",
    "    .with_limit(1)\n",
    "    .do()\n",
    ")[\"data\"][\"Get\"][\"Article\"][0]\n",
    "\n",
    "print(test_article[\"_additional\"][\"id\"])\n",
    "print(test_article[\"title\"])\n",
    "print(test_article[\"content\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46050ca9",
   "metadata": {},
   "source": [
    "### Search data\n",
    "\n",
    "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "add222d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_weaviate(query, collection_name, top_k=20):\n",
    "\n",
    "    # Creates embedding vector from user query\n",
    "    embedded_query = openai.Embedding.create(\n",
    "        input=query,\n",
    "        model=EMBEDDING_MODEL,\n",
    "    )[\"data\"][0]['embedding']\n",
    "    \n",
    "    near_vector = {\"vector\": embedded_query}\n",
    "\n",
    "    # Queries input schema with vectorised user query\n",
    "    query_result = (\n",
    "        client.query\n",
    "        .get(collection_name, [\"title\", \"content\", \"_additional {certainty distance}\"])\n",
    "        .with_near_vector(near_vector)\n",
    "        .with_limit(top_k)\n",
    "        .do()\n",
    "    )\n",
    "    \n",
    "    return query_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c888aa4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Museum of Modern Art (Certainty: 0.938) (Distance: 0.125)\n",
      "2. Western Europe (Certainty: 0.934) (Distance: 0.133)\n",
      "3. Renaissance art (Certainty: 0.932) (Distance: 0.136)\n",
      "4. Pop art (Certainty: 0.93) (Distance: 0.14)\n",
      "5. Northern Europe (Certainty: 0.927) (Distance: 0.145)\n",
      "6. Hellenistic art (Certainty: 0.926) (Distance: 0.147)\n",
      "7. Modernist literature (Certainty: 0.924) (Distance: 0.153)\n",
      "8. Art film (Certainty: 0.922) (Distance: 0.157)\n",
      "9. Central Europe (Certainty: 0.921) (Distance: 0.157)\n",
      "10. European (Certainty: 0.921) (Distance: 0.159)\n",
      "11. Art (Certainty: 0.921) (Distance: 0.159)\n",
      "12. Byzantine art (Certainty: 0.92) (Distance: 0.159)\n",
      "13. Postmodernism (Certainty: 0.92) (Distance: 0.16)\n",
      "14. Eastern Europe (Certainty: 0.92) (Distance: 0.161)\n",
      "15. Europe (Certainty: 0.919) (Distance: 0.161)\n",
      "16. Cubism (Certainty: 0.919) (Distance: 0.161)\n",
      "17. Impressionism (Certainty: 0.919) (Distance: 0.162)\n",
      "18. Bauhaus (Certainty: 0.919) (Distance: 0.162)\n",
      "19. Expressionism (Certainty: 0.918) (Distance: 0.163)\n",
      "20. Surrealism (Certainty: 0.918) (Distance: 0.163)\n"
     ]
    }
   ],
   "source": [
    "query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n",
    "counter = 0\n",
    "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c54cd8e9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Historic Scotland (Score: 0.946)\n",
      "2. First War of Scottish Independence (Score: 0.946)\n",
      "3. Battle of Bannockburn (Score: 0.946)\n",
      "4. Wars of Scottish Independence (Score: 0.944)\n",
      "5. Second War of Scottish Independence (Score: 0.94)\n",
      "6. List of Scottish monarchs (Score: 0.937)\n",
      "7. Scottish Borders (Score: 0.932)\n",
      "8. Braveheart (Score: 0.929)\n",
      "9. John of Scotland (Score: 0.929)\n",
      "10. Guardians of Scotland (Score: 0.926)\n",
      "11. Holyrood Abbey (Score: 0.925)\n",
      "12. Scottish (Score: 0.925)\n",
      "13. Scots (Score: 0.925)\n",
      "14. Robert I of Scotland (Score: 0.924)\n",
      "15. Scottish people (Score: 0.924)\n",
      "16. Edinburgh Castle (Score: 0.924)\n",
      "17. Alexander I of Scotland (Score: 0.924)\n",
      "18. Robert Burns (Score: 0.924)\n",
      "19. Battle of Bosworth Field (Score: 0.922)\n",
      "20. David II of Scotland (Score: 0.922)\n"
     ]
    }
   ],
   "source": [
    "query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n",
    "counter = 0\n",
    "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "220b3e11",
   "metadata": {},
   "source": [
    "### Let Weaviate handle vector embeddings\n",
    "\n",
    "Weaviate has a [built-in module for OpenAI](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the steps required to generate a vector embedding for your queries and any CRUD operations.\n",
    "\n",
    "This allows you to run a vector query with the `with_near_text` filter, which uses your `OPEN_API_KEY`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "9425c882",
   "metadata": {},
   "outputs": [],
   "source": [
    "def near_text_weaviate(query, collection_name):\n",
    "    \n",
    "    nearText = {\n",
    "        \"concepts\": [query],\n",
    "        \"distance\": 0.7,\n",
    "    }\n",
    "\n",
    "    properties = [\n",
    "        \"title\", \"content\",\n",
    "        \"_additional {certainty distance}\"\n",
    "    ]\n",
    "\n",
    "    query_result = (\n",
    "        client.query\n",
    "        .get(collection_name, properties)\n",
    "        .with_near_text(nearText)\n",
    "        .with_limit(20)\n",
    "        .do()\n",
    "    )[\"data\"][\"Get\"][collection_name]\n",
    "    \n",
    "    print (f\"Objects returned: {len(query_result)}\")\n",
    "    \n",
    "    return query_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "501a16f7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Objects returned: 20\n",
      "1. Museum of Modern Art (Certainty: 0.938) (Distance: 0.125)\n",
      "2. Western Europe (Certainty: 0.934) (Distance: 0.133)\n",
      "3. Renaissance art (Certainty: 0.932) (Distance: 0.136)\n",
      "4. Pop art (Certainty: 0.93) (Distance: 0.14)\n",
      "5. Northern Europe (Certainty: 0.927) (Distance: 0.145)\n",
      "6. Hellenistic art (Certainty: 0.926) (Distance: 0.147)\n",
      "7. Modernist literature (Certainty: 0.923) (Distance: 0.153)\n",
      "8. Art film (Certainty: 0.922) (Distance: 0.157)\n",
      "9. Central Europe (Certainty: 0.921) (Distance: 0.157)\n",
      "10. European (Certainty: 0.921) (Distance: 0.159)\n",
      "11. Art (Certainty: 0.921) (Distance: 0.159)\n",
      "12. Byzantine art (Certainty: 0.92) (Distance: 0.159)\n",
      "13. Postmodernism (Certainty: 0.92) (Distance: 0.16)\n",
      "14. Eastern Europe (Certainty: 0.92) (Distance: 0.161)\n",
      "15. Europe (Certainty: 0.919) (Distance: 0.161)\n",
      "16. Cubism (Certainty: 0.919) (Distance: 0.161)\n",
      "17. Impressionism (Certainty: 0.919) (Distance: 0.162)\n",
      "18. Bauhaus (Certainty: 0.919) (Distance: 0.162)\n",
      "19. Surrealism (Certainty: 0.918) (Distance: 0.163)\n",
      "20. Expressionism (Certainty: 0.918) (Distance: 0.163)\n"
     ]
    }
   ],
   "source": [
    "query_result = near_text_weaviate(\"modern art in Europe\",\"Article\")\n",
    "counter = 0\n",
    "for article in query_result:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "839b26df",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Objects returned: 20\n",
      "1. Historic Scotland (Certainty: 0.946) (Distance: 0.107)\n",
      "2. First War of Scottish Independence (Certainty: 0.946) (Distance: 0.108)\n",
      "3. Battle of Bannockburn (Certainty: 0.946) (Distance: 0.109)\n",
      "4. Wars of Scottish Independence (Certainty: 0.944) (Distance: 0.111)\n",
      "5. Second War of Scottish Independence (Certainty: 0.94) (Distance: 0.121)\n",
      "6. List of Scottish monarchs (Certainty: 0.937) (Distance: 0.127)\n",
      "7. Scottish Borders (Certainty: 0.932) (Distance: 0.137)\n",
      "8. Braveheart (Certainty: 0.929) (Distance: 0.141)\n",
      "9. John of Scotland (Certainty: 0.929) (Distance: 0.142)\n",
      "10. Guardians of Scotland (Certainty: 0.926) (Distance: 0.148)\n",
      "11. Holyrood Abbey (Certainty: 0.925) (Distance: 0.15)\n",
      "12. Scottish (Certainty: 0.925) (Distance: 0.15)\n",
      "13. Scots (Certainty: 0.925) (Distance: 0.15)\n",
      "14. Robert I of Scotland (Certainty: 0.924) (Distance: 0.151)\n",
      "15. Scottish people (Certainty: 0.924) (Distance: 0.152)\n",
      "16. Edinburgh Castle (Certainty: 0.924) (Distance: 0.153)\n",
      "17. Alexander I of Scotland (Certainty: 0.924) (Distance: 0.153)\n",
      "18. Robert Burns (Certainty: 0.924) (Distance: 0.153)\n",
      "19. Battle of Bosworth Field (Certainty: 0.922) (Distance: 0.155)\n",
      "20. David II of Scotland (Certainty: 0.922) (Distance: 0.157)\n"
     ]
    }
   ],
   "source": [
    "query_result = near_text_weaviate(\"Famous battles in Scottish history\",\"Article\")\n",
    "counter = 0\n",
    "for article in query_result:\n",
    "    counter += 1\n",
    "    print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0119d87a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "vector_db_split",
   "language": "python",
   "name": "vector_db_split"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "vscode": {
   "interpreter": {
    "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}