openai-cookbook/examples/vector_databases/weaviate/getting-started-with-weaviate-and-openai.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cb1537e6",
   "metadata": {},
   "source": [
    "# Using Weaviate with OpenAI vectorize module for Embeddings Search\n",
    "\n",
    "This notebook is prepared for a scenario where:\n",
    "* Your data is not vectorized\n",
    "* You want to run Vector Search on your data\n",
    "* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
    "\n",
    "This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run semantic search.\n",
    "\n",
    "This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
    "\n",
    "## What is Weaviate\n",
    "\n",
    "Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
    "\n",
    "Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
    "\n",
    "Weaviate let you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
    "\n",
    "### Deployment options\n",
    "\n",
    "Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
    "* Self-hosted – you can deploy Weaviate with docker locally, or any server you want.\n",
    "* SaaS – you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
    "* Hybrid-SaaS – you can deploy Weaviate in your own private Cloud Service.\n",
    "\n",
    "### Programming languages\n",
    "\n",
    "Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
    "* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
    "* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
    "* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
    "* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
    "\n",
    "Additionally, Weaviate has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45956173",
   "metadata": {},
   "source": [
    "## Demo Flow\n",
    "The demo flow is:\n",
    "- **Prerequisites Setup**: Create a Weaviate instance and install the required libraries\n",
    "- **Connect**: Connect to your Weaviate instance \n",
    "- **Schema Configuration**: Configure the schema of your data\n",
    "    - *Note*: Here we can define which OpenAI Embedding Model to use\n",
    "    - *Note*: Here we can configure which properties to index\n",
    "- **Import data**: Load a demo dataset and import it into Weaviate\n",
    "    - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
    "    - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you\n",
    "- **Run Queries**: Query \n",
    "    - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you\n",
    "\n",
    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a4a145e",
   "metadata": {},
   "source": [
    "## OpenAI Module in Weaviate\n",
    "All Weaviate instances come equipped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.\n",
    "\n",
    "This module is responsible for handling vectorization during import (or any CRUD operations) and when you run a query.\n",
    "\n",
    "### No need to manually vectorize data\n",
    "This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
    "\n",
    "All you need to do is:\n",
    "1. provide your OpenAI API Key – when you connected to the Weaviate Client\n",
    "2. define which OpenAI vectorizer to use in your Schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1a618c5",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before we start this project, we need setup the following:\n",
    "\n",
    "* create a `Weaviate` instance\n",
    "* install libraries\n",
    "    * `weaviate-client`\n",
    "    * `datasets`\n",
    "    * `apache-beam`\n",
    "* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
    "\n",
    "===========================================================\n",
    "### Create a Weaviate instance\n",
    "\n",
    "To create a Weaviate instance we have 2 options:\n",
    "\n",
    "1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
    "2. Install and run Weaviate locally with Docker.\n",
    "\n",
    "#### Option 1 – WCS Installation Steps\n",
    "\n",
    "Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
    "1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
    "2. create a `Weaviate Cluster` with the following settings:\n",
    "    * Sandbox: `Sandbox Free`\n",
    "    * Weaviate Version: Use default (latest)\n",
    "    * OIDC Authentication: `Disabled`\n",
    "3. your instance should be ready in a minute or two\n",
    "4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
    "\n",
    "#### Option 2 – local Weaviate instance with Docker\n",
    "\n",
    "Install and run Weaviate locally with Docker.\n",
    "1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
    "2. Then open your terminal, navigate to where your docker-compose.yml file is located, and start docker with: `docker-compose up -d`\n",
    "3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
    "\n",
    "Note. To shut down your docker instance you can call: `docker-compose down`\n",
    "\n",
    "##### Learn more\n",
    "To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9babafe",
   "metadata": {},
   "source": [
    "===========================================================    \n",
    "## Install required libraries\n",
    "\n",
    "Before running this project make sure to have the following libraries:\n",
    "\n",
    "### Weaviate Python client\n",
    "\n",
    "The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
    "\n",
    "### datasets & apache-beam\n",
    "\n",
    "To load sample data, you need the `datasets` library and its dependency `apache-beam`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b04113f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install the Weaviate client for Python\n",
    "!pip install weaviate-client>=3.11.0\n",
    "\n",
    "# Install datasets and apache-beam to load the sample datasets\n",
    "!pip install datasets apache-beam"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36fe86f4",
   "metadata": {},
   "source": [
    "===========================================================\n",
    "## Prepare your OpenAI API key\n",
    "\n",
    "The `OpenAI API key` is used for vectorization of your data at import, and for running queries.\n",
    "\n",
    "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
    "\n",
    "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43395339",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export OpenAI API Key\n",
    "!export OPENAI_API_KEY=\"your key\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88be138c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test that your OpenAI API key is correctly set as an environment variable\n",
    "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
    "import os\n",
    "\n",
    "# Note. alternatively you can set a temporary env variable like this:\n",
    "# os.environ[\"OPENAI_API_KEY\"] = 'your-key-goes-here'\n",
    "\n",
    "if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
    "    print (\"OPENAI_API_KEY is ready\")\n",
    "else:\n",
    "    print (\"OPENAI_API_KEY environment variable not found\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91df4d5b",
   "metadata": {},
   "source": [
    "## Connect to your Weaviate instance\n",
    "\n",
    "In this section, we will:\n",
    "\n",
    "1. test env variable `OPENAI_API_KEY` – **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
    "2. connect to your Weaviate with your `OpenAI API Key`\n",
    "3. and test the client connection\n",
    "\n",
    "### The client \n",
    "\n",
    "After this step, the `client` object will be used to perform all Weaviate-related operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc662c1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import weaviate\n",
    "from datasets import load_dataset\n",
    "import os\n",
    "\n",
    "# Connect to your Weaviate instance\n",
    "client = weaviate.Client(\n",
    "    url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
    "    # url=\"http://localhost:8080/\",\n",
    "    auth_client_secret=weaviate.auth.AuthApiKey(api_key=\"<YOUR-WEAVIATE-API-KEY>\"), # comment out this line if you are not using authentication for your Weaviate instance (i.e. for locally deployed instances)\n",
    "    additional_headers={\n",
    "        \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
    "    }\n",
    ")\n",
    "\n",
    "# Check if your instance is live and ready\n",
    "# This should return `True`\n",
    "client.is_ready()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d3dac3c",
   "metadata": {},
   "source": [
    "# Schema\n",
    "\n",
    "In this section, we will:\n",
    "1. configure the data schema for your data\n",
    "2. select OpenAI module\n",
    "\n",
    "> This is the second and final step, which requires OpenAI specific configuration.\n",
    "> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
    "\n",
    "\n",
    "## What is a schema\n",
    "\n",
    "In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
    "\n",
    "A schema is how you tell Weaviate:\n",
    "* what embedding model should be used to vectorize the data\n",
    "* what your data is made of (property names and types)\n",
    "* which properties should be vectorized and indexed\n",
    "\n",
    "In this cookbook we will use a dataset for `Articles`, which contains:\n",
    "* `title`\n",
    "* `content`\n",
    "* `url`\n",
    "\n",
    "We want to vectorize `title` and `content`, but not the `url`.\n",
    "\n",
    "To vectorize and query the data, we will use `text-embedding-ada-002`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f894b911",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Clear up the schema, so that we can recreate it\n",
    "client.schema.delete_all()\n",
    "client.schema.get()\n",
    "\n",
    "# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
    "article_schema = {\n",
    "    \"class\": \"Article\",\n",
    "    \"description\": \"A collection of articles\",\n",
    "    \"vectorizer\": \"text2vec-openai\",\n",
    "    \"moduleConfig\": {\n",
    "        \"text2vec-openai\": {\n",
    "          \"model\": \"ada\",\n",
    "          \"modelVersion\": \"002\",\n",
    "          \"type\": \"text\"\n",
    "        }\n",
    "    },\n",
    "    \"properties\": [{\n",
    "        \"name\": \"title\",\n",
    "        \"description\": \"Title of the article\",\n",
    "        \"dataType\": [\"string\"]\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"content\",\n",
    "        \"description\": \"Contents of the article\",\n",
    "        \"dataType\": [\"text\"]\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"url\",\n",
    "        \"description\": \"URL to the article\",\n",
    "        \"dataType\": [\"string\"],\n",
    "        \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
    "    }]\n",
    "}\n",
    "\n",
    "# add the Article schema\n",
    "client.schema.create_class(article_schema)\n",
    "\n",
    "# get the schema to make sure it worked\n",
    "client.schema.get()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d9d2e1",
   "metadata": {},
   "source": [
    "## Import data\n",
    "\n",
    "In this section we will:\n",
    "1. load the Simple Wikipedia dataset\n",
    "2. configure Weaviate Batch import (to make the import more efficient)\n",
    "3. import the data into Weaviate\n",
    "\n",
    "> Note: <br/>\n",
    "> Like mentioned before. We don't need to manually vectorize the data.<br/>\n",
    "> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc3efadd",
   "metadata": {},
   "outputs": [],
   "source": [
    "### STEP 1 - load the dataset\n",
    "\n",
    "from datasets import load_dataset\n",
    "from typing import List, Iterator\n",
    "\n",
    "# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
    "dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
    "\n",
    "# For testing, limited to 2.5k articles for demo purposes\n",
    "dataset = dataset[:2_500]\n",
    "\n",
    "# Limited to 25k articles for larger demo purposes\n",
    "# dataset = dataset[:25_000]\n",
    "\n",
    "# for free OpenAI acounts, you can use 50 objects\n",
    "# dataset = dataset[:50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5044da96",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Step 2 - configure Weaviate Batch, with\n",
    "# - starting batch size of 100\n",
    "# - dynamically increase/decrease based on performance\n",
    "# - add timeout retries if something goes wrong\n",
    "\n",
    "client.batch.configure(\n",
    "    batch_size=10, \n",
    "    dynamic=True,\n",
    "    timeout_retries=3,\n",
    "#   callback=None,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15db8380",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Step 3 - import data\n",
    "\n",
    "print(\"Importing Articles\")\n",
    "\n",
    "counter=0\n",
    "\n",
    "with client.batch as batch:\n",
    "    for article in dataset:\n",
    "        if (counter %10 == 0):\n",
    "            print(f\"Import {counter} / {len(dataset)} \")\n",
    "\n",
    "        properties = {\n",
    "            \"title\": article[\"title\"],\n",
    "            \"content\": article[\"text\"],\n",
    "            \"url\": article[\"url\"]\n",
    "        }\n",
    "        \n",
    "        batch.add_data_object(properties, \"Article\")\n",
    "        counter = counter+1\n",
    "\n",
    "print(\"Importing Articles complete\")       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3658693c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test that all data has loaded – get object count\n",
    "result = (\n",
    "    client.query.aggregate(\"Article\")\n",
    "    .with_fields(\"meta { count }\")\n",
    "    .do()\n",
    ")\n",
    "print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d791186",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test one article has worked by checking one object\n",
    "test_article = (\n",
    "    client.query\n",
    "    .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
    "    .with_limit(1)\n",
    "    .do()\n",
    ")[\"data\"][\"Get\"][\"Article\"][0]\n",
    "\n",
    "print(test_article['title'])\n",
    "print(test_article['url'])\n",
    "print(test_article['content'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46050ca9",
   "metadata": {},
   "source": [
    "### Search Data\n",
    "\n",
    "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b044aa93",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_weaviate(query, collection_name):\n",
    "    \n",
    "    nearText = {\n",
    "        \"concepts\": [query],\n",
    "        \"distance\": 0.7,\n",
    "    }\n",
    "\n",
    "    properties = [\n",
    "        \"title\", \"content\", \"url\",\n",
    "        \"_additional {certainty distance}\"\n",
    "    ]\n",
    "\n",
    "    result = (\n",
    "        client.query\n",
    "        .get(collection_name, properties)\n",
    "        .with_near_text(nearText)\n",
    "        .with_limit(10)\n",
    "        .do()\n",
    "    )\n",
    "    \n",
    "    # Check for errors\n",
    "    if (\"errors\" in result):\n",
    "        print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute – the limit is set at 60 per minute.\")\n",
    "        raise Exception(result[\"errors\"][0]['message'])\n",
    "    \n",
    "    return result[\"data\"][\"Get\"][collection_name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e2025f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n",
    "\n",
    "for i, article in enumerate(query_result):\n",
    "    print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93c4a696",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n",
    "\n",
    "for i, article in enumerate(query_result):\n",
    "    print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2007be48",
   "metadata": {},
   "source": [
    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}