{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Weaviate with OpenAI vectorize module for Embeddings Search\n",
"\n",
"This notebook is prepared for a scenario where:\n",
"* Your data is not vectorized\n",
"* You want to run Vector Search on your data\n",
"* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
"\n",
"This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run semantic search.\n",
"\n",
"This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"## What is Weaviate\n",
"\n",
"Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
"\n",
"Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
"\n",
"Weaviate let's you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"### Deployment options\n",
"\n",
"Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
"* Self-hosted – you can deploy Weaviate with docker locally, or any server you want.\n",
"* SaaS – you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
"* Hybrid-Saas – you can deploy Weaviate in your own private Cloud Service \n",
"\n",
"### Programming languages\n",
"\n",
"Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
"* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
"* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
"* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
"* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
"\n",
"Additionally, Weavaite has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
]
},
{
"cell_type": "markdown",
"id": "45956173",
"metadata": {},
"source": [
"## Demo Flow\n",
"The demo flow is:\n",
"- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n",
"- **Connect**: Connect to your Weaviate instance \n",
"- **Schema Configuration**: Configure the schema of your data\n",
" - *Note*: Here we can define which OpenAI Embedding Model to use\n",
" - *Note*: Here we can configure which properties to index on\n",
"- **Import data**: Load a demo dataset and import it into Weaviate\n",
" - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
" - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.\n",
"- **Run Queries**: Query \n",
" - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "2a4a145e",
"metadata": {},
"source": [
"## OpenAI Module in Weaviate\n",
"All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.\n",
"\n",
"This module is responsible handling vectorization at import (or any CRUD operations) and when you run a query.\n",
"\n",
"### No need to manually vectorize data\n",
"This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
"\n",
"All you need to do is:\n",
"1. provide your OpenAI API Key – when you connected to the Weaviate Client\n",
"2. define which OpenAI vectorizer to use in your Schema"
]
},
{
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* create a `Weaviate` instance\n",
"* install libraries\n",
" * `weaviate-client`\n",
" * `datasets`\n",
" * `apache-beam`\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"### Create a Weaviate instance\n",
"\n",
"To create a Weaviate instance we have 2 options:\n",
"\n",
"1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
"2. Install and run Weaviate locally with Docker.\n",
"\n",
"#### Option 1 – WCS Installation Steps\n",
"\n",
"Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
"\n",
"#### Option 2 – local Weaviate instance with Docker\n",
"\n",
"Install and run Weaviate locally with Docker.\n",
"1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
"2. Then open your terminal, navigate to where your docker-compose.yml folder, and start docker with: `docker-compose up -d`\n",
"3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
"\n",
"Note. To shut down your docker instance you can call: `docker-compose down`\n",
"\n",
"##### Learn more\n",
"To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
]
},
{
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"=========================================================== \n",
"## Install required libraries\n",
"\n",
"Before running this project make sure to have the following libraries:\n",
"\n",
"### Weaviate Python client\n",
"\n",
"The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
"\n",
"### datasets & apache-beam\n",
"\n",
"To load sample data, you need the `datasets` library and its' dependency `apache-beam`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"# Install the Weaviate client for Python\n",
"!pip install weaviate-client>=3.11.0\n",
"\n",
"# Install datasets and apache-beam to load the sample datasets\n",
"!pip install datasets apache-beam"
]
},
{
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88be138c",
"metadata": {},
"outputs": [],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ[\"OPENAI_API_KEY\"] = 'your-key-goes-here'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to your Weaviate instance\n",
"\n",
"In this section, we will:\n",
"\n",
"1. test env variable `OPENAI_API_KEY` – **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
"2. connect to your Weaviate your `OpenAI API Key`\n",
"3. and test the client connection\n",
"\n",
"### The client \n",
"\n",
"After this step, the `client` object will be used to perform all Weaviate-related operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc662c1b",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from datasets import load_dataset\n",
"import os\n",
"\n",
"# Connect to your Weaviate instance\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
"# url=\"http://localhost:8080/\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")\n",
"\n",
"# Check if your instance is live and ready\n",
"# This should return `True`\n",
"client.is_ready()"
]
},
{
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"# Schema\n",
"\n",
"In this section, we will:\n",
"1. configure the data schema for your data\n",
"2. select OpenAI module\n",
"\n",
"> This is the second and final step, which requires OpenAI specific configuration.\n",
"> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
"\n",
"\n",
"## What is a schema\n",
"\n",
"In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
"\n",
"A schema is how you tell Weaviate:\n",
"* what embedding model should be used to vectorize the data\n",
"* what your data is made of (property names and types)\n",
"* which properties should be vectorized and indexed\n",
"\n",
"In this cookbook we will use a dataset for `Articles`, which contains:\n",
"* `title`\n",
"* `content`\n",
"* `url`\n",
"\n",
"We want to vectorize `title` and `content`, but not the `url`.\n",
"\n",
"To vectorize and query the data, we will use `text-embedding-ada-002`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"url\",\n",
" \"description\": \"URL to the article\",\n",
" \"dataType\": [\"string\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"In this section we will:\n",
"1. load the Simple Wikipedia dataset\n",
"2. configure Weaviate Batch import (to make the import more efficient)\n",
"3. import the data into Weaviate\n",
"\n",
"> Note:
\n",
"> Like mentioned before. We don't need to manually vectorize the data.
\n",
"> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc3efadd",
"metadata": {},
"outputs": [],
"source": [
"### STEP 1 - load the dataset\n",
"\n",
"from datasets import load_dataset\n",
"from typing import List, Iterator\n",
"\n",
"# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
"dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
"\n",
"# For testing, limited to 2.5k articles for demo purposes\n",
"dataset = dataset[:2_500]\n",
"\n",
"# Limited to 25k articles for larger demo purposes\n",
"# dataset = dataset[:25_000]\n",
"\n",
"# for free OpenAI acounts, you can use 50 objects\n",
"# dataset = dataset[:50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5044da96",
"metadata": {},
"outputs": [],
"source": [
"### Step 2 - configure Weaviate Batch, with\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=10, \n",
" dynamic=True,\n",
" timeout_retries=3,\n",
"# callback=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"### Step 3 - import data\n",
"\n",
"print(\"Importing Articles\")\n",
"\n",
"counter=0\n",
"\n",
"with client.batch as batch:\n",
" for article in dataset:\n",
" if (counter %10 == 0):\n",
" print(f\"Import {counter} / {len(dataset)} \")\n",
"\n",
" properties = {\n",
" \"title\": article[\"title\"],\n",
" \"content\": article[\"text\"],\n",
" \"url\": article[\"url\"]\n",
" }\n",
" \n",
" batch.add_data_object(properties, \"Article\")\n",
" counter = counter+1\n",
"\n",
"print(\"Importing Articles complete\") "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Test that all data has loaded – get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article['title'])\n",
"print(test_article['url'])\n",
"print(test_article['content'])"
]
},
{
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def query_weaviate(query, collection_name):\n",
" \n",
" nearText = {\n",
" \"concepts\": [query],\n",
" \"distance\": 0.7,\n",
" }\n",
"\n",
" properties = [\n",
" \"title\", \"content\", \"url\",\n",
" \"_additional {certainty distance}\"\n",
" ]\n",
"\n",
" result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_near_text(nearText)\n",
" .with_limit(10)\n",
" .do()\n",
" )\n",
" \n",
" # Check for errors\n",
" if (\"errors\" in result):\n",
" print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute – the limit is set at 60 per minute.\")\n",
" raise Exception(result[\"errors\"][0]['message'])\n",
" \n",
" return result[\"data\"][\"Get\"][collection_name]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e2025f6",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}