{ "cells": [ { "cell_type": "markdown", "id": "cb1537e6", "metadata": {}, "source": [ "# Using Weaviate with OpenAI vectorize module for Embeddings Search\n", "\n", "This notebook is prepared for a scenario where:\n", "* Your data is not vectorized\n", "* You want to run Vector Search on your data\n", "* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n", "\n", "This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run semantic search.\n", "\n", "This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", "\n", "## What is Weaviate\n", "\n", "Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n", "\n", "Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n", "\n", "Weaviate let's you use your favorite ML-models, and scale seamlessly into billions of data objects.\n", "\n", "### Deployment options\n", "\n", "Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n", "* Self-hosted – you can deploy Weaviate with docker locally, or any server you want.\n", "* SaaS – you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n", "* Hybrid-Saas – you can deploy Weaviate in your own private Cloud Service \n", "\n", "### Programming languages\n", "\n", "Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n", "* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n", "* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n", "* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n", "* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n", "\n", "Additionally, Weavaite has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests." ] }, { "cell_type": "markdown", "id": "45956173", "metadata": {}, "source": [ "## Demo Flow\n", "The demo flow is:\n", "- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n", "- **Connect**: Connect to your Weaviate instance \n", "- **Schema Configuration**: Configure the schema of your data\n", " - *Note*: Here we can define which OpenAI Embedding Model to use\n", " - *Note*: Here we can configure which properties to index on\n", "- **Import data**: Load a demo dataset and import it into Weaviate\n", " - *Note*: The import process will automatically index your data - based on the configuration in the schema\n", " - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.\n", "- **Run Queries**: Query \n", " - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.\n", "\n", "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." ] }, { "cell_type": "markdown", "id": "2a4a145e", "metadata": {}, "source": [ "## OpenAI Module in Weaviate\n", "All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.\n", "\n", "This module is responsible handling vectorization at import (or any CRUD operations) and when you run a query.\n", "\n", "### No need to manually vectorize data\n", "This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n", "\n", "All you need to do is:\n", "1. provide your OpenAI API Key – when you connected to the Weaviate Client\n", "2. define which OpenAI vectorizer to use in your Schema" ] }, { "cell_type": "markdown", "id": "f1a618c5", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Before we start this project, we need setup the following:\n", "\n", "* create a `Weaviate` instance\n", "* install libraries\n", " * `weaviate-client`\n", " * `datasets`\n", " * `apache-beam`\n", "* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n", "\n", "===========================================================\n", "### Create a Weaviate instance\n", "\n", "To create a Weaviate instance we have 2 options:\n", "\n", "1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n", "2. Install and run Weaviate locally with Docker.\n", "\n", "#### Option 1 – WCS Installation Steps\n", "\n", "Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n", "1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n", "2. create a `Weaviate Cluster` with the following settings:\n", " * Sandbox: `Sandbox Free`\n", " * Weaviate Version: Use default (latest)\n", " * OIDC Authentication: `Disabled`\n", "3. your instance should be ready in a minute or two\n", "4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n", "\n", "#### Option 2 – local Weaviate instance with Docker\n", "\n", "Install and run Weaviate locally with Docker.\n", "1. Download the [./docker-compose.yml](./docker-compose.yml) file\n", "2. Then open your terminal, navigate to where your docker-compose.yml folder, and start docker with: `docker-compose up -d`\n", "3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n", "\n", "Note. To shut down your docker instance you can call: `docker-compose down`\n", "\n", "##### Learn more\n", "To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)." ] }, { "cell_type": "markdown", "id": "b9babafe", "metadata": {}, "source": [ "=========================================================== \n", "## Install required libraries\n", "\n", "Before running this project make sure to have the following libraries:\n", "\n", "### Weaviate Python client\n", "\n", "The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n", "\n", "### datasets & apache-beam\n", "\n", "To load sample data, you need the `datasets` library and its' dependency `apache-beam`." ] }, { "cell_type": "code", "execution_count": null, "id": "2b04113f", "metadata": {}, "outputs": [], "source": [ "# Install the Weaviate client for Python\n", "!pip install weaviate-client>=3.11.0\n", "\n", "# Install datasets and apache-beam to load the sample datasets\n", "!pip install datasets apache-beam" ] }, { "cell_type": "markdown", "id": "36fe86f4", "metadata": {}, "source": [ "===========================================================\n", "## Prepare your OpenAI API key\n", "\n", "The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n", "\n", "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n", "\n", "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`." ] }, { "cell_type": "code", "execution_count": null, "id": "88be138c", "metadata": {}, "outputs": [], "source": [ "# Test that your OpenAI API key is correctly set as an environment variable\n", "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n", "import os\n", "\n", "# Note. alternatively you can set a temporary env variable like this:\n", "# os.environ[\"OPENAI_API_KEY\"] = 'your-key-goes-here'\n", "\n", "if os.getenv(\"OPENAI_API_KEY\") is not None:\n", " print (\"OPENAI_API_KEY is ready\")\n", "else:\n", " print (\"OPENAI_API_KEY environment variable not found\")" ] }, { "cell_type": "markdown", "id": "91df4d5b", "metadata": {}, "source": [ "## Connect to your Weaviate instance\n", "\n", "In this section, we will:\n", "\n", "1. test env variable `OPENAI_API_KEY` – **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n", "2. connect to your Weaviate your `OpenAI API Key`\n", "3. and test the client connection\n", "\n", "### The client \n", "\n", "After this step, the `client` object will be used to perform all Weaviate-related operations." ] }, { "cell_type": "code", "execution_count": null, "id": "cc662c1b", "metadata": {}, "outputs": [], "source": [ "import weaviate\n", "from datasets import load_dataset\n", "import os\n", "\n", "# Connect to your Weaviate instance\n", "client = weaviate.Client(\n", " url=\"https://your-wcs-instance-name.weaviate.network/\",\n", "# url=\"http://localhost:8080/\",\n", " additional_headers={\n", " \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n", " }\n", ")\n", "\n", "# Check if your instance is live and ready\n", "# This should return `True`\n", "client.is_ready()" ] }, { "cell_type": "markdown", "id": "7d3dac3c", "metadata": {}, "source": [ "# Schema\n", "\n", "In this section, we will:\n", "1. configure the data schema for your data\n", "2. select OpenAI module\n", "\n", "> This is the second and final step, which requires OpenAI specific configuration.\n", "> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n", "\n", "\n", "## What is a schema\n", "\n", "In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n", "\n", "A schema is how you tell Weaviate:\n", "* what embedding model should be used to vectorize the data\n", "* what your data is made of (property names and types)\n", "* which properties should be vectorized and indexed\n", "\n", "In this cookbook we will use a dataset for `Articles`, which contains:\n", "* `title`\n", "* `content`\n", "* `url`\n", "\n", "We want to vectorize `title` and `content`, but not the `url`.\n", "\n", "To vectorize and query the data, we will use `text-embedding-ada-002`." ] }, { "cell_type": "code", "execution_count": null, "id": "f894b911", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Clear up the schema, so that we can recreate it\n", "client.schema.delete_all()\n", "client.schema.get()\n", "\n", "# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n", "article_schema = {\n", " \"class\": \"Article\",\n", " \"description\": \"A collection of articles\",\n", " \"vectorizer\": \"text2vec-openai\",\n", " \"moduleConfig\": {\n", " \"text2vec-openai\": {\n", " \"model\": \"ada\",\n", " \"modelVersion\": \"002\",\n", " \"type\": \"text\"\n", " }\n", " },\n", " \"properties\": [{\n", " \"name\": \"title\",\n", " \"description\": \"Title of the article\",\n", " \"dataType\": [\"string\"]\n", " },\n", " {\n", " \"name\": \"content\",\n", " \"description\": \"Contents of the article\",\n", " \"dataType\": [\"text\"]\n", " },\n", " {\n", " \"name\": \"url\",\n", " \"description\": \"URL to the article\",\n", " \"dataType\": [\"string\"],\n", " \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n", " }]\n", "}\n", "\n", "# add the Article schema\n", "client.schema.create_class(article_schema)\n", "\n", "# get the schema to make sure it worked\n", "client.schema.get()" ] }, { "cell_type": "markdown", "id": "e5d9d2e1", "metadata": {}, "source": [ "## Import data\n", "\n", "In this section we will:\n", "1. load the Simple Wikipedia dataset\n", "2. configure Weaviate Batch import (to make the import more efficient)\n", "3. import the data into Weaviate\n", "\n", "> Note:
\n", "> Like mentioned before. We don't need to manually vectorize the data.
\n", "> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that." ] }, { "cell_type": "code", "execution_count": null, "id": "fc3efadd", "metadata": {}, "outputs": [], "source": [ "### STEP 1 - load the dataset\n", "\n", "from datasets import load_dataset\n", "from typing import List, Iterator\n", "\n", "# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n", "dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n", "\n", "# For testing, limited to 2.5k articles for demo purposes\n", "dataset = dataset[:2_500]\n", "\n", "# Limited to 25k articles for larger demo purposes\n", "# dataset = dataset[:25_000]\n", "\n", "# for free OpenAI acounts, you can use 50 objects\n", "# dataset = dataset[:50]" ] }, { "cell_type": "code", "execution_count": null, "id": "5044da96", "metadata": {}, "outputs": [], "source": [ "### Step 2 - configure Weaviate Batch, with\n", "# - starting batch size of 100\n", "# - dynamically increase/decrease based on performance\n", "# - add timeout retries if something goes wrong\n", "\n", "client.batch.configure(\n", " batch_size=10, \n", " dynamic=True,\n", " timeout_retries=3,\n", "# callback=None,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "15db8380", "metadata": {}, "outputs": [], "source": [ "### Step 3 - import data\n", "\n", "print(\"Importing Articles\")\n", "\n", "counter=0\n", "\n", "with client.batch as batch:\n", " for article in dataset:\n", " if (counter %10 == 0):\n", " print(f\"Import {counter} / {len(dataset)} \")\n", "\n", " properties = {\n", " \"title\": article[\"title\"],\n", " \"content\": article[\"text\"],\n", " \"url\": article[\"url\"]\n", " }\n", " \n", " batch.add_data_object(properties, \"Article\")\n", " counter = counter+1\n", "\n", "print(\"Importing Articles complete\") " ] }, { "cell_type": "code", "execution_count": null, "id": "3658693c", "metadata": {}, "outputs": [], "source": [ "# Test that all data has loaded – get object count\n", "result = (\n", " client.query.aggregate(\"Article\")\n", " .with_fields(\"meta { count }\")\n", " .do()\n", ")\n", "print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")" ] }, { "cell_type": "code", "execution_count": null, "id": "0d791186", "metadata": {}, "outputs": [], "source": [ "# Test one article has worked by checking one object\n", "test_article = (\n", " client.query\n", " .get(\"Article\", [\"title\", \"url\", \"content\"])\n", " .with_limit(1)\n", " .do()\n", ")[\"data\"][\"Get\"][\"Article\"][0]\n", "\n", "print(test_article['title'])\n", "print(test_article['url'])\n", "print(test_article['content'])" ] }, { "cell_type": "markdown", "id": "46050ca9", "metadata": {}, "source": [ "### Search Data\n", "\n", "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors" ] }, { "cell_type": "code", "execution_count": null, "id": "b044aa93", "metadata": {}, "outputs": [], "source": [ "def query_weaviate(query, collection_name):\n", " \n", " nearText = {\n", " \"concepts\": [query],\n", " \"distance\": 0.7,\n", " }\n", "\n", " properties = [\n", " \"title\", \"content\", \"url\",\n", " \"_additional {certainty distance}\"\n", " ]\n", "\n", " result = (\n", " client.query\n", " .get(collection_name, properties)\n", " .with_near_text(nearText)\n", " .with_limit(10)\n", " .do()\n", " )\n", " \n", " # Check for errors\n", " if (\"errors\" in result):\n", " print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute – the limit is set at 60 per minute.\")\n", " raise Exception(result[\"errors\"][0]['message'])\n", " \n", " return result[\"data\"][\"Get\"][collection_name]" ] }, { "cell_type": "code", "execution_count": null, "id": "7e2025f6", "metadata": {}, "outputs": [], "source": [ "query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n", "\n", "for i, article in enumerate(query_result):\n", " print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")" ] }, { "cell_type": "code", "execution_count": null, "id": "93c4a696", "metadata": {}, "outputs": [], "source": [ "query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n", "\n", "for i, article in enumerate(query_result):\n", " print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")" ] }, { "cell_type": "markdown", "id": "2007be48", "metadata": {}, "source": [ "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }