You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/weaviate/question-answering-with-wea...

584 lines
21 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Question Answering in Weaviate with OpenAI Q&A module\n",
"\n",
"This notebook is prepared for a scenario where:\n",
"* Your data is not vectorized\n",
"* You want to run Q&A ([learn more](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai)) on your data based on the [OpenAI completions](https://beta.openai.com/docs/api-reference/completions) endpoint.\n",
"* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
"\n",
"This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run question answering.\n",
"\n",
"## What is Weaviate\n",
"\n",
"Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
"\n",
"Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
"\n",
"Weaviate let you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"### Deployment options\n",
"\n",
"Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
"* Self-hosted you can deploy Weaviate with docker locally, or any server you want.\n",
"* SaaS you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
"* Hybrid-SaaS you can deploy Weaviate in your own private Cloud Service \n",
"\n",
"### Programming languages\n",
"\n",
"Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
"* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
"* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
"* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
"* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
"\n",
"Additionally, Weaviate has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
]
},
{
"cell_type": "markdown",
"id": "45956173",
"metadata": {},
"source": [
"## Demo Flow\n",
"The demo flow is:\n",
"- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n",
"- **Connect**: Connect to your Weaviate instance \n",
"- **Schema Configuration**: Configure the schema of your data\n",
" - *Note*: Here we can define which OpenAI Embedding Model to use\n",
" - *Note*: Here we can configure which properties to index\n",
"- **Import data**: Load a demo dataset and import it into Weaviate\n",
" - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
" - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you\n",
"- **Run Queries**: Query \n",
" - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you\n",
" - *Note*: The `qna-openai` module automatically communicates with the OpenAI completions endpoint\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases for question answering."
]
},
{
"cell_type": "markdown",
"id": "2a4a145e",
"metadata": {},
"source": [
"## OpenAI Module in Weaviate\n",
"All Weaviate instances come equipped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) and the [qna-openai](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai) modules.\n",
"\n",
"The first module is responsible for handling vectorization at import (or any CRUD operations) and when you run a search query. The second module communicates with the OpenAI completions endpoint.\n",
"\n",
"### No need to manually vectorize data\n",
"This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
"\n",
"All you need to do is:\n",
"1. provide your OpenAI API Key when you connected to the Weaviate Client\n",
"2. define which OpenAI vectorizer to use in your Schema"
]
},
{
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* create a `Weaviate` instance\n",
"* install libraries\n",
" * `weaviate-client`\n",
" * `datasets`\n",
" * `apache-beam`\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"### Create a Weaviate instance\n",
"\n",
"To create a Weaviate instance we have 2 options:\n",
"\n",
"1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
"2. Install and run Weaviate locally with Docker.\n",
"\n",
"#### Option 1 WCS Installation Steps\n",
"\n",
"Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
"\n",
"#### Option 2 local Weaviate instance with Docker\n",
"\n",
"Install and run Weaviate locally with Docker.\n",
"1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
"2. Then open your terminal, navigate to where your docker-compose.yml file is located, and start docker with: `docker-compose up -d`\n",
"3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
"\n",
"Note. To shut down your docker instance you can call: `docker-compose down`\n",
"\n",
"##### Learn more\n",
"To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
]
},
{
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"=========================================================== \n",
"## Install required libraries\n",
"\n",
"Before running this project make sure to have the following libraries:\n",
"\n",
"### Weaviate Python client\n",
"\n",
"The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
"\n",
"### datasets & apache-beam\n",
"\n",
"To load sample data, you need the `datasets` library and its' dependency `apache-beam`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"# Install the Weaviate client for Python\n",
"!pip install weaviate-client>3.11.0\n",
"\n",
"# Install datasets and apache-beam to load the sample datasets\n",
"!pip install datasets apache-beam"
]
},
{
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a2ded4b",
"metadata": {},
"outputs": [],
"source": [
"# Export OpenAI API Key\n",
"!export OPENAI_API_KEY=\"your key\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88be138c",
"metadata": {},
"outputs": [],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ['OPENAI_API_KEY'] = 'your-key-goes-here'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to your Weaviate instance\n",
"\n",
"In this section, we will:\n",
"\n",
"1. test env variable `OPENAI_API_KEY` **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
"2. connect to your Weaviate your `OpenAI API Key`\n",
"3. and test the client connection\n",
"\n",
"### The client \n",
"\n",
"After this step, the `client` object will be used to perform all Weaviate-related operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc662c1b",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from datasets import load_dataset\n",
"import os\n",
"\n",
"# Connect to your Weaviate instance\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
"# url=\"http://localhost:8080/\",\n",
" auth_client_secret=weaviate.auth.AuthApiKey(api_key=\"<YOUR-WEAVIATE-API-KEY>\"), # comment out this line if you are not using authentication for your Weaviate instance (i.e. for locally deployed instances)\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")\n",
"\n",
"# Check if your instance is live and ready\n",
"# This should return `True`\n",
"client.is_ready()"
]
},
{
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"# Schema\n",
"\n",
"In this section, we will:\n",
"1. configure the data schema for your data\n",
"2. select OpenAI module\n",
"\n",
"> This is the second and final step, which requires OpenAI specific configuration.\n",
"> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
"\n",
"\n",
"## What is a schema\n",
"\n",
"In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
"\n",
"A schema is how you tell Weaviate:\n",
"* what embedding model should be used to vectorize the data\n",
"* what your data is made of (property names and types)\n",
"* which properties should be vectorized and indexed\n",
"\n",
"In this cookbook we will use a dataset for `Articles`, which contains:\n",
"* `title`\n",
"* `content`\n",
"* `url`\n",
"\n",
"We want to vectorize `title` and `content`, but not the `url`.\n",
"\n",
"To vectorize and query the data, we will use `text-embedding-3-small`. For Q&A we will use `gpt-3.5-turbo-instruct`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-3-small` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }, \n",
" \"qna-openai\": {\n",
" \"model\": \"gpt-3.5-turbo-instruct\",\n",
" \"maxTokens\": 16,\n",
" \"temperature\": 0.0,\n",
" \"topP\": 1,\n",
" \"frequencyPenalty\": 0.0,\n",
" \"presencePenalty\": 0.0\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"url\",\n",
" \"description\": \"URL to the article\",\n",
" \"dataType\": [\"string\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"In this section we will:\n",
"1. load the Simple Wikipedia dataset\n",
"2. configure Weaviate Batch import (to make the import more efficient)\n",
"3. import the data into Weaviate\n",
"\n",
"> Note: <br/>\n",
"> Like mentioned before. We don't need to manually vectorize the data.<br/>\n",
"> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc3efadd",
"metadata": {},
"outputs": [],
"source": [
"### STEP 1 - load the dataset\n",
"\n",
"from datasets import load_dataset\n",
"from typing import List, Iterator\n",
"\n",
"# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
"dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
"\n",
"# For testing, limited to 2.5k articles for demo purposes\n",
"dataset = dataset[:2_500]\n",
"\n",
"# Limited to 25k articles for larger demo purposes\n",
"# dataset = dataset[:25_000]\n",
"\n",
"# for free OpenAI acounts, you can use 50 objects\n",
"# dataset = dataset[:50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5044da96",
"metadata": {},
"outputs": [],
"source": [
"### Step 2 - configure Weaviate Batch, with\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=10, \n",
" dynamic=True,\n",
" timeout_retries=3,\n",
"# callback=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"### Step 3 - import data\n",
"\n",
"print(\"Importing Articles\")\n",
"\n",
"counter=0\n",
"\n",
"with client.batch as batch:\n",
" for article in dataset:\n",
" if (counter %10 == 0):\n",
" print(f\"Import {counter} / {len(dataset)} \")\n",
"\n",
" properties = {\n",
" \"title\": article[\"title\"],\n",
" \"content\": article[\"text\"],\n",
" \"url\": article[\"url\"]\n",
" }\n",
" \n",
" batch.add_data_object(properties, \"Article\")\n",
" counter = counter+1\n",
"\n",
"print(\"Importing Articles complete\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Test that all data has loaded get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article['title'])\n",
"print(test_article['url'])\n",
"print(test_article['content'])"
]
},
{
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"### Question Answering on the Data\n",
"\n",
"As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def qna(query, collection_name):\n",
" \n",
" properties = [\n",
" \"title\", \"content\", \"url\",\n",
" \"_additional { answer { hasAnswer property result startPosition endPosition } distance }\"\n",
" ]\n",
"\n",
" ask = {\n",
" \"question\": query,\n",
" \"properties\": [\"content\"]\n",
" }\n",
"\n",
" result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_ask(ask)\n",
" .with_limit(1)\n",
" .do()\n",
" )\n",
" \n",
" # Check for errors\n",
" if (\"errors\" in result):\n",
" print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute the limit is set at 60 per minute.\")\n",
" raise Exception(result[\"errors\"][0]['message'])\n",
" \n",
" return result[\"data\"][\"Get\"][collection_name]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e2025f6",
"metadata": {},
"outputs": [],
"source": [
"query_result = qna(\"Did Alanis Morissette win a Grammy?\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"metadata": {},
"outputs": [],
"source": [
"query_result = qna(\"What is the capital of China?\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" if article['_additional']['answer']['hasAnswer'] == False:\n",
" print('No answer found')\n",
" else:\n",
" print(f\"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}