Merging Weaviate notebooks to main (#122)

* updates Weaviate vector database Cookbook examples
pull/129/head
colin-openai 1 year ago committed by GitHub
parent 2992ca14cb
commit 264bcb03dd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -127,7 +127,7 @@
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"\n",
"# Warning, the file is pretty big so this will take some time\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
@ -463,7 +463,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "d701b3c7",
"id": "c8280363",
"metadata": {},
"outputs": [],
"source": [
@ -521,7 +521,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "67b3584d",
"id": "3402b1b1",
"metadata": {},
"outputs": [],
"source": [
@ -531,7 +531,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "3e7ac79b",
"id": "64a3f90a",
"metadata": {},
"outputs": [],
"source": [
@ -539,61 +539,95 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d939342f",
"metadata": {},
"source": [
"## Weaviate\n",
"\n",
"The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
"Another vector database option we'll explore is **Weaviate**, which offers both a managed, [SaaS](https://console.weaviate.io/) option, as well as a self-hosted [open source](https://github.com/weaviate/weaviate) option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
"\n",
"For this we will:\n",
"- Set up a local deployment of Weaviate\n",
"- Create indices in Weaviate\n",
"- Store our data there\n",
"- Fire some similarity search queries\n",
"- Try a real use case"
"- Try a real use case\n",
"\n",
"\n",
"### Bring your own vectors approach\n",
"In this cookbook, we provide the data with already generated vectors. This is a good approach for scenarios, where your data is already vectorized.\n",
"\n",
"### Automated vectorization with OpenAI module\n",
"For scenarios, where your data is not vectorized yet, you can delegate the vectorization task with OpenAI to Weaviate.\n",
"Weaviate offers a built-in module [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the vectorization for you at:\n",
"* import\n",
"* for any CRUD operations\n",
"* for semantic search\n",
"\n",
"Check out the [Getting Started with Weaviate and OpenAI module cookbook](./weaviate/getting-started-with-text2vec-openai.ipynb) to learn step by step how to import and vectorize data in one step."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "bfdfe260",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"To get Weaviate running locally we will use Docker and follow the instructions contained in the Weaviate documentation here: https://weaviate.io/developers/weaviate/current/installation/docker-compose.html\n",
"To run Weaviate locally, you'll need [Docker](https://www.docker.com/pricing/). Following the instructions contained in the Weaviate documentation [here](https://weaviate.io/developers/weaviate/current/installation/docker-compose.html), we created an example docker-compose.yml file in this repo saved at [./weaviate/docker-compose.yml](./weaviate/docker-compose.yml).\n",
"\n",
"For an example docker-compose.yaml file please refer to `./weaviate/docker-compose.yaml` in this repo\n",
"After starting Docker, you can start Weaviate locally by navigating to the `examples/vector_databases/weaviate/` directory and running `docker-compose up -d`.\n",
"\n",
"You can start Weaviate up locally by navigating to this directory and running `docker-compose up -d `"
"#### SaaS\n",
"Alternatively you can use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9ea472d",
"id": "a78f95d1",
"metadata": {},
"outputs": [],
"source": [
"client = weaviate.Client(\"http://localhost:8080/\")"
"# Option #1 - Self-hosted - Weaviate Open Source \n",
"client = weaviate.Client(\n",
" url=\"http://localhost:8080\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13be220d",
"id": "e00b7d68",
"metadata": {},
"outputs": [],
"source": [
"client.schema.delete_all()\n",
"client.schema.get()"
"# Option #2 - SaaS - (Weaviate Cloud Service)\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73d33184",
"id": "1d370afa",
"metadata": {},
"outputs": [],
"source": [
@ -601,6 +635,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "03a926b9",
"metadata": {},
@ -611,87 +646,138 @@
"\n",
"In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n",
"\n",
"The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.htm)"
"The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/quickstart).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e868d143",
"id": "0e6175a1",
"metadata": {},
"outputs": [],
"source": [
"class_obj = {\n",
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"text\"]\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" \"dataType\": [\"text\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# Create the schema in Weaviate\n",
"client.schema.create_class(class_obj)\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# Check that we've created it as intended\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "786d437f",
"id": "ea838e7d",
"metadata": {},
"outputs": [],
"source": [
"### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=100,\n",
" dynamic=True,\n",
" timeout_retries=3,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4c967ec",
"metadata": {},
"outputs": [],
"source": [
"# Convert DF into a list of tuples\n",
"data_objects = []\n",
"for k,v in article_df.iterrows():\n",
" data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))\n",
"### Step 2 - import data\n",
"\n",
"# Upsert into article schema\n",
"print(\"Uploading vectors to article schema..\")\n",
"print(\"Uploading data with vectors to Article schema..\")\n",
"\n",
"# Store a list of UUIDs in case we want to use to refer back to the initial dataframe\n",
"uuids = []\n",
"counter=0\n",
"\n",
"# Reuse our batcher from the Pinecone ingestion\n",
"for batch_df in df_batcher(article_df):\n",
" for k,v in batch_df.iterrows():\n",
" #print(articles)\n",
" uuid = client.data_object.create(\n",
" {\n",
" \"title\": v['title'],\n",
" \"content\": v['text']\n",
" },\n",
" \"Article\",\n",
" vector=v['title_vector']\n",
" )\n",
" uuids.append(uuid)"
"with client.batch as batch:\n",
" for k,v in article_df.iterrows():\n",
" \n",
" # print update message every 100 objects \n",
" if (counter %100 == 0):\n",
" print(f\"Import {counter} / {len(article_df)} \")\n",
" \n",
" properties = {\n",
" \"title\": v[\"title\"],\n",
" \"content\": v[\"text\"]\n",
" }\n",
" \n",
" vector = v[\"title_vector\"]\n",
" \n",
" batch.add_data_object(properties, \"Article\", None, vector)\n",
" counter = counter+1\n",
"\n",
"print(f\"Importing ({len(article_df)}) Articles complete\") "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"id": "f826e1ad",
"metadata": {},
"outputs": [],
"source": [
"# Test our insert has worked by checking one object\n",
"print(client.data_object.get()['objects'][0]['properties']['title'])\n",
"print(client.data_object.get()['objects'][0]['properties']['content'])\n",
"\n",
"# Test that all data has loaded\n",
"result = client.query.aggregate(\"Article\") \\\n",
" .with_fields('meta { count }') \\\n",
"# Test that all data has loaded get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
"result['data']"
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c09d483",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"content\", \"_additional {id}\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article[\"_additional\"][\"id\"])\n",
"print(test_article[\"title\"])\n",
"print(test_article[\"content\"])"
]
},
{
@ -707,25 +793,28 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5acd5437",
"id": "add222d7",
"metadata": {},
"outputs": [],
"source": [
"def query_weaviate(query, schema, top_k=20):\n",
"def query_weaviate(query, collection_name, top_k=20):\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(\n",
" input=query,\n",
" model=EMBEDDING_MODEL,\n",
" )[\"data\"][0]['embedding']\n",
" input=query,\n",
" model=EMBEDDING_MODEL,\n",
" )[\"data\"][0]['embedding']\n",
" \n",
" near_vector = {\"vector\": embedded_query}\n",
"\n",
" # Queries input schema with vectorised user query\n",
" query_result = client.query.get(schema,[\"title\",\"content\", \"_additional {certainty}\"]) \\\n",
" .with_near_vector(near_vector) \\\n",
" .with_limit(top_k) \\\n",
" .do()\n",
" query_result = (\n",
" client.query\n",
" .get(collection_name,[\"title\",\"content\", \"_additional {certainty distance}\"])\n",
" .with_near_vector(near_vector)\n",
" .with_limit(top_k)\n",
" .do()\n",
" )\n",
" \n",
" return query_result"
]
@ -733,31 +822,103 @@
{
"cell_type": "code",
"execution_count": null,
"id": "15def653",
"id": "c888aa4b",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate('modern art in Europe','Article')\n",
"query_result = query_weaviate(\"modern art in Europe\",\"Article\")\n",
"counter = 0\n",
"for article in query_result['data']['Get']['Article']:\n",
"for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n",
" counter += 1\n",
" print(f\"{counter}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
" print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"id": "c54cd8e9",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate('Famous battles in Scottish history','Article')\n",
"query_result = query_weaviate(\"Famous battles in Scottish history\",\"Article\")\n",
"counter = 0\n",
"for article in query_result['data']['Get']['Article']:\n",
"for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n",
" counter += 1\n",
" print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "220b3e11",
"metadata": {},
"source": [
"### Let Weaviate handle vector embeddings\n",
"\n",
"Weaviate has a [built-in module for OpenAI](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the steps required to generate a vector embedding for your queries and any CRUD operations.\n",
"\n",
"This allows you to run a vector query with the `with_near_text` filter, which uses your `OPEN_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9425c882",
"metadata": {},
"outputs": [],
"source": [
"def near_text_weaviate(query, collection_name):\n",
" \n",
" nearText = {\n",
" \"concepts\": [query],\n",
" \"distance\": 0.7,\n",
" }\n",
"\n",
" properties = [\n",
" \"title\", \"content\",\n",
" \"_additional {certainty distance}\"\n",
" ]\n",
"\n",
" query_result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_near_text(nearText)\n",
" .with_limit(20)\n",
" .do()\n",
" )[\"data\"][\"Get\"][collection_name]\n",
" \n",
" print (f\"Objects returned: {len(query_result)}\")\n",
" \n",
" return query_result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "501a16f7",
"metadata": {},
"outputs": [],
"source": [
"query_result = near_text_weaviate(\"modern art in Europe\",\"Article\")\n",
"counter = 0\n",
"for article in query_result:\n",
" counter += 1\n",
" print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "839b26df",
"metadata": {},
"outputs": [],
"source": [
"query_result = near_text_weaviate(\"Famous battles in Scottish history\",\"Article\")\n",
"counter = 0\n",
"for article in query_result:\n",
" counter += 1\n",
" print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "9cfaed9d",
@ -1002,9 +1163,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "vectordb",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "vectordb"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
@ -1016,7 +1177,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.9.12"
}
},
"nbformat": 4,

@ -0,0 +1,20 @@
# Weaviate <> OpenAI
[Weaviate](https://weaviate.io) is an open-source vector search engine ([docs](https://weaviate.io/developers/weaviate) - [Github](https://github.com/weaviate/weaviate)) that can store and search through OpenAI embeddings _and_ data objects. The database allows you to do similarity search, hybrid search (the combining of multiple search techniques, such as keyword-based and vector search), and generative search (like Q&A). Weaviate also supports a wide variety of OpenAI-based modules (e.g., [`text2vec-openai`](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), [`qna-openai`](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai)), allowing you to vectorize and query data fast and efficiently.
You can run Weaviate (including the OpenAI modules if desired) in three ways:
1. Open source inside a Docker-container ([example](./docker-compose.yml))
2. Using the Weaviate Cloud Service ([get started](https://weaviate.io/developers/weaviate/quickstart/installation#weaviate-cloud-service))
3. In a Kubernetes cluster ([learn more](https://weaviate.io/developers/weaviate/installation/kubernetes))
### Examples
This folder contains a variety of Weaviate and OpenAI examples.
| Name | Description | lanugage | Google Colab |
| --- | --- | --- | --- |
| [Getting Started with Weaviate and OpenAI](./getting-started-with-weaviate-and-openai.ipynb) | A simple getting started for *semantic vector search* using the OpenAI vectorization module in Weaviate (`text2vec-openai`) | Python Notebook | [link](https://colab.research.google.com/drive/1RxpDE_ruCnoBB3TfwAZqdjYgHJhtdwhK) |
| [Hybrid Search with Weaviate and OpenAI](./hybrid-search-with-weaviate-and-openai.ipynb) | A simple getting started for *hybrid search* using the OpenAI vectorization module in Weaviate (`text2vec-openai`) | Python Notebook | [link](https://colab.research.google.com/drive/1E75BALWoKrOjvUhaznJKQO0A-B1QUPZ4) |
| [Question Answering with Weaviate and OpenAI](./question-answering-with-weaviate-and-openai.ipynb) | A simple getting started for *question answering (Q&A)* using the OpenAI Q&A module in Weaviate (`qna-openai`) | Python Notebook | [link](https://colab.research.google.com/drive/1pUerUZrJaknEboDxDxsuf3giCK0MJJgm) |
| [Docker-compose example](./docker-compose.yml) | A Docker-compose file with all OpenAI modules enabled | Docker |

@ -1,20 +0,0 @@
version: '3.4'
services:
weaviate:
image: semitechnologies/weaviate:1.14.0
restart: on-failure:0
ports:
- "8080:8080"
environment:
QUERY_DEFAULTS_LIMIT: 20
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: "./data"
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: text2vec-transformers
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-msmarco-distilroberta-base-v2
environment:
ENABLE_CUDA: 0 # set to 1 to enable
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

@ -0,0 +1,33 @@
#################
#
# This is an example Docker file for Weaviate with all OpenAI modules enabled
# You can, but don't have to set `OPENAI_APIKEY` because it can also be set at runtime
#
# Find the latest version here: https://weaviate.io/developers/weaviate/installation/docker-compose
#
#################
---
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.17.2
ports:
- 8080:8080
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-openai'
ENABLE_MODULES: 'text2vec-openai,qna-openai'
CLUSTER_HOSTNAME: 'openai-weaviate-cluster'
# The following parameter (`OPENAI_APIKEY`) is optional, as you can also provide it at insert/query time
# OPENAI_APIKEY: sk-foobar
...

@ -0,0 +1,561 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Weaviate with OpenAI vectorize module for Embeddings Search\n",
"\n",
"This notebook is prepared for a scenario where:\n",
"* Your data is not vectorized\n",
"* You want to run Vector Search on your data\n",
"* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
"\n",
"This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run semantic search.\n",
"\n",
"This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"## What is Weaviate\n",
"\n",
"Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
"\n",
"Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
"\n",
"Weaviate let's you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"### Deployment options\n",
"\n",
"Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
"* Self-hosted you can deploy Weaviate with docker locally, or any server you want.\n",
"* SaaS you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
"* Hybrid-Saas you can deploy Weaviate in your own private Cloud Service \n",
"\n",
"### Programming languages\n",
"\n",
"Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
"* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
"* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
"* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
"* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
"\n",
"Additionally, Weavaite has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
]
},
{
"cell_type": "markdown",
"id": "45956173",
"metadata": {},
"source": [
"## Demo Flow\n",
"The demo flow is:\n",
"- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n",
"- **Connect**: Connect to your Weaviate instance \n",
"- **Schema Configuration**: Configure the schema of your data\n",
" - *Note*: Here we can define which OpenAI Embedding Model to use\n",
" - *Note*: Here we can configure which properties to index on\n",
"- **Import data**: Load a demo dataset and import it into Weaviate\n",
" - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
" - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.\n",
"- **Run Queries**: Query \n",
" - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "2a4a145e",
"metadata": {},
"source": [
"## OpenAI Module in Weaviate\n",
"All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.\n",
"\n",
"This module is responsible handling vectorization at import (or any CRUD operations) and when you run a query.\n",
"\n",
"### No need to manually vectorize data\n",
"This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
"\n",
"All you need to do is:\n",
"1. provide your OpenAI API Key when you connected to the Weaviate Client\n",
"2. define which OpenAI vectorizer to use in your Schema"
]
},
{
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* create a `Weaviate` instance\n",
"* install libraries\n",
" * `weaviate-client`\n",
" * `datasets`\n",
" * `apache-beam`\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"### Create a Weaviate instance\n",
"\n",
"To create a Weaviate instance we have 2 options:\n",
"\n",
"1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
"2. Install and run Weaviate locally with Docker.\n",
"\n",
"#### Option 1 WCS Installation Steps\n",
"\n",
"Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
"\n",
"#### Option 2 local Weaviate instance with Docker\n",
"\n",
"Install and run Weaviate locally with Docker.\n",
"1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
"2. Then open your terminal, navigate to where your docker-compose.yml folder, and start docker with: `docker-compose up -d`\n",
"3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
"\n",
"Note. To shut down your docker instance you can call: `docker-compose down`\n",
"\n",
"##### Learn more\n",
"To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
]
},
{
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"=========================================================== \n",
"## Install required libraries\n",
"\n",
"Before running this project make sure to have the following libraries:\n",
"\n",
"### Weaviate Python client\n",
"\n",
"The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
"\n",
"### datasets & apache-beam\n",
"\n",
"To load sample data, you need the `datasets` library and its' dependency `apache-beam`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"# Install the Weaviate client for Python\n",
"!pip install weaviate-client>=3.11.0\n",
"\n",
"# Install datasets and apache-beam to load the sample datasets\n",
"!pip install datasets apache-beam"
]
},
{
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88be138c",
"metadata": {},
"outputs": [],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ[\"OPENAI_API_KEY\"] = 'your-key-goes-here'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to your Weaviate instance\n",
"\n",
"In this section, we will:\n",
"\n",
"1. test env variable `OPENAI_API_KEY` **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
"2. connect to your Weaviate your `OpenAI API Key`\n",
"3. and test the client connection\n",
"\n",
"### The client \n",
"\n",
"After this step, the `client` object will be used to perform all Weaviate-related operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc662c1b",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from datasets import load_dataset\n",
"import os\n",
"\n",
"# Connect to your Weaviate instance\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
"# url=\"http://localhost:8080/\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")\n",
"\n",
"# Check if your instance is live and ready\n",
"# This should return `True`\n",
"client.is_ready()"
]
},
{
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"# Schema\n",
"\n",
"In this section, we will:\n",
"1. configure the data schema for your data\n",
"2. select OpenAI module\n",
"\n",
"> This is the second and final step, which requires OpenAI specific configuration.\n",
"> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
"\n",
"\n",
"## What is a schema\n",
"\n",
"In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
"\n",
"A schema is how you tell Weaviate:\n",
"* what embedding model should be used to vectorize the data\n",
"* what your data is made of (property names and types)\n",
"* which properties should be vectorized and indexed\n",
"\n",
"In this cookbook we will use a dataset for `Articles`, which contains:\n",
"* `title`\n",
"* `content`\n",
"* `url`\n",
"\n",
"We want to vectorize `title` and `content`, but not the `url`.\n",
"\n",
"To vectorize and query the data, we will use `text-embedding-ada-002`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"url\",\n",
" \"description\": \"URL to the article\",\n",
" \"dataType\": [\"string\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"In this section we will:\n",
"1. load the Simple Wikipedia dataset\n",
"2. configure Weaviate Batch import (to make the import more efficient)\n",
"3. import the data into Weaviate\n",
"\n",
"> Note: <br/>\n",
"> Like mentioned before. We don't need to manually vectorize the data.<br/>\n",
"> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc3efadd",
"metadata": {},
"outputs": [],
"source": [
"### STEP 1 - load the dataset\n",
"\n",
"from datasets import load_dataset\n",
"from typing import List, Iterator\n",
"\n",
"# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
"dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
"\n",
"# For testing, limited to 2.5k articles for demo purposes\n",
"dataset = dataset[:2_500]\n",
"\n",
"# Limited to 25k articles for larger demo purposes\n",
"# dataset = dataset[:25_000]\n",
"\n",
"# for free OpenAI acounts, you can use 50 objects\n",
"# dataset = dataset[:50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5044da96",
"metadata": {},
"outputs": [],
"source": [
"### Step 2 - configure Weaviate Batch, with\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=10, \n",
" dynamic=True,\n",
" timeout_retries=3,\n",
"# callback=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"### Step 3 - import data\n",
"\n",
"print(\"Importing Articles\")\n",
"\n",
"counter=0\n",
"\n",
"with client.batch as batch:\n",
" for article in dataset:\n",
" if (counter %10 == 0):\n",
" print(f\"Import {counter} / {len(dataset)} \")\n",
"\n",
" properties = {\n",
" \"title\": article[\"title\"],\n",
" \"content\": article[\"text\"],\n",
" \"url\": article[\"url\"]\n",
" }\n",
" \n",
" batch.add_data_object(properties, \"Article\")\n",
" counter = counter+1\n",
"\n",
"print(\"Importing Articles complete\") "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Test that all data has loaded get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article['title'])\n",
"print(test_article['url'])\n",
"print(test_article['content'])"
]
},
{
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def query_weaviate(query, collection_name):\n",
" \n",
" nearText = {\n",
" \"concepts\": [query],\n",
" \"distance\": 0.7,\n",
" }\n",
"\n",
" properties = [\n",
" \"title\", \"content\", \"url\",\n",
" \"_additional {certainty distance}\"\n",
" ]\n",
"\n",
" result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_near_text(nearText)\n",
" .with_limit(10)\n",
" .do()\n",
" )\n",
" \n",
" # Check for errors\n",
" if (\"errors\" in result):\n",
" print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute the limit is set at 60 per minute.\")\n",
" raise Exception(result[\"errors\"][0]['message'])\n",
" \n",
" return result[\"data\"][\"Get\"][collection_name]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e2025f6",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"metadata": {},
"outputs": [],
"source": [
"query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,563 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Weaviate with OpenAI vectorize module for Hybrid Search\n",
"\n",
"This notebook is prepared for a scenario where:\n",
"* Your data is not vectorized\n",
"* You want to run Hybrid Search ([learn more](https://weaviate.io/blog/hybrid-search-explained)) on your data\n",
"* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
"\n",
"This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run hybrid search (mixing of vector and BM25 search).\n",
"\n",
"This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"## What is Weaviate\n",
"\n",
"Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
"\n",
"Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
"\n",
"Weaviate let's you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"### Deployment options\n",
"\n",
"Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
"* Self-hosted you can deploy Weaviate with docker locally, or any server you want.\n",
"* SaaS you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
"* Hybrid-Saas you can deploy Weaviate in your own private Cloud Service \n",
"\n",
"### Programming languages\n",
"\n",
"Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
"* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
"* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
"* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
"* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
"\n",
"Additionally, Weavaite has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
]
},
{
"cell_type": "markdown",
"id": "45956173",
"metadata": {},
"source": [
"## Demo Flow\n",
"The demo flow is:\n",
"- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n",
"- **Connect**: Connect to your Weaviate instance \n",
"- **Schema Configuration**: Configure the schema of your data\n",
" - *Note*: Here we can define which OpenAI Embedding Model to use\n",
" - *Note*: Here we can configure which properties to index on\n",
"- **Import data**: Load a demo dataset and import it into Weaviate\n",
" - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
" - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.\n",
"- **Run Queries**: Query \n",
" - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "2a4a145e",
"metadata": {},
"source": [
"## OpenAI Module in Weaviate\n",
"All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.\n",
"\n",
"This module is responsible handling vectorization at import (or any CRUD operations) and when you run a query.\n",
"\n",
"### No need to manually vectorize data\n",
"This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
"\n",
"All you need to do is:\n",
"1. provide your OpenAI API Key when you connected to the Weaviate Client\n",
"2. define which OpenAI vectorizer to use in your Schema"
]
},
{
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* create a `Weaviate` instance\n",
"* install libraries\n",
" * `weaviate-client`\n",
" * `datasets`\n",
" * `apache-beam`\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"### Create a Weaviate instance\n",
"\n",
"To create a Weaviate instance we have 2 options:\n",
"\n",
"1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
"2. Install and run Weaviate locally with Docker.\n",
"\n",
"#### Option 1 WCS Installation Steps\n",
"\n",
"Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
"\n",
"#### Option 2 local Weaviate instance with Docker\n",
"\n",
"Install and run Weaviate locally with Docker.\n",
"1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
"2. Then open your terminal, navigate to where your docker-compose.yml folder, and start docker with: `docker-compose up -d`\n",
"3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
"\n",
"Note. To shut down your docker instance you can call: `docker-compose down`\n",
"\n",
"##### Learn more\n",
"To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
]
},
{
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"=========================================================== \n",
"## Install required libraries\n",
"\n",
"Before running this project make sure to have the following libraries:\n",
"\n",
"### Weaviate Python client\n",
"\n",
"The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
"\n",
"### datasets & apache-beam\n",
"\n",
"To load sample data, you need the `datasets` library and its' dependency `apache-beam`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"# Install the Weaviate client for Python\n",
"!pip install weaviate-client>3.11.0\n",
"\n",
"# Install datasets and apache-beam to load the sample datasets\n",
"!pip install datasets apache-beam"
]
},
{
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88be138c",
"metadata": {},
"outputs": [],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ['OPENAI_API_KEY'] = 'your-key-goes-here'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to your Weaviate instance\n",
"\n",
"In this section, we will:\n",
"\n",
"1. test env variable `OPENAI_API_KEY` **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
"2. connect to your Weaviate your `OpenAI API Key`\n",
"3. and test the client connection\n",
"\n",
"### The client \n",
"\n",
"After this step, the `client` object will be used to perform all Weaviate-related operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc662c1b",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from datasets import load_dataset\n",
"import os\n",
"\n",
"# Connect to your Weaviate instance\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
"# url=\"http://localhost:8080/\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")\n",
"\n",
"# Check if your instance is live and ready\n",
"# This should return `True`\n",
"client.is_ready()"
]
},
{
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"# Schema\n",
"\n",
"In this section, we will:\n",
"1. configure the data schema for your data\n",
"2. select OpenAI module\n",
"\n",
"> This is the second and final step, which requires OpenAI specific configuration.\n",
"> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
"\n",
"\n",
"## What is a schema\n",
"\n",
"In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
"\n",
"A schema is how you tell Weaviate:\n",
"* what embedding model should be used to vectorize the data\n",
"* what your data is made of (property names and types)\n",
"* which properties should be vectorized and indexed\n",
"\n",
"In this cookbook we will use a dataset for `Articles`, which contains:\n",
"* `title`\n",
"* `content`\n",
"* `url`\n",
"\n",
"We want to vectorize `title` and `content`, but not the `url`.\n",
"\n",
"To vectorize and query the data, we will use `text-embedding-ada-002`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"url\",\n",
" \"description\": \"URL to the article\",\n",
" \"dataType\": [\"string\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"In this section we will:\n",
"1. load the Simple Wikipedia dataset\n",
"2. configure Weaviate Batch import (to make the import more efficient)\n",
"3. import the data into Weaviate\n",
"\n",
"> Note: <br/>\n",
"> Like mentioned before. We don't need to manually vectorize the data.<br/>\n",
"> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc3efadd",
"metadata": {},
"outputs": [],
"source": [
"### STEP 1 - load the dataset\n",
"\n",
"from datasets import load_dataset\n",
"from typing import List, Iterator\n",
"\n",
"# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
"dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
"\n",
"# For testing, limited to 2.5k articles for demo purposes\n",
"dataset = dataset[:2_500]\n",
"\n",
"# Limited to 25k articles for larger demo purposes\n",
"# dataset = dataset[:25_000]\n",
"\n",
"# for free OpenAI acounts, you can use 50 objects\n",
"# dataset = dataset[:50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5044da96",
"metadata": {},
"outputs": [],
"source": [
"### Step 2 - configure Weaviate Batch, with\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=10, \n",
" dynamic=True,\n",
" timeout_retries=3,\n",
"# callback=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"### Step 3 - import data\n",
"\n",
"print(\"Importing Articles\")\n",
"\n",
"counter=0\n",
"\n",
"with client.batch as batch:\n",
" for article in dataset:\n",
" if (counter %10 == 0):\n",
" print(f\"Import {counter} / {len(dataset)} \")\n",
"\n",
" properties = {\n",
" \"title\": article[\"title\"],\n",
" \"content\": article[\"text\"],\n",
" \"url\": article[\"url\"]\n",
" }\n",
" \n",
" batch.add_data_object(properties, \"Article\")\n",
" counter = counter+1\n",
"\n",
"print(\"Importing Articles complete\") "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Test that all data has loaded get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article['title'])\n",
"print(test_article['url'])\n",
"print(test_article['content'])"
]
},
{
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors\n",
"\n",
"Learn more about the `alpha` setting [here](https://weaviate.io/developers/weaviate/api/graphql/vector-search-parameters#hybrid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def hybrid_query_weaviate(query, collection_name, alpha_val):\n",
" \n",
" nearText = {\n",
" \"concepts\": [query],\n",
" \"distance\": 0.7,\n",
" }\n",
"\n",
" properties = [\n",
" \"title\", \"content\", \"url\",\n",
" \"_additional { score }\"\n",
" ]\n",
"\n",
" result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_hybrid(nearText, alpha=alpha_val)\n",
" .with_limit(10)\n",
" .do()\n",
" )\n",
" \n",
" # Check for errors\n",
" if (\"errors\" in result):\n",
" print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute the limit is set at 60 per minute.\")\n",
" raise Exception(result[\"errors\"][0]['message'])\n",
" \n",
" return result[\"data\"][\"Get\"][collection_name]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e2025f6",
"metadata": {},
"outputs": [],
"source": [
"query_result = hybrid_query_weaviate(\"modern art in Europe\", \"Article\", 0.5)\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {article['_additional']['score']})\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"metadata": {},
"outputs": [],
"source": [
"query_result = hybrid_query_weaviate(\"Famous battles in Scottish history\", \"Article\", 0.5)\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['title']} (Score: {article['_additional']['score']})\")"
]
},
{
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,571 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Question Answering in Weaviate with OpenAI Q&A module\n",
"\n",
"This notebook is prepared for a scenario where:\n",
"* Your data is not vectorized\n",
"* You want to run Q&A ([learn more](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai)) on your data based on the [OpenAI completions](https://beta.openai.com/docs/api-reference/completions) endpoint.\n",
"* You want to use Weaviate with the OpenAI module ([text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai)), to generate vector embeddings for you.\n",
"\n",
"This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run question answering.\n",
"\n",
"## What is Weaviate\n",
"\n",
"Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.\n",
"\n",
"Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more [here](https://weaviate.io/blog/why-is-vector-search-so-fast).\n",
"\n",
"Weaviate let's you use your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"### Deployment options\n",
"\n",
"Whatever your scenario or production setup, Weaviate has an option for you. You can deploy Weaviate in the following setups:\n",
"* Self-hosted you can deploy Weaviate with docker locally, or any server you want.\n",
"* SaaS you can use [Weaviate Cloud Service (WCS)](https://console.weaviate.io/) to host your Weaviate instances.\n",
"* Hybrid-Saas you can deploy Weaviate in your own private Cloud Service \n",
"\n",
"### Programming languages\n",
"\n",
"Weaviate offers four [client libraries](https://weaviate.io/developers/weaviate/client-libraries), which allow you to communicate from your apps:\n",
"* [Python](https://weaviate.io/developers/weaviate/client-libraries/python)\n",
"* [JavaScript](https://weaviate.io/developers/weaviate/client-libraries/javascript)\n",
"* [Java](https://weaviate.io/developers/weaviate/client-libraries/java)\n",
"* [Go](https://weaviate.io/developers/weaviate/client-libraries/go)\n",
"\n",
"Additionally, Weavaite has a [REST layer](https://weaviate.io/developers/weaviate/api/rest/objects). Basically you can call Weaviate from any language that supports REST requests."
]
},
{
"cell_type": "markdown",
"id": "45956173",
"metadata": {},
"source": [
"## Demo Flow\n",
"The demo flow is:\n",
"- **Prerequisites Setup**: Create a Weaviate instance and install required libraries\n",
"- **Connect**: Connect to your Weaviate instance \n",
"- **Schema Configuration**: Configure the schema of your data\n",
" - *Note*: Here we can define which OpenAI Embedding Model to use\n",
" - *Note*: Here we can configure which properties to index on\n",
"- **Import data**: Load a demo dataset and import it into Weaviate\n",
" - *Note*: The import process will automatically index your data - based on the configuration in the schema\n",
" - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.\n",
"- **Run Queries**: Query \n",
" - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.\n",
" - *Note*: The `qna-openai` module automatically communicates with the OpenAI completions endpoint.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases for question answering."
]
},
{
"cell_type": "markdown",
"id": "2a4a145e",
"metadata": {},
"source": [
"## OpenAI Module in Weaviate\n",
"All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) and the [qna-openai](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai) modules.\n",
"\n",
"The first module is responsible for handling vectorization at import (or any CRUD operations) and when you run a search query. The second module communicates with the OpenAI completions endpoint.\n",
"\n",
"### No need to manually vectorize data\n",
"This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.\n",
"\n",
"All you need to do is:\n",
"1. provide your OpenAI API Key when you connected to the Weaviate Client\n",
"2. define which OpenAI vectorizer to use in your Schema"
]
},
{
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* create a `Weaviate` instance\n",
"* install libraries\n",
" * `weaviate-client`\n",
" * `datasets`\n",
" * `apache-beam`\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"### Create a Weaviate instance\n",
"\n",
"To create a Weaviate instance we have 2 options:\n",
"\n",
"1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.\n",
"2. Install and run Weaviate locally with Docker.\n",
"\n",
"#### Option 1 WCS Installation Steps\n",
"\n",
"Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n",
"1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n",
"2. create a `Weaviate Cluster` with the following settings:\n",
" * Sandbox: `Sandbox Free`\n",
" * Weaviate Version: Use default (latest)\n",
" * OIDC Authentication: `Disabled`\n",
"3. your instance should be ready in a minute or two\n",
"4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` \n",
"\n",
"#### Option 2 local Weaviate instance with Docker\n",
"\n",
"Install and run Weaviate locally with Docker.\n",
"1. Download the [./docker-compose.yml](./docker-compose.yml) file\n",
"2. Then open your terminal, navigate to where your docker-compose.yml folder, and start docker with: `docker-compose up -d`\n",
"3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)\n",
"\n",
"Note. To shut down your docker instance you can call: `docker-compose down`\n",
"\n",
"##### Learn more\n",
"To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose)."
]
},
{
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"=========================================================== \n",
"## Install required libraries\n",
"\n",
"Before running this project make sure to have the following libraries:\n",
"\n",
"### Weaviate Python client\n",
"\n",
"The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.\n",
"\n",
"### datasets & apache-beam\n",
"\n",
"To load sample data, you need the `datasets` library and its' dependency `apache-beam`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"# Install the Weaviate client for Python\n",
"!pip install weaviate-client>3.11.0\n",
"\n",
"# Install datasets and apache-beam to load the sample datasets\n",
"!pip install datasets apache-beam"
]
},
{
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of your data at import, and for queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88be138c",
"metadata": {},
"outputs": [],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ['OPENAI_API_KEY'] = 'your-key-goes-here'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to your Weaviate instance\n",
"\n",
"In this section, we will:\n",
"\n",
"1. test env variable `OPENAI_API_KEY` **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)\n",
"2. connect to your Weaviate your `OpenAI API Key`\n",
"3. and test the client connection\n",
"\n",
"### The client \n",
"\n",
"After this step, the `client` object will be used to perform all Weaviate-related operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc662c1b",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from datasets import load_dataset\n",
"import os\n",
"\n",
"# Connect to your Weaviate instance\n",
"client = weaviate.Client(\n",
" url=\"https://your-wcs-instance-name.weaviate.network/\",\n",
"# url=\"http://localhost:8080/\",\n",
" additional_headers={\n",
" \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n",
" }\n",
")\n",
"\n",
"# Check if your instance is live and ready\n",
"# This should return `True`\n",
"client.is_ready()"
]
},
{
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"# Schema\n",
"\n",
"In this section, we will:\n",
"1. configure the data schema for your data\n",
"2. select OpenAI module\n",
"\n",
"> This is the second and final step, which requires OpenAI specific configuration.\n",
"> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.\n",
"\n",
"\n",
"## What is a schema\n",
"\n",
"In Weaviate you create __schemas__ to capture each of the entities you will be searching.\n",
"\n",
"A schema is how you tell Weaviate:\n",
"* what embedding model should be used to vectorize the data\n",
"* what your data is made of (property names and types)\n",
"* which properties should be vectorized and indexed\n",
"\n",
"In this cookbook we will use a dataset for `Articles`, which contains:\n",
"* `title`\n",
"* `content`\n",
"* `url`\n",
"\n",
"We want to vectorize `title` and `content`, but not the `url`.\n",
"\n",
"To vectorize and query the data, we will use `text-embedding-ada-002`. For Q&A we will use `text-davinci-002`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Clear up the schema, so that we can recreate it\n",
"client.schema.delete_all()\n",
"client.schema.get()\n",
"\n",
"# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n",
"article_schema = {\n",
" \"class\": \"Article\",\n",
" \"description\": \"A collection of articles\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {\n",
" \"model\": \"ada\",\n",
" \"modelVersion\": \"002\",\n",
" \"type\": \"text\"\n",
" }, \n",
" \"qna-openai\": {\n",
" \"model\": \"text-davinci-002\",\n",
" \"maxTokens\": 16,\n",
" \"temperature\": 0.0,\n",
" \"topP\": 1,\n",
" \"frequencyPenalty\": 0.0,\n",
" \"presencePenalty\": 0.0\n",
" }\n",
" },\n",
" \"properties\": [{\n",
" \"name\": \"title\",\n",
" \"description\": \"Title of the article\",\n",
" \"dataType\": [\"string\"]\n",
" },\n",
" {\n",
" \"name\": \"content\",\n",
" \"description\": \"Contents of the article\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"url\",\n",
" \"description\": \"URL to the article\",\n",
" \"dataType\": [\"string\"],\n",
" \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n",
" }]\n",
"}\n",
"\n",
"# add the Article schema\n",
"client.schema.create_class(article_schema)\n",
"\n",
"# get the schema to make sure it worked\n",
"client.schema.get()"
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"In this section we will:\n",
"1. load the Simple Wikipedia dataset\n",
"2. configure Weaviate Batch import (to make the import more efficient)\n",
"3. import the data into Weaviate\n",
"\n",
"> Note: <br/>\n",
"> Like mentioned before. We don't need to manually vectorize the data.<br/>\n",
"> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc3efadd",
"metadata": {},
"outputs": [],
"source": [
"### STEP 1 - load the dataset\n",
"\n",
"from datasets import load_dataset\n",
"from typing import List, Iterator\n",
"\n",
"# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding\n",
"dataset = list(load_dataset(\"wikipedia\", \"20220301.simple\")[\"train\"])\n",
"\n",
"# For testing, limited to 2.5k articles for demo purposes\n",
"dataset = dataset[:2_500]\n",
"\n",
"# Limited to 25k articles for larger demo purposes\n",
"# dataset = dataset[:25_000]\n",
"\n",
"# for free OpenAI acounts, you can use 50 objects\n",
"# dataset = dataset[:50]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5044da96",
"metadata": {},
"outputs": [],
"source": [
"### Step 2 - configure Weaviate Batch, with\n",
"# - starting batch size of 100\n",
"# - dynamically increase/decrease based on performance\n",
"# - add timeout retries if something goes wrong\n",
"\n",
"client.batch.configure(\n",
" batch_size=10, \n",
" dynamic=True,\n",
" timeout_retries=3,\n",
"# callback=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"### Step 3 - import data\n",
"\n",
"print(\"Importing Articles\")\n",
"\n",
"counter=0\n",
"\n",
"with client.batch as batch:\n",
" for article in dataset:\n",
" if (counter %10 == 0):\n",
" print(f\"Import {counter} / {len(dataset)} \")\n",
"\n",
" properties = {\n",
" \"title\": article[\"title\"],\n",
" \"content\": article[\"text\"],\n",
" \"url\": article[\"url\"]\n",
" }\n",
" \n",
" batch.add_data_object(properties, \"Article\")\n",
" counter = counter+1\n",
"\n",
"print(\"Importing Articles complete\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Test that all data has loaded get object count\n",
"result = (\n",
" client.query.aggregate(\"Article\")\n",
" .with_fields(\"meta { count }\")\n",
" .do()\n",
")\n",
"print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"], \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"# Test one article has worked by checking one object\n",
"test_article = (\n",
" client.query\n",
" .get(\"Article\", [\"title\", \"url\", \"content\"])\n",
" .with_limit(1)\n",
" .do()\n",
")[\"data\"][\"Get\"][\"Article\"][0]\n",
"\n",
"print(test_article['title'])\n",
"print(test_article['url'])\n",
"print(test_article['content'])"
]
},
{
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"### Question Answering on the Data\n",
"\n",
"As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def qna(query, collection_name):\n",
" \n",
" properties = [\n",
" \"title\", \"content\", \"url\",\n",
" \"_additional { answer { hasAnswer property result startPosition endPosition } distance }\"\n",
" ]\n",
"\n",
" ask = {\n",
" \"question\": query,\n",
" \"properties\": [\"content\"]\n",
" }\n",
"\n",
" result = (\n",
" client.query\n",
" .get(collection_name, properties)\n",
" .with_ask(ask)\n",
" .with_limit(1)\n",
" .do()\n",
" )\n",
" \n",
" # Check for errors\n",
" if (\"errors\" in result):\n",
" print (\"\\033[91mYou probably have run out of OpenAI API calls for the current minute the limit is set at 60 per minute.\")\n",
" raise Exception(result[\"errors\"][0]['message'])\n",
" \n",
" return result[\"data\"][\"Get\"][collection_name]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e2025f6",
"metadata": {},
"outputs": [],
"source": [
"query_result = qna(\"Did Alanis Morissette win a Grammy?\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" print(f\"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c4a696",
"metadata": {},
"outputs": [],
"source": [
"query_result = qna(\"What is the capital of China?\", \"Article\")\n",
"\n",
"for i, article in enumerate(query_result):\n",
" if article['_additional']['answer']['hasAnswer'] == False:\n",
" print('No answer found')\n",
" else:\n",
" print(f\"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })\")"
]
},
{
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading…
Cancel
Save