Merge pull request #131 from Spartee/add-redis-example

Add Redis example notebooks and READMEs
This commit is contained in:
colin-openai 2023-02-15 08:04:01 -08:00 committed by GitHub
commit 9b17d00cab
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 1569 additions and 29 deletions

View File

@ -1,6 +1,7 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
@ -34,6 +35,11 @@
" - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n",
" - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n",
" - *Search Data*: We'll run a few searches to confirm it works\n",
"- **Redis**\n",
" - *Setup*: Set up the Redis-Py client. For more details go [here](https://github.com/redis/redis-py)\n",
" - *Index Data*: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.\n",
" - *Search Data*: Run a few example queries with various goals in mind.\n",
"\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
@ -59,6 +65,7 @@
"!pip install pinecone-client\n",
"!pip install weaviate-client\n",
"!pip install qdrant-client\n",
"!pip install redis\n",
"\n",
"#Install wget to pull zip file\n",
"!pip install wget"
@ -66,18 +73,10 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"id": "5be94df6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()\n"
]
}
],
"outputs": [],
"source": [
"import openai\n",
"\n",
@ -89,6 +88,9 @@
"import wget\n",
"from ast import literal_eval\n",
"\n",
"# Redis client library for Python\n",
"import redis\n",
"\n",
"# Pinecone's client library for Python\n",
"import pinecone\n",
"\n",
@ -120,10 +122,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"id": "5dff8b55",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'vector_database_wikipedia_articles_embedded.zip'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"\n",
@ -133,7 +146,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "21097972",
"metadata": {},
"outputs": [],
@ -145,7 +158,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "70bbd8ba",
"metadata": {},
"outputs": [],
@ -155,7 +168,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 6,
"id": "1721e45d",
"metadata": {},
"outputs": [
@ -274,7 +287,7 @@
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 3,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@ -285,7 +298,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 7,
"id": "960b82af",
"metadata": {},
"outputs": [],
@ -300,23 +313,33 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 11,
"id": "a334ab8b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"34471"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 25000 entries, 0 to 24999\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 25000 non-null int64 \n",
" 1 url 25000 non-null object\n",
" 2 title 25000 non-null object\n",
" 3 text 25000 non-null object\n",
" 4 title_vector 25000 non-null object\n",
" 5 content_vector 25000 non-null object\n",
" 6 vector_id 25000 non-null object\n",
"dtypes: int64(1), object(6)\n",
"memory usage: 1.3+ MB\n"
]
}
],
"source": [
"len(article_df['title_vector'][0])"
"article_df.info(show_counts=True)"
]
},
{
@ -1152,6 +1175,470 @@
" print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "43bffd04",
"metadata": {},
"source": [
"# Redis\n",
"\n",
"The last vector database covered in this tutorial is **[Redis](https://redis.io)**. You most likely already know Redis. What you might not be aware of is the [RediSearch module](https://github.com/RediSearch/RediSearch). Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.\n",
"\n",
"Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries [here](https://redis.io/resources/clients/).\n",
"\n",
"| Project | Language | License | Author | Stars |\n",
"|----------|---------|--------|---------|-------|\n",
"| [jedis][jedis-url] | Java | MIT | [Redis][redis-url] | ![Stars][jedis-stars] |\n",
"| [redis-py][redis-py-url] | Python | MIT | [Redis][redis-url] | ![Stars][redis-py-stars] |\n",
"| [node-redis][node-redis-url] | Node.js | MIT | [Redis][redis-url] | ![Stars][node-redis-stars] |\n",
"| [nredisstack][nredisstack-url] | .NET | MIT | [Redis][redis-url] | ![Stars][nredisstack-stars] |\n",
"| [redisearch-go][redisearch-go-url] | Go | BSD | [Redis][redisearch-go-author] | [![redisearch-go-stars]][redisearch-go-url] |\n",
"| [redisearch-api-rs][redisearch-api-rs-url] | Rust | BSD | [Redis][redisearch-api-rs-author] | [![redisearch-api-rs-stars]][redisearch-api-rs-url] |\n",
"\n",
"[redis-url]: https://redis.com\n",
"\n",
"[redis-py-url]: https://github.com/redis/redis-py\n",
"[redis-py-stars]: https://img.shields.io/github/stars/redis/redis-py.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"[redis-py-package]: https://pypi.python.org/pypi/redis\n",
"\n",
"[jedis-url]: https://github.com/redis/jedis\n",
"[jedis-stars]: https://img.shields.io/github/stars/redis/jedis.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"[Jedis-package]: https://search.maven.org/artifact/redis.clients/jedis\n",
"\n",
"[nredisstack-url]: https://github.com/redis/nredisstack\n",
"[nredisstack-stars]: https://img.shields.io/github/stars/redis/nredisstack.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"[nredisstack-package]: https://www.nuget.org/packages/nredisstack/\n",
"\n",
"[node-redis-url]: https://github.com/redis/node-redis\n",
"[node-redis-stars]: https://img.shields.io/github/stars/redis/node-redis.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"[node-redis-package]: https://www.npmjs.com/package/redis\n",
"\n",
"[redis-om-python-url]: https://github.com/redis/redis-om-python\n",
"[redis-om-python-author]: https://redis.com\n",
"[redis-om-python-stars]: https://img.shields.io/github/stars/redis/redis-om-python.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"\n",
"[redisearch-go-url]: https://github.com/RediSearch/redisearch-go\n",
"[redisearch-go-author]: https://redis.com\n",
"[redisearch-go-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-go.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"\n",
"[redisearch-api-rs-url]: https://github.com/RediSearch/redisearch-api-rs\n",
"[redisearch-api-rs-author]: https://redis.com\n",
"[redisearch-api-rs-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-api-rs.svg?style=social&amp;label=Star&amp;maxAge=2592000\n",
"\n",
"\n",
"In the below cells, we will walk you through using Redis as a vector database. Since many of you are likely already used to the Redis API, this should be familiar to most."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "698e24f6",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"There are many ways to deploy Redis with RediSearch. The easiest way to get started is to use Docker, but there are are many potential options for deployment. For other deployment options, see the [redis directory](./redis) in this repo.\n",
"\n",
"For this tutorial, we will use Redis Stack on Docker.\n",
"\n",
"Start a version of Redis with RediSearch (Redis Stack) by running the following docker command\n",
"\n",
"```bash\n",
"$ cd redis\n",
"$ docker compose up -d\n",
"```\n",
"This also includes the [RedisInsight](https://redis.com/redis-enterprise/redis-insight/) GUI for managing your Redis database which you can view at [http://localhost:8001](http://localhost:8001) once you start the docker container.\n",
"\n",
"You're all set up and ready to go! Next, we import and create our client for communicating with the Redis database we just created."
]
},
{
"cell_type": "code",
"execution_count": 134,
"id": "d2ce669a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import redis\n",
"from redis.commands.search.indexDefinition import (\n",
" IndexDefinition,\n",
" IndexType\n",
")\n",
"from redis.commands.search.query import Query\n",
"from redis.commands.search.field import (\n",
" TextField,\n",
" VectorField\n",
")\n",
"\n",
"REDIS_HOST = \"localhost\"\n",
"REDIS_PORT = 6379\n",
"REDIS_PASSWORD = \"\" # default for passwordless Redis\n",
"\n",
"# Connect to Redis\n",
"redis_client = redis.Redis(\n",
" host=REDIS_HOST,\n",
" port=REDIS_PORT,\n",
" password=REDIS_PASSWORD\n",
")\n",
"redis_client.ping()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3f6f0af9",
"metadata": {},
"source": [
"## Creating a Search Index\n",
"\n",
"The below cells will show how to specify and create a search index in Redis. We will\n",
"\n",
"1. Set some constants for defining our index like the distance metric and the index name\n",
"2. Define the index schema with RediSearch fields\n",
"3. Create the index\n"
]
},
{
"cell_type": "code",
"execution_count": 135,
"id": "a7c64cb9",
"metadata": {},
"outputs": [],
"source": [
"# Constants\n",
"VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors\n",
"VECTOR_NUMBER = len(article_df) # initial number of vectors\n",
"INDEX_NAME = \"embeddings-index\" # name of the search index\n",
"PREFIX = \"doc\" # prefix for the document keys\n",
"DISTANCE_METRIC = \"COSINE\" # distance metric for the vectors (ex. COSINE, IP, L2)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"id": "d95fcd06",
"metadata": {},
"outputs": [],
"source": [
"# Define RediSearch fields for each of the columns in the dataset\n",
"title = TextField(name=\"title\")\n",
"url = TextField(name=\"url\")\n",
"text = TextField(name=\"text\")\n",
"title_embedding = VectorField(\"title_vector\",\n",
" \"FLAT\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER,\n",
" }\n",
")\n",
"text_embedding = VectorField(\"content_vector\",\n",
" \"FLAT\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER,\n",
" }\n",
")\n",
"fields = [title, url, text, title_embedding, text_embedding]"
]
},
{
"cell_type": "code",
"execution_count": 137,
"id": "7418480d",
"metadata": {},
"outputs": [],
"source": [
"# Check if index exists\n",
"try:\n",
" redis_client.ft(INDEX_NAME).info()\n",
" print(\"Index already exists\")\n",
"except:\n",
" # Create RediSearch Index\n",
" redis_client.ft(INDEX_NAME).create_index(\n",
" fields = fields,\n",
" definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n",
" )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f3563eec",
"metadata": {},
"source": [
"## Load Documents into the Index\n",
"\n",
"Now that we have a search index, we can load documents into it. We will use the same documents we used in the previous examples. In Redis, either the Hash or JSON (if using RedisJSON in addition to RediSearch) data types can be used to store documents. We will use the HASH data type in this example. The below cells will show how to load documents into the index."
]
},
{
"cell_type": "code",
"execution_count": 138,
"id": "e98d63ad",
"metadata": {},
"outputs": [],
"source": [
"def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):\n",
" records = documents.to_dict(\"records\")\n",
" for doc in records:\n",
" key = f\"{prefix}:{str(doc['id'])}\"\n",
"\n",
" # create byte vectors for title and content\n",
" title_embedding = np.array(doc[\"title_vector\"], dtype=np.float32).tobytes()\n",
" content_embedding = np.array(doc[\"content_vector\"], dtype=np.float32).tobytes()\n",
"\n",
" # replace list of floats with byte vectors\n",
" doc[\"title_vector\"] = title_embedding\n",
" doc[\"content_vector\"] = content_embedding\n",
"\n",
" client.hset(key, mapping = doc)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"id": "098d3c5a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 25000 documents in Redis search index with name: embeddings-index\n"
]
}
],
"source": [
"index_documents(redis_client, PREFIX, article_df)\n",
"print(f\"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f646bff4",
"metadata": {},
"source": [
"## Running Search Queries\n",
"\n",
"Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Redis as a vector database. Each example will demonstrate specific features to keep in mind when developing your search application with Redis.\n",
"\n",
"1. **Return Fields**: You can specify which fields you want to return in the search results. This is useful if you only want to return a subset of the fields in your documents and doesn't require a separate call to retrieve documents. In the below example, we will only return the `title` field in the search results.\n",
"2. **Hybrid Search**: You can combine vector search with any of the other RediSearch fields for hybrid search such as full text search, tag, geo, and numeric. In the below example, we will combine vector search with full text search.\n"
]
},
{
"cell_type": "code",
"execution_count": 140,
"id": "508d1f89",
"metadata": {},
"outputs": [],
"source": [
"def search_redis(\n",
" redis_client: redis.Redis,\n",
" user_query: str,\n",
" index_name: str = \"embeddings-index\",\n",
" vector_field: str = \"title_vector\",\n",
" return_fields: list = [\"title\", \"url\", \"text\", \"vector_score\"],\n",
" hybrid_fields = \"*\",\n",
" k: int = 20,\n",
") -> List[dict]:\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(input=user_query,\n",
" model=EMBEDDING_MODEL,\n",
" )[\"data\"][0]['embedding']\n",
"\n",
" # Prepare the Query\n",
" base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'\n",
" query = (\n",
" Query(base_query)\n",
" .return_fields(*return_fields)\n",
" .sort_by(\"vector_score\")\n",
" .paging(0, k)\n",
" .dialect(2)\n",
" )\n",
" params_dict = {\"vector\": np.array(embedded_query).astype(dtype=np.float32).tobytes()}\n",
"\n",
" # perform vector search\n",
" results = redis_client.ft(index_name).search(query, params_dict)\n",
" for i, article in enumerate(results.docs):\n",
" score = 1 - float(article.vector_score)\n",
" print(f\"{i}. {article.title} (Score: {round(score ,3) })\")\n",
" return results.docs"
]
},
{
"cell_type": "code",
"execution_count": 142,
"id": "1f0eef07",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Museum of Modern Art (Score: 0.875)\n",
"1. Western Europe (Score: 0.867)\n",
"2. Renaissance art (Score: 0.864)\n",
"3. Pop art (Score: 0.86)\n",
"4. Northern Europe (Score: 0.855)\n",
"5. Hellenistic art (Score: 0.853)\n",
"6. Modernist literature (Score: 0.847)\n",
"7. Art film (Score: 0.843)\n",
"8. Central Europe (Score: 0.843)\n",
"9. European (Score: 0.841)\n"
]
}
],
"source": [
"# For using OpenAI to generate query embedding\n",
"openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n",
"results = search_redis(redis_client, 'modern art in Europe', k=10)"
]
},
{
"cell_type": "code",
"execution_count": 143,
"id": "7b805a81",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Battle of Bannockburn (Score: 0.869)\n",
"1. Wars of Scottish Independence (Score: 0.861)\n",
"2. 1651 (Score: 0.853)\n",
"3. First War of Scottish Independence (Score: 0.85)\n",
"4. Robert I of Scotland (Score: 0.846)\n",
"5. 841 (Score: 0.844)\n",
"6. 1716 (Score: 0.844)\n",
"7. 1314 (Score: 0.837)\n",
"8. 1263 (Score: 0.836)\n",
"9. William Wallace (Score: 0.835)\n"
]
}
],
"source": [
"results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0ed0b34e",
"metadata": {},
"source": [
"## Hybrid Queries with Redis\n",
"\n",
"The previous examples showed how run vector search queries with RediSearch. In this section, we will show how to combine vector search with other RediSearch fields for hybrid search. In the below example, we will combine vector search with full text search."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "c94d5cce",
"metadata": {},
"outputs": [],
"source": [
"def create_hybrid_field(field_name: str, value: str) -> str:\n",
" return f'@{field_name}:\"{value}\"'"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "bfcd31c2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. First War of Scottish Independence (Score: 0.892)\n",
"1. Wars of Scottish Independence (Score: 0.889)\n",
"2. Second War of Scottish Independence (Score: 0.879)\n",
"3. List of Scottish monarchs (Score: 0.873)\n",
"4. Scottish Borders (Score: 0.863)\n"
]
}
],
"source": [
"# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title\n",
"results = search_redis(redis_client,\n",
" \"Famous battles in Scottish history\",\n",
" vector_field=\"title_vector\",\n",
" k=5,\n",
" hybrid_fields=create_hybrid_field(\"title\", \"Scottish\")\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "28ab1e30",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Art (Score: 1.0)\n",
"1. Paint (Score: 0.896)\n",
"2. Renaissance art (Score: 0.88)\n",
"3. Painting (Score: 0.874)\n",
"4. Renaissance (Score: 0.846)\n"
]
},
{
"data": {
"text/plain": [
"'In Europe, after the Middle Ages, there was a \"Renaissance\" which means \"rebirth\". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# run a hybrid query for articles about Art in the title vector and only include results with the phrase \"Leonardo da Vinci\" in the text\n",
"results = search_redis(redis_client,\n",
" \"Art\",\n",
" vector_field=\"title_vector\",\n",
" k=5,\n",
" hybrid_fields=create_hybrid_field(\"text\", \"Leonardo da Vinci\")\n",
" )\n",
"\n",
"# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned\n",
"mention = [sentence for sentence in results[0].text.split(\"\\n\") if \"Leonardo da Vinci\" in sentence][0]\n",
"mention"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f94b5be2",
"metadata": {},
"source": [
"For more example with Redis as a vector database, see the README and examples within the ``vector_databases/redis`` directory of this repository"
]
},
{
"cell_type": "markdown",
"id": "55afccbf",
@ -1163,7 +1650,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "redisvl2",
"language": "python",
"name": "python3"
},
@ -1177,7 +1664,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316"
}
}
},
"nbformat": 4,

View File

@ -0,0 +1,108 @@
# Redis
**[Redis](https://redis.io)** has first-class support for vector search through the [RediSearch module](https://github.com/RediSearch/RediSearch). RediSearch is a [Redis module](https://redis.io/modules) that provides querying, secondary indexing, full-text search and vector search for Redis. To use RediSearch, you first declare indexes on your Redis data. You can then use the RediSearch query language to query that data.
### Features
RediSearch uses compressed, inverted indexes for fast indexing with a low memory footprint. RediSearch indexes enhance Redis by providing exact-phrase matching, fuzzy search, and numeric filtering, among many other features. Such as:
* Full-Text indexing of multiple fields in Redis hashes
* Incremental indexing without performance loss
* Vector similarity search
* Document ranking (using [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), with optional user-provided weights)
* Field weighting
* Complex boolean queries with AND, OR, and NOT operators
* Prefix matching, fuzzy matching, and exact-phrase queries
* Support for [double-metaphone phonetic matching](https://redis.io/docs/stack/search/reference/phonetic_matching/)
* Auto-complete suggestions (with fuzzy prefix suggestions)
* Stemming-based query expansion in [many languages](https://redis.io/docs/stack/search/reference/stemming/) (using [Snowball](http://snowballstem.org/))
* Support for Chinese-language tokenization and querying (using [Friso](https://github.com/lionsoul2014/friso))
* Numeric filters and ranges
* Geospatial searches using [Redis geospatial indexing](/commands/georadius)
* A powerful aggregations engine
* Supports for all utf-8 encoded text
* Retrieve full documents, selected fields, or only the document IDs
* Sorting results (for example, by creation date)
* JSON support through RedisJSON
### Clients
Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries [here](https://redis.io/resources/clients/).
| Project | Language | License | Author | Stars |
|----------|---------|--------|---------|-------|
| [jedis][jedis-url] | Java | MIT | [Redis][redis-url] | ![Stars][jedis-stars] |
| [redis-py][redis-py-url] | Python | MIT | [Redis][redis-url] | ![Stars][redis-py-stars] |
| [node-redis][node-redis-url] | Node.js | MIT | [Redis][redis-url] | ![Stars][node-redis-stars] |
| [nredisstack][nredisstack-url] | .NET | MIT | [Redis][redis-url] | ![Stars][nredisstack-stars] |
| [redisearch-go][redisearch-go-url] | Go | BSD | [Redis][redisearch-go-author] | [![redisearch-go-stars]][redisearch-go-url] |
| [redisearch-api-rs][redisearch-api-rs-url] | Rust | BSD | [Redis][redisearch-api-rs-author] | [![redisearch-api-rs-stars]][redisearch-api-rs-url] |
[redis-url]: https://redis.com
[redis-py-url]: https://github.com/redis/redis-py
[redis-py-stars]: https://img.shields.io/github/stars/redis/redis-py.svg?style=social&amp;label=Star&amp;maxAge=2592000
[redis-py-package]: https://pypi.python.org/pypi/redis
[jedis-url]: https://github.com/redis/jedis
[jedis-stars]: https://img.shields.io/github/stars/redis/jedis.svg?style=social&amp;label=Star&amp;maxAge=2592000
[Jedis-package]: https://search.maven.org/artifact/redis.clients/jedis
[nredisstack-url]: https://github.com/redis/nredisstack
[nredisstack-stars]: https://img.shields.io/github/stars/redis/nredisstack.svg?style=social&amp;label=Star&amp;maxAge=2592000
[nredisstack-package]: https://www.nuget.org/packages/nredisstack/
[node-redis-url]: https://github.com/redis/node-redis
[node-redis-stars]: https://img.shields.io/github/stars/redis/node-redis.svg?style=social&amp;label=Star&amp;maxAge=2592000
[node-redis-package]: https://www.npmjs.com/package/redis
[redis-om-python-url]: https://github.com/redis/redis-om-python
[redis-om-python-author]: https://redis.com
[redis-om-python-stars]: https://img.shields.io/github/stars/redis/redis-om-python.svg?style=social&amp;label=Star&amp;maxAge=2592000
[redisearch-go-url]: https://github.com/RediSearch/redisearch-go
[redisearch-go-author]: https://redis.com
[redisearch-go-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-go.svg?style=social&amp;label=Star&amp;maxAge=2592000
[redisearch-api-rs-url]: https://github.com/RediSearch/redisearch-api-rs
[redisearch-api-rs-author]: https://redis.com
[redisearch-api-rs-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-api-rs.svg?style=social&amp;label=Star&amp;maxAge=2592000
### Deployment Options
There are many ways to deploy Redis with RediSearch. The easiest way to get started is to use Docker, but there are are many potential options for deployment such as
- [Redis Cloud](https://redis.com/redis-enterprise-cloud/overview/)
- Cloud marketplaces: [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-e6y7ork67pjwg?sr=0-2&ref_=beagle&applicationId=AWSMPContessa), [Google Marketplace](https://console.cloud.google.com/marketplace/details/redislabs-public/redis-enterprise?pli=1), or [Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/garantiadata.redis_enterprise_1sp_public_preview?tab=Overview)
- On-premise: [Redis Enterprise Software](https://redis.com/redis-enterprise-software/overview/)
- Kubernetes: [Redis Enterprise Software on Kubernetes](https://docs.redis.com/latest/kubernetes/)
- [Docker (RediSearch)](https://hub.docker.com/r/redislabs/redisearch)
- [Docker (Redis Stack)](https://hub.docker.com/r/redis/redis-stack)
### Cluster support
RediSearch has a distributed cluster version that scales to billions of documents across hundreds of servers. At the moment, distributed RediSearch is available as part of [Redis Enterprise Cloud](https://redis.com/redis-enterprise-cloud/overview/) and [Redis Enterprise Software](https://redis.com/redis-enterprise-software/overview/).
See [RediSearch on Redis Enterprise](https://redis.com/modules/redisearch/) for more information.
### Examples
- [Product Search](https://github.com/RedisVentures/redis-product-search) - eCommerce product search (with image and text)
- [Product Recommendations with DocArray / Jina](https://github.com/jina-ai/product-recommendation-redis-docarray) - Content-based product recommendations example with Redis and DocArray.
- [Redis VSS in RecSys](https://github.com/RedisVentures/Redis-Recsys) - 3 end-to-end Redis & NVIDIA Merlin Recommendation System Architectures.
- [Azure OpenAI Embeddings Q&A](https://github.com/ruoccofabrizio/azure-open-ai-embeddings-qna) - OpenAI and Redis as a Q&A service on Azure.
- [ArXiv Paper Search](https://github.com/RedisVentures/redis-arXiv-search) - Semantic search over arXiv scholarly papers
### More Resources
For more information on how to use Redis as a vector database, check out the following resources:
- [Redis Vector Similarity Docs](https://redis.io/docs/stack/search/reference/vectors/) - Redis official docs for Vector Search.
- [Redis-py Search Docs](https://redis.readthedocs.io/en/latest/redismodules.html#redisearch-commands) - Redis-py client library docs for RediSearch.
- [Vector Similarity Search: From Basics to Production](https://mlops.community/vector-similarity-search-from-basics-to-production/) - Introductory blog post to VSS and Redis as a VectorDB.
- [AI-Powered Document Search](https://datasciencedojo.com/blog/ai-powered-document-search/) - Blog post covering AI Powered Document Search Use Cases & Architectures.
- [Vector Database Benchmarks](https://jina.ai/news/benchmark-vector-search-databases-with-one-million-data/) - Jina AI VectorDB benchmarks comparing Redis against others.

View File

@ -0,0 +1,22 @@
version: '3.7'
services:
vector-db:
image: redis/redis-stack:latest
ports:
- 6379:6379
- 8001:8001
environment:
- REDISEARCH_ARGS=CONCURRENT_WRITE_MODE
volumes:
- vector-db:/var/lib/redis
- ./redis.conf:/usr/local/etc/redis/redis.conf
healthcheck:
test: ["CMD", "redis-cli", "-h", "localhost", "-p", "6379", "ping"]
interval: 2s
timeout: 1m30s
retries: 5
start_period: 5s
volumes:
vector-db:

View File

@ -0,0 +1,867 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Redis as a Vector Database with OpenAI\n",
"\n",
"This notebook provides an introduction to using Redis as a vector database with OpenAI embeddings. Redis is a scalable, real-time database that can be used as a vector database when using the [RediSearch Module](https://oss.redislabs.com/redisearch/). The RediSearch module allows you to index and search for vectors in Redis. This notebook will show you how to use the RediSearch module to index and search for vectors created by using the OpenAI API and stored in Redis.\n",
"\n",
"### What is Redis?\n",
"\n",
"Most developers from a web services background are probably familiar with Redis. At it's core, Redis is an open-source key-value store that can be used as a cache, message broker, and database. Developers choice Redis because it is fast, has a large ecosystem of client libraries, and has been deployed by major enterprises for years.\n",
"\n",
"In addition to the traditional uses of Redis. Redis also provides [Redis Modules](https://redis.io/modules) which are a way to extend Redis with new data types and commands. Example modules include [RedisJSON](https://redis.io/docs/stack/json/), [RedisTimeSeries](https://redis.io/docs/stack/timeseries/), [RedisBloom](https://redis.io/docs/stack/bloom/) and [RediSearch](https://redis.io/docs/stack/search/).\n",
"\n",
"### What is RediSearch?\n",
"\n",
"RediSearch is a [Redis module](https://redis.io/modules) that provides querying, secondary indexing, full-text search and vector search for Redis. To use RediSearch, you first declare indexes on your Redis data. You can then use the RediSearch clients to query that data. For more information on the feature set of RediSearch, see the [README](./README.md) or the [RediSearch documentation](https://redis.io/docs/stack/search/).\n",
"\n",
"### Deployment options\n",
"\n",
"There are a number of ways to deploy Redis. For local development, the quickest method is to use the [Redis Stack docker container](https://hub.docker.com/r/redis/redis-stack) which we will use here. Redis Stack contains a number of Redis modules that can be used together to create a fast, multi-model data store and query engine.\n",
"\n",
"For production use cases, The easiest way to get started is to use the [Redis Cloud](https://redislabs.com/redis-enterprise-cloud/overview/) service. Redis Cloud is a fully managed Redis service. You can also deploy Redis on your own infrastructure using [Redis Enterprise](https://redislabs.com/redis-enterprise/overview/). Redis Enterprise is a fully managed Redis service that can be deployed in kubernetes, on-premises or in the cloud.\n",
"\n",
"Additionally, every major cloud provider ([AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-e6y7ork67pjwg?sr=0-2&ref_=beagle&applicationId=AWSMPContessa), [Google Marketplace](https://console.cloud.google.com/marketplace/details/redislabs-public/redis-enterprise?pli=1), or [Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/garantiadata.redis_enterprise_1sp_public_preview?tab=Overview)) offers Redis Enterprise in a marketplace offering.\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f1a618c5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Before we start this project, we need setup the following:\n",
"\n",
"* start a Redis database with RediSearch (redis-stack)\n",
"* install libraries\n",
" * [Redis-py](https://github.com/redis/redis-py)\n",
"* get your [OpenAI API key](https://beta.openai.com/account/api-keys)\n",
"\n",
"===========================================================\n",
"\n",
"### Start Redis\n",
"\n",
"To keep this example simple, we will use the Redis Stack docker container which we can start as follows\n",
"\n",
"```bash\n",
"$ docker compose up -d\n",
"```\n",
"\n",
"This also includes the [RedisInsight](https://redis.com/redis-enterprise/redis-insight/) GUI for managing your Redis database which you can view at [http://localhost:8001](http://localhost:8001) once you start the docker container.\n",
"\n",
"You're all set up and ready to go! Next, we import and create our client for communicating with the Redis database we just created."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b9babafe",
"metadata": {},
"source": [
"## Install Requirements\n",
"\n",
"Redis-Py is the python client for communicating with Redis. We will use this to communicate with our Redis-stack database. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b04113f",
"metadata": {},
"outputs": [],
"source": [
"!pip install redis wget pandas openai"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "36fe86f4",
"metadata": {},
"source": [
"===========================================================\n",
"## Prepare your OpenAI API key\n",
"\n",
"The `OpenAI API key` is used for vectorization of query data.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "88be138c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OPENAI_API_KEY is ready\n"
]
}
],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"import os\n",
"import openai\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ[\"OPENAI_API_KEY\"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
" print (\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print (\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "97fefe4c",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we'll load embedded data that has already been converted into vectors. We'll use this data to create an index in Redis and then search for similar vectors."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9fbebe0d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"File Downloaded\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>url</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>title_vector</th>\n",
" <th>content_vector</th>\n",
" <th>vector_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>https://simple.wikipedia.org/wiki/April</td>\n",
" <td>April</td>\n",
" <td>April is the fourth month of the year in the J...</td>\n",
" <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
" <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>https://simple.wikipedia.org/wiki/August</td>\n",
" <td>August</td>\n",
" <td>August (Aug.) is the eighth month of the year ...</td>\n",
" <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
" <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>https://simple.wikipedia.org/wiki/Art</td>\n",
" <td>Art</td>\n",
" <td>Art is a creative activity that expresses imag...</td>\n",
" <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
" <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>8</td>\n",
" <td>https://simple.wikipedia.org/wiki/A</td>\n",
" <td>A</td>\n",
" <td>A or a is the first letter of the English alph...</td>\n",
" <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
" <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>https://simple.wikipedia.org/wiki/Air</td>\n",
" <td>Air</td>\n",
" <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
" <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
" <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id url title \\\n",
"0 1 https://simple.wikipedia.org/wiki/April April \n",
"1 2 https://simple.wikipedia.org/wiki/August August \n",
"2 6 https://simple.wikipedia.org/wiki/Art Art \n",
"3 8 https://simple.wikipedia.org/wiki/A A \n",
"4 9 https://simple.wikipedia.org/wiki/Air Air \n",
"\n",
" text \\\n",
"0 April is the fourth month of the year in the J... \n",
"1 August (Aug.) is the eighth month of the year ... \n",
"2 Art is a creative activity that expresses imag... \n",
"3 A or a is the first letter of the English alph... \n",
"4 Air refers to the Earth's atmosphere. Air is a... \n",
"\n",
" title_vector \\\n",
"0 [0.001009464613161981, -0.020700545981526375, ... \n",
"1 [0.0009286514250561595, 0.000820168002974242, ... \n",
"2 [0.003393713850528002, 0.0061537534929811954, ... \n",
"3 [0.0153952119871974, -0.013759135268628597, 0.... \n",
"4 [0.02224554680287838, -0.02044147066771984, -0... \n",
"\n",
" content_vector vector_id \n",
"0 [-0.011253940872848034, -0.013491976074874401,... 0 \n",
"1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n",
"2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n",
"3 [0.024894846603274345, -0.022186409682035446, ... 3 \n",
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"import numpy as np\n",
"import pandas as pd\n",
"from typing import List\n",
"\n",
"# use helper function in nbutils.py to download and read the data\n",
"# this should take from 5-10 min to run\n",
"if os.getcwd() not in sys.path:\n",
" sys.path.append(os.getcwd())\n",
"import nbutils\n",
"\n",
"nbutils.download_wikipedia_data()\n",
"data = nbutils.read_wikipedia_data()\n",
"\n",
"data.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "91df4d5b",
"metadata": {},
"source": [
"## Connect to Redis\n",
"\n",
"Now that we have our Redis database running, we can connect to it using the Redis-py client. We will use the default host and port for the Redis database which is `localhost:6379`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cc662c1b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import redis\n",
"from redis.commands.search.indexDefinition import (\n",
" IndexDefinition,\n",
" IndexType\n",
")\n",
"from redis.commands.search.query import Query\n",
"from redis.commands.search.field import (\n",
" TextField,\n",
" VectorField\n",
")\n",
"\n",
"REDIS_HOST = \"localhost\"\n",
"REDIS_PORT = 6379\n",
"REDIS_PASSWORD = \"\" # default for passwordless Redis\n",
"\n",
"# Connect to Redis\n",
"redis_client = redis.Redis(\n",
" host=REDIS_HOST,\n",
" port=REDIS_PORT,\n",
" password=REDIS_PASSWORD\n",
")\n",
"redis_client.ping()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7d3dac3c",
"metadata": {},
"source": [
"## Creating a Search Index in Redis\n",
"\n",
"The below cells will show how to specify and create a search index in Redis. We will\n",
"\n",
"1. Set some constants for defining our index like the distance metric and the index name\n",
"2. Define the index schema with RediSearch fields\n",
"3. Create the index"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f894b911",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Constants\n",
"VECTOR_DIM = len(data['title_vector'][0]) # length of the vectors\n",
"VECTOR_NUMBER = len(data) # initial number of vectors\n",
"INDEX_NAME = \"embeddings-index\" # name of the search index\n",
"PREFIX = \"doc\" # prefix for the document keys\n",
"DISTANCE_METRIC = \"COSINE\" # distance metric for the vectors (ex. COSINE, IP, L2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "15db8380",
"metadata": {},
"outputs": [],
"source": [
"# Define RediSearch fields for each of the columns in the dataset\n",
"title = TextField(name=\"title\")\n",
"url = TextField(name=\"url\")\n",
"text = TextField(name=\"text\")\n",
"title_embedding = VectorField(\"title_vector\",\n",
" \"FLAT\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER,\n",
" }\n",
")\n",
"text_embedding = VectorField(\"content_vector\",\n",
" \"FLAT\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER,\n",
" }\n",
")\n",
"fields = [title, url, text, title_embedding, text_embedding]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3658693c",
"metadata": {},
"outputs": [],
"source": [
"# Check if index exists\n",
"try:\n",
" redis_client.ft(INDEX_NAME).info()\n",
" print(\"Index already exists\")\n",
"except:\n",
" # Create RediSearch Index\n",
" redis_client.ft(INDEX_NAME).create_index(\n",
" fields = fields,\n",
" definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "775c15b4",
"metadata": {},
"source": [
"## Load Documents into the Index\n",
"\n",
"Now that we have a search index, we can load documents into it. We will use the same documents we used in the previous examples. In Redis, either the Hash or JSON (if using RedisJSON in addition to RediSearch) data types can be used to store documents. We will use the HASH data type in this example. The below cells will show how to load documents into the index."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0d791186",
"metadata": {},
"outputs": [],
"source": [
"def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):\n",
" records = documents.to_dict(\"records\")\n",
" for doc in records:\n",
" key = f\"{prefix}:{str(doc['id'])}\"\n",
"\n",
" # create byte vectors for title and content\n",
" title_embedding = np.array(doc[\"title_vector\"], dtype=np.float32).tobytes()\n",
" content_embedding = np.array(doc[\"content_vector\"], dtype=np.float32).tobytes()\n",
"\n",
" # replace list of floats with byte vectors\n",
" doc[\"title_vector\"] = title_embedding\n",
" doc[\"content_vector\"] = content_embedding\n",
"\n",
" client.hset(key, mapping = doc)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5bfaeafa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 25000 documents in Redis search index with name: embeddings-index\n"
]
}
],
"source": [
"index_documents(redis_client, PREFIX, data)\n",
"print(f\"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "46050ca9",
"metadata": {},
"source": [
"## Simple Vector Search Queries with OpenAI Query Embeddings\n",
"\n",
"Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Redis as a vector database."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b044aa93",
"metadata": {},
"outputs": [],
"source": [
"def search_redis(\n",
" redis_client: redis.Redis,\n",
" user_query: str,\n",
" index_name: str = \"embeddings-index\",\n",
" vector_field: str = \"title_vector\",\n",
" return_fields: list = [\"title\", \"url\", \"text\", \"vector_score\"],\n",
" hybrid_fields = \"*\",\n",
" k: int = 20,\n",
" print_results: bool = True,\n",
") -> List[dict]:\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(input=user_query,\n",
" model=\"text-embedding-ada-002\",\n",
" )[\"data\"][0]['embedding']\n",
"\n",
" # Prepare the Query\n",
" base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'\n",
" query = (\n",
" Query(base_query)\n",
" .return_fields(*return_fields)\n",
" .sort_by(\"vector_score\")\n",
" .paging(0, k)\n",
" .dialect(2)\n",
" )\n",
" params_dict = {\"vector\": np.array(embedded_query).astype(dtype=np.float32).tobytes()}\n",
"\n",
" # perform vector search\n",
" results = redis_client.ft(index_name).search(query, params_dict)\n",
" if print_results:\n",
" for i, article in enumerate(results.docs):\n",
" score = 1 - float(article.vector_score)\n",
" print(f\"{i}. {article.title} (Score: {round(score ,3) })\")\n",
" return results.docs"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "7e2025f6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Museum of Modern Art (Score: 0.875)\n",
"1. Western Europe (Score: 0.868)\n",
"2. Renaissance art (Score: 0.864)\n",
"3. Pop art (Score: 0.86)\n",
"4. Northern Europe (Score: 0.855)\n",
"5. Hellenistic art (Score: 0.853)\n",
"6. Modernist literature (Score: 0.847)\n",
"7. Art film (Score: 0.843)\n",
"8. Central Europe (Score: 0.843)\n",
"9. European (Score: 0.841)\n"
]
}
],
"source": [
"# For using OpenAI to generate query embedding\n",
"results = search_redis(redis_client, 'modern art in Europe', k=10)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "93c4a696",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Battle of Bannockburn (Score: 0.869)\n",
"1. Wars of Scottish Independence (Score: 0.861)\n",
"2. 1651 (Score: 0.853)\n",
"3. First War of Scottish Independence (Score: 0.85)\n",
"4. Robert I of Scotland (Score: 0.846)\n",
"5. 841 (Score: 0.844)\n",
"6. 1716 (Score: 0.844)\n",
"7. 1314 (Score: 0.837)\n",
"8. 1263 (Score: 0.836)\n",
"9. William Wallace (Score: 0.835)\n"
]
}
],
"source": [
"results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2007be48",
"metadata": {},
"source": [
"## Hybrid Queries with Redis\n",
"\n",
"The previous examples showed how run vector search queries with RediSearch. In this section, we will show how to combine vector search with other RediSearch fields for hybrid search. In the below example, we will combine vector search with full text search."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6c25ee8d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. First War of Scottish Independence (Score: 0.892)\n",
"1. Wars of Scottish Independence (Score: 0.889)\n",
"2. Second War of Scottish Independence (Score: 0.879)\n",
"3. List of Scottish monarchs (Score: 0.873)\n",
"4. Scottish Borders (Score: 0.863)\n"
]
}
],
"source": [
"def create_hybrid_field(field_name: str, value: str) -> str:\n",
" return f'@{field_name}:\"{value}\"'\n",
"\n",
"# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title\n",
"results = search_redis(redis_client,\n",
" \"Famous battles in Scottish history\",\n",
" vector_field=\"title_vector\",\n",
" k=5,\n",
" hybrid_fields=create_hybrid_field(\"title\", \"Scottish\")\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2c0d11d8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Art (Score: 1.0)\n",
"1. Paint (Score: 0.896)\n",
"2. Renaissance art (Score: 0.88)\n",
"3. Painting (Score: 0.874)\n",
"4. Renaissance (Score: 0.846)\n"
]
},
{
"data": {
"text/plain": [
"'In Europe, after the Middle Ages, there was a \"Renaissance\" which means \"rebirth\". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# run a hybrid query for articles about Art in the title vector and only include results with the phrase \"Leonardo da Vinci\" in the text\n",
"results = search_redis(redis_client,\n",
" \"Art\",\n",
" vector_field=\"title_vector\",\n",
" k=5,\n",
" hybrid_fields=create_hybrid_field(\"text\", \"Leonardo da Vinci\")\n",
" )\n",
"\n",
"# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned\n",
"mention = [sentence for sentence in results[0].text.split(\"\\n\") if \"Leonardo da Vinci\" in sentence][0]\n",
"mention"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f8aebbe3",
"metadata": {},
"source": [
"## HNSW Index\n",
"\n",
"Up until now, we've been using the ``FLAT`` or \"brute-force\" index to run our queries. Redis also supports the ``HNSW`` index which is a fast, approximate index. The ``HNSW`` index is a graph-based index that uses a hierarchical navigable small world graph to store vectors. The ``HNSW`` index is a good choice for large datasets where you want to run approximate queries.\n",
"\n",
"``HNSW`` will take longer to build and consume more memory for most cases than ``FLAT`` but will be faster to run queries on, especially for large datasets.\n",
"\n",
"The following cells will show how to create an ``HNSW`` index and run queries with it using the same data as before."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "865c30f3",
"metadata": {},
"outputs": [],
"source": [
"# re-define RediSearch vector fields to use HNSW index\n",
"title_embedding = VectorField(\"title_vector\",\n",
" \"HNSW\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER\n",
" }\n",
")\n",
"text_embedding = VectorField(\"content_vector\",\n",
" \"HNSW\", {\n",
" \"TYPE\": \"FLOAT32\",\n",
" \"DIM\": VECTOR_DIM,\n",
" \"DISTANCE_METRIC\": DISTANCE_METRIC,\n",
" \"INITIAL_CAP\": VECTOR_NUMBER\n",
" }\n",
")\n",
"fields = [title, url, text, title_embedding, text_embedding]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "347e1e70",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"# Check if index exists\n",
"HNSW_INDEX_NAME = INDEX_NAME+ \"_HNSW\"\n",
"\n",
"try:\n",
" redis_client.ft(HNSW_INDEX_NAME).info()\n",
" print(\"Index already exists\")\n",
"except:\n",
" # Create RediSearch Index\n",
" redis_client.ft(HNSW_INDEX_NAME).create_index(\n",
" fields = fields,\n",
" definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n",
" )\n",
"\n",
"# since RediSearch creates the index in the background for existing documents, we will wait until\n",
"# indexing is complete before running our queries. Although this is not necessary for the first query,\n",
"# some queries may take longer to run if the index is not fully built. In general, Redis will perform\n",
"# best when adding new documents to existing indices rather than new indices on existing documents.\n",
"while redis_client.ft(HNSW_INDEX_NAME).info()[\"indexing\"] == \"1\":\n",
" time.sleep(5)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "8e474447",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Western Europe (Score: 0.868)\n",
"1. Northern Europe (Score: 0.855)\n",
"2. Central Europe (Score: 0.843)\n",
"3. European (Score: 0.841)\n",
"4. Eastern Europe (Score: 0.839)\n",
"5. Europe (Score: 0.839)\n",
"6. Western European Union (Score: 0.837)\n",
"7. Southern Europe (Score: 0.831)\n",
"8. Western civilization (Score: 0.83)\n",
"9. Council of Europe (Score: 0.827)\n"
]
}
],
"source": [
"results = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "cb799e69",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ----- Flat Index ----- \n",
"0. Museum of Modern Art (Score: 0.875)\n",
"1. Western Europe (Score: 0.867)\n",
"2. Renaissance art (Score: 0.864)\n",
"3. Pop art (Score: 0.861)\n",
"4. Northern Europe (Score: 0.855)\n",
"5. Hellenistic art (Score: 0.853)\n",
"6. Modernist literature (Score: 0.847)\n",
"7. Art film (Score: 0.843)\n",
"8. Central Europe (Score: 0.843)\n",
"9. Art (Score: 0.842)\n",
"Flat index query time: 0.263 seconds\n",
"\n",
" ----- HNSW Index ------ \n",
"0. Western Europe (Score: 0.867)\n",
"1. Northern Europe (Score: 0.855)\n",
"2. Central Europe (Score: 0.843)\n",
"3. European (Score: 0.841)\n",
"4. Eastern Europe (Score: 0.839)\n",
"5. Europe (Score: 0.839)\n",
"6. Western European Union (Score: 0.837)\n",
"7. Southern Europe (Score: 0.831)\n",
"8. Western civilization (Score: 0.83)\n",
"9. Council of Europe (Score: 0.827)\n",
"HNSW index query time: 0.129 seconds\n",
" ------------------------ \n"
]
}
],
"source": [
"# compare the results of the HNSW index to the FLAT index and time both queries\n",
"def time_queries(iterations: int = 10):\n",
" print(\" ----- Flat Index ----- \")\n",
" t0 = time.time()\n",
" for i in range(iterations):\n",
" results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=False)\n",
" t0 = (time.time() - t0) / iterations\n",
" results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=True)\n",
" print(f\"Flat index query time: {round(t0, 3)} seconds\\n\")\n",
" time.sleep(1)\n",
" print(\" ----- HNSW Index ------ \")\n",
" t1 = time.time()\n",
" for i in range(iterations):\n",
" results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=False)\n",
" t1 = (time.time() - t1) / iterations\n",
" results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=True)\n",
" print(f\"HNSW index query time: {round(t1, 3)} seconds\")\n",
" print(\" ------------------------ \")\n",
"time_queries()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69aa7a09",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "redisvl2",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,46 @@
import os
import wget
import zipfile
import numpy as np
import pandas as pd
from ast import literal_eval
def download_wikipedia_data(
data_path: str = '../../data/',
download_path: str = "./",
file_name: str = "vector_database_wikipedia_articles_embedded") -> pd.DataFrame:
data_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
csv_file_path = os.path.join(data_path, file_name + ".csv")
zip_file_path = os.path.join(download_path, file_name + ".zip")
if os.path.isfile(csv_file_path):
print("File Downloaded")
else:
if os.path.isfile(zip_file_path):
print("Zip downloaded but not unzipped, unzipping now...")
else:
print("File not found, downloading now...")
# Download the data
wget.download(data_url, out=download_path, bar=True)
# Unzip the data
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(data_path)
# Remove the zip file
os.remove('vector_database_wikipedia_articles_embedded.zip')
print(f"File downloaded to {data_path}")
def read_wikipedia_data(data_path: str = '../../data/', file_name: str = "vector_database_wikipedia_articles_embedded") -> pd.DataFrame:
csv_file_path = os.path.join(data_path, file_name + ".csv")
data = pd.read_csv(csv_file_path)
# Read vectors from strings back into a list
data['title_vector'] = data.title_vector.apply(literal_eval)
data['content_vector'] = data.content_vector.apply(literal_eval)
# Set vector_id to be a string
data['vector_id'] = data['vector_id'].apply(str)
return data

View File

@ -0,0 +1,5 @@
port 6379
appendonly no
save ""
protected-mode no
io-threads 2