openai-cookbook/examples/vector_databases/redis/redisjson/redisjson.ipynb
2024-01-25 11:59:05 -06:00

414 lines
23 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Redis Vectors as JSON with OpenAI\n",
"This notebook expands on the other Redis OpenAI-cookbook examples with examples of how to use JSON with vectors.\n",
"[Storing Vectors in JSON](https://redis.io/docs/stack/search/reference/vectors/#storing-vectors-in-json)\n",
"\n",
"## Prerequisites\n",
"* Redis instance with the Redis Search and Redis JSON modules\n",
"* Redis-py client lib\n",
"* OpenAI API key\n",
"\n",
"## Installation\n",
"Install Python modules necessary for the examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"! pip install redis openai python-dotenv openai[datalib]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## OpenAI API Key\n",
"Create a .env file and add your OpenAI key to it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"OPENAI_API_KEY=your_key"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Text Vectors\n",
"Create embeddings (array of floats) of the news excerpts below."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"import os\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
"\n",
"def get_vector(text, model=\"text-embedding-3-small\"):\n",
" text = text.replace(\"\\n\", \" \")\n",
" return openai.Embedding.create(input = [text], model = model)['data'][0]['embedding']\n",
"\n",
"text_1 = \"\"\"Japan narrowly escapes recession\n",
"\n",
"Japan's economy teetered on the brink of a technical recession in the three months to September, figures show.\n",
"\n",
"Revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. On an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.\n",
"The government was keen to play down the worrying implications of the data. \"I maintain the view that Japan's economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully,\" said economy minister Heizo Takenaka. But in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine. \"It's painting a picture of a recovery... much patchier than previously thought,\" said Paul Sheard, economist at Lehman Brothers in Tokyo. Improvements in the job market apparently have yet to feed through to domestic demand, with private consumption up just 0.2% in the third quarter.\n",
"\"\"\"\n",
"\n",
"text_2 = \"\"\"Dibaba breaks 5,000m world record\n",
"\n",
"Ethiopia's Tirunesh Dibaba set a new world record in winning the women's 5,000m at the Boston Indoor Games.\n",
"\n",
"Dibaba won in 14 minutes 32.93 seconds to erase the previous world indoor mark of 14:39.29 set by another Ethiopian, Berhane Adera, in Stuttgart last year. But compatriot Kenenisa Bekele's record hopes were dashed when he miscounted his laps in the men's 3,000m and staged his sprint finish a lap too soon. Ireland's Alistair Cragg won in 7:39.89 as Bekele battled to second in 7:41.42. \"I didn't want to sit back and get out-kicked,\" said Cragg. \"So I kept on the pace. The plan was to go with 500m to go no matter what, but when Bekele made the mistake that was it. The race was mine.\" Sweden's Carolina Kluft, the Olympic heptathlon champion, and Slovenia's Jolanda Ceplak had winning performances, too. Kluft took the long jump at 6.63m, while Ceplak easily won the women's 800m in 2:01.52. \n",
"\"\"\"\n",
"\n",
"\n",
"text_3 = \"\"\"Google's toolbar sparks concern\n",
"\n",
"Search engine firm Google has released a trial tool which is concerning some net users because it directs people to pre-selected commercial websites.\n",
"\n",
"The AutoLink feature comes with Google's latest toolbar and provides links in a webpage to Amazon.com if it finds a book's ISBN number on the site. It also links to Google's map service, if there is an address, or to car firm Carfax, if there is a licence plate. Google said the feature, available only in the US, \"adds useful links\". But some users are concerned that Google's dominant position in the search engine market place could mean it would be giving a competitive edge to firms like Amazon.\n",
"\n",
"AutoLink works by creating a link to a website based on information contained in a webpage - even if there is no link specified and whether or not the publisher of the page has given permission.\n",
"\n",
"If a user clicks the AutoLink feature in the Google toolbar then a webpage with a book's unique ISBN number would link directly to Amazon's website. It could mean online libraries that list ISBN book numbers find they are directing users to Amazon.com whether they like it or not. Websites which have paid for advertising on their pages may also be directing people to rival services. Dan Gillmor, founder of Grassroots Media, which supports citizen-based media, said the tool was a \"bad idea, and an unfortunate move by a company that is looking to continue its hypergrowth\". In a statement Google said the feature was still only in beta, ie trial, stage and that the company welcomed feedback from users. It said: \"The user can choose never to click on the AutoLink button, and web pages she views will never be modified. \"In addition, the user can choose to disable the AutoLink feature entirely at any time.\"\n",
"\n",
"The new tool has been compared to the Smart Tags feature from Microsoft by some users. It was widely criticised by net users and later dropped by Microsoft after concerns over trademark use were raised. Smart Tags allowed Microsoft to link any word on a web page to another site chosen by the company. Google said none of the companies which received AutoLinks had paid for the service. Some users said AutoLink would only be fair if websites had to sign up to allow the feature to work on their pages or if they received revenue for any \"click through\" to a commercial site. Cory Doctorow, European outreach coordinator for digital civil liberties group Electronic Fronter Foundation, said that Google should not be penalised for its market dominance. \"Of course Google should be allowed to direct people to whatever proxies it chooses. \"But as an end user I would want to know - 'Can I choose to use this service?, 'How much is Google being paid?', 'Can I substitute my own companies for the ones chosen by Google?'.\" Mr Doctorow said the only objection would be if users were forced into using AutoLink or \"tricked into using the service\".\n",
"\"\"\"\n",
"\n",
"doc_1 = {\"content\": text_1, \"vector\": get_vector(text_1)}\n",
"doc_2 = {\"content\": text_2, \"vector\": get_vector(text_2)}\n",
"doc_3 = {\"content\": text_3, \"vector\": get_vector(text_3)}\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start the Redis Stack Docker container"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1A\u001b[1B\u001b[0G\u001b[?25l[+] Running 0/0\n",
" ⠿ Container redisjson-redis-1 Starting \u001b[34m0.1s \u001b[0m\n",
"\u001b[?25h\u001b[1A\u001b[1A\u001b[0G\u001b[?25l[+] Running 0/1\n",
" ⠿ Container redisjson-redis-1 Starting \u001b[34m0.2s \u001b[0m\n",
"\u001b[?25h\u001b[1A\u001b[1A\u001b[0G\u001b[?25l[+] Running 0/1\n",
" ⠿ Container redisjson-redis-1 Starting \u001b[34m0.3s \u001b[0m\n",
"\u001b[?25h\u001b[1A\u001b[1A\u001b[0G\u001b[?25l[+] Running 0/1\n",
" ⠿ Container redisjson-redis-1 Starting \u001b[34m0.4s \u001b[0m\n",
"\u001b[?25h\u001b[1A\u001b[1A\u001b[0G\u001b[?25l\u001b[34m[+] Running 1/1\u001b[0m\n",
" \u001b[32m✔\u001b[0m Container redisjson-redis-1 \u001b[32mStarted\u001b[0m \u001b[34m0.4s \u001b[0m\n",
"\u001b[?25h"
]
}
],
"source": [
"! docker compose up -d"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect Redis client"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from redis import from_url\n",
"\n",
"REDIS_URL = 'redis://localhost:6379'\n",
"client = from_url(REDIS_URL)\n",
"client.ping()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Index\n",
"[FT.CREATE](https://redis.io/commands/ft.create/)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'OK'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from redis.commands.search.field import TextField, VectorField\n",
"from redis.commands.search.indexDefinition import IndexDefinition, IndexType\n",
"\n",
"schema = [ VectorField('$.vector', \n",
" \"FLAT\", \n",
" { \"TYPE\": 'FLOAT32', \n",
" \"DIM\": len(doc_1['vector']), \n",
" \"DISTANCE_METRIC\": \"COSINE\"\n",
" }, as_name='vector' ),\n",
" TextField('$.content', as_name='content')\n",
" ]\n",
"idx_def = IndexDefinition(index_type=IndexType.JSON, prefix=['doc:'])\n",
"try: \n",
" client.ft('idx').dropindex()\n",
"except:\n",
" pass\n",
"client.ft('idx').create_index(schema, definition=idx_def)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data into Redis as JSON objects\n",
"[Redis JSON](https://redis.io/docs/stack/json/)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client.json().set('doc:1', '$', doc_1)\n",
"client.json().set('doc:2', '$', doc_2)\n",
"client.json().set('doc:3', '$', doc_3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Semantic Search\n",
"Given a sports-related article, search Redis via Vector Similarity Search (VSS) for similar articles. \n",
"[KNN Search](https://redis.io/docs/stack/search/reference/vectors/#knn-search)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance:0.188 content:Dibaba breaks 5,000m world record\n",
"\n",
"Ethiopia's Tirunesh Dibaba set a new world record in winning the women's 5,000m at the Boston Indoor Games.\n",
"\n",
"Dibaba won in 14 minutes 32.93 seconds to erase the previous world indoor mark of 14:39.29 set by another Ethiopian, Berhane Adera, in Stuttgart last year. But compatriot Kenenisa Bekele's record hopes were dashed when he miscounted his laps in the men's 3,000m and staged his sprint finish a lap too soon. Ireland's Alistair Cragg won in 7:39.89 as Bekele battled to second in 7:41.42. \"I didn't want to sit back and get out-kicked,\" said Cragg. \"So I kept on the pace. The plan was to go with 500m to go no matter what, but when Bekele made the mistake that was it. The race was mine.\" Sweden's Carolina Kluft, the Olympic heptathlon champion, and Slovenia's Jolanda Ceplak had winning performances, too. Kluft took the long jump at 6.63m, while Ceplak easily won the women's 800m in 2:01.52. \n",
"\n",
"\n",
"distance:0.268 content:Japan narrowly escapes recession\n",
"\n",
"Japan's economy teetered on the brink of a technical recession in the three months to September, figures show.\n",
"\n",
"Revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. On an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.\n",
"The government was keen to play down the worrying implications of the data. \"I maintain the view that Japan's economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully,\" said economy minister Heizo Takenaka. But in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine. \"It's painting a picture of a recovery... much patchier than previously thought,\" said Paul Sheard, economist at Lehman Brothers in Tokyo. Improvements in the job market apparently have yet to feed through to domestic demand, with private consumption up just 0.2% in the third quarter.\n",
"\n",
"\n",
"distance:0.287 content:Google's toolbar sparks concern\n",
"\n",
"Search engine firm Google has released a trial tool which is concerning some net users because it directs people to pre-selected commercial websites.\n",
"\n",
"The AutoLink feature comes with Google's latest toolbar and provides links in a webpage to Amazon.com if it finds a book's ISBN number on the site. It also links to Google's map service, if there is an address, or to car firm Carfax, if there is a licence plate. Google said the feature, available only in the US, \"adds useful links\". But some users are concerned that Google's dominant position in the search engine market place could mean it would be giving a competitive edge to firms like Amazon.\n",
"\n",
"AutoLink works by creating a link to a website based on information contained in a webpage - even if there is no link specified and whether or not the publisher of the page has given permission.\n",
"\n",
"If a user clicks the AutoLink feature in the Google toolbar then a webpage with a book's unique ISBN number would link directly to Amazon's website. It could mean online libraries that list ISBN book numbers find they are directing users to Amazon.com whether they like it or not. Websites which have paid for advertising on their pages may also be directing people to rival services. Dan Gillmor, founder of Grassroots Media, which supports citizen-based media, said the tool was a \"bad idea, and an unfortunate move by a company that is looking to continue its hypergrowth\". In a statement Google said the feature was still only in beta, ie trial, stage and that the company welcomed feedback from users. It said: \"The user can choose never to click on the AutoLink button, and web pages she views will never be modified. \"In addition, the user can choose to disable the AutoLink feature entirely at any time.\"\n",
"\n",
"The new tool has been compared to the Smart Tags feature from Microsoft by some users. It was widely criticised by net users and later dropped by Microsoft after concerns over trademark use were raised. Smart Tags allowed Microsoft to link any word on a web page to another site chosen by the company. Google said none of the companies which received AutoLinks had paid for the service. Some users said AutoLink would only be fair if websites had to sign up to allow the feature to work on their pages or if they received revenue for any \"click through\" to a commercial site. Cory Doctorow, European outreach coordinator for digital civil liberties group Electronic Fronter Foundation, said that Google should not be penalised for its market dominance. \"Of course Google should be allowed to direct people to whatever proxies it chooses. \"But as an end user I would want to know - 'Can I choose to use this service?, 'How much is Google being paid?', 'Can I substitute my own companies for the ones chosen by Google?'.\" Mr Doctorow said the only objection would be if users were forced into using AutoLink or \"tricked into using the service\".\n",
"\n",
"\n"
]
}
],
"source": [
"from redis.commands.search.query import Query\n",
"import numpy as np\n",
"\n",
"text_4 = \"\"\"Radcliffe yet to answer GB call\n",
"\n",
"Paula Radcliffe has been granted extra time to decide whether to compete in the World Cross-Country Championships.\n",
"\n",
"The 31-year-old is concerned the event, which starts on 19 March in France, could upset her preparations for the London Marathon on 17 April. \"There is no question that Paula would be a huge asset to the GB team,\" said Zara Hyde Peters of UK Athletics. \"But she is working out whether she can accommodate the worlds without too much compromise in her marathon training.\" Radcliffe must make a decision by Tuesday - the deadline for team nominations. British team member Hayley Yelling said the team would understand if Radcliffe opted out of the event. \"It would be fantastic to have Paula in the team,\" said the European cross-country champion. \"But you have to remember that athletics is basically an individual sport and anything achieved for the team is a bonus. \"She is not messing us around. We all understand the problem.\" Radcliffe was world cross-country champion in 2001 and 2002 but missed last year's event because of injury. In her absence, the GB team won bronze in Brussels.\n",
"\"\"\"\n",
"\n",
"vec = np.array(get_vector(text_4), dtype=np.float32).tobytes()\n",
"q = Query('*=>[KNN 3 @vector $query_vec AS vector_score]')\\\n",
" .sort_by('vector_score')\\\n",
" .return_fields('vector_score', 'content')\\\n",
" .dialect(2) \n",
"params = {\"query_vec\": vec}\n",
"\n",
"results = client.ft('idx').search(q, query_params=params)\n",
"for doc in results.docs:\n",
" print(f\"distance:{round(float(doc['vector_score']),3)} content:{doc['content']}\\n\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hybrid Search\n",
"Use a combination of full text search and VSS to find a matching article. For this scenario, we filter on a full text search of the term 'recession' and then find the KNN articles. In this case, business-related. Reminder document #1 was about a recession in Japan.\n",
"[Hybrid Queries](https://redis.io/docs/stack/search/reference/vectors/#hybrid-queries)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance:0.241 content:Japan narrowly escapes recession\n",
"\n",
"Japan's economy teetered on the brink of a technical recession in the three months to September, figures show.\n",
"\n",
"Revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. On an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.\n",
"The government was keen to play down the worrying implications of the data. \"I maintain the view that Japan's economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully,\" said economy minister Heizo Takenaka. But in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine. \"It's painting a picture of a recovery... much patchier than previously thought,\" said Paul Sheard, economist at Lehman Brothers in Tokyo. Improvements in the job market apparently have yet to feed through to domestic demand, with private consumption up just 0.2% in the third quarter.\n",
"\n",
"\n"
]
}
],
"source": [
"text_5 = \"\"\"Ethiopia's crop production up 24%\n",
"\n",
"Ethiopia produced 14.27 million tonnes of crops in 2004, 24% higher than in 2003 and 21% more than the average of the past five years, a report says.\n",
"\n",
"In 2003, crop production totalled 11.49 million tonnes, the joint report from the Food and Agriculture Organisation and the World Food Programme said. Good rains, increased use of fertilizers and improved seeds contributed to the rise in production. Nevertheless, 2.2 million Ethiopians will still need emergency assistance.\n",
"\n",
"The report calculated emergency food requirements for 2005 to be 387,500 tonnes. On top of that, 89,000 tonnes of fortified blended food and vegetable oil for \"targeted supplementary food distributions for a survival programme for children under five and pregnant and lactating women\" will be needed.\n",
"\n",
"In eastern and southern Ethiopia, a prolonged drought has killed crops and drained wells. Last year, a total of 965,000 tonnes of food assistance was needed to help seven million Ethiopians. The Food and Agriculture Organisation (FAO) recommend that the food assistance is bought locally. \"Local purchase of cereals for food assistance programmes is recommended as far as possible, so as to assist domestic markets and farmers,\" said Henri Josserand, chief of FAO's Global Information and Early Warning System. Agriculture is the main economic activity in Ethiopia, representing 45% of gross domestic product. About 80% of Ethiopians depend directly or indirectly on agriculture.\n",
"\"\"\"\n",
"\n",
"vec = np.array(get_vector(text_5), dtype=np.float32).tobytes()\n",
"q = Query('@content:recession => [KNN 3 @vector $query_vec AS vector_score]')\\\n",
" .sort_by('vector_score')\\\n",
" .return_fields('vector_score', 'content')\\\n",
" .dialect(2) \n",
"params = {\"query_vec\": vec}\n",
"\n",
"results = client.ft('idx').search(q, query_params=params)\n",
"for doc in results.docs:\n",
" print(f\"distance:{round(float(doc['vector_score']),3)} content:{doc['content']}\\n\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}