You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/azuresearch/Getting_started_with_azure_...

648 lines
23 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Azure Cognitive Search as a vector database for OpenAI embeddings"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook provides step by step instuctions on using Azure Cognitive Search as a vector database with OpenAI embeddings. Azure Cognitive Search (formerly known as \"Azure Search\") is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.\n",
"\n",
"## Prerequistites:\n",
"For the purposes of this exercise you must have the following:\n",
"- [Azure Cognitive Search Service](https://learn.microsoft.com/azure/search/)\n",
"- [OpenAI Key](https://platform.openai.com/account/api-keys) or [Azure OpenAI credentials](https://learn.microsoft.com/azure/cognitive-services/openai/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install wget\n",
"! pip install azure-search-documents --pre "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import required libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"import json \n",
"import openai\n",
"import wget\n",
"import pandas as pd\n",
"import zipfile\n",
"from azure.core.credentials import AzureKeyCredential \n",
"from azure.search.documents import SearchClient \n",
"from azure.search.documents.indexes import SearchIndexClient \n",
"from azure.search.documents.models import Vector \n",
"from azure.search.documents import SearchIndexingBufferedSender\n",
"from azure.search.documents.indexes.models import ( \n",
" SearchIndex, \n",
" SearchField, \n",
" SearchFieldDataType, \n",
" SimpleField, \n",
" SearchableField, \n",
" SearchIndex, \n",
" SemanticConfiguration, \n",
" PrioritizedFields, \n",
" SemanticField, \n",
" SearchField, \n",
" SemanticSettings, \n",
" VectorSearch, \n",
" HnswVectorSearchAlgorithmConfiguration, \n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure OpenAI settings\n",
"\n",
"Configure your OpenAI or Azure OpenAI settings. For this example, we use Azure OpenAI."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"openai.api_type = \"azure\"\n",
"openai.api_base = \"YOUR_AZURE_OPENAI_ENDPOINT\"\n",
"openai.api_version = \"2023-05-15\"\n",
"openai.api_key = \"YOUR_AZURE_OPENAI_KEY\"\n",
"model: str = \"text-embedding-3-small\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure Azure Cognitive Search Vector Store settings\n",
"You can find this in the Azure Portal or using the [Search Management SDK](https://learn.microsoft.com/rest/api/searchmanagement/)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"search_service_endpoint: str = \"YOUR_AZURE_SEARCH_ENDPOINT\"\n",
"search_service_api_key: str = \"YOUR_AZURE_SEARCH_ADMIN_KEY\"\n",
"index_name: str = \"azure-cognitive-search-vector-demo\"\n",
"credential = AzureKeyCredential(search_service_api_key)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'vector_database_wikipedia_articles_embedded.zip'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(\"../../data\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>url</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>title_vector</th>\n",
" <th>content_vector</th>\n",
" <th>vector_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>https://simple.wikipedia.org/wiki/April</td>\n",
" <td>April</td>\n",
" <td>April is the fourth month of the year in the J...</td>\n",
" <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
" <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>https://simple.wikipedia.org/wiki/August</td>\n",
" <td>August</td>\n",
" <td>August (Aug.) is the eighth month of the year ...</td>\n",
" <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
" <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>https://simple.wikipedia.org/wiki/Art</td>\n",
" <td>Art</td>\n",
" <td>Art is a creative activity that expresses imag...</td>\n",
" <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
" <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>8</td>\n",
" <td>https://simple.wikipedia.org/wiki/A</td>\n",
" <td>A</td>\n",
" <td>A or a is the first letter of the English alph...</td>\n",
" <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
" <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>https://simple.wikipedia.org/wiki/Air</td>\n",
" <td>Air</td>\n",
" <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
" <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
" <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id url title \\\n",
"0 1 https://simple.wikipedia.org/wiki/April April \n",
"1 2 https://simple.wikipedia.org/wiki/August August \n",
"2 6 https://simple.wikipedia.org/wiki/Art Art \n",
"3 8 https://simple.wikipedia.org/wiki/A A \n",
"4 9 https://simple.wikipedia.org/wiki/Air Air \n",
"\n",
" text \\\n",
"0 April is the fourth month of the year in the J... \n",
"1 August (Aug.) is the eighth month of the year ... \n",
"2 Art is a creative activity that expresses imag... \n",
"3 A or a is the first letter of the English alph... \n",
"4 Air refers to the Earth's atmosphere. Air is a... \n",
"\n",
" title_vector \\\n",
"0 [0.001009464613161981, -0.020700545981526375, ... \n",
"1 [0.0009286514250561595, 0.000820168002974242, ... \n",
"2 [0.003393713850528002, 0.0061537534929811954, ... \n",
"3 [0.0153952119871974, -0.013759135268628597, 0.... \n",
"4 [0.02224554680287838, -0.02044147066771984, -0... \n",
"\n",
" content_vector vector_id \n",
"0 [-0.011253940872848034, -0.013491976074874401,... 0 \n",
"1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n",
"2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n",
"3 [0.024894846603274345, -0.022186409682035446, ... 3 \n",
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"article_df = pd.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv') \n",
" \n",
"# Read vectors from strings back into a list using json.loads \n",
"article_df[\"title_vector\"] = article_df.title_vector.apply(json.loads) \n",
"article_df[\"content_vector\"] = article_df.content_vector.apply(json.loads) \n",
"article_df['vector_id'] = article_df['vector_id'].apply(str) \n",
"article_df.head() \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create an index"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"azure-cognitive-search-vector-demo created\n"
]
}
],
"source": [
"# Configure a search index\n",
"index_client = SearchIndexClient(\n",
" endpoint=search_service_endpoint, credential=credential)\n",
"fields = [\n",
" SimpleField(name=\"id\", type=SearchFieldDataType.String),\n",
" SimpleField(name=\"vector_id\", type=SearchFieldDataType.String, key=True),\n",
" SimpleField(name=\"url\", type=SearchFieldDataType.String),\n",
" SearchableField(name=\"title\", type=SearchFieldDataType.String),\n",
" SearchableField(name=\"text\", type=SearchFieldDataType.String),\n",
" SearchField(name=\"title_vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n",
" searchable=True, vector_search_dimensions=1536, vector_search_configuration=\"my-vector-config\"),\n",
" SearchField(name=\"content_vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n",
" searchable=True, vector_search_dimensions=1536, vector_search_configuration=\"my-vector-config\"),\n",
"]\n",
"\n",
"# Configure the vector search configuration\n",
"vector_search = VectorSearch(\n",
" algorithm_configurations=[\n",
" HnswVectorSearchAlgorithmConfiguration(\n",
" name=\"my-vector-config\",\n",
" kind=\"hnsw\",\n",
" parameters={\n",
" \"m\": 4,\n",
" \"efConstruction\": 400,\n",
" \"efSearch\": 500,\n",
" \"metric\": \"cosine\"\n",
" }\n",
" )\n",
" ]\n",
")\n",
"\n",
"# Optional: configure semantic reranking by passing your title, keywords, and content fields\n",
"semantic_config = SemanticConfiguration(\n",
" name=\"my-semantic-config\",\n",
" prioritized_fields=PrioritizedFields(\n",
" title_field=SemanticField(field_name=\"title\"),\n",
" prioritized_keywords_fields=[SemanticField(field_name=\"url\")],\n",
" prioritized_content_fields=[SemanticField(field_name=\"text\")]\n",
" )\n",
")\n",
"# Create the semantic settings with the configuration\n",
"semantic_settings = SemanticSettings(configurations=[semantic_config])\n",
"\n",
"# Create the index \n",
"index = SearchIndex(name=index_name, fields=fields,\n",
" vector_search=vector_search, semantic_settings=semantic_settings)\n",
"result = index_client.create_or_update_index(index)\n",
"print(f'{result.name} created')\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Insert text and embeddings into vector store\n",
"In this notebook, the wikipedia articles dataset provided by OpenAI, the embeddings are pre-computed. The code below takes the data frame and converts it into a dictionary list to upload to your Azure Search index.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uploaded 25000 documents in total\n"
]
}
],
"source": [
"# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field \n",
"article_df['id'] = article_df['id'].astype(str) \n",
"article_df['vector_id'] = article_df['vector_id'].astype(str) \n",
" \n",
"# Convert the DataFrame to a list of dictionaries \n",
"documents = article_df.to_dict(orient='records') \n",
" \n",
"# Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing \n",
"with SearchIndexingBufferedSender(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)) as batch_client: \n",
" # Add upload actions for all documents \n",
" batch_client.upload_documents(documents=documents) \n",
" \n",
"print(f\"Uploaded {len(documents)} documents in total\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If your dataset didn't already contain pre-computed embeddings, you can create embeddings by using the below function using the `openai` python library. You'll also notice the same function and model are being used to generate query embeddings for performing vector searches."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March\n",
"Content vector generated\n"
]
}
],
"source": [
"# Example function to generate document embedding \n",
"def generate_document_embeddings(text): \n",
" response = openai.Embedding.create( \n",
" input=text, engine=model) \n",
" embeddings = response['data'][0]['embedding'] \n",
" return embeddings \n",
" \n",
"# Sampling the first document content as an example \n",
"first_document_content = documents[0]['text'] \n",
"print(f\"Content: {first_document_content[:100]}\") \n",
" \n",
"# Generate the content vector using the `generate_document_embeddings` function \n",
"content_vector = generate_document_embeddings(first_document_content) \n",
"print(f\"Content vector generated\") \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a vector similarity search"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Title: Documenta\n",
"Score: 0.8599451\n",
"URL: https://simple.wikipedia.org/wiki/Documenta\n",
"\n",
"Title: Museum of Modern Art\n",
"Score: 0.85260946\n",
"URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art\n",
"\n",
"Title: Expressionism\n",
"Score: 0.85235393\n",
"URL: https://simple.wikipedia.org/wiki/Expressionism\n",
"\n"
]
}
],
"source": [
"# Function to generate query embedding\n",
"def generate_embeddings(text):\n",
" response = openai.Embedding.create(\n",
" input=text, engine=model)\n",
" embeddings = response['data'][0]['embedding']\n",
" return embeddings\n",
"\n",
"# Pure Vector Search\n",
"query = \"modern art in Europe\"\n",
" \n",
"search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)) \n",
"vector = Vector(value=generate_embeddings(query), k=3, fields=\"content_vector\") \n",
" \n",
"results = search_client.search( \n",
" search_text=None, \n",
" vectors=[vector], \n",
" select=[\"title\", \"text\", \"url\"] \n",
")\n",
" \n",
"for result in results: \n",
" print(f\"Title: {result['title']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"URL: {result['url']}\\n\") "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a Hybrid Search"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Title: Wars of Scottish Independence\n",
"Score: 0.03306011110544205\n",
"URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence\n",
"\n",
"Title: Battle of Bannockburn\n",
"Score: 0.022253260016441345\n",
"URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn\n",
"\n",
"Title: Scottish\n",
"Score: 0.016393441706895828\n",
"URL: https://simple.wikipedia.org/wiki/Scottish\n",
"\n"
]
}
],
"source": [
"# Hybrid Search\n",
"query = \"Famous battles in Scottish history\" \n",
" \n",
"search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)) \n",
"vector = Vector(value=generate_embeddings(query), k=3, fields=\"content_vector\") \n",
" \n",
"results = search_client.search( \n",
" search_text=query, \n",
" vectors=[vector],\n",
" select=[\"title\", \"text\", \"url\"],\n",
" top=3\n",
") \n",
" \n",
"for result in results: \n",
" print(f\"Title: {result['title']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"URL: {result['url']}\\n\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a Hybrid Search with Reranking (powered by Bing)\n",
"[Semantic search](https://learn.microsoft.com/azure/search/semantic-ranking) allows you to leverage deep neural networks from Microsoft Bing to further increase your search accuracy. Additionally, you can get captions, answers, and highlights. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Answer: The<em> Battle of Bannockburn,</em> fought on 23 and 24 June 1314, was an important Scottish victory in the Wars of Scottish Independence. A smaller Scottish army defeated a much larger and better armed English army. Background When King Alexander III of Scotland died in 1286, his heir was his granddaughter Margaret, Maid of Norway.\n",
"Semantic Answer Score: 0.8857421875\n",
"\n",
"Title: Wars of Scottish Independence\n",
"URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence\n",
"Caption: Important Figures Scotland King David II King John Balliol King Robert I the Bruce William Wallace England King Edward I King Edward II King Edward III Battles Battle of Bannockburn The Battle of Bannockburn (2324 June 1314) was an important Scottish victory. It was the decisive battle in the First War of Scottish Independence.\n",
"\n",
"Title: Battle of Bannockburn\n",
"URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn\n",
"Caption: The Battle of Bannockburn, fought on 23 and 24 June 1314, was an important<em> Scottish</em> victory in the Wars of<em> Scottish</em> Independence. A smaller Scottish army defeated a much larger and better armed English army. Background When King Alexander III of Scotland died in 1286, his heir was his granddaughter Margaret, Maid of Norway.\n",
"\n",
"Title: First War of Scottish Independence\n",
"URL: https://simple.wikipedia.org/wiki/First%20War%20of%20Scottish%20Independence\n",
"Caption: The First War of<em> Scottish Independence</em> lasted from the outbreak of the war in 1296 until the 1328. The Scots were defeated at Dunbar on 27 April 1296. John Balliol abdicated in Montrose castle on 10 July 1296.\n",
"\n"
]
}
],
"source": [
"# Semantic Hybrid Search\n",
"query = \"Famous battles in Scottish history\" \n",
"\n",
"search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)) \n",
"vector = Vector(value=generate_embeddings(query), k=3, fields=\"content_vector\") \n",
"\n",
"results = search_client.search( \n",
" search_text=query, \n",
" vectors=[vector], \n",
" select=[\"title\", \"text\", \"url\"],\n",
" query_type=\"semantic\", query_language=\"en-us\", semantic_configuration_name='my-semantic-config', query_caption=\"extractive\", query_answer=\"extractive\",\n",
" top=3\n",
")\n",
"\n",
"semantic_answers = results.get_answers()\n",
"for answer in semantic_answers:\n",
" if answer.highlights:\n",
" print(f\"Semantic Answer: {answer.highlights}\")\n",
" else:\n",
" print(f\"Semantic Answer: {answer.text}\")\n",
" print(f\"Semantic Answer Score: {answer.score}\\n\")\n",
"\n",
"for result in results:\n",
" print(f\"Title: {result['title']}\")\n",
" print(f\"URL: {result['url']}\")\n",
" captions = result[\"@search.captions\"]\n",
" if captions:\n",
" caption = captions[0]\n",
" if caption.highlights:\n",
" print(f\"Caption: {caption.highlights}\\n\")\n",
" else:\n",
" print(f\"Caption: {caption.text}\\n\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}