{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Azure Cognitive Search as a vector database for OpenAI embeddings" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides step by step instuctions on using Azure Cognitive Search as a vector database with OpenAI embeddings. Azure Cognitive Search (formerly known as \"Azure Search\") is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.\n", "\n", "## Prerequistites:\n", "For the purposes of this exercise you must have the following:\n", "- [Azure Cognitive Search Service](https://learn.microsoft.com/azure/search/)\n", "- [OpenAI Key](https://platform.openai.com/account/api-keys) or [Azure OpenAI credentials](https://learn.microsoft.com/azure/cognitive-services/openai/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install wget\n", "! pip install azure-search-documents --pre " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Import required libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import openai\n", "import json \n", "import openai\n", "import wget\n", "import pandas as pd\n", "import zipfile\n", "from azure.core.credentials import AzureKeyCredential \n", "from azure.search.documents import SearchClient \n", "from azure.search.documents.indexes import SearchIndexClient \n", "from azure.search.documents.models import Vector \n", "from azure.search.documents import SearchIndexingBufferedSender\n", "from azure.search.documents.indexes.models import ( \n", " SearchIndex, \n", " SearchField, \n", " SearchFieldDataType, \n", " SimpleField, \n", " SearchableField, \n", " SearchIndex, \n", " SemanticConfiguration, \n", " PrioritizedFields, \n", " SemanticField, \n", " SearchField, \n", " SemanticSettings, \n", " VectorSearch, \n", " HnswVectorSearchAlgorithmConfiguration, \n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Configure OpenAI settings\n", "\n", "Configure your OpenAI or Azure OpenAI settings. For this example, we use Azure OpenAI." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "openai.api_type = \"azure\"\n", "openai.api_base = \"YOUR_AZURE_OPENAI_ENDPOINT\"\n", "openai.api_version = \"2023-05-15\"\n", "openai.api_key = \"YOUR_AZURE_OPENAI_KEY\"\n", "model: str = \"text-embedding-ada-002\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Azure Cognitive Search Vector Store settings\n", "You can find this in the Azure Portal or using the [Search Management SDK](https://learn.microsoft.com/rest/api/searchmanagement/)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "search_service_endpoint: str = \"YOUR_AZURE_SEARCH_ENDPOINT\"\n", "search_service_api_key: str = \"YOUR_AZURE_SEARCH_ADMIN_KEY\"\n", "index_name: str = \"azure-cognitive-search-vector-demo\"\n", "credential = AzureKeyCredential(search_service_api_key)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Load data\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'vector_database_wikipedia_articles_embedded.zip'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n", "\n", "# The file is ~700 MB so this will take some time\n", "wget.download(embeddings_url)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", " zip_ref.extractall(\"../../data\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "url | \n", "title | \n", "text | \n", "title_vector | \n", "content_vector | \n", "vector_id | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "https://simple.wikipedia.org/wiki/April | \n", "April | \n", "April is the fourth month of the year in the J... | \n", "[0.001009464613161981, -0.020700545981526375, ... | \n", "[-0.011253940872848034, -0.013491976074874401,... | \n", "0 | \n", "
1 | \n", "2 | \n", "https://simple.wikipedia.org/wiki/August | \n", "August | \n", "August (Aug.) is the eighth month of the year ... | \n", "[0.0009286514250561595, 0.000820168002974242, ... | \n", "[0.0003609954728744924, 0.007262262050062418, ... | \n", "1 | \n", "
2 | \n", "6 | \n", "https://simple.wikipedia.org/wiki/Art | \n", "Art | \n", "Art is a creative activity that expresses imag... | \n", "[0.003393713850528002, 0.0061537534929811954, ... | \n", "[-0.004959689453244209, 0.015772193670272827, ... | \n", "2 | \n", "
3 | \n", "8 | \n", "https://simple.wikipedia.org/wiki/A | \n", "A | \n", "A or a is the first letter of the English alph... | \n", "[0.0153952119871974, -0.013759135268628597, 0.... | \n", "[0.024894846603274345, -0.022186409682035446, ... | \n", "3 | \n", "
4 | \n", "9 | \n", "https://simple.wikipedia.org/wiki/Air | \n", "Air | \n", "Air refers to the Earth's atmosphere. Air is a... | \n", "[0.02224554680287838, -0.02044147066771984, -0... | \n", "[0.021524671465158463, 0.018522677943110466, -... | \n", "4 | \n", "