From 5f552669f7d3f0af8bb34690c305d06600808753 Mon Sep 17 00:00:00 2001 From: Max Reid <164893837+maxreid-openai@users.noreply.github.com> Date: Thu, 25 Jul 2024 15:12:35 -0400 Subject: [PATCH] initial commit for Azure RAG cookbook (#1272) Co-authored-by: juston <96567547+justonf@users.noreply.github.com> --- .../chatgpt/rag-quickstart/azure/.funcignore | 1 + .../chatgpt/rag-quickstart/azure/.gitignore | 50 + .../azure/.vscode/extensions.json | 5 + ...Functions_and_GPT_Actions_in_ChatGPT.ipynb | 1214 +++++++++++++++++ .../rag-quickstart/azure/function_app.py | 153 +++ .../chatgpt/rag-quickstart/azure/host.json | 15 + .../rag-quickstart/azure/requirements.txt | 17 + .../vector_similarity_search/function.json | 19 + examples/data/oai_docs/authentication.txt | 34 + examples/data/oai_docs/batch.txt | 333 +++++ examples/data/oai_docs/changelog.txt | 393 ++++++ .../oai_docs/crawl-website-embeddings.txt | 560 ++++++++ examples/data/oai_docs/curl-setup.txt | 118 ++ examples/data/oai_docs/data-retrieval.txt | 75 + examples/data/oai_docs/deprecations.txt | 162 +++ examples/data/oai_docs/embeddings.txt | 510 +++++++ examples/data/oai_docs/error-codes.txt | 244 ++++ examples/data/oai_docs/fine-tuning.txt | 827 +++++++++++ .../function-calling-run-example--polling.txt | 122 ++ ...unction-calling-run-example--streaming.txt | 124 ++ examples/data/oai_docs/function-calling.txt | 252 ++++ examples/data/oai_docs/getting-started.txt | 334 +++++ examples/data/oai_docs/gptbot.txt | 39 + examples/data/oai_docs/hackathons.txt | 62 + examples/data/oai_docs/how-it-works.txt | 577 ++++++++ examples/data/oai_docs/images-node-tips.txt | 95 ++ examples/data/oai_docs/images-python-tips.txt | 69 + examples/data/oai_docs/images.txt | 230 ++++ examples/data/oai_docs/index.txt | 71 + examples/data/oai_docs/introduction.txt | 75 + .../data/oai_docs/latency-optimization.txt | 452 ++++++ examples/data/oai_docs/libraries.txt | 198 +++ .../oai_docs/meeting-minutes-tutorial.txt | 261 ++++ examples/data/oai_docs/migration.txt | 332 +++++ examples/data/oai_docs/models.txt | 228 ++++ examples/data/oai_docs/moderation.txt | 114 ++ examples/data/oai_docs/node-setup.txt | 128 ++ .../data/oai_docs/optimizing-llm-accuracy.txt | 340 +++++ .../data/oai_docs/overview-with-streaming.txt | 85 ++ .../oai_docs/overview-without-streaming.txt | 78 ++ .../oai_docs/production-best-practices.txt | 155 +++ examples/data/oai_docs/production.txt | 34 + examples/data/oai_docs/prompt-engineering.txt | 578 ++++++++ examples/data/oai_docs/python-setup.txt | 205 +++ examples/data/oai_docs/release-notes.txt | 79 ++ .../data/oai_docs/safety-best-practices.txt | 84 ++ examples/data/oai_docs/speech-to-text.txt | 353 +++++ .../data/oai_docs/supported-countries.txt | 195 +++ examples/data/oai_docs/text-generation.txt | 565 ++++++++ examples/data/oai_docs/text-to-speech.txt | 157 +++ examples/data/oai_docs/tier-five.txt | 20 + examples/data/oai_docs/tier-four.txt | 18 + examples/data/oai_docs/tier-free.txt | 14 + examples/data/oai_docs/tier-one.txt | 18 + examples/data/oai_docs/tier-three.txt | 18 + examples/data/oai_docs/tier-two.txt | 18 + .../data/oai_docs/tool-code-interpreter.txt | 358 +++++ examples/data/oai_docs/tool-file-search.txt | 616 +++++++++ .../data/oai_docs/tool-function-calling.txt | 223 +++ examples/data/oai_docs/vision.txt | 446 ++++++ examples/data/oai_docs/whats-new.txt | 21 + images/azure-rag-architecture.png | Bin 0 -> 166839 bytes images/azure-rag-quickstart-gpt.png | Bin 0 -> 195141 bytes registry.yaml | 20 +- 64 files changed, 13182 insertions(+), 9 deletions(-) create mode 100644 examples/chatgpt/rag-quickstart/azure/.funcignore create mode 100644 examples/chatgpt/rag-quickstart/azure/.gitignore create mode 100644 examples/chatgpt/rag-quickstart/azure/.vscode/extensions.json create mode 100644 examples/chatgpt/rag-quickstart/azure/Azure_AI_Search_with_Azure_Functions_and_GPT_Actions_in_ChatGPT.ipynb create mode 100644 examples/chatgpt/rag-quickstart/azure/function_app.py create mode 100644 examples/chatgpt/rag-quickstart/azure/host.json create mode 100644 examples/chatgpt/rag-quickstart/azure/requirements.txt create mode 100644 examples/chatgpt/rag-quickstart/azure/vector_similarity_search/function.json create mode 100644 examples/data/oai_docs/authentication.txt create mode 100644 examples/data/oai_docs/batch.txt create mode 100644 examples/data/oai_docs/changelog.txt create mode 100644 examples/data/oai_docs/crawl-website-embeddings.txt create mode 100644 examples/data/oai_docs/curl-setup.txt create mode 100644 examples/data/oai_docs/data-retrieval.txt create mode 100644 examples/data/oai_docs/deprecations.txt create mode 100644 examples/data/oai_docs/embeddings.txt create mode 100644 examples/data/oai_docs/error-codes.txt create mode 100644 examples/data/oai_docs/fine-tuning.txt create mode 100644 examples/data/oai_docs/function-calling-run-example--polling.txt create mode 100644 examples/data/oai_docs/function-calling-run-example--streaming.txt create mode 100644 examples/data/oai_docs/function-calling.txt create mode 100644 examples/data/oai_docs/getting-started.txt create mode 100644 examples/data/oai_docs/gptbot.txt create mode 100644 examples/data/oai_docs/hackathons.txt create mode 100644 examples/data/oai_docs/how-it-works.txt create mode 100644 examples/data/oai_docs/images-node-tips.txt create mode 100644 examples/data/oai_docs/images-python-tips.txt create mode 100644 examples/data/oai_docs/images.txt create mode 100644 examples/data/oai_docs/index.txt create mode 100644 examples/data/oai_docs/introduction.txt create mode 100644 examples/data/oai_docs/latency-optimization.txt create mode 100644 examples/data/oai_docs/libraries.txt create mode 100644 examples/data/oai_docs/meeting-minutes-tutorial.txt create mode 100644 examples/data/oai_docs/migration.txt create mode 100644 examples/data/oai_docs/models.txt create mode 100644 examples/data/oai_docs/moderation.txt create mode 100644 examples/data/oai_docs/node-setup.txt create mode 100644 examples/data/oai_docs/optimizing-llm-accuracy.txt create mode 100644 examples/data/oai_docs/overview-with-streaming.txt create mode 100644 examples/data/oai_docs/overview-without-streaming.txt create mode 100644 examples/data/oai_docs/production-best-practices.txt create mode 100644 examples/data/oai_docs/production.txt create mode 100644 examples/data/oai_docs/prompt-engineering.txt create mode 100644 examples/data/oai_docs/python-setup.txt create mode 100644 examples/data/oai_docs/release-notes.txt create mode 100644 examples/data/oai_docs/safety-best-practices.txt create mode 100644 examples/data/oai_docs/speech-to-text.txt create mode 100644 examples/data/oai_docs/supported-countries.txt create mode 100644 examples/data/oai_docs/text-generation.txt create mode 100644 examples/data/oai_docs/text-to-speech.txt create mode 100644 examples/data/oai_docs/tier-five.txt create mode 100644 examples/data/oai_docs/tier-four.txt create mode 100644 examples/data/oai_docs/tier-free.txt create mode 100644 examples/data/oai_docs/tier-one.txt create mode 100644 examples/data/oai_docs/tier-three.txt create mode 100644 examples/data/oai_docs/tier-two.txt create mode 100644 examples/data/oai_docs/tool-code-interpreter.txt create mode 100644 examples/data/oai_docs/tool-file-search.txt create mode 100644 examples/data/oai_docs/tool-function-calling.txt create mode 100644 examples/data/oai_docs/vision.txt create mode 100644 examples/data/oai_docs/whats-new.txt create mode 100644 images/azure-rag-architecture.png create mode 100644 images/azure-rag-quickstart-gpt.png diff --git a/examples/chatgpt/rag-quickstart/azure/.funcignore b/examples/chatgpt/rag-quickstart/azure/.funcignore new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/.funcignore @@ -0,0 +1 @@ + diff --git a/examples/chatgpt/rag-quickstart/azure/.gitignore b/examples/chatgpt/rag-quickstart/azure/.gitignore new file mode 100644 index 00000000..f4e5ac24 --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/.gitignore @@ -0,0 +1,50 @@ +bin +obj +csx +.vs +edge +Publish + +*.user +*.suo +*.cscfg +*.Cache +project.lock.json + +/packages +/TestResults + +/tools/NuGet.exe +/App_Data +/secrets +/data +.secrets +appsettings.json +local.settings.json + +node_modules +dist +vector_database_wikipedia_articles_embedded + +# Local python packages +.python_packages/ + +# Python Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# Azurite artifacts +__blobstorage__ +__queuestorage__ +__azurite_db*__.json +vector_database_wikipedia_articles_embedded \ No newline at end of file diff --git a/examples/chatgpt/rag-quickstart/azure/.vscode/extensions.json b/examples/chatgpt/rag-quickstart/azure/.vscode/extensions.json new file mode 100644 index 00000000..dde673dc --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/.vscode/extensions.json @@ -0,0 +1,5 @@ +{ + "recommendations": [ + "ms-azuretools.vscode-azurefunctions" + ] +} \ No newline at end of file diff --git a/examples/chatgpt/rag-quickstart/azure/Azure_AI_Search_with_Azure_Functions_and_GPT_Actions_in_ChatGPT.ipynb b/examples/chatgpt/rag-quickstart/azure/Azure_AI_Search_with_Azure_Functions_and_GPT_Actions_in_ChatGPT.ipynb new file mode 100644 index 00000000..dd3db639 --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/Azure_AI_Search_with_Azure_Functions_and_GPT_Actions_in_ChatGPT.ipynb @@ -0,0 +1,1214 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Azure AI Search as a vector database + Azure Functions for GPT integration in ChatGPT" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook provides step by step instuctions on using Azure AI Search (f.k.a Azure Cognitive Search) as a vector database with OpenAI embeddings, then creating an Azure Function on top to plug into a Custom GPT in ChatGPT. \n", + "\n", + "This can be a solution for customers looking to set up RAG infrastructure contained within Azure, and exposing it as an endpoint to integrate that with other platforms such as ChatGPT.\n", + "\n", + "Azure AI Search is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications. \n", + "\n", + "Azure Functions is a serverless compute service that runs event-driven code, automatically managing infrastructure, scaling, and integrating with other Azure services." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites:\n", + "For the purposes of this exercise you must have the following:\n", + "- Azure user with permission to create [Azure AI Search Service](https://learn.microsoft.com/azure/search/) and Azure Function Apps\n", + "- Azure subscription ID and a resource group.\n", + "- [OpenAI Key](https://platform.openai.com/account/api-keys) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Architecture\n", + "Below is a diagram of the architecture of this solution, which we'll walk through step-by-step.\n", + "\n", + "![azure-rag-architecture.png](../../../../images/azure-rag-architecture.png)\n", + "\n", + "\n", + "> Note: This architecture pattern of vector data store + serverless functions can be extrapolated to other vector data stores. For example, if you would want to use something like Postgres within Azure, you'd change the [Configure Azure AI Search Settings](#configure-azure-ai-search-settings) step to set-up the requirements for Postgres, you'd modify the [Create Azure AI Vector Search](#create-azure-ai-vector-search) to create the database and table in Postgres instead, and you'd update the `function_app.py` code in this repository to query Postgres instead of Azure AI Search. The data preparation and creation of the Azure Function would stay consistent. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Table of Contents:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. **[Setup of Environment](#set-up-environment)**\n", + " Setup environment by installing and importing the required libraries and configuring our Azure settings. Includes:\n", + " - [Install and Import Required Libraries](#install-and-import-required-libraries)\n", + " - [Configure OpenAI Settings](#configure-openai-settings)\n", + " - [Configure Azure AI Search Settings](#configure-azure-ai-search-settings)\n", + " \n", + "\n", + "2. **[Prepare Data](#prepare-data)** Prepare the data for uploading by embedding the documents, as well as capturing additional metadata. We will use a subset of OpenAI's docs as example data for this.\n", + " \n", + "3. **[Create Azure AI Vector Search](#create-azure-ai-vector-search)** Create an Azure AI Vector Search and upload the data we've prepared. Includes:\n", + " - [Create Index](#create-index): Steps to create an index in Azure AI Search.\n", + " - [Upload Data](#upload-data): Instructions to upload data to Azure AI Search.\n", + " - [Test Search](#test-search): Steps to test the search functionality.\n", + " \n", + "4. **[Create Azure Function](#create-azure-function)** Create an Azure Function to interact with the Azure AI Vector Search. Includes:\n", + " - [Create Storage Account](#create-storage-account): Steps to create a storage account for the Azure Function.\n", + " - [Create Function App](#create-function-app): Instructions to create a function app in Azure.\n", + " \n", + "5. **[Input in a Custom GPT in ChatGPT](#input-in-a-custom-gpt-in-chatgpt)** Integrate the Azure Function with a Custom GPT in ChatGPT. Includes:\n", + " - [Create OpenAPI Spec](#create-openapi-spec): Steps to create an OpenAPI specification for the Azure Function.\n", + " - [Create GPT Instructions](#create-gpt-instructions): Instructions to create GPT-specific instructions for the integration.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set up environment\n", + "We'll set up our environment by importing the required libraries and configuring our Azure settings." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install and import required libraries\n", + "We categorize these libraries into standard Python libraries, third-party libraries, and Azure-related libraries for readability." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! pip install -q wget\n", + "! pip install -q azure-search-documents \n", + "! pip install -q azure-identity\n", + "! pip install -q openai\n", + "! pip install -q azure-mgmt-search\n", + "! pip install -q pandas\n", + "! pip install -q azure-mgmt-resource \n", + "! pip install -q azure-mgmt-storage\n", + "! pip install -q pyperclip\n", + "! pip install -q PyPDF2\n", + "! pip install -q tiktoken" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Standard Libraries\n", + "import json \n", + "import os\n", + "import platform\n", + "import subprocess\n", + "import csv\n", + "from itertools import islice\n", + "import uuid\n", + "import shutil\n", + "import concurrent.futures\n", + "\n", + "# Third-Party Libraries\n", + "import pandas as pd\n", + "from PyPDF2 import PdfReader\n", + "import tiktoken\n", + "from dotenv import load_dotenv\n", + "import pyperclip\n", + "\n", + "# OpenAI Libraries (note we use OpenAI directly here, but you can replace with Azure OpenAI as needed)\n", + "from openai import OpenAI\n", + "\n", + "# Azure Identity and Credentials\n", + "from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n", + "from azure.core.credentials import AzureKeyCredential \n", + "from azure.core.exceptions import HttpResponseError\n", + "\n", + "# Azure Search Documents\n", + "from azure.search.documents import SearchClient, SearchIndexingBufferedSender \n", + "from azure.search.documents.indexes import SearchIndexClient \n", + "from azure.search.documents.models import (\n", + " VectorizedQuery\n", + ")\n", + "from azure.search.documents.indexes.models import (\n", + " HnswAlgorithmConfiguration,\n", + " HnswParameters,\n", + " SearchField,\n", + " SearchableField,\n", + " SearchFieldDataType,\n", + " SearchIndex,\n", + " SimpleField,\n", + " VectorSearch,\n", + " VectorSearchAlgorithmKind,\n", + " VectorSearchAlgorithmMetric,\n", + " VectorSearchProfile,\n", + ")\n", + "\n", + "# Azure Management Clients\n", + "from azure.mgmt.search import SearchManagementClient\n", + "from azure.mgmt.resource import ResourceManagementClient, SubscriptionClient\n", + "from azure.mgmt.storage import StorageManagementClient" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configure OpenAI settings\n", + "\n", + "Before going through this section, make sure you have your OpenAI API key.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "openai_api_key = os.environ.get(\"OPENAI_API_KEY\", \"\") # Saving this as a variable to reference in function app in later step\n", + "openai_client = OpenAI(api_key=openai_api_key)\n", + "embeddings_model = \"text-embedding-3-small\" # We'll use this by default, but you can change to your text-embedding-3-large if desired" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configure Azure AI Search Settings\n", + "You can locate your Azure AI Search service details in the Azure Portal or programmatically via the [Search Management SDK](https://learn.microsoft.com/rest/api/searchmanagement/).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Prerequisites:\n", + "- Subscription ID from Azure\n", + "- Resource Group name from Azure\n", + "- Region in Azure" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Update the below with your values\n", + "subscription_id=\"\"\n", + "resource_group=\"\"\n", + "\n", + "## Make sure to choose a region that supports the proper products. We've defaulted to \"eastus\" below. https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/#products-by-region_tab5\n", + "region = \"eastus\"\n", + "credential = InteractiveBrowserCredential()\n", + "subscription_client = SubscriptionClient(credential)\n", + "subscription = next(subscription_client.subscriptions.list())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Create and Configure Azure AI Search Service\n", + "Below we'll generate a unique name for the search service, set up the service properties, and create the search service." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize the SearchManagementClient with the provided credentials and subscription ID\n", + "search_management_client = SearchManagementClient(\n", + " credential=credential,\n", + " subscription_id=subscription_id,\n", + ")\n", + "\n", + "# Generate a unique name for the search service using UUID, but you can change this if you'd like.\n", + "generated_uuid = str(uuid.uuid4())\n", + "search_service_name = \"search-service-gpt-demo\" + generated_uuid\n", + "## The below is the default endpoint structure that is created when you create a search service. This may differ based on your Azure settings.\n", + "search_service_endpoint = 'https://'+search_service_name+'.search.windows.net'\n", + "\n", + "# Create or update the search service with the specified parameters\n", + "response = search_management_client.services.begin_create_or_update(\n", + " resource_group_name=resource_group,\n", + " search_service_name=search_service_name,\n", + " service={\n", + " \"location\": region,\n", + " \"properties\": {\"hostingMode\": \"default\", \"partitionCount\": 1, \"replicaCount\": 1},\n", + " # We are using the free pricing tier for this demo. You are only allowed one free search service per subscription.\n", + " \"sku\": {\"name\": \"free\"},\n", + " \"tags\": {\"app-name\": \"Search service demo\"},\n", + " },\n", + ").result()\n", + "\n", + "# Convert the response to a dictionary and then to a pretty-printed JSON string\n", + "response_dict = response.as_dict()\n", + "response_json = json.dumps(response_dict, indent=4)\n", + "\n", + "print(response_json)\n", + "print(\"Search Service Name:\" + search_service_name)\n", + "print(\"Search Service Endpoint:\" + search_service_endpoint)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Get the Search Service API Key\n", + "Now that we have the search service up and running, we need the [Search Service API Key](https://learn.microsoft.com/en-us/azure/search/search-security-api-keys?tabs=rest-use,portal-find,portal-query), which we'll use to initiate the index creation, and later to execute the search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve the admin keys for the search service\n", + "try:\n", + " response = search_management_client.admin_keys.get(\n", + " resource_group_name=resource_group,\n", + " search_service_name=search_service_name,\n", + " )\n", + " # Extract the primary API key from the response and save as a variable to be used later\n", + " search_service_api_key = response.primary_key\n", + " print(\"Successfully retrieved the API key.\")\n", + "except Exception as e:\n", + " print(f\"Failed to retrieve the API key: {e}\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prepare data\n", + "We're going to embed and store a few pages of the OpenAI docs in the oai_docs folder. We'll first embed each, add it to a CSV, and then use that CSV to upload to the index.\n", + "\n", + "In order to handle longer text files beyond the context of 8191 tokens, we can either use the chunk embeddings separately, or combine them in some way, such as averaging (weighted by the size of each chunk).\n", + "\n", + "We will take a function from Python's own cookbook that breaks up a sequence into chunks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def batched(iterable, n):\n", + " \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n", + " # batched('ABCDEFG', 3) --> ABC DEF G\n", + " if n < 1:\n", + " raise ValueError('n must be at least one')\n", + " it = iter(iterable)\n", + " while (batch := tuple(islice(it, n))):\n", + " yield batch\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we define a function that encodes a string into tokens and then breaks it up into chunks. We'll use tiktoken, a fast open-source tokenizer by OpenAI.\n", + "\n", + "To read more about counting tokens with Tiktoken, check out [this cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken). \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def chunked_tokens(text, chunk_length, encoding_name='cl100k_base'):\n", + " # Get the encoding object for the specified encoding name. OpenAI's tiktoken library, which is used in this notebook, currently supports two encodings: 'bpe' and 'cl100k_base'. The 'bpe' encoding is used for GPT-3 and earlier models, while 'cl100k_base' is used for newer models like GPT-4.\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " # Encode the input text into tokens\n", + " tokens = encoding.encode(text)\n", + " # Create an iterator that yields chunks of tokens of the specified length\n", + " chunks_iterator = batched(tokens, chunk_length)\n", + " # Yield each chunk from the iterator\n", + " yield from chunks_iterator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The average flag can be set to True to return the weighted average of the chunk embeddings, or False to simply return the unmodified list of chunk embeddings.\n", + "\n", + "> Note: there are other, more sophisticated techniques you can take here, including:\n", + "> - using GPT-4o to capture images/chart descriptions for embedding.\n", + "> - keeping text overlap between the chunks to minimize cutting off important context.\n", + "> - chunking based on paragraphs or sections.\n", + "> - adding more descriptive metadata about each article." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## Change the below based on model. The below is for the latest embeddings models from OpenAI, so you can leave as is unless you are using a different embedding model..\n", + "EMBEDDING_CTX_LENGTH = 8191\n", + "EMBEDDING_ENCODING='cl100k_base'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_embeddings(text, model):\n", + " # Generate embeddings for the provided text using the specified model\n", + " embeddings_response = openai_client.embeddings.create(model=model, input=text)\n", + " # Extract the embedding data from the response\n", + " embedding = embeddings_response.data[0].embedding\n", + " return embedding\n", + "\n", + "def len_safe_get_embedding(text, model=embeddings_model, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING):\n", + " # Initialize lists to store embeddings and corresponding text chunks\n", + " chunk_embeddings = []\n", + " chunk_texts = []\n", + " # Iterate over chunks of tokens from the input text\n", + " for chunk in chunked_tokens(text, chunk_length=max_tokens, encoding_name=encoding_name):\n", + " # Generate embeddings for each chunk and append to the list\n", + " chunk_embeddings.append(generate_embeddings(chunk, model=model))\n", + " # Decode the chunk back to text and append to the list\n", + " chunk_texts.append(tiktoken.get_encoding(encoding_name).decode(chunk))\n", + " # Return the list of chunk embeddings and the corresponding text chunks\n", + " return chunk_embeddings, chunk_texts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we can define a helper function that will capture additional metadata about the documents. This is useful to use as a metadata filter for search queries, and capturing richer data for search. \n", + "\n", + "In this example, I'll choose from a list of categories to use later on in a metadata filter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## These are the categories I will be using for the categorization task. You can change these as needed based on your use case.\n", + "categories = ['authentication','models','techniques','tools','setup','billing_limits','other']\n", + "\n", + "def categorize_text(text, categories):\n", + " # Create a prompt for categorization\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": f\"\"\"You are an expert in LLMs, and you will be given text that corresponds to an article in OpenAI's documentation.\n", + " Categorize the document into one of these categories: {', '.join(categories)}. Only respond with the category name and nothing else.\"\"\"},\n", + " {\"role\": \"user\", \"content\": text}\n", + " ]\n", + " try:\n", + " # Call the OpenAI API to categorize the text\n", + " response = openai_client.chat.completions.create(\n", + " model=\"gpt-4o\",\n", + " messages=messages\n", + " )\n", + " # Extract the category from the response\n", + " category = response.choices[0].message.content\n", + " return category\n", + " except Exception as e:\n", + " print(f\"Error categorizing text: {str(e)}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we can define some helper functions to process the .txt files in the oai_docs folder within the data folder. You can use this with your own data as well and supports both .txt and .pdf files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def extract_text_from_pdf(pdf_path):\n", + " # Initialize the PDF reader\n", + " reader = PdfReader(pdf_path)\n", + " text = \"\"\n", + " # Iterate through each page in the PDF and extract text\n", + " for page in reader.pages:\n", + " text += page.extract_text()\n", + " return text\n", + "\n", + "def process_file(file_path, idx, categories, embeddings_model):\n", + " file_name = os.path.basename(file_path)\n", + " print(f\"Processing file {idx + 1}: {file_name}\")\n", + " \n", + " # Read text content from .txt files\n", + " if file_name.endswith('.txt'):\n", + " with open(file_path, 'r', encoding='utf-8') as file:\n", + " text = file.read()\n", + " # Extract text content from .pdf files\n", + " elif file_name.endswith('.pdf'):\n", + " text = extract_text_from_pdf(file_path)\n", + " \n", + " title = file_name\n", + " # Generate embeddings for the title\n", + " title_vectors, title_text = len_safe_get_embedding(title, embeddings_model)\n", + " print(f\"Generated title embeddings for {file_name}\")\n", + " \n", + " # Generate embeddings for the content\n", + " content_vectors, content_text = len_safe_get_embedding(text, embeddings_model)\n", + " print(f\"Generated content embeddings for {file_name}\")\n", + " \n", + " category = categorize_text(' '.join(content_text), categories)\n", + " print(f\"Categorized {file_name} as {category}\")\n", + " \n", + " # Prepare the data to be appended\n", + " data = []\n", + " for i, content_vector in enumerate(content_vectors):\n", + " data.append({\n", + " \"id\": f\"{idx}_{i}\",\n", + " \"vector_id\": f\"{idx}_{i}\",\n", + " \"title\": title_text[0],\n", + " \"text\": content_text[i],\n", + " \"title_vector\": json.dumps(title_vectors[0]), # Assuming title is short and has only one chunk\n", + " \"content_vector\": json.dumps(content_vector),\n", + " \"category\": category\n", + " })\n", + " print(f\"Appended data for chunk {i + 1}/{len(content_vectors)} of {file_name}\")\n", + " \n", + " return data\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll now use this helper function to process our OpenAI documentation. Feel free to update this to use your own data by changing the folder in `process_files` below.\n", + "\n", + "Note that this will process the documents in chosen folder concurrently, so this should take <30 seconds if using txt files, and slightly longer if using PDFs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## Customize the location below if you are using different data besides the OpenAI documentation. Note that if you are using a different dataset, you will need to update the categories list as well.\n", + "folder_name = \"../../../data/oai_docs\"\n", + "\n", + "files = [os.path.join(folder_name, f) for f in os.listdir(folder_name) if f.endswith('.txt') or f.endswith('.pdf')]\n", + "data = []\n", + "\n", + "# Process each file concurrently\n", + "with concurrent.futures.ThreadPoolExecutor() as executor:\n", + " futures = {executor.submit(process_file, file_path, idx, categories, embeddings_model): idx for idx, file_path in enumerate(files)}\n", + " for future in concurrent.futures.as_completed(futures):\n", + " try:\n", + " result = future.result()\n", + " data.extend(result)\n", + " except Exception as e:\n", + " print(f\"Error processing file: {str(e)}\")\n", + "\n", + "# Write the data to a CSV file\n", + "csv_file = os.path.join(\"..\", \"embedded_data.csv\")\n", + "with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:\n", + " fieldnames = [\"id\", \"vector_id\", \"title\", \"text\", \"title_vector\", \"content_vector\",\"category\"]\n", + " writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n", + " writer.writeheader()\n", + " for row in data:\n", + " writer.writerow(row)\n", + " print(f\"Wrote row with id {row['id']} to CSV\")\n", + "\n", + "# Convert the CSV file to a Dataframe\n", + "article_df = pd.read_csv(\"../embedded_data.csv\")\n", + "# Read vectors from strings back into a list using json.loads\n", + "article_df[\"title_vector\"] = article_df.title_vector.apply(json.loads)\n", + "article_df[\"content_vector\"] = article_df.content_vector.apply(json.loads)\n", + "article_df[\"vector_id\"] = article_df[\"vector_id\"].apply(str)\n", + "article_df[\"category\"] = article_df[\"category\"].apply(str)\n", + "article_df.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now have an `embedded_data.csv` file with six columns that we can upload to our vector database! " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create Azure AI Vector Search" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create index\n", + "We'll define and create a search index using the `SearchIndexClient` from the Azure AI Search Python SDK. The index incorporates both vector search and hybrid search capabilities. For more details, visit Microsoft's documentation on how to [Create a Vector Index](https://learn.microsoft.com/azure/search/vector-search-how-to-create-index?.tabs=config-2023-11-01%2Crest-2023-11-01%2Cpush%2Cportal-check-index)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index_name = \"azure-ai-search-openai-cookbook-demo\"\n", + "# index_name = \"\"\n", + "\n", + "index_client = SearchIndexClient(\n", + " endpoint=search_service_endpoint, credential=AzureKeyCredential(search_service_api_key)\n", + ")\n", + "# Define the fields for the index. Update these based on your data.\n", + "# Each field represents a column in the search index\n", + "fields = [\n", + " SimpleField(name=\"id\", type=SearchFieldDataType.String), # Simple string field for document ID\n", + " SimpleField(name=\"vector_id\", type=SearchFieldDataType.String, key=True), # Key field for the index\n", + " # SimpleField(name=\"url\", type=SearchFieldDataType.String), # URL field (commented out)\n", + " SearchableField(name=\"title\", type=SearchFieldDataType.String), # Searchable field for document title\n", + " SearchableField(name=\"text\", type=SearchFieldDataType.String), # Searchable field for document text\n", + " SearchField(\n", + " name=\"title_vector\",\n", + " type=SearchFieldDataType.Collection(SearchFieldDataType.Single), # Collection of single values for title vector\n", + " vector_search_dimensions=1536, # Number of dimensions in the vector\n", + " vector_search_profile_name=\"my-vector-config\", # Profile name for vector search configuration\n", + " ),\n", + " SearchField(\n", + " name=\"content_vector\",\n", + " type=SearchFieldDataType.Collection(SearchFieldDataType.Single), # Collection of single values for content vector\n", + " vector_search_dimensions=1536, # Number of dimensions in the vector\n", + " vector_search_profile_name=\"my-vector-config\", # Profile name for vector search configuration\n", + " ),\n", + " SearchableField(name=\"category\", type=SearchFieldDataType.String, filterable=True), # Searchable field for document category\n", + "]\n", + "\n", + "# This configuration defines the algorithm and parameters for vector search\n", + "vector_search = VectorSearch(\n", + " algorithms=[\n", + " HnswAlgorithmConfiguration(\n", + " name=\"my-hnsw\", # Name of the HNSW algorithm configuration\n", + " kind=VectorSearchAlgorithmKind.HNSW, # Type of algorithm\n", + " parameters=HnswParameters(\n", + " m=4, # Number of bi-directional links created for every new element\n", + " ef_construction=400, # Size of the dynamic list for the nearest neighbors during construction\n", + " ef_search=500, # Size of the dynamic list for the nearest neighbors during search\n", + " metric=VectorSearchAlgorithmMetric.COSINE, # Distance metric used for the search\n", + " ),\n", + " )\n", + " ],\n", + " profiles=[\n", + " VectorSearchProfile(\n", + " name=\"my-vector-config\", # Name of the vector search profile\n", + " algorithm_configuration_name=\"my-hnsw\", # Reference to the algorithm configuration\n", + " )\n", + " ],\n", + ")\n", + "\n", + "# Create the search index with the vector search configuration\n", + "# This combines all the configurations into a single search index\n", + "index = SearchIndex(\n", + " name=index_name, # Name of the index\n", + " fields=fields, # Fields defined for the index\n", + " vector_search=vector_search # Vector search configuration\n", + "\n", + ")\n", + "\n", + "# Create or update the index\n", + "# This sends the index definition to the Azure Search service\n", + "result = index_client.create_index(index)\n", + "print(f\"{result.name} created\") # Output the name of the created index" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Upload Data\n", + "\n", + "Now we'll upload the articles from above that we've stored in `embedded_data.csv` from a pandas DataFrame to an Azure AI Search index. For a detailed guide on data import strategies and best practices, refer to [Data Import in Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-data-import).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field\n", + "article_df[\"id\"] = article_df[\"id\"].astype(str)\n", + "article_df[\"vector_id\"] = article_df[\"vector_id\"].astype(str)\n", + "\n", + "# Convert the DataFrame to a list of dictionaries\n", + "documents = article_df.to_dict(orient=\"records\")\n", + "\n", + "# Log the number of documents to be uploaded\n", + "print(f\"Number of documents to upload: {len(documents)}\")\n", + "\n", + "# Create a SearchIndexingBufferedSender\n", + "batch_client = SearchIndexingBufferedSender(\n", + " search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)\n", + ")\n", + "# Get the first document to check its schema\n", + "first_document = documents[0]\n", + "\n", + "# Get the index schema\n", + "index_schema = index_client.get_index(index_name)\n", + "\n", + "# Get the field names from the index schema\n", + "index_fields = {field.name: field.type for field in index_schema.fields}\n", + "\n", + "# Check each field in the first document\n", + "for field, value in first_document.items():\n", + " if field not in index_fields:\n", + " print(f\"Field '{field}' is not in the index schema.\")\n", + "\n", + "# Check for any fields in the index schema that are not in the documents\n", + "for field in index_fields:\n", + " if field not in first_document:\n", + " print(f\"Field '{field}' is in the index schema but not in the documents.\")\n", + "\n", + "try:\n", + " if documents:\n", + " # Add upload actions for all documents in a single call\n", + " upload_result = batch_client.upload_documents(documents=documents)\n", + "\n", + " # Check if the upload was successful\n", + " # Manually flush to send any remaining documents in the buffer\n", + " batch_client.flush()\n", + " \n", + " print(f\"Uploaded {len(documents)} documents in total\")\n", + " else:\n", + " print(\"No documents to upload.\")\n", + "except HttpResponseError as e:\n", + " print(f\"An error occurred: {e}\")\n", + " raise # Re-raise the exception to ensure it errors out\n", + "finally:\n", + " # Clean up resources\n", + " batch_client.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test search\n", + "Now that the data is uploaded, we'll test both vector similarity search and hybrid search locally below to make sure it is working as expected." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can test both a pure vector search and hybrid search. Pure vector search passes in `None` to the `search_text` below and will only search on vector similarity. Hybrid search will combines the capabilities of traditional keyword-based search by passing in the query text `query` to the `search_text` with vector-based similarity search to provide more relevant and contextual results. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"What model should I use to embed?\"\n", + "# Note: we'll have the GPT choose the category automatically once we put it in ChatGPT\n", + "category =\"models\"\n", + "\n", + "search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))\n", + "vector_query = VectorizedQuery(vector=generate_embeddings(query, embeddings_model), k_nearest_neighbors=3, fields=\"content_vector\")\n", + " \n", + "results = search_client.search( \n", + " search_text=None, # Pass in None if you want to use pure vector search, and `query` if you want to use hybrid search\n", + " vector_queries= [vector_query], \n", + " select=[\"title\", \"text\"],\n", + " filter=f\"category eq '{category}'\" \n", + ")\n", + "\n", + "for result in results: \n", + " print(result)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Azure Function\n", + "\n", + "Azure Functions are an easy way to build an API on top of our new AI search. Our code (see the `function_app.py` file in this folder, or linked [here](https://github.com/openai/openai-cookbook/blob/main/examples/chatgpt/rag-quickstart/azure/function_app.py)) does the following:\n", + "\n", + "1. Takes in an input of the user's query, search index endpoint, the index name, the k_nearest_neighbors*, the search column to use (either content_vector or title_vector), and whether it should use a hybrid query\n", + "2. Takes the user's query and embeds it.\n", + "3. Conducts a vector search and retrieves relevant text chunks.\n", + "4. Returns those relevant text chunks as the response body. \n", + "\n", + "*In the context of vector search, k_nearest_neighbors specifies the number of \"closest\" vectors (in terms of cosine similarity) that the search should return. For example, if k_nearest_neighbors is set to 3, the search will return the 3 vectors in the index that are most similar to the query vector.\n", + "\n", + "> Note that this Azure Function _does not have any authentication_. However, you can set authentication on it following docs [here](https://learn.microsoft.com/en-us/azure/azure-functions/security-concepts?tabs=v4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create storage account" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can create a new storage account using the code below, but feel free to skip that block and modify the subsequent steps to use an existing storage account. This may take up to 30 seconds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## Update below with a different name\n", + "storage_account_name = \"\"\n", + "\n", + "## Use below SKU or any other SKU as per your requirement\n", + "sku = \"Standard_LRS\"\n", + "resource_client = ResourceManagementClient(credential, subscription_id)\n", + "storage_client = StorageManagementClient(credential, subscription_id)\n", + "\n", + "# Create resource group if it doesn't exist\n", + "rg_result = resource_client.resource_groups.create_or_update(resource_group, {\"location\": region})\n", + "\n", + "# Create storage account\n", + "storage_async_operation = storage_client.storage_accounts.begin_create(\n", + " resource_group,\n", + " storage_account_name,\n", + " {\n", + " \"sku\": {\"name\": sku},\n", + " \"kind\": \"StorageV2\",\n", + " \"location\": region,\n", + " },\n", + ")\n", + "storage_account = storage_async_operation.result()\n", + "\n", + "print(f\"Storage account {storage_account.name} created\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Function App\n", + "This Function App is where the python code will execute once it is triggered via a GPT Action. To read more about Function Apps, see the docs [here](https://learn.microsoft.com/en-us/azure/azure-functions/functions-overview?pivots=programming-language-csharp). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To deploy Function Apps, we'll need to use the Azure CLI and Azure Functions Core Tools. \n", + "\n", + "> The below will attempt to install it and run it based on your platform type in your virtual environment, but if that does not work, read the Azure documentation to figure out how to install [Azure Function Core Tools](https://learn.microsoft.com/en-us/azure/azure-functions/create-first-function-cli-python?tabs=linux,bash,azure-cli,browser) and [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli). After doing that, run the below `subprocess.run` commands in your terminal after navigating to this folder." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we'll make sure we have the relevant tools in the environment in order to run the Azure commands necessary. This may take a few minutes to install." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "os_type = platform.system()\n", + "\n", + "if os_type == \"Windows\":\n", + " # Install Azure Functions Core Tools on Windows\n", + " subprocess.run([\"npm\", \"install\", \"-g\", \"azure-functions-core-tools@3\", \"--unsafe-perm\", \"true\"], check=True)\n", + " # Install Azure CLI on Windows\n", + " subprocess.run([\"powershell\", \"-Command\", \"Invoke-WebRequest -Uri https://aka.ms/installazurecliwindows -OutFile .\\\\AzureCLI.msi; Start-Process msiexec.exe -ArgumentList '/I AzureCLI.msi /quiet' -Wait\"], check=True)\n", + "elif os_type == \"Darwin\": # MacOS\n", + " # Install Azure Functions Core Tools on MacOS\n", + " if platform.machine() == 'arm64':\n", + " # For M1 Macs\n", + " subprocess.run([\"arch\", \"-arm64\", \"brew\", \"install\", \"azure-functions-core-tools@3\"], check=True)\n", + " else:\n", + " # For Intel Macs\n", + " subprocess.run([\"brew\", \"install\", \"azure-functions-core-tools@3\"], check=True)\n", + " # Install Azure CLI on MacOS\n", + " subprocess.run([\"brew\", \"update\"], check=True)\n", + " subprocess.run([\"brew\", \"install\", \"azure-cli\"], check=True)\n", + "elif os_type == \"Linux\":\n", + " # Install Azure Functions Core Tools on Linux\n", + " subprocess.run([\"curl\", \"https://packages.microsoft.com/keys/microsoft.asc\", \"|\", \"gpg\", \"--dearmor\", \">\", \"microsoft.gpg\"], check=True, shell=True)\n", + " subprocess.run([\"sudo\", \"mv\", \"microsoft.gpg\", \"/etc/apt/trusted.gpg.d/microsoft.gpg\"], check=True)\n", + " subprocess.run([\"sudo\", \"sh\", \"-c\", \"'echo \\\"deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-$(lsb_release -cs)-prod $(lsb_release -cs) main\\\" > /etc/apt/sources.list.d/dotnetdev.list'\"], check=True, shell=True)\n", + " subprocess.run([\"sudo\", \"apt-get\", \"update\"], check=True)\n", + " subprocess.run([\"sudo\", \"apt-get\", \"install\", \"azure-functions-core-tools-3\"], check=True)\n", + " # Install Azure CLI on Linux\n", + " subprocess.run([\"curl\", \"-sL\", \"https://aka.ms/InstallAzureCLIDeb\", \"|\", \"sudo\", \"bash\"], check=True, shell=True)\n", + "else:\n", + " # Raise an error if the operating system is not supported\n", + " raise OSError(\"Unsupported operating system\")\n", + "\n", + "# Verify the installation of Azure Functions Core Tools\n", + "subprocess.run([\"func\", \"--version\"], check=True)\n", + "# Verify the installation of Azure CLI\n", + "subprocess.run([\"az\", \"--version\"], check=True)\n", + "\n", + "subprocess.run([\n", + " \"az\", \"login\"\n", + "], check=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we need to create a `local.settings.json` file with our key environment variables for Azure" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_settings_content = f\"\"\"\n", + "{{\n", + " \"IsEncrypted\": false,\n", + " \"Values\": {{\n", + " \"AzureWebJobsStorage\": \"UseDevelopmentStorage=true\",\n", + " \"FUNCTIONS_WORKER_RUNTIME\": \"python\",\n", + " \"OPENAI_API_KEY\": \"{openai_api_key}\",\n", + " \"EMBEDDINGS_MODEL\": \"{embeddings_model}\",\n", + " \"SEARCH_SERVICE_API_KEY\": \"{search_service_api_key}\",\n", + " }}\n", + "}}\n", + "\"\"\"\n", + "\n", + "with open(\"local.settings.json\", \"w\") as file:\n", + " file.write(local_settings_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check the `local.settings.json` file and make sure that the environment variables match what you expect. \n", + "\n", + "Now, give your app a name below, and you are ready to create your Function App and then publish your function. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace this with your own values. This name will appear in the URL of the API call https://.azurewebsites.net\n", + "app_name = \"\"\n", + "\n", + "subprocess.run([\n", + " \"az\", \"functionapp\", \"create\",\n", + " \"--resource-group\", resource_group,\n", + " \"--consumption-plan-location\", region,\n", + " \"--runtime\", \"python\",\n", + " \"--name\", app_name,\n", + " \"--storage-account\", storage_account_name,\n", + " \"--os-type\", \"Linux\",\n", + "], check=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once we've created the Function App, we now want to add the configuration variables to the function app to use in the function. Specifically, we need the `OPENAI_API_KEY`, the `SEARCH_SERVICE_API_KEY`, and the `EMBEDDINGS_MODEL` as these are all used in the `function_app.py` code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Collect the relevant environment variables \n", + "env_vars = {\n", + " \"OPENAI_API_KEY\": openai_api_key,\n", + " \"SEARCH_SERVICE_API_KEY\": search_service_api_key,\n", + " \"EMBEDDINGS_MODEL\": embeddings_model\n", + "}\n", + "\n", + "# Create the settings argument for the az functionapp create command\n", + "settings_args = []\n", + "for key, value in env_vars.items():\n", + " settings_args.append(f\"{key}={value}\")\n", + "\n", + "subprocess.run([\n", + " \"az\", \"functionapp\", \"config\", \"appsettings\", \"set\",\n", + " \"--name\", app_name,\n", + " \"--resource-group\", resource_group,\n", + " \"--settings\", *settings_args\n", + "], check=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now ready to publish your function code `function_app.py` to the Azure Function. This may take up to 10 minutes to deploy. Once this is finished, we now have an API endpoint using an Azure Function on top of Azure AI Search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subprocess.run([\n", + " \"func\", \"azure\", \"functionapp\", \"publish\", app_name\n", + "], check=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Input in a Custom GPT in ChatGPT\n", + "Now that we have an Azure Function that queries this Vector Search Index, let's put it as a GPT Action!\n", + "\n", + "See documentation [here](https://openai.com/index/introducing-gpts/) on GPTs and [here](https://platform.openai.com/docs/actions) on GPT Actions. Use the below as the instructions for the GPT and as the OpenAPI spec for the GPT Action.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create OpenAPI Spec\n", + "Below is a sample OpenAPI spec. When we run the block below, a functional spec should be copied to the clipboard to paste in the GPT Action.\n", + "\n", + "Note that this does not have any authentication by default, but you can set up Azure Functions with OAuth by following the pattern in [this cookbook](https://cookbook.openai.com/examples/chatgpt/sharepoint_azure_function/using_azure_functions_and_microsoft_graph_to_query_sharepoint) in the Authentication section or looking at the documentation [here](https://learn.microsoft.com/en-us/azure/app-service/overview-authentication-authorization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "spec = f\"\"\"\n", + "openapi: 3.1.0\n", + "info:\n", + " title: Vector Similarity Search API\n", + " description: API for performing vector similarity search.\n", + " version: 1.0.0\n", + "servers:\n", + " - url: https://{app_name}.azurewebsites.net/api\n", + " description: Main (production) server\n", + "paths:\n", + " /vector_similarity_search:\n", + " post:\n", + " operationId: vectorSimilaritySearch\n", + " summary: Perform a vector similarity search.\n", + " requestBody:\n", + " required: true\n", + " content:\n", + " application/json:\n", + " schema:\n", + " type: object\n", + " properties:\n", + " search_service_endpoint:\n", + " type: string\n", + " description: The endpoint of the search service.\n", + " index_name:\n", + " type: string\n", + " description: The name of the search index.\n", + " query:\n", + " type: string\n", + " description: The search query.\n", + " k_nearest_neighbors:\n", + " type: integer\n", + " description: The number of nearest neighbors to return.\n", + " search_column:\n", + " type: string\n", + " description: The name of the search column.\n", + " use_hybrid_query:\n", + " type: boolean\n", + " description: Whether to use a hybrid query.\n", + " category:\n", + " type: string\n", + " description: category to filter.\n", + " required:\n", + " - search_service_endpoint\n", + " - index_name\n", + " - query\n", + " - k_nearest_neighbors\n", + " - search_column\n", + " - use_hybrid_query\n", + " responses:\n", + " '200':\n", + " description: A successful response with the search results.\n", + " content:\n", + " application/json:\n", + " schema:\n", + " type: object\n", + " properties:\n", + " results:\n", + " type: array\n", + " items:\n", + " type: object\n", + " properties:\n", + " id:\n", + " type: string\n", + " description: The identifier of the result item.\n", + " score:\n", + " type: number\n", + " description: The similarity score of the result item.\n", + " content:\n", + " type: object\n", + " description: The content of the result item.\n", + " '400':\n", + " description: Bad request due to missing or invalid parameters.\n", + " '500':\n", + " description: Internal server error.\n", + "\"\"\"\n", + "pyperclip.copy(spec)\n", + "print(\"OpenAPI spec copied to clipboard\")\n", + "print(spec)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create GPT Instructions\n", + "\n", + "Feel free to modify instructions as you see fit. Check out our docs [here](https://platform.openai.com/docs/guides/prompt-engineering) for some tips on prompt engineering." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "instructions = f'''\n", + "You are an OAI docs assistant. You have an action in your knowledge base where you can make a POST request to search for information. The POST request should always include: {{\n", + " \"search_service_endpoint\": \"{search_service_endpoint}\",\n", + " \"index_name\": {index_name},\n", + " \"query\": \"\",\n", + " \"k_nearest_neighbors\": 1,\n", + " \"search_column\": \"content_vector\",\n", + " \"use_hybrid_query\": true,\n", + " \"category\": \"\"\n", + "}}. Only the query and category change based on the user's request. Your goal is to assist users by performing searches using this POST request and providing them with relevant information based on the query.\n", + "\n", + "You must only include knowledge you get from your action in your response.\n", + "The category must be from the following list: {categories}, which you should determine based on the user's query. If you cannot determine, then do not include the category in the POST request.\n", + "'''\n", + "pyperclip.copy(instructions)\n", + "print(\"GPT Instructions copied to clipboard\")\n", + "print(instructions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now have a GPT that queries a vector database! " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Recap\n", + "We've now successfully integrated Azure AI Search with GPT Actions in ChatGPT by doing the following:\n", + "1. embedded them using OpenAI's embeddings, while adding some additional metadata using gpt-4o.\n", + "2. uploaded that data to Azure AI Search.\n", + "3. created an endpoint to query it using Azure Functions.\n", + "4. incorporated it into a Custom GPT. \n", + "\n", + "Our GPT can now retrieve information to help answer user queries, making it much more accurate and customized to our data. Here's the GPT in action:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ![azure-rag-quickstart-gpt.png](../../../../images/azure-rag-quickstart-gpt.png)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/chatgpt/rag-quickstart/azure/function_app.py b/examples/chatgpt/rag-quickstart/azure/function_app.py new file mode 100644 index 00000000..f6878ddc --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/function_app.py @@ -0,0 +1,153 @@ +import azure.functions as func +import json +import logging +from azure.search.documents import SearchClient +from azure.search.documents.indexes import SearchIndexClient +from azure.core.credentials import AzureKeyCredential +from openai import OpenAI +import os +from azure.search.documents.models import ( + VectorizedQuery +) + +# Initialize the Azure Function App +app = func.FunctionApp() + +def generate_embeddings(text): + # Check if text is provided + if not text: + logging.error("No text provided in the query string.") + return func.HttpResponse( + "Please provide text in the query string.", + status_code=400 + ) + + try: + # Initialize OpenAI client + client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) + logging.info("OpenAI client initialized successfully.") + + # Generate embeddings using OpenAI API + response = client.embeddings.create( + input=text, + model=os.getenv("EMBEDDINGS_MODEL") + ) + logging.info("Embeddings created successfully.") + + # Extract the embedding from the response + embedding = response.data[0].embedding + logging.debug(f"Generated embedding: {embedding}") + + return embedding + except Exception as e: + logging.error(f"Error generating embeddings: {str(e)}") + return func.HttpResponse( + f"Error generating embeddings: {str(e)}", + status_code=500 + ) + + +@app.route(route="vector_similarity_search", auth_level=func.AuthLevel.ANONYMOUS) +def vector_similarity_search(req: func.HttpRequest) -> func.HttpResponse: + logging.info("Received request for vector similarity search.") + try: + # Parse the request body as JSON + req_body = req.get_json() + logging.info("Request body parsed successfully.") + except ValueError: + logging.error("Invalid JSON in request body.") + return func.HttpResponse( + "Invalid JSON in request body.", + status_code=400 + ) + + # Extract parameters from the request body + search_service_endpoint = req_body.get('search_service_endpoint') + index_name = req_body.get('index_name') + query = req_body.get('query') + k_nearest_neighbors = req_body.get('k_nearest_neighbors') + search_column = req_body.get('search_column') + use_hybrid_query = req_body.get('use_hybrid_query') + + logging.info(f"Parsed request parameters: search_service_endpoint={search_service_endpoint}, index_name={index_name}, query={query}, k_nearest_neighbors={k_nearest_neighbors}, search_column={search_column}, use_hybrid_query={use_hybrid_query}") + + # Generate embeddings for the query + embeddings = generate_embeddings(query) + logging.info(f"Generated embeddings: {embeddings}") + + # Check for required parameters + if not (search_service_endpoint and index_name and query): + logging.error("Missing required parameters in request body.") + return func.HttpResponse( + "Please provide search_service_endpoint, index_name, and query in the request body.", + status_code=400 + ) + try: + # Create a vectorized query + vector_query = VectorizedQuery(vector=embeddings, k_nearest_neighbors=float(k_nearest_neighbors), fields=search_column) + logging.info("Vector query generated successfully.") + except Exception as e: + logging.error(f"Error generating vector query: {str(e)}") + return func.HttpResponse( + f"Error generating vector query: {str(e)}", + status_code=500 + ) + + try: + # Initialize the search client + search_client = SearchClient( + endpoint=search_service_endpoint, + index_name=index_name, + credential=AzureKeyCredential(os.getenv("SEARCH_SERVICE_API_KEY")) + ) + logging.info("Search client created successfully.") + + # Initialize the index client and get the index schema + index_client = SearchIndexClient(endpoint=search_service_endpoint, credential=AzureKeyCredential(os.getenv("SEARCH_SERVICE_API_KEY"))) + index_schema = index_client.get_index(index_name) + for field in index_schema.fields: + logging.info(f"Field: {field.name}, Type: {field.type}") + # Filter out non-vector fields + non_vector_fields = [field.name for field in index_schema.fields if field.type not in ["Edm.ComplexType", "Collection(Edm.ComplexType)","Edm.Vector","Collection(Edm.Single)"]] + + logging.info(f"Non-vector fields in the index: {non_vector_fields}") + except Exception as e: + logging.error(f"Error creating search client: {str(e)}") + return func.HttpResponse( + f"Error creating search client: {str(e)}", + status_code=500 + ) + + # Determine if hybrid query should be used + search_text = query if use_hybrid_query else None + + try: + # Perform the search + results = search_client.search( + search_text=search_text, + vector_queries=[vector_query], + select=non_vector_fields, + top=3 + ) + logging.info("Search performed successfully.") + except Exception as e: + logging.error(f"Error performing search: {str(e)}") + return func.HttpResponse( + f"Error performing search: {str(e)}", + status_code=500 + ) + + try: + # Extract relevant data from results and put it into a list of dictionaries + response_data = [result for result in results] + response_data = json.dumps(response_data) + logging.info("Search results processed successfully.") + except Exception as e: + logging.error(f"Error processing search results: {str(e)}") + return func.HttpResponse( + f"Error processing search results: {str(e)}", + status_code=500 + ) + + logging.info("Returning search results.") + return func.HttpResponse(response_data, mimetype="application/json") diff --git a/examples/chatgpt/rag-quickstart/azure/host.json b/examples/chatgpt/rag-quickstart/azure/host.json new file mode 100644 index 00000000..9df91361 --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/host.json @@ -0,0 +1,15 @@ +{ + "version": "2.0", + "logging": { + "applicationInsights": { + "samplingSettings": { + "isEnabled": true, + "excludedTypes": "Request" + } + } + }, + "extensionBundle": { + "id": "Microsoft.Azure.Functions.ExtensionBundle", + "version": "[4.*, 5.0.0)" + } +} \ No newline at end of file diff --git a/examples/chatgpt/rag-quickstart/azure/requirements.txt b/examples/chatgpt/rag-quickstart/azure/requirements.txt new file mode 100644 index 00000000..47ac7f5b --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/requirements.txt @@ -0,0 +1,17 @@ +# Do not include azure-functions-worker in this file +# The Python Worker is managed by the Azure Functions platform +# Manually managing azure-functions-worker may cause unexpected issues + +azure-functions +azure-search-documents +azure-identity +openai +azure-mgmt-search +pandas +azure-mgmt-resource +azure-mgmt-storage +azure-mgmt-web +python-dotenv +pyperclip +PyPDF2 +tiktoken \ No newline at end of file diff --git a/examples/chatgpt/rag-quickstart/azure/vector_similarity_search/function.json b/examples/chatgpt/rag-quickstart/azure/vector_similarity_search/function.json new file mode 100644 index 00000000..149763d4 --- /dev/null +++ b/examples/chatgpt/rag-quickstart/azure/vector_similarity_search/function.json @@ -0,0 +1,19 @@ +{ + "scriptFile": "__init__.py", + "bindings": [ + { + "authLevel": "Anonymous", + "type": "httpTrigger", + "direction": "in", + "name": "req", + "methods": [ + "post" + ] + }, + { + "type": "http", + "direction": "out", + "name": "$return" + } + ] + } \ No newline at end of file diff --git a/examples/data/oai_docs/authentication.txt b/examples/data/oai_docs/authentication.txt new file mode 100644 index 00000000..44833699 --- /dev/null +++ b/examples/data/oai_docs/authentication.txt @@ -0,0 +1,34 @@ + +# Action authentication + +Actions offer different authentication schemas to accommodate various use cases. To specify the authentication schema for your action, use the GPT editor and select "None", "API Key", or "OAuth". + +By default, the authentication method for all actions is set to "None", but you can change this and allow different actions to have different authentication methods. + +## No authentication + +We support flows without authentication for applications where users can send requests directly to your API without needing an API key or signing in with OAuth. + +Consider using no authentication for initial user interactions as you might experience a user drop off if they are forced to sign into an application. You can create a "signed out" experience and then move users to a "signed in" experience by enabling a separate action. + +## API key authentication + +Just like how a user might already be using your API, we allow API key authentication through the GPT editor UI. We encrypt the secret key when we store it in our database to keep your API key secure. + +This approach is useful if you have an API that takes slightly more consequential actions than the no authentication flow but does not require an individual user to sign in. Adding API key authentication can protect your API and give you more fine-grained access controls along with visibility into where requests are coming from. + +## OAuth + +Actions allow OAuth sign in for each user. This is the best way to provide personalized experiences and make the most powerful actions available to users. A simple example of the OAuth flow with actions will look like the following: + +- To start, select "Authentication" in the GPT editor UI, and select "OAuth". +- You will be prompted to enter the OAuth client ID, client secret, authorization URL, token URL, and scope. + - The client ID and secret can be simple text strings but should [follow OAuth best practices](https://www.oauth.com/oauth2-servers/client-registration/client-id-secret/). + - We store an encrypted version of the client secret, while the client ID is available to end users. +- OAuth requests will include the following information: `request={'grant_type': 'authorization_code', 'client_id': 'YOUR_CLIENT_ID', 'client_secret': 'YOUR_CLIENT_SECRET', 'code': 'abc123', 'redirect_uri': 'https://chatgpt.com/aip/g-some_gpt_id/oauth/callback'}` +- In order for someone to use an action with OAuth, they will need to send a message that invokes the action and then the user will be presented with a "Sign in to [domain]" button in the ChatGPT UI. +- The `authorization_url` endpoint should return a response that looks like: + `{ "access_token": "example_token", "token_type": "bearer", "refresh_token": "example_token", "expires_in": 59 }` +- During the user sign in process, ChatGPT makes a request to your `authorization_url` using the specified `authorization_content_type`, we expect to get back an access token and optionally a [refresh token](https://auth0.com/learn/refresh-tokens) which we use to periodically fetch a new access token. +- Each time a user makes a request to the action, the user’s token will be passed in the Authorization header: (“Authorization”: “[Bearer/Basic] [user’s token]”). +- We require that OAuth applications make use of the [state parameter](https://auth0.com/docs/secure/attack-protection/state-parameters#set-and-compare-state-parameter-values) for security reasons. diff --git a/examples/data/oai_docs/batch.txt b/examples/data/oai_docs/batch.txt new file mode 100644 index 00000000..636fe68e --- /dev/null +++ b/examples/data/oai_docs/batch.txt @@ -0,0 +1,333 @@ + +# Batch API + +Learn how to use OpenAI's Batch API to send asynchronous groups of requests with 50% lower costs, a separate pool of significantly higher rate limits, and a clear 24-hour turnaround time. The service is ideal for processing jobs that don't require immediate responses. You can also [explore the API reference directly here](/docs/api-reference/batch). + +## Overview + +While some uses of the OpenAI Platform require you to send synchronous requests, there are many cases where requests do not need an immediate response or [rate limits](/docs/guides/rate-limits) prevent you from executing a large number of queries quickly. Batch processing jobs are often helpful in use cases like: + +1. running evaluations +2. classifying large datasets +3. embedding content repositories + +The Batch API offers a straightforward set of endpoints that allow you to collect a set of requests into a single file, kick off a batch processing job to execute these requests, query for the status of that batch while the underlying requests execute, and eventually retrieve the collected results when the batch is complete. + +Compared to using standard endpoints directly, Batch API has: + +1. **Better cost efficiency:** 50% cost discount compared to synchronous APIs +2. **Higher rate limits:** [Substantially more headroom](/settings/organization/limits) compared to the synchronous APIs +3. **Fast completion times:** Each batch completes within 24 hours (and often more quickly) + +## Getting Started + +### 1. Preparing Your Batch File + +Batches start with a `.jsonl` file where each line contains the details of an individual request to the API. For now, the available endpoints are `/v1/chat/completions` ([Chat Completions API](/docs/api-reference/chat)) and `/v1/embeddings` ([Embeddings API](/docs/api-reference/embeddings)). For a given input file, the parameters in each line's `body` field are the same as the parameters for the underlying endpoint. Each request must include a unique `custom_id` value, which you can use to reference results after completion. Here's an example of an input file with 2 requests. Note that each input file can only include requests to a single model. + +```jsonl +{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} +{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} +``` + +### 2. Uploading Your Batch Input File + +Similar to our [Fine-tuning API](/docs/guides/fine-tuning/), you must first upload your input file so that you can reference it correctly when kicking off batches. Upload your `.jsonl` file using the [Files API](/docs/api-reference/files). + + + +### 3. Creating the Batch + +Once you've successfully uploaded your input file, you can use the input File object's ID to create a batch. In this case, let's assume the file ID is `file-abc123`. For now, the completion window can only be set to `24h`. You can also provide custom metadata via an optional `metadata` parameter. + + + +This request will return a [Batch object](/docs/api-reference/batch/object) with metadata about your batch: + +```python +{ + "id": "batch_abc123", + "object": "batch", + "endpoint": "/v1/chat/completions", + "errors": null, + "input_file_id": "file-abc123", + "completion_window": "24h", + "status": "validating", + "output_file_id": null, + "error_file_id": null, + "created_at": 1714508499, + "in_progress_at": null, + "expires_at": 1714536634, + "completed_at": null, + "failed_at": null, + "expired_at": null, + "request_counts": { + "total": 0, + "completed": 0, + "failed": 0 + }, + "metadata": null +} +``` + +### 4. Checking the Status of a Batch + +You can check the status of a batch at any time, which will also return a Batch object. + + + +The status of a given Batch object can be any of the following: + +| Status | Description | +| ------------- | ------------------------------------------------------------------------------ | +| `validating` | the input file is being validated before the batch can begin | +| `failed` | the input file has failed the validation process | +| `in_progress` | the input file was successfully validated and the batch is currently being run | +| `finalizing` | the batch has completed and the results are being prepared | +| `completed` | the batch has been completed and the results are ready | +| `expired` | the batch was not able to be completed within the 24-hour time window | +| `cancelling` | the batch is being cancelled (may take up to 10 minutes) | +| `cancelled` | the batch was cancelled | + +### 5. Retrieving the Results + +Once the batch is complete, you can download the output by making a request against the [Files API](/docs/api-reference/files) via the `output_file_id` field from the Batch object and writing it to a file on your machine, in this case `batch_output.jsonl` + + batch_output.jsonl +`.trim(), + node: ` +import OpenAI from "openai";\n +const openai = new OpenAI();\n +async function main() { + const file = await openai.files.content("file-xyz123");\n + console.log(file); +}\n +main(); +`.trim(), + }} +/> + +The output `.jsonl` file will have one response line for every successful request line in the input file. Any failed requests in the batch will have their error information written to an error file that can be found via the batch's `error_file_id`. + + + Note that the output line order may not match the input line order. Instead of + relying on order to process your results, use the custom_id field which will be + present in each line of your output file and allow you to map requests in your input + to results in your output. + + +```jsonl +{"id": "batch_req_123", "custom_id": "request-2", "response": {"status_code": 200, "request_id": "req_123", "body": {"id": "chatcmpl-123", "object": "chat.completion", "created": 1711652795, "model": "gpt-3.5-turbo-0125", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello."}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 22, "completion_tokens": 2, "total_tokens": 24}, "system_fingerprint": "fp_123"}}, "error": null} +{"id": "batch_req_456", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "req_789", "body": {"id": "chatcmpl-abc", "object": "chat.completion", "created": 1711652789, "model": "gpt-3.5-turbo-0125", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello! How can I assist you today?"}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 20, "completion_tokens": 9, "total_tokens": 29}, "system_fingerprint": "fp_3ba"}}, "error": null} +``` + +### 6. Cancelling a Batch + +If necessary, you can cancel an ongoing batch. The batch's status will change to `cancelling` until in-flight requests are complete (up to 10 minutes), after which the status will change to `cancelled`. + + + +### 7. Getting a List of All Batches + +At any time, you can see all your batches. For users with many batches, you can use the `limit` and `after` parameters to paginate your results. + + + +## Model Availability + +The Batch API can currently be used to execute queries against the following models. The Batch API supports text and vision inputs in the same format as the endpoints for these models: + +- `gpt-4o` +- `gpt-4-turbo` +- `gpt-4` +- `gpt-4-32k` +- `gpt-3.5-turbo` +- `gpt-3.5-turbo-16k` +- `gpt-4-turbo-preview` +- `gpt-4-vision-preview` +- `gpt-4-turbo-2024-04-09` +- `gpt-4-0314` +- `gpt-4-32k-0314` +- `gpt-4-32k-0613` +- `gpt-3.5-turbo-0301` +- `gpt-3.5-turbo-16k-0613` +- `gpt-3.5-turbo-1106` +- `gpt-3.5-turbo-0613` +- `text-embedding-3-large` +- `text-embedding-3-small` +- `text-embedding-ada-002` + +The Batch API also supports [fine-tuned models](/docs/guides/fine-tuning/what-models-can-be-fine-tuned). + +## Rate Limits + +Batch API rate limits are separate from existing per-model rate limits. The Batch API has two new types of rate limits: + +1. **Per-batch limits:** A single batch may include up to 50,000 requests, and a batch input file can be up to 100 MB in size. Note that `/v1/embeddings` batches are also restricted to a maximum of 50,000 embedding inputs across all requests in the batch. +2. **Enqueued prompt tokens per model:** Each model has a maximum number of enqueued prompt tokens allowed for batch processing. You can find these limits on the [Platform Settings page](/settings/organization/limits). + +There are no limits for output tokens or number of submitted requests for the Batch API today. Because Batch API rate limits are a new, separate pool, **using the Batch API will not consume tokens from your standard per-model rate limits**, thereby offering you a convenient way to increase the number of requests and processed tokens you can use when querying our API. + +## Batch Expiration + +Batches that do not complete in time eventually move to an `expired` state; unfinished requests within that batch are cancelled, and any responses to completed requests are made available via the batch's output file. You will be charged for tokens consumed from any completed requests. + +## Other Resources + +For more concrete examples, visit **[the OpenAI Cookbook](https://cookbook.openai.com/examples/batch_processing)**, which contains sample code for use cases like classification, sentiment analysis, and summary generation. diff --git a/examples/data/oai_docs/changelog.txt b/examples/data/oai_docs/changelog.txt new file mode 100644 index 00000000..b8943585 --- /dev/null +++ b/examples/data/oai_docs/changelog.txt @@ -0,0 +1,393 @@ + +# Changelog + +Keep track of changes to the OpenAI API. You can also track changes via our [public OpenAPI specification](https://github.com/openai/openai-openapi) which is used to generate our SDKs, documentation, and more. This changelog is maintained in a best effort fashion and may not reflect all changes +being made. + +### Jun 6th, 2024 + +- + + Parallel function calling + {" "} + can be disabled in Chat Completions and the Assistants API by passing{" "} + parallel_tool_calls=false. + +- + + .NET SDK + {" "} + launched in Beta. + + +### Jun 3rd, 2024 + +- + Added support for{" "} + + file search customizations + + . + + +### May 15th, 2024 + +- + Added support for{" "} + + archiving projects + + . Only organization owners can access this functionality. + +- + Added support for{" "} + + setting cost limits + {" "} + on a per-project basis for pay as you go customers. + + +### May 13th, 2024 + +- + Released{" "} + + GPT-4o + {" "} + in the API. GPT-4o is our fastest and most affordable flagship model. + + +### May 9th, 2024 + +- + Added support for{" "} + + image inputs to the Assistants API. + + + +### May 7th, 2024 + +- + Added support for{" "} + + fine-tuned models to the Batch API + + . + + +### May 6th, 2024 + +- + Added{" "} + + {'`stream_options: {"include_usage": true}`'} + {" "} + parameter to the Chat Completions and Completions APIs. Setting this gives + developers access to usage stats when using streaming. + + +### May 2nd, 2024 + +- + Added{" "} + + {"a new endpoint"} + {" "} + to delete a message from a thread in the Assistants API. + + +### Apr 29th, 2024 + +- + Added a new{" "} + + function calling option `tool_choice: "required"` + {" "} + to the Chat Completions and Assistants APIs. + +- + Added a{" "} + + guide for the Batch API + {" "} + and Batch API support for{" "} + + embeddings models + + + +### Apr 17th, 2024 + +- + Introduced a{" "} + + series of updates to the Assistants API + + , including a new file search tool allowing up to 10,000 files per assistant, new token + controls, and support for tool choice. + + +### Apr 16th, 2024 + +- + Introduced{" "} + + project based hierarchy + {" "} + for organizing work by projects, including the ability to create{" "} + API keys {" "} + and manage rate and cost limits on a per-project basis (cost limits available only + for Enterprise customers). + + +### Apr 15th, 2024 + +- + Released{" "} + + Batch API + + + +### Apr 9th, 2024 + +- + Released{" "} + + GPT-4 Turbo with Vision + {" "} + in general availability in the API + + +### Apr 4th, 2024 + +- + Added support for{" "} + + seed + {" "} + in the fine-tuning API + +- + Added support for{" "} + + checkpoints + {" "} + in the fine-tuning API + +- + Added support for{" "} + + adding Messages when creating a Run + {" "} + in the Assistants API + + +### Apr 1st, 2024 + +- + Added support for{" "} + + filtering Messages by run_id + {" "} + in the Assistants API + + +### Mar 29th, 2024 + +- + Added support for{" "} + + temperature + {" "} + and{" "} + + assistant message creation + {" "} + in the Assistants API + + +### Mar 14th, 2024 + +- + Added support for{" "} + + streaming + {" "} + in the Assistants API + + +### Feb 9th, 2024 + +- + Added + + {" "} + timestamp_granularities parameter + to the Audio API + + +### Feb 1st, 2024 + +- + Released + + {" "} + gpt-3.5-turbo-0125, an updated GPT-3.5 Turbo model + + + +### Jan 25th, 2024 + +- + Released + + {" "} + embedding V3 models and an updated GPT-4 Turbo preview + + +- + Added + + {" "} + dimensions parameter + to the Embeddings API + + +### Dec 20th, 2023 + +- + Added + + {" "} + additional_instructions parameter + to run creation in the Assistants API + + +### Dec 15th, 2023 + +- + Added + + {" "} + logprobs and + top_logprobs + parameters + to the Chat Completions API + + +### Dec 14th, 2023 + +- + Changed{" "} + + function parameters + {" "} + argument on a tool call to be optional + + +### Nov 30th, 2023 + +- + Released{" "} + + OpenAI Deno SDK + + + +### Nov 6th, 2023 + +- + Released{" "} + + GPT-4 Turbo Preview + + , + updated GPT-3.5 Turbo + , + GPT-4 Turbo with Vision + , + Assistants API + , + DALL·E 3 in the API + , and + text-to-speech API + + +- + Deprecated the Chat Completions functions{" "} + parameter{" "} + + in favor of + tools + {" "} + +- + Released{" "} + + OpenAI Python SDK V1.0 + + + +### Oct 16th, 2023 + +- + Added + + {" "} + encoding_format parameter + to the Embeddings API + +- + Added max_tokens to the{" "} + + Moderation models + + + +### Oct 6th, 2023 + +- + Added{" "} + + function calling support + {" "} + to the Fine-tuning API + diff --git a/examples/data/oai_docs/crawl-website-embeddings.txt b/examples/data/oai_docs/crawl-website-embeddings.txt new file mode 100644 index 00000000..7d8a5003 --- /dev/null +++ b/examples/data/oai_docs/crawl-website-embeddings.txt @@ -0,0 +1,560 @@ + +# How to build an AI that can answer questions about your website + +This tutorial walks through a simple example of crawling a website (in this example, the OpenAI website), turning the crawled pages into embeddings using the [Embeddings API](/docs/guides/embeddings), and then creating a basic search functionality that allows a user to ask questions about the embedded information. This is intended to be a starting point for more sophisticated applications that make use of custom knowledge bases. + +# Getting started + +Some basic knowledge of Python and GitHub is helpful for this tutorial. Before diving in, make sure to [set up an OpenAI API key](/docs/api-reference/introduction) and walk through the [quickstart tutorial](/docs/quickstart). This will give a good intuition on how to use the API to its full potential. + +Python is used as the main programming language along with the OpenAI, Pandas, transformers, NumPy, and other popular packages. If you run into any issues working through this tutorial, please ask a question on the [OpenAI Community Forum](https://community.openai.com). + +To start with the code, clone the [full code for this tutorial on GitHub](https://github.com/openai/web-crawl-q-and-a-example). Alternatively, follow along and copy each section into a Jupyter notebook and run the code step by step, or just read along. A good way to avoid any issues is to set up a new virtual environment and install the required packages by running the following commands: + +```bash +python -m venv env + +source env/bin/activate + +pip install -r requirements.txt +``` + +## Setting up a web crawler + +The primary focus of this tutorial is the OpenAI API so if you prefer, you can skip the context on how to create a web crawler and just [download the source code](https://github.com/openai/web-crawl-q-and-a-example). Otherwise, expand the section below to work through the scraping mechanism implementation. + + + + + + DALL-E: Coding a web crawling system pixel art + + + + Acquiring data in text form is the first step to use embeddings. This tutorial + creates a new set of data by crawling the OpenAI website, a technique that you + can also use for your own company or personal website. + + +