From 8af80058661708a31ef5e9955f060bf2845cda1e Mon Sep 17 00:00:00 2001 From: simonpfish Date: Thu, 15 Jun 2023 21:53:12 -0700 Subject: [PATCH] address feedback --- ..._generation_and_embeddings_reranking.ipynb | 519 ++++++++++++++++++ examples/search_augmentation.ipynb | 472 ---------------- images/search_augmentation_embeddings.png | Bin 0 -> 124808 bytes 3 files changed, 519 insertions(+), 472 deletions(-) create mode 100644 examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb delete mode 100644 examples/search_augmentation.ipynb create mode 100644 images/search_augmentation_embeddings.png diff --git a/examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb b/examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb new file mode 100644 index 00000000..e7fb6088 --- /dev/null +++ b/examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb @@ -0,0 +1,519 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Search Augmented\n", + "\n", + "### by Query Generation and Embeddings Reranking\n", + "\n", + "Searching for relevant information can sometimes be like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.\n", + "\n", + "There are two prominent approaches to using language models for information retrieval:\n", + "\n", + "1. **Mimicking Human Browsing:** The [model triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.\n", + "2. **Retrieval with Embeddings:** Calculating [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content, and then using a metric like cosine distance between the user query and the embedded data to sort and [retrieve information](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb). This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.\n", + "\n", + "These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.\n", + "\n", + "By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:\n", + "\n", + "![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_augmentation_embeddings.png)\n", + "\n", + "**Step 1: Search**\n", + "\n", + "1. User asks a question\n", + "2. Model generates a list of potential queries\n", + "3. Search queries are executed in parallel\n", + "\n", + "**Step 2: Re-rank**\n", + "\n", + "1. Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.\n", + "2. Results are ranked and filtered based on this similarity metric.\n", + "\n", + "**Step 3: Answer**\n", + "\n", + "1. Given the top search results, the model generates an answer to the user’s question, including references and links.\n", + "\n", + "This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.\n", + "\n", + "## Setup\n", + "\n", + "In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get the API key [here](https://newsapi.org/).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%env NEWS_API_KEY = YOUR_API_KEY\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Dependencies\n", + "from datetime import date, timedelta # date handling for fetching recent news\n", + "from IPython import display # for pretty printing\n", + "import json # for parsing the JSON api responses and model outputs\n", + "from numpy import dot # for cosine similarity\n", + "import openai\n", + "import os # for loading environment variables\n", + "import requests # for making the API requests\n", + "from tqdm import tqdm # for printing progress bars\n", + "\n", + "# Load environment variables\n", + "news_api_key = os.getenv(\"NEWS_API_KEY\")\n", + "\n", + "\n", + "# Helper functions\n", + "def json_gpt(input: str):\n", + " completion = openai.ChatCompletion.create(\n", + " model=\"gpt-4\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n", + " {\"role\": \"user\", \"content\": input},\n", + " ],\n", + " temperature=1,\n", + " )\n", + "\n", + " text = completion.choices[0].message.content\n", + " parsed = json.loads(text)\n", + "\n", + " return parsed\n", + "\n", + "\n", + "def embeddings(input: list[str]) -> list[list[str]]:\n", + " response = openai.Embedding.create(\n", + " model=\"text-embedding-ada-002\", input=input)\n", + " return [data.embedding for data in response.data]\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Search\n", + "\n", + "It all starts with a user question.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# User asks a question\n", + "USER_QUESTION = \"Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['NBA championship winner',\n", + " 'NBA finals MVP',\n", + " 'last NBA championship game',\n", + " 'recent NBA finals results',\n", + " 'NBA finals champions',\n", + " 'NBA finals MVP and winner',\n", + " 'latest NBA championship game details',\n", + " 'NBA championship winning team',\n", + " 'most recent NBA finals MVP',\n", + " 'last NBA finals game summary',\n", + " 'latest NBA finals champion and MVP',\n", + " 'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "QUERIES_INPUT = f\"\"\"\n", + "Generate an array of search queries that are relevant to this question. \n", + "Use a variation of related keywords for the queries, trying to be as general as possible.\n", + "Include as many queries as you can think of, including and excluding terms. \n", + "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2']. \n", + "Be creative. The more queries you include, the more likely you are to find relevant results.\n", + "\n", + "User question: {USER_QUESTION}\n", + "\n", + "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n", + "\"\"\"\n", + "\n", + "queries = json_gpt(QUERIES_INPUT)[\"queries\"]\n", + "\n", + "# Lets include the original question as well for good measure\n", + "queries.append(USER_QUESTION)\n", + "\n", + "queries\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The queries look good, let's run the searches.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 12/12 [00:04<00:00, 2.69it/s]\n" + ] + } + ], + "source": [ + "def search_news(query: str):\n", + " # get date 1 week ago\n", + " one_week_ago = (date.today() - timedelta(weeks=1)).strftime(\"%Y-%m-%d\")\n", + "\n", + " response = requests.get(\n", + " \"https://newsapi.org/v2/everything\",\n", + " params={\n", + " \"q\": query,\n", + " \"apiKey\": news_api_key,\n", + " \"pageSize\": 50,\n", + " \"sortBy\": \"relevancy\",\n", + " \"from\": one_week_ago,\n", + " },\n", + " )\n", + "\n", + " return response.json()\n", + "\n", + "\n", + "articles = []\n", + "\n", + "for query in tqdm(queries):\n", + " result = search_news(query)\n", + " if result[\"status\"] == \"ok\":\n", + " articles = articles + result[\"articles\"]\n", + " else:\n", + " raise Exception(result[\"message\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total number of articles: 356\n", + "Top 5 articles of query 1: \n", + "\n", + "Title: Nascar takes on Le Mans as LeBron James gets centenary race under way\n", + "Description: The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…\n", + "Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...\n", + "\n", + "Title: Futura and Michelob ULTRA Toast to the NBA Finals With Abstract Artwork Crafted From the Brand’s 2023 Limited-Edition Championship Bottles\n", + "Description: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beermaker is back with its celebratory NBA Champ Bottles. This year, the self-proclaimed MVP of joy is dropping a limited-edition bottle made in collaboration with a…\n", + "Content: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beerma...\n", + "\n", + "Title: Signed and Delivered, Futura and Michelob ULTRA Will Gift Hand-Painted Bottles to This Year’s NBA Championship Team\n", + "Description: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basketball lovers and sports fans around the globe as the NBA 2022-2023 season comes to a nail-biting close. In collaboration with artist Futura, Michelob ULTRA will…\n", + "Content: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basket...\n", + "\n", + "Title: Alexis Ohanian and Serena Williams are building a mini-sports empire with a new golf team that's part of a league created by Tiger Woods and Rory McIlroy\n", + "Description: Ohanian and Williams are already co-owners of the National Women's Soccer League Los Angeles team, Angel City FC.\n", + "Content: Alexis Ohanian and Serena Williams attend The 2023 Met Gala.Cindy Ord/Getty Images\n", + "