address feedback

11 months ago · 8af8005866
parent 2a77e0d118
commit 8af8005866
3 changed files with 519 additions and 472 deletions
--- a/examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb
+++ b/examples/Search_augmented_by_query_generation_and_embeddings_reranking.ipynb
@ -0,0 +1,519 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Search Augmented\n",
+    "\n",
+    "### by Query Generation and Embeddings Reranking\n",
+    "\n",
+    "Searching for relevant information can sometimes be like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.\n",
+    "\n",
+    "There are two prominent approaches to using language models for information retrieval:\n",
+    "\n",
+    "1. **Mimicking Human Browsing:** The [model triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.\n",
+    "2. **Retrieval with Embeddings:** Calculating [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content, and then using a metric like cosine distance between the user query and the embedded data to sort and [retrieve information](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb). This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.\n",
+    "\n",
+    "These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.\n",
+    "\n",
+    "By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:\n",
+    "\n",
+    "![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_augmentation_embeddings.png)\n",
+    "\n",
+    "**Step 1: Search**\n",
+    "\n",
+    "1.  User asks a question\n",
+    "2.  Model generates a list of potential queries\n",
+    "3.  Search queries are executed in parallel\n",
+    "\n",
+    "**Step 2: Re-rank**\n",
+    "\n",
+    "1.  Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.\n",
+    "2.  Results are ranked and filtered based on this similarity metric.\n",
+    "\n",
+    "**Step 3: Answer**\n",
+    "\n",
+    "1.  Given the top search results, the model generates an answer to the user’s question, including references and links.\n",
+    "\n",
+    "This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get the API key [here](https://newsapi.org/).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%env NEWS_API_KEY = YOUR_API_KEY\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dependencies\n",
+    "from datetime import date, timedelta  # date handling for fetching recent news\n",
+    "from IPython import display  # for pretty printing\n",
+    "import json  # for parsing the JSON api responses and model outputs\n",
+    "from numpy import dot  # for cosine similarity\n",
+    "import openai\n",
+    "import os  # for loading environment variables\n",
+    "import requests  # for making the API requests\n",
+    "from tqdm import tqdm  # for printing progress bars\n",
+    "\n",
+    "# Load environment variables\n",
+    "news_api_key = os.getenv(\"NEWS_API_KEY\")\n",
+    "\n",
+    "\n",
+    "# Helper functions\n",
+    "def json_gpt(input: str):\n",
+    "    completion = openai.ChatCompletion.create(\n",
+    "        model=\"gpt-4\",\n",
+    "        messages=[\n",
+    "            {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n",
+    "            {\"role\": \"user\", \"content\": input},\n",
+    "        ],\n",
+    "        temperature=1,\n",
+    "    )\n",
+    "\n",
+    "    text = completion.choices[0].message.content\n",
+    "    parsed = json.loads(text)\n",
+    "\n",
+    "    return parsed\n",
+    "\n",
+    "\n",
+    "def embeddings(input: list[str]) -> list[list[str]]:\n",
+    "    response = openai.Embedding.create(\n",
+    "        model=\"text-embedding-ada-002\", input=input)\n",
+    "    return [data.embedding for data in response.data]\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Search\n",
+    "\n",
+    "It all starts with a user question.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# User asks a question\n",
+    "USER_QUESTION = \"Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.\""
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['NBA championship winner',\n",
+       " 'NBA finals MVP',\n",
+       " 'last NBA championship game',\n",
+       " 'recent NBA finals results',\n",
+       " 'NBA finals champions',\n",
+       " 'NBA finals MVP and winner',\n",
+       " 'latest NBA championship game details',\n",
+       " 'NBA championship winning team',\n",
+       " 'most recent NBA finals MVP',\n",
+       " 'last NBA finals game summary',\n",
+       " 'latest NBA finals champion and MVP',\n",
+       " 'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "QUERIES_INPUT = f\"\"\"\n",
+    "Generate an array of search queries that are relevant to this question. \n",
+    "Use a variation of related keywords for the queries, trying to be as general as possible.\n",
+    "Include as many queries as you can think of, including and excluding terms. \n",
+    "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2']. \n",
+    "Be creative. The more queries you include, the more likely you are to find relevant results.\n",
+    "\n",
+    "User question: {USER_QUESTION}\n",
+    "\n",
+    "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n",
+    "\"\"\"\n",
+    "\n",
+    "queries = json_gpt(QUERIES_INPUT)[\"queries\"]\n",
+    "\n",
+    "# Lets include the original question as well for good measure\n",
+    "queries.append(USER_QUESTION)\n",
+    "\n",
+    "queries\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The queries look good, let's run the searches.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 12/12 [00:04<00:00,  2.69it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def search_news(query: str):\n",
+    "    # get date 1 week ago\n",
+    "    one_week_ago = (date.today() - timedelta(weeks=1)).strftime(\"%Y-%m-%d\")\n",
+    "\n",
+    "    response = requests.get(\n",
+    "        \"https://newsapi.org/v2/everything\",\n",
+    "        params={\n",
+    "            \"q\": query,\n",
+    "            \"apiKey\": news_api_key,\n",
+    "            \"pageSize\": 50,\n",
+    "            \"sortBy\": \"relevancy\",\n",
+    "            \"from\": one_week_ago,\n",
+    "        },\n",
+    "    )\n",
+    "\n",
+    "    return response.json()\n",
+    "\n",
+    "\n",
+    "articles = []\n",
+    "\n",
+    "for query in tqdm(queries):\n",
+    "    result = search_news(query)\n",
+    "    if result[\"status\"] == \"ok\":\n",
+    "        articles = articles + result[\"articles\"]\n",
+    "    else:\n",
+    "        raise Exception(result[\"message\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of articles: 356\n",
+      "Top 5 articles of query 1: \n",
+      "\n",
+      "Title: Nascar takes on Le Mans as LeBron James gets centenary race under way\n",
+      "Description: <ul><li>Nascar has presence at iconic race for first time since 1976</li><li>NBA superstar LeBron James waves flag as honorary starter</li></ul>The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…\n",
+      "Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...\n",
+      "\n",
+      "Title: Futura and Michelob ULTRA Toast to the NBA Finals With Abstract Artwork Crafted From the Brand’s 2023 Limited-Edition Championship Bottles\n",
+      "Description: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beermaker is back with its celebratory NBA Champ Bottles. This year, the self-proclaimed MVP of joy is dropping a limited-edition bottle made in collaboration with a…\n",
+      "Content: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beerma...\n",
+      "\n",
+      "Title: Signed and Delivered, Futura and Michelob ULTRA Will Gift Hand-Painted Bottles to This Year’s NBA Championship Team\n",
+      "Description: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basketball lovers and sports fans around the globe as the NBA 2022-2023 season comes to a nail-biting close. In collaboration with artist Futura, Michelob ULTRA will…\n",
+      "Content: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basket...\n",
+      "\n",
+      "Title: Alexis Ohanian and Serena Williams are building a mini-sports empire with a new golf team that's part of a league created by Tiger Woods and Rory McIlroy\n",
+      "Description: Ohanian and Williams are already co-owners of the National Women's Soccer League Los Angeles team, Angel City FC.\n",
+      "Content: Alexis Ohanian and Serena Williams attend The 2023 Met Gala.Cindy Ord/Getty Images\n",
+      "<ul>\n",
+      "<li>Alexis ...\n",
+      "\n",
+      "Title: Las Vegas wanted the NHL. And now the city has the Stanley Cup\n",
+      "Description: The Golden Knights won the championship on Tuesday. It’s a testament to the decision to move hockey into uncharted territoryA month after Las Vegas was awarded an NHL team in the summer of 2016, a local paper held a series of, in the publication’s own words, …\n",
+      "Content: A month after Las Vegas was awarded an NHL team in the summer of 2016, a local paper held a series o...\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Total number of articles:\", len(articles))\n",
+    "print(\"Top 5 articles of query 1:\", \"\\n\")\n",
+    "\n",
+    "for article in articles[0:5]:\n",
+    "    print(\"Title:\", article[\"title\"])\n",
+    "    print(\"Description:\", article[\"description\"])\n",
+    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
+    "    print()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see, oftentimes, the search queries will return a large number of results, many of which are not relevant to the original question asked by the user. In order to improve the quality of the final answer, we use embeddings to re-rank and filter the results.\n",
+    "\n",
+    "# 2. Re-rank\n",
+    "\n",
+    "Drawing inspiration from [HyDE (Gao et. al.)](https://arxiv.org/abs/2212.10496), we first generate a hypothetical ideal answer to rerank our compare our results against. This ensures that we’re prioritizing results that might provide good answers, rather than those that are similar to our question. Here’s the prompt we use to generate our hypothetical answer.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Team NAME won the NBA championship, and PLAYER_NAME was awarded the MVP title. In the last game, NAME displayed an outstanding performance with X points, Y rebounds, and Z assists, leading their team to a thrilling victory at PLACE.'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "HA_INPUT = f\"\"\"\n",
+    "Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. \n",
+    "Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders\n",
+    "like NAME did something, or NAME said something at PLACE. \n",
+    "\n",
+    "User question: {USER_QUESTION}\n",
+    "\n",
+    "Format: {{\"hypotheticalAnswer\": \"hypothetical answer text\"}}\n",
+    "\"\"\"\n",
+    "\n",
+    "hypothetical_answer = json_gpt(HA_INPUT)[\"hypotheticalAnswer\"]\n",
+    "\n",
+    "hypothetical_answer"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's generate embeddings for the search results and the hypothetical answer. We then calculate the cosine distance between these embeddings, giving us a semantic similarity metric. Note that we can simply calculate the dot product in lieu of doing a full cosine similarity calculation since the OpenAI embeddings are returned normalized in our API.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[0.779471308147333,\n",
+       " 0.7800588417397301,\n",
+       " 0.7900892607044301,\n",
+       " 0.7704583513281005,\n",
+       " 0.7841046909560899,\n",
+       " 0.8246545759453099,\n",
+       " 0.8127694680991286,\n",
+       " 0.8235724294003601,\n",
+       " 0.7978980332478777,\n",
+       " 0.8273641985639677]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hypothetical_answer_embedding = embeddings(hypothetical_answer)\n",
+    "article_embeddings = embeddings(\n",
+    "    [\n",
+    "        f\"{article['title']} {article['description']} {article['content'][0:100]}\"\n",
+    "        for article in articles\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "# Calculate cosine similarity\n",
+    "cosine_similarities = []\n",
+    "for article_embedding in article_embeddings:\n",
+    "    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding)[0])\n",
+    "\n",
+    "cosine_similarities[0:10]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we use these similarity scores to sort and filter the results.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Top 5 articles: \n",
+      "\n",
+      "Title: Barack Obama and Magic Johnson lead Nuggets tributes after first-ever NBA title win\n",
+      "Description: Straight after the Nuggets' Game 5 win, several standout names and brands within the world of sports and elsewhere paid tribute to Nikola Jokic and his teammates for their historic season.\n",
+      "Content: The Denver Nuggetsclinched their first ever NBA championship on Monday night, following it's Game 5 ...\n",
+      "Score: 0.8440720976260312\n",
+      "\n",
+      "Title: Barack Obama and Magic Johnson lead Nuggets tributes after first-ever NBA title win\n",
+      "Description: Straight after the Nuggets' Game 5 win, several standout names and brands within the world of sports and elsewhere paid tribute to Nikola Jokic and his teammates for their historic season.\n",
+      "Content: The Denver Nuggetsclinched their first ever NBA championship on Monday night, following it's Game 5 ...\n",
+      "Score: 0.8440720976260312\n",
+      "\n",
+      "Title: Nikola Jokic wins NBA Finals MVP, leading Denver Nuggets to first championship\n",
+      "Description: In the 2023 NBA Finals, Nikola Jokic became the first player ever with a 30-point, 20-rebound triple-double in a Finals game.\n",
+      "Content: Nikola Jokic did something nice.\n",
+      "After leading the Denver Nuggets to their first championship in fr...\n",
+      "Score: 0.8358902345553455\n",
+      "\n",
+      "Title: Nikola Jokic named NBA Finals MVP after leading Denver Nuggets to first championship\n",
+      "Description: Already a two-time league MVP, Nikola Jokic added an NBA championship and Finals MVP award to his impressive resume.\n",
+      "Content: DENVER Nikola Jokic claimed he didnt care about winning a third consecutive regular-season MVP in 20...\n",
+      "Score: 0.8358544359823221\n",
+      "\n",
+      "Title: Nikola Jokic named NBA Finals MVP after leading Denver Nuggets to first championship\n",
+      "Description: Already a two-time league MVP, Nikola Jokic added an NBA championship and Finals MVP award to his impressive resume.\n",
+      "Content: DENVER Nikola Jokic claimed he didnt care about winning a third consecutive regular-season MVP in 20...\n",
+      "Score: 0.8358361081673931\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "scored_articles = zip(articles, cosine_similarities)\n",
+    "\n",
+    "# Sort articles by cosine similarity\n",
+    "sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)\n",
+    "\n",
+    "# Print top 5 articles\n",
+    "print(\"Top 5 articles:\", \"\\n\")\n",
+    "\n",
+    "for article, score in sorted_articles[0:5]:\n",
+    "    print(\"Title:\", article[\"title\"])\n",
+    "    print(\"Description:\", article[\"description\"])\n",
+    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
+    "    print(\"Score:\", score)\n",
+    "    print()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.\n",
+    "\n",
+    "## 3. Answer\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/markdown": [
+       "The Denver Nuggets won their first-ever NBA championship, and Nikola Jokic was named the NBA Finals MVP. In the last game, Game 5 of the 2023 NBA Finals, Jokic became the first player ever with a 30-point, 20-rebound triple-double in a Finals game [^1^] [^3^]. The Nuggets' historic season was celebrated and recognized by standout names and brands within the world of sports and elsewhere, including tributes from Barack Obama and Magic Johnson [^1^] [^2^].\n",
+       "\n",
+       "[^1^]: [Denver Post](https://www.denverpost.com/2023/06/12/nikola-jokic-nba-finals-mvp-denver-nuggets/)\n",
+       "[^2^]: [Daily Mail](https://www.dailymail.co.uk/sport/nba/article-12189843/Barack-Obama-Magic-Johnson-lead-Nuggets-tributes-NBA-title-win.html)\n",
+       "[^3^]: [USA Today](https://www.usatoday.com/story/sports/nba/2023/06/12/nikola-jokic-is-nba-finals-mvp-after-nuggets-win-first-championship/70314309007/)"
+      ],
+      "text/plain": [
+       "<IPython.core.display.Markdown object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "formatted_top_results = [\n",
+    "    {\n",
+    "        \"title\": article[\"title\"],\n",
+    "        \"description\": article[\"description\"],\n",
+    "        \"url\": article[\"url\"],\n",
+    "    }\n",
+    "    for article, _score in sorted_articles[0:5]\n",
+    "]\n",
+    "\n",
+    "ANSWER_INPUT = f\"\"\"\n",
+    "Generate an answer to the user's question based on the given search results. \n",
+    "TOP_RESULTS: {formatted_top_results}\n",
+    "USER_QUESTION: {USER_QUESTION}\n",
+    "\n",
+    "Include as much information as possible in the answer. Include references to the search results as markdown links.\n",
+    "\"\"\"\n",
+    "\n",
+    "completion = openai.ChatCompletion.create(\n",
+    "    model=\"gpt-4\",\n",
+    "    messages=[{\"role\": \"user\", \"content\": ANSWER_INPUT}],\n",
+    "    temperature=1,\n",
+    "    stream=True,\n",
+    ")\n",
+    "\n",
+    "text = \"\"\n",
+    "for chunk in completion:\n",
+    "    text += chunk.choices[0].delta.get(\"content\", \"\")\n",
+    "    display.clear_output(wait=True)\n",
+    "    display.display(display.Markdown(text))\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/examples/search_augmentation.ipynb
+++ b/examples/search_augmentation.ipynb
@ -1,472 +0,0 @@
-{
- "cells": [
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Search Augmentation\n",
-    "\n",
-    "### with Multiple Query Generation and Semantic Reranking\n",
-    "\n",
-    "Searching for information can be challenging. We can leverage the completions API and embeddings to help us sift through the noise. In this notebook, we will use the completions API to generate search queries given a user's question, and then rerank the results using semantic similarity to a hypothetical answer.\n",
-    "\n",
-    "We can break down this process into three steps:\n",
-    "\n",
-    "**1. Search**\n",
-    "\n",
-    "- User asks a question\n",
-    "- Model generates a list of queries\n",
-    "- Search queries are executed in parallel\n",
-    "\n",
-    "**2. Re-rank**\n",
-    "\n",
-    "- Model generates an ideal answer by hallucination\n",
-    "- Search results are ranked based on semantic similarity to the ideal answer\n",
-    "\n",
-    "**3. Answer**\n",
-    "\n",
-    "- Given the top search results, the model attempts to answer the user question, including references and links.\n",
-    "\n",
-    "Let's dive into it! We will use Twitter as an example domain to search over.\n",
-    "\n",
-    "## Setup\n",
-    "\n",
-    "Once you have your keys, you can set them as environment variables in your`.env` file in the same directory as this notebook. The `.env` file should look like this:\n",
-    "\n",
-    "```\n",
-    "NEWS_API_KEY=your_api_key\n",
-    "OPENAI_API_KEY=your_openai_api_key\n",
-    "```\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Dependencies\n",
-    "import openai\n",
-    "from tqdm import tqdm\n",
-    "import os\n",
-    "import dotenv\n",
-    "import requests\n",
-    "import json\n",
-    "from datetime import date, timedelta\n",
-    "from numpy import dot\n",
-    "from IPython import display\n",
-    "\n",
-    "\n",
-    "# Load environment variables\n",
-    "dotenv.load_dotenv()\n",
-    "\n",
-    "news_api_key = os.getenv(\"NEWS_API_KEY\")\n",
-    "\n",
-    "\n",
-    "# Helper functions\n",
-    "def json_gpt(prompt):\n",
-    "    completion = openai.ChatCompletion.create(\n",
-    "        model=\"gpt-3.5-turbo\",\n",
-    "        messages=[\n",
-    "            {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n",
-    "            {\"role\": \"user\", \"content\": prompt},\n",
-    "        ],\n",
-    "        temperature=1,\n",
-    "    )\n",
-    "\n",
-    "    text = completion.choices[0].message.content\n",
-    "    parsed = json.loads(text)\n",
-    "\n",
-    "    return parsed\n",
-    "\n",
-    "\n",
-    "def embedding(input):\n",
-    "    response = openai.Embedding.create(\n",
-    "        model=\"text-embedding-ada-002\", input=input)\n",
-    "    return [data.embedding for data in response.data]\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 1. Search\n",
-    "\n",
-    "Let's first generate a set of queries given a user question.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['San Francisco recent events',\n",
-       " 'news in San Francisco',\n",
-       " 'San Francisco happenings',\n",
-       " 'recent developments in San Francisco',\n",
-       " 'Happening in San Francisco today']"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# User asks a question\n",
-    "USER_QUESTION = \"What has happened recently in San Francisco?\"\n",
-    "\n",
-    "# Model generates a list of queries\n",
-    "PROMPT = f\"\"\"\n",
-    "Generate an array of search queries that are relevant to this question. \n",
-    "Use a variation of related keywords for the queries, trying to be as general as possible.\n",
-    "Include as many queries as you can think of, including and excluding terms. \n",
-    "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2']. \n",
-    "Be creative. The more queries you include, the more likely you are to find relevant results.\n",
-    "\n",
-    "User question: {USER_QUESTION}\n",
-    "\n",
-    "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n",
-    "Maximum 5 queries\n",
-    "\"\"\"\n",
-    "\n",
-    "queries = json_gpt(PROMPT)[\"queries\"]\n",
-    "\n",
-    "queries\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The queries look good, let's run the search!\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 5/5 [00:02<00:00,  2.30it/s]\n"
-     ]
-    }
-   ],
-   "source": [
-    "def search_news(query: str):\n",
-    "    # get date 1 week ago\n",
-    "    one_week_ago = (date.today() - timedelta(weeks=1)).strftime(\"%Y-%m-%d\")\n",
-    "\n",
-    "    response = requests.get(\n",
-    "        \"https://newsapi.org/v2/everything\",\n",
-    "        params={\n",
-    "            \"q\": query,\n",
-    "            \"apiKey\": news_api_key,\n",
-    "            \"pageSize\": 50,\n",
-    "            \"sortBy\": \"relevancy\",\n",
-    "            \"from\": one_week_ago,\n",
-    "        },\n",
-    "    )\n",
-    "\n",
-    "    return response.json()\n",
-    "\n",
-    "\n",
-    "articles = []\n",
-    "\n",
-    "for query in tqdm(queries):\n",
-    "    articles = articles + search_news(query)[\"articles\"]\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Number of articles: 185\n",
-      "Top 5 articles: \n",
-      "\n",
-      "Title: Samsung confirms Galaxy Z Flip 5, Fold 5 launch details\n",
-      "Description: Samsung has announced when and where the next Galaxy Unpacked event will take place. It's here where the company will unveil its next foldables.\n",
-      "Content: <ul><li>Samsung has announced that the launch of its next foldable phones will take place in Korea.<...\n",
-      "\n",
-      "Title: AI jobs with mind-blowing paychecks of $375K a year\n",
-      "Description: When we’re talking AI and jobs, it’s easy to be nervous that the tech will make yours obsolete. Scan this list to see if AI might change the way you work.\n",
-      "Content: Theres no question that artificial intelligence is changing our lives. A bot that sounds almost huma...\n",
-      "\n",
-      "Title: US culture wars come to baseball as MLB celebrates Pride month\n",
-      "Description: The league has made efforts to welcome LGBTQ+ people into ballparks but recent pushback has shown that large parts of the sport remain conservativeWhen the Los Angeles Dodgers arranged their latest annual Pride Night the team probably did not envision facing …\n",
-      "Content: When the Los Angeles Dodgers arranged their latest annual Pride Night the team probably did not envi...\n",
-      "\n",
-      "Title: Samsung's next-generation foldables will be revealed at Galaxy Unpacked in July\n",
-      "Description: The event will take place in Seoul, South Korea, in late July.\n",
-      "Content: Samsung has made things official, sharing that its next Galaxy Unpacked event will take place in lat...\n",
-      "\n",
-      "Title: Why you might notice more religious groups at Pride celebrations this year\n",
-      "Description: Some people of faith are organizing a pushback against the wave of anti-LGBTQ rights legislation making its way through state houses this year. They're calling it Faith for Pride.\n",
-      "Content: A national initiative called Faith for Pride wants religious groups and houses of worship that suppo...\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "print(\"Number of articles:\", len(articles))\n",
-    "print(\"Top 5 articles:\", \"\\n\")\n",
-    "\n",
-    "for article in articles[0:5]:\n",
-    "    print(\"Title:\", article[\"title\"])\n",
-    "    print(\"Description:\", article[\"description\"])\n",
-    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
-    "    print()\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "There is a lot of noise in these results, let's now use re-ranking and the completions model to synthesize a good final answer out of all these news articles.\n",
-    "\n",
-    "# 2. Re-rank\n",
-    "\n",
-    "Let's first generate a hypothetical answer using the completions API.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "\"Recently in San Francisco, the city has implemented a new COVID-19 vaccine mandate for indoor activities and events. Additionally, the Golden State Warriors have started their NBA season, playing games at the Chase Center arena. There have also been ongoing discussions about the city's affordable housing crisis and the need for more solutions to address it.\""
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "HA_PROMPT = f\"\"\"\n",
-    "Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. Pretend you have all the information you need to answer.\n",
-    "\n",
-    "User question: {USER_QUESTION}\n",
-    "\n",
-    "Format: {{\"hypotheticalAnswer\": \"hypothetical answer text\"}}\n",
-    "\"\"\"\n",
-    "\n",
-    "hypothetical_answer = json_gpt(HA_PROMPT)[\"hypotheticalAnswer\"]\n",
-    "\n",
-    "hypothetical_answer\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now, let's generate embeddings for the search results and the hypothetical answer.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[0.7181421473937228,\n",
-       " 0.7047313247008025,\n",
-       " 0.7833773244357916,\n",
-       " 0.7229598385205838,\n",
-       " 0.7561981577536391,\n",
-       " 0.7978369364374425,\n",
-       " 0.765459048492688,\n",
-       " 0.768682248268667,\n",
-       " 0.7719928527938629,\n",
-       " 0.7750573135338864]"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "hypothetical_answer_embedding = embedding(hypothetical_answer)\n",
-    "article_embeddings = embedding(\n",
-    "    [\n",
-    "        f\"{article['title']} {article['description']} {article['content'][0:100]}\"\n",
-    "        for article in articles\n",
-    "    ]\n",
-    ")\n",
-    "\n",
-    "# Calculate cosine similarity\n",
-    "cosine_similarities = []\n",
-    "for article_embedding in article_embeddings:\n",
-    "    cosine_similarities.append(\n",
-    "        dot(hypothetical_answer_embedding, article_embedding)[0])\n",
-    "\n",
-    "cosine_similarities[0:10]\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Top 5 articles: \n",
-      "\n",
-      "Title: San Francisco mayor says city needs to take action on drug crisis, even if services not accepted - ABC7 News Bay Area\n",
-      "Description: <ol><li>San Francisco mayor says city needs to take action on drug crisis, even if services not accepted  ABC7 News Bay Area\n",
-      "</li><li>London Breed urges Biden to ramp up federal aid in fentanyl crisis  Axios\n",
-      "</li><li>San Francisco Mayor London Breed discuss…\n",
-      "Content: We use cookies and data to<ul><li>Deliver and maintain Google services</li><li>Track outages and pro...\n",
-      "Score: 0.8246179518912831\n",
-      "\n",
-      "Title: San Francisco man says he's witnessing 'collapse' of Western civilization a month after Newsom promised aid\n",
-      "Description: San Francisco has 'become a fourth world country within a first world country,' Gen Z activist says One month after California Gov. Gavin Newsom promised to crack down on San Francisco's open-air drug markets, a Gen Z activist says far-left politics have made…\n",
-      "Content: Skip to comments.\n",
-      "San Francisco man says he's witnessing 'collapse' of Western civilization a month...\n",
-      "Score: 0.8204558255857742\n",
-      "\n",
-      "Title: Dying San Francisco Sustains Another Massive Blow: A Luxury Hotel Chain is Leaving Town.\n",
-      "Description: Not too long ago, San Francisco was still a remarkably beautiful city, although even then it could change on a dime. One could be in the midst of a leisurely stroll down a gorgeous street filled with chic cafes, turn a corner, and find oneself without warning…\n",
-      "Content: Skip to comments.\n",
-      "Dying San Francisco Sustains Another Massive Blow: A Luxury Hotel Chain is Leavin...\n",
-      "Score: 0.8092623170975866\n",
-      "\n",
-      "Title: Park Hotels Gives Up on San Francisco’s Doggedly Weak Demand\n",
-      "Description: It's ominous for San Francisco that a firm just abandoned 9 percent of the city's hotel rooms. It's crying uncle on a city struggling with weak hotel demand. Plus, more highlights from this week's news in hotel deals and development worldwide. -Sean O'Neill\n",
-      "Content: Here are some excerpts from Daily Lodging Report from the past week. If youre not a subscriber, you ...\n",
-      "Score: 0.7978839040241306\n",
-      "\n",
-      "Title: Park Hotels Gives Up on San Francisco’s Doggedly Weak Demand\n",
-      "Description: It's ominous for San Francisco that a firm just abandoned 9 percent of the city's hotel rooms. It's crying uncle on a city struggling with weak hotel demand. Plus, more highlights from this week's news in hotel deals and development worldwide. -Sean O'Neill\n",
-      "Content: Here are some excerpts from Daily Lodging Report from the past week. If youre not a subscriber, you ...\n",
-      "Score: 0.7978369364374425\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "scored_articles = zip(articles, cosine_similarities)\n",
-    "\n",
-    "# Sort articles by cosine similarity\n",
-    "sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)\n",
-    "\n",
-    "# Print top 5 articles\n",
-    "print(\"Top 5 articles:\", \"\\n\")\n",
-    "\n",
-    "for article, score in sorted_articles[0:5]:\n",
-    "    print(\"Title:\", article[\"title\"])\n",
-    "    print(\"Description:\", article[\"description\"])\n",
-    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
-    "    print(\"Score:\", score)\n",
-    "    print()\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.\n",
-    "\n",
-    "## 3. Answer\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/markdown": [
-       "Recently, several events have taken place in San Francisco. The city's mayor, London Breed, stated the need for action on the drug crisis, even if services aren't accepted [ABC7 News Bay Area](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1pHdmR6QkQ0ZnBZmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). A few mass shootings have occurred in San Francisco's Mission District, with one resulting in 9 injuries [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1NhNDdrRE9GQUs4mAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). The Park Hotels company has abandoned 9% of San Francisco's hotel rooms due to weak demand in the city [Skift](https://skift.com/2023/06/08/park-hotels-gives-up-on-san-franciscos-doggedly-weak-demand/). In response to the drug crisis, the San Francisco Sheriff has deployed deputies to arrest drug dealers [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC0VvcXlvaUlZSng4mAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). A gas leak and water main break led to evacuations in the city [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1M2UXotMU1qRTVZmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). Additionally, Hilton SF Union Square and Parc 55's owner has stopped payments on loans for these properties [ABC7 News Bay Area](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC3dpQk55U3I0SHBNmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1)."
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "formatted_top_results = [\n",
-    "    {\n",
-    "        \"title\": article[\"title\"],\n",
-    "        \"description\": article[\"description\"],\n",
-    "        \"url\": article[\"url\"],\n",
-    "    }\n",
-    "    for article, _score in sorted_articles[0:50]\n",
-    "]\n",
-    "\n",
-    "ANSWER_PROMT = f\"\"\"\n",
-    "Generate an answer to the user's question based on the given search results. \n",
-    "TOP_RESULTS: {formatted_top_results}\n",
-    "USER_QUESTION: {USER_QUESTION}\n",
-    "\n",
-    "Include as much information as possible in the answer. Include references to the search results as markdown links.\n",
-    "Format: {{\"answer\": \"answer text\"}}\n",
-    "\"\"\"\n",
-    "\n",
-    "completion = openai.ChatCompletion.create(\n",
-    "    model=\"gpt-4\",\n",
-    "    messages=[{\"role\": \"user\", \"content\": ANSWER_PROMT}],\n",
-    "    temperature=1,\n",
-    "    stream=True,\n",
-    ")\n",
-    "\n",
-    "text = \"\"\n",
-    "for chunk in completion:\n",
-    "    text += chunk.choices[0].delta.get(\"content\", \"\")\n",
-    "    display.clear_output(wait=True)\n",
-    "    display.display(display.Markdown(text))"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.0"
-  },
-  "orig_nbformat": 4
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/images/search_augmentation_embeddings.png
+++ b/images/search_augmentation_embeddings.png