openai-cookbook/examples/Question_answering_using_a_...

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question answering using a search API and re-ranking\n",
    "\n",
    "Searching for relevant information can sometimes feel like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.\n",
    "\n",
    "Two ways of retrieving information for GPT are:\n",
    "\n",
    "1. **Mimicking Human Browsing:** [GPT triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.\n",
    "2. **Retrieval with Embeddings:** Calculate [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content and a user query, and then [retrieve the content](Question_answering_using_embeddings.ipynb) most related as measured by cosine similarity. This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.\n",
    "\n",
    "These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.\n",
    "\n",
    "By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:\n",
    "\n",
    "![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_augmentation_embeddings.png)\n",
    "\n",
    "**Step 1: Search**\n",
    "\n",
    "1.  User asks a question.\n",
    "2.  GPT generates a list of potential queries.\n",
    "3.  Search queries are executed in parallel.\n",
    "\n",
    "**Step 2: Re-rank**\n",
    "\n",
    "1.  Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.\n",
    "2.  Results are ranked and filtered based on this similarity metric.\n",
    "\n",
    "**Step 3: Answer**\n",
    "\n",
    "1.  Given the top search results, the model generates an answer to the user’s question, including references and links.\n",
    "\n",
    "This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.\n",
    "\n",
    "## Setup\n",
    "\n",
    "In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get an API key [here](https://newsapi.org/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "%env NEWS_API_KEY = YOUR_NEWS_API_KEY\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dependencies\n",
    "from datetime import date, timedelta  # date handling for fetching recent news\n",
    "from IPython import display  # for pretty printing\n",
    "import json  # for parsing the JSON api responses and model outputs\n",
    "from numpy import dot  # for cosine similarity\n",
    "import openai  # for using GPT and getting embeddings\n",
    "import os  # for loading environment variables\n",
    "import requests  # for making the API requests\n",
    "from tqdm import tqdm  # for printing progress bars\n",
    "\n",
    "# Load environment variables\n",
    "news_api_key = os.getenv(\"NEWS_API_KEY\")\n",
    "\n",
    "GPT_MODEL = \"gpt-3.5-turbo\"\n",
    "\n",
    "# Helper functions\n",
    "def json_gpt(input: str):\n",
    "    completion = openai.ChatCompletion.create(\n",
    "        model=GPT_MODEL,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n",
    "            {\"role\": \"user\", \"content\": input},\n",
    "        ],\n",
    "        temperature=0.5,\n",
    "    )\n",
    "\n",
    "    text = completion.choices[0].message.content\n",
    "    parsed = json.loads(text)\n",
    "\n",
    "    return parsed\n",
    "\n",
    "\n",
    "def embeddings(input: list[str]) -> list[list[str]]:\n",
    "    response = openai.Embedding.create(\n",
    "        model=\"text-embedding-ada-002\", input=input)\n",
    "    return [data.embedding for data in response.data]\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Search\n",
    "\n",
    "It all starts with a user question.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# User asks a question\n",
    "USER_QUESTION = \"Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QUERIES_INPUT = f\"\"\"\n",
    "You have access to a search API that returns recent news articles.\n",
    "Generate an array of search queries that are relevant to this question.\n",
    "Use a variation of related keywords for the queries, trying to be as general as possible.\n",
    "Include as many queries as you can think of, including and excluding terms.\n",
    "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].\n",
    "Be creative. The more queries you include, the more likely you are to find relevant results.\n",
    "\n",
    "User question: {USER_QUESTION}\n",
    "\n",
    "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n",
    "\"\"\"\n",
    "\n",
    "queries = json_gpt(QUERIES_INPUT)[\"queries\"]\n",
    "\n",
    "# Let's include the original question as well for good measure\n",
    "queries.append(USER_QUESTION)\n",
    "\n",
    "queries\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The queries look good, so let's run the searches.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def search_news(\n",
    "    query: str,\n",
    "    news_api_key: str = news_api_key,\n",
    "    num_articles: int = 50,\n",
    "    from_datetime: str = \"2023-06-01\",  # the 2023 NBA finals were played in June 2023\n",
    "    to_datetime: str = \"2023-06-30\",\n",
    ") -> dict:\n",
    "    response = requests.get(\n",
    "        \"https://newsapi.org/v2/everything\",\n",
    "        params={\n",
    "            \"q\": query,\n",
    "            \"apiKey\": news_api_key,\n",
    "            \"pageSize\": num_articles,\n",
    "            \"sortBy\": \"relevancy\",\n",
    "            \"from\": from_datetime,\n",
    "            \"to\": to_datetime,\n",
    "        },\n",
    "    )\n",
    "\n",
    "    return response.json()\n",
    "\n",
    "\n",
    "articles = []\n",
    "\n",
    "for query in tqdm(queries):\n",
    "    result = search_news(query)\n",
    "    if result[\"status\"] == \"ok\":\n",
    "        articles = articles + result[\"articles\"]\n",
    "    else:\n",
    "        raise Exception(result[\"message\"])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Total number of articles:\", len(articles))\n",
    "print(\"Top 5 articles of query 1:\", \"\\n\")\n",
    "\n",
    "for article in articles[0:5]:\n",
    "    print(\"Title:\", article[\"title\"])\n",
    "    print(\"Description:\", article[\"description\"])\n",
    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
    "    print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, oftentimes, the search queries will return a large number of results, many of which are not relevant to the original question asked by the user. In order to improve the quality of the final answer, we use embeddings to re-rank and filter the results.\n",
    "\n",
    "# 2. Re-rank\n",
    "\n",
    "Drawing inspiration from [HyDE (Gao et al.)](https://arxiv.org/abs/2212.10496), we first generate a hypothetical ideal answer to rerank our compare our results against. This helps prioritize results that look like good answers, rather than those similar to our question. Here’s the prompt we use to generate our hypothetical answer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "HA_INPUT = f\"\"\"\n",
    "Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. \n",
    "Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders\n",
    "like NAME did something, or NAME said something at PLACE. \n",
    "\n",
    "User question: {USER_QUESTION}\n",
    "\n",
    "Format: {{\"hypotheticalAnswer\": \"hypothetical answer text\"}}\n",
    "\"\"\"\n",
    "\n",
    "hypothetical_answer = json_gpt(HA_INPUT)[\"hypotheticalAnswer\"]\n",
    "\n",
    "hypothetical_answer"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's generate embeddings for the search results and the hypothetical answer. We then calculate the cosine distance between these embeddings, giving us a semantic similarity metric. Note that we can simply calculate the dot product in lieu of doing a full cosine similarity calculation since the OpenAI embeddings are returned normalized in our API.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]\n",
    "article_embeddings = embeddings(\n",
    "    [\n",
    "        f\"{article['title']} {article['description']} {article['content'][0:100]}\"\n",
    "        for article in articles\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Calculate cosine similarity\n",
    "cosine_similarities = []\n",
    "for article_embedding in article_embeddings:\n",
    "    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))\n",
    "\n",
    "cosine_similarities[0:10]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use these similarity scores to sort and filter the results.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scored_articles = zip(articles, cosine_similarities)\n",
    "\n",
    "# Sort articles by cosine similarity\n",
    "sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)\n",
    "\n",
    "# Print top 5 articles\n",
    "print(\"Top 5 articles:\", \"\\n\")\n",
    "\n",
    "for article, score in sorted_articles[0:5]:\n",
    "    print(\"Title:\", article[\"title\"])\n",
    "    print(\"Description:\", article[\"description\"])\n",
    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
    "    print(\"Score:\", score)\n",
    "    print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.\n",
    "\n",
    "## 3. Answer\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "formatted_top_results = [\n",
    "    {\n",
    "        \"title\": article[\"title\"],\n",
    "        \"description\": article[\"description\"],\n",
    "        \"url\": article[\"url\"],\n",
    "    }\n",
    "    for article, _score in sorted_articles[0:5]\n",
    "]\n",
    "\n",
    "ANSWER_INPUT = f\"\"\"\n",
    "Generate an answer to the user's question based on the given search results. \n",
    "TOP_RESULTS: {formatted_top_results}\n",
    "USER_QUESTION: {USER_QUESTION}\n",
    "\n",
    "Include as much information as possible in the answer. Include references to the search results as markdown links.\n",
    "\"\"\"\n",
    "\n",
    "completion = openai.ChatCompletion.create(\n",
    "    model=GPT_MODEL,\n",
    "    messages=[{\"role\": \"user\", \"content\": ANSWER_INPUT}],\n",
    "    temperature=0.5,\n",
    "    stream=True,\n",
    ")\n",
    "\n",
    "text = \"\"\n",
    "for chunk in completion:\n",
    "    text += chunk.choices[0].delta.get(\"content\", \"\")\n",
    "    display.clear_output(wait=True)\n",
    "    display.display(display.Markdown(text))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
-												address feedback

											
										
										
											1 year ago
+								{
 								 "cells": [
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												cleaned up notebook

											
										
										
											1 year ago
+								    "# Question answering using a search API and re-ranking\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
-												be -> feel

											
										
										
											1 year ago
+								    "Searching for relevant information can sometimes feel like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
-												rewrites intro points to be more consistent with one another

											
										
										
											1 year ago
+								    "Two ways of retrieving information for GPT are:\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
-												the model -> GPT

											
										
										
											1 year ago
+								    "1. **Mimicking Human Browsing:** [GPT triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.\n",
-												rewrites intro points to be more consistent with one another

											
										
										
											1 year ago
+								    "2. **Retrieval with Embeddings:** Calculate [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content and a user query, and then [retrieve the content](Question_answering_using_embeddings.ipynb) most related as measured by cosine similarity. This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
 								    "These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.\n",
 								    "\n",
 								    "By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:\n",
 								    "\n",
 								    "![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_augmentation_embeddings.png)\n",
 								    "\n",
 								    "**Step 1: Search**\n",
 								    "\n",
-												adds periods to list to be consistent with others

											
										
										
											1 year ago
+								    "1.  User asks a question.\n",
 								    "2.  GPT generates a list of potential queries.\n",
 								    "3.  Search queries are executed in parallel.\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
 								    "**Step 2: Re-rank**\n",
 								    "\n",
 								    "1.  Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.\n",
 								    "2.  Results are ranked and filtered based on this similarity metric.\n",
 								    "\n",
 								    "**Step 3: Answer**\n",
 								    "\n",
 								    "1.  Given the top search results, the model generates an answer to the user’s question, including references and links.\n",
 								    "\n",
 								    "This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.\n",
 								    "\n",
 								    "## Setup\n",
 								    "\n",
-												the -> an

											
										
										
											1 year ago
+								    "In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get an API key [here](https://newsapi.org/).\n"
-												address feedback

											
										
										
											1 year ago
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												cleaned up notebook

											
										
										
											1 year ago
+								    "%%capture\n",
-												renames dummy key

											
										
										
											1 year ago
+								    "%env NEWS_API_KEY = YOUR_NEWS_API_KEY\n"
-												address feedback

											
										
										
											1 year ago
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# Dependencies\n",
 								    "from datetime import date, timedelta  # date handling for fetching recent news\n",
 								    "from IPython import display  # for pretty printing\n",
 								    "import json  # for parsing the JSON api responses and model outputs\n",
 								    "from numpy import dot  # for cosine similarity\n",
-												adds comment for openai package

											
										
										
											1 year ago
+								    "import openai  # for using GPT and getting embeddings\n",
-												address feedback

											
										
										
											1 year ago
+								    "import os  # for loading environment variables\n",
 								    "import requests  # for making the API requests\n",
 								    "from tqdm import tqdm  # for printing progress bars\n",
 								    "\n",
 								    "# Load environment variables\n",
 								    "news_api_key = os.getenv(\"NEWS_API_KEY\")\n",
 								    "\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "GPT_MODEL = \"gpt-3.5-turbo\"\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
 								    "# Helper functions\n",
 								    "def json_gpt(input: str):\n",
 								    "    completion = openai.ChatCompletion.create(\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "        model=GPT_MODEL,\n",
-												address feedback

											
										
										
											1 year ago
+								    "        messages=[\n",
 								    "            {\"role\": \"system\", \"content\": \"Output only valid JSON\"},\n",
 								    "            {\"role\": \"user\", \"content\": input},\n",
 								    "        ],\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "        temperature=0.5,\n",
-												address feedback

											
										
										
											1 year ago
+								    "    )\n",
 								    "\n",
 								    "    text = completion.choices[0].message.content\n",
 								    "    parsed = json.loads(text)\n",
 								    "\n",
 								    "    return parsed\n",
 								    "\n",
 								    "\n",
 								    "def embeddings(input: list[str]) -> list[list[str]]:\n",
 								    "    response = openai.Embedding.create(\n",
 								    "        model=\"text-embedding-ada-002\", input=input)\n",
 								    "    return [data.embedding for data in response.data]\n"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## 1. Search\n",
 								    "\n",
 								    "It all starts with a user question.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# User asks a question\n",
 								    "USER_QUESTION = \"Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.\""
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
 								    "QUERIES_INPUT = f\"\"\"\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "You have access to a search API that returns recent news articles.\n",
-												cuts trailing spaces

											
										
										
											1 year ago
+								    "Generate an array of search queries that are relevant to this question.\n",
-												address feedback

											
										
										
											1 year ago
+								    "Use a variation of related keywords for the queries, trying to be as general as possible.\n",
-												cuts trailing spaces

											
										
										
											1 year ago
+								    "Include as many queries as you can think of, including and excluding terms.\n",
 								    "For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].\n",
-												address feedback

											
										
										
											1 year ago
+								    "Be creative. The more queries you include, the more likely you are to find relevant results.\n",
 								    "\n",
 								    "User question: {USER_QUESTION}\n",
 								    "\n",
 								    "Format: {{\"queries\": [\"query_1\", \"query_2\", \"query_3\"]}}\n",
 								    "\"\"\"\n",
 								    "\n",
 								    "queries = json_gpt(QUERIES_INPUT)[\"queries\"]\n",
 								    "\n",
-												adds missing apostrophe

											
										
										
											1 year ago
+								    "# Let's include the original question as well for good measure\n",
-												address feedback

											
										
										
											1 year ago
+								    "queries.append(USER_QUESTION)\n",
 								    "\n",
 								    "queries\n"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												joins run-on sentence

											
										
										
											1 year ago
+								    "The queries look good, so let's run the searches.\n"
-												address feedback

											
										
										
											1 year ago
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
-												changes search date to June 2023

											
										
										
											1 year ago
+								    "def search_news(\n",
 								    "    query: str,\n",
 								    "    news_api_key: str = news_api_key,\n",
 								    "    num_articles: int = 50,\n",
 								    "    from_datetime: str = \"2023-06-01\",  # the 2023 NBA finals were played in June 2023\n",
 								    "    to_datetime: str = \"2023-06-30\",\n",
 								    ") -> dict:\n",
-												address feedback

											
										
										
											1 year ago
+								    "    response = requests.get(\n",
 								    "        \"https://newsapi.org/v2/everything\",\n",
 								    "        params={\n",
 								    "            \"q\": query,\n",
 								    "            \"apiKey\": news_api_key,\n",
-												changes search date to June 2023

											
										
										
											1 year ago
+								    "            \"pageSize\": num_articles,\n",
-												address feedback

											
										
										
											1 year ago
+								    "            \"sortBy\": \"relevancy\",\n",
-												changes search date to June 2023

											
										
										
											1 year ago
+								    "            \"from\": from_datetime,\n",
 								    "            \"to\": to_datetime,\n",
-												address feedback

											
										
										
											1 year ago
+								    "        },\n",
 								    "    )\n",
 								    "\n",
 								    "    return response.json()\n",
 								    "\n",
 								    "\n",
 								    "articles = []\n",
 								    "\n",
 								    "for query in tqdm(queries):\n",
 								    "    result = search_news(query)\n",
 								    "    if result[\"status\"] == \"ok\":\n",
 								    "        articles = articles + result[\"articles\"]\n",
 								    "    else:\n",
-												changes search date to June 2023

											
										
										
											1 year ago
+								    "        raise Exception(result[\"message\"])\n"
-												address feedback

											
										
										
											1 year ago
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
 								    "print(\"Total number of articles:\", len(articles))\n",
 								    "print(\"Top 5 articles of query 1:\", \"\\n\")\n",
 								    "\n",
 								    "for article in articles[0:5]:\n",
 								    "    print(\"Title:\", article[\"title\"])\n",
 								    "    print(\"Description:\", article[\"description\"])\n",
 								    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
 								    "    print()"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "As we can see, oftentimes, the search queries will return a large number of results, many of which are not relevant to the original question asked by the user. In order to improve the quality of the final answer, we use embeddings to re-rank and filter the results.\n",
 								    "\n",
 								    "# 2. Re-rank\n",
 								    "\n",
-												polishes text

											
										
										
											1 year ago
+								    "Drawing inspiration from [HyDE (Gao et al.)](https://arxiv.org/abs/2212.10496), we first generate a hypothetical ideal answer to rerank our compare our results against. This helps prioritize results that look like good answers, rather than those similar to our question. Here’s the prompt we use to generate our hypothetical answer.\n"
-												address feedback

											
										
										
											1 year ago
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
 								    "HA_INPUT = f\"\"\"\n",
 								    "Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. \n",
 								    "Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders\n",
 								    "like NAME did something, or NAME said something at PLACE. \n",
 								    "\n",
 								    "User question: {USER_QUESTION}\n",
 								    "\n",
 								    "Format: {{\"hypotheticalAnswer\": \"hypothetical answer text\"}}\n",
 								    "\"\"\"\n",
 								    "\n",
 								    "hypothetical_answer = json_gpt(HA_INPUT)[\"hypotheticalAnswer\"]\n",
 								    "\n",
 								    "hypothetical_answer"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Now, let's generate embeddings for the search results and the hypothetical answer. We then calculate the cosine distance between these embeddings, giving us a semantic similarity metric. Note that we can simply calculate the dot product in lieu of doing a full cosine similarity calculation since the OpenAI embeddings are returned normalized in our API.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
-												change embedding dimensions

											
										
										
											1 year ago
+								    "hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]\n",
-												address feedback

											
										
										
											1 year ago
+								    "article_embeddings = embeddings(\n",
 								    "    [\n",
 								    "        f\"{article['title']} {article['description']} {article['content'][0:100]}\"\n",
 								    "        for article in articles\n",
 								    "    ]\n",
 								    ")\n",
 								    "\n",
 								    "# Calculate cosine similarity\n",
 								    "cosine_similarities = []\n",
 								    "for article_embedding in article_embeddings:\n",
-												change embedding dimensions

											
										
										
											1 year ago
+								    "    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))\n",
-												address feedback

											
										
										
											1 year ago
+								    "\n",
 								    "cosine_similarities[0:10]"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Finally, we use these similarity scores to sort and filter the results.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
 								    "scored_articles = zip(articles, cosine_similarities)\n",
 								    "\n",
 								    "# Sort articles by cosine similarity\n",
 								    "sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)\n",
 								    "\n",
 								    "# Print top 5 articles\n",
 								    "print(\"Top 5 articles:\", \"\\n\")\n",
 								    "\n",
 								    "for article, score in sorted_articles[0:5]:\n",
 								    "    print(\"Title:\", article[\"title\"])\n",
 								    "    print(\"Description:\", article[\"description\"])\n",
 								    "    print(\"Content:\", article[\"content\"][0:100] + \"...\")\n",
 								    "    print(\"Score:\", score)\n",
 								    "    print()"
 								   ]
 								  },
 								  {
 								   "attachments": {},
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.\n",
 								    "\n",
 								    "## 3. Answer\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "execution_count": null,
-												address feedback

											
										
										
											1 year ago
+								   "metadata": {},
-												cleaned up notebook

											
										
										
											1 year ago
+								   "outputs": [],
-												address feedback

											
										
										
											1 year ago
+								   "source": [
 								    "formatted_top_results = [\n",
 								    "    {\n",
 								    "        \"title\": article[\"title\"],\n",
 								    "        \"description\": article[\"description\"],\n",
 								    "        \"url\": article[\"url\"],\n",
 								    "    }\n",
 								    "    for article, _score in sorted_articles[0:5]\n",
 								    "]\n",
 								    "\n",
 								    "ANSWER_INPUT = f\"\"\"\n",
 								    "Generate an answer to the user's question based on the given search results. \n",
 								    "TOP_RESULTS: {formatted_top_results}\n",
 								    "USER_QUESTION: {USER_QUESTION}\n",
 								    "\n",
 								    "Include as much information as possible in the answer. Include references to the search results as markdown links.\n",
 								    "\"\"\"\n",
 								    "\n",
 								    "completion = openai.ChatCompletion.create(\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "    model=GPT_MODEL,\n",
-												address feedback

											
										
										
											1 year ago
+								    "    messages=[{\"role\": \"user\", \"content\": ANSWER_INPUT}],\n",
-												cleaned up notebook

											
										
										
											1 year ago
+								    "    temperature=0.5,\n",
-												address feedback

											
										
										
											1 year ago
+								    "    stream=True,\n",
 								    ")\n",
 								    "\n",
 								    "text = \"\"\n",
 								    "for chunk in completion:\n",
 								    "    text += chunk.choices[0].delta.get(\"content\", \"\")\n",
 								    "    display.clear_output(wait=True)\n",
 								    "    display.display(display.Markdown(text))\n"
 								   ]
 								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
-												cleaned up notebook

											
										
										
											1 year ago
+								   "version": "3.9.9"
-												address feedback

											
										
										
											1 year ago
+								  },
 								  "orig_nbformat": 4
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 2
 								}