{ "cells": [ { "cell_type": "markdown", "id": "c4ca8276-e829-4cff-8905-47534e4b4d4e", "metadata": {}, "source": [ "# Question Answering using Embeddings\n", "\n", "Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.\n", "\n", "In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We'll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games. Please see [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb) to follow the data gathering process." ] }, { "cell_type": "code", "execution_count": 1, "id": "9e3839a6-9146-4f60-b74b-19abbc24278d", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import openai\n", "import pandas as pd\n", "import pickle\n", "import tiktoken\n", "\n", "COMPLETIONS_MODEL = \"text-davinci-003\"\n", "EMBEDDING_MODEL = \"text-embedding-ada-002\"" ] }, { "cell_type": "markdown", "id": "9312f62f-e208-4030-a648-71ad97aee74f", "metadata": {}, "source": [ "By default, GPT-3 isn't an expert on the 2020 Olympics:" ] }, { "cell_type": "code", "execution_count": 2, "id": "a167516c-7c19-4bda-afa5-031aa0ae13bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Marcelo Chierighini of Brazil won the gold medal in the men's high jump at the 2020 Summer Olympics.\"" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt = \"Who won the 2020 Summer Olympics men's high jump?\"\n", "\n", "openai.Completion.create(\n", " prompt=prompt,\n", " temperature=0,\n", " max_tokens=300,\n", " model=COMPLETIONS_MODEL\n", ")[\"choices\"][0][\"text\"].strip(\" \\n\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "47204cce-a7d5-4c81-ab6e-53323026e08c", "metadata": {}, "source": [ "Marcelo is a gold medalist swimmer, and, we assume, not much of a high jumper! Evidently GPT-3 needs some assistance here. \n", "\n", "The first issue to tackle is that the model is hallucinating an answer rather than telling us \"I don't know\". This is bad because it makes it hard to trust the answer that the model gives us! \n", "\n", "# 0) Preventing hallucination with prompt engineering\n", "\n", "We can address this hallucination issue by being more explicit with our prompt:\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "a5451371-17fe-4ef3-aa02-affcf4edb0e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Sorry, I don't know.\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt = \"\"\"Answer the question as truthfully as possible, and if you're unsure of the answer, say \"Sorry, I don't know\".\n", "\n", "Q: Who won the 2020 Summer Olympics men's high jump?\n", "A:\"\"\"\n", "\n", "openai.Completion.create(\n", " prompt=prompt,\n", " temperature=0,\n", " max_tokens=300,\n", " model=COMPLETIONS_MODEL\n", ")[\"choices\"][0][\"text\"].strip(\" \\n\")" ] }, { "cell_type": "markdown", "id": "1af18d66-d47a-496d-ae5f-4c5d53caa434", "metadata": {}, "source": [ "To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly. For example we can use this information taken from Wikipedia. We update the initial prompt to tell the model to explicitly make use of the provided text." ] }, { "cell_type": "code", "execution_count": 4, "id": "fceaf665-2602-4788-bc44-9eb256a6f955", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt = \"\"\"Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say \"I don't know\"\n", "\n", "Context:\n", "The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.\n", "33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places \n", "to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).\n", "Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following\n", "a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance\n", "where the athletes of different nations had agreed to share the same medal in the history of Olympics. \n", "Barshim in particular was heard to ask a competition official \"Can we have two golds?\" in response to being offered a \n", "'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and \n", "Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump\n", "for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg\n", "of Sweden (1984 to 1992).\n", "\n", "Q: Who won the 2020 Summer Olympics men's high jump?\n", "A:\"\"\"\n", "\n", "openai.Completion.create(\n", " prompt=prompt,\n", " temperature=0,\n", " max_tokens=300,\n", " top_p=1,\n", " frequency_penalty=0,\n", " presence_penalty=0,\n", " model=COMPLETIONS_MODEL\n", ")[\"choices\"][0][\"text\"].strip(\" \\n\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ee85ee77-d8d2-4788-b57e-0785f2d7e2e3", "metadata": {}, "source": [ "Adding extra information into the prompt only works when the dataset of extra content that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information?\n", "\n", "**In the remainder of this notebook, we will demonstrate a method for augmenting GPT-3 with a large body of additional contextual information by using document embeddings and retrieval.** This method answers queries in two steps: first it retrieves the information relevant to the query, then it writes an answer tailored to the question based on the retrieved information. The first step uses the [Embeddings API](https://beta.openai.com/docs/guides/embeddings), the second step uses the [Completions API](https://beta.openai.com/docs/guides/completion/introduction).\n", " \n", "The steps are:\n", "* Preprocess the contextual information by splitting it into chunks and create an embedding vector for each chunk.\n", "* On receiving a query, embed the query in the same vector space as the context chunks and find the context embeddings which are most similar to the query.\n", "* Prepend the most relevant context embeddings to the query prompt.\n", "* Submit the question along with the most relevant context to GPT, and receive an answer which makes use of the provided contextual information." ] }, { "cell_type": "markdown", "id": "0c9bfea5-a028-4191-b9f1-f210d76ec4e3", "metadata": {}, "source": [ "# 1) Preprocess the document library\n", "\n", "We plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3. We therefore need to break up the document library into \"sections\" of context, which can be searched and retrieved separately. \n", "\n", "Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them." ] }, { "cell_type": "code", "execution_count": 5, "id": "cc9c8d69-e234-48b4-87e3-935970e1523a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3964 rows in the data.\n" ] }, { "data": { "text/html": [ "
\n", " | \n", " | content | \n", "tokens | \n", "
---|---|---|---|
title | \n", "heading | \n", "\n", " | \n", " |
Jamaica at the 2020 Summer Olympics | \n", "Swimming | \n", "Jamaican swimmers further achieved qualifying ... | \n", "51 | \n", "
Archery at the 2020 Summer Olympics – Women's individual | \n", "Background | \n", "This is the 13th consecutive appearance of the... | \n", "136 | \n", "
Germany at the 2020 Summer Olympics | \n", "Sport climbing | \n", "Germany entered two sport climbers into the Ol... | \n", "98 | \n", "
Cycling at the 2020 Summer Olympics – Women's BMX racing | \n", "Competition format | \n", "The competition was a three-round tournament, ... | \n", "215 | \n", "
Volleyball at the 2020 Summer Olympics – Men's tournament | \n", "Format | \n", "The preliminary round was a competition betwee... | \n", "104 | \n", "