You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/Question_answering_using_em...

961 lines
38 KiB
Plaintext

2 years ago
{
"cells": [
{
"cell_type": "markdown",
"id": "c4ca8276-e829-4cff-8905-47534e4b4d4e",
"metadata": {},
"source": [
"# Question Answering using Embeddings\n",
"\n",
"Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.\n",
"\n",
"In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We'll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games. Please see [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb) to follow the data gathering process."
]
},
2 years ago
{
"cell_type": "code",
"execution_count": 1,
2 years ago
"id": "9e3839a6-9146-4f60-b74b-19abbc24278d",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import openai\n",
"import pandas as pd\n",
2 years ago
"import pickle\n",
"import tiktoken\n",
2 years ago
"\n",
"COMPLETIONS_MODEL = \"text-davinci-003\"\n",
"EMBEDDING_MODEL = \"text-embedding-ada-002\""
2 years ago
]
},
{
"cell_type": "markdown",
"id": "9312f62f-e208-4030-a648-71ad97aee74f",
"metadata": {},
"source": [
"By default, GPT-3 isn't an expert on the 2020 Olympics:"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 2,
2 years ago
"id": "a167516c-7c19-4bda-afa5-031aa0ae13bb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Marcelo Chierighini of Brazil won the gold medal in the men's high jump at the 2020 Summer Olympics.\""
2 years ago
]
},
"execution_count": 2,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt = \"Who won the 2020 Summer Olympics men's high jump?\"\n",
"\n",
"openai.Completion.create(\n",
" prompt=prompt,\n",
" temperature=0,\n",
" max_tokens=300,\n",
" model=COMPLETIONS_MODEL\n",
")[\"choices\"][0][\"text\"].strip(\" \\n\")"
2 years ago
]
},
{
"attachments": {},
2 years ago
"cell_type": "markdown",
"id": "47204cce-a7d5-4c81-ab6e-53323026e08c",
"metadata": {},
"source": [
"Marcelo is a gold medalist swimmer, and, we assume, not much of a high jumper! Evidently GPT-3 needs some assistance here. \n",
"\n",
"The first issue to tackle is that the model is hallucinating an answer rather than telling us \"I don't know\". This is bad because it makes it hard to trust the answer that the model gives us! \n",
2 years ago
"\n",
"# 0) Preventing hallucination with prompt engineering\n",
"\n",
"We can address this hallucination issue by being more explicit with our prompt:\n"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a5451371-17fe-4ef3-aa02-affcf4edb0e0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Sorry, I don't know.\""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt = \"\"\"Answer the question as truthfully as possible, and if you're unsure of the answer, say \"Sorry, I don't know\".\n",
"\n",
"Q: Who won the 2020 Summer Olympics men's high jump?\n",
"A:\"\"\"\n",
"\n",
"openai.Completion.create(\n",
" prompt=prompt,\n",
" temperature=0,\n",
" max_tokens=300,\n",
" model=COMPLETIONS_MODEL\n",
")[\"choices\"][0][\"text\"].strip(\" \\n\")"
]
},
{
"cell_type": "markdown",
"id": "1af18d66-d47a-496d-ae5f-4c5d53caa434",
"metadata": {},
"source": [
"To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly. For example we can use this information taken from Wikipedia. We update the initial prompt to tell the model to explicitly make use of the provided text."
]
},
{
"cell_type": "code",
"execution_count": 4,
2 years ago
"id": "fceaf665-2602-4788-bc44-9eb256a6f955",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'"
2 years ago
]
},
"execution_count": 4,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt = \"\"\"Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say \"I don't know\"\n",
"\n",
"Context:\n",
2 years ago
"The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.\n",
"33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places \n",
"to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).\n",
"Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following\n",
"a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance\n",
"where the athletes of different nations had agreed to share the same medal in the history of Olympics. \n",
"Barshim in particular was heard to ask a competition official \"Can we have two golds?\" in response to being offered a \n",
"'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and \n",
"Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump\n",
"for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg\n",
"of Sweden (1984 to 1992).\n",
"\n",
"Q: Who won the 2020 Summer Olympics men's high jump?\n",
"A:\"\"\"\n",
2 years ago
"\n",
"openai.Completion.create(\n",
" prompt=prompt,\n",
" temperature=0,\n",
" max_tokens=300,\n",
" top_p=1,\n",
" frequency_penalty=0,\n",
" presence_penalty=0,\n",
" model=COMPLETIONS_MODEL\n",
")[\"choices\"][0][\"text\"].strip(\" \\n\")"
2 years ago
]
},
{
"cell_type": "markdown",
"id": "ee85ee77-d8d2-4788-b57e-0785f2d7e2e3",
"metadata": {},
"source": [
"Adding extra information into the prompt only works when the dataset of extra content that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information?\n",
2 years ago
"\n",
"**In the remainder of this notebook, we will demonstrate a method for augmenting GPT-3 with a large body of additional contextual information by using document embeddings and retrieval.** This method answers queries in two steps: first it retrieves the information relevant to the query, then it writes an answer tailored to the question based on the retrieved information. The first step uses the [Embedding API](https://beta.openai.com/docs/guides/embeddings), the second step uses the [Completions API](https://beta.openai.com/docs/guides/completion/introduction).\n",
2 years ago
" \n",
"The steps are:\n",
"* Preprocess the contextual information by splitting it into chunks and create an embedding vector for each chunk.\n",
"* On receiving a query, embed the query in the same vector space as the context chunks and find the context embeddings which are most similar to the query.\n",
"* Prepend the most relevant context embeddings to the query prompt.\n",
"* Submit the question along with the most relevant context to GPT, and receive an answer which makes use of the provided contextual information."
]
},
{
"cell_type": "markdown",
"id": "0c9bfea5-a028-4191-b9f1-f210d76ec4e3",
"metadata": {},
"source": [
2 years ago
"# 1) Preprocess the document library\n",
2 years ago
"\n",
"We plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3. We therefore need to break up the document library into \"sections\" of context, which can be searched and retrieved separately. \n",
2 years ago
"\n",
"Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them."
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 5,
2 years ago
"id": "cc9c8d69-e234-48b4-87e3-935970e1523a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3964 rows in the data.\n"
2 years ago
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>content</th>\n",
" <th>tokens</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th>heading</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Jamaica at the 2020 Summer Olympics</th>\n",
" <th>Swimming</th>\n",
" <td>Jamaican swimmers further achieved qualifying ...</td>\n",
" <td>51</td>\n",
2 years ago
" </tr>\n",
" <tr>\n",
" <th>Archery at the 2020 Summer Olympics Women's individual</th>\n",
" <th>Background</th>\n",
" <td>This is the 13th consecutive appearance of the...</td>\n",
" <td>136</td>\n",
2 years ago
" </tr>\n",
" <tr>\n",
" <th>Germany at the 2020 Summer Olympics</th>\n",
" <th>Sport climbing</th>\n",
" <td>Germany entered two sport climbers into the Ol...</td>\n",
" <td>98</td>\n",
2 years ago
" </tr>\n",
" <tr>\n",
" <th>Cycling at the 2020 Summer Olympics Women's BMX racing</th>\n",
" <th>Competition format</th>\n",
" <td>The competition was a three-round tournament, ...</td>\n",
" <td>215</td>\n",
2 years ago
" </tr>\n",
" <tr>\n",
" <th>Volleyball at the 2020 Summer Olympics Men's tournament</th>\n",
" <th>Format</th>\n",
" <td>The preliminary round was a competition betwee...</td>\n",
" <td>104</td>\n",
2 years ago
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" content \\\n",
"title heading \n",
"Jamaica at the 2020 Summer Olympics Swimming Jamaican swimmers further achieved qualifying ... \n",
"Archery at the 2020 Summer Olympics Women's i... Background This is the 13th consecutive appearance of the... \n",
"Germany at the 2020 Summer Olympics Sport climbing Germany entered two sport climbers into the Ol... \n",
"Cycling at the 2020 Summer Olympics Women's B... Competition format The competition was a three-round tournament, ... \n",
"Volleyball at the 2020 Summer Olympics Men's ... Format The preliminary round was a competition betwee... \n",
2 years ago
"\n",
" tokens \n",
"title heading \n",
"Jamaica at the 2020 Summer Olympics Swimming 51 \n",
"Archery at the 2020 Summer Olympics Women's i... Background 136 \n",
"Germany at the 2020 Summer Olympics Sport climbing 98 \n",
"Cycling at the 2020 Summer Olympics Women's B... Competition format 215 \n",
"Volleyball at the 2020 Summer Olympics Men's ... Format 104 "
2 years ago
]
},
"execution_count": 5,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We have hosted the processed dataset, so you can download it directly without having to recreate it.\n",
2 years ago
"# This dataset has already been split into sections, one row for each section of the Wikipedia page.\n",
"\n",
"df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')\n",
2 years ago
"df = df.set_index([\"title\", \"heading\"])\n",
"print(f\"{len(df)} rows in the data.\")\n",
"df.sample(5)"
]
},
{
"cell_type": "markdown",
"id": "a17b88b9-7ea2-491e-9727-12617c74a77d",
"metadata": {},
"source": [
"We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.openai.com/docs/guides/embeddings) for more information.\n",
"\n",
"This indexing stage can be executed offline and only runs once to precompute the indexes for the dataset so that each piece of content can be retrieved later. Since this is a small example, we will store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.\n",
"\n",
"For the purposes of this tutorial we chose to use Curie embeddings, which are 4096-dimensional embeddings at a very good price and performance point. Since we will be using these embeddings for retrieval, well use the \"search\" embeddings (see the [documentation](https://beta.openai.com/docs/guides/embeddings))."
]
},
{
"cell_type": "code",
"execution_count": 6,
2 years ago
"id": "ba475f30-ef7f-431c-b60d-d5970b62ad09",
"metadata": {},
"outputs": [],
"source": [
"def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:\n",
" result = openai.Embedding.create(\n",
" model=model,\n",
" input=text\n",
" )\n",
" return result[\"data\"][0][\"embedding\"]\n",
2 years ago
"\n",
"def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:\n",
2 years ago
" \"\"\"\n",
" Create an embedding for each row in the dataframe using the OpenAI Embeddings API.\n",
2 years ago
" \n",
" Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.\n",
" \"\"\"\n",
" return {\n",
" idx: get_embedding(r.content) for idx, r in df.iterrows()\n",
2 years ago
" }"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "737266aa-cbe7-4691-87c1-fce8a31632f1",
"metadata": {},
"outputs": [],
"source": [
"def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:\n",
" \"\"\"\n",
" Read the document embeddings and their keys from a CSV.\n",
" \n",
" fname is the path to a CSV with exactly these named columns: \n",
" \"title\", \"heading\", \"0\", \"1\", ... up to the length of the embedding vectors.\n",
" \"\"\"\n",
" \n",
" df = pd.read_csv(fname, header=0)\n",
" max_dim = max([int(c) for c in df.columns if c != \"title\" and c != \"heading\"])\n",
" return {\n",
" (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "cfe9c723-f838-4c75-8ed8-286b2e491a60",
"metadata": {},
"source": [
"Again, we have hosted the embeddings for you so you don't have to re-calculate them from scratch."
]
},
{
"cell_type": "code",
"execution_count": 8,
2 years ago
"id": "ab50bfca-cb02-41c6-b338-4400abe1d86e",
"metadata": {},
"outputs": [],
2 years ago
"source": [
"document_embeddings = load_embeddings(\"https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv\")\n",
"\n",
"# ===== OR, uncomment the below line to recaculate the embeddings from scratch. ========\n",
"\n",
"# document_embeddings = compute_doc_embeddings(df)"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b9a8c713-c8a9-47dc-85a4-871ee1395566",
2 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('2020 Summer Olympics', 'Summary') : [0.0037565305829048, -0.0061981128528714, -0.0087078781798481, -0.0071364338509738, -0.0025227521546185]... (1536 entries)\n"
]
2 years ago
}
],
"source": [
"# An example embedding:\n",
"example_entry = list(document_embeddings.items())[0]\n",
"print(f\"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)\")"
2 years ago
]
},
{
"cell_type": "markdown",
"id": "aa32cf88-9edb-4dc6-b4cf-a16a8de7d304",
"metadata": {
"tags": []
},
"source": [
"So we have split our document library into sections, and encoded them by creating embedding vectors that represent each chunk. Next we will use these embeddings to answer our users' questions.\n",
"\n",
2 years ago
"# 2) Find the most similar document embeddings to the question embedding\n",
2 years ago
"\n",
"At the time of question-answering, to answer the user's query we compute the query embedding of the question and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search."
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 10,
2 years ago
"id": "dcd680e9-f194-4180-b14f-fc357498eb92",
"metadata": {},
"outputs": [],
"source": [
"def vector_similarity(x: list[float], y: list[float]) -> float:\n",
" \"\"\"\n",
" Returns the similarity between two vectors.\n",
" \n",
" Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.\n",
" \"\"\"\n",
" return np.dot(np.array(x), np.array(y))\n",
2 years ago
"\n",
"def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:\n",
2 years ago
" \"\"\"\n",
" Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings\n",
" to find the most relevant sections. \n",
2 years ago
" \n",
" Return the list of document sections, sorted by relevance in descending order.\n",
2 years ago
" \"\"\"\n",
" query_embedding = get_embedding(query)\n",
2 years ago
" \n",
" document_similarities = sorted([\n",
2 years ago
" (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()\n",
" ], reverse=True)\n",
" \n",
" return document_similarities"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 11,
2 years ago
"id": "e3a27d73-f47f-480d-b336-079414f749cb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(0.884864308450606,\n",
2 years ago
" (\"Athletics at the 2020 Summer Olympics Men's high jump\", 'Summary')),\n",
" (0.8633938355935518,\n",
" (\"Athletics at the 2020 Summer Olympics Men's pole vault\", 'Summary')),\n",
" (0.861639730583851,\n",
" (\"Athletics at the 2020 Summer Olympics Men's long jump\", 'Summary')),\n",
" (0.8560523857031264,\n",
2 years ago
" (\"Athletics at the 2020 Summer Olympics Men's triple jump\", 'Summary')),\n",
" (0.8469039130441247,\n",
" (\"Athletics at the 2020 Summer Olympics Men's 110 metres hurdles\",\n",
" 'Summary'))]"
2 years ago
]
},
"execution_count": 11,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"order_document_sections_by_query_similarity(\"Who won the men's high jump?\", document_embeddings)[:5]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "729c2ce7-8540-4ab2-bb3a-76c4dfcb689c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(0.8726165220223294,\n",
" (\"Athletics at the 2020 Summer Olympics Women's long jump\", 'Summary')),\n",
" (0.8682196158313358,\n",
" (\"Athletics at the 2020 Summer Olympics Women's high jump\", 'Summary')),\n",
" (0.863191526370672,\n",
" (\"Athletics at the 2020 Summer Olympics Women's pole vault\", 'Summary')),\n",
" (0.8609374262115406,\n",
" (\"Athletics at the 2020 Summer Olympics Women's triple jump\", 'Summary')),\n",
" (0.8581515607285688,\n",
" (\"Athletics at the 2020 Summer Olympics Women's 100 metres hurdles\",\n",
" 'Summary'))]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"order_document_sections_by_query_similarity(\"Who won the women's high jump?\", document_embeddings)[:5]"
2 years ago
]
},
{
"attachments": {},
2 years ago
"cell_type": "markdown",
"id": "3cf71fae-abb1-46b2-a483-c1b2f1a915c2",
"metadata": {},
"source": [
"We can see that the most relevant document sections for each question include the summaries for the Men's and Women's high jump competitions - which is exactly what we would expect."
2 years ago
]
},
{
"cell_type": "markdown",
"id": "a0efa0f6-4469-457a-89a4-a2f5736a01e0",
"metadata": {},
"source": [
2 years ago
"# 3) Add the most relevant document sections to the query prompt\n",
2 years ago
"\n",
"Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text."
]
},
{
"cell_type": "code",
"execution_count": 13,
2 years ago
"id": "b763ace2-1946-48e0-8ff1-91ba335d47a0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Context separator contains 3 tokens'"
]
},
"execution_count": 13,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MAX_SECTION_LEN = 500\n",
"SEPARATOR = \"\\n* \"\n",
"ENCODING = \"cl100k_base\" # encoding for text-embedding-ada-002\n",
2 years ago
"\n",
"encoding = tiktoken.get_encoding(ENCODING)\n",
"separator_len = len(encoding.encode(SEPARATOR))\n",
2 years ago
"\n",
"f\"Context separator contains {separator_len} tokens\""
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 14,
2 years ago
"id": "0c5c0509-eeb9-4552-a5d4-6ace04ef73dd",
"metadata": {},
"outputs": [],
"source": [
"def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:\n",
2 years ago
" \"\"\"\n",
" Fetch relevant \n",
2 years ago
" \"\"\"\n",
" most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)\n",
2 years ago
" \n",
" chosen_sections = []\n",
" chosen_sections_len = 0\n",
" chosen_sections_indexes = []\n",
2 years ago
" \n",
" for _, section_index in most_relevant_document_sections:\n",
" # Add contexts until we run out of space. \n",
" document_section = df.loc[section_index]\n",
2 years ago
" \n",
" chosen_sections_len += document_section.tokens + separator_len\n",
" if chosen_sections_len > MAX_SECTION_LEN:\n",
2 years ago
" break\n",
" \n",
" chosen_sections.append(SEPARATOR + document_section.content.replace(\"\\n\", \" \"))\n",
" chosen_sections_indexes.append(str(section_index))\n",
" \n",
" # Useful diagnostic information\n",
" print(f\"Selected {len(chosen_sections)} document sections:\")\n",
" print(\"\\n\".join(chosen_sections_indexes))\n",
" \n",
" header = \"\"\"Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say \"I don't know.\"\\n\\nContext:\\n\"\"\"\n",
" \n",
" return header + \"\".join(chosen_sections) + \"\\n\\n Q: \" + question + \"\\n A:\""
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 15,
2 years ago
"id": "f614045a-3917-4b28-9643-7e0c299ec1a7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
2 years ago
"(\"Athletics at the 2020 Summer Olympics Men's high jump\", 'Summary')\n",
"(\"Athletics at the 2020 Summer Olympics Men's long jump\", 'Summary')\n",
2 years ago
"===\n",
" Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say \"I don't know.\"\n",
"\n",
"Context:\n",
"\n",
2 years ago
"* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official \"Can we have two golds?\" in response to being offered a 'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).\n",
"* The men's long jump event at the 2020 Summer Olympics took place between 31 July and 2 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (1 universality place was used in 2016). 31 athletes from 20 nations competed. Miltiadis Tentoglou won the gold medal, Greece's first medal in the men's long jump. Cuban athletes Juan Miguel Echevarría and Maykel Massó earned silver and bronze, respectively, the nation's first medals in the event since 2008.\n",
2 years ago
"\n",
" Q: Who won the 2020 Summer Olympics men's high jump?\n",
" A:\n"
2 years ago
]
}
],
"source": [
"prompt = construct_prompt(\n",
" \"Who won the 2020 Summer Olympics men's high jump?\",\n",
" document_embeddings,\n",
2 years ago
" df\n",
")\n",
"\n",
"print(\"===\\n\", prompt)"
]
},
{
"cell_type": "markdown",
"id": "1b022fd4-0a3c-4ae1-bed1-4c80e4f0fb56",
"metadata": {
"tags": []
},
"source": [
"We have now obtained the document sections that are most relevant to the question. As a final step, let's put it all together to get an answer to the question.\n",
"\n",
"# 4) Answer the user's question based on the context.\n",
2 years ago
"\n",
"Now that we've retrieved the relevant context and constructed our prompt, we can finally use the Completions API to answer the user's query."
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 16,
2 years ago
"id": "b0edfec7-9243-4573-92e0-253d31c771ad",
"metadata": {},
"outputs": [],
"source": [
"COMPLETIONS_API_PARAMS = {\n",
" # We use temperature of 0.0 because it gives the most predictable, factual answer.\n",
2 years ago
" \"temperature\": 0.0,\n",
" \"max_tokens\": 300,\n",
" \"model\": COMPLETIONS_MODEL,\n",
2 years ago
"}"
]
},
{
"cell_type": "code",
"execution_count": 17,
2 years ago
"id": "9c1c9a69-848e-4099-a90d-c8da36c153d5",
"metadata": {},
"outputs": [],
"source": [
"def answer_query_with_context(\n",
" query: str,\n",
" df: pd.DataFrame,\n",
" document_embeddings: dict[(str, str), np.array],\n",
" show_prompt: bool = False\n",
2 years ago
") -> str:\n",
" prompt = construct_prompt(\n",
" query,\n",
" document_embeddings,\n",
2 years ago
" df\n",
" )\n",
" \n",
" if show_prompt:\n",
" print(prompt)\n",
2 years ago
"\n",
" response = openai.Completion.create(\n",
" prompt=prompt,\n",
" **COMPLETIONS_API_PARAMS\n",
" )\n",
"\n",
" return response[\"choices\"][0][\"text\"].strip(\" \\n\")"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 18,
2 years ago
"id": "c233e449-bf33-4c9e-b095-6a4dd278c8fd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
2 years ago
"(\"Athletics at the 2020 Summer Olympics Men's high jump\", 'Summary')\n",
"(\"Athletics at the 2020 Summer Olympics Men's long jump\", 'Summary')\n"
2 years ago
]
},
{
"data": {
"text/plain": [
"'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal.'"
2 years ago
]
},
"execution_count": 18,
2 years ago
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer_query_with_context(\"Who won the 2020 Summer Olympics men's high jump?\", df, document_embeddings)"
2 years ago
]
},
{
"cell_type": "markdown",
"id": "7b48d155-d2d4-447c-ab8e-5a5b4722b07c",
"metadata": {},
"source": [
"Wow! By combining the Embeddings and Completions APIs, we have created a question-answering model which can answer questions using a large base of additional knowledge. It also understands when it doesn't know the answer! \n",
"\n",
"For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more. **We can't wait to see what you create with GPT-3!**\n",
"\n",
"# More Examples\n",
"\n",
"Let's have some fun and try some more examples."
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 19,
2 years ago
"id": "1127867b-2884-44bb-9439-0e8ae171c835",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 1 document sections:\n",
"('Concerns and controversies at the 2020 Summer Olympics', 'Summary')\n",
"\n",
"Q: Why was the 2020 Summer Olympics originally postponed?\n",
"A: The 2020 Summer Olympics were originally postponed due to the COVID-19 pandemic.\n"
2 years ago
]
}
],
"source": [
"query = \"Why was the 2020 Summer Olympics originally postponed?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 20,
2 years ago
"id": "720d9e0b-b189-4101-91ee-babf736199e6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
"('2020 Summer Olympics medal table', 'Summary')\n",
"('List of 2020 Summer Olympics medal winners', 'Summary')\n",
"\n",
"Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?\n",
"A: The United States won the most medals overall, with 113, and the most gold medals, with 39.\n"
2 years ago
]
}
],
"source": [
"query = \"In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 21,
2 years ago
"id": "4e8e51cc-e4eb-4557-9e09-2929d4df5b7f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
2 years ago
"(\"Athletics at the 2020 Summer Olympics Men's shot put\", 'Summary')\n",
"(\"Athletics at the 2020 Summer Olympics Men's discus throw\", 'Summary')\n",
"\n",
"Q: What was unusual about the mens shotput competition?\n",
"A: The same three competitors received the same medals in back-to-back editions of the same individual event.\n"
2 years ago
]
}
],
"source": [
"query = \"What was unusual about the mens shotput competition?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
2 years ago
]
},
{
"cell_type": "code",
"execution_count": 22,
2 years ago
"id": "37c83519-e3c6-4c44-8b4a-98cbb3a5f5ba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
"('Italy at the 2020 Summer Olympics', 'Summary')\n",
"('San Marino at the 2020 Summer Olympics', 'Summary')\n",
"\n",
"Q: In the 2020 Summer Olympics, how many silver medals did Italy win?\n",
"A: 10 silver medals.\n"
2 years ago
]
}
],
"source": [
"query = \"In the 2020 Summer Olympics, how many silver medals did Italy win?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
]
},
{
"cell_type": "markdown",
"id": "177c945e-f5c4-4fa5-8331-44f328b25e44",
"metadata": {},
"source": [
"Our Q&A model is less prone to hallucinating answers, and has a better sense of what it does or doesn't know. This works when the information isn't contained in the context; when the question is nonsensical; or when the question is theoretically answerable but beyond GPT-3's powers!"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "26a1a9ef-e1ee-4f80-a1b1-6164ccfa5bac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 4 document sections:\n",
"('France at the 2020 Summer Olympics', 'Taekwondo')\n",
"('Taekwondo at the 2020 Summer Olympics Qualification', 'Qualification summary')\n",
"('2020 Summer Olympics medal table', 'Medal count')\n",
"(\"Taekwondo at the 2020 Summer Olympics Men's 80 kg\", 'Competition format')\n",
"\n",
"Q: What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?\n",
"A: I don't know.\n"
]
}
],
"source": [
"query = \"What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "9fba8a63-eb81-4661-ae17-59bb5e2933d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 3 document sections:\n",
"(\"Sport climbing at the 2020 Summer Olympics Men's combined\", 'Route-setting')\n",
"(\"Ski mountaineering at the 2020 Winter Youth Olympics Boys' individual\", 'Summary')\n",
"(\"Ski mountaineering at the 2020 Winter Youth Olympics Girls' individual\", 'Summary')\n",
"\n",
"Q: What is the tallest mountain in the world?\n",
"A: I don't know.\n"
]
}
],
"source": [
"query = \"What is the tallest mountain in the world?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "2d4c693b-cdb9-4f4c-bd1b-f77b29097a1f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected 2 document sections:\n",
"(\"Gymnastics at the 2020 Summer Olympics Women's trampoline\", 'Summary')\n",
"('Equestrian at the 2020 Summer Olympics Team jumping', 'Summary')\n",
"\n",
"Q: Who won the grimblesplatch competition at the 2020 Summer Olympic games?\n",
"A: I don't know.\n"
]
}
],
"source": [
"query = \"Who won the grimblesplatch competition at the 2020 Summer Olympic games?\"\n",
"answer = answer_query_with_context(query, df, document_embeddings)\n",
"\n",
"print(f\"\\nQ: {query}\\nA: {answer}\")"
2 years ago
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.9 ('openai')",
2 years ago
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
2 years ago
}
},
"nbformat": 4,
"nbformat_minor": 5
}