"<span style=\"color:orange; font-weight:bold\">Note: To answer questions based on text documents, we recommend the procedure in <a href=\"https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb\">Question Answering using Embeddings</a>. Some of the code below may rely on <a href=\"https://github.com/openai/openai-cookbook/tree/main/transition_guides_for_deprecated_API_endpoints\">deprecated API endpoints</a>.</span>"
"We use [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta), a model specialized in following instructions, to create questions based on the given context. Then we also use [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta) to answer those questions, given the same context. \n",
"This is expensive, and will also take a long time, as we call the davinci engine for each section. You can simply download the final dataset instead.\n",
"\n",
"We're using the dataset created using the [previous notebook](olympics-1-collect-data.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1 Read in the data, and create a context\n",
"Create a context by concatenating the title, the heading and the content of that section"
"<span style=\"color:orange; font-weight:bold\">WARNING: This step will last a long time, and consume a lot of tokens, as it calls davinci-instruct for every section to generate a number of questions.</span>"
"The prompt is designed to generate a number of questions. Example questions above were generated based on the summary section of the 2020 Summer Olympics page.\n",
"\n",
"We can observe that the questions 3 and 5 above repeat. Sometimes the generated questions could be ambiguous without the context. We will show that even despite these limitations we can create a successful model."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.\n",
"Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 name for marketing and branding purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic. The Summer Paralympics were held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.The 2020 Games were the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter) and Nagano 1998 (Winter) games. Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games were the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea and preceding the 2022 Winter Olympics in Beijing, China.\n",
"New events were introduced in existing sports for 2020, including 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allowed the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee were baseball and softball, karate, sport climbing, surfing and skateboarding, the last four of which made their Olympic debuts, and the last three of which will remain on the Olympic program.The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). Host nation Japan finished third, setting a record for the most gold medals and total medals ever won by their delegation at an Olympic Games with 27 and 58. Great Britain finished fourth, with a total of 22 gold and 65 medals, becoming the first nation at the Summer Olympics to increase or equal their total medals won in the two Games subsequent to hosting them. The Russian delegation competing as the ROC (not to be confused with the Republic of China (Taiwan) which competed as Chinese Taipei, not ROC) finished fifth with 20 gold medals and third in the overall medal count, with 71 medals. Bermuda, the Philippines and Qatar won their first-ever Olympic gold medals. Burkina Faso, San Marino and Turkmenistan won their first-ever Olympic medals.\n"
]
}
],
"source": [
"print(df.content.values[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 Create answers based on the context\n",
"Use davinci-instruct to answer the questions given the relevant Wikipedia section contents\n",
"\n",
"Note: We have used temperature=0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.\n",
"\n",
"<span style=\"color:orange\">**WARNING: This step will last a long time, and consume a lot of tokens, as it calls davinci-instruct for every section to answer all the questions.**</span>"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. The 2020 Summer Olympics is an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan.\n",
"2. The 2020 Summer Olympics took place from 23 July to 8 August 2021.\n",
"3. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n",
"4. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n",
"5. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).\n"
"These are the answers to the questions above based on the context around the host city selection. \n",
"\n",
"We can see that answers 3-5 contain the correct answer, but instead of answering the question directly, the answer is a verbatim extraction. Despite these occasional lower quality answers, we will show that the model can learn the task reasonably well, given a high number of examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.4 Save the Olympics Q&A dataset based on Wikipedia sections\n",
"We save the file for use in the [next notebook](olympics-3-train-qa.ipynb)"
"We create a search file ([API reference](https://beta.openai.com/docs/api-reference/files/list)), which can be used to retrieve the relevant context when a question is asked.\n",
"<span style=\"color:orange; font-weight:bold\">DEPRECATED: The /search endpoint is deprecated in favour of using embeddings. Embeddings are cheaper, faster and can support a better search experience. See <a href=\"https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb\">Question Answering Guide</a> for a search implementation using the embeddings</span>\n"
"## 2.6 Answer questions based on the context provided\n",
"\n",
"We will use a simple implementation of the answers endpoint. This works by simply using the [/search endpoint](https://beta.openai.com/docs/api-reference/searches), which searches over an indexed file to obtain the relevant sections which can be included in the context, following by a question and answering prompt given a specified model."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay\n",
"Summary\n",
"\n",
"The women's 4 × 100 metres relay event at the 2020 Summer Olympics took place on 5 and 6 August 2021 at the Japan National Stadium. There were 16 competing relay teams, with each team having 5 members from which 4 were selected in each round.\n",
"\n",
"###\n",
"\n",
"Athletics at the 2020 Summer Olympics – Men's 4 × 100 metres relay\n",
"Qualification\n",
"\n",
"National Olympic Committees (NOCs) could qualify one relay team in one of three following ways:\n",
"The top 8 NOCs at the 2019 World Athletics Championships qualified a relay team.\n",
"The top 8 NOCs at the 2021 World Athletics Relays qualified a relay team.\n",
"Where an NOC placed in the top 8 at both the 2019 World Championships and the 2021 World Relays, the quota place was allocated to the world top list as of 29 June 2021. In this case, 4 teams did so, so there are 4 places available through the world rankings.A total of five athletes may be entered for a relay team. Should a NOC have also entered individual athletes in the corresponding individual event (100 m), the entered individual athletes must be included in the total of five (5) athletes entered for the relay event. In addition of five, NOCs can nominate a maximum of one alternate athlete for each team.\n",
"The qualifying period was originally from 1 May 2019 to 29 June 2020. Due to the COVID-19 pandemic, the period was suspended from 6 April 2020 to 30 November 2020, with the end date extended to 29 June 2021. The qualifying time standards could be obtained in various meets during the given period that have the approval of the IAAF. Both indoor and outdoor meets are eligible. The most recent Area Championships may be counted in the ranking, even if not during the qualifying period.\n"
"print(create_context(\"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?\", olympics_search_fileid, max_len=400))"
"After we fine-tune the model for Q&A we'll be able to use it instead of [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta), to obtain better answers when the question can't be answered based on the context. We see a downside of [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta), which always attempts to answer the question, regardless of the relevant context being present or not. (Note the second question is asking about a future event, set in 2024.)"
" \"Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?\", max_len=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that davinci has a tendency to answer the question, even if the question can't be answered given the context provided. Note the question asked regarding 2048 Summer Olympics, which didn't happen yet, and the retrieved content has only returned results for 2020."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.7 (Optional) Investigation into how likely the search endpoint is to return the relevant context"
" cur_len += int(result['metadata']) + 4 # we add 4 tokens for the separator `\\n\\n###\\n\\n`\n",
" if cur_len > max_len:\n",
" break\n",
" returns.append(result['text'])\n",
" res = result['text'].split('\\n')\n",
" if res[0] == title and res[1] == heading:\n",
" index = len(returns) - 1\n",
" break\n",
" return index, cur_len\n",
" except Exception as e:\n",
" #print (e)\n",
" return []\n",
"print(check_context(\"Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay\", \"Summary\", \"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?\", max_len=10000))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We utilize the generated questions based on context to estimate how often we can retrieve the original context. These questions are noisy, so this is not a perfect estimate.\n",
"\n",
"Our questions and answers are prefixed with numbered bullet points, however due to the way they were generated, they are missing the first number, hence we add \"1.\" to the list of questions (and answers).\n",
"\n",
"We calculate the rank of the section retrieved using ada search, and the number of tokens in the context needed to retrieve the relevant section in full."
"print(f\"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7.4% of the time, this is due to the keyword search part of the search algorithm not retrieving the relevant context within the first 200 results.\n",
"18.3% of the time this is due to the semantic search not placing the relevant context within the first 2000 tokens."
"plt.title('Histogram of the number of minimum tokens needed')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can observe that the context is most likely to be returned as one of the first results, and most likely to be returned within the first 200-500 tokens."