{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Note: To answer questions based on text documents, we recommend the procedure in Question Answering using Embeddings. Some of the code below may rely on deprecated API endpoints." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Creating a synthetic Q&A dataset\n", "We use [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta), a model specialized in following instructions, to create questions based on the given context. Then we also use [`davinci-instruct-beta-v3`](https://beta.openai.com/docs/engines/instruct-series-beta) to answer those questions, given the same context. \n", "\n", "This is expensive, and will also take a long time, as we call the davinci engine for each section. You can simply download the final dataset instead.\n", "\n", "We're using the dataset created using the [previous notebook](olympics-1-collect-data.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Read in the data, and create a context\n", "Create a context by concatenating the title, the heading and the content of that section" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | title | \n", "heading | \n", "content | \n", "tokens | \n", "context | \n", "
---|---|---|---|---|---|
0 | \n", "2020 Summer Olympics | \n", "Summary | \n", "The 2020 Summer Olympics (Japanese: 2020年夏季オリン... | \n", "713 | \n", "2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ... | \n", "
1 | \n", "2020 Summer Olympics | \n", "Host city selection | \n", "The International Olympic Committee (IOC) vote... | \n", "126 | \n", "2020 Summer Olympics\\nHost city selection\\n\\nT... | \n", "
2 | \n", "2020 Summer Olympics | \n", "Impact of the COVID-19 pandemic | \n", "In January 2020, concerns were raised about th... | \n", "369 | \n", "2020 Summer Olympics\\nImpact of the COVID-19 p... | \n", "
3 | \n", "2020 Summer Olympics | \n", "Qualifying event cancellation and postponement | \n", "Concerns about the pandemic began to affect qu... | \n", "298 | \n", "2020 Summer Olympics\\nQualifying event cancell... | \n", "
4 | \n", "2020 Summer Olympics | \n", "Effect on doping tests | \n", "Mandatory doping tests were being severely res... | \n", "163 | \n", "2020 Summer Olympics\\nEffect on doping tests\\n... | \n", "