Pushing updated version with grammar changes

pull/1077/head
colin-openai 2 years ago
parent 2e64a89e3e
commit 9f2915b92c

@ -6,23 +6,30 @@
"source": [
"# Long Document Content Extraction\n",
"\n",
"We often need GPT-3 to help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers. \n",
"GPT-3 can help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers. \n",
"\n",
"In this notebook we'll run through this approach:\n",
"- We'll load in a long PDF and pull the text out\n",
"- We'll create a prompt to be used to extract key bits of information\n",
"- We'll chunk up our document and process each chunk to pull any answers out\n",
"- We'll then combine them at the end\n",
"- This simple approach will then be extended to three more difficult questions"
"- Load in a long PDF and pull the text out\n",
"- Create a prompt to be used to extract key bits of information\n",
"- Chunk up our document and process each chunk to pull any answers out\n",
"- Combine them at the end\n",
"- This simple approach will then be extended to three more difficult questions\n",
"\n",
"## Approach\n",
"\n",
"- **Setup**: Take a PDF, a Formula 1 Financial Regulation document on Power Units, and extract the text from it for entity extraction. We'll use this to try to extract answers that are buried in the content.\n",
"- **Simple Entity Extraction**: Extract key bits of information from chunks of a document by:\n",
" - Creating a template prompt with our questions and an example of the format it expects\n",
" - Create a function to take a chunk of text as input, combine with the prompt and get a response\n",
" - Run a script to chunk the text, extract answers and output them for parsing\n",
"- **Complex Entity Extraction**: Ask some more difficult questions which require tougher reasoning to work out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We'll take our PDF, a Formula 1 Financial Regulation document on Power Units, and extract the text from it for entity extraction. We'll use this to try to extract answers that are buried in the content."
"## Setup"
]
},
{
@ -53,14 +60,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple Entity Extraction\n",
"\n",
"We'll start by extracting key bits of information from chunks of a document.\n",
"\n",
"We'll accomplish this by:\n",
"- Creating a template prompt with our questions and an example of the format it expects\n",
"- Create a function to take a chunk of text as input, combine with the prompt and get a response\n",
"- Run a script to chunk the text, extract answers and output them for parsing"
"## Simple Entity Extraction"
]
},
{
@ -91,10 +91,10 @@
"source": [
"# Example prompt - \n",
"document = '<document>'\n",
"template_prompt=f'''Extract key pieces of information from this regulation document. \n",
"If a particular piece of information is not present, output \\\"Not specified\\\". \n",
"template_prompt=f'''Extract key pieces of information from this regulation document.\n",
"If a particular piece of information is not present, output \\\"Not specified\\\".\n",
"When you extract a key piece of information, include the closest page number.\n",
"Use the following format:\\n0. Author\\n1. What is the amount of the power unit cost cap in USD, GBP and EUR\\n2. What is the value of External Manufacturing Costs in USD\\n3. What is the Capital Expenditure Limit in USD\\n\\nDocument: \\\"\\\"\\\"{document}\\\"\\\"\\\"\\n\\n0. Author: Tom Anderson (Page 1)\\n1.'''\n",
"Use the following format:\\n0. Who is the author\\n1. What is the amount of the power unit cost cap in USD, GBP and EUR\\n2. What is the value of External Manufacturing Costs in USD\\n3. What is the Capital Expenditure Limit in USD\\n\\nDocument: \\\"\\\"\\\"{document}\\\"\\\"\\\"\\n\\n0. Who is the author: Tom Anderson (Page 1)\\n1.'''\n",
"print(template_prompt)"
]
},
@ -234,11 +234,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Complex Entity Extraction\n",
"\n",
"Now we'll try to answer more difficult questions that need tougher logic to work out.\n",
"\n",
"We'll update the prompt with three more questions and see whether our model can work them out."
"## Complex Entity Extraction"
]
},
{
@ -268,10 +264,10 @@
],
"source": [
"# Example prompt - \n",
"template_prompt=f'''Extract key pieces of information from this regulation document. \n",
"If a particular piece of information is not present, output \\\"Not specified\\\". \n",
"template_prompt=f'''Extract key pieces of information from this regulation document.\n",
"If a particular piece of information is not present, output \\\"Not specified\\\".\n",
"When you extract a key piece of information, include the closest page number.\n",
"Use the following format:\\n0. Author\\n1. How is a minor overspend breach calculated\\n2. How is a major overspend breach calculated\\n3.Which years do these financial regulations apply to\\n\\nDocument: \\\"\\\"\\\"{document}\\\"\\\"\\\"\\n\\n0. Author: Tom Anderson (Page 1)\\n1.'''\n",
"Use the following format:\\n0. Who is the author\\n1. How is a minor overspend breach calculated\\n2. How is a major overspend breach calculated\\n3. Which years do these financial regulations apply to\\n\\nDocument: \\\"\\\"\\\"{document}\\\"\\\"\\\"\\n\\n0. Who is the author: Tom Anderson (Page 1)\\n1.'''\n",
"print(template_prompt)"
]
},
@ -333,12 +329,18 @@
"\n",
"To tune this further you can consider experimenting with:\n",
"- A more descriptive or specific prompt\n",
"- Negative prompting, telling the model to ignore a particular thing i.e. the footnote date on every page\n",
"- If you have sufficient training data, fine-tuning a model to find a set of outputs very well\n",
"- The way you chunk your data - we have gone for 10,000 characters with no overlap, but more intelligent chunking that breaks info into sections, cuts by tokens or similar may get better results\n",
"\n",
"However, with minimal tuning we have now answered 6 questions of varying difficulty using the contents of a long document, and have a reusable approach that we can apply to any long document requiring entity extraction. Look forward to seeing what you can do with this!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

Loading…
Cancel
Save