Initial commit

pull/2/head
Boris Power 2 years ago committed by Ted Sanders
commit 535f545be7

129
.gitignore vendored

@ -0,0 +1,129 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/

@ -0,0 +1,632 @@
# OpenAI Cookbook
This repository shares example code and example prompts for accomplishing common tasks with the [OpenAI API](https://openai.com/api/).
To try these examples yourself, youll need an OpenAI account. [Create a free account to get started.](https://beta.openai.com/signup)
Most code examples are written in Python, though the concepts can be applied in any language.
In the same way that a cookbook's recipes don't span all possible meals or techniques, these examples don't span all possible use cases or methods. Use them as starting points upon which to elaborate, discover, and invent.
## Related resources
Beyond the code examples here, you can also learn about the [OpenAI API](https://openai.com/api/) from the following resources:
* Try out GPT-3 in the [OpenAI Playground](https://beta.openai.com/playground)
* Read about the API in the [OpenAI Documentation](https://beta.openai.com/docs/introduction)
* Discuss the API in the [OpenAI Community Forum](https://community.openai.com/top?period=monthly)
* Look for help in the [OpenAI Help Center](https://help.openai.com/en/)
* See example prompts in the [OpenAI Examples](https://beta.openai.com/examples)
## Examples, organized by capability
<table id="verticalalign">
<thead>
<tr>
<th></th>
<th>Text</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>Write</td>
<td>
<li><a href='#1-write-text'>Copywriting</a></li>
<li><a href='#1-write-text'>Blog posts</a></li>
<li><a href='#1-write-text'>Product descriptions</a></li>
<li><a href='#1-write-text'>Question generation</a></li>
</td>
<td>
<li><a href='#1-write-code'>Code completion (e.g., GitHub Copilot)</a></li>
<li><a href='#1-write-code'>Natural language software interfaces</a></li>
<li><a href='#1-write-code'>Text to code</a></li>
<li><a href='#1-write-code'>Unit tests</a></li>
</td>
</tr>
<tr>
<td>Explain</td>
<td>
<li><a href='#answering-questions-about-a-piece-of-text'>Q&A about a doc</a></li>
<li><a href='#entity-extraction'>Entity extraction</a></li>
<li><a href='#summarization'>Summarization</a></li>
<li><a href='#classification'>Classification</a></li>
</td>
<td>
<li><a href='#2-explain-code'>Code documentation</a></li>
<li><a href='#2-explain-code'>Code explanation</a></li>
<li><a href='#2-explain-code'>Docstrings</a></li>
</td>
</tr>
<tr>
<td>Edit</td>
<td>
<li><a href='#3-edit-text'>Editing</a></li>
<li><a href='#translation'>Translation</a></li>
</td>
<td>
<li><a href='#3-edit-code'>Conversion between languages or styles</a></li>
<li><a href='#3-edit-code'>Bug fixing</a></li>
</td>
</tr>
<tr>
<td>Compare</td>
<td>
<li><a href='#semantic-search'>Semantic search</a></li>
<li><a href='#recommendations'>Recommendations</a></li>
<li><a href='#4-compare-text'>Clustering</a></li>
<li><a href='#4-compare-text'>Near-duplicate detection</a></li>
</td>
<td>
<li><a href='#4-compare-code'>Code search</a></li>
<li><a href='#4-compare-code'>Code clustering</a></li>
</td>
</tr>
</tbody>
</table>
## How large language models work
[Large language models](https://openai.com/blog/better-language-models/) are functions that map text to text. Given an input string of text, a large language model tries to predict the text that will come next.
The magic of large language models is that by being trained to minimize this prediction error over vast quantities of text, the models end up learning concepts useful for these predictions. For example, they learn concepts like:
* how to spell
* how grammar works
* how to paraphrase
* how to answer questions
* how to hold a conversation
* how to write in many languages
* how to code
* etc.
None of these capabilities are explicitly programmed in - they all emerge as a result of training.
GPT-3's capabilities now power [hundreds of different software products](https://openai.com/blog/gpt-3-apps/), including productivity apps, education apps, games, and more.
## How to control a large language model
Of all the inputs to a large language model, by far the most influential is the text prompt.
Large language models can be prompted to produce output in a few ways:
* **Instruction**: Tell the model what you want
* **Completion**: Induce the model to complete the beginning of what you want
* **Demonstration**: Show the model what you want, with either:
* A few examples in the prompt
* Many hundreds or thousands of examples in a fine-tuning training dataset
An example of each is shown below.
### Instruction prompts
Instruction-following models (e.g., `text-davinci-002` or any model beginning with `text-`) are specially designed to follow instructions. Write your instruction at the top of the prompt (or at the bottom, or both), and the model will do its best to follow the instruction and then stop. Instructions can be detailed, so don't be afraid to write a paragraph explicitly detailing the output you want.
Example instruction prompt:
```text
Extract the name of the author from the quotation below.
“Some humans theorize that intelligent species go extinct before they can expand into outer space. If they're correct, then the hush of the night sky is the silence of the graveyard.”
― Ted Chiang, Exhalation
```
Output:
```text
Ted Chiang
```
### Completion prompt example
Completion-style prompts take advantage of how large language models try to write text they think is mostly likely to come next. To steer the model, try beginning a pattern or sentence that will be completed by the output you want to see. Relative to direct instructions, this mode of steering large language models can take more care and experimentation. In addition, the models won't necessarily know where to stop, so you will often need stop sequences or post-processing to cut off text generated beyond the desired output.
Example completion prompt:
```text
“Some humans theorize that intelligent species go extinct before they can expand into outer space. If they're correct, then the hush of the night sky is the silence of the graveyard.”
― Ted Chiang, Exhalation
The author of this quote is
```
Output:
```text
Ted Chiang
```
### Demonstration prompt example (few-shot learning)
Similar to completion-style prompts, demonstrations can show the model what you want it to do. This approach is sometimes called few-shot learning, as the model learns from a few examples provided in the prompt.
Example demonstration prompt:
```text
Quote:
“When the reasoning mind is forced to confront the impossible again and again, it has no choice but to adapt.”
― N.K. Jemisin, The Fifth Season
Author: N.K. Jemisin
Quote:
“Some humans theorize that intelligent species go extinct before they can expand into outer space. If they're correct, then the hush of the night sky is the silence of the graveyard.”
― Ted Chiang, Exhalation
Author:
```
Output:
```text
Ted Chiang
```
### Fine-tuned prompt example
With enough training examples, you can [fine-tune](https://beta.openai.com/docs/guides/fine-tuning) a custom model. In this case, instructions become unnecessary, as the model can learn the task from the training data provided. However, it can be helpful to include separator sequences (e.g., `->` or `###` or any string that doesn't commonly appear in your inputs) to tell the model when the prompt has ended and the output should begin. Without separator sequences, there is a risk that the model continues elaborating on the input text rather than starting on the answer you want to see.
Example fine-tuned prompt (for a model that has been custom trained on similar prompt-completion pairs):
```text
“Some humans theorize that intelligent species go extinct before they can expand into outer space. If they're correct, then the hush of the night sky is the silence of the graveyard.”
― Ted Chiang, Exhalation
###
```
Output:
```text
Ted Chiang
```
### More prompt advice
For more prompt examples, visit [OpenAI Examples](https://beta.openai.com/examples).
In general, the input prompt is the best lever for improving model outputs. You can try tricks like:
* **Give more explicit instructions.** E.g., if you want the output to be a comma separated list, ask it to return a comma separated list. If you want it to say "I don't know" when the it doesn't know the answer, tell it 'Say "I don't know" if you do not know the answer.'
* **Supply better examples.** If you're demonstrating examples in your prompt, make sure that your examples are diverse and high quality.
* **Ask the model to answer as if it was an expert.** Explicitly asking the model to produce high quality output or output as if it was written by an expert can induce the model to give higher quality answers that it thinks an expert would write. E.g., "The following answer is correct, high-quality, and written by an expert."
* **Prompt the model to write down the series of steps explaining its reasoning.** E.g., prepend your answer with something like "[Let's think step by step](https://arxiv.org/pdf/2205.11916v1.pdf)." Prompting the model to give an explanation of its reasoning before its final answer can increase the likelihood that its final answer is consistent and correct.
## Text Capabilities
### 1. Write text
Large language models are excellent at writing text. They can assist with:
* Blog posts
* Email copy
* Ad copy
* Website copy
* Product descriptions
* Memos
* Storytelling
* Brainstorming
* Question generation
* etc.
An example prompt for an instruction-following model:
```text
Write an email to a colleague named Jill congratulating her on her promotion. The tone should be warm yet professional. Mention how you admire the work she's been putting in. Include a joke about how her pet lizard Max enjoys eating grasshoppers. Mention how you're looking forward to the team off-site next week.
```
Output:
```text
Dear Jill,
Congratulations on your promotion! I've been admiring the great work you've been putting in and I'm really happy for your success. Max the lizard must be thrilled too - I bet he's looking forward to feasting on even more grasshoppers!
I'm really looking forward to next week's team off-site. It's going to be great to catch up with everyone and brainstorm some new ideas.
Best,
[Your Name]
```
In general, writing can work with any style of prompt. Experiment to see what works best for your use case.
| | Advantages | Disadvantages |
| ---------------------------------------------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| Instruction-following models<br>(e.g., `text-davinci-002`) | Easiest to use | Less creative; less diverse; harder to control tone, length, etc. |
| Base models<br>(e.g., `davinci`) | More creative | More expensive (as including examples demonstrations in prompt will cost tokens) |
| Fine-tuned models | Can train off of many examples; cheaper than including examples in the prompt | Hard to gather training data; training makes iteration slower and more expensive |
### 2. Explain text
One capability of large language models is distilling information from a piece of text. This can include:
* Answering questions about a piece of text, e.g.:
* Querying an knowledge base to help people look up things they don't know
* Querying an unfamiliar document to understand what it contains
* Querying a document with structured questions in order to extract tags, classes, entities, etc.
* Summarizing text, e.g.:
* Summarizing long documents
* Summarizing back-and-forth emails or message threads
* Summarizing detailed meeting notes with key points and next steps
* Classifying text, e.g.:
* Classifying customer feedback messages by topic or type
* Classifying documents by topic or type
* Classifying the tone or sentiment of text
* Extracting entities, e.g.:
* Extracting contact information from a customer message
* Extracting names of people or companies or products from a document
* Extracting things mentioned in customer reviews or feedback
#### Answering questions about a piece of text
Example prompt for answering questions about a piece of text:
```text
Using the following text, answer the following question. If the answer is not contained within the text, say "I don't know."
Text:
"""
Oklo Mine (sometimes Oklo Reactor or Oklo Mines), located in Oklo, Gabon on the west coast of Central Africa, is believed to be the only natural nuclear fission reactor. Oklo consists of 16 sites at which self-sustaining nuclear fission reactions are thought to have taken place approximately 1.7 billion years ago, and ran for hundreds of thousands of years. It is estimated to have averaged under 100 kW of thermal power during that time.
"""
Question: How many natural fission reactors have ever been discovered?
Answer:
```
Output:
```text
One
```
If the text you wish to ask about is longer than the token limit (~4,000 tokens for `text-davinci-002` and ~2,000 tokens for earlier models), we recommending splitting the text into smaller pieces, ranking them by relevance, and then asking the most-relevant-looking pieces.
#### Summarization
An example prompt for summarization:
```text
Summarize the following text.
Text:
"""
Two independent experiments reported their results this morning at CERN, Europe's high-energy physics laboratory near Geneva in Switzerland. Both show convincing evidence of a new boson particle weighing around 125 gigaelectronvolts, which so far fits predictions of the Higgs previously made by theoretical physicists.
"As a layman I would say: 'I think we have it'. Would you agree?" Rolf-Dieter Heuer, CERN's director-general, asked the packed auditorium. The physicists assembled there burst into applause.
"""
Summary:
```
Output:
```text
CERN has announced the discovery of a new particle, the Higgs boson. This particle has been predicted by theoretical physicists and is a major step forward in our understanding of the universe.
```
#### Classification
The best approach for classifying text depends on whether the classes are known in advance or not.
If your classes are known in advance, classification is best done with a fine-tuned model, as demonstrated in [Fine-tuned_classification.ipynb](examples/Fine-tuned_classification.ipynb).
If your classes are not known in advance (e.g., they are set by a user or generated on the fly), you can try zero-shot classification by either giving an instruction containing the classes or even by using embeddings to see which class label (or other classified texts) are most similar to the text ([Zero-shot_classification.ipynb](examples/Zero-shot_classification_with_embeddings.ipynb)).
#### Entity extraction
An example prompt for entity extraction:
```text
From the text below, extract the following entities in the following format:
Companies: <comma-separated list of companies mentioned>
People & titles: <comma-separated list of people mentioned (with their titles or roles appended in parentheses)>
Text:
"""
In March 1981, United States v. AT&T came to trial under Assistant Attorney General William Baxter. AT&T chairman Charles L. Brown thought the company would be gutted. He realized that AT&T would lose and, in December 1981, resumed negotiations with the Justice Department. Reaching an agreement less than a month later, Brown agreed to divestiture—the best and only realistic alternative. AT&T's decision allowed it to retain its research and manufacturing arms. The decree, titled the Modification of Final Judgment, was an adjustment of the Consent Decree of 14 January 1956. Judge Harold H. Greene was given the authority over the modified decree....
In 1982, the U.S. government announced that AT&T would cease to exist as a monopolistic entity. On 1 January 1984, it was split into seven smaller regional companies, Bell South, Bell Atlantic, NYNEX, American Information Technologies, Southwestern Bell, US West, and Pacific Telesis, to handle regional phone services in the U.S. AT&T retains control of its long distance services, but was no longer protected from competition.
"""
```
Output:
```text
Companies: United States v. AT&T, AT&T, Justice Department, Bell South, Bell Atlantic, NYNEX, American Information Technologies, Southwestern Bell, US West, Pacific Telesis
People & titles: William Baxter (Assistant Attorney General), Charles L. Brown (AT&T chairman), Harold H. Greene (Judge)
```
### 3. Edit text
In addition to the [completion API endpoint](https://beta.openai.com/docs/api-reference/completions), OpenAI now offers an [edit API endpoint](https://beta.openai.com/docs/api-reference/edits) ([blog post](https://openai.com/blog/gpt-3-edit-insert/)). In contrast to completions, which only take a single text input, edits take two text inputs: the instruction and the text to be modified.
An example edit prompt:
Instruction input:
```text
Fix the OCR errors
```
Text input:
```text
Therewassomehostilityntheenergybehindthe researchreportedinPerceptrons....Part of ourdrivecame,aswequiteplainlyacknoweldgednourbook,fromhe facthatfundingndresearchnergywerebeingdissipatedon. . .misleadingttemptsouseconnectionistmethodsnpracticalappli-cations.
```
Output:
```text
There was some hostility in the energy behind the research reported in Perceptrons....Part of our drive came, as we quite plainly acknowledged in our book, from the fact that funding and research energy were being dissipated on...misleading attempts to use connectionist methods in practical applications.
```
#### Translation
Translation is another emergent capability of large language models. In 2021, [GPT-3 was used](https://arxiv.org/abs/2110.05448) to set a new state-of-the-art record in unsupervised translation on the WMT14 English-French benchmark.
Example translation prompt using the edits endpoint:
Instruction input:
```text
translation into French
```
Text input:
```text
That's life.
```
Output:
```text
C'est la vie.
```
Example translation prompt using the completions endpoint:
```text
Translate the following text from English to French.
English: That's life.
French:
```
Output:
```text
C'est la vie.
```
Tips for translation:
* Performance is best on the most common languages
* We've seen better performance when the instruction is given in the final language (so if translating into French, give the instruction `Traduire le texte de l'anglais au français.` rather than `Translate the following text from English to French.`)
* Backtranslation (as described [here](https://arxiv.org/abs/2110.05448)) can also increase performance
* Text with colons and heavy punctuation can trip up the instruction-following models, especially if the instruction is using colons (e.g., `English: {english text} French:`)
* The edits endpoint has been seen to sometimes repeat the text input alongside the translation
When it comes to translation, large language models particularly shine at combining other instructions alongside translation. For example, you can ask GPT-3 to translate Slovenian to English but keep all LaTeX typesetting commands unchanged. The following notebook details how we translated a Slovenian math book into English:
[Translation of a Slovenian math book into English](book_translation/translate_latex_book.ipynb)
### 4. Compare text
The [OpenAI API embeddings endpoint](https://beta.openai.com/docs/guides/embeddings) can be used to measure similarity between pieces of text ([blog post](https://openai.com/blog/introducing-text-and-code-embeddings/)). By leveraging GPT-3's understanding of text, these embeddings [achieved state-of-the-art results](https://arxiv.org/abs/2201.10005) on benchmarks in both unsupervised learning and transfer learning settings.
Embeddings can be used for semantic search, recommendations, cluster analysis, near-duplicate detection, and more.
#### Semantic search
Embeddings can be used for search either by themselves or as a feature in a larger system.
The simplest way to use embeddings for search is as follows:
* Before the search (precompute):
* Split your text corpus into chunks smaller than the token limit (e.g., ~2,000 tokens)
* Embed each chunk using a 'doc' model (e.g., `text-search-curie-doc-001`)
* Store those embeddings in your own database or in a vector search provider like [pinecone.io](pinecone.io) or [weaviate](weaviate.io)
* At the time of the search (live compute):
* Embed the search query using the correponding 'query' model (e.g. `text-search-curie-query-001`)
* Find the closest embeddings in your database
* Return the top results, ranked by cosine similarity
An example of how to use embeddings for search is shown in [Semantic_search.ipynb](examples/Semantic_search.ipynb).
In more advanced search systems, the the cosine similarity of embeddings can be used as one feature among many in ranking search results.
#### Recommendations
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set. And instead of using pairs of doc-query models, you can use a single symmetric similarity model (e.g., `text-similarity-curie-001`).
An example of how to use embeddings for recommendations is shown in [Recommendations.ipynb](examples/Recommendations.ipynb).
Similar to search, these cosine similarity scores can either be used on their own to rank items or as features in larger ranking algorithms.
#### Customizing Embeddings
Although OpenAI's embedding model weights cannot be fine-tuned, you can still use training data to customize embeddings to your application.
In the following notebook, we provide an example method for customizing your embeddings using training data. The idea of the method is to train a custom matrix to multiply embedding vectors by in order to get new customized embeddings. With good training data, this custom matrix will highlight the features relevant to your training labels and suppress the rest. You can equivalently consider the matrix mulitplication as (a) a modification of the embeddings or (b) a modification of the distance function used to measure the distances between embeddings.
* [Customizing_embeddings.ipynb](examples/Customizing_embeddings.ipynb)
## Code Capabilities
Large language models aren't only great at text - they can be great at code too. OpenAI's specialized code model is called [Codex](https://openai.com/blog/openai-codex/).
Codex powers [more than 70 products](https://openai.com/blog/codex-apps/), including:
* [GitHub Copilot](https://copilot.github.com/) (autocompletes code in VS Code and other IDEs)
* [Pygma](https://pygma.app/) (turns Figma designs into code)
* [Replit](https://replit.com/) (has an 'Explain code' button and other features)
* [Warp](https://www.warp.dev/) (a smart terminal with AI command search)
* [Machinet](https://machinet.net/) (writes Java unit test templates)
Note that unlike instruction-following text models (e.g., `text-davinci-002`), Codex is *not* trained to follow instructions. As a result, designing good prompts can take more care.
### 1. Write code
An example prompt for writing code with `code-davinci-002`:
````text
SQL tables (and columns):
* Customers(customer_id, signup_date)
* Streaming(customer_id, video_id, watch_date, watch_minutes)
A well-written SQL query that lists customers who signed up during March 2020 and watched more than 50 hours of video in their first 30 days:
```
````
Output:
```sql
SELECT c.customer_id
FROM Customers c
JOIN Streaming s
ON c.customer_id = s.customer_id
WHERE c.signup_date BETWEEN '2020-03-01' AND '2020-03-31'
AND s.watch_date BETWEEN c.signup_date AND DATE_ADD(c.signup_date, INTERVAL 30 DAY)
GROUP BY c.customer_id
HAVING SUM(s.watch_minutes) > 50 * 60
```
`code-davinci-002` is able to make inferences from variable names; for example, it infers that `watch_minutes` has units of minutes and therefore needs to be converted by a factor of 60 before being compared with 50 hours.
### 2. Explain code
Code explanation can be applied to many use cases:
* Generating in-code documentation (e.g., Python docstrings, git commit messages)
* Generating out-of-code documentation (e.g., man pages)
* In an interactive code exploration tool
* Communicating program results back to users via a natural langauge interface
An example prompt for explaining code with `code-davinci-002`:
````text
A SQL query:
```
SELECT c.customer_id
FROM Customers c
JOIN Streaming s
ON c.customer_id = s.customer_id
WHERE c.signup_date BETWEEN '2020-03-01' AND '2020-03-31'
AND s.watch_date BETWEEN c.signup_date AND DATE_ADD(c.signup_date, INTERVAL 30 DAY)
GROUP BY c.customer_id
HAVING SUM(s.watch_minutes) > 50 * 60
```
Questions:
1. What does the SQL query do?
2. Why might someone be interested in this time period?
3. Why might a company be interested in this SQL query?
Answers:
````
Output:
```text
1. The SQL query finds all customers who signed up in March 2020 and watched more than 50 hours of content in the first 30 days after signing up.
2. The time period is interesting because it is the first month of the COVID-19 pandemic.
3. A company might be interested in this SQL query because it can help them understand how the pandemic has affected their business.
```
### 3. Edit code
OpenAI's edit endpoint is particularly useful for editing code.
Example text input to `code-davinci-edit-001`:
```python
def tribonacci(n):
if n == 0:
return 0
elif n == 1:
return 1
elif n == 2:
return 1
elif n == 3:
return 2
else:
return tribonacci(n-1) + tribonacci(n-2) + tribonacci(n-3)
```
Example instruction inputs:
```text
Add a docstring
```
```text
Add typing
```
```text
Improve the runtime
```
```text
Add a test
```
```text
Translate to JavaScript (or Rust or Lisp or any language you like)
```
Example output after improving the runtime and translating to JavaScript:
```JavaScript
function tribonacci(n) {
let a = 0;
let b = 1;
let c = 1;
for (let i = 0; i < n; i++) {
[a, b, c] = [b, c, a + b + c];
}
return a;
}
```
As you can see, `code-davinci-edit-001` was able to successfully reduce the function's runtime from exponential down to linear, as well as convert from Python to JavaScript.
### 4. Compare code
The OpenAI API also features code search embeddings, which can measure the relevance of a section of code to a text query, or the similarity between two sections of code.
OpenAI code search embeddings significantly improved the state-of-the-art on the [CodeSearchNet](https://github.com/github/CodeSearchNet) evaluation suite, scoring 93.5% versus the previous record of 77.4%.
Read more about OpenAI's code embeddings in the [blog post announcement](https://openai.com/blog/introducing-text-and-code-embeddings/) or [documentation](https://beta.openai.com/docs/guides/embeddings).
Code embeddings can be useful for use cases such as:
* Code search
* Codebase clustering & analysis
An example of code search is shown in [Code_search.ipynb](examples/Code_search.ipynb).
We haven't written an example of code clustering, but the idea is the same as the text clustering in [Clustering.ipynb](examples/Clustering.ipynb).

@ -0,0 +1,189 @@
from typing import List, Union
from smokey import Smokey
import openai
def get_candidates(
prompt: str,
stop: List[str],
temperature: float,
priming_prefix: str,
engine: str,
n: int = 5,
) -> List[str]:
"""
Generate N candidate completions based on the prompt, generated with a specific temperature.
:param prompt: The prompt to start the conversation with.
:param stop: A list of tokens that indicate the end of the generation.
:param temperature: The temperature of the generation.
:param priming_prefix: The prefix to use for the priming.
:param engine: The engine to use for the generation.
:param n: The number of completions to generate.
:return: A list of completions.
"""
response = openai.Completion.create(
engine=engine,
prompt=prompt,
temperature=temperature,
max_tokens=150,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=stop,
n=n,
)
responses = [priming_prefix + choice.text for choice in response.choices]
return responses
def rindex(lst: List, value: str) -> int:
"""
Return the index of the last occurence of a value in a list.
:param lst: The list to search in.
:param value: The value to search for.
:return: The index of the last occurence of the value.
"""
try:
return len(lst) - lst[::-1].index(value) - 1
except ValueError:
raise ValueError(f"Answer start token `{value}` not found in the eval template")
def eval_candidate(
candidate_answer: str,
original_instruction: str,
eval_template: str,
answer_start_token: str,
engine: str,
) -> float:
"""
Evaluate a candidate answer by calculating the average log probability
of the original instruction, given the candidate answer with a specific
evaluation template, aimed at reconstructing the original instruction.
:param candidate_answer: The candidate answer to evaluate.
:param original_instruction: The original instruction.
:param eval_template: The template to use for the evaluation.
:param answer_start_token: The token to use to indicate the start of the answer.
:param engine: The engine to use for the evaluation.
:return: The evaluation of the candidate answer.
"""
response = openai.Completion.create(
engine=engine,
prompt=eval_template.format(candidate_answer, original_instruction),
temperature=0,
max_tokens=0,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
logprobs=1,
echo=True,
)
answer_start = rindex(
response["choices"][0]["logprobs"]["tokens"], answer_start_token
)
logprobs = response["choices"][0]["logprobs"]["token_logprobs"][answer_start + 1 :]
return sum(logprobs) / len(logprobs)
def backtranslation(
prompt_template: str,
additional_info: str,
instruction: str,
eval_template: str,
priming_prefix: str = "SELECT",
stop1: List[str] = ["#", ";"],
answer_start_token: str = "--",
n: int = 5,
temperature: float = 0.5,
return_all_results: bool = False,
engine: str = "davinci-codex",
) -> Union[str, List[str, float]]:
"""
Generate a number of SQL queries given a natural language instruction,
and pick the best one based on the average log probability of explaining the
candidate SQL query with the exact original instruction, when prompted for
a natural language explanation of the candidate SQL query.
:param prompt_template: The template to use for the prompt to generate SQL.
:param additional_info: Additional information to include in the prompt
(SQL Tables, and their properties).
:param instruction: The instruction in natural language.
:param eval_template: The template to use for the evaluation.
:param priming_prefix: The prefix to use for the priming of the SQL query.
:param stop1: A list of tokens that indicate the end of the generation.
:param answer_start_token: The token to use to indicate the start of the
natural answer.
:param n: The number of candidates to generate.
:param temperature: The temperature of the generation.
:param return_all_results: Whether to return all results or just the best one.
:param engine: The engine to use for the generation and evaluation.
:return: The best SQL query, or a list of all scored generated SQL queries.
"""
prompt_template = prompt_template.format(
additional_info, instruction, priming_prefix
)
candidates = []
responses = get_candidates(
prompt_template, stop1, temperature, priming_prefix, engine=engine, n=n
)
for i in range(n):
quality = eval_candidate(
responses[i],
instruction,
eval_template,
answer_start_token,
engine=engine,
)
candidates.append((responses[i], quality))
candidates.sort(key=lambda x: x[1], reverse=True)
if return_all_results:
return candidates
return candidates[0][0]
def main(
nl_query: str = "Return the name of each department that had more than 10 employees in June 2021",
eval_template: str = "{};\n-- Explanation of the above query in human readable format\n-- {}",
table_definitions: str = "# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n",
prompt_template: str = "### Postgres SQL tables, with their properties:\n#\n{}#\n### {}\n{}",
n: int = 3,
temperature: float = 0.3,
engine: str = "davinci-codex",
):
"""
Generate a number of SQL queries given a natural language instruction,
and pick the best one based on the highest backtranslation score.
:param nl_query: The natural language query.
:param eval_template: The template to use for the evaluation.
:param table_definitions: The definitions of the tables used in the query.
:param prompt_template: The template to use for the prompt to generate SQL.
:param n: The number of candidates to generate.
:param temperature: The temperature of the generation.
:param engine: The engine to use for the generation and evaluation.
:return: The best SQL query, or a list of all scored generated SQL queries.
"""
result = backtranslation(
prompt_template,
table_definitions,
nl_query,
eval_template,
priming_prefix="SELECT",
temperature=temperature,
n=n,
engine=engine,
)
print(result)
if __name__ == "__main__":
Smokey(main)

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -0,0 +1,396 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code search\n",
"\n",
"We index our own openai-python code repository, and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of py files: 40\n",
"Total number of functions extracted: 64\n"
]
}
],
"source": [
"import os\n",
"from glob import glob\n",
"import pandas as pd\n",
"\n",
"def get_function_name(code):\n",
" \"\"\"\n",
" Extract function name from a line beginning with \"def \"\n",
" \"\"\"\n",
" assert code.startswith(\"def \")\n",
" return code[len(\"def \"): code.index(\"(\")]\n",
"\n",
"def get_until_no_space(all_lines, i) -> str:\n",
" \"\"\"\n",
" Get all lines until a line outside the function definition is found.\n",
" \"\"\"\n",
" ret = [all_lines[i]]\n",
" for j in range(i + 1, i + 10000):\n",
" if j < len(all_lines):\n",
" if len(all_lines[j]) == 0 or all_lines[j][0] in [\" \", \"\\t\", \")\"]:\n",
" ret.append(all_lines[j])\n",
" else:\n",
" break\n",
" return \"\\n\".join(ret)\n",
"\n",
"def get_functions(filepath):\n",
" \"\"\"\n",
" Get all functions in a Python file.\n",
" \"\"\"\n",
" whole_code = open(filepath).read().replace(\"\\r\", \"\\n\")\n",
" all_lines = whole_code.split(\"\\n\")\n",
" for i, l in enumerate(all_lines):\n",
" if l.startswith(\"def \"):\n",
" code = get_until_no_space(all_lines, i)\n",
" function_name = get_function_name(code)\n",
" yield {\"code\": code, \"function_name\": function_name, \"filepath\": filepath}\n",
"\n",
"\n",
"# get user root directory\n",
"root_dir = os.path.expanduser(\"~\")\n",
"\n",
"# path to code repository directory\n",
"code_root = root_dir + \"/openai-python\"\n",
"code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
"print(\"Total number of py files:\", len(code_files))\n",
"all_funcs = []\n",
"for code_file in code_files:\n",
" funcs = list(get_functions(code_file))\n",
" for func in funcs:\n",
" all_funcs.append(func)\n",
"\n",
"print(\"Total number of functions extracted:\", len(all_funcs))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For code search models we use code-search-{model}-code to obtain embeddings for code snippets, and code-search-{model}-text to embed natural language queries."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>code</th>\n",
" <th>function_name</th>\n",
" <th>filepath</th>\n",
" <th>code_embedding</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>def semantic_search(engine, query, documents):...</td>\n",
" <td>semantic_search</td>\n",
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
" <td>[-0.038976121693849564, -0.0031428150832653046...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>def main():\\n parser = argparse.ArgumentPar...</td>\n",
" <td>main</td>\n",
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
" <td>[-0.024289356544613838, -0.017748363316059113,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>def get_candidates(\\n prompt: str,\\n sto...</td>\n",
" <td>get_candidates</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.04161201789975166, -0.0169310811907053, 0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>def rindex(lst: List, value: str) -&gt; int:\\n ...</td>\n",
" <td>rindex</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.027255680412054062, -0.007931121625006199,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>def eval_candidate(\\n candidate_answer: str...</td>\n",
" <td>eval_candidate</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.00999179296195507, -0.01640152558684349, 0...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" code function_name \\\n",
"0 def semantic_search(engine, query, documents):... semantic_search \n",
"1 def main():\\n parser = argparse.ArgumentPar... main \n",
"2 def get_candidates(\\n prompt: str,\\n sto... get_candidates \n",
"3 def rindex(lst: List, value: str) -> int:\\n ... rindex \n",
"4 def eval_candidate(\\n candidate_answer: str... eval_candidate \n",
"\n",
" filepath \\\n",
"0 /examples/semanticsearch/semanticsearch.py \n",
"1 /examples/semanticsearch/semanticsearch.py \n",
"2 /examples/codex/backtranslation.py \n",
"3 /examples/codex/backtranslation.py \n",
"4 /examples/codex/backtranslation.py \n",
"\n",
" code_embedding \n",
"0 [-0.038976121693849564, -0.0031428150832653046... \n",
"1 [-0.024289356544613838, -0.017748363316059113,... \n",
"2 [-0.04161201789975166, -0.0169310811907053, 0.... \n",
"3 [-0.027255680412054062, -0.007931121625006199,... \n",
"4 [-0.00999179296195507, -0.01640152558684349, 0... "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from openai.embeddings_utils import get_embedding\n",
"\n",
"df = pd.DataFrame(all_funcs)\n",
"df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='code-search-babbage-code-001'))\n",
"df['filepath'] = df['filepath'].apply(lambda x: x.replace(code_root, \"\"))\n",
"df.to_csv(\"output/code_search_openai-python.csv\", index=False)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.681\n",
"def test_completions_multiple_prompts():\n",
" result = openai.Completion.create(\n",
" prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
" )\n",
" assert len(result.choices) == 10\n",
"\n",
"----------------------------------------------------------------------\n",
"/openai/tests/test_endpoints.py:test_completions score=0.675\n",
"def test_completions():\n",
" result = openai.Completion.create(prompt=\"This was a test\", n=5, engine=\"ada\")\n",
" assert len(result.choices) == 5\n",
"\n",
"\n",
"----------------------------------------------------------------------\n",
"/openai/tests/test_api_requestor.py:test_requestor_sets_request_id score=0.635\n",
"def test_requestor_sets_request_id(mocker: MockerFixture) -> None:\n",
" # Fake out 'requests' and confirm that the X-Request-Id header is set.\n",
"\n",
" got_headers = {}\n",
"\n",
" def fake_request(self, *args, **kwargs):\n",
" nonlocal got_headers\n",
"----------------------------------------------------------------------\n"
]
}
],
"source": [
"from openai.embeddings_utils import cosine_similarity\n",
"\n",
"def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n",
" embedding = get_embedding(code_query, engine='code-search-babbage-text-001')\n",
" df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n",
"\n",
" res = df.sort_values('similarities', ascending=False).head(n)\n",
" if pprint:\n",
" for r in res.iterrows():\n",
" print(r[1].filepath+\":\"+r[1].function_name + \" score=\" + str(round(r[1].similarities, 3)))\n",
" print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n",
" print('-'*70)\n",
" return res\n",
"res = search_functions(df, 'Completions API tests', n=3)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/validators.py:format_inferrer_validator score=0.655\n",
"def format_inferrer_validator(df):\n",
" \"\"\"\n",
" This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
" It will also suggest to use ada and explain train/validation split benefits.\n",
" \"\"\"\n",
" ft_type = infer_task_type(df)\n",
" immediate_msg = None\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:long_examples_validator score=0.649\n",
"def long_examples_validator(df):\n",
" \"\"\"\n",
" This validator will suggest to the user to remove examples that are too long.\n",
" \"\"\"\n",
" immediate_msg = None\n",
" optional_msg = None\n",
" optional_fn = None\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:non_empty_completion_validator score=0.646\n",
"def non_empty_completion_validator(df):\n",
" \"\"\"\n",
" This validator will ensure that no completion is empty.\n",
" \"\"\"\n",
" necessary_msg = None\n",
" necessary_fn = None\n",
" immediate_msg = None\n",
"----------------------------------------------------------------------\n"
]
}
],
"source": [
"res = search_functions(df, 'fine-tuning input data validation logic', n=3)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/validators.py:common_completion_suffix_validator score=0.665\n",
"def common_completion_suffix_validator(df):\n",
" \"\"\"\n",
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
" \"\"\"\n",
" error_msg = None\n",
" immediate_msg = None\n",
" optional_msg = None\n",
" optional_fn = None\n",
"\n",
" ft_type = infer_task_type(df)\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:get_outfnames score=0.66\n",
"def get_outfnames(fname, split):\n",
" suffixes = [\"_train\", \"_valid\"] if split else [\"\"]\n",
" i = 0\n",
" while True:\n",
" index_suffix = f\" ({i})\" if i > 0 else \"\"\n",
" candidate_fnames = [\n",
" fname.split(\".\")[0] + \"_prepared\" + suffix + index_suffix + \".jsonl\"\n",
" for suffix in suffixes\n",
" ]\n",
" if not any(os.path.isfile(f) for f in candidate_fnames):\n",
"----------------------------------------------------------------------\n"
]
}
],
"source": [
"res = search_functions(df, 'find common suffix', n=2, n_lines=10)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/cli.py:tools_register score=0.651\n",
"def tools_register(parser):\n",
" subparsers = parser.add_subparsers(\n",
" title=\"Tools\", help=\"Convenience client side tools\"\n",
" )\n",
"\n",
" def help(args):\n",
" parser.print_help()\n",
"\n",
" parser.set_defaults(func=help)\n",
"\n",
" sub = subparsers.add_parser(\"fine_tunes.prepare_data\")\n",
" sub.add_argument(\n",
" \"-f\",\n",
" \"--file\",\n",
" required=True,\n",
" help=\"JSONL, JSON, CSV, TSV, TXT or XLSX file containing prompt-completion examples to be analyzed.\"\n",
" \"This should be the local file path.\",\n",
" )\n",
" sub.add_argument(\n",
" \"-q\",\n",
"----------------------------------------------------------------------\n"
]
}
],
"source": [
"res = search_functions(df, 'Command line interface for fine-tuning', n=1, n_lines=20)"
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

@ -0,0 +1,107 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get embeddings\n",
"\n",
"The function `get_embedding` will give us an embedding for an input text."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12288"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import openai\n",
"\n",
"embedding = openai.Embedding.create(input=\"Sample document text goes here\", engine=\"text-similarity-davinci-001\")['data'][0]['embedding']\n",
"len(embedding)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1024\n"
]
}
],
"source": [
"import openai\n",
"from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
"\n",
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
"def get_embedding(text: str, engine=\"text-similarity-davinci-001\") -> List[float]:\n",
"\n",
" # replace newlines, which can negatively affect performance.\n",
" text = text.replace(\"\\n\", \" \")\n",
"\n",
" return openai.Embedding.create(input=[text], engine=engine)[\"data\"][0][\"embedding\"]\n",
"\n",
"embedding = get_embedding(\"Sample query text goes here\", engine=\"text-search-ada-query-001\")\n",
"print(len(embedding))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1024\n"
]
}
],
"source": [
"embedding = get_embedding(\"Sample document text goes here\", engine=\"text-search-ada-doc-001\")\n",
"print(len(embedding))"
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,192 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Load the dataset\n",
"\n",
"The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
"\n",
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Time</th>\n",
" <th>ProductId</th>\n",
" <th>UserId</th>\n",
" <th>Score</th>\n",
" <th>Summary</th>\n",
" <th>Text</th>\n",
" <th>combined</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1303862400</td>\n",
" <td>B001E4KFG0</td>\n",
" <td>A3SGXH7AUHU8GW</td>\n",
" <td>5</td>\n",
" <td>Good Quality Dog Food</td>\n",
" <td>I have bought several of the Vitality canned d...</td>\n",
" <td>Title: Good Quality Dog Food; Content: I have ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1346976000</td>\n",
" <td>B00813GRG4</td>\n",
" <td>A1D87F6ZCVE5NK</td>\n",
" <td>1</td>\n",
" <td>Not as Advertised</td>\n",
" <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
" <td>Title: Not as Advertised; Content: Product arr...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Time ProductId UserId Score Summary \\\n",
"Id \n",
"1 1303862400 B001E4KFG0 A3SGXH7AUHU8GW 5 Good Quality Dog Food \n",
"2 1346976000 B00813GRG4 A1D87F6ZCVE5NK 1 Not as Advertised \n",
"\n",
" Text \\\n",
"Id \n",
"1 I have bought several of the Vitality canned d... \n",
"2 Product arrived labeled as Jumbo Salted Peanut... \n",
"\n",
" combined \n",
"Id \n",
"1 Title: Good Quality Dog Food; Content: I have ... \n",
"2 Title: Not as Advertised; Content: Product arr... "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('input/Reviews.csv', index_col=0)\n",
"df = df[['Time', 'ProductId', 'UserId', 'Score', 'Summary', 'Text']]\n",
"df = df.dropna()\n",
"df['combined'] = \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n",
"df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# subsample to 1k most recent reviews and remove samples that are too long\n",
"df = df.sort_values('Time').tail(1_100)\n",
"df.drop('Time', axis=1, inplace=True)\n",
"\n",
"from transformers import GPT2TokenizerFast\n",
"tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
"\n",
"# remove reviews that are too long\n",
"df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
"df = df[df.n_tokens<2000].tail(1_000)\n",
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Get embeddings and save them for future reuse"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from openai.embeddings_utils import get_embedding\n",
"\n",
"# This will take just under 10 minutes\n",
"df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n",
"df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n",
"df.to_csv('output/embedded_1k_reviews.csv')"
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff

@ -0,0 +1,109 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regression using the embeddings\n",
"\n",
"Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
"\n",
"We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Babbage similarity embedding performance on 1k Amazon reviews: mse=0.38, mae=0.39\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
"\n",
"df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
"df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n",
"\n",
"rfr = RandomForestRegressor(n_estimators=100)\n",
"rfr.fit(X_train, y_train)\n",
"preds = rfr.predict(X_test)\n",
"\n",
"\n",
"mse = mean_squared_error(y_test, preds)\n",
"mae = mean_absolute_error(y_test, preds)\n",
"\n",
"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dummy mean prediction performance on Amazon reviews: mse=1.77, mae=1.04\n"
]
}
],
"source": [
"bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
"bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
"print(f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,185 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Semantic text search using embeddings\n",
"\n",
"We can search through all our reviews semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"\n",
"df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
"df['babbage_search'] = df.babbage_search.apply(eval).apply(np.array)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember to use the documents embedding engine for documents (in this case reviews), and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
"\n",
"Good Buy: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!\n",
"\n",
"Fantastic Instant Refried beans: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.\n",
"\n"
]
}
],
"source": [
"from openai.embeddings_utils import get_embedding, cosine_similarity\n",
"\n",
"# search through the reviews for a specific product\n",
"def search_reviews(df, product_description, n=3, pprint=True):\n",
" embedding = get_embedding(product_description, engine='text-search-babbage-query-001')\n",
" df['similarities'] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))\n",
"\n",
" res = df.sort_values('similarities', ascending=False).head(n).combined.str.replace('Title: ','').str.replace('; Content:', ': ')\n",
" if pprint:\n",
" for r in res:\n",
" print(r[:200])\n",
" print()\n",
" return res\n",
"res = search_reviews(df, 'delicious beans', n=3)\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rustichella ROCKS!: Anything this company makes is worthwhile eating! My favorite is their Trenne.<br />Their whole wheat pasta is the best I have ever had.\n",
"\n",
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
"\n",
"Wonderful: Came quickly. Was plentiful and delicious and cheaper than in the store. You will enjoy it if you like thick pasta.\n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, 'whole wheat pasta', n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"great product, poor delivery: The coffee is excellent and I am a repeat buyer. Problem this time was with the UPS delivery. They left the box in front of my garage door in the middle of the drivewa\n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, 'bad delivery', n=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extremely dissapointed: Hi,<br />I am very disappointed with the past shipment I received of the ONE coconut water. 3 of the boxes were leaking and the coconut water was spoiled.<br /><br />Thanks.<b\n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, 'spoilt', n=1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
"\n",
"A great deal on Greenies: Paid only $22 with free shipping for 96 teenies compared to about $35 at the pet store. How can you go wrong with a deal like that? The dog begs for his daily Greenie. Got \n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, 'pet food', n=2)"
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -0,0 +1,200 @@
{"text": " Morada Limited is a textile company based in Altham Lancashire. Morada specializes in curtains.", "category": "Company"}
{"text": " The Armenian Mirror-Spectator is a newspaper published by the Baikar Association in Watertown Massachusetts.", "category": "WrittenWork"}
{"text": " Mt. Kinka (\u91d1\u83ef\u5c71 Kinka-zan) also known as Kinkazan is located in the heart of the city of Gifu Gifu Prefecture Japan and rises to a height of 329 m (1079 ft). Previously called Mt. Inaba (\u7a32\u8449\u5c71 Inaba-yama) it has long served as the representative symbol of Gifu. It stands along the Nagara River creating bountiful nature within the city. Though it is the most famous mountain in the city Mount Dodo to the north is the tallest.", "category": "NaturalPlace"}
{"text": " Planning the Play of a Bridge Hand is a book on contract bridge co-written by Canadian teacher and author Barbara Seagram and British author David Bird. It was published by Master Point Press in 2009.The book teaches novice bridge players some basic techniques of declarer play including suit establishment ruffing losers and the finesse.", "category": "WrittenWork"}
{"text": " Wang Yuanping (born 8 December 1976) is a retired Chinese athlete who specialised in the 800 metres. She won several medals at the regional level.Her personal bests in the event are 2:00.63 seconds outdoors (Jinzhou 2000) and 2:03.41 seconds indoors (Yokohama 2004).", "category": "Athlete"}
{"text": " The Incorporated VIllage of Westhampton Beach is an incorporated village in the Town of Southampton Suffolk County New York United States. As of the 2010 census the village population was 1721.", "category": "Village"}
{"text": " Andersons Creek is a creek in Warrandyte and Park Orchards east of Melbourne Victoria Australia. It is a tributary of the Yarra River.", "category": "NaturalPlace"}
{"text": " The Three Horseshoes is a public house in Drybridge Street in the Overmonnow area of Monmouth Wales. The pub has also been used as an Inn and also known as The Three Horse Shoes Inn. The building has been a Grade II Listed building since 15 August 1974. 19th century 2 storeys roughcast as stone with a hooded doorway", "category": "Building"}
{"text": " The Brewer's Art is a Baltimore Maryland brewpub and restaurant. Opened on Friday September 13 1996. In 2008 it was named by Esquire magazine as the #1 Best Bar in America.", "category": "Company"}
{"text": " The P\u00e2r\u00e2ul S\u0103r\u0103\u021bii is a tributary of the Cibin River in Romania.", "category": "NaturalPlace"}
{"text": " Jean-Fran\u00e7ois Imbernon (born October 17 1951 in Perpignan France is a retired French international rugby union player.He played as a Lock for USA Perpignan. He earned his first cap with the French national team on 7 February 1976 against Ireland at Parc des Princes.", "category": "Athlete"}
{"text": " Le Cadeau released in Italy as Il regalo is a 1982 French and Italian film. It stars Claudia Cardinale.", "category": "Film"}
{"text": " Mykola Kanevets (Ukrainian: \u041c\u0438\u043a\u043e\u043b\u0430 \u041a\u0430\u043d\u0456\u0432\u0435\u0446\u044c) is the Artistic Director and Ballet Master of the Cheremosh Ukrainian Dance Company in Edmonton Alberta Canada.A native of Kiev Ukraine Mykola attended the National University of Culture and Performing Arts in Kiev Ukraine where he graduated from the Faculty of Choreography with the distinction of Ballet Master and Choreographer.", "category": "Artist"}
{"text": " Jenna Rose Swerdlow (born September 28 1998) is an American teenage singer who gained media attention as a pre-teen with her single My Jeans. After the video went viral on YouTube and received 14 million views Swerdlow is considered a semi-viral star.", "category": "Artist"}
{"text": " The Spice of Life is a smooth jazz studio album by Earl Klugh released in April 2008. The album received a Grammy nomination for Best Pop Instrumental Album at the 51st Grammy Awards in 2009.", "category": "Album"}
{"text": " Lomatium macrocarpum is a perennial flowering plant in the carrot family known by the common names bigseed lomatium biscuit root or even bigseed biscuitroot. It is native to much of western North America where it can be found in various types of habitat including the grasslands of the Great Plains. It is spreading or erect perennial herb growing up to about half a meter long with hairy gray-green herbage.", "category": "Plant"}
{"text": " Physena is the sole genus of the flowering plant family Physenaceae. It contains two species of shrubs and small trees which are endemic to Madagascar. The APG II system of 2003 (unchanged from the APG system of 1998) does recognize this family and assigns it to the order Caryophyllales in the clade core eudicots.", "category": "Plant"}
{"text": " David John Weatherley (born 1 March 1939) is a New Zealander actor known for his roles as Spencer the butler and the voice of Benglo the Fearcat in Power Rangers Operation Overdrive and Barliman Butterbur in The Lord of the Rings: The Fellowship of the Ring.Weatherley was born in London England and moved to Canada for a military career. He eventually moved to New Zealand to engage in a theatre acting career.", "category": "Artist"}
{"text": " Draba incrassata is an uncommon species of flowering plant in the mustard family known by the common name Sweetwater Mountains draba. It is endemic to California where it is known mainly from the Sweetwater Mountains of Mono County. It grows in alpine rock fields on the barren high mountain peaks. Draba incrassata is a small perennial herb forming mats of thick oval-shaped leaves.", "category": "Plant"}
{"text": " Pimelea ferruginea is a small shrub native to southwest Western Australia. It was described by Labillardiere in 1805.", "category": "Plant"}
{"text": " Lindsay Ell is a country music singer songwriter and guitarist from Calgary Alberta.She performed at the South by Southwest music festival held in Austin Texas in March 2009 the welcome reception of the 2009 Juno Awards held in Vancouver British Columbia and was also a featured artist at the 2010 Winter Olympics.", "category": "Artist"}
{"text": " Scopula fuscata is a moth of the Geometridae family. It is found from south-western Saskatchewan west to British Columbia and south to California and Arizona. The habitat consists of montane areas including foothills.The wingspan is 24-28 mm. The wings and body are light tan sprinkled with darker yellow-brown or grey-brown scales. There is one generation per year with adults on wing in late June and early July in the northern part of the range.", "category": "Animal"}
{"text": " Oxmoor Center is a Louisville Kentucky shopping mall located at 7900 Shelbyville Road in eastern Louisville.", "category": "Building"}
{"text": " Ghostquake (also known as Haunted High) is a 2012 American made-for-television horror film produced by Syfy. The film was directed by Jeffrey Lando and written by Paul A. Birkett and Anthony C. Ferrante. The film stars Danny Trejo and MC Gainey. It follows a group of high school students trying to escape the wrath of a few ghastly spirits following an earthquake at their school Holloman High School.", "category": "Film"}
{"text": " The Masonic Temple in Great Falls Montana is a building from 1914. It was listed on the National Register of Historic Places in 2000.Address is 821 Central Avenue Great Falls Motana 59401 Phone number is 453-9080.Cascade No. 34 meets 2nd and 4th Tuesdays at 7:30pm Sept-June.Euclid No. 58 meets year-round 1st and 3rd Tuesdays at 7:30pm Sept-May 3rd Tuesdays at 7:30pm June-Aug. Delta No. 128 meets 2nd Wednesdays at 7:30pm Sept-June.", "category": "Building"}
{"text": " Harold Frederick Weaver Hawkins (1893-1977) was an English painter who specialized in ambitious sometimes mural-sized modernist allegories of morality for an age of atomic warfare and global over-population.", "category": "Artist"}
{"text": " Robert Murray Waddington (24 October 1927 \u2013 15 March 2007) was Dean of Manchester in the last quarter of the 20th century.Born in Bognor Regis on 24 October 1927 he was educated at Dulwich College Selwyn College Cambridge and Ely Theological College. Ordained in 1954 he began his career at St John\u2019s Bethnal Green. Afterwards he was Chaplain at Slade School in Warwick Queensland. He returned to England in 1959 to join the Oratory of the Good Shepherd an order of celibate priests.", "category": "OfficeHolder"}
{"text": " Jason Gary King (born 13 April 1985 in Maidstone England) is a speedway rider who was formerly the club captain of Newcastle Diamonds in the British Premier League. His brother Daniel is also a speedway rider.", "category": "Athlete"}
{"text": " The African Queen is a 1951 adventure film adapted from the 1935 novel of the same name by C. S. Forester. The film was directed by John Huston and produced by Sam Spiegel and John Woolf. The screenplay was adapted by James Agee John Huston John Collier and Peter Viertel. It was photographed in Technicolor by Jack Cardiff and had a music score by Allan Gray.", "category": "Film"}
{"text": " The Fiat Barchetta (Italian pronunciation: [\u02c8fiat bar\u02c8ketta]) (Type 183) is a roadster produced by the Italian manufacturer Fiat from 1995 to 2005 (though production was paused between May 2002 and 2004). Barchetta in Italian means 'little boat'.", "category": "MeanOfTransportation"}
{"text": " Sardar Vallabhbhai Patel National Memorial is a museum and exhibition centre dedicated to Sardar Vallabhbhai Patel at Moti Shahi Mahal located in Shahibaug Ahmedabad Gujarat. Moti Shahi Mahal was constructed by Mughal emperor Shahjahan between 1618 and 1622. It is surrounded by a garden.", "category": "Building"}
{"text": " Under Cover 2 is the 5th solo album of Joe Lynn Turner released in 1999. Just like Under Cover the album consists mainly of covers of Turner's favourite artists.", "category": "Album"}
{"text": " The Atakora River is a tributary of Lake Volta in Ghana it flows about 60 km east to the Lake Volta. Its entire course is in south Ghana.", "category": "NaturalPlace"}
{"text": " Death from Above is a 2011 horror film by director Bruce Koehler. The film features professional wrestling stars Kurt Angle Sid Eudy James Storm Matt Morgan Terry Gerin and Jessica Kresa.", "category": "Film"}
{"text": " Portraits of Cuba is an album by Cuban musician Paquito D'Rivera released through Chesky Records in 1996. In 1997 the album won D'Rivera the Grammy Award for Best Latin Jazz Performance.", "category": "Album"}
{"text": " Jimmy Cross (17 November 1938 - 8 October 1978) was an American radio producer and singer who attained a minor Billboard Hot 100 hit with the novelty song I Want My Baby Back in 1965. He was born in Dothan Alabama[citation needed] and became the producer of the syndicated radio series Country Concert.I Want My Baby Back was originally issued on the Tollie label and reached #92 on the Billboard Hot 100 in February 1965.", "category": "Artist"}
{"text": " Timothy Floyd Tim Burchett (born August 25 1964) is an American Republican politician currently the mayor of Knox County Tennessee. He previously served in Tennessee General Assembly first in the Tennessee House of Representatives and later in the Tennessee State Senate in which he represented Tennessee's District 7 part of Knox County. On August 5 2010 Burchett was elected mayor of Knox County replacing Mike Ragsdale.", "category": "OfficeHolder"}
{"text": " Daniel Lawrence Dan Whitney (born February 17 1963) best known by his stage name and character Larry the Cable Guy is an American stand-up comedian actor voice actor and former radio personality.", "category": "Artist"}
{"text": " Renealmia is a plant genus in the family Zingiberaceae. Species include: Renealmia alpinia Renealmia aurantifera Renealmia cernua Renealmia dolichocalyx Renealmia oligotricha Renealmia sessilifolia Renealmia thrysoidesE.g. Alpinia nutans was formerly placed herein too.", "category": "Plant"}
{"text": " Jeff Chapman (born July 17 1969 in Brunswick Georgia) is the bass singer for the Kingdom Heirs. He has been a member of the group since 2002. He has previously traveled with Bob Wills The Sound The Anchormen and The Blackwoods.He has twice been nominated as favorite bass in the Singing News fan awards.Chapman has a wife Angie two sons Justin and Sean and daughter Taylor.", "category": "Artist"}
{"text": " Arenaria ursina is a species of flowering plant in the pink family known by the common name Bear Valley sandwort.", "category": "Plant"}
{"text": " Living Fossil is a classic science fiction story on the concepts of human extinction and future evolution by L. Sprague de Camp. It was first published in the magazine Astounding Science-Fiction for February 1939. It first appeared in book form in the anthology A Treasury of Science Fiction (Crown Publishers 1948); it later appeared in the anthologies Gates to Tomorrow (Atheneum 1973) and The SFWA Grand Masters Volume 1 (Tor Books 1999).", "category": "WrittenWork"}
{"text": " Brachyglottis huntii commonly called rautini or Chatham Island Christmas tree is a species in the Asteraceae family and is found only on the Chatham Islands in New Zealand.", "category": "Plant"}
{"text": " Luktvatnet is a lake that lies in the northern part of the municipality of Vefsn in Nordland county Norway. The lake lies between the mountains Korgfjellet and Lukttinden about 5 kilometres (3.1 mi) south of Elsfjord. The European route E06 highway passes along the northern shore of the lake.", "category": "NaturalPlace"}
{"text": " The IAR 79 is a bi-engine bomber military reconnaissance aircraft with a wood and metal structure that saw service in World War II built under licence in Brasov Romania by Industria Aeronautic\u0103 Rom\u00e2n\u0103", "category": "MeanOfTransportation"}
{"text": " Enrico Perucconi (born 4 January 1925 in Morazzone Varese Italy) was an Italian athlete who competed mainly in the 100 metres.", "category": "Athlete"}
{"text": " Central National-Gottesman Inc. is one of the world's largest distributors of pulp paper paperboard and newsprint. The firm's products are sold in over 75 countries through a network of 43 offices located in the United States and abroad. With annual revenues exceeding $3 billion Forbes ranked Central National-Gottesman 137th in its annual list of The Largest Private Companies.", "category": "Company"}
{"text": " The Kout Food Group is a Kuwaiti-based conglomerate founded in 1982.In Kuwait it operates franchises of Burger King Pizza Hut and Taco Bell.Its UK arm Kout Food Group Restaurants UK it operates under brands such as Burger King KFC and Maison Blanc. In August 2013 it acquired the Little Chef chain for \u00a315 million.", "category": "Company"}
{"text": " Fab Five: The Texas Cheerleader Scandal is a Lifetime Television made-for-TV drama film starring Jenna Dewan Ashley Benson and Tatum O'Neal and directed by Tom McLoughlin. The film premiered on August 2 2008. It is based on a true story which occurred at McKinney North High School in McKinney Texas in 2006 in which five teenage cheerleaders became notorious for bullying truancies violations of the school dress code and general disrespect to the school community and authority.", "category": "Film"}
{"text": " Qadi Mahalleh (Persian: \u0642\u0627\u062f\u064a \u0645\u062d\u0644\u0647\u200e also Romanized as Q\u0101d\u012b Ma\u1e29alleh) is a village in Pazevar Rural District Rudbast District Babolsar County Mazandaran Province Iran. At the 2006 census its population was 228 in 59 families.", "category": "Village"}
{"text": " Eungella Dam is one of Queensland's more established freshwater fisheries. Eungella has made a name for producing extra oversized Sooty grunter and more recently Barramundi.Eungella Dam was constructed in 1969 to meet the requirements of a thermal power station at Collinsville and the town water requirement of Collinsville and Scottsville.", "category": "NaturalPlace"}
{"text": " The American Motor Car Company was a short-lived company in the automotive industry founded in 1906 lasting until 1913. It was based in Indianapolis Indiana United States. The American Motor Car Company pioneered the underslung design.", "category": "Company"}
{"text": " Hawkeye & Mockingbird was a comic book ongoing series published by Marvel Comics starring superheroes Hawkeye and Mockingbird.", "category": "WrittenWork"}
{"text": " Margaret Anderson Kelliher (born March 11 1968) is a Minnesota politician and a former member of the Minnesota House of Representatives. A member of the Minnesota Democratic\u2013Farmer\u2013Labor Party she represented District 60A which includes portions of the city of Minneapolis in Hennepin County located in the Twin Cities metropolitan area. First elected in 1999 she served until 2011 also serving as the Speaker from 2007 to 2011.", "category": "OfficeHolder"}
{"text": " John Whitlow Wyatt (September 27 1907 \u2013 July 16 1999) was a professional baseball pitcher. He played all or part of sixteen seasons in Major League Baseball for the Detroit Tigers (1929\u201333) Chicago White Sox (1933\u201336) Cleveland Indians (1937) Brooklyn Dodgers (1939\u201344) and Philadelphia Phillies (1945). While injuries sidetracked much of Wyatt's early career he is most famous for his performance in 1941 when his team (the Dodgers) won the National League pennant.", "category": "Athlete"}
{"text": " William Thomas Burton (31 January 1878 in Black Rock St Michael Barbados \u2013 22 August 1946 St Michael Barbados) was a coloured West Indian cricketer best known as a member of the 1900 and 1906 West Indian tourists to England. He is generally known as Tommie Burton.He was the son of a black mother and a white father. He was brought up in Barbados and served for some years there as a practice bowler and in trial matches.", "category": "Athlete"}
{"text": " Tulemalu Lake is a lake in Kivalliq Region Nunavut Canada.", "category": "NaturalPlace"}
{"text": " Sten Stjernqvist is a Swedish former footballer who played as a forward.", "category": "Athlete"}
{"text": " David Parlett (born 1939) is a games scholar from South London who has studied both card games and board games. His published works include many popular books on games and the more academic volumes The Oxford Guide to Card Games and The Oxford History of Board Games both now out of print. Parlett also invented a number of board games the most successful of which is Hare and Tortoise (1974). The German edition was awarded Spiel des Jahres (Game of the Year) in 1979.", "category": "Artist"}
{"text": " Karl Nabersberg (sometimes written as Carl Nabersberg) was a German youth leader.Nabersberg was the son of a Crefeld shopkeeper. In 1923 he joined the Jugendorganisation the forerunner of the Hitler Youth in his home town. On 28 December 1925 he was admitted as a member of the National Socialist German Workers' Party (member number 26269) and as a member of the Sturmabteilung.", "category": "OfficeHolder"}
{"text": " \u0160etonje is a village situated in Petrovac na Mlavi municipality in Serbia.", "category": "Village"}
{"text": " Dr. Joseph de Graft-Johnson (1933\u20131999) was an engineer academic and politician. He became the Vice-President of Ghana between 1979 and 1981.", "category": "OfficeHolder"}
{"text": " Patties Foods (previously Patties Bakery) is an Australian food manufacturing company that produces meat pies baked goods frozen fruits and pre-made desserts. Headquartered in Bairnsdale Victoria Australia Patties Foods is represented in the Australian market by the Four'N Twenty Patties Herbert Adams Creative Gourmet Nanna's and Chefs Pride brands. Patties is the largest meat pie producing company in Australia and the world.", "category": "Company"}
{"text": " Double Butte is the 2579-foot (786 m) mountain summit distinguished by two buttes (the other at abou 2480 feet or 756 metres) in Riverside County California. It is the western most summit of a mountain range north of Winchester California east of Perris Valley and west of the San Jacinto Valley. The eastern ridge is composed primarily of metamorphic rock of the Triassic - Jurassic French Valley formation.", "category": "NaturalPlace"}
{"text": " Mount Carmel \u2013 Blytheswood Public School is an elementary school in the north end of Leamington Ontario Canada. It is part of the Greater Essex County District School Board and serves students from JK to Grade 8 from the communities of Blytheswood and Mount Carmel and surrounding areas.", "category": "EducationalInstitution"}
{"text": " La combi asesina (The Killer Combination) is a 1982 Mexican film. It was directed by Gustavo Alatriste.", "category": "Film"}
{"text": " Halimium ocymoides (basil-leaved rock rose) syn. Cistus algarvensis is a species of flowering plant in the family Cistaceae native to southern Portugal and southern Spain. It is an erect evergreen shrub growing to 60 cm (24 in) tall by 100 cm (3 ft) wide with woolly grey-green leaves and bright yellow flowers in spring. The flowers may have a dark brown blotch at the base of each petal.In cultivation this plant requires a sandy soil and full sun.", "category": "Plant"}
{"text": " Kaala Patthar (English: Black Stone) is a 1979 Indian Bollywood action/drama film. It was produced and directed by Yash Chopra. The story was written by Salim-Javed. This film is the fourth collaboration between Amitabh Bachchan Shashi Kapoor and director Yash Chopra after the hugely successful Deewaar (1975) Kabhie Kabhie (1976) and Trishul (1978). However this film did average business at the box office. It was nominated for Filmfare awards.", "category": "Film"}
{"text": " Martin G.S. Mansergh (born 31 December 1946) is a former Irish Fianna F\u00e1il politician and historian. He was a Teachta D\u00e1la (TD) for the Tipperary South constituency from 2007 until 2011. He was previously a Senator from 2002 to 2007. He played a leading role in formulating Fianna F\u00e1il policy on Northern Ireland.", "category": "OfficeHolder"}
{"text": " Shriniwas Ganesh Sardesai (1907-1996) popularly known as S.G. Sardesai was an Indian freedom fighter from Maharashtra and one of the great communist leaders produced by the communist movement in India. He is author of the book Progress and conservatism in ancient India famous for his profound theoretical analysis. He was the Central Executive Committee of pre-split Communist Party of India during the Indo-china conflict.", "category": "OfficeHolder"}
{"text": " USS Tuluran (AG-46) \u2013 also known as USS Lake Superior (ID-2995) \u2013 was a commercial cargo ship acquired by the U.S. Navy for service during both World War I when she was known as USS Lake Superior and also during World War II when she was known as USS Tuluran.", "category": "MeanOfTransportation"}
{"text": " The American Journal of Gastroenterology is a peer-reviewed medical journal published for the American College of Gastroenterology by the Nature Publishing Group.", "category": "WrittenWork"}
{"text": " William Lindsay (September 4 1835 \u2013 October 15 1909) was a Democratic U.S. Senator from Kentucky from 1893 to 1901.Born near Lexington Virginia Lindsay attended the common schools and settled in Clinton Kentucky in 1854. There he taught school and studied law. He was admitted to the bar and commenced practice in Clinton in 1858.", "category": "OfficeHolder"}
{"text": " Brian Schroeder (a.k.a. Pushead) is an artist record label owner and writer within the hardcore punk and heavy metal field. He has created artwork for many bands artists and athletes including Metallica The Misfits Dr. Dre Travis Barker Craig Johnson and Kool Keith. He has designed many record covers T-shirts skateboards and a pair of Nike SB Dunks. His record label Pusmort Records has released albums by Negative Gain Poison Idea and Final Conflict.", "category": "Artist"}
{"text": " Panicum anceps is a species of grass known by the common name beaked panicgrass. It is native to the southeastern United States where it occurs as far north as New Jersey and as far west as Kansas and Texas.This species is a rhizomatous perennial grass with stems growing up to 1.3 meters tall. The leaves have erect blades up to half a meter tall. The inflorescence is a panicle up to 40 centimeters long bearing pale green or yellowish spikelets. The grass produces an abundance of seed.", "category": "Plant"}
{"text": " Shukan ST is a weekly newspaper published by The Japan Times for learners of English language. It is originally titled as Student Times but changed to Shukan ST since a significant portion of its readers are not students. It has articles on news movie lifestyle in English-speaking countries opinions and other kinds attracting learners of English and helping them with notes on terms.", "category": "Company"}
{"text": " The Tiger Hotel is a hotel in Columbia Missouri. Built as a hotel in 1928 the building later housed a retirement home and banquet center. In 2012 the building was fully restored and reopened as a boutique hotel. It was listed on the National Register of Historic Places in 1980.", "category": "Building"}
{"text": " Emi Motoi (\u672c\u4e95 \u3048\u307f Motoi Emi born October 11 in Kanagawa) is a Japanese voice actress.", "category": "Artist"}
{"text": " The Hudson River is a 49.5-mile-long (79.7 km) tributary of the Broad River in the U.S. state of Georgia. Via the Broad River it is part of the Savannah River watershed.The headwaters are in Banks County near the city of Homer. Grove Creek feeds into the Hudson near the Franklin County line. The river then constitutes most of the southern border of Franklin County separating it from Madison County.", "category": "NaturalPlace"}
{"text": " This article details Car Nos. 10\u201313 of the Manx Electric Railway on the Isle of Man.This was the third batch of motorcars delivered to the railway in 1895 at the same time as the cars for the new Snaefell Mountain Railway were delivered. They were constructed to a very similar design to those provided for the mountain line.", "category": "MeanOfTransportation"}
{"text": " Catharanthus roseus commonly known as the Madagascar rosy periwinkle is a species of Catharanthus native and endemic to Madagascar. Other English names occasionally used include Cape periwinkle rose periwinkle rosy periwinkle and old-maid.", "category": "Plant"}
{"text": " Thapanzeik is a village in Homalin Township Hkamti District in the Sagaing Region of northwestern Burma.", "category": "Village"}
{"text": " USS Spiegel Grove (LSD-32) was a Thomaston-class dock landing ship of the United States Navy. She was named for Spiegel Grove the home and estate in Fremont Ohio of Rutherford B. Hayes the 19th President of the United States.", "category": "MeanOfTransportation"}
{"text": " Acmella is a genus of thirty species of plants in the aster family Asteraceae. It is native to the Americas and has been introduced to Asia Africa the Pacific islands and Australia.One familiar species is Acmella oleracea which has been widely cultivated for centuries. It is used for food and medicine and as an insecticide and an ornamental plant.", "category": "Plant"}
{"text": " Mirbelia is a plant genus belonging to the Fabaceae family. It is endemic to Australia occurring in every mainland state except South Australia.", "category": "Plant"}
{"text": " Nigma puella is a species of spider belonging to the family Dictynidae. It is found in Europe Azores Madeira Canary Islands and parts of North Africa.Like most members of the family this is a small spider but the female is striking with a light green abdomen marked with a bold maroon blotch and a variable amount of barring in the same colour. The male is reddish-brown. This species makes a horizontal web over the top surface of a leaf.", "category": "Animal"}
{"text": " The Madrisa (or Madrisahorn) is a mountain in the R\u00e4tikon mountain range overlooking Klosters in the Swiss canton of Graub\u00fcnden. Its summit (2826 metres) is located near the Austrian border.The Madrisa is constituted by several secondary summits notably the Gargeller Madrisa (2770 metres) overlooking Gargellen in Austria.Ski lifts up to 2600 metres are located on the Klosters side.", "category": "NaturalPlace"}
{"text": " Temporary Temple is a live album by Psychic TV. The album was recorded on July 28 1984 in London and released on 12 vinyl. It was later coupled with another concert and released on CD as Temporary Temple & Atonal.", "category": "Album"}
{"text": " La hija de Juan Sim\u00f3n (Juan Sim\u00f3n's Daughter) is a musical play by Nemesio M. Sobrevila which has been made into two Spanish films. It is also the name of the title track and the song has been recorded by numerous artists such as Leonardo Favio.The first film directed by Jos\u00e9 Luis S\u00e1enz de Heredia was released in 1935 and starred Angelillo Pilar Mu\u00f1oz and Manuel Arb\u00f3. Luis Bu\u00f1uel was the executive producer for Film\u00f3fono and had a small role as an actor.", "category": "Film"}
{"text": " Book Of Matches is a poetry book written by Simon Armitage first published in 1993 by Faber and Faber. Several poems featured in the book are studied as part of the GCSE English Literature examination in the UK.The book is written in three sections the first (Book of Matches) containing 30 short sonnets. Each is meant to be read within 20 seconds the amount of time it would take for a match to be lit and burn out.", "category": "WrittenWork"}
{"text": " The Last Supper is the fourth album released by American stand-up comedian Jim Gaffigan. It focuses largely on his love of food.", "category": "Album"}
{"text": " The Miami Center is a skyscraper in downtown Miami Florida. Although Miami Center is not the city's tallest building it is a symbol of early downtown. Built in 1983 it is older compared with most of the taller buildings in Miami which have been built in the last decade. In addition the Miami Center is immediately adjacent to Bayfront Park and is unobstructed when looking at the skyline from Miami Beach to the east. The building is 484 ft (148 m) tall and has 34 floors.", "category": "Building"}
{"text": " Duboisia hopwoodii is a shrub native to the arid interior region of Australia. Common names include pituri pitchuri thornapple or pitcheri. It has an erect habit usually growing to between 1 and 3 metres in height and has long narrow leaves. Flowers are white and bell-shaped with violet-striped throats. These appear between June and November in the species native range followed by purple-black rounded berries which are 3 to 6 mm in diameter.", "category": "Plant"}
{"text": " Jelenin svet (Jelena's World) is a 2008 independent documentary film written and directed by Tanja Brzakovi\u0107 about former World No. 1 female tennis player Jelena Jankovi\u0107.", "category": "Film"}
{"text": " Jay Cashman Inc. is an American heavy-construction company based in Quincy Massachusetts with satellite offices in Boston Jupiter Florida and Staten Island New York. As of 2006 the company has about 1000 employees. The company was one of the major contractors on the Boston's Central Artery/Tunnel Project. In 2004 Jay Cashman Inc.", "category": "Company"}
{"text": " Hashemanli (Persian: \u0647\u0627\u0634\u0645\u0646\u0644\u064a\u200e also Romanized as H\u0101shemanl\u012b; also known as H\u0101shem El\u00e1) is a village in Jafarbay-ye Jonubi Rural District in the Central District of Torkaman County Golestan Province Iran. At the 2006 census its population was 135 in 27 families.", "category": "Village"}
{"text": " Rani Kasula Rangamma is a Telugu film starring Chiranjeevi.", "category": "Film"}
{"text": " The 20/20 Experience \u2013 The Complete Experience is a compilation album by American singer-songwriter Justin Timberlake. It was released on September 27 2013 by RCA Records.", "category": "Album"}
{"text": " R.C. Bigelow Inc better known as the Bigelow Tea Company is an American tea company based in Fairfield Connecticut. The company was founded by Ruth C. Bigelow in the late 1940s based on a recipe she marketed as Constant Comment tea. Bigelow is still a 100% family-owned business that markets over 50 varieties of tea including black and green as well as herbal teas all of which are still blended in Fairfield. They also own America's only tea plantation in Charleston South Carolina.", "category": "Company"}
{"text": " Thomas Eyre(fl. 1890s) was a footballer who made 65 appearances in the Football League playing for Lincoln City. He played at left back. Either side of Lincoln he played for Ashfield and Hamilton Academical in Scotland.", "category": "Athlete"}
{"text": " Malleable Iron Range Company was a company that existed from 1896 to 1985 and primarily produced kitchen ranges made of malleable iron but also produced a variety of other related products. The company's primary trademark was 'Monarch' and was colloquially often referred to as the Monarch Company or just Monarch.", "category": "Company"}
{"text": " The Chiltern School is a coeducational special school located over two sites in Dunstable and Houghton Regis in Bedfordshire England. The school accepts pupils from all over the Central Bedfordshire area.The school was formed in 2012 from the merger of Glenwood School in Dunstable and Hillcrest School in Houghton Regis.", "category": "EducationalInstitution"}
{"text": " Kim Dae-Eun (born September 17 1984) is a South Korean gymnast. He is the 2004 Olympic All-around silver medalist. He won the gold medal on the parallel bars at the 2007 World Artistic Gymnastics Championships.Kim was part of the South Korean team that won the bronze medal in the team event at the 2006 Asian Games.", "category": "Athlete"}
{"text": " Arayik Vladimirovich Harutyunyan (Armenian: \u0531\u0580\u0561\u0575\u056b\u056f \u0540\u0561\u0580\u0578\u0582\u0569\u0575\u0578\u0582\u0576\u0575\u0561\u0576 Russian: \u0410\u0440\u0430\u0438\u043a \u0410\u0440\u0443\u0442\u044e\u043d\u044f\u043d) (born 14 December 1973) is the current Prime Minister of the Nagorno-Karabakh Republic. He was suggested by the President of Nagorno-Karabakh Bako Sahakyan and was unanimously approved by the Parliament of Karabakh on 14 September 2007 by 32 out of 32 present parliamentarians.", "category": "OfficeHolder"}
{"text": " Shelton Hank Williams also known as Hank Williams III and Hank 3 (born December 12 1972) is an American musician singer and multi-instrumentalist including guitar bass drums banjo and vocals. In addition to his honky tonk recordings Williams' style alternates between country punk and metal.", "category": "Artist"}
{"text": " Helicella orzai is a species of air-breathing land snails terrestrial pulmonate gastropod mollusks in the family Hygromiidae the hairy snails and their allies. This species is endemic to Spain.", "category": "Animal"}
{"text": " Gro\u00dfe Schmalenau is a river of North Rhine-Westphalia Germany.", "category": "NaturalPlace"}
{"text": " The Tupolev ANT-29 (military designation DIP \u2013 Dvukhmotorny istrebitel pushechny twin-engined cannon fighter) was a 1930s twin-engined cannon-armed fighter designed by Alexander Arkhangelsky and built by Tupolev.Design work started in 1932 on a twin-engined aircraft capable of carrying two APK-100 cannons. The resulting design was the ANT-29 and it first flew in February 1935. A monoplane with a tall and narrow fuselage and powered by two Hispano-Suiza 12Ybrs engines.", "category": "MeanOfTransportation"}
{"text": " Charles Corm (1894-1963) was a Lebanese writer businessman and philanthropist. He is considered to be the leader of the Phoenicianism movement in Lebanon which ignited a surge of nationalism that led to Lebanon's independence. In a country torn by sectarian conflicts Corm's intention was to find a common root shared by all Lebanese beyond their religious beliefs (the Phoenicians were pagans).", "category": "Artist"}
{"text": " Joseph Hubert Ruetz (October 21 1916 \u2013 January 2 2003) was a professional football player in the All-America Football Conference for the Chicago Rockets in 1946 and 1948. Prior to that he played at the collegiate level while attending the University of Notre Dame. He played guard for the Irish with the exception of playing one season at quarterback. In 1938 he graduated from Notre Dame with cum laude honors.", "category": "Athlete"}
{"text": " The Reef House is a historic house located at 411 S. Poplar St. in Carbondale Illinois. William A. Reef built the house for his family circa 1892. The Queen Anne-style cottage may have been designed by local carpenter A. M. Etherton though records of its designer do not exist. The house features fishscale shingle siding on its second floor and clapboard siding on its first; the clapboard siding is adorned with stickwork.", "category": "Building"}
{"text": " MAKO Surgical Corp. (Stryker Medical) is a publicly traded medical device company based in Florida. On September 25 2013 the Board of Directors of Mako Surgical accepted a deal to merge with Stryker Medical for $1.65B subject to shareholder approval.", "category": "Company"}
{"text": " Pop Carn is a 2003 Indian Tamil film written and directed by actor-cum-director Nassar and starring Mohanlal and Simran Bagga in lead roles and introducing newcomers Kunal Shah and Jyothi Nawal. The film which had music scored by Yuvan Shankar Raja was released on 30 January 2003 but flopped at the box office. Nonetheless the film was dubbed into Malayalam and released in 2007 under the same name.", "category": "Film"}
{"text": " USNS Mount Baker (T-AE-34) is the seventh of eight Kilauea-class ammunition ships to serve with the Military Sealift Command. She is the second U.S. Navy ship to bear the name and is named for Mount Baker a 10781-foot volcano in the Cascade Range of Washington. Ammunition ships operated by Military Sealift Command provide logistic support to US Navy ships at sea.Mount Baker was built by Ingalls Shipbuilding Pascagoula Mississippi.", "category": "MeanOfTransportation"}
{"text": " Dansere is an album by Jan Garbarek. The album was recorded in November 1975 and features the Bobo Stenson Quartet.", "category": "Album"}
{"text": " Divraz (Persian: \u062f\u064a\u0648\u0631\u0632\u200e also Romanized as D\u012bvraz) is a village in Bala Khiyaban-e Litkuh Rural District in the Central District of Amol County Mazandaran Province Iran. At the 2006 census its population was 393 in 95 families.", "category": "Village"}
{"text": " The D\u0103ih\u0103\u021ba\u0219u River is a tributary of the Dumbr\u0103vanu River in Romania.", "category": "NaturalPlace"}
{"text": " Zeisters also known as Fat Guy Goes Nutzoid is a 1986 comedy film produced by Troma Entertainment. Troma was originally set to title the film Fat Boy Goes Nutzoid but at the request of the lawyers of the hip-hop group The Fat Boys it was changed to Fat Guy.", "category": "Film"}
{"text": " Paul Gobeil (born March 1 1942 in Saint-R\u00e9mi-de-Tingwick Quebec) is a former Canadian politician and businessman.From 1985 to 1989 Mr. Gobeil was a Liberal member of the National Assembly for the riding of Verdun and served as Minister assigned to Administration President of the Treasury Board and as Minister of International Affairs for the Government of Quebec.", "category": "OfficeHolder"}
{"text": " Ruff Ryders: Past Present Future is the fifth compilation album from American hip hop record label Ruff Ryders Entertainment released on November 21 2011.", "category": "Album"}
{"text": " Ridi Viharaya (Sinhala: \u0dbb\u0dd2\u0daf\u0dd3 \u0dc0\u0dd2\u0dc4\u0dcf\u0dbb\u0dba) or Silver Temple is a 2nd-century BCE Theravada Buddhist temple in the village of Ridigama Sri Lanka. Built during the reign of Dutthagamani of Anuradhapura the temple is considered as the place where the silver ore which provided silver to complete Ruwanwelisaya; one of the largest stupa in Sri Lanka was discovered.", "category": "Building"}
{"text": " Grand Canyon Preparatory Academy is a public charter college preparatory school in Tempe Arizona.", "category": "EducationalInstitution"}
{"text": " Aricoceras is an extinct genus of the Adrianitidae family. They are an extinct group of ammonoid which are shelled cephalopods related to squids belemnites octopuses and cuttlefish and more distantly to the nautiloids.", "category": "Animal"}
{"text": " Blackburn High School is a public secondary school for girls and boys in years 7 to 12 in Blackburn a suburb of Melbourne Victoria Australia. Blackburn High School is an outstanding secondary school for aspiring young men and women. It aims to educate tomorrow's minds today. Started in 1956 the school has a proud tradition of academic excellence and exceptional music achievement.The school is nationally recognised as a leading educational institution for music education.", "category": "EducationalInstitution"}
{"text": " Chris Nieratko (born February 19 1976) is an American humorist and author. Nieratko is a past editor of Big Brother Magazine and currently reviews pornographic films for Vice magazine as well as being the author of the related Skinema book. He also appeared on MTV's Jackass.", "category": "Artist"}
{"text": " Warlock is a 1959 film released by Twentieth Century Fox and shot in DeLuxe Color and CinemaScope. It is a Western adapted from the novel by Oakley Hall (screenplay written by Robert Alan Aurthur).", "category": "Film"}
{"text": " Sieniawa [\u0255e\u02c8\u0272ava] is a village in the administrative district of Gmina Raba Wy\u017cna within Nowy Targ County Lesser Poland Voivodeship in southern Poland. It lies approximately 5 kilometres (3 mi) south-east of Raba Wy\u017cna 10 km (6 mi) north-west of Nowy Targ and 59 km (37 mi) south of the regional capital Krak\u00f3w.The village has a population of 1900.", "category": "Village"}
{"text": " Michael Adam (born 9 December 1984) is a German politician. He has been District Administrator (Landrat) of Regen since 2011.", "category": "OfficeHolder"}
{"text": " Thunderbird High School is a public high school located in northwestern Phoenix Arizona. The school is a part of the Glendale Union High School District.", "category": "EducationalInstitution"}
{"text": " Nayef Al Khater (born May 10 1978) is a Qatari football player. He currently plays for Al Wakrah as a defender.", "category": "Athlete"}
{"text": " Black Cobra Woman (Italian: Eva nera) also known as Black Cobra is an Italian 1976 exploitation movie written and directed by Joe D'Amato.", "category": "Film"}
{"text": " Joe Cuba a.k.a Sonny (April 22 1931 \u2013 February 15 2009) was a musician of Puerto Rican descent who was known as the Father of Latin Boogaloo.", "category": "Artist"}
{"text": " Jacob LeBlanc (born February 2 1981 in Auburn California) is an American retired professional soccer player.", "category": "Athlete"}
{"text": " Kevin B. Kamenetz is the 12th and current County Executive of Baltimore County Maryland serving since 2010. He is a member of the Democratic Party. He previously served as a four-term County Councilman representing the Second District of Baltimore County.", "category": "OfficeHolder"}
{"text": " Thomas Frederick Fred Peart Baron Peart PC (30 April 1914 \u2013 26 August 1988) was a British Labour politician who served in the Labour governments of the 1960s and 1970s and was a candidate for Deputy Leader of the Party.", "category": "OfficeHolder"}
{"text": " Grand Lake is a large lake in the interior of Newfoundland of the Canadian province of Newfoundland and Labrador. It has an area of 534 km\u00b2 making it the largest lake on Newfoundland. Consequently it is one of if not the deepest.", "category": "NaturalPlace"}
{"text": " A Colossal Failure of Common Sense: The Inside Story of the Collapse of Lehman Brothers is a 2009 non-fiction book written by Lawrence G. McDonald and Patrick Robinson which chronicles the events surrounding the bankruptcy of Lehman Brothers in the context of the financial crisis of 2007\u20132010 and the subprime mortgage crisis.", "category": "WrittenWork"}
{"text": " Thatching (31 May 1975 \u2013 1999) was an Irish Thoroughbred racehorse and sire. The horse's early career was delayed and disrupted by injury and he did not show his best form until switched to sprinting distances in the spring of 1979 when he won the Duke of York Stakes. He improved further when equipped with blinkers that summer recording impressive victories in both the Cork and Orrery Stakes and the July Cup.", "category": "Animal"}
{"text": " Ya\u015far Kurt (Armenian: \u0545\u0561\u0577\u0561\u0580 \u053f\u0578\u0582\u0580\u0569 b.August 16 1968 in Istanbul Turkey) is a Turkish-Armenian rock artist.", "category": "Artist"}
{"text": " Then and Now is a historical novel by W. Somerset Maugham. Set in Florence Italy during the Renaissance the story focuses on three months in the life of Niccolo Machiavelli the Florentine politician diplomat philosopher and writer in the early years of the 16th century. The book was first published by Heinemann in 1946.", "category": "WrittenWork"}
{"text": " Abdollah Masud-e Sofla (Persian: \u0639\u0628\u062f\u0627\u0644\u0647 \u0645\u0633\u0639\u0648\u062f\u0633\u0641\u0644\u064a\u200e also Romanized as \u2018Abdoll\u0101h Mas\u2018\u016bd-e Sofl\u00e1; also known as Abdollah Mas\u2019ood \u2018Abdoll\u0101h Mas\u2018\u016bd and Abdull\u0101h Mas\u016bd) is a village in Hesar-e Valiyeasr Rural District Avaj District Buin Zahra County Qazvin Province Iran. At the 2006 census its population was 72 in 15 families.", "category": "Village"}
{"text": " Springhill High School (SHS) is a secondary school in Springhill Nova Scotia Canada. SHS is part of the Chignecto-Central Regional School Board and is the only high school in the town of Springhill. The school is home to many sports teams and clubs. These include: basketball soccer badminton track and field softball Students Against Destructive Decisions drama club homework club and book club.", "category": "EducationalInstitution"}
{"text": " Charniele L. Herring (/\u0283\u0251r\u02c8n\u025bl \u02c8h\u025br\u026a\u014b/ shar-NEL HERR-ing; born September 25 1969) is an American politician. She has served in the Virginia House of Delegates since 2009 representing the 46th district made up the city of Alexandria and part of Fairfax County near Washington D.C. Herring is a member of the Democratic Party. She has been the House minority whip since 2012 and in December 2012 she was the first African-American to be elected chair of the Democratic Party of Virginia.", "category": "OfficeHolder"}
{"text": " Symmoca dodecatella is a moth of the Symmocidae family. It is found in Portugal and Spain.The wingspan is about 18\u201319 mm. The forewings are grey sprinkled with black mainly along the margin. The hindwings are grey.", "category": "Animal"}
{"text": " Ali Abbasov Mammad oglu (Azerbaijani: \u018fli Abbasov M\u0259mm\u0259d o\u011flu) (born 1953 in Azerbaijan) is the current Minister of Communications and Information Technologies of the Republic of Azerbaijan.", "category": "OfficeHolder"}
{"text": " Worlds Beyond was an American digest magazine of science fiction and fantasy fiction in 1950 and 1951. The magazine only issued three monthly issues from December 1950 to February 1951 but is notable for having printed stories by Cyril M. Kornbluth Jack Vance Mack Reynolds Graham Greene John Christopher Lester del Rey Judith Merril and others.Worlds Beyond was published by Hillman Periodicals and was edited by Damon Knight.", "category": "Company"}
{"text": " The Daily News Journal commonly abbreviated to DNJ is a newspaper serving Murfreesboro Tennessee Rutherford County and surrounding communities. Published in Murfreesboro it serves as the primary local newspaper with competition from The Murfreesboro Post and other publications. The newspaper is not in competition with The Tennessean of Nashville as both are owned by Gannett Company.The roots the DNJ date back to 1849 and the founding of Murfreesboro News.", "category": "WrittenWork"}
{"text": " Echinocereus fendleri is a species of cactus known by the common names pinkflower hedgehog cactus and Fendler's hedgehog cactus. It grows in deserts and woodlands in the Southwestern United States and Northeastern Mexico. It is most common in New Mexico.The taxonomy of the species is uncertain with authors recognizing up to eight varieties.", "category": "Plant"}
{"text": " Michael F. Kitt Snr (13 September 1914 \u2013 24 December 1974) was an Irish Fianna F\u00e1il politician and long-serving Teachta D\u00e1la (TD).He was elected to D\u00e1il \u00c9ireann for the first time at the 1948 general election for the Galway North constituency but lost his seat at the 1951 general election and failed to be elected again at the 1954 general election.", "category": "OfficeHolder"}
{"text": " Epidendrum mancum is an epiphytic orchid that grows in the tropical low elfin cloud forests of Ecuador and Amazonas Peru at altitudes of 2\u20143 km .", "category": "Plant"}
{"text": " Salempur Masanda is a village in Jalandhar District near the Jalandhar Cantonment in Punjab India.", "category": "Village"}
{"text": " Yaleh Gonbad (Persian: \u064a\u0644\u0647 \u06af\u0646\u0628\u062f\u200e; also known as Em\u0101mz\u0101deh Imamzade-Ele-Geumbez and Im\u0101mz\u0101deh) is a village in Ilat-e Qaqazan-e Gharbi Rural District Kuhin District Qazvin County Qazvin Province Iran. At the 2006 census its population was 429 in 105 families.", "category": "Village"}
{"text": " Popeyes Louisiana Kitchen is an American chain of fried chicken fast food restaurants founded in 1972 in New Orleans Louisiana. Often referred to as Popeyes and sometimes as Popeyes Chicken & Biscuits or Popeyes Chicken & Seafood[citation needed] it was acquired by Sandy Springs Georgia-based AFC Enterprises originally America's Favorite Chicken Company in 1993.", "category": "Company"}
{"text": " The White Umfolozi River originates just south of Vryheid KwaZulu-Natal South Africa and joins the Black Umfolozi River at 28\u00b020\u203258\u2033S 31\u00b058\u203246\u2033E to form the Umfolozi River before it flows east towards the Indian Ocean.", "category": "NaturalPlace"}
{"text": " The Albatros L 74 was a two-seated German training biplane produced by Albatros Flugzeugwerke. Only two were produced.", "category": "MeanOfTransportation"}
{"text": " The University of Nevada School of Medicine is an academic division of the University of Nevada Reno and grants the Doctor of Medicine (MD) degree. The School of Medicine was founded in 1969 as the first medical school in the state of Nevada. More than 1500 MDs have graduated from the School of Medicine. The pre-clinical campus is located in Reno but the third and fourth years can be spent in hospitals and clinics throughout Nevada.", "category": "EducationalInstitution"}
{"text": " Leon Kroll (December 6 1884 \u2013 October 25 1974) was an American painter and lithographer. Known as a figurative artist Life Magazine described him as the dean of U.S. nude painters yet he was an exceptional landscape painter and also produced an exceptional body of still life compositions.Born into a musical family on lower Second Avenue in New York City Kroll's father was a violinist and his cousin was William Kroll.", "category": "Artist"}
{"text": " Michael E. DeBakey High School for Health Professions at Qatar (DHSHP@Q in short) is a private international middle and secondary school in Doha Qatar. The school is a branch campus of Michael E. DeBakey High School for Health Professions of Houston Texas United States. Charlesetta Deason is the CEO and President.Named after Michael E. DeBakey the school opened in September 2008 with grades 8 through 10 with 100 students per grade; the school will ultimately cover grades 7-12.", "category": "EducationalInstitution"}
{"text": " The Richleighs of Tantamount is a children\u2019s historical novel written by British author Barbara Willard. It was originally published in the United Kingdom in 1966 by the publishers Constable before being published in the United States by Harcourt Brace & World in June 1967. C. Walter Hodges drew the line illustrations and painted the cover portrait for the original edition.", "category": "WrittenWork"}
{"text": " Ennea is a genus of air-breathing land snails terrestrial pulmonate gastropod mollusks in the family Streptaxidae.Ennea is the type genus of the subfamily Enneinae.", "category": "Animal"}
{"text": " Come Live with Me is a 1941 American romantic comedy film produced and directed by Clarence Brown and starring James Stewart and Hedy Lamarr. Based on a story by Virginia Van Upp the film is about a beautiful Viennese refugee seeking United States citizenship who arranges a marriage of convenience with a struggling writer.", "category": "Film"}
{"text": " St. Thomas Episcopal Church is a parish church in the Episcopal Diocese of Iowa. The church is located in Sioux City Iowa United States at 1200 Douglas Street. The church building is listed on the National Register of Historic Places.", "category": "Building"}
{"text": " Nuno Daniel Costeira Valente (born 22 November 1991 in Ada\u00fafe - Braga ) is a Portuguese footballer who plays for Vizela on loan from S.C. Braga as a midfielder.", "category": "Athlete"}
{"text": " Jaagoo (Bengali: \u099c\u09be\u0997\u09cb) is a Bangladeshi sports based romantic Movies. Its writer and director Khijir Hayat Khan. Adnan Karim sponsored youth football game built this film was released in 2010. The film is produced by Sharjeel Karim and Adnan Karim and directed by Khijir Hayat Khan who has written the story screenplay and dialogues of the film. The film features Ferdous Ahmed and Afsana Ara Bindu in lead roles and with supporting Arefin Shuvo Tariq Anam Ronok Hasaan and many more.", "category": "Film"}
{"text": " John Edward Hatton AO (born 29 May 1933) is former Australian politician and an National Trust of Australia nominated Australian Living Treasure. He was the independent member of the Legislative Assembly of the New South Wales parliament for the seat of South Coast from 1973 to 1995. Notably the allegations about police corruption Hatton raised in Parliament resulted in the Wood Royal Commission. He is currently a social activist in his local community.", "category": "OfficeHolder"}
{"text": " Trichoptilus subtilis is a moth of the Pterophoridae family that is known from South Africa.", "category": "Animal"}
{"text": " Sin\u00e9ad Madden (born in Galway Ireland) is an Irish singer-songwriter and fiddle player best known as a member of the Moya Brennan band. She also teaches at Waltons New School of Music in Dublin.", "category": "Artist"}
{"text": " Philip Sprint is a German footballer who currently plays for Hertha BSC.Sprint made his professional debut for Hertha BSC on 12 August 2012 in a 2. Bundesliga match against FSV Frankfurt coming on in the 50th minute for Marvin Knoll after starting goalkeeper Sascha Burchert had been sent off.", "category": "Athlete"}
{"text": " River Roads Mall was an enclosed shopping mall located in the city of Jennings a suburb of St. Louis Missouri United States. Opened in 1962 as one of the nation's first shopping malls the mall declined in the 1990s becoming a dead mall and eventually being shuttered in 1995. Demolition of the long-vacant mall began in 2006.", "category": "Building"}
{"text": " The Brown-patched Kangaroo lizard (Otocryptis wiegmanni) also called Wiegmann's Agama or Sri Lankan Kangaroo Lizard is a small ground dwelling agamid lizard endemic to the wet zone forests and lower mountain forests (up to 1300 metres) of Sri Lanka. It is commonly seen in the leaf litter of shady rain forests. When perceiving danger it spurts away quickly on its large hind legs and might eventually climb up a sapling or tree.", "category": "Animal"}
{"text": " Shiho Kawaragi (\u6cb3\u539f\u6728 \u5fd7\u7a42 Kawaragi Shiho born April 29 1976 in Tokyo) is a Japanese voice actress who works for Kenyu-Office. When voicing adult games and hentai OVAs she is also known as Kaname Yuzuki (\u67da\u6728\u304b\u306a\u3081 Yuzuki Kaname) She is currently married since March 2012.", "category": "Artist"}
{"text": " Down in the Shacks Where the Satellite Dishes Grow is the second album by the Judybats released in 1992.", "category": "Album"}
{"text": " Turn of Faith is a 2001 film directed by Charles Jarrott. It stars Ray Mancini and Mia Sara.", "category": "Film"}
{"text": " Frederick William Seward (July 8 1830 \u2013 April 25 1915) was the Assistant Secretary of State during the American Civil War serving in Abraham Lincoln's administration as well as under Andrew Johnson during Reconstruction and for over two years under Rutherford B. Hayes.", "category": "OfficeHolder"}
{"text": " Ivoprop Corporation founded in 1984 by Ivo Zdarsky is an American manufacturer of composite propellers for homebuilt and ultralight aircraft as well as airboats. The company headquarters is located in Bellflower California.Zdarsky started the company after carving his own propeller for a homebuilt ultralight trike that he flew from Cold War Czechoslovakia over the Iron Curtain to Vienna in 1984.", "category": "Company"}
{"text": " Wave Broadband is a provider of residential business and enterprise class cable TV broadband internet and telephone services on the West Coast currently serving about 400000 customers within communities in western Washington state Oregon Sacramento California and the San Francisco Bay Area. Wave Broadband provides services via their fiber-optic network and uses Northwest Open Access Network as the backbone for most of their service areas in Washington.", "category": "Company"}
{"text": " Andie Tong is a comic book artist known for his work on books such as Spectacular Spider-Man UK The Batman Strikes! and Tangent: Superman's Reign.", "category": "Artist"}
{"text": " Merdani is a village in the municipality of Busova\u010da Bosnia and Herzegovina.", "category": "Village"}
{"text": " Kamam (Persian: \u0643\u0627\u0645\u0645\u200e also Romanized as K\u0101mam) is a village in Mangur-e Sharqi Rural District Khalifan District Mahabad County West Azerbaijan Province Iran. At the 2006 census its population was 98 in 16 families.", "category": "Village"}
{"text": " Ficus greiffiana is a species of plant in the Moraceae family. It is found in Argentina Brazil Colombia and Guyana.", "category": "Plant"}
{"text": " Toni Amboaje is a Spanish singer who currently works for metal band Sauze which formed on early 2008.", "category": "Artist"}
{"text": " Mount Whittier is a mountain in Carroll County New Hampshire in the northern Ossipee Mountains. Named after John Greenleaf Whittier the peak is not to be confused with nearby Nickerson Mountain which was once known as Mount Whittier.There are no hiking trails on Mount Whittier. There was once a CCC alpine ski trail on the northern face.", "category": "NaturalPlace"}
{"text": " El Rompe Discoteka: The Mix Album is a 2007 album by Hector El Father.", "category": "Album"}
{"text": " e-Spirit is a commercial software company that develops and markets the FirstSpirit CMS Web content management system. The company was founded in 1999 in Dortmund Germany and established a US presence in 2011. The company's FirstSpirit CMS is a Java-based offering now in its fifth major release.[citation needed]", "category": "Company"}
{"text": " The Valley is the first novel by Barry Pilton published in 2005 by Bloomsbury. It is a humorous account of the effect of outsiders on the rural status quo in a fictional mid-Wales valley during the 1980s and is being adapted for television.", "category": "WrittenWork"}
{"text": " Sema Group plc was an Anglo-French IT services company. It was listed on the London Stock Exchange and was a constituent of the FTSE 100 Index but was acquired by Schlumberger in 2001.", "category": "Company"}
{"text": " Bent Hansen (born 1954) is a retired Danish ice hockey forward. He played for 18 years in Denmark for the R\u00f8dovre SIK and KSF. He also competed for the Danish national team. His son Jannik Hansen also played for the R\u00f8dovre team and was drafted into the NHL by the Vancouver Canucks in 2004. During his hockey career Hansen also worked as a carpenter.", "category": "Athlete"}
{"text": " Behind the Sun is a 2004 album by Dive.", "category": "Album"}
{"text": " Mungaru Male (English: Pre Monsoon Rain) is a 2006 Kannada language movie directed by Yograj Bhat and produced by E Krishnappa. The film stars Ganesh Pooja Gandhi Anant Nag Padmaja Rao in lead roles.", "category": "Film"}
{"text": " Megachile perihirta commonly known as the Western leafcutting bee is a bee in the genus Megachile. The bee is native to western North America ranging from Nebraska to Texas and Mexico west to California and north to British Columbia and Alberta and often inhabits meadows and orchards. The bee is black with long whitish-yellow hair more so below the thorax and abdomen. The abdomen however is mostly bare although each segment has scattered whitish hair.", "category": "Animal"}
{"text": " Sukeban Deka The Movie (\u30b9\u30b1\u30d0\u30f3\u5211\u4e8b) is a live action Japanese film that was released in 1987. The movie closely follows a TV and manga series Sukeban Deka written and illustrated by Shinji Wada. The movie stars Yoko Minamino and Yui Asaka who were also in the TV series. The movie was followed by Sukeban Deka II in 1988.", "category": "Film"}
{"text": " The Maple School District is a public school district in Douglas County Wisconsin United States based in Maple Wisconsin.", "category": "EducationalInstitution"}
{"text": " Mount Waverley Secondary College is a public secondary school located in the Melbourne suburb of Mount Waverley. The school consists of roughly 1900 students and is one of the largest in the state.The school consists of two campuses (Junior & Senior) both situated on Stephensons Road in Mount Waverley. The Junior site holds years 7 and 8 with year levels 9 to 12 at the Senior Campus. The campuses are a short walking distance apart.", "category": "EducationalInstitution"}
{"text": " Jon-Paul Roger JP Pietersen (born 12 July 1986 in Stellenbosch South Africa) is a South African rugby union footballer. He generally plays fullback or wing for the Sharks (in the Super Rugby competition) and the Natal Sharks in the Currie Cup. He played in more than 50 tests for the Springboks.", "category": "Athlete"}
{"text": " Deltocolpodes is a genus of beetles in the family Carabidae containing the following species: Deltocolpodes brendelli Morvan 1992 Deltocolpodes championi Morvan 1992 Deltocolpodes duluchus Morvan 1992 Deltocolpodes heinigeri Morvan 1992 Deltocolpodes jalepensis Morvan 1992 Deltocolpodes kirschenhoferi Morvan 1992 Deltocolpodes nepalensis Morvan 1992 Deltocolpodes perreaui Deuve 1985 Deltocolpodes rectangulus Morvan 1992 Deltocolpodes rolex Morvan 1992 Deltocolpodes salpensis Deuve 1985 Deltocolpodes sikkimensis Morvan 1992\u2191", "category": "Animal"}
{"text": " Stanhopea martiana is a species of orchid endemic to southwestern Mexico.", "category": "Plant"}
{"text": " Yawarmayu (Quechua yawar blood mayu river blood river hispanicized spelling Yahuarmayo) is a river in Peru located in the Puno Region Carabaya Province Ayapata District. It originates near the border of the districts Ayapata and Coasa. Its direction is mainly to the northwest where it meets Inambari River as a right affluent. The confluence is north of the village Yawarmayu (Yahuarmayo).", "category": "NaturalPlace"}
{"text": " The Charles Granke House at 406 S. Seventh St. in Hamilton Montana is a historic house that was built in 1906. It includes Colonial Revival and Queen Anne architecture. It was listed on the National Register of Historic Places in 1988. The listing included two contributing buildings.It was built in approximately 1906 by the Anaconda Copper Mining Company as a worker cottage for workers at the sawmill that operated in Hamilton until 1915. Charles W.", "category": "Building"}
{"text": " Passiflora monadelpha is a species of plant in the Passifloraceae family. It is endemic to Ecuador.", "category": "Plant"}
{"text": " Mangifera persiciformis or Peach Mango is a species of plant in the Anacardiaceae family. It is endemic to China.", "category": "Plant"}

@ -0,0 +1,150 @@
import argparse
import openai
def create_context(
question, search_file_id, max_len=1800, search_model="ada", max_rerank=10
):
"""
Create a context for a question by finding the most similar context from the search file.
:param question: The question
:param search_file_id: The file id of the search file
:param max_len: The maximum length of the returned context (in tokens)
:param search_model: The search model to use
:param max_rerank: The maximum number of reranking
:return: The context
"""
results = openai.Engine(search_model).search(
search_model=search_model,
query=question,
max_rerank=max_rerank,
file=search_file_id,
return_metadata=True,
)
returns = []
cur_len = 0
for result in results["data"]:
cur_len += int(result["metadata"]) + 4
if cur_len > max_len:
break
returns.append(result["text"])
return "\n\n###\n\n".join(returns)
def answer_question(
search_file_id="<SEARCH_FILE_ID>",
fine_tuned_qa_model="<FT_QA_MODEL_ID>",
question="Which country won the European Football championship in 2021?",
max_len=1800,
search_model="ada",
max_rerank=10,
debug=False,
stop_sequence=["\n", "."],
max_tokens=100,
):
"""
Answer a question based on the most similar context from the search file, using your fine-tuned model.
:param question: The question
:param fine_tuned_qa_model: The fine tuned QA model
:param search_file_id: The file id of the search file
:param max_len: The maximum length of the returned context (in tokens)
:param search_model: The search model to use
:param max_rerank: The maximum number of reranking
:param debug: Whether to output debug information
:param stop_sequence: The stop sequence for Q&A model
:param max_tokens: The maximum number of tokens to return
:return: The answer
"""
context = create_context(
question,
search_file_id,
max_len=max_len,
search_model=search_model,
max_rerank=max_rerank,
)
if debug:
print("Context:\n" + context)
print("\n\n")
try:
# fine-tuned models requires model parameter, whereas other models require engine parameter
model_param = (
{"model": fine_tuned_qa_model}
if ":" in fine_tuned_qa_model
and fine_tuned_qa_model.split(":")[1].startswith("ft")
else {"engine": fine_tuned_qa_model}
)
response = openai.Completion.create(
prompt=f"Answer the question based on the context below\n\nText: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
temperature=0,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=stop_sequence,
**model_param,
)
return response["choices"][0]["text"]
except Exception as e:
print(e)
return ""
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Rudimentary functionality of the answers endpoint with a fine-tuned Q&A model.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--search_file_id", help="Search file id", required=True, type=str
)
parser.add_argument(
"--fine_tuned_qa_model", help="Fine-tuned QA model id", required=True, type=str
)
parser.add_argument(
"--question", help="Question to answer", required=True, type=str
)
parser.add_argument(
"--max_len",
help="Maximum length of the returned context (in tokens)",
default=1800,
type=int,
)
parser.add_argument(
"--search_model", help="Search model to use", default="ada", type=str
)
parser.add_argument(
"--max_rerank",
help="Maximum number of reranking for the search",
default=10,
type=int,
)
parser.add_argument(
"--debug", help="Print debug information (context used)", action="store_true"
)
parser.add_argument(
"--stop_sequence",
help="Stop sequences for the Q&A model",
default=["\n", "."],
nargs="+",
type=str,
)
parser.add_argument(
"--max_tokens",
help="Maximum number of tokens to return",
default=100,
type=int,
)
args = parser.parse_args()
response = answer_question(
search_file_id=args.search_file_id,
fine_tuned_qa_model=args.fine_tuned_qa_model,
question=args.question,
max_len=args.max_len,
search_model=args.search_model,
max_rerank=args.max_rerank,
debug=args.debug,
stop_sequence=args.stop_sequence,
max_tokens=args.max_tokens,
)
print(f"Answer:{response}")

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -0,0 +1,637 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Train a fine-tuning model specialized for Q&A\n",
"This notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer \"No sufficient context for answering the question\". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.\n",
"\n",
"We will add hard adversarial examples as well, which will be based either on semantically similar sections, or neighbouring sections, originating from the same article."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>heading</th>\n",
" <th>content</th>\n",
" <th>tokens</th>\n",
" <th>context</th>\n",
" <th>questions</th>\n",
" <th>answers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2020 Summer Olympics</td>\n",
" <td>Summary</td>\n",
" <td>The 2020 Summer Olympics (Japanese: 2020年夏季オリン...</td>\n",
" <td>713</td>\n",
" <td>2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ...</td>\n",
" <td>1. What is the 2020 Summer Olympics?\\n2. When ...</td>\n",
" <td>1. The 2020 Summer Olympics is an internationa...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2020 Summer Olympics</td>\n",
" <td>Host city selection</td>\n",
" <td>The International Olympic Committee (IOC) vote...</td>\n",
" <td>126</td>\n",
" <td>2020 Summer Olympics\\nHost city selection\\n\\nT...</td>\n",
" <td>1. \\n2. \\n3. \\n4.</td>\n",
" <td>1. What is the International Olympic Committee...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2020 Summer Olympics</td>\n",
" <td>Impact of the COVID-19 pandemic</td>\n",
" <td>In January 2020, concerns were raised about th...</td>\n",
" <td>369</td>\n",
" <td>2020 Summer Olympics\\nImpact of the COVID-19 p...</td>\n",
" <td>1. What was the COVID-19 pandemic?\\n2. How did...</td>\n",
" <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2020 Summer Olympics</td>\n",
" <td>Qualifying event cancellation and postponement</td>\n",
" <td>Concerns about the pandemic began to affect qu...</td>\n",
" <td>298</td>\n",
" <td>2020 Summer Olympics\\nQualifying event cancell...</td>\n",
" <td>1. What was the original location of the Asia ...</td>\n",
" <td>1. The original location of the Asia &amp; Oceania...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2020 Summer Olympics</td>\n",
" <td>Effect on doping tests</td>\n",
" <td>Mandatory doping tests were being severely res...</td>\n",
" <td>163</td>\n",
" <td>2020 Summer Olympics\\nEffect on doping tests\\n...</td>\n",
" <td>1. What was the COVID-19 pandemic?\\n2. What di...</td>\n",
" <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title heading \\\n",
"0 2020 Summer Olympics Summary \n",
"1 2020 Summer Olympics Host city selection \n",
"2 2020 Summer Olympics Impact of the COVID-19 pandemic \n",
"3 2020 Summer Olympics Qualifying event cancellation and postponement \n",
"4 2020 Summer Olympics Effect on doping tests \n",
"\n",
" content tokens \\\n",
"0 The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 713 \n",
"1 The International Olympic Committee (IOC) vote... 126 \n",
"2 In January 2020, concerns were raised about th... 369 \n",
"3 Concerns about the pandemic began to affect qu... 298 \n",
"4 Mandatory doping tests were being severely res... 163 \n",
"\n",
" context \\\n",
"0 2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ... \n",
"1 2020 Summer Olympics\\nHost city selection\\n\\nT... \n",
"2 2020 Summer Olympics\\nImpact of the COVID-19 p... \n",
"3 2020 Summer Olympics\\nQualifying event cancell... \n",
"4 2020 Summer Olympics\\nEffect on doping tests\\n... \n",
"\n",
" questions \\\n",
"0 1. What is the 2020 Summer Olympics?\\n2. When ... \n",
"1 1. \\n2. \\n3. \\n4. \n",
"2 1. What was the COVID-19 pandemic?\\n2. How did... \n",
"3 1. What was the original location of the Asia ... \n",
"4 1. What was the COVID-19 pandemic?\\n2. What di... \n",
"\n",
" answers \n",
"0 1. The 2020 Summer Olympics is an internationa... \n",
"1 1. What is the International Olympic Committee... \n",
"2 1. The COVID-19 pandemic was a pandemic that o... \n",
"3 1. The original location of the Asia & Oceania... \n",
"4 1. The COVID-19 pandemic was a pandemic that o... "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import openai\n",
"import pandas as pd\n",
"df = pd.read_csv('olympics-data/olympics_qa.csv')\n",
"olympics_search_fileid = \"file-c3shd8wqF3vSCKaukW4Jr1TT\"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the sections into a training and testing set"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3014, 754)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
"len(train_df), len(test_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we check that he separator we intend to use isn't present within the contexts"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.context.str.contains('->').sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Create the fine-tuning datasets for Q&A and discriminator models\n",
"The fine-tuning dataset is created in the following way. For every corresponding question, answer and context pair we create:\n",
"- Positive example: correct question, answer, context pair\n",
"- Negative examples:\n",
" - random negative example, where the random context is paired with the question \n",
" - two hard negative examples\n",
" - one originating from the same wikipedia article\n",
" - another, which is most similar to the correct context\n",
"\n",
"This process is noisy, as sometimes the question might be answerable given a different context, but on average we hope this won't affect the peformance too much.\n",
"\n",
"We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
"def get_random_similar_contexts(question, context, file_id=olympics_search_fileid, search_model='ada', max_rerank=10):\n",
" \"\"\"\n",
" Find similar contexts to the given context using the search file\n",
" \"\"\"\n",
" try:\n",
" results = openai.Engine(search_model).search(\n",
" search_model=search_model, \n",
" query=question, \n",
" max_rerank=max_rerank,\n",
" file=file_id\n",
" )\n",
" candidates = []\n",
" for result in results['data'][:3]:\n",
" if result['text'] == context:\n",
" continue\n",
" candidates.append(result['text'])\n",
" random_candidate = random.choice(candidates)\n",
" return random_candidate\n",
" except Exception as e:\n",
" print(e)\n",
" return \"\"\n",
"\n",
"def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):\n",
" \"\"\"\n",
" Create a dataset for fine tuning the OpenAI model; either for a discriminator model, \n",
" or a model specializing in Q&A, where it says if no relevant context is found.\n",
"\n",
" Parameters\n",
" ----------\n",
" df: pd.DataFrame\n",
" The dataframe containing the question, answer and context pairs\n",
" discriminator: bool\n",
" Whether to create a dataset for the discriminator\n",
" n_negative: int\n",
" The number of random negative samples to add (using a random context)\n",
" add_related: bool\n",
" Whether to add the related contexts to the correct context. These are hard negative examples\n",
"\n",
" Returns\n",
" -------\n",
" pd.DataFrame\n",
" The dataframe containing the prompts and completions, ready for fine-tuning\n",
" \"\"\"\n",
" rows = []\n",
" for i, row in df.iterrows():\n",
" for q, a in zip((\"1.\" + row.questions).split('\\n'), (\"1.\" + row.answers).split('\\n')):\n",
" if len(q) >10 and len(a) >10:\n",
" if discriminator:\n",
" rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" yes\"})\n",
" else:\n",
" rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" {a[2:].strip()}\"})\n",
"\n",
" for i, row in df.iterrows():\n",
" for q in (\"1.\" + row.questions).split('\\n'):\n",
" if len(q) >10:\n",
" for j in range(n_negative + (2 if add_related else 0)):\n",
" random_context = \"\"\n",
" if j == 0 and add_related:\n",
" # add the related contexts based on originating from the same wikipedia page\n",
" subset = df[(df.title == row.title) & (df.context != row.context)]\n",
" \n",
" if len(subset) < 1:\n",
" continue\n",
" random_context = subset.sample(1).iloc[0].context\n",
" if j == 1 and add_related:\n",
" # add the related contexts based on the most similar contexts according to the search\n",
" random_context = get_random_similar_contexts(q[2:].strip(), row.context, search_model='ada', max_rerank=10)\n",
" else:\n",
" while True:\n",
" # add random context, which isn't the correct context\n",
" random_context = df.sample(1).iloc[0].context\n",
" if random_context != row.context:\n",
" break\n",
" if discriminator:\n",
" rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" no\"})\n",
" else:\n",
" rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" No appropriate context found to answer the question.\"})\n",
"\n",
" return pd.DataFrame(rows) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": []
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for name, is_disc in [('discriminator', True), ('qa', False)]:\n",
" for train_test, dt in [('train', train_df), ('test', test_df)]:\n",
" ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)\n",
" ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We formatted the data according to the recommendations from the fine-tuning tool, which is available using\n",
"> openai tools fine_tunes.prepare_data -f qa_train.jsonl\n",
"\n",
"We highly recommend that you use this tool, which suggests improvements in your data formatting for fine-tuning.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Submit the datasets for fine-tuning"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": []
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"!openai api fine_tunes.create -t \"olympics-data/discriminator_train.jsonl\" -v \"olympics-data/discriminator_test.jsonl\" --batch_size 16 --compute_classification_metrics --classification_positive_class \" yes\" --model ada"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": []
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"!openai api fine_tunes.create -t \"olympics-data/qa_train.jsonl\" -v \"olympics-data/qa_test.jsonl\" --batch_size 16"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Using the fine-tuned models\n",
"\n",
"We will now use the fine-tuned discriminator and the fine-tuned Q&A model. By requesting logprobs, we can see how certain the discriminator is in a `yes` vs `no` answer."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<OpenAIObject at 0x7fe812e602b0> JSON: {\n",
" \" no\": -10.819577,\n",
" \" yes\": -2.045765e-05\n",
" }]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ft_discriminator = \"curie:ft-openai-internal-2021-08-23-23-58-57\"\n",
"ft_qa = \"curie:ft-openai-internal-2021-08-23-17-54-10\"\n",
"\n",
"def apply_ft_discriminator(context, question, discriminator_model):\n",
" \"\"\"\n",
" Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.\n",
" \"\"\"\n",
" prompt = f\"{context}\\nQuestion: {question}\\n Related:\"\n",
" result = openai.Completion.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)\n",
" return result['choices'][0]['logprobs']['top_logprobs']\n",
"\n",
"apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
" 'What was the first human-made object in space?', ft_discriminator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the model can generalize well to different contexts and questions. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def apply_ft_qa_answer(context, question, answering_model):\n",
" \"\"\"\n",
" Apply the fine tuned discriminator to a question\n",
" \"\"\"\n",
" prompt = f\"{context}\\nQuestion: {question}\\nAnswer:\"\n",
" result = openai.Completion.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\\n'])\n",
" return result['choices'][0]['text']\n",
"\n",
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
" 'What was the first human-made object in space?', ft_qa)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the model can answer the question, when the context is appropriate."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' The Soviet Union was the first country to successfully launch a satellite into space'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
" 'What is impressive about the Soviet Union?', ft_qa)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' No appropriate context found to answer the question'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
" 'How many cars were produced in the Soviet Union in 1970?', ft_qa)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the model knows when to answer the question, and when to say that insufficient context is present to answer the question."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also combine a discriminator and a base model, or a fine-tuned Q&A model. Discriminator can essentially serve as a decision whether the question can be answered given the context or not."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Weather could cause a sport event to have no crowd'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):\n",
" logprobs = apply_ft_discriminator(context, question, discriminator_model)\n",
" yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100\n",
" no_logprob = logprobs[' no'] if ' no' in logprobs else -100\n",
" if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:\n",
" return \" No appropriate context found to answer the question based on the discriminator.\"\n",
" return apply_ft_qa_answer(context, question, answering_model)\n",
"answer_question_conditionally(ft_qa, ft_discriminator, \n",
" \"Crowdless games are a rare although not unheard-of occurrence in sports. \\\n",
" When they do occur, it is usually the result of events beyond the control \\\n",
" of the teams or fans, such as weather-related concerns, public health concerns, \\\n",
" or wider civil disturbances unrelated to the game. For instance, \\\n",
" the COVID-19 pandemic caused many sports leagues around the world \\\n",
" to be played behind closed doors.\",\n",
" \"Could weather cause a sport event to have no crowd?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above function illustrates how to potentially combine a discriminator and a fine-tuned Q&A model. This gives a more fine-grained control over how certain we want the model to be before it answers the question.\n",
"\n",
"We'll now take a look on how answers endpoint works - combining search to retrieve the relevant context from a knowledge base, and then using the fine-tuned Q&A model to answer the question."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.4 Answering the question based on a knowledge base\n",
"Finally we can use a logic similar to the [/answers](https://beta.openai.com/docs/api-reference/answers) endpoint, where we first search for the relevant context, and then ask a Q&A model to answer the question given that context. If you'd like to see the implementation details, check out the [`answers_with_ft.py`](answers_with_ft.py) file."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" Canada won the Women's football tournament at the 2020 Olympic games\""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from answers_with_ft import answer_question\n",
"answer_question(olympics_search_fileid, ft_qa, \"Which country won the Women's football tournament at the 2020 Olympic games?\")"
]
}
],
"metadata": {
"interpreter": {
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,50 @@
# Deprecation of Answers, Classification, and Search
In 2021, OpenAI released specialized endpoints in beta for Answers, Classification, and Search.
While these specialized endpoints were convenient, they had two drawbacks:
1. These specialized endpoints were eclipsed by techniques that achieved better results.
2. These specialized endpoints were more difficult to customize and optimize for individual use cases.
As a result, **the Answers, Classifications, and Search endpoints are being deprecated.**
## Timeline of deprecation
For those who have not used these endpoints, nothing will change except that access will no longer be available.
**For existing users of these endpoints, access will continue until December 3, 2022.** Before that date, we strongly encourage developers to switch over to newer techniques which produce better results.
## How to transition
We've written guides and code examples for transitioning from the deprecated API endpoints to better methods.
### Answers
[Guide: How to transition off the Answers endpoint](https://help.openai.com/en/articles/6233728-answers-transition-guide)
* Option 1: transition to embeddings-based search **(recommended)**
* Example code: [Semantic_text_search_using_embeddings.ipynb](../examples/Semantic_text_search_using_embeddings.ipynb)
* Option 2: reimplement Answers endpoint functionality
* Example code: [answers_functionality_example.py](answers_functionality_example.py)
### Classification
[Guide: How to transition off the Classifications endpoint](https://help.openai.com/en/articles/6272941-classifications-transition-guide)
* Option 1: transition to fine-tuning **(recommended)**
* Example code: [Classification.ipynb](../examples/Classification.ipynb)
* Option 2: transition to embeddings
* Example code: [Semantic_text_search_using_embeddings.ipynb](../examples/Semantic_text_search_using_embeddings.ipynb)
* Option 3: reimplement Classifications endpoint functionality
* Example code: [classification_functionality_example.py](classification_functionality_example.py)
### Search
[Guide: How to transition off the Search endpoint](https://help.openai.com/en/articles/6272952-search-transition-guide)
* Option 1: transition to embeddings-based search **(recommended)**
* Example code: [Semantic_text_search_using_embeddings.ipynb](../examples/Semantic_text_search_using_embeddings.ipynb)
* Option 2: reimplement Search endpoint functionality
* Example code: [search_functionality_example.py](search_functionality_example.py)

@ -0,0 +1,304 @@
from transformers import GPT2TokenizerFast
import openai
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
MAX_TOKENS_LIMIT = 2048
ANSWERS_INSTRUCTION = "Please answer the question according to the above context.\n"
CONTEXT_TEMPLATE = "===\nContext: {context}\n===\n"
def extract_instruction(instruction):
"""
Extract `instruction` parameter and format it properly.
If not exist, return empty string.
"""
if instruction is None:
return ""
return f"{instruction.strip()}\n\n"
def semantic_search(
search_model, query_for_search, file_id=None, max_documents=None, examples=None
):
"""
:param examples: A list of {"text":...} or {"text": ..., "label": ...}.
:return:
a list of semantic search result dict of documents sorted by "score":
[
{
"document": ...,
"object": "search_result",
"score": ...,
"text": ...,
},
...
]
"""
assert (examples is None) ^ (file_id is None) # xor
if file_id is not None:
# This is where you'd do an elastic search call. Since there isn't an example of this
# we can query, we'll raise an error.
# The return value from this would be a list of examples
raise NotImplementedError()
# This isn't quite accurate since Search is also being deprecated. See our search guide for more
# information.
search_result = openai.Search.create(
model=search_model,
documents=[x["text"] for x in examples],
query=query_for_search,
)
info_dict = {d["document"]: d for d in search_result["data"]}
sorted_doc_ids = sorted(
info_dict.keys(), key=lambda x: info_dict[x]["score"], reverse=True
)
if max_documents:
sorted_doc_ids = sorted_doc_ids[:max_documents]
return [info_dict[i] for i in sorted_doc_ids]
def select_by_length(
sorted_doc_infos,
max_token_len,
lambda_fn=None,
):
"""
Give a list of (document ID, document content in string), we will select as many
documents as possible as long as the total length does not go above `max_token_len`.
:param sorted_doc_infos: A list of semantic search result dict of documents sorted by "score".
:param max_token_len: The maximum token length for selected documents.
:param lambda_fn: A function that takes in search results dict and output a formatted
example for context stuffing.
:return: A tuple of (
A concatenation of selected documents used as context,
A list of selected document IDs
)
"""
if not sorted_doc_infos:
return "", []
selected_indices = []
total_doc_tokens = 0
doc_dict = {}
for i, doc_info in enumerate(sorted_doc_infos):
doc = lambda_fn(doc_info) if lambda_fn else doc_info["text"]
n_doc_tokens = len(tokenizer.encode(doc))
if total_doc_tokens + n_doc_tokens < max_token_len:
total_doc_tokens += n_doc_tokens
selected_indices.append(i)
doc_dict[i] = doc
# The top ranked documents should go at the end.
selected_indices = selected_indices[::-1]
context = "".join([doc_dict[i] for i in selected_indices])
selected_doc_infos = [sorted_doc_infos[i] for i in selected_indices]
return context, selected_doc_infos
def answers(
examples,
question,
model,
examples_context,
file_id=None,
documents=None,
logit_bias=None,
max_rerank=200,
max_tokens=16,
alternative_question=None,
search_model="ada",
temperature=0.0,
logprobs=0,
stop=None,
n=1,
):
"""
Given a prompt, a question, a list of (question, answer) pairs as examples, and
a list of documents for context, it tries to include all the QA examples and top
relevant context documents.
The constructed prompt for the final completion call:
```
Please answer the question according to the above context.
===
Context: {{ the context for example QA pairs. }}
===
Q: example 1 question
A: example 1 answer
---
Q: example 2 question
A: example 2 answer
===
Context: {{ a list of relevant documents sorted via search(question, documents) }}
===
Q: question
A:
```
The returned object has a structure like:
{
"answers": [
"Beijing",
"Beijing, China"
],
"completion_id": "xxx-xxx",
"object": "answer",
"selected_documents": [
{
"document": ..., # document index, same as in search/ results.
"object": "search_result",
"text": ...,
},
...
],
}
"""
examples = examples if examples else []
example_prompts = [f"Q: {x}\nA: {y}" for x, y in examples]
prompt = f"Q: {question}\nA:"
# Append all the QA examples into the prompt.
if examples_context:
examples_context = CONTEXT_TEMPLATE.format(context=examples_context)
instruction = (
ANSWERS_INSTRUCTION + examples_context + "\n---\n".join(example_prompts) + "\n"
)
logit_bias = logit_bias if logit_bias is not None else {}
if file_id is None and documents is None:
raise Exception("Please submit at least one of `documents` or `file`.")
if file_id is not None and documents is not None:
raise Exception("Please submit only one of `documents` or `file`.")
instruction = extract_instruction(instruction)
n_instruction_tokens = len(tokenizer.encode(instruction))
n_prompt_tokens = len(tokenizer.encode(prompt))
n_query_tokens = len(tokenizer.encode(question))
n_context_tokens = len(tokenizer.encode(CONTEXT_TEMPLATE.format(context="")))
if documents is not None:
documents = [doc.strip() + " " for doc in documents]
n_docs_tokens = [len(tokenizer.encode(doc)) for doc in documents]
# Except all the required content, how many tokens left for context stuffing.
leftover_token_len = MAX_TOKENS_LIMIT - (
n_instruction_tokens + n_context_tokens + n_prompt_tokens + max_tokens
)
sorted_doc_infos = []
question_for_search = (
alternative_question if alternative_question is not None else question
)
if file_id is not None:
search_model_, sorted_doc_infos = semantic_search(
search_model,
question_for_search,
file_id=file_id,
max_documents=max_rerank,
)
elif len(documents) == 0:
# If no context document is provided, do nothing.
pass
elif min(n_docs_tokens) >= leftover_token_len:
# If there is no room for adding any context doc.
pass
elif (max_rerank is None or max_rerank >= len(documents)) and sum(
n_docs_tokens
) < leftover_token_len:
# If the total length of docs is short enough to be added all.
selected_indices = list(range(len(documents)))
sorted_doc_infos = [
{"document": i, "text": documents[i]} for i in selected_indices
]
elif n_query_tokens + max(n_docs_tokens) >= MAX_TOKENS_LIMIT:
# If the prompt and the longest document together go above the limit.
total_tokens = n_query_tokens + max(n_docs_tokens)
raise Exception(
f"The longest document and prompt pair together contains {total_tokens} "
f"tokens, above the limit {MAX_TOKENS_LIMIT} for semantic search. Please consider "
f"shortening the prompt or the longest document."
)
else:
# If we can add some context documents but not all of them, we should
# query search endpoint to rank docs by score.
sorted_doc_infos = semantic_search(
search_model,
question_for_search,
examples=[{"text": doc} for doc in documents],
max_documents=max_rerank,
)
# Select documents w.r.t. the context length limitation.
context, sorted_doc_infos = select_by_length(
sorted_doc_infos,
leftover_token_len,
lambda_fn=lambda x: x["text"].strip() + " ",
)
# Add instruction before the context and the prompt after the context.
if context:
context = CONTEXT_TEMPLATE.format(context=context.strip())
full_prompt = instruction + context + prompt
completion_result = openai.Completion.create(
engine=model,
prompt=full_prompt,
logit_bias=logit_bias,
temperature=temperature,
n=n,
max_tokens=max_tokens,
stop=stop,
logprobs=logprobs,
)
completion_result["selected_documents"] = sorted_doc_infos
result = dict(
object="answer",
selected_documents=completion_result.pop("selected_documents"),
completion=completion_result["id"],
)
result["answers"] = [
item["text"].replace("A:", "").split("Q:")[0].strip()
for item in completion_result["choices"]
]
return result
print(
answers(
examples=[
["What is the capital of Washington", "Olympia"],
["What is the capital of Oregon", "Salem"],
],
question="What is the capital of China?",
examples_context="I am a bot that names country capitals",
documents=["I am a bot that names country capitals"],
model="davinci",
search_model="ada",
alternative_question="different test",
max_tokens=16,
stop=["\n\n"],
)
)

@ -0,0 +1,302 @@
import itertools
from collections import defaultdict
from transformers import GPT2TokenizerFast
import openai
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
MAX_TOKENS_LIMIT = 2048
def create_instruction(labels) -> str:
"""
Construct an instruction for a classification task.
"""
instruction = f"Please classify a piece of text into the following categories: {', '.join(labels)}."
return f"{instruction.strip()}\n\n"
def semantic_search(
search_model, query_for_search, file_id=None, max_documents=None, examples=None
):
"""
:param examples: A list of {"text":...} or {"text": ..., "label": ...}.
:return:
a list of semantic search result dict of documents sorted by "score":
[
{
"document": ...,
"object": "search_result",
"score": ...,
"text": ...,
},
...
]
"""
assert (examples is None) ^ (file_id is None) # xor
if file_id is not None:
# This is where you'd do an elastic search call. Since there isn't an example of this
# we can query, we'll raise an error.
# The return value from this would be a list of examples
raise NotImplementedError()
# This isn't quite accurate since Search is also being deprecated. See our search guide for more
# information.
search_result = openai.Search.create(
model=search_model,
documents=[x["text"] for x in examples],
query=query_for_search,
)
info_dict = {d["document"]: d for d in search_result["data"]}
sorted_doc_ids = sorted(
info_dict.keys(), key=lambda x: info_dict[x]["score"], reverse=True
)
if max_documents:
sorted_doc_ids = sorted_doc_ids[:max_documents]
return [info_dict[i] for i in sorted_doc_ids]
def select_by_length(
sorted_doc_infos,
max_token_len,
lambda_fn=None,
):
"""
Give a list of (document ID, document content in string), we will select as many
documents as possible as long as the total length does not go above `max_token_len`.
:param sorted_doc_infos: A list of semantic search result dict of documents sorted by "score".
:param max_token_len: The maximum token length for selected documents.
:param lambda_fn: A function that takes in search results dict and output a formatted
example for context stuffing.
:return: A tuple of (
A concatenation of selected documents used as context,
A list of selected document IDs
)
"""
if not sorted_doc_infos:
return "", []
selected_indices = []
total_doc_tokens = 0
doc_dict = {}
for i, doc_info in enumerate(sorted_doc_infos):
doc = lambda_fn(doc_info) if lambda_fn else doc_info["text"]
n_doc_tokens = len(tokenizer.encode(doc))
if total_doc_tokens + n_doc_tokens < max_token_len:
total_doc_tokens += n_doc_tokens
selected_indices.append(i)
doc_dict[i] = doc
# The top ranked documents should go at the end.
selected_indices = selected_indices[::-1]
context = "".join([doc_dict[i] for i in selected_indices])
selected_doc_infos = [sorted_doc_infos[i] for i in selected_indices]
return context, selected_doc_infos
def format_example_fn(x: dict) -> str:
return "Text: {text}\nCategory: {label}\n---\n".format(
text=x["text"].replace("\n", " ").strip(),
label=x["label"].replace("\n", " ").strip(),
)
def classifications(
query,
model,
search_model="ada",
examples=None,
file=None,
labels=None,
temperature=0.0,
logprobs=None,
max_examples=200,
logit_bias=None,
alternative_query=None,
max_tokens=16,
) -> dict:
"""
Given a prompt, a question and a list of examples, containing (text, label) pairs,
it selects top relevant examples to construct a prompt for few-shot classification.
The constructed prompt for the final completion call:
```
{{ an optional instruction }}
Text: example 1 text
Category: example 2 label
---
Text: example 1 text
Category: example 2 label
---
Text: question
Category:
```
The returned object has a structure like:
{
"label": "Happy",
"model": "ada",
"object": "classification",
"selected_examples": [
{
"document": ..., # document index, same as in search/ results.
"text": ...,
"label": ...,
},
...
],
}
"""
query = query.replace("\n", " ").strip()
logit_bias = logit_bias if logit_bias else {}
labels = labels if labels else []
if file is None and examples is None:
raise Exception("Please submit at least one of `examples` or `file`.")
if file is not None and examples is not None:
raise Exception("Please submit only one of `examples` or `file`.")
instruction = create_instruction(labels)
query_for_search = alternative_query if alternative_query is not None else query
# Extract examples and example labels first.
if file is not None:
sorted_doc_infos = semantic_search(
search_model,
query_for_search,
file_id=file,
max_documents=max_examples,
)
else:
example_prompts = [
format_example_fn(dict(text=x, label=y)) for x, y in examples
]
n_examples_tokens = [len(tokenizer.encode(x)) for x in example_prompts]
query_prompt = f"Text: {query}\nCategory:"
n_instruction_tokens = len(tokenizer.encode(instruction))
n_query_tokens = len(tokenizer.encode(query_prompt))
# Except all the required content, how many tokens left for context stuffing.
leftover_token_len = MAX_TOKENS_LIMIT - (
n_instruction_tokens + n_query_tokens + max_tokens
)
# Process when `examples` are provided but no `file` is provided.
if examples:
if (max_examples is None or max_examples >= len(examples)) and sum(
n_examples_tokens
) < leftover_token_len:
# If the total length of docs is short enough that we can add all examples, no search call.
selected_indices = list(range(len(examples)))
sorted_doc_infos = [
{"document": i, "text": examples[i][0], "label": examples[i][1]}
for i in selected_indices
]
elif max(n_examples_tokens) + n_query_tokens >= MAX_TOKENS_LIMIT:
# If the prompt and the longest example together go above the limit:
total_tokens = max(n_examples_tokens) + n_query_tokens
raise Exception(
user_message=f"The longest classification example, query and prompt together contain "
f"{total_tokens} tokens, above the limit {MAX_TOKENS_LIMIT} for semantic search. "
f"Please consider shortening your instruction, query or the longest example."
)
else:
# If we can add some context documents but not all of them, we should
# query search endpoint to rank docs by score.
sorted_doc_infos = semantic_search(
search_model,
query_for_search,
examples=[{"text": x, "label": y} for x, y in examples],
max_documents=max_examples,
)
# Per label, we have a list of doc id sorted by its relevancy to the query.
label_to_indices = defaultdict(list)
for idx, d in enumerate(sorted_doc_infos):
label_to_indices[d["label"]].append(idx)
# Do a round robin for each of the different labels, taking the best match for each label.
label_indices = [label_to_indices[label] for label in labels]
mixed_indices = [
i for x in itertools.zip_longest(*label_indices) for i in x if i is not None
]
sorted_doc_infos = [sorted_doc_infos[i] for i in mixed_indices]
# Try to select as many examples as needed to fit into the context
context, sorted_doc_infos = select_by_length(
sorted_doc_infos,
leftover_token_len,
lambda_fn=format_example_fn,
)
prompt = instruction + context + query_prompt
completion_params = {
"engine": model,
"prompt": prompt,
"temperature": temperature,
"logprobs": logprobs,
"logit_bias": logit_bias,
"max_tokens": max_tokens,
"stop": "\n",
"n": 1,
}
completion_resp = openai.Completion.create(
**completion_params,
)
label = completion_resp["choices"][0]["text"]
label = label.split("\n")[0].strip().lower().capitalize()
if label not in labels:
label = "Unknown"
result = dict(
# TODO: Add id for object persistence.
object="classification",
model=completion_resp["model"],
label=label,
completion=completion_resp["id"],
)
result["selected_examples"] = sorted_doc_infos
return result
print(
classifications(
query="this is my test",
model="davinci",
search_model="ada",
examples=[
["this is my test", "davinci"],
["this is other test", "blahblah"],
],
file=None,
labels=["davinci", "blahblah"],
temperature=0.1,
logprobs=0,
max_examples=200,
logit_bias=None,
alternative_query="different test",
max_tokens=16,
)
)

@ -0,0 +1,76 @@
from transformers import GPT2TokenizerFast
import openai
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
docs = ["test1", "asdklgjnasdv", "banana", "lord lollipop"]
query = "apple orang asdansbdausd"
print(openai.Search.create(model="davinci", query=query, documents=docs))
def construct_context(query, document):
return "<|endoftext|>{document}\n\n---\n\nThe above passage is related to: {query}".format(
document=document, query=query
)
def get_score(context, query, log_probs, text_offsets) -> float:
SCORE_MULTIPLIER = 100.0
log_prob = 0
count = 0
cutoff = len(context) - len(query)
for i in range(len(text_offsets) - 1, 0, -1):
log_prob += log_probs[i]
count += 1
if text_offsets[i] <= cutoff and text_offsets[i] != text_offsets[i - 1]:
break
return log_prob / float(count) * SCORE_MULTIPLIER
def search(query, documents, engine):
prompts = [construct_context(query, doc) for doc in [""] + docs]
resps = openai.Completion.create(
model=engine,
prompt=prompts,
temperature=1.0,
top_p=1.0,
max_tokens=0,
logprobs=0,
n=1,
echo=True,
)
resps_by_index = {choice["index"]: choice for choice in resps["choices"]}
scores = [
get_score(
prompts[i],
query,
resps_by_index[i]["logprobs"]["token_logprobs"],
resps_by_index[i]["logprobs"]["text_offset"],
)
for i in range(len(prompts))
]
# Process results
scores = [score - scores[0] for score in scores][1:]
return [
{
"object": "search_result",
"document": document_idx,
"score": round(score, 3),
}
for document_idx, score in enumerate(scores)
]
print(search(query=query, documents=docs, engine="davinci"))
Loading…
Cancel
Save