langchain/docs/modules/prompts/examples/example_selectors.ipynb
Harrison Chase 23d5f64bda
Harrison/ngram example (#846)
Co-authored-by: Sean Spriggens <ssprigge@syr.edu>
2023-02-02 09:44:42 -08:00

705 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "bf038596",
"metadata": {},
"source": [
"# Example Selectors\n",
"If you have a large number of examples, you may need to select which ones to include in the prompt. The ExampleSelector is the class responsible for doing so. The base interface is defined as below.\n",
"\n",
"```python\n",
"class BaseExampleSelector(ABC):\n",
" \"\"\"Interface for selecting examples to include in prompts.\"\"\"\n",
"\n",
" @abstractmethod\n",
" def select_examples(self, input_variables: Dict[str, str]) -> List[dict]:\n",
" \"\"\"Select which examples to use based on the inputs.\"\"\"\n",
"\n",
"```\n",
"\n",
"The only method it needs to expose is a `select_examples` method. This takes in the input variables and then returns a list of examples. It is up to each specific implementation as to how those examples are selected. Let's take a look at some below."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8244ff60",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import FewShotPromptTemplate"
]
},
{
"cell_type": "markdown",
"id": "861a4d1f",
"metadata": {},
"source": [
"## LengthBased ExampleSelector\n",
"\n",
"This ExampleSelector selects which examples to use based on length. This is useful when you are worried about constructing a prompt that will go over the length of the context window. For longer inputs, it will select fewer examples to include, while for shorter inputs it will select more.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7c469c95",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"from langchain.prompts.example_selector import LengthBasedExampleSelector"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0ec6d950",
"metadata": {},
"outputs": [],
"source": [
"# These are a lot of examples of a pretend task of creating antonyms.\n",
"examples = [\n",
" {\"input\": \"happy\", \"output\": \"sad\"},\n",
" {\"input\": \"tall\", \"output\": \"short\"},\n",
" {\"input\": \"energetic\", \"output\": \"lethargic\"},\n",
" {\"input\": \"sunny\", \"output\": \"gloomy\"},\n",
" {\"input\": \"windy\", \"output\": \"calm\"},\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "207e55f7",
"metadata": {},
"outputs": [],
"source": [
"example_prompt = PromptTemplate(\n",
" input_variables=[\"input\", \"output\"],\n",
" template=\"Input: {input}\\nOutput: {output}\",\n",
")\n",
"example_selector = LengthBasedExampleSelector(\n",
" # These are the examples it has available to choose from.\n",
" examples=examples, \n",
" # This is the PromptTemplate being used to format the examples.\n",
" example_prompt=example_prompt, \n",
" # This is the maximum length that the formatted examples should be.\n",
" # Length is measured by the get_text_length function below.\n",
" max_length=25,\n",
" # This is the function used to get the length of a string, which is used\n",
" # to determine which examples to include. It is commented out because\n",
" # it is provided as a default value if none is specified.\n",
" # get_text_length: Callable[[str], int] = lambda x: len(re.split(\"\\n| \", x))\n",
")\n",
"dynamic_prompt = FewShotPromptTemplate(\n",
" # We provide an ExampleSelector instead of examples.\n",
" example_selector=example_selector,\n",
" example_prompt=example_prompt,\n",
" prefix=\"Give the antonym of every input\",\n",
" suffix=\"Input: {adjective}\\nOutput:\", \n",
" input_variables=[\"adjective\"],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d00b4385",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: tall\n",
"Output: short\n",
"\n",
"Input: energetic\n",
"Output: lethargic\n",
"\n",
"Input: sunny\n",
"Output: gloomy\n",
"\n",
"Input: windy\n",
"Output: calm\n",
"\n",
"Input: big\n",
"Output:\n"
]
}
],
"source": [
"# An example with small input, so it selects all examples.\n",
"print(dynamic_prompt.format(adjective=\"big\"))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "878bcde9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else\n",
"Output:\n"
]
}
],
"source": [
"# An example with long input, so it selects only one example.\n",
"long_string = \"big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else\"\n",
"print(dynamic_prompt.format(adjective=long_string))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "e4bebcd9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: tall\n",
"Output: short\n",
"\n",
"Input: energetic\n",
"Output: lethargic\n",
"\n",
"Input: sunny\n",
"Output: gloomy\n",
"\n",
"Input: windy\n",
"Output: calm\n",
"\n",
"Input: big\n",
"Output: small\n",
"\n",
"Input: enthusiastic\n",
"Output:\n"
]
}
],
"source": [
"# You can add an example to an example selector as well.\n",
"new_example = {\"input\": \"big\", \"output\": \"small\"}\n",
"dynamic_prompt.example_selector.add_example(new_example)\n",
"print(dynamic_prompt.format(adjective=\"enthusiastic\"))"
]
},
{
"cell_type": "markdown",
"id": "2d007b0a",
"metadata": {},
"source": [
"## Similarity ExampleSelector\n",
"\n",
"The SemanticSimilarityExampleSelector selects examples based on which examples are most similar to the inputs. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "241bfe80",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts.example_selector import SemanticSimilarityExampleSelector\n",
"from langchain.vectorstores import FAISS\n",
"from langchain.embeddings import OpenAIEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "50d0a701",
"metadata": {},
"outputs": [],
"source": [
"example_selector = SemanticSimilarityExampleSelector.from_examples(\n",
" # This is the list of examples available to select from.\n",
" examples, \n",
" # This is the embedding class used to produce embeddings which are used to measure semantic similarity.\n",
" OpenAIEmbeddings(), \n",
" # This is the VectorStore class that is used to store the embeddings and do a similarity search over.\n",
" FAISS, \n",
" # This is the number of examples to produce.\n",
" k=1\n",
")\n",
"similar_prompt = FewShotPromptTemplate(\n",
" # We provide an ExampleSelector instead of examples.\n",
" example_selector=example_selector,\n",
" example_prompt=example_prompt,\n",
" prefix=\"Give the antonym of every input\",\n",
" suffix=\"Input: {adjective}\\nOutput:\", \n",
" input_variables=[\"adjective\"],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "4c8fdf45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: worried\n",
"Output:\n"
]
}
],
"source": [
"# Input is a feeling, so should select the happy/sad example\n",
"print(similar_prompt.format(adjective=\"worried\"))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "829af21a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: fat\n",
"Output:\n"
]
}
],
"source": [
"# Input is a measurement, so should select the tall/short example\n",
"print(similar_prompt.format(adjective=\"fat\"))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3c16fe23",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: joyful\n",
"Output:\n"
]
}
],
"source": [
"# You can add new examples to the SemanticSimilarityExampleSelector as well\n",
"similar_prompt.example_selector.add_example({\"input\": \"enthusiastic\", \"output\": \"apathetic\"})\n",
"print(similar_prompt.format(adjective=\"joyful\"))"
]
},
{
"cell_type": "markdown",
"id": "bc35afd0",
"metadata": {},
"source": [
"## Maximal Marginal Relevance ExampleSelector\n",
"\n",
"The MaxMarginalRelevanceExampleSelector selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "ac95c968",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts.example_selector import MaxMarginalRelevanceExampleSelector"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "db579bea",
"metadata": {},
"outputs": [],
"source": [
"example_selector = MaxMarginalRelevanceExampleSelector.from_examples(\n",
" # This is the list of examples available to select from.\n",
" examples, \n",
" # This is the embedding class used to produce embeddings which are used to measure semantic similarity.\n",
" OpenAIEmbeddings(), \n",
" # This is the VectorStore class that is used to store the embeddings and do a similarity search over.\n",
" FAISS, \n",
" # This is the number of examples to produce.\n",
" k=2\n",
")\n",
"mmr_prompt = FewShotPromptTemplate(\n",
" # We provide an ExampleSelector instead of examples.\n",
" example_selector=example_selector,\n",
" example_prompt=example_prompt,\n",
" prefix=\"Give the antonym of every input\",\n",
" suffix=\"Input: {adjective}\\nOutput:\", \n",
" input_variables=[\"adjective\"],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "cd76e344",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: windy\n",
"Output: calm\n",
"\n",
"Input: worried\n",
"Output:\n"
]
}
],
"source": [
"# Input is a feeling, so should select the happy/sad example as the first one\n",
"print(mmr_prompt.format(adjective=\"worried\"))"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "cf82956b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the antonym of every input\n",
"\n",
"Input: happy\n",
"Output: sad\n",
"\n",
"Input: enthusiastic\n",
"Output: apathetic\n",
"\n",
"Input: worried\n",
"Output:\n"
]
}
],
"source": [
"# Let's compare this to what we would just get if we went solely off of similarity\n",
"similar_prompt.example_selector.k = 2\n",
"print(similar_prompt.format(adjective=\"worried\"))"
]
},
{
"cell_type": "markdown",
"id": "4aaeed2f",
"metadata": {},
"source": [
"## NGram Overlap ExampleSelector\n",
"\n",
"The NGramOverlapExampleSelector selects and orders examples based on which examples are most similar to the input, according to an ngram overlap score. The ngram overlap score is a float between 0.0 and 1.0, inclusive. \n",
"\n",
"The selector allows for a threshold score to be set. Examples with an ngram overlap score less than or equal to the threshold are excluded. The threshold is set to -1.0, by default, so will not exclude any examples, only reorder them. Setting the threshold to 0.0 will exclude examples that have no ngram overlaps with the input.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9cbc0acc",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"from langchain.prompts.example_selector.ngram_overlap import NGramOverlapExampleSelector"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4f318f4b",
"metadata": {},
"outputs": [],
"source": [
"# These are examples of a fictional translation task.\n",
"examples = [\n",
" {\"input\": \"See Spot run.\", \"output\": \"Ver correr a Spot.\"},\n",
" {\"input\": \"My dog barks.\", \"output\": \"Mi perro ladra.\"},\n",
" {\"input\": \"Spot can run.\", \"output\": \"Spot puede correr.\"},\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "bf75e0fe",
"metadata": {},
"outputs": [],
"source": [
"example_prompt = PromptTemplate(\n",
" input_variables=[\"input\", \"output\"],\n",
" template=\"Input: {input}\\nOutput: {output}\",\n",
")\n",
"example_selector = NGramOverlapExampleSelector(\n",
" # These are the examples it has available to choose from.\n",
" examples=examples, \n",
" # This is the PromptTemplate being used to format the examples.\n",
" example_prompt=example_prompt, \n",
" # This is the threshold, at which selector stops.\n",
" # It is set to -1.0 by default.\n",
" threshold=-1.0,\n",
" # For negative threshold:\n",
" # Selector sorts examples by ngram overlap score, and excludes none.\n",
" # For threshold greater than 1.0:\n",
" # Selector excludes all examples, and returns an empty list.\n",
" # For threshold equal to 0.0:\n",
" # Selector sorts examples by ngram overlap score,\n",
" # and excludes those with no ngram overlap with input.\n",
")\n",
"dynamic_prompt = FewShotPromptTemplate(\n",
" # We provide an ExampleSelector instead of examples.\n",
" example_selector=example_selector,\n",
" example_prompt=example_prompt,\n",
" prefix=\"Give the Spanish translation of every input\",\n",
" suffix=\"Input: {sentence}\\nOutput:\", \n",
" input_variables=[\"sentence\"],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "83fb218a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the Spanish translation of every input\n",
"\n",
"Input: Spot can run.\n",
"Output: Spot puede correr.\n",
"\n",
"Input: See Spot run.\n",
"Output: Ver correr a Spot.\n",
"\n",
"Input: My dog barks.\n",
"Output: Mi perro ladra.\n",
"\n",
"Input: Spot can run fast.\n",
"Output:\n"
]
}
],
"source": [
"# An example input with large ngram overlap with \"Spot can run.\"\n",
"# and no overlap with \"My dog barks.\"\n",
"print(dynamic_prompt.format(sentence=\"Spot can run fast.\"))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "485f5307",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the Spanish translation of every input\n",
"\n",
"Input: Spot can run.\n",
"Output: Spot puede correr.\n",
"\n",
"Input: See Spot run.\n",
"Output: Ver correr a Spot.\n",
"\n",
"Input: Spot plays fetch.\n",
"Output: Spot juega a buscar.\n",
"\n",
"Input: My dog barks.\n",
"Output: Mi perro ladra.\n",
"\n",
"Input: Spot can run fast.\n",
"Output:\n"
]
}
],
"source": [
"# You can add examples to NGramOverlapExampleSelector as well.\n",
"new_example = {\"input\": \"Spot plays fetch.\", \"output\": \"Spot juega a buscar.\"}\n",
"\n",
"example_selector.add_example(new_example)\n",
"print(dynamic_prompt.format(sentence=\"Spot can run fast.\"))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "606ce697",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the Spanish translation of every input\n",
"\n",
"Input: Spot can run.\n",
"Output: Spot puede correr.\n",
"\n",
"Input: See Spot run.\n",
"Output: Ver correr a Spot.\n",
"\n",
"Input: Spot plays fetch.\n",
"Output: Spot juega a buscar.\n",
"\n",
"Input: Spot can run fast.\n",
"Output:\n"
]
}
],
"source": [
"# You can set a threshold at which examples are excluded.\n",
"# For example, setting threshold equal to 0.0\n",
"# excludes examples with no ngram overlaps with input.\n",
"# Since \"My dog barks.\" has no ngram overlaps with \"Spot can run fast.\"\n",
"# it is excluded.\n",
"example_selector.threshold=0.0\n",
"print(dynamic_prompt.format(sentence=\"Spot can run fast.\"))"
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "7f8d72f7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the Spanish translation of every input\n",
"\n",
"Input: Spot can run.\n",
"Output: Spot puede correr.\n",
"\n",
"Input: Spot plays fetch.\n",
"Output: Spot juega a buscar.\n",
"\n",
"Input: Spot can play fetch.\n",
"Output:\n"
]
}
],
"source": [
"# Setting small nonzero threshold\n",
"example_selector.threshold=0.09\n",
"print(dynamic_prompt.format(sentence=\"Spot can play fetch.\"))"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "09633aa8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Give the Spanish translation of every input\n",
"\n",
"Input: Spot can play fetch.\n",
"Output:\n"
]
}
],
"source": [
"# Setting threshold greater than 1.0\n",
"example_selector.threshold=1.0+1e-9\n",
"print(dynamic_prompt.format(sentence=\"Spot can play fetch.\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39f30097",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}