mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
3179ee3a56
Still don't have good "how to's", and the guides / examples section could be further pruned and improved, but this PR adds a couple examples for each of the common evaluator interfaces. - [x] Example docs for each implemented evaluator - [x] "how to make a custom evalutor" notebook for each low level APIs (comparison, string, agent) - [x] Move docs to modules area - [x] Link to reference docs for more information - [X] Still need to finish the evaluation index page - ~[ ] Don't have good data generation section~ - ~[ ] Don't have good how to section for other common scenarios / FAQs like regression testing, testing over similar inputs to measure sensitivity, etc.~
291 lines
9.2 KiB
Plaintext
291 lines
9.2 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2da95378",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Pairwise String Comparison\n",
|
|
"\n",
|
|
"Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The `StringComparison` evaluators facilitate this so you can answer questions like:\n",
|
|
"\n",
|
|
"- Which LLM or prompt produces a preferred output for a given question?\n",
|
|
"- Which examples should I include for few-shot example selection?\n",
|
|
"- Which output is better to include for fintetuning?\n",
|
|
"\n",
|
|
"The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the `pairwise_string` evaluator.\n",
|
|
"\n",
|
|
"Check out the reference docs for the [PairwiseStringEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.html#langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain) for more info."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "f6790c46",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.evaluation import load_evaluator\n",
|
|
"\n",
|
|
"evaluator = load_evaluator(\"pairwise_string\", requires_reference=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "49ad9139",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'reasoning': 'Response A provides an incorrect answer by stating there are three dogs in the park, while the reference answer indicates there are four. Response B, on the other hand, provides the correct answer, matching the reference answer. Although Response B is less detailed, it is accurate and directly answers the question. \\n\\nTherefore, the better response is [[B]].\\n',\n",
|
|
" 'value': 'B',\n",
|
|
" 'score': 0}"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"evaluator.evaluate_string_pairs(\n",
|
|
" prediction=\"there are three dogs\",\n",
|
|
" prediction_b=\"4\",\n",
|
|
" input=\"how many dogs are in the park?\",\n",
|
|
" reference=\"four\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ed353b93-be71-4479-b9c0-8c97814c2e58",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Without References\n",
|
|
"\n",
|
|
"When references aren't available, you can still predict the preferred response.\n",
|
|
"The results will reflect the evaluation model's preference, which is less reliable and may result\n",
|
|
"in preferences that are factually incorrect."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "586320da",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.evaluation import load_evaluator\n",
|
|
"\n",
|
|
"evaluator = load_evaluator(\"pairwise_string\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "7f56c76e-a39b-4509-8b8a-8a2afe6c3da1",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'reasoning': \"Response A is accurate but lacks depth and detail. It simply states that addition is a mathematical operation without explaining what it does or how it works. \\n\\nResponse B, on the other hand, provides a more detailed explanation. It not only identifies addition as a mathematical operation, but also explains that it involves adding two numbers to create a third number, the 'sum'. This response is more helpful and informative, providing a clearer understanding of what addition is.\\n\\nTherefore, the better response is B.\\n\",\n",
|
|
" 'value': 'B',\n",
|
|
" 'score': 0}"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"evaluator.evaluate_string_pairs(\n",
|
|
" prediction=\"Addition is a mathematical operation.\",\n",
|
|
" prediction_b=\"Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.\",\n",
|
|
" input=\"What is addition?\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a25b60b2-627c-408a-be4b-a2e5cbc10726",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Customize the LLM\n",
|
|
"\n",
|
|
"By default, the loader uses `gpt-4` in the evaluation chain. You can customize this when loading."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "de84a958-1330-482b-b950-68bcf23f9e35",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chat_models import ChatAnthropic\n",
|
|
"\n",
|
|
"llm = ChatAnthropic(temperature=0)\n",
|
|
"\n",
|
|
"evaluator = load_evaluator(\"pairwise_string\", llm=llm, requires_reference=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "e162153f-d50a-4a7c-a033-019dabbc954c",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'reasoning': 'Response A provides a specific number but is inaccurate based on the reference answer. Response B provides the correct number but lacks detail or explanation. Overall, Response B is more helpful and accurate in directly answering the question, despite lacking depth or creativity.\\n\\n[[B]]\\n',\n",
|
|
" 'value': 'B',\n",
|
|
" 'score': 0}"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"evaluator.evaluate_string_pairs(\n",
|
|
" prediction=\"there are three dogs\",\n",
|
|
" prediction_b=\"4\",\n",
|
|
" input=\"how many dogs are in the park?\",\n",
|
|
" reference=\"four\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e0e89c13-d0ad-4f87-8fcb-814399bafa2a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Customize the Evaluation Prompt\n",
|
|
"\n",
|
|
"You can use your own custom evaluation prompt to add more task-specific instructions or to instruct the evaluator to score the output.\n",
|
|
"\n",
|
|
"*Note: If you use a prompt that expects generates a result in a unique format, you may also have to pass in a custom output parser (`output_parser=your_parser()`) instead of the default `PairwiseStringResultOutputParser`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "fb817efa-3a4d-439d-af8c-773b89d97ec9",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.prompts import PromptTemplate\n",
|
|
"\n",
|
|
"prompt_template = PromptTemplate.from_template(\n",
|
|
" \"\"\"Given the input context, which is most similar to the reference label: A or B?\n",
|
|
"Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.\n",
|
|
"\n",
|
|
"DATA\n",
|
|
"----\n",
|
|
"input: {input}\n",
|
|
"reference: {reference}\n",
|
|
"A: {prediction}\n",
|
|
"B: {prediction_b}\n",
|
|
"---\n",
|
|
"Reasoning:\n",
|
|
"\n",
|
|
"\"\"\"\n",
|
|
")\n",
|
|
"evaluator = load_evaluator(\n",
|
|
" \"pairwise_string\", prompt=prompt_template, requires_reference=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "d40aa4f0-cfd5-4cb4-83c8-8d2300a04c2f",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"input_variables=['input', 'prediction', 'prediction_b', 'reference'] output_parser=None partial_variables={} template='Given the input context, which is most similar to the reference label: A or B?\\nReason step by step and finally, respond with either [[A]] or [[B]] on its own line.\\n\\nDATA\\n----\\ninput: {input}\\nreference: {reference}\\nA: {prediction}\\nB: {prediction_b}\\n---\\nReasoning:\\n\\n' template_format='f-string' validate_template=True\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# The prompt was assigned to the evaluator\n",
|
|
"print(evaluator.prompt)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "9467bb42-7a31-4071-8f66-9ed2c6f06dcd",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'reasoning': \"Option A is most similar to the reference label. Both the reference label and option A state that the dog's name is Fido. Option B, on the other hand, gives a different name for the dog. Therefore, option A is the most similar to the reference label. \\n\",\n",
|
|
" 'value': 'A',\n",
|
|
" 'score': 1}"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"evaluator.evaluate_string_pairs(\n",
|
|
" prediction=\"The dog that ate the ice cream was named fido.\",\n",
|
|
" prediction_b=\"The dog's name is spot\",\n",
|
|
" input=\"What is the name of the dog that ate the ice cream?\",\n",
|
|
" reference=\"The dog's name is fido\",\n",
|
|
")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|