langchain/docs/extras/guides/evaluation/comparison/pairwise_string.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2da95378",
   "metadata": {},
   "source": [
    "# Pairwise String Comparison\n",
    "\n",
    "Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The `StringComparison` evaluators facilitate this so you can answer questions like:\n",
    "\n",
    "- Which LLM or prompt produces a preferred output for a given question?\n",
    "- Which examples should I include for few-shot example selection?\n",
    "- Which output is better to include for fintetuning?\n",
    "\n",
    "The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the `pairwise_string` evaluator.\n",
    "\n",
    "Check out the reference docs for the [PairwiseStringEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.html#langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f6790c46",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"labeled_pairwise_string\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "49ad9139",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Both responses are relevant to the question asked, as they both provide a numerical answer to the question about the number of dogs in the park. However, Response A is incorrect according to the reference answer, which states that there are four dogs. Response B, on the other hand, is correct as it matches the reference answer. Neither response demonstrates depth of thought, as they both simply provide a numerical answer without any additional information or context. \\n\\nBased on these criteria, Response B is the better response.\\n',\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"there are three dogs\",\n",
    "    prediction_b=\"4\",\n",
    "    input=\"how many dogs are in the park?\",\n",
    "    reference=\"four\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7491d2e6-4e77-4b17-be6b-7da966785c1d",
   "metadata": {},
   "source": [
    "## Methods\n",
    "\n",
    "\n",
    "The pairwise string evaluator can be called using [evaluate_string_pairs](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.html#langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.evaluate_string_pairs) (or async [aevaluate_string_pairs](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.html#langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.aevaluate_string_pairs)) methods, which accept:\n",
    "\n",
    "- prediction (str) – The predicted response of the first model, chain, or prompt.\n",
    "- prediction_b (str) – The predicted response of the second model, chain, or prompt.\n",
    "- input (str) – The input question, prompt, or other text.\n",
    "- reference (str) – (Only for the labeled_pairwise_string variant) The reference response.\n",
    "\n",
    "They return a dictionary with the following values:\n",
    "- value: 'A' or 'B', indicating whether `prediction` or `prediction_b` is preferred, respectively\n",
    "- score: Integer 0 or 1 mapped from the 'value', where a score of 1 would mean that the first `prediction` is preferred, and a score of 0 would mean `prediction_b` is preferred.\n",
    "- reasoning: String \"chain of thought reasoning\" from the LLM generated prior to creating the score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed353b93-be71-4479-b9c0-8c97814c2e58",
   "metadata": {},
   "source": [
    "## Without References\n",
    "\n",
    "When references aren't available, you can still predict the preferred response.\n",
    "The results will reflect the evaluation model's preference, which is less reliable and may result\n",
    "in preferences that are factually incorrect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "586320da",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"pairwise_string\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7f56c76e-a39b-4509-8b8a-8a2afe6c3da1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Both responses are correct and relevant to the question. However, Response B is more helpful and insightful as it provides a more detailed explanation of what addition is. Response A is correct but lacks depth as it does not explain what the operation of addition entails. \\n\\nFinal Decision: [[B]]',\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"Addition is a mathematical operation.\",\n",
    "    prediction_b=\"Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.\",\n",
    "    input=\"What is addition?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a09b21d-9851-47e8-93d3-90044b2945b0",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Defining the Criteria\n",
    "\n",
    "By default, the LLM is instructed to select the 'preferred' response based on helpfulness, relevance, correctness, and depth of thought. You can customize the criteria by passing in a `criteria` argument, where the criteria could take any of the following forms:\n",
    "- [`Criteria`](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.Criteria.html#langchain.evaluation.criteria.eval_chain.Criteria) enum or its string value - to use one of the default criteria and their descriptions\n",
    "- [Constitutional principal](https://api.python.langchain.com/en/latest/chains/langchain.chains.constitutional_ai.models.ConstitutionalPrinciple.html#langchain.chains.constitutional_ai.models.ConstitutionalPrinciple) - use one any of the constitutional principles defined in langchain\n",
    "- Dictionary: a list of custom criteria, where the key is the name of the criteria, and the value is the description.\n",
    "- A list of criteria or constitutional principles - to combine multiple criteria in one.\n",
    "\n",
    "Below is an example for determining preferred writing responses based on a custom style."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8539e7d9-f7b0-4d32-9c45-593a7915c093",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "custom_criteria = {\n",
    "    \"simplicity\": \"Is the language straightforward and unpretentious?\",\n",
    "    \"clarity\": \"Are the sentences clear and easy to understand?\",\n",
    "    \"precision\": \"Is the writing precise, with no unnecessary words or details?\",\n",
    "    \"truthfulness\": \"Does the writing feel honest and sincere?\",\n",
    "    \"subtext\": \"Does the writing suggest deeper meanings or themes?\",\n",
    "}\n",
    "evaluator = load_evaluator(\"pairwise_string\", criteria=custom_criteria)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fec7bde8-fbdc-4730-8366-9d90d033c181",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Response A is simple, clear, and precise. It uses straightforward language to convey a deep and sincere message about families. The metaphor of joy and sorrow as music is effective and easy to understand.\\n\\nResponse B, on the other hand, is more complex and less clear. The language is more pretentious, with words like \"domicile,\" \"resounds,\" \"abode,\" \"dissonant,\" and \"elegy.\" While it conveys a similar message to Response A, it does so in a more convoluted way. The precision is also lacking due to the use of unnecessary words and details.\\n\\nBoth responses suggest deeper meanings or themes about the shared joy and unique sorrow in families. However, Response A does so in a more effective and accessible way.\\n\\nTherefore, the better response is [[A]].',\n",
       " 'value': 'A',\n",
       " 'score': 1}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.\",\n",
    "    prediction_b=\"Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,\"\n",
    "    \" identical notes; yet, every abode of despair conducts a dissonant orchestra, each\"\n",
    "    \" playing an elegy of grief that is peculiar and profound to its own existence.\",\n",
    "    input=\"Write some prose about families.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a25b60b2-627c-408a-be4b-a2e5cbc10726",
   "metadata": {},
   "source": [
    "## Customize the LLM\n",
    "\n",
    "By default, the loader uses `gpt-4` in the evaluation chain. You can customize this when loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "de84a958-1330-482b-b950-68bcf23f9e35",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatAnthropic\n",
    "\n",
    "llm = ChatAnthropic(temperature=0)\n",
    "\n",
    "evaluator = load_evaluator(\"labeled_pairwise_string\", llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e162153f-d50a-4a7c-a033-019dabbc954c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Here is my assessment:\\n\\nResponse B is more helpful, insightful, and accurate than Response A. Response B simply states \"4\", which directly answers the question by providing the exact number of dogs mentioned in the reference answer. In contrast, Response A states \"there are three dogs\", which is incorrect according to the reference answer. \\n\\nIn terms of helpfulness, Response B gives the precise number while Response A provides an inaccurate guess. For relevance, both refer to dogs in the park from the question. However, Response B is more correct and factual based on the reference answer. Response A shows some attempt at reasoning but is ultimately incorrect. Response B requires less depth of thought to simply state the factual number.\\n\\nIn summary, Response B is superior in terms of helpfulness, relevance, correctness, and depth. My final decision is: [[B]]\\n',\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"there are three dogs\",\n",
    "    prediction_b=\"4\",\n",
    "    input=\"how many dogs are in the park?\",\n",
    "    reference=\"four\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e89c13-d0ad-4f87-8fcb-814399bafa2a",
   "metadata": {},
   "source": [
    "## Customize the Evaluation Prompt\n",
    "\n",
    "You can use your own custom evaluation prompt to add more task-specific instructions or to instruct the evaluator to score the output.\n",
    "\n",
    "*Note: If you use a prompt that expects generates a result in a unique format, you may also have to pass in a custom output parser (`output_parser=your_parser()`) instead of the default `PairwiseStringResultOutputParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fb817efa-3a4d-439d-af8c-773b89d97ec9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"\"\"Given the input context, which do you prefer: A or B?\n",
    "Evaluate based on the following criteria:\n",
    "{criteria}\n",
    "Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.\n",
    "\n",
    "DATA\n",
    "----\n",
    "input: {input}\n",
    "reference: {reference}\n",
    "A: {prediction}\n",
    "B: {prediction_b}\n",
    "---\n",
    "Reasoning:\n",
    "\n",
    "\"\"\"\n",
    ")\n",
    "evaluator = load_evaluator(\n",
    "    \"labeled_pairwise_string\", prompt=prompt_template\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d40aa4f0-cfd5-4cb4-83c8-8d2300a04c2f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input_variables=['prediction', 'reference', 'prediction_b', 'input'] output_parser=None partial_variables={'criteria': 'helpfulness: Is the submission helpful, insightful, and appropriate?\\nrelevance: Is the submission referring to a real quote from the text?\\ncorrectness: Is the submission correct, accurate, and factual?\\ndepth: Does the submission demonstrate depth of thought?'} template='Given the input context, which do you prefer: A or B?\\nEvaluate based on the following criteria:\\n{criteria}\\nReason step by step and finally, respond with either [[A]] or [[B]] on its own line.\\n\\nDATA\\n----\\ninput: {input}\\nreference: {reference}\\nA: {prediction}\\nB: {prediction_b}\\n---\\nReasoning:\\n\\n' template_format='f-string' validate_template=True\n"
     ]
    }
   ],
   "source": [
    "# The prompt was assigned to the evaluator\n",
    "print(evaluator.prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "9467bb42-7a31-4071-8f66-9ed2c6f06dcd",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Helpfulness: Both A and B are helpful as they provide a direct answer to the question.\\nRelevance: A is relevant as it refers to the correct name of the dog from the text. B is not relevant as it provides a different name.\\nCorrectness: A is correct as it accurately states the name of the dog. B is incorrect as it provides a different name.\\nDepth: Both A and B demonstrate a similar level of depth as they both provide a straightforward answer to the question.\\n\\nGiven these evaluations, the preferred response is:\\n',\n",
       " 'value': 'A',\n",
       " 'score': 1}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"The dog that ate the ice cream was named fido.\",\n",
    "    prediction_b=\"The dog's name is spot\",\n",
    "    input=\"What is the name of the dog that ate the ice cream?\",\n",
    "    reference=\"The dog's name is fido\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}