Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` - Move the criteria evaluator out so it's not restricted to being applied on traced runs
1 year ago · c460b04c64
parent b3f8324de9
commit c460b04c64
15 changed files with 1001 additions and 76 deletions
--- a/docs/extras/guides/evaluation/criteria_eval_chain.ipynb
+++ b/docs/extras/guides/evaluation/criteria_eval_chain.ipynb
@ -0,0 +1,264 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
+   "metadata": {},
+   "source": [
+    "# Evaluating Custom Criteria\n",
+    "\n",
+    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
+    "\n",
+    "The `CriteriaEvalChain` is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
+    "describe those criteria in regular language. In this example, you will use the `CriteriaEvalChain` to check whether an output is concise.\n",
+    "\n",
+    "### Step 1: Create the Eval Chain\n",
+    "\n",
+    "First, create the evaluation chain to predict whether outputs are \"concise\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation.criteria import CriteriaEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0)\n",
+    "criterion = \"conciseness\"\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criterion)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eaef0d93-e080-4be2-a0f1-701b0d91fcf4",
+   "metadata": {},
+   "source": [
+    "### Step 2: Make Prediction\n",
+    "\n",
+    "Run an output to measure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "68b1a348-cf41-40bf-9667-e79683464cf2",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "llm = ChatOpenAI(temperature=0)\n",
+    "query=\"What's the origin of the term synecdoche?\"\n",
+    "prediction = llm.predict(query)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f45ed40e-09c4-44dc-813d-63a4ffb2d2ea",
+   "metadata": {},
+   "source": [
+    "### Step 3: Evaluate Prediction\n",
+    "\n",
+    "Determine whether the prediciton conforms to the criteria."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Conciseness: The submission is concise and to the point. It directly answers the question without any unnecessary information. Therefore, the submission meets the criterion of conciseness.\\n\\nY', 'value': 'Y', 'score': 1}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "8c4ec9dd-6557-4f23-8480-c822eb6ec552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['conciseness',\n",
+       " 'relevance',\n",
+       " 'coherence',\n",
+       " 'harmfulness',\n",
+       " 'maliciousness',\n",
+       " 'helpfulness',\n",
+       " 'controversiality',\n",
+       " 'mysogyny',\n",
+       " 'criminality',\n",
+       " 'insensitive']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
+    "CriteriaEvalChain.get_supported_default_criteria()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2eb7dedb-913a-4d9e-b48a-9521425d1008",
+   "metadata": {},
+   "source": [
+    "## Multiple Criteria\n",
+    "\n",
+    "To check whether an output complies with all of a list of default criteria, pass in a list! Be sure to only include criteria that are relevant to the provided information, and avoid mixing criteria that measure opposing things (e.g., harmfulness and helpfulness)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "50c067f7-bc6e-4d6c-ba34-97a72023be27",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'Conciseness: The submission is not concise and does not answer the given task. It provides information on the origin of the term synecdoche, which is not relevant to the task. Therefore, the submission does not meet the criterion of conciseness.\\n\\nCoherence: The submission is not coherent, well-structured, or organized. It does not provide any information related to the given task and is not connected to the topic in any way. Therefore, the submission does not meet the criterion of coherence.\\n\\nConclusion: The submission does not meet all criteria.', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "criteria = [\"conciseness\", \"coherence\"]\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "077c4715-e857-44a3-9f87-346642586a8d",
+   "metadata": {},
+   "source": [
+    "## Custom Criteria\n",
+    "\n",
+    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
+    "\n",
+    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Criteria: numeric: Does the output contain numeric information?\\n- The submission does not contain any numeric information.\\n- Conclusion: The submission meets the criteria.', 'value': 'Answer: Y', 'score': None}\n"
+     ]
+    }
+   ],
+   "source": [
+    "custom_criterion = {\n",
+    "    \"numeric\": \"Does the output contain numeric information?\"\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criterion)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "6db12a16-0058-4a14-8064-8528540963d8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '- complements-user: The submission directly answers the question asked and provides additional information about the population of Lagos. However, it does not necessarily complement the person writing the question. \\n- positive: The submission maintains a positive tone throughout and does not contain any negative language. \\n- active voice: The submission uses an active voice and avoids state of being verbs. \\n\\nTherefore, the submission meets all criteria. \\n\\nY\\n\\nY', 'value': 'Y', 'score': 1}\n",
+      "Meets criteria:  1\n",
+      "{'reasoning': '- complements-user: The submission directly answers the question asked in the task, so it complements the question. Therefore, the answer meets this criterion. \\n- positive: The submission does not contain any negative language or tone, so it maintains a positive sentiment throughout. Therefore, the answer meets this criterion. \\n- active voice: The submission uses the state of being verb \"is\" to describe the population, which is not in active voice. Therefore, the answer does not meet this criterion. \\n\\nAnswer: N', 'value': 'N', 'score': 0}\n",
+      "Does not meet criteria:  0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# You can specify multiple criteria in the dictionary. We recommend you keep the number criteria to a minimum, however for more reliable results.\n",
+    "\n",
+    "custom_criteria = {\n",
+    "    \"complements-user\": \"Does the submission complements the question or the person writing the question in some way?\",\n",
+    "    \"positive\": \"Does the submission maintain a positive sentiment throughout?\",\n",
+    "    \"active voice\": \"Does the submission maintain an active voice throughout, avoiding state of being verbs?\",\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criteria)\n",
+    "\n",
+    "# Example that complies\n",
+    "query = \"What's the population of lagos?\"\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"I think that's a great question, you're really curious! About 30 million people live in Lagos, Nigeria, as of 2023.\", input=query)\n",
+    "print(\"Meets criteria: \", eval_result[\"score\"])\n",
+    "\n",
+    "# Example that does not comply\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"The population of Lagos, Nigeria, is about 30 million people.\", input=query)\n",
+    "print(\"Does not meet criteria: \", eval_result[\"score\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/langchain/evaluation/agents/trajectory_eval_chain.py
+++ b/langchain/evaluation/agents/trajectory_eval_chain.py
@ -16,6 +16,10 @@ class TrajectoryEval(NamedTuple):


 class TrajectoryOutputParser(BaseOutputParser):
+    @property
+    def _type(self) -> str:
+        return "agent_trajectory"
+
    def parse(self, text: str) -> TrajectoryEval:
        if "Score:" not in text:
            raise OutputParserException(
--- a/langchain/evaluation/criteria/init.py
+++ b/langchain/evaluation/criteria/init.py
@ -0,0 +1,48 @@
+"""Criteria or rubric based evaluators.
+
+These evaluators are useful for evaluating the
+output of a language model or chain against
+custom criteria or rubric.
+
+Classes
+-------
+CriteriaEvalChain : Evaluates the output of a language model or
+chain against custom criteria.
+
+Examples
+--------
+Using a pre-defined criterion:
+>>> from langchain.llms import OpenAI
+>>> from langchain.evaluation.criteria import CriteriaEvalChain
+
+>>> llm = OpenAI()
+>>> criteria = "conciseness"
+>>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+>>> chain.evaluate_strings(
+        prediction="The answer is 42.",
+        reference="42",
+        input="What is the answer to life, the universe, and everything?",
+    )
+
+Using a custom criterion:
+
+>>> from langchain.llms import OpenAI
+>>> from langchain.evaluation.criteria import CriteriaEvalChain
+
+>>> llm = OpenAI()
+>>> criteria = {
+       "hallucination": (
+            "Does this submission contain information"
+            " not present in the input or reference?"
+        ),
+    }
+>>> chain = CriteriaEvalChain.from_llm(
+        llm=llm,
+        criteria=criteria,
+        requires_reference=True,
+        )
+"""
+
+from langchain.evaluation.criteria.eval_chain import CriteriaEvalChain
+
+__all__ = ["CriteriaEvalChain"]
--- a/langchain/evaluation/criteria/eval_chain.py
+++ b/langchain/evaluation/criteria/eval_chain.py
@ -0,0 +1,341 @@
+from __future__ import annotations
+
+from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
+
+from pydantic import Field
+
+from langchain.base_language import BaseLanguageModel
+from langchain.chains.llm import LLMChain
+from langchain.evaluation.criteria.prompt import PROMPT, PROMPT_WITH_REFERENCES
+from langchain.prompts.base import BasePromptTemplate
+from langchain.schema import BaseOutputParser
+
+CONCISENESS_CRITERION = {"conciseness": "Is the submission concise and to the point?"}
+RELEVANCE_CRITERION = {
+    "relevance": "Is the submission referring to a real quote from the text?"
+}
+CORRECTNESS_CRITERION = {
+    "correctness": "Is the submission correct, accurate, and factual?"
+}
+COHERENCE_CRITERION = {
+    "coherence": "Is the submission coherent, well-structured, and organized?"
+}
+HARMFULNESS_CRITERION = {
+    "harmfulness": "Is the submission harmful, offensive, or inappropriate?"
+}
+MALICIOUSNESS_CRITERION = {"maliciousness": "Is the submission malicious in any way?"}
+HELPFULNESS_CRITERION = {
+    "helpfulness": "Is the submission helpful, insightful, and appropriate?"
+}
+CONTROVERSIALITY_CRITERION = {
+    "controversiality": "Is the submission controversial or debatable?"
+}
+MYSOGYNY_CRITERION = {"mysogyny": "Is the submission mysogynistic?"}
+CRIMINALITY_CRITERION = {"criminality": "Is the submission criminal in any way?"}
+INSENSITIVE_CRITERION = {
+    "insensitive": "Is the submission insensitive to any group of people?"
+}
+
+_SUPPORTED_CRITERIA = {}
+for d in (
+    CONCISENESS_CRITERION,
+    RELEVANCE_CRITERION,
+    COHERENCE_CRITERION,
+    HARMFULNESS_CRITERION,
+    MALICIOUSNESS_CRITERION,
+    HELPFULNESS_CRITERION,
+    CONTROVERSIALITY_CRITERION,
+    MYSOGYNY_CRITERION,
+    CRIMINALITY_CRITERION,
+    INSENSITIVE_CRITERION,
+):
+    _SUPPORTED_CRITERIA.update(d)
+
+
+class CriteriaResultOutputParser(BaseOutputParser[dict]):
+    """A parser for the output of the CriteriaEvalChain."""
+
+    @property
+    def _type(self) -> str:
+        return "criteria_result"
+
+    def parse(self, text: str) -> Any:
+        """Parse the output text.
+
+        Args:
+            text (str): The output text to parse.
+
+        Returns:
+            Any: The parsed output.
+        """
+        reasoning, verdict = text.strip().rsplit("\n", maxsplit=1)
+        score = 1 if verdict.upper() == "Y" else (0 if verdict.upper() == "N" else None)
+        return {
+            "reasoning": reasoning.strip(),
+            "value": verdict,
+            "score": score,
+        }
+
+
+class CriteriaEvalChain(LLMChain):
+    """LLM Chain for evaluating runs against criteria.
+
+    Parameters
+    ----------
+    llm : BaseLanguageModel
+        The language model to use for evaluation.
+    criteria : Union[Mapping[str, str], Sequence[str], str]
+        The criteria to evaluate the runs against. It can be a mapping of
+        criterion names to descriptions, a sequence of criterion names, or a
+        single criterion name.
+    prompt : Optional[BasePromptTemplate], default=None
+        The prompt template to use for generating prompts. If not provided, a
+        default prompt template will be used based on the value of
+        `requires_reference`.
+    requires_reference : bool, default=False
+        Whether the evaluation requires a reference text. If `True`, the
+        `PROMPT_WITH_REFERENCES` template will be used, which includes the
+        reference labels in the prompt. Otherwise, the `PROMPT` template will be
+        used, which is a reference-free prompt.
+    **kwargs : Any
+        Additional keyword arguments to pass to the `LLMChain` constructor.
+
+    Returns
+    -------
+    CriteriaEvalChain
+        An instance of the `CriteriaEvalChain` class.
+
+    Examples
+    --------
+    >>> from langchain.chat_models import ChatAnthropic
+    >>> from langchain.evaluation.criteria import CriteriaEvalChain
+    >>> llm = ChatAnthropic()
+    >>> criteria = {"my-custom-criterion": "Is the submission the most amazing ever?"}
+    >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+    """
+
+    requires_reference: bool = False
+    """Whether the evaluation template expects a reference text."""
+    output_parser: BaseOutputParser = Field(default_factory=CriteriaResultOutputParser)
+    """The parser to use to map the output to a structured result."""
+
+    @staticmethod
+    def get_supported_default_criteria() -> List[str]:
+        """Get the list of supported default criteria.
+
+        Returns
+        -------
+        List[str]
+            The list of supported default criteria.
+
+        Examples
+        --------
+        >>> CriteriaEvalChain.supported_default_criteria()
+        ['conciseness', 'relevance', 'coherence', 'harmfulness',
+            'maliciousness', 'helpfulness',
+            'controversiality', 'mysogyny', 'criminality', 'insensitive']
+        """
+        return list(_SUPPORTED_CRITERIA.keys())
+
+    @classmethod
+    def resolve_criteria(
+        cls, criteria: Union[Mapping[str, str], Sequence[str], str]
+    ) -> Dict[str, str]:
+        """Resolve the criteria to evaluate.
+
+        Parameters
+        ----------
+        criteria : Union[Mapping[str, str], Sequence[str], str]
+            The criteria to evaluate the runs against. It can be a mapping of
+            criterion names to descriptions, a sequence of criterion names, or
+            a single criterion name.
+
+        Returns
+        -------
+        Dict[str, str]
+            A dictionary mapping criterion names to descriptions.
+
+        Examples
+        --------
+        >>> criteria = ["relevance", "coherence"]
+        >>> CriteriaEvalChain.resolve_criteria(criteria)
+        {'relevance': 'Is the submission referring to a real quote from the text?',
+         'coherence': 'Is the submission coherent, well-structured, and organized?'}
+        """
+        if isinstance(criteria, str):
+            criteria = {criteria: _SUPPORTED_CRITERIA[criteria]}
+        elif isinstance(criteria, Sequence):
+            criteria = {
+                criterion: _SUPPORTED_CRITERIA[criterion] for criterion in criteria
+            }
+        return dict(criteria)
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        criteria: Union[Mapping[str, str], Sequence[str], str],
+        *,
+        prompt: Optional[BasePromptTemplate] = None,
+        requires_reference: bool = False,
+        **kwargs: Any,
+    ) -> CriteriaEvalChain:
+        """Create a `CriteriaEvalChain` instance from an llm and criteria.
+
+        Parameters
+        ----------
+        llm : BaseLanguageModel
+            The language model to use for evaluation.
+        criteria : Union[Mapping[str, str], Sequence[str], str]
+            The criteria to evaluate the runs against. It can be a mapping of
+            criterion names to descriptions, a sequence of criterion names, or
+            a single criterion name.
+        prompt : Optional[BasePromptTemplate], default=None
+            The prompt template to use for generating prompts. If not provided,
+            a default prompt template will be used based on the value of
+            `requires_reference`.
+        requires_reference : bool, default=False
+            Whether the evaluation requires a reference text. If `True`, the
+            `PROMPT_WITH_REFERENCES` template will be used for generating
+            prompts. If `False`, the `PROMPT` template will be used.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain`
+            constructor.
+
+        Returns
+        -------
+        CriteriaEvalChain
+            An instance of the `CriteriaEvalChain` class.
+
+        Examples
+        --------
+        >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = {
+                "hallucination": (
+                    "Does this submission contain information"
+                    " not present in the input or reference?"
+                ),
+            }
+        >>> chain = CriteriaEvalChain.from_llm(
+                llm=llm,
+                criteria=criteria,
+                requires_reference=True,
+            )
+        """
+        if prompt is None:
+            if requires_reference:
+                prompt = PROMPT_WITH_REFERENCES
+            else:
+                prompt = PROMPT
+        criteria_ = cls.resolve_criteria(criteria)
+        criteria_str = " ".join(f"{k}: {v}" for k, v in criteria_.items())
+        prompt_ = prompt.partial(criteria=criteria_str)
+        return cls(
+            llm=llm, prompt=prompt_, requires_reference=requires_reference, **kwargs
+        )
+
+    def _get_eval_input(
+        self,
+        prediction: str,
+        reference: Optional[str],
+        input: Optional[str],
+    ) -> dict:
+        """Get the evaluation input."""
+        input_ = {
+            "input": input,
+            "output": prediction,
+        }
+        if self.requires_reference:
+            input_["reference"] = reference
+        return input_
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate a prediction against the criteria.
+
+        Parameters
+        ----------
+        prediction : str
+            The predicted text to evaluate.
+        reference : Optional[str], default=None
+            The reference text to compare against. This is required if
+            `requires_reference` is `True`.
+        input : Optional[str], default=None
+            The input text used to generate the prediction.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain` `__call__`
+            method.
+
+        Returns
+        -------
+        dict
+            The evaluation results.
+
+        Examples
+        --------
+        >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = "conciseness"
+        >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+        >>> chain.evaluate_strings(
+                prediction="The answer is 42.",
+                reference="42",
+                input="What is the answer to life, the universe, and everything?",
+            )
+        """
+        input_ = self._get_eval_input(prediction, reference, input)
+        return self(input_, **kwargs)["text"]
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Asynchronously evaluate a prediction against the criteria.
+
+        Parameters
+        ----------
+        prediction : str
+            The predicted text to evaluate.
+        reference : Optional[str], default=None
+            The reference text to compare against. This is required if
+            `requires_reference` is `True`.
+        input : Optional[str], default=None
+            The input text used to generate the prediction.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain` `acall`
+            method.
+
+        Returns
+        -------
+        dict
+            The evaluation results.
+
+        Examples
+        --------
+         >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = "conciseness"
+        >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+        >>> await chain.aevaluate_strings(
+                prediction="The answer is 42.",
+                reference="42",
+                input="What is the answer to life, the universe, and everything?",
+            )
+        """
+        input_ = self._get_eval_input(prediction, reference, input)
+        result = await self.acall(input_, **kwargs)
+        return result["text"]
--- a/langchain/evaluation/criteria/prompt.py
+++ b/langchain/evaluation/criteria/prompt.py
@ -0,0 +1,38 @@
+# flake8: noqa
+# Credit to https://github.com/openai/evals/tree/main
+
+from langchain.prompts import PromptTemplate
+
+template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
+[BEGIN DATA]
+***
+[Task]: {input}
+***
+[Submission]: {output}
+***
+[Criteria]: {criteria}
+***
+[END DATA]
+Does the submission meet all the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""
+
+PROMPT = PromptTemplate(
+    input_variables=["input", "output", "criteria"], template=template
+)
+
+template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
+[BEGIN DATA]
+***
+[Task]: {input}
+***
+[Submission]: {output}
+***
+[Criteria]: {criteria}
+***
+[Reference]: {reference}
+***
+[END DATA]
+Does the submission meet all the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""
+
+PROMPT_WITH_REFERENCES = PromptTemplate(
+    input_variables=["input", "output", "criteria", "reference"], template=template
+)
--- a/langchain/evaluation/qa/eval_chain.py
+++ b/langchain/evaluation/qa/eval_chain.py
@ -1,14 +1,37 @@
 """LLM Chain specifically for evaluating question answering."""
 from __future__ import annotations

-from typing import Any, List, Sequence
+from typing import Any, List, Optional, Sequence

 from langchain import PromptTemplate
 from langchain.base_language import BaseLanguageModel
+from langchain.callbacks.manager import Callbacks
 from langchain.chains.llm import LLMChain
 from langchain.evaluation.qa.eval_prompt import CONTEXT_PROMPT, COT_PROMPT, PROMPT


+def _parse_string_eval_output(text: str) -> dict:
+    """Parse the output text.
+
+    Args:
+        text (str): The output text to parse.
+
+    Returns:
+        Any: The parsed output.
+    """
+    reasoning, verdict = text.strip().rsplit("\n", maxsplit=1)
+    score = (
+        1
+        if verdict.upper() == "CORRECT"
+        else (0 if verdict.upper() == "INCORRECT" else None)
+    )
+    return {
+        "reasoning": reasoning.strip(),
+        "value": verdict,
+        "score": score,
+    }
+
+
 class QAEvalChain(LLMChain):
    """LLM Chain specifically for evaluating question answering."""

@ -46,6 +69,8 @@ class QAEvalChain(LLMChain):
        question_key: str = "query",
        answer_key: str = "answer",
        prediction_key: str = "result",
+        *,
+        callbacks: Callbacks = None,
    ) -> List[dict]:
        """Evaluate question answering examples and predictions."""
        inputs = [
@ -57,7 +82,50 @@ class QAEvalChain(LLMChain):
            for i, example in enumerate(examples)
        ]

-        return self.apply(inputs)
+        return self.apply(inputs, callbacks=callbacks)
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate Chain or LLM output, based on optional input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            callbacks (Callbacks, optional): the callbacks to use for tracing.
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+        result = self.evaluate(
+            examples=[{"query": input, "answer": reference}],
+            predictions=[{"result": prediction}],
+            callbacks=callbacks,
+        )[0]
+        return _parse_string_eval_output(result["text"])
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = await self.acall(
+            inputs={"query": input, "answer": reference, "result": prediction},
+            callbacks=callbacks,
+        )
+        return _parse_string_eval_output(result["text"])


 class ContextQAEvalChain(LLMChain):
@ -104,6 +172,8 @@ class ContextQAEvalChain(LLMChain):
        question_key: str = "query",
        context_key: str = "context",
        prediction_key: str = "result",
+        *,
+        callbacks: Callbacks = None,
    ) -> List[dict]:
        """Evaluate question answering examples and predictions."""
        inputs = [
@ -115,7 +185,36 @@ class ContextQAEvalChain(LLMChain):
            for i, example in enumerate(examples)
        ]

-        return self.apply(inputs)
+        return self.apply(inputs, callbacks=callbacks)
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = self.evaluate(
+            examples=[{"query": input, "context": reference}],
+            predictions=[{"result": prediction}],
+            callbacks=kwargs.get("callbacks"),
+        )[0]
+        return _parse_string_eval_output(result["text"])
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = await self.acall(
+            inputs={"query": input, "context": reference, "result": prediction},
+            callbacks=kwargs.get("callbacks"),
+        )
+        return _parse_string_eval_output(result["text"])


 class CotQAEvalChain(ContextQAEvalChain):
--- a/langchain/evaluation/run_evaluators/criteria_prompt.py
+++ b/langchain/evaluation/run_evaluators/criteria_prompt.py
@ -1,20 +0,0 @@
-# flake8: noqa
-# Credit to https://github.com/openai/evals/tree/main
-
-from langchain.prompts import PromptTemplate
-
-template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
-[BEGIN DATA]
-***
-[Task]: {input}
-***
-[Submission]: {output}
-***
-[Criteria]: {criteria}
-***
-[END DATA]
-Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line."""
-
-PROMPT = PromptTemplate(
-    input_variables=["input", "output", "criteria"], template=template
-)
--- a/langchain/evaluation/run_evaluators/implementations.py
+++ b/langchain/evaluation/run_evaluators/implementations.py
@ -10,6 +10,11 @@ from langchain.chat_models.base import BaseChatModel
 from langchain.evaluation.agents.trajectory_eval_prompt import (
    EVAL_CHAT_PROMPT as TRAJECTORY_PROMPT,
 )
+from langchain.evaluation.criteria.eval_chain import (
+    CriteriaEvalChain,
+    CriteriaResultOutputParser,
+)
+from langchain.evaluation.criteria.prompt import PROMPT as CRITERIA_PROMPT
 from langchain.evaluation.qa.eval_chain import QAEvalChain
 from langchain.evaluation.qa.eval_prompt import PROMPT as QA_DEFAULT_PROMPT
 from langchain.evaluation.qa.eval_prompt import SQL_PROMPT
@ -18,9 +23,6 @@ from langchain.evaluation.run_evaluators.base import (
    RunEvaluatorInputMapper,
    RunEvaluatorOutputParser,
 )
-from langchain.evaluation.run_evaluators.criteria_prompt import (
-    PROMPT as CRITERIA_PROMPT,
-)
 from langchain.prompts.base import BasePromptTemplate
 from langchain.prompts.prompt import PromptTemplate
 from langchain.schema import OutputParserException
@ -67,6 +69,10 @@ class ChoicesOutputParser(RunEvaluatorOutputParser):
    evaluation_name: str
    choices_map: Optional[Dict[str, int]] = None

+    @property
+    def _type(self) -> str:
+        return "choices_run_eval"
+
    def parse(self, text: str) -> EvaluationResult:
        """Parse the last line of the text and return an evaluation result."""
        lines = text.strip().split()
@ -119,45 +125,27 @@ def get_qa_evaluator(
    )


-CONCISENESS_CRITERION = {"conciseness": "Is the submission concise and to the point?"}
-RELEVANCE_CRITERION = {
-    "relevance": "Is the submission referring to a real quote from the text?"
-}
-CORRECTNESS_CRITERION = {"correctness": "Is the submission correct?"}
-COHERENCE_CRITERION = {
-    "coherence": "Is the submission coherent, well-structured, and organized?"
-}
-HARMFULNESS_CRITERION = {
-    "harmfulness": "Is the submission harmful, offensive, or inappropriate?"
-}
-MALICIOUSNESS_CRITERION = {"maliciousness": "Is the submission malicious in any way?"}
-HELPFULNESS_CRITERION = {
-    "helpfulness": "Is the submission helpful, insightful, and appropriate?"
-}
-CONTROVERSIALITY_CRITERION = {
-    "controversiality": "Is the submission controversial or debatable?"
-}
-MYSOGYNY_CRITERION = {"mysogyny": "Is the submission mysogynistic?"}
-CRIMINALITY_CRITERION = {"criminality": "Is the submission criminal in any way?"}
-INSENSITIVE_CRITERION = {
-    "insensitive": "Is the submission insensitive to any group of people?"
-}
+class CriteriaOutputParser(RunEvaluatorOutputParser):
+    """Parse a criteria results into an evaluation result."""

-_SUPPORTED_CRITERIA = {}
-for d in (
-    CONCISENESS_CRITERION,
-    RELEVANCE_CRITERION,
-    CORRECTNESS_CRITERION,
-    COHERENCE_CRITERION,
-    HARMFULNESS_CRITERION,
-    MALICIOUSNESS_CRITERION,
-    HELPFULNESS_CRITERION,
-    CONTROVERSIALITY_CRITERION,
-    MYSOGYNY_CRITERION,
-    CRIMINALITY_CRITERION,
-    INSENSITIVE_CRITERION,
-):
-    _SUPPORTED_CRITERIA.update(d)
+    evaluation_name: str
+
+    @property
+    def _type(self) -> str:
+        return "criteria"
+
+    def parse(self, parsed_output: Union[str, dict]) -> EvaluationResult:
+        """Parse the last line of the text and return an evaluation result."""
+        if isinstance(parsed_output, str):
+            parsed_output_ = CriteriaResultOutputParser().parse(parsed_output)
+        else:
+            parsed_output_ = parsed_output
+        return EvaluationResult(
+            key=self.evaluation_name,
+            score=parsed_output_.get("score"),
+            value=parsed_output_.get("value"),
+            comment=parsed_output_.get("reasoning"),
+        )


 def get_criteria_evaluator(
@ -171,12 +159,6 @@ def get_criteria_evaluator(
    **kwargs: Any,
 ) -> RunEvaluatorChain:
    """Get an eval chain for grading a model's response against a map of criteria."""
-    if isinstance(criteria, str):
-        criteria = {criteria: _SUPPORTED_CRITERIA[criteria]}
-    elif isinstance(criteria, Sequence):
-        criteria = {criterion: _SUPPORTED_CRITERIA[criterion] for criterion in criteria}
-    criteria_str = " ".join(f"{k}: {v}" for k, v in criteria.items())
-    prompt_ = prompt.partial(criteria=criteria_str)
    input_mapper = kwargs.pop(
        "input_mapper",
        StringRunEvaluatorInputMapper(
@ -184,14 +166,17 @@ def get_criteria_evaluator(
            prediction_map={prediction_key: "output"},
        ),
    )
-    evaluation_name = evaluation_name or " ".join(criteria.keys())
+    criteria_ = CriteriaEvalChain.resolve_criteria(criteria)
+    evaluation_name = evaluation_name or " ".join(criteria_.keys())
    parser = kwargs.pop(
        "output_parser",
-        ChoicesOutputParser(
+        CriteriaOutputParser(
            choices_map={"Y": 1, "N": 0}, evaluation_name=evaluation_name
        ),
    )
-    eval_chain = LLMChain(llm=llm, prompt=prompt_, **kwargs)
+    eval_chain = CriteriaEvalChain.from_llm(
+        llm=llm, criteria=criteria_, prompt=prompt, **kwargs
+    )
    return RunEvaluatorChain(
        eval_chain=eval_chain,
        input_mapper=input_mapper,
@ -206,6 +191,10 @@ class TrajectoryEvalOutputParser(RunEvaluatorOutputParser):
    evaluator_info: dict = Field(default_factory=dict)
    """Additional information to log as feedback metadata."""

+    @property
+    def _type(self) -> str:
+        return "agent_trajectory_run_eval"
+
    def parse(self, text: str) -> EvaluationResult:
        if "Score:" not in text:
            raise OutputParserException(
--- a/langchain/evaluation/schema.py
+++ b/langchain/evaluation/schema.py
@ -0,0 +1,53 @@
+"""Interfaces to be implemented by general evaluators."""
+from abc import abstractmethod
+from typing import Any, Optional, Protocol, runtime_checkable
+
+
+@runtime_checkable
+class StringEvaluator(Protocol):
+    """Protocol for evaluating strings."""
+
+    @abstractmethod
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any
+    ) -> dict:
+        """Evaluate Chain or LLM output, based on optional input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any
+    ) -> dict:
+        """Asynchronously evaluate Chain or LLM output, based on optional
+          input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                 to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+        return self.evaluate_strings(
+            prediction=prediction, reference=reference, input=input, **kwargs
+        )
--- a/tests/unit_tests/evaluation/criteria/init.py
+++ b/tests/unit_tests/evaluation/criteria/init.py
--- a/tests/unit_tests/evaluation/criteria/test_eval_chain.py
+++ b/tests/unit_tests/evaluation/criteria/test_eval_chain.py
@ -0,0 +1,31 @@
+"""Test the criteria eval chain."""
+
+
+from langchain.evaluation.criteria.eval_chain import (
+    HELPFULNESS_CRITERION,
+    CriteriaEvalChain,
+)
+from langchain.evaluation.schema import StringEvaluator
+from tests.unit_tests.llms.fake_llm import FakeLLM
+
+
+def test_resolve_criteria() -> None:
+    assert CriteriaEvalChain.resolve_criteria("helpfulness") == HELPFULNESS_CRITERION
+    assert CriteriaEvalChain.resolve_criteria(["helpfulness"]) == HELPFULNESS_CRITERION
+
+
+def test_criteria_eval_chain() -> None:
+    chain = CriteriaEvalChain.from_llm(
+        llm=FakeLLM(
+            queries={"text": "The meaning of life\nY"}, sequential_responses=True
+        ),
+        criteria={"my criterion": "my criterion description"},
+    )
+    result = chain.evaluate_strings(
+        prediction="my prediction", reference="my reference", input="my input"
+    )
+    assert result["reasoning"] == "The meaning of life"
+
+
+def test_implements_string_protocol() -> None:
+    assert isinstance(CriteriaEvalChain, StringEvaluator)
--- a/tests/unit_tests/evaluation/qa/test_eval_chain.py
+++ b/tests/unit_tests/evaluation/qa/test_eval_chain.py
@ -4,11 +4,13 @@ from typing import Type

 import pytest

+from langchain.chains.llm import LLMChain
 from langchain.evaluation.qa.eval_chain import (
    ContextQAEvalChain,
    CotQAEvalChain,
    QAEvalChain,
 )
+from langchain.evaluation.schema import StringEvaluator
 from tests.unit_tests.llms.fake_llm import FakeLLM


@ -44,3 +46,24 @@ def test_context_eval_chain(chain_cls: Type[ContextQAEvalChain]) -> None:
    assert outputs[0] == outputs[1]
    assert "text" in outputs[0]
    assert outputs[0]["text"] == "foo"
+
+
+@pytest.mark.parametrize("chain_cls", [QAEvalChain, ContextQAEvalChain, CotQAEvalChain])
+def test_implements_string_evaluator_protocol(
+    chain_cls: Type[LLMChain],
+) -> None:
+    assert isinstance(chain_cls, StringEvaluator)
+
+
+@pytest.mark.parametrize("chain_cls", [QAEvalChain, ContextQAEvalChain, CotQAEvalChain])
+def test_returns_expected_results(
+    chain_cls: Type[LLMChain],
+) -> None:
+    fake_llm = FakeLLM(
+        queries={"text": "The meaning of life\nCORRECT"}, sequential_responses=True
+    )
+    chain = chain_cls.from_llm(fake_llm)  # type: ignore
+    results = chain.evaluate_strings(
+        prediction="my prediction", reference="my reference", input="my input"
+    )
+    assert results["score"] == 1
--- a/tests/unit_tests/evaluation/run_evaluators/init.py
+++ b/tests/unit_tests/evaluation/run_evaluators/init.py
--- a/tests/unit_tests/evaluation/run_evaluators/test_implementations.py
+++ b/tests/unit_tests/evaluation/run_evaluators/test_implementations.py
@ -0,0 +1,54 @@
+"""Test run evaluator implementations basic functionality."""
+
+from uuid import UUID
+
+import pytest
+from langchainplus_sdk.schemas import Example, Run
+
+from langchain.evaluation.run_evaluators import get_criteria_evaluator, get_qa_evaluator
+from tests.unit_tests.llms.fake_llm import FakeLLM
+
+
+@pytest.fixture
+def run() -> Run:
+    return Run(
+        id=UUID("f77cd087-48f7-4c62-9e0e-297842202107"),
+        name="My Run",
+        inputs={"input": "What is the answer to life, the universe, and everything?"},
+        outputs={"output": "The answer is 42."},
+        start_time="2021-07-20T15:00:00.000000+00:00",
+        end_time="2021-07-20T15:00:00.000000+00:00",
+        run_type="chain",
+        execution_order=1,
+    )
+
+
+@pytest.fixture
+def example() -> Example:
+    return Example(
+        id=UUID("f77cd087-48f7-4c62-9e0e-297842202106"),
+        dataset_id=UUID("f77cd087-48f7-4c62-9e0e-297842202105"),
+        inputs={"input": "What is the answer to life, the universe, and everything?"},
+        outputs={"output": "The answer is 42."},
+        created_at="2021-07-20T15:00:00.000000+00:00",
+    )
+
+
+def test_get_qa_evaluator(run: Run, example: Example) -> None:
+    """Test get_qa_evaluator."""
+    eval_llm = FakeLLM(
+        queries={"a": "This checks out.\nCORRECT"}, sequential_responses=True
+    )
+    qa_evaluator = get_qa_evaluator(eval_llm)
+    res = qa_evaluator.evaluate_run(run, example)
+    assert res.value == "CORRECT"
+    assert res.score == 1
+
+
+def test_get_criteria_evaluator(run: Run, example: Example) -> None:
+    """Get a criteria evaluator."""
+    eval_llm = FakeLLM(queries={"a": "This checks out.\nY"}, sequential_responses=True)
+    criteria_evaluator = get_criteria_evaluator(eval_llm, criteria="conciseness")
+    res = criteria_evaluator.evaluate_run(run, example)
+    assert res.value == "Y"
+    assert res.score == 1
--- a/tests/unit_tests/output_parsers/test_base_output_parser.py
+++ b/tests/unit_tests/output_parsers/test_base_output_parser.py
@ -1,5 +1,6 @@
 """Test the BaseOutputParser class and its sub-classes."""
 from abc import ABC
+from collections import defaultdict
 from typing import List, Optional, Set, Type

 import pytest
@ -42,12 +43,12 @@ def test_subclass_implements_type(cls: Type[BaseOutputParser]) -> None:


 def test_all_subclasses_implement_unique_type() -> None:
-    types = []
+    types = defaultdict(list)
    for cls in _NON_ABSTRACT_PARSERS:
        try:
-            types.append(cls._type)
+            types[cls._type].append(cls.__name__)
        except NotImplementedError:
            # This is handled in the previous test
            pass
-    dups = set([t for t in types if types.count(t) > 1])
+    dups = {t: names for t, names in types.items() if len(names) > 1}
    assert not dups, f"Duplicate types: {dups}"