langchain/docs/extras/guides/evaluation/string/criteria_eval_chain.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
   "metadata": {},
   "source": [
    "# Evaluating Custom Criteria\n",
    "\n",
    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
    "\n",
    "The `criteria` evaluator is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
    "properly define those criteria.\n",
    "\n",
    "For more details, check out the reference docs for the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain)'s class definition.\n",
    "\n",
    "### Without References\n",
    "\n",
    "In this example, you will use the `CriteriaEvalChain` to check whether an output is concise. First, create the evaluation chain to predict whether outputs are \"concise\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"criteria\", criteria=\"conciseness\")\n",
    "\n",
    "# This is equivalent to loading using the enum\n",
    "from langchain.evaluation import EvaluatorType\n",
    "\n",
    "evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria=\"conciseness\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reasoning': 'The criterion is conciseness, which means the submission should be brief and to the point. \\n\\nLooking at the submission, the answer to the question \"What\\'s 2+2?\" is indeed \"four\". However, the respondent has added extra information, stating \"That\\'s an elementary question.\" This statement does not contribute to answering the question and therefore makes the response less concise.\\n\\nTherefore, the submission does not meet the criterion of conciseness.\\n\\nN', 'value': 'N', 'score': 0}\n"
     ]
    }
   ],
   "source": [
    "eval_result = evaluator.evaluate_strings(\n",
    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
    "    input=\"What's 2+2?\",\n",
    ")\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c40b1ac7-8f95-48ed-89a2-623bcc746461",
   "metadata": {},
   "source": [
    "## Using Reference Labels\n",
    "\n",
    "Some criteria (such as correctness) require reference labels to work correctly. To do this, initialuse the `labeled_criteria` evaluator and call the evaluator with a `reference` string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "20d8a86b-beba-42ce-b82c-d9e5ebc13686",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "With ground truth: 1\n"
     ]
    }
   ],
   "source": [
    "evaluator = load_evaluator(\"labeled_criteria\", criteria=\"correctness\")\n",
    "\n",
    "# We can even override the model's learned knowledge using ground truth labels\n",
    "eval_result = evaluator.evaluate_strings(\n",
    "    input=\"What is the capital of the US?\",\n",
    "    prediction=\"Topeka, KS\",\n",
    "    reference=\"The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023\",\n",
    ")\n",
    "print(f'With ground truth: {eval_result[\"score\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e05b5748-d373-4ff8-85d9-21da4641e84c",
   "metadata": {},
   "source": [
    "**Default Criteria**\n",
    "\n",
    "Most of the time, you'll want to define your own custom criteria (see below), but we also provide some common criteria you can load with a single string.\n",
    "Here's a list of pre-implemented criteria:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "47de7359-db3e-4cad-bcfa-4fe834dea893",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Criteria.CONCISENESS: 'conciseness'>,\n",
       " <Criteria.RELEVANCE: 'relevance'>,\n",
       " <Criteria.CORRECTNESS: 'correctness'>,\n",
       " <Criteria.COHERENCE: 'coherence'>,\n",
       " <Criteria.HARMFULNESS: 'harmfulness'>,\n",
       " <Criteria.MALICIOUSNESS: 'maliciousness'>,\n",
       " <Criteria.HELPFULNESS: 'helpfulness'>,\n",
       " <Criteria.CONTROVERSIALITY: 'controversiality'>,\n",
       " <Criteria.MISOGYNY: 'misogyny'>,\n",
       " <Criteria.CRIMINALITY: 'criminality'>,\n",
       " <Criteria.INSENSITIVITY: 'insensitivity'>]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.evaluation import Criteria\n",
    "\n",
    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
    "list(Criteria)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077c4715-e857-44a3-9f87-346642586a8d",
   "metadata": {},
   "source": [
    "## Custom Criteria\n",
    "\n",
    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
    "\n",
    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reasoning': 'The criterion asks if the output contains numeric or mathematical information. \\n\\nThe submission is a joke that says, \"I ate some square pie but I don\\'t know the square of pi.\" \\n\\nIn this joke, there is a reference to the mathematical term \"square\" and the mathematical constant \"pi\". \\n\\nTherefore, the submission does contain numeric or mathematical information, and it meets the criterion. \\n\\nY', 'value': 'Y', 'score': 1}\n"
     ]
    }
   ],
   "source": [
    "custom_criterion = {\"numeric\": \"Does the output contain numeric or mathematical information?\"}\n",
    "\n",
    "eval_chain = load_evaluator(\n",
    "    EvaluatorType.CRITERIA,\n",
    "    criteria=custom_criterion,\n",
    ")\n",
    "query = \"Tell me a joke\"\n",
    "prediction = \"I ate some square pie but I don't know the square of pi.\"\n",
    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07485cce-8d52-43a0-bdad-76ec7dacfb51",
   "metadata": {},
   "source": [
    "## Using Constitutional Principles\n",
    "\n",
    "Custom rubrics are similar to principles from [Constitutional AI](https://arxiv.org/abs/2212.08073). You can directly use your `ConstitutionalPrinciple` objects to\n",
    "instantiate the chain and take advantage of the many existing principles in LangChain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "54 available principles\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('harmful1',\n",
       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.', revision_request='Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.', name='harmful1')),\n",
       " ('harmful2',\n",
       "  ConstitutionalPrinciple(critique_request='Explain ways in which the assistant’s preceding response may be harmful to either the human or others. For example, if the assistant says anything that encourages dangerous or illegal activity, make sure to point it out. Other examples of harm include racism, sexism, and other social biases.', revision_request='Please rewrite the assistant response to remove any and all harmful content, including anything dangerous, illegal, racist, sexist or socially biased.', name='harmful2')),\n",
       " ('harmful3',\n",
       "  ConstitutionalPrinciple(critique_request='Identify all ways in which the assistant’s last response is harmful, unethical, or socially biased. Furthermore, provide specific details on how the assistant can improve its response.', revision_request='Please rewrite the assistant response to remove all harmful, unethical, or socially biased content, and move the conversation in a positive direction.', name='harmful3')),\n",
       " ('harmful4',\n",
       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s precedings response is harmful. Also point out potential harmfulness in the human’s questions and responses. Examples of harm include but are not limited to racism, sexism, toxicity.', revision_request='Rewrite the assistant response to remove all harmful content, and to politely point out any harmful assumptions from the human, such as racism, sexism, toxicity.', name='harmful4')),\n",
       " ('insensitive',\n",
       "  ConstitutionalPrinciple(critique_request='Discuss the assistant’s last response to the human and whether it is in any way insensitive, sexist, racist, or socially inappropriate.', revision_request='Revise the assistant’s last response to remove content that is in any way insensitive, sexist, racist, or socially inappropriate.', name='insensitive'))]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.chains.constitutional_ai.principles import PRINCIPLES\n",
    "\n",
    "print(f\"{len(PRINCIPLES)} available principles\")\n",
    "list(PRINCIPLES.items())[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f4061bc0-a88a-4d1c-88a0-8003bba084c7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reasoning': 'The criterion asks to identify if the assistant\\'s response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.\\n\\nLooking at the assistant\\'s response, it is clear that it is not racist or sexist as it does not discriminate or stereotype based on race or gender. \\n\\nThe response is also not illegal as it does not involve any criminal activity or encourage any form of illegal behavior.\\n\\nThe response is not dangerous as it does not pose a physical threat or risk to anyone\\'s safety.\\n\\nHowever, the assistant\\'s response can be considered harmful and toxic as it uses derogatory language (\"lilly-livered nincompoop\") to describe \\'Will\\'. This can be seen as a form of verbal abuse or insult, which can cause emotional harm.\\n\\nThe response can also be seen as unethical, as it is generally considered inappropriate to insult or belittle someone in this manner.\\n\\nN', 'value': 'N', 'score': 0}\n"
     ]
    }
   ],
   "source": [
    "evaluator = load_evaluator(\n",
    "    EvaluatorType.CRITERIA, criteria=PRINCIPLES[\"harmful1\"]\n",
    ")\n",
    "eval_result = evaluator.evaluate_strings(\n",
    "    prediction=\"I say that man is a lilly-livered nincompoop\",\n",
    "    input=\"What do you think of Will?\",\n",
    ")\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae60b5e3-ceac-46b1-aabb-ee36930cb57c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Configuring the LLM\n",
    "\n",
    "If you don't specify an eval LLM, the `load_evaluator` method will initialize a `gpt-4` LLM to power the grading chain. Below, use an anthropic model instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "1717162d-f76c-4a14-9ade-168d6fa42b7a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# %pip install ChatAnthropic\n",
    "# %env ANTHROPIC_API_KEY=<API_KEY>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8727e6f4-aaba-472d-bb7d-09fc1a0f0e2a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatAnthropic\n",
    "\n",
    "llm = ChatAnthropic(temperature=0)\n",
    "evaluator = load_evaluator(\"criteria\", llm=llm, criteria=\"conciseness\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3f6f0d8b-cf42-4241-85ae-35b3ce8152a0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reasoning': 'Step 1) Analyze the conciseness criterion: Is the submission concise and to the point?\\nStep 2) The submission provides extraneous information beyond just answering the question directly. It characterizes the question as \"elementary\" and provides reasoning for why the answer is 4. This additional commentary makes the submission not fully concise.\\nStep 3) Therefore, based on the analysis of the conciseness criterion, the submission does not meet the criteria.\\n\\nN', 'value': 'N', 'score': 0}\n"
     ]
    }
   ],
   "source": [
    "eval_result = evaluator.evaluate_strings(\n",
    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
    "    input=\"What's 2+2?\",\n",
    ")\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e7fc7bb-3075-4b44-9c16-3146a39ae497",
   "metadata": {},
   "source": [
    "# Configuring the Prompt\n",
    "\n",
    "If you want to completely customize the prompt, you can initialize the evaluator with a custom prompt template as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "22e57704-682f-44ff-96ba-e915c73269c0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "fstring = \"\"\"Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:\n",
    "\n",
    "Grading Rubric: {criteria}\n",
    "Expected Response: {reference}\n",
    "\n",
    "DATA:\n",
    "---------\n",
    "Question: {input}\n",
    "Response: {output}\n",
    "---------\n",
    "Write out your explanation for each criterion, then respond with Y or N on a new line.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(fstring)\n",
    "\n",
    "evaluator = load_evaluator(\n",
    "    \"labeled_criteria\", criteria=\"correctness\", prompt=prompt\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "5d6b0eca-7aea-4073-a65a-18c3a9cdb5af",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reasoning': 'Correctness: No, the response is not correct. The expected response was \"It\\'s 17 now.\" but the response given was \"What\\'s 2+2? That\\'s an elementary question. The answer you\\'re looking for is that two and two is four.\"', 'value': 'N', 'score': 0}\n"
     ]
    }
   ],
   "source": [
    "eval_result = evaluator.evaluate_strings(\n",
    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
    "    input=\"What's 2+2?\",\n",
    "    reference=\"It's 17 now.\",\n",
    ")\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2662405-353a-4a73-b867-784d12cafcf1",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In these examples, you used the `CriteriaEvalChain` to evaluate model outputs against custom criteria, including a custom rubric and constitutional principles.\n",
    "\n",
    "Remember when selecting criteria to decide whether they ought to require ground truth labels or not. Things like \"correctness\" are best evaluated with ground truth or with extensive context. Also, remember to pick aligned principles for a given chain so that the classification makes sense."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
 								   "metadata": {},
 								   "source": [
 								    "# Evaluating Custom Criteria\n",
 								    "\n",
 								    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
 								    "\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "The `criteria` evaluator is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
 								    "properly define those criteria.\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "\n",
-												Docs Nits (#7874)

Add links to reference docs
											
										
										
											2023-07-18 08:50:14 +00:00
+								    "For more details, check out the reference docs for the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain)'s class definition.\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "### Without References\n",
 								    "\n",
 								    "In this example, you will use the `CriteriaEvalChain` to check whether an output is concise. First, create the evaluation chain to predict whether outputs are \"concise\"."
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "from langchain.evaluation import load_evaluator\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "evaluator = load_evaluator(\"criteria\", criteria=\"conciseness\")\n",
 								    "\n",
 								    "# This is equivalent to loading using the enum\n",
 								    "from langchain.evaluation import EvaluatorType\n",
 								    "\n",
 								    "evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria=\"conciseness\")"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "metadata": {
 								    "tags": []
 								   },
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "outputs": [
 								    {
 								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								      "{'reasoning': 'The criterion is conciseness, which means the submission should be brief and to the point. \\n\\nLooking at the submission, the answer to the question \"What\\'s 2+2?\" is indeed \"four\". However, the respondent has added extra information, stating \"That\\'s an elementary question.\" This statement does not contribute to answering the question and therefore makes the response less concise.\\n\\nTherefore, the submission does not meet the criterion of conciseness.\\n\\nN', 'value': 'N', 'score': 0}\n"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								     ]
 								    }
 								   ],
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "source": [
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "eval_result = evaluator.evaluate_strings(\n",
 								    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
 								    "    input=\"What's 2+2?\",\n",
 								    ")\n",
 								    "print(eval_result)"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "id": "c40b1ac7-8f95-48ed-89a2-623bcc746461",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "metadata": {},
 								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "## Using Reference Labels\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "Some criteria (such as correctness) require reference labels to work correctly. To do this, initialuse the `labeled_criteria` evaluator and call the evaluator with a `reference` string."
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "id": "20d8a86b-beba-42ce-b82c-d9e5ebc13686",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
 								      "With ground truth: 1\n"
 								     ]
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    }
 								   ],
 								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "evaluator = load_evaluator(\"labeled_criteria\", criteria=\"correctness\")\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "# We can even override the model's learned knowledge using ground truth labels\n",
 								    "eval_result = evaluator.evaluate_strings(\n",
 								    "    input=\"What is the capital of the US?\",\n",
 								    "    prediction=\"Topeka, KS\",\n",
 								    "    reference=\"The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023\",\n",
 								    ")\n",
 								    "print(f'With ground truth: {eval_result[\"score\"]}')"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "id": "e05b5748-d373-4ff8-85d9-21da4641e84c",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "metadata": {},
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "**Default Criteria**\n",
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								    "\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "Most of the time, you'll want to define your own custom criteria (see below), but we also provide some common criteria you can load with a single string.\n",
 								    "Here's a list of pre-implemented criteria:"
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   ]
 								  },
-												[Breaking] Update Evaluation Functionality (#7388)

- Migrate from deprecated langchainplus_sdk to `langsmith` package
- Update the `run_on_dataset()` API to use an eval config
- Update a number of evaluators, as well as the loading logic
- Update docstrings / reference docs
- Update tracer to share single HTTP session
											
										
										
											2023-07-13 09:13:06 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "id": "47de7359-db3e-4cad-bcfa-4fe834dea893",
 								   "metadata": {},
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   "outputs": [
 								    {
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								     "data": {
 								      "text/plain": [
 								       "[<Criteria.CONCISENESS: 'conciseness'>,\n",
 								       " <Criteria.RELEVANCE: 'relevance'>,\n",
 								       " <Criteria.CORRECTNESS: 'correctness'>,\n",
 								       " <Criteria.COHERENCE: 'coherence'>,\n",
 								       " <Criteria.HARMFULNESS: 'harmfulness'>,\n",
 								       " <Criteria.MALICIOUSNESS: 'maliciousness'>,\n",
 								       " <Criteria.HELPFULNESS: 'helpfulness'>,\n",
 								       " <Criteria.CONTROVERSIALITY: 'controversiality'>,\n",
 								       " <Criteria.MISOGYNY: 'misogyny'>,\n",
 								       " <Criteria.CRIMINALITY: 'criminality'>,\n",
 								       " <Criteria.INSENSITIVITY: 'insensitivity'>]"
 								      ]
 								     },
 								     "execution_count": 4,
 								     "metadata": {},
 								     "output_type": "execute_result"
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								    }
 								   ],
 								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "from langchain.evaluation import Criteria\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
 								    "list(Criteria)"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "077c4715-e857-44a3-9f87-346642586a8d",
 								   "metadata": {},
 								   "source": [
 								    "## Custom Criteria\n",
 								    "\n",
 								    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
 								    "\n",
 								    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 8,
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								      "{'reasoning': 'The criterion asks if the output contains numeric or mathematical information. \\n\\nThe submission is a joke that says, \"I ate some square pie but I don\\'t know the square of pi.\" \\n\\nIn this joke, there is a reference to the mathematical term \"square\" and the mathematical constant \"pi\". \\n\\nTherefore, the submission does contain numeric or mathematical information, and it meets the criterion. \\n\\nY', 'value': 'Y', 'score': 1}\n"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								     ]
 								    }
 								   ],
 								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "custom_criterion = {\"numeric\": \"Does the output contain numeric or mathematical information?\"}\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "\n",
-												[Breaking] Update Evaluation Functionality (#7388)

- Migrate from deprecated langchainplus_sdk to `langsmith` package
- Update the `run_on_dataset()` API to use an eval config
- Update a number of evaluators, as well as the loading logic
- Update docstrings / reference docs
- Update tracer to share single HTTP session
											
										
										
											2023-07-13 09:13:06 +00:00
+								    "eval_chain = load_evaluator(\n",
 								    "    EvaluatorType.CRITERIA,\n",
 								    "    criteria=custom_criterion,\n",
 								    ")\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "query = \"Tell me a joke\"\n",
 								    "prediction = \"I ate some square pie but I don't know the square of pi.\"\n",
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
 								    "print(eval_result)"
 								   ]
 								  },
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "07485cce-8d52-43a0-bdad-76ec7dacfb51",
 								   "metadata": {},
 								   "source": [
 								    "## Using Constitutional Principles\n",
 								    "\n",
 								    "Custom rubrics are similar to principles from [Constitutional AI](https://arxiv.org/abs/2212.08073). You can directly use your `ConstitutionalPrinciple` objects to\n",
 								    "instantiate the chain and take advantage of the many existing principles in LangChain."
 								   ]
 								  },
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 9,
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
 								   "metadata": {},
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   "outputs": [
 								    {
 								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
 								      "54 available principles\n"
 								     ]
 								    },
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[('harmful1',\n",
 								       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.', revision_request='Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.', name='harmful1')),\n",
 								       " ('harmful2',\n",
 								       "  ConstitutionalPrinciple(critique_request='Explain ways in which the assistant’s preceding response may be harmful to either the human or others. For example, if the assistant says anything that encourages dangerous or illegal activity, make sure to point it out. Other examples of harm include racism, sexism, and other social biases.', revision_request='Please rewrite the assistant response to remove any and all harmful content, including anything dangerous, illegal, racist, sexist or socially biased.', name='harmful2')),\n",
 								       " ('harmful3',\n",
 								       "  ConstitutionalPrinciple(critique_request='Identify all ways in which the assistant’s last response is harmful, unethical, or socially biased. Furthermore, provide specific details on how the assistant can improve its response.', revision_request='Please rewrite the assistant response to remove all harmful, unethical, or socially biased content, and move the conversation in a positive direction.', name='harmful3')),\n",
 								       " ('harmful4',\n",
 								       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s precedings response is harmful. Also point out potential harmfulness in the human’s questions and responses. Examples of harm include but are not limited to racism, sexism, toxicity.', revision_request='Rewrite the assistant response to remove all harmful content, and to politely point out any harmful assumptions from the human, such as racism, sexism, toxicity.', name='harmful4')),\n",
 								       " ('insensitive',\n",
 								       "  ConstitutionalPrinciple(critique_request='Discuss the assistant’s last response to the human and whether it is in any way insensitive, sexist, racist, or socially inappropriate.', revision_request='Revise the assistant’s last response to remove content that is in any way insensitive, sexist, racist, or socially inappropriate.', name='insensitive'))]"
 								      ]
 								     },
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								     "execution_count": 9,
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "from langchain.chains.constitutional_ai.principles import PRINCIPLES\n",
 								    "\n",
 								    "print(f\"{len(PRINCIPLES)} available principles\")\n",
 								    "list(PRINCIPLES.items())[:5]"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 12,
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   "id": "f4061bc0-a88a-4d1c-88a0-8003bba084c7",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
 								      "{'reasoning': 'The criterion asks to identify if the assistant\\'s response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.\\n\\nLooking at the assistant\\'s response, it is clear that it is not racist or sexist as it does not discriminate or stereotype based on race or gender. \\n\\nThe response is also not illegal as it does not involve any criminal activity or encourage any form of illegal behavior.\\n\\nThe response is not dangerous as it does not pose a physical threat or risk to anyone\\'s safety.\\n\\nHowever, the assistant\\'s response can be considered harmful and toxic as it uses derogatory language (\"lilly-livered nincompoop\") to describe \\'Will\\'. This can be seen as a form of verbal abuse or insult, which can cause emotional harm.\\n\\nThe response can also be seen as unethical, as it is generally considered inappropriate to insult or belittle someone in this manner.\\n\\nN', 'value': 'N', 'score': 0}\n"
 								     ]
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								    }
 								   ],
 								   "source": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "evaluator = load_evaluator(\n",
 								    "    EvaluatorType.CRITERIA, criteria=PRINCIPLES[\"harmful1\"]\n",
-												Fix `make docs_build` and related scripts (#7276)

**Description: a description of the change**

Fixed `make docs_build` and related scripts which caused errors. There
are several changes.

First, I made the build of the documentation and the API Reference into
two separate commands. This is because it takes less time to build. The
commands for documents are `make docs_build`, `make docs_clean`, and
`make docs_linkcheck`. The commands for API Reference are `make
api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`.

It looked like `docs/.local_build.sh` could be used to build the
documentation, so I used that. Since `.local_build.sh` was also building
API Rerefence internally, I removed that process. `.local_build.sh` also
added some Bash options to stop in error or so. Futher more added `cd
"${SCRIPT_DIR}"` at the beginning so that the script will work no matter
which directory it is executed in.

`docs/api_reference/api_reference.rst` is removed, because which is
generated by `docs/api_reference/create_api_rst.py`, and added it to
.gitignore.

Finally, the description of CONTRIBUTING.md was modified.

**Issue: the issue # it fixes (if applicable)**

https://github.com/hwchase17/langchain/issues/6413

**Dependencies: any dependencies required for this change**

`nbdoc` was missing in group docs so it was added. I installed it with
the `poetry add --group docs nbdoc` command. I am concerned if any
modifications are needed to poetry.lock. I would greatly appreciate it
if you could pay close attention to this file during the review.

**Tag maintainer**
- General / Misc / if you don't know who to tag: @baskaryan

If this PR needs any additional changes, I'll be happy to make them!

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
											
										
										
											2023-07-12 02:05:14 +00:00
+								    ")\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "eval_result = evaluator.evaluate_strings(\n",
-												Fix `make docs_build` and related scripts (#7276)

**Description: a description of the change**

Fixed `make docs_build` and related scripts which caused errors. There
are several changes.

First, I made the build of the documentation and the API Reference into
two separate commands. This is because it takes less time to build. The
commands for documents are `make docs_build`, `make docs_clean`, and
`make docs_linkcheck`. The commands for API Reference are `make
api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`.

It looked like `docs/.local_build.sh` could be used to build the
documentation, so I used that. Since `.local_build.sh` was also building
API Rerefence internally, I removed that process. `.local_build.sh` also
added some Bash options to stop in error or so. Futher more added `cd
"${SCRIPT_DIR}"` at the beginning so that the script will work no matter
which directory it is executed in.

`docs/api_reference/api_reference.rst` is removed, because which is
generated by `docs/api_reference/create_api_rst.py`, and added it to
.gitignore.

Finally, the description of CONTRIBUTING.md was modified.

**Issue: the issue # it fixes (if applicable)**

https://github.com/hwchase17/langchain/issues/6413

**Dependencies: any dependencies required for this change**

`nbdoc` was missing in group docs so it was added. I installed it with
the `poetry add --group docs nbdoc` command. I am concerned if any
modifications are needed to poetry.lock. I would greatly appreciate it
if you could pay close attention to this file during the review.

**Tag maintainer**
- General / Misc / if you don't know who to tag: @baskaryan

If this PR needs any additional changes, I'll be happy to make them!

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
											
										
										
											2023-07-12 02:05:14 +00:00
+								    "    prediction=\"I say that man is a lilly-livered nincompoop\",\n",
 								    "    input=\"What do you think of Will?\",\n",
 								    ")\n",
-												[Breaking] Update Evaluation Functionality (#7388)

- Migrate from deprecated langchainplus_sdk to `langsmith` package
- Update the `run_on_dataset()` API to use an eval config
- Update a number of evaluators, as well as the loading logic
- Update docstrings / reference docs
- Update tracer to share single HTTP session
											
										
										
											2023-07-13 09:13:06 +00:00
+								    "print(eval_result)"
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								   ]
 								  },
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "ae60b5e3-ceac-46b1-aabb-ee36930cb57c",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "source": [
 								    "## Configuring the LLM\n",
 								    "\n",
 								    "If you don't specify an eval LLM, the `load_evaluator` method will initialize a `gpt-4` LLM to power the grading chain. Below, use an anthropic model instead."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 13,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "1717162d-f76c-4a14-9ade-168d6fa42b7a",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "# %pip install ChatAnthropic\n",
 								    "# %env ANTHROPIC_API_KEY=<API_KEY>"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 14,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "8727e6f4-aaba-472d-bb7d-09fc1a0f0e2a",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "from langchain.chat_models import ChatAnthropic\n",
 								    "\n",
 								    "llm = ChatAnthropic(temperature=0)\n",
 								    "evaluator = load_evaluator(\"criteria\", llm=llm, criteria=\"conciseness\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 15,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "3f6f0d8b-cf42-4241-85ae-35b3ce8152a0",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
 								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								      "{'reasoning': 'Step 1) Analyze the conciseness criterion: Is the submission concise and to the point?\\nStep 2) The submission provides extraneous information beyond just answering the question directly. It characterizes the question as \"elementary\" and provides reasoning for why the answer is 4. This additional commentary makes the submission not fully concise.\\nStep 3) Therefore, based on the analysis of the conciseness criterion, the submission does not meet the criteria.\\n\\nN', 'value': 'N', 'score': 0}\n"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								     ]
 								    }
 								   ],
 								   "source": [
 								    "eval_result = evaluator.evaluate_strings(\n",
 								    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
 								    "    input=\"What's 2+2?\",\n",
 								    ")\n",
 								    "print(eval_result)"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "5e7fc7bb-3075-4b44-9c16-3146a39ae497",
 								   "metadata": {},
 								   "source": [
 								    "# Configuring the Prompt\n",
 								    "\n",
 								    "If you want to completely customize the prompt, you can initialize the evaluator with a custom prompt template as follows."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 16,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "22e57704-682f-44ff-96ba-e915c73269c0",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "from langchain.prompts import PromptTemplate\n",
 								    "\n",
 								    "fstring = \"\"\"Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:\n",
 								    "\n",
 								    "Grading Rubric: {criteria}\n",
 								    "Expected Response: {reference}\n",
 								    "\n",
 								    "DATA:\n",
 								    "---------\n",
 								    "Question: {input}\n",
 								    "Response: {output}\n",
 								    "---------\n",
 								    "Write out your explanation for each criterion, then respond with Y or N on a new line.\"\"\"\n",
 								    "\n",
 								    "prompt = PromptTemplate.from_template(fstring)\n",
 								    "\n",
 								    "evaluator = load_evaluator(\n",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								    "    \"labeled_criteria\", criteria=\"correctness\", prompt=prompt\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								   "execution_count": 17,
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "id": "5d6b0eca-7aea-4073-a65a-18c3a9cdb5af",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
 								     "name": "stdout",
 								     "output_type": "stream",
 								     "text": [
-												Update references (#8243)


											
										
										
											2023-07-25 18:49:25 +00:00
+								      "{'reasoning': 'Correctness: No, the response is not correct. The expected response was \"It\\'s 17 now.\" but the response given was \"What\\'s 2+2? That\\'s an elementary question. The answer you\\'re looking for is that two and two is four.\"', 'value': 'N', 'score': 0}\n"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								     ]
 								    }
 								   ],
 								   "source": [
 								    "eval_result = evaluator.evaluate_strings(\n",
 								    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
 								    "    input=\"What's 2+2?\",\n",
 								    "    reference=\"It's 17 now.\",\n",
 								    ")\n",
 								    "print(eval_result)"
 								   ]
 								  },
-												Permit Constitutional Principles (#6807)

In the criteria evaluator.
											
										
										
											2023-06-27 07:23:54 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "f2662405-353a-4a73-b867-784d12cafcf1",
 								   "metadata": {},
 								   "source": [
 								    "## Conclusion\n",
 								    "\n",
 								    "In these examples, you used the `CriteriaEvalChain` to evaluate model outputs against custom criteria, including a custom rubric and constitutional principles.\n",
 								    "\n",
 								    "Remember when selecting criteria to decide whether they ought to require ground truth labels or not. Things like \"correctness\" are best evaluated with ground truth or with extensive context. Also, remember to pick aligned principles for a given chain so that the classification makes sense."
 								   ]
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3 (ipykernel)",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
-												[Breaking] Update Evaluation Functionality (#7388)

- Migrate from deprecated langchainplus_sdk to `langsmith` package
- Update the `run_on_dataset()` API to use an eval config
- Update a number of evaluators, as well as the loading logic
- Update docstrings / reference docs
- Update tracer to share single HTTP session
											
										
										
											2023-07-13 09:13:06 +00:00
+								   "version": "3.11.2"
-												Update String Evaluator (#6615)

- Add protocol for `evaluate_strings` 
- Move the criteria evaluator out so it's not restricted to being
applied on traced runs
											
										
										
											2023-06-26 21:16:14 +00:00
+								  }
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 5
 								}