langchain/docs/use_cases/evaluation.rst

Evaluation
==========

.. note::
   `Conceptual Guide <https://docs.langchain.com/docs/use-cases/evaluation>`_


This section of documentation covers how we approach and think about evaluation in LangChain.
Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.

The Problem
-----------

It can be really hard to evaluate LangChain chains and agents.
There are two main reasons for this:

**# 1: Lack of data**

You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
a large dataset of examples.
This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
before even getting started using a model.

**# 2: Lack of metrics**

Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
For example, one of the most common use cases is generating text of some form.
Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.

The Solution
------------

LangChain attempts to tackle both of those issues.
What we have so far are initial passes at solutions - we do not think we have a perfect solution.
So we very much welcome feedback, contributions, integrations, and thoughts on this.

Here is what we have for each problem so far:

**# 1: Lack of data**

We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
We intend this to be a collection of open source datasets for evaluating common chains and agents.
We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.

We're also aiming to make it as easy as possible for people to create their own datasets.
As a first pass at this, we've added a QAGenerationChain, which given a document comes up
with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
See `this notebook <./evaluation/qa_generation.html>`_ for an example of how to use this chain.

**# 2: Lack of metrics**

We have two solutions to the lack of metrics.

The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
To assist in this, we have developed (and will continue to develop) `tracing <../additional_resources/tracing.html>`_, a UI-based visualizer of your chain and agent runs.

The second solution we recommend is to use Language Models themselves to evaluate outputs.
For this we have a few different chains and prompts aimed at tackling this issue.

The Examples
------------

We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
In addition to the examples we've curated, we also highly welcome contributions here.
To facilitate that, we've included a `template notebook <./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.

The existing examples we have are:

`Question Answering (State of Union) <./evaluation/qa_benchmarking_sota.html>`_: A notebook showing evaluation of a question-answering task over a State-of-the-Union address.

`Question Answering (Paul Graham Essay) <./evaluation/qa_benchmarking_pg.html>`_: A notebook showing evaluation of a question-answering task over a Paul Graham essay.

`SQL Question Answering (Chinook) <./evaluation/sql_qa_benchmarking_chinook.html>`_: A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).

`Agent Vectorstore <./evaluation/agent_vectordb_sota_pg.html>`_: A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.

`Agent Search + Calculator <./evaluation/agent_benchmarking.html>`_: A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.

`Evaluating an OpenAPI Chain <./evaluation/openapi_eval.html>`_: A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.


Other Examples
--------------

In addition, we also have some more generic resources for evaluation.

`Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.

`Data Augmented Question Answering <./evaluation/data_augmented_question_answering.html>`_: An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.

`Hugging Face Datasets <./evaluation/huggingface_datasets.html>`_: Covers an example of loading and using a dataset from Hugging Face for evaluation.


.. toctree::
   :maxdepth: 1
   :glob:
   :hidden:

   evaluation/*
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`Evaluation`
[Docs] minor fixes to loaders links and rst warnings (#2846) The doc loaders index was picking up a bunch of subheadings because I mistakenly made the MD titles H1s. Fixed that. also the easy minor warnings from docs_build 2023-04-13 17:54:40 +00:00			`==========`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00
big docs refactor (#1978) Co-authored-by: Ankush Gola <ankush.gola@gmail.com> 2023-03-27 02:49:46 +00:00			`.. note::`
			`Conceptual Guide <https://docs.langchain.com/docs/use-cases/evaluation>`_


Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`This section of documentation covers how we approach and think about evaluation in LangChain.`
			`Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`The Problem`
			`-----------`

			`It can be really hard to evaluate LangChain chains and agents.`
			`There are two main reasons for this:`

			`# 1: Lack of data`

			`You generally don't have a ton of data to evaluate your chains/agents over before starting a project.`
			`This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,`
			`meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without`
			`a large dataset of examples.`
			`This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints`
			`before even getting started using a model.`

			`# 2: Lack of metrics`

			`Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.`
			`For example, one of the most common use cases is generating text of some form.`
			`Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.`

			`The Solution`
			`------------`

			`LangChain attempts to tackle both of those issues.`
			`What we have so far are initial passes at solutions - we do not think we have a perfect solution.`
			`So we very much welcome feedback, contributions, integrations, and thoughts on this.`

			`Here is what we have for each problem so far:`

			`# 1: Lack of data`

			We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
			`We intend this to be a collection of open source datasets for evaluating common chains and agents.`
			`We have contributed five datasets of our own to start, but we highly intend this to be a community effort.`
			`In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.`

			`We're also aiming to make it as easy as possible for people to create their own datasets.`
			`As a first pass at this, we've added a QAGenerationChain, which given a document comes up`
			`with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.`
			See `this notebook <./evaluation/qa_generation.html>`_ for an example of how to use this chain.

			`# 2: Lack of metrics`

			`We have two solutions to the lack of metrics.`

			`The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.`
docs: added `additional_resources` folder (#4748) # docs: added `additional_resources` folder The additional resource files were inside the doc top-level folder, which polluted the top-level folder. - added the `additional_resources` folder and moved correspondent files to this folder; - fixed a broken link to the "Model comparison" page (model_laboratory notebook) - fixed a broken link to one of the YouTube videos (sorry, it is not directly related to this PR) ## Who can review? @dev2049 2023-05-16 00:12:47 +00:00			To assist in this, we have developed (and will continue to develop) `tracing <../additional_resources/tracing.html>`_, a UI-based visualizer of your chain and agent runs.
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
			`The second solution we recommend is to use Language Models themselves to evaluate outputs.`
			`For this we have a few different chains and prompts aimed at tackling this issue.`

			`The Examples`
			`------------`

			`We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.`
			`In addition to the examples we've curated, we also highly welcome contributions here.`
			To facilitate that, we've included a `template notebook <./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.

			`The existing examples we have are:`

Harrison/move eval (#2533) 2023-04-07 14:53:13 +00:00			`Question Answering (State of Union) <./evaluation/qa_benchmarking_sota.html>`_: A notebook showing evaluation of a question-answering task over a State-of-the-Union address.
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
Harrison/move eval (#2533) 2023-04-07 14:53:13 +00:00			`Question Answering (Paul Graham Essay) <./evaluation/qa_benchmarking_pg.html>`_: A notebook showing evaluation of a question-answering task over a Paul Graham essay.
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
Harrison/move eval (#2533) 2023-04-07 14:53:13 +00:00			`SQL Question Answering (Chinook) <./evaluation/sql_qa_benchmarking_chinook.html>`_: A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
Harrison/move eval (#2533) 2023-04-07 14:53:13 +00:00			`Agent Vectorstore <./evaluation/agent_vectordb_sota_pg.html>`_: A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
Harrison/move eval (#2533) 2023-04-07 14:53:13 +00:00			`Agent Search + Calculator <./evaluation/agent_benchmarking.html>`_: A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.

			`Evaluating an OpenAPI Chain <./evaluation/openapi_eval.html>`_: A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00

			`Other Examples`
[Docs] minor fixes to loaders links and rst warnings (#2846) The doc loaders index was picking up a bunch of subheadings because I mistakenly made the MD titles H1s. Fixed that. also the easy minor warnings from docs_build 2023-04-13 17:54:40 +00:00			`--------------`
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00
			`In addition, we also have some more generic resources for evaluation.`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00
Feature: linkcheck-action (#534) (#542) - Add support for local build and linkchecking of docs - Add GitHub Action to automatically check links before prior to publication - Minor reformat of Contributing readme - Fix existing broken links Co-authored-by: Hunter Gerlach <hunter@huntergerlach.com> Co-authored-by: Hunter Gerlach <HunterGerlach@users.noreply.github.com> Co-authored-by: Hunter Gerlach <hunter@huntergerlach.com> 2023-01-05 05:39:50 +00:00			`Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00
WIP: Harrison/base retriever (#1765) 2023-03-24 14:46:49 +00:00			`Data Augmented Question Answering <./evaluation/data_augmented_question_answering.html>`_: An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00
Feature: linkcheck-action (#534) (#542) - Add support for local build and linkchecking of docs - Add GitHub Action to automatically check links before prior to publication - Minor reformat of Contributing readme - Fix existing broken links Co-authored-by: Hunter Gerlach <hunter@huntergerlach.com> Co-authored-by: Hunter Gerlach <HunterGerlach@users.noreply.github.com> Co-authored-by: Hunter Gerlach <hunter@huntergerlach.com> 2023-01-05 05:39:50 +00:00			`Hugging Face Datasets <./evaluation/huggingface_datasets.html>`_: Covers an example of loading and using a dataset from Hugging Face for evaluation.
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00

			`.. toctree::`
			`:maxdepth: 1`
			`:glob:`
			`:hidden:`

			`evaluation/*`