fix eval guide links (#6319)

This commit is contained in:
Davis Chase 2023-06-16 17:53:46 -07:00 committed by GitHub
parent ad324a39ae
commit 6640293087
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -33,7 +33,7 @@ Here is what we have for each problem so far:
**# 1: Lack of data** **# 1: Lack of data**
We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face. We have started [LangChainDatasets](https://huggingface.co/LangChainDatasets) a Community space on Hugging Face.
We intend this to be a collection of open source datasets for evaluating common chains and agents. We intend this to be a collection of open source datasets for evaluating common chains and agents.
We have contributed five datasets of our own to start, but we highly intend this to be a community effort. We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets. In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
@ -41,14 +41,14 @@ In order to contribute a dataset, you simply need to join the community and then
We're also aiming to make it as easy as possible for people to create their own datasets. We're also aiming to make it as easy as possible for people to create their own datasets.
As a first pass at this, we've added a QAGenerationChain, which given a document comes up As a first pass at this, we've added a QAGenerationChain, which given a document comes up
with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line. with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
See `this notebook <./qa_generation.html>`_ for an example of how to use this chain. See [this notebook](/docs/guides/evaluation/qa_generation.html) for an example of how to use this chain.
**# 2: Lack of metrics** **# 2: Lack of metrics**
We have two solutions to the lack of metrics. We have two solutions to the lack of metrics.
The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing. The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
To assist in this, we have developed (and will continue to develop) `tracing <../additional_resources/tracing.html>`_, a UI-based visualizer of your chain and agent runs. To assist in this, we have developed (and will continue to develop) [tracing](/docs/guides/tracing/), a UI-based visualizer of your chain and agent runs.
The second solution we recommend is to use Language Models themselves to evaluate outputs. The second solution we recommend is to use Language Models themselves to evaluate outputs.
For this we have a few different chains and prompts aimed at tackling this issue. For this we have a few different chains and prompts aimed at tackling this issue.
@ -57,30 +57,30 @@ For this we have a few different chains and prompts aimed at tackling this issue
We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing. We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
In addition to the examples we've curated, we also highly welcome contributions here. In addition to the examples we've curated, we also highly welcome contributions here.
To facilitate that, we've included a `template notebook <./benchmarking_template.html>`_ for community members to use to build their own examples. To facilitate that, we've included a [template notebook](/docs/guides/evaluation/benchmarking_template.html) for community members to use to build their own examples.
The existing examples we have are: The existing examples we have are:
`Question Answering (State of Union) <./qa_benchmarking_sota.html>`_: A notebook showing evaluation of a question-answering task over a State-of-the-Union address. [Question Answering (State of Union)](/docs/guides/evaluation/qa_benchmarking_sota.html): A notebook showing evaluation of a question-answering task over a State-of-the-Union address.
`Question Answering (Paul Graham Essay) <./qa_benchmarking_pg.html>`_: A notebook showing evaluation of a question-answering task over a Paul Graham essay. [Question Answering (Paul Graham Essay)](/docs/guides/evaluation/qa_benchmarking_pg.html): A notebook showing evaluation of a question-answering task over a Paul Graham essay.
`SQL Question Answering (Chinook) <./sql_qa_benchmarking_chinook.html>`_: A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database). [SQL Question Answering (Chinook)](/docs/guides/evaluation/sql_qa_benchmarking_chinook.html): A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
`Agent Vectorstore <./agent_vectordb_sota_pg.html>`_: A notebook showing evaluation of an agent doing question answering while routing between two different vector databases. [Agent Vectorstore](/docs/guides/evaluation/agent_vectordb_sota_pg.html): A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
`Agent Search + Calculator <./agent_benchmarking.html>`_: A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools. [Agent Search + Calculator](/docs/guides/evaluation/agent_benchmarking.html): A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
`Evaluating an OpenAPI Chain <./openapi_eval.html>`_: A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any. [Evaluating an OpenAPI Chain](/docs/guides/evaluation/openapi_eval.html): A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.
## Other Examples ## Other Examples
In addition, we also have some more generic resources for evaluation. In addition, we also have some more generic resources for evaluation.
`Question Answering <./question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general. [Question Answering](/docs/guides/evaluation/question_answering.html): An overview of LLMs aimed at evaluating question answering systems in general.
`Data Augmented Question Answering <./data_augmented_question_answering.html>`_: An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples. [Data Augmented Question Answering](/docs/guides/evaluation/data_augmented_question_answering.html): An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.
`Hugging Face Datasets <./huggingface_datasets.html>`_: Covers an example of loading and using a dataset from Hugging Face for evaluation. [Hugging Face Datasets](/docs/guides/evaluation/huggingface_datasets.html): Covers an example of loading and using a dataset from Hugging Face for evaluation.