forked from Archives/langchain
fix eval guide links (#6319)
This commit is contained in:
parent
ad324a39ae
commit
6640293087
@ -33,7 +33,7 @@ Here is what we have for each problem so far:
|
||||
|
||||
**# 1: Lack of data**
|
||||
|
||||
We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
|
||||
We have started [LangChainDatasets](https://huggingface.co/LangChainDatasets) a Community space on Hugging Face.
|
||||
We intend this to be a collection of open source datasets for evaluating common chains and agents.
|
||||
We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
|
||||
In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
|
||||
@ -41,14 +41,14 @@ In order to contribute a dataset, you simply need to join the community and then
|
||||
We're also aiming to make it as easy as possible for people to create their own datasets.
|
||||
As a first pass at this, we've added a QAGenerationChain, which given a document comes up
|
||||
with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
|
||||
See `this notebook <./qa_generation.html>`_ for an example of how to use this chain.
|
||||
See [this notebook](/docs/guides/evaluation/qa_generation.html) for an example of how to use this chain.
|
||||
|
||||
**# 2: Lack of metrics**
|
||||
|
||||
We have two solutions to the lack of metrics.
|
||||
|
||||
The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
|
||||
To assist in this, we have developed (and will continue to develop) `tracing <../additional_resources/tracing.html>`_, a UI-based visualizer of your chain and agent runs.
|
||||
To assist in this, we have developed (and will continue to develop) [tracing](/docs/guides/tracing/), a UI-based visualizer of your chain and agent runs.
|
||||
|
||||
The second solution we recommend is to use Language Models themselves to evaluate outputs.
|
||||
For this we have a few different chains and prompts aimed at tackling this issue.
|
||||
@ -57,30 +57,30 @@ For this we have a few different chains and prompts aimed at tackling this issue
|
||||
|
||||
We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
|
||||
In addition to the examples we've curated, we also highly welcome contributions here.
|
||||
To facilitate that, we've included a `template notebook <./benchmarking_template.html>`_ for community members to use to build their own examples.
|
||||
To facilitate that, we've included a [template notebook](/docs/guides/evaluation/benchmarking_template.html) for community members to use to build their own examples.
|
||||
|
||||
The existing examples we have are:
|
||||
|
||||
`Question Answering (State of Union) <./qa_benchmarking_sota.html>`_: A notebook showing evaluation of a question-answering task over a State-of-the-Union address.
|
||||
[Question Answering (State of Union)](/docs/guides/evaluation/qa_benchmarking_sota.html): A notebook showing evaluation of a question-answering task over a State-of-the-Union address.
|
||||
|
||||
`Question Answering (Paul Graham Essay) <./qa_benchmarking_pg.html>`_: A notebook showing evaluation of a question-answering task over a Paul Graham essay.
|
||||
[Question Answering (Paul Graham Essay)](/docs/guides/evaluation/qa_benchmarking_pg.html): A notebook showing evaluation of a question-answering task over a Paul Graham essay.
|
||||
|
||||
`SQL Question Answering (Chinook) <./sql_qa_benchmarking_chinook.html>`_: A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
|
||||
[SQL Question Answering (Chinook)](/docs/guides/evaluation/sql_qa_benchmarking_chinook.html): A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
|
||||
|
||||
`Agent Vectorstore <./agent_vectordb_sota_pg.html>`_: A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
|
||||
[Agent Vectorstore](/docs/guides/evaluation/agent_vectordb_sota_pg.html): A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
|
||||
|
||||
`Agent Search + Calculator <./agent_benchmarking.html>`_: A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
|
||||
[Agent Search + Calculator](/docs/guides/evaluation/agent_benchmarking.html): A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
|
||||
|
||||
`Evaluating an OpenAPI Chain <./openapi_eval.html>`_: A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.
|
||||
[Evaluating an OpenAPI Chain](/docs/guides/evaluation/openapi_eval.html): A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.
|
||||
|
||||
|
||||
## Other Examples
|
||||
|
||||
In addition, we also have some more generic resources for evaluation.
|
||||
|
||||
`Question Answering <./question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.
|
||||
[Question Answering](/docs/guides/evaluation/question_answering.html): An overview of LLMs aimed at evaluating question answering systems in general.
|
||||
|
||||
`Data Augmented Question Answering <./data_augmented_question_answering.html>`_: An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.
|
||||
[Data Augmented Question Answering](/docs/guides/evaluation/data_augmented_question_answering.html): An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.
|
||||
|
||||
`Hugging Face Datasets <./huggingface_datasets.html>`_: Covers an example of loading and using a dataset from Hugging Face for evaluation.
|
||||
[Hugging Face Datasets](/docs/guides/evaluation/huggingface_datasets.html): Covers an example of loading and using a dataset from Hugging Face for evaluation.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user