mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
c5edbea34a
Current problems: 1. Evaluating LLMs or Chat models isn't smooth. Even specifying 'generations' as the output inserts a redundant list into the eval template 2. Configuring input / prediction / reference keys in the `get_qa_evaluator` function is confusing. Unless you are using a chain with the default keys, you have to specify all the variables and need to reason about whether the key corresponds to the traced run's inputs, outputs or the examples inputs or outputs. Proposal: - Configure the run evaluator according to a model. Use the model type and input/output keys to assert compatibility where possible. Only need to specify a reference_key for certain evaluators (which is less confusing than specifying input keys) When does this work: - If you have your langchain model available (assumed always for run_on_dataset flow) - If you are evaluating an LLM, Chat model, or chain - If the LLM or chat models are traced by langchain (wouldn't work if you add an incompatible schema via the REST API) When would this fail: - Currently if you directly create an example from an LLM run, the outputs are generations with all the extra metadata present. A simple `example_key` and dumping all to the template could make the evaluations unreliable - Doesn't help if you're not using the low level API - If you want to instantiate the evaluator without instantiating your chain or LLM (maybe common for monitoring, for instance) -> could also load from run or run type though What's ugly: - Personally think it's better to load evaluators one by one since passing a config down is pretty confusing. - Lots of testing needs to be added - Inconsistent in that it makes a separate run and example input mapper instead of the original `RunEvaluatorInputMapper`, which maps a run and example to a single input. Example usage running the for an LLM, Chat Model, and Agent. ``` # Test running for the string evaluators evaluator_names = ["qa", "criteria"] model = ChatOpenAI() configured_evaluators = load_run_evaluators_for_model(evaluator_names, model=model, reference_key="answer") run_on_dataset(ds_name, model, run_evaluators=configured_evaluators) ``` <details> <summary>Full code with dataset upload</summary> ``` ## Create dataset from langchain.evaluation.run_evaluators.loading import load_run_evaluators_for_model from langchain.evaluation import load_dataset import pandas as pd lcds = load_dataset("llm-math") df = pd.DataFrame(lcds) from uuid import uuid4 from langsmith import Client client = Client() ds_name = "llm-math - " + str(uuid4())[0:8] ds = client.upload_dataframe(df, name=ds_name, input_keys=["question"], output_keys=["answer"]) ## Define the models we'll test over from langchain.llms import OpenAI from langchain.chat_models import ChatOpenAI from langchain.agents import initialize_agent, AgentType from langchain.tools import tool llm = OpenAI(temperature=0) chat_model = ChatOpenAI(temperature=0) @tool def sum(a: float, b: float) -> float: """Add two numbers""" return a + b def construct_agent(): return initialize_agent( llm=chat_model, tools=[sum], agent=AgentType.OPENAI_MULTI_FUNCTIONS, ) agent = construct_agent() # Test running for the string evaluators evaluator_names = ["qa", "criteria"] models = [llm, chat_model, agent] run_evaluators = [] for model in models: run_evaluators.append(load_run_evaluators_for_model(evaluator_names, model=model, reference_key="answer")) # Run on LLM, Chat Model, and Agent from langchain.client.runner_utils import run_on_dataset to_test = [llm, chat_model, construct_agent] for model, configured_evaluators in zip(to_test, run_evaluators): run_on_dataset(ds_name, model, run_evaluators=configured_evaluators, verbose=True) ``` </details> --------- Co-authored-by: Nuno Campos <nuno@boringbits.io> |
||
---|---|---|
.. | ||
__init__.py | ||
test_runner_utils.py |