Feature/add deepeval (#10349)

Description: Adding `DeepEval` - which provides an opinionated framework for testing and evaluating LLMs Issue: Missing Deepeval Dependencies: Optional DeepEval dependency Tag maintainer: @baskaryan (not 100% sure) Twitter handle: https://twitter.com/ColabDog
1 year ago · 6ad6bb46c4
parent 675d57df50
commit 6ad6bb46c4
4 changed files with 546 additions and 0 deletions
--- a/docs/extras/integrations/callbacks/confident.ipynb
+++ b/docs/extras/integrations/callbacks/confident.ipynb
@ -0,0 +1,310 @@
 {
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Confident\n",
    "\n",
    ">[DeepEval](https://confident-ai.com) package for unit testing LLMs.\n",
    "> Using Confident, everyone can build robust language models through faster iterations\n",
    "> using both unit testing and integration testing. We provide support for each step in the iteration\n",
    "> from synthetic data creation to testing.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this guide we will demonstrate how to test and measure LLMs in performance. We show how you can use our callback to measure performance and how you can define your own metric and log them into our dashboard.\n",
    "\n",
    "DeepEval also offers:\n",
    "- How to generate synthetic data\n",
    "- How to measure performance\n",
    "- A dashboard to monitor and review results over time"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Installation and Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install deepeval --upgrade"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting API Credentials\n",
    "\n",
    "To get the DeepEval API credentials, follow the next steps:\n",
    "\n",
    "1. Go to https://app.confident-ai.com\n",
    "2. Click on \"Organization\"\n",
    "3. Copy the API Key.\n",
    "\n",
    "\n",
    "When you log in, you will also be asked to set the `implementation` name. The implementation name is required to describe the type of implementation. (Think of what you want to call your project. We recommend making it descriptive.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "!deepeval login"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup DeepEval\n",
    "\n",
    "You can, by default, use the `DeepEvalCallbackHandler` to set up the metrics you want to track. However, this has limited support for metrics at the moment (more to be added soon). It currently supports:\n",
    "- [Answer Relevancy](https://docs.confident-ai.com/docs/measuring_llm_performance/answer_relevancy)\n",
    "- [Bias](https://docs.confident-ai.com/docs/measuring_llm_performance/debias)\n",
    "- [Toxicness](https://docs.confident-ai.com/docs/measuring_llm_performance/non_toxic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deepeval.metrics.answer_relevancy import AnswerRelevancy\n",
    "\n",
    "# Here we want to make sure the answer is minimally relevant\n",
    "answer_relevancy_metric = AnswerRelevancy(minimum_score=0.5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Started"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the `DeepEvalCallbackHandler`, we need the `implementation_name`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from langchain.callbacks.confident_callback import DeepEvalCallbackHandler\n",
    "\n",
    "deepeval_callback = DeepEvalCallbackHandler(\n",
    "    implementation_name=\"langchainQuickstart\",\n",
    "    metrics=[answer_relevancy_metric]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scenario 1: Feeding into LLM\n",
    "\n",
    "You can then feed it into your LLM with OpenAI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LLMResult(generations=[[Generation(text='\\n\\nQ: What did the fish say when he hit the wall? \\nA: Dam.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\\n\\nThe Moon \\n\\nThe moon is high in the midnight sky,\\nSparkling like a star above.\\nThe night so peaceful, so serene,\\nFilling up the air with love.\\n\\nEver changing and renewing,\\nA never-ending light of grace.\\nThe moon remains a constant view,\\nA reminder of life’s gentle pace.\\n\\nThrough time and space it guides us on,\\nA never-fading beacon of hope.\\nThe moon shines down on us all,\\nAs it continues to rise and elope.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\\n\\nQ. What did one magnet say to the other magnet?\\nA. \"I find you very attractive!\"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text=\"\\n\\nThe world is charged with the grandeur of God.\\nIt will flame out, like shining from shook foil;\\nIt gathers to a greatness, like the ooze of oil\\nCrushed. Why do men then now not reck his rod?\\n\\nGenerations have trod, have trod, have trod;\\nAnd all is seared with trade; bleared, smeared with toil;\\nAnd wears man's smudge and shares man's smell: the soil\\nIs bare now, nor can foot feel, being shod.\\n\\nAnd for all this, nature is never spent;\\nThere lives the dearest freshness deep down things;\\nAnd though the last lights off the black West went\\nOh, morning, at the brown brink eastward, springs —\\n\\nBecause the Holy Ghost over the bent\\nWorld broods with warm breast and with ah! bright wings.\\n\\n~Gerard Manley Hopkins\", generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\\n\\nQ: What did one ocean say to the other ocean?\\nA: Nothing, they just waved.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text=\"\\n\\nA poem for you\\n\\nOn a field of green\\n\\nThe sky so blue\\n\\nA gentle breeze, the sun above\\n\\nA beautiful world, for us to love\\n\\nLife is a journey, full of surprise\\n\\nFull of joy and full of surprise\\n\\nBe brave and take small steps\\n\\nThe future will be revealed with depth\\n\\nIn the morning, when dawn arrives\\n\\nA fresh start, no reason to hide\\n\\nSomewhere down the road, there's a heart that beats\\n\\nBelieve in yourself, you'll always succeed.\", generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'completion_tokens': 504, 'total_tokens': 528, 'prompt_tokens': 24}, 'model_name': 'text-davinci-003'})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.llms import OpenAI\n",
    "llm = OpenAI(\n",
    "    temperature=0,\n",
    "    callbacks=[deepeval_callback],\n",
    "    verbose=True,\n",
    "    openai_api_key=\"<YOUR_API_KEY>\",\n",
    ")\n",
    "output = llm.generate(\n",
    "    [\n",
    "        \"What is the best evaluation tool out there? (no bias at all)\",\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can then check the metric if it was successful by calling the `is_successful()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "answer_relevancy_metric.is_successful()\n",
    "# returns True/False"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have ran that, you should be able to see our dashboard below. \n",
    "\n",
    "![Dashboard](https://docs.confident-ai.com/assets/images/dashboard-screenshot-b02db73008213a211b1158ff052d969e.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scenario 2: Tracking an LLM in a chain without callbacks\n",
    "\n",
    "To track an LLM in a chain without callbacks, you can plug into it at the end.\n",
    "\n",
    "We can start by defining a simple chain as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from langchain.chains import RetrievalQA\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.llms import OpenAI\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import Chroma\n",
    "\n",
    "text_file_url = \"https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt\"\n",
    "\n",
    "openai_api_key = \"sk-XXX\"\n",
    "\n",
    "with open(\"state_of_the_union.txt\", \"w\") as f:\n",
    "  response = requests.get(text_file_url)\n",
    "  f.write(response.text)\n",
    "\n",
    "loader = TextLoader(\"state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "texts = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)\n",
    "docsearch = Chroma.from_documents(texts, embeddings)\n",
    "\n",
    "qa = RetrievalQA.from_chain_type(\n",
    "  llm=OpenAI(openai_api_key=openai_api_key), chain_type=\"stuff\",\n",
    "  retriever=docsearch.as_retriever()\n",
    ")\n",
    "\n",
    "# Providing a new question-answering pipeline\n",
    "query = \"Who is the president?\"\n",
    "result = qa.run(query)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After defining a chain, you can then manually check for answer similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "answer_relevancy_metric.measure(result, query)\n",
    "answer_relevancy_metric.is_successful()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What's next?\n",
    "\n",
    "You can create your own custom metrics [here](https://docs.confident-ai.com/docs/quickstart/custom-metrics). \n",
    "\n",
    "DeepEval also offers other features such as being able to [automatically create unit tests](https://docs.confident-ai.com/docs/quickstart/synthetic-data-creation), [tests for hallucination](https://docs.confident-ai.com/docs/measuring_llm_performance/factual_consistency).\n",
    "\n",
    "If you are interested, check out our Github repository here [https://github.com/confident-ai/deepeval](https://github.com/confident-ai/deepeval). We welcome any PRs and discussions on how to improve LLM performance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "a53ebf4a859167383b364e7e7521d0add3c2dbbdecce4edf676e8c4634ff3fbb"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/docs/extras/integrations/providers/confident.mdx
+++ b/docs/extras/integrations/providers/confident.mdx
@ -0,0 +1,22 @@
 # Confident AI
 ![Confident - Unit Testing for LLMs](https://github.com/confident-ai/deepeval)
 >[DeepEval](https://confident-ai.com) package for unit testing LLMs.
 > Using Confident, everyone can build robust language models through faster iterations
 > using both unit testing and integration testing. We provide support for each step in the iteration
 > from synthetic data creation to testing.
 ## Installation and Setup
 First, you'll need to install the `DeepEval` Python package as follows:
 ```bash
 pip install deepeval
 ```
 Afterwards, you can get started in as little as a few lines of code.
 ```python
 from langchain.callbacks import DeepEvalCallback
 ```
--- a/libs/langchain/langchain/callbacks/confident_callback.py
+++ b/libs/langchain/langchain/callbacks/confident_callback.py
@ -0,0 +1,188 @@
 # flake8: noqa
 import os
 import warnings
 from typing import Any, Dict, List, Optional, Union
 from langchain.callbacks.base import BaseCallbackHandler
 from langchain.schema import AgentAction, AgentFinish, LLMResult
 class DeepEvalCallbackHandler(BaseCallbackHandler):
    """Callback Handler that logs into deepeval.
    Args:
        implementation_name: name of the `implementation` in deepeval
        metrics: A list of metrics
    Raises:
        ImportError: if the `deepeval` package is not installed.
    Examples:
        >>> from langchain.llms import OpenAI
        >>> from langchain.callbacks import DeepEvalCallbackHandler
        >>> from deepeval.metrics import AnswerRelevancy
        >>> metric = AnswerRelevancy(minimum_score=0.3)
        >>> deepeval_callback = DeepEvalCallbackHandler(
        ...     implementation_name="exampleImplementation",
        ...     metrics=[metric],
        ... )
        >>> llm = OpenAI(
        ...     temperature=0,
        ...     callbacks=[deepeval_callback],
        ...     verbose=True,
        ...     openai_api_key="API_KEY_HERE",
        ... )
        >>> llm.generate([
        ...     "What is the best evaluation tool out there? (no bias at all)",
        ... ])
        "Deepeval, no doubt about it."
    """
    REPO_URL: str = "https://github.com/confident-ai/deepeval"
    ISSUES_URL: str = f"{REPO_URL}/issues"
    BLOG_URL: str = "https://docs.confident-ai.com"  # noqa: E501
    def __init__(
        self,
        metrics: List[Any],
        implementation_name: Optional[str] = None,
    ) -> None:
        """Initializes the `deepevalCallbackHandler`.
        Args:
            implementation_name: Name of the implementation you want.
            metrics: What metrics do you want to track?
        Raises:
            ImportError: if the `deepeval` package is not installed.
            ConnectionError: if the connection to deepeval fails.
        """
        super().__init__()
        # Import deepeval (not via `import_deepeval` to keep hints in IDEs)
        try:
            import deepeval  # ignore: F401,I001
        except ImportError:
            raise ImportError(
                """To use the deepeval callback manager you need to have the 
                `deepeval` Python package installed. Please install it with 
                `pip install deepeval`"""
            )
        if os.path.exists(".deepeval"):
            warnings.warn(
                """You are currently not logging anything to the dashboard, we 
                recommend using `deepeval login`."""
            )
        # Set the deepeval variables
        self.implementation_name = implementation_name
        self.metrics = metrics
        warnings.warn(
            (
                "The `DeepEvalCallbackHandler` is currently in beta and is subject to"
                " change based on updates to `langchain`. Please report any issues to"
                f" {self.ISSUES_URL} as an `integration` issue."
            ),
        )
    def on_llm_start(
        self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
    ) -> None:
        """Store the prompts"""
        self.prompts = prompts
    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        """Do nothing when a new token is generated."""
        pass
    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        """Log records to deepeval when an LLM ends."""
        from deepeval.metrics.answer_relevancy import AnswerRelevancy
        from deepeval.metrics.bias_classifier import UnBiasedMetric
        from deepeval.metrics.metric import Metric
        from deepeval.metrics.toxic_classifier import NonToxicMetric
        for metric in self.metrics:
            for i, generation in enumerate(response.generations):
                # Here, we only measure the first generation's output
                output = generation[0].text
                query = self.prompts[i]
                if isinstance(metric, AnswerRelevancy):
                    result = metric.measure(
                        output=output,
                        query=query,
                    )
                    print(f"Answer Relevancy: {result}")
                elif isinstance(metric, UnBiasedMetric):
                    score = metric.measure(output)
                    print(f"Bias Score: {score}")
                elif isinstance(metric, NonToxicMetric):
                    score = metric.measure(output)
                    print(f"Toxic Score: {score}")
                else:
                    raise ValueError(
                        f"""Metric {metric.__name__} is not supported by deepeval 
                        callbacks."""
                    )
    def on_llm_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> None:
        """Do nothing when LLM outputs an error."""
        pass
    def on_chain_start(
        self, serialized: Dict[str, Any], inputs: Dict[str, Any], **kwargs: Any
    ) -> None:
        """Do nothing when chain starts"""
        pass
    def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> None:
        """Do nothing when chain ends."""
        pass
    def on_chain_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> None:
        """Do nothing when LLM chain outputs an error."""
        pass
    def on_tool_start(
        self,
        serialized: Dict[str, Any],
        input_str: str,
        **kwargs: Any,
    ) -> None:
        """Do nothing when tool starts."""
        pass
    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
        """Do nothing when agent takes a specific action."""
        pass
    def on_tool_end(
        self,
        output: str,
        observation_prefix: Optional[str] = None,
        llm_prefix: Optional[str] = None,
        **kwargs: Any,
    ) -> None:
        """Do nothing when tool ends."""
        pass
    def on_tool_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> None:
        """Do nothing when tool outputs an error."""
        pass
    def on_text(self, text: str, **kwargs: Any) -> None:
        """Do nothing"""
        pass
    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> None:
        """Do nothing"""
        pass
--- a/libs/langchain/tests/integration_tests/llms/test_confident.py
+++ b/libs/langchain/tests/integration_tests/llms/test_confident.py
@ -0,0 +1,26 @@
 """Test Confident."""
 def test_confident_deepeval() -> None:
    """Test valid call to Beam."""
    from deepeval.metrics.answer_relevancy import AnswerRelevancy
    from langchain.callbacks.confident_callback import DeepEvalCallbackHandler
    from langchain.llms import OpenAI
    answer_relevancy = AnswerRelevancy(minimum_score=0.3)
    deepeval_callback = DeepEvalCallbackHandler(
        implementation_name="exampleImplementation", metrics=[answer_relevancy]
    )
    llm = OpenAI(
        temperature=0,
        callbacks=[deepeval_callback],
        verbose=True,
        openai_api_key="<YOUR_API_KEY>",
    )
    llm.generate(
        [
            "What is the best evaluation tool out there? (no bias at all)",
        ]
    )
    assert answer_relevancy.is_successful(), "Answer not relevant"