langchain/docs/extras/use_cases/tagging.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a0507a4b",
   "metadata": {},
   "source": [
    "# Tagging\n",
    "\n",
    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/tagging.ipynb)\n",
    "\n",
    "## Use case\n",
    "\n",
    "Tagging means labeling a document with classes such as:\n",
    "\n",
    "- sentiment\n",
    "- language\n",
    "- style (formal, informal etc.)\n",
    "- covered topics\n",
    "- political tendency\n",
    "\n",
    "![Image description](/img/tagging.png)\n",
    "\n",
    "## Overview\n",
    "\n",
    "Tagging has a few components:\n",
    "\n",
    "* `function`: Like [extraction](/docs/use_cases/extraction), tagging uses [functions](https://openai.com/blog/function-calling-and-other-api-updates) to specify how the model should tag a document\n",
    "* `schema`: defines how we want to tag the document\n",
    "\n",
    "## Quickstart\n",
    "\n",
    "Let's see a very straightforward example of how we can use OpenAI functions for tagging in LangChain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc5cbb6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain openai \n",
    "\n",
    "# Set env var OPENAI_API_KEY or load from a .env file:\n",
    "# import dotenv\n",
    "# dotenv.load_env()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bafb496a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain.chains import create_tagging_chain, create_tagging_chain_pydantic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8ca3f93",
   "metadata": {},
   "source": [
    "We specify a few properties with their expected type in our schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "39f3ce3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Schema\n",
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"sentiment\": {\"type\": \"string\"},\n",
    "        \"aggressiveness\": {\"type\": \"integer\"},\n",
    "        \"language\": {\"type\": \"string\"},\n",
    "    }\n",
    "}\n",
    "\n",
    "# LLM\n",
    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")\n",
    "chain = create_tagging_chain(schema, llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5509b6a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentiment': 'positive', 'language': 'Spanish'}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inp = \"Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9154474c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentiment': 'enojado', 'aggressiveness': 1, 'language': 'es'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inp = \"Estoy muy enojado con vos! Te voy a dar tu merecido!\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d921bb53",
   "metadata": {},
   "source": [
    "As we can see in the examples, it correctly interprets what we want.\n",
    "\n",
    "The results vary so that we get, for example, sentiments in different languages ('positive', 'enojado' etc.).\n",
    "\n",
    "We will see how to control these results in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bebb2f83",
   "metadata": {},
   "source": [
    "## Finer control\n",
    "\n",
    "Careful schema definition gives us more control over the model's output. \n",
    "\n",
    "Specifically, we can define:\n",
    "\n",
    "- possible values for each property\n",
    "- description to make sure that the model understands the property\n",
    "- required properties to be returned"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69ef0b9a",
   "metadata": {},
   "source": [
    "Here is an example of how we can use `_enum_`, `_description_`, and `_required_` to control for each of the previously mentioned aspects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6a5f7961",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"aggressiveness\": {\n",
    "            \"type\": \"integer\",\n",
    "            \"enum\": [1, 2, 3, 4, 5],\n",
    "            \"description\": \"describes how aggressive the statement is, the higher the number the more aggressive\",\n",
    "        },\n",
    "        \"language\": {\n",
    "            \"type\": \"string\",\n",
    "            \"enum\": [\"spanish\", \"english\", \"french\", \"german\", \"italian\"],\n",
    "        },\n",
    "    },\n",
    "    \"required\": [\"language\", \"sentiment\", \"aggressiveness\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e5a5881f",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_tagging_chain(schema, llm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ded2332",
   "metadata": {},
   "source": [
    "Now the answers are much better!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d9b9d53d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'aggressiveness': 0, 'language': 'spanish'}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inp = \"Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1c12fa00",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'aggressiveness': 5, 'language': 'spanish'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inp = \"Estoy muy enojado con vos! Te voy a dar tu merecido!\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0bdfcb05",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'aggressiveness': 0, 'language': 'english'}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inp = \"Weather is ok here, I can go outside without much more than a coat\"\n",
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf6b7389",
   "metadata": {},
   "source": [
    "The [LangSmith trace](https://smith.langchain.com/public/311e663a-bbe8-4053-843e-5735055c032d/r) lets us peek under the hood:\n",
    "\n",
    "* As with [extraction](/docs/use_cases/extraction), we call the `information_extraction` function [here](https://github.com/langchain-ai/langchain/blob/269f85b7b7ffd74b38cd422d4164fc033388c3d0/libs/langchain/langchain/chains/openai_functions/extraction.py#L20) on the input string.\n",
    "* This OpenAI funtion extraction information based upon the provided schema.\n",
    "\n",
    "![Image description](/img/tagging_trace.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e68ad17e",
   "metadata": {},
   "source": [
    "## Pydantic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f5970ec",
   "metadata": {},
   "source": [
    "We can also use a Pydantic schema to specify the required properties and types. \n",
    "\n",
    "We can also send other arguments, such as `enum` or `description`, to each field.\n",
    "\n",
    "This lets us specify our schema in the same manner that we would a new class or function in Python with purely Pythonic types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "bf1f367e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from enum import Enum\n",
    "from pydantic import BaseModel, Field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "83a2e826",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Tags(BaseModel):\n",
    "    sentiment: str = Field(..., enum=[\"happy\", \"neutral\", \"sad\"])\n",
    "    aggressiveness: int = Field(\n",
    "        ...,\n",
    "        description=\"describes how aggressive the statement is, the higher the number the more aggressive\",\n",
    "        enum=[1, 2, 3, 4, 5],\n",
    "    )\n",
    "    language: str = Field(\n",
    "        ..., enum=[\"spanish\", \"english\", \"french\", \"german\", \"italian\"]\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "6e404892",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_tagging_chain_pydantic(Tags, llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b5fc43c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "inp = \"Estoy muy enojado con vos! Te voy a dar tu merecido!\"\n",
    "res = chain.run(inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "5074bcc3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tags(sentiment='sad', aggressiveness=5, language='spanish')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29346d09",
   "metadata": {},
   "source": [
    "### Going deeper\n",
    "\n",
    "* You can use the [metadata tagger](https://python.langchain.com/docs/integrations/document_transformers/openai_metadata_tagger) document transformer to extract metadata from a LangChain `Document`. \n",
    "* This covers the same basic functionality as the tagging chain, only applied to a LangChain `Document`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}