langchain/docs/docs/integrations/document_transformers/openai_metadata_tagger.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenAI metadata tagger\n",
    "\n",
    "It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for a more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.\n",
    "\n",
    "The `OpenAIMetadataTagger` document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable `OpenAI Functions`-powered chain under the hood, so if you pass a custom LLM instance, it must be an `OpenAI` model with functions support. \n",
    "\n",
    "**Note:** This document transformer works best with complete documents, so it's best to run it first with whole documents before doing any other splitting or processing!\n",
    "\n",
    "For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer with a valid `JSON Schema` object as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_transformers.openai_functions import (\n",
    "    create_metadata_tagger,\n",
    ")\n",
    "from langchain_core.documents import Document\n",
    "from langchain_openai import ChatOpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"movie_title\": {\"type\": \"string\"},\n",
    "        \"critic\": {\"type\": \"string\"},\n",
    "        \"tone\": {\"type\": \"string\", \"enum\": [\"positive\", \"negative\"]},\n",
    "        \"rating\": {\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"The number of stars the critic rated the movie\",\n",
    "        },\n",
    "    },\n",
    "    \"required\": [\"movie_title\", \"critic\", \"tone\"],\n",
    "}\n",
    "\n",
    "# Must be an OpenAI model that supports functions\n",
    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")\n",
    "\n",
    "document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can then simply pass the document transformer a list of documents, and it will extract metadata from the contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_documents = [\n",
    "    Document(\n",
    "        page_content=\"Review of The Bee Movie\\nBy Roger Ebert\\n\\nThis is the greatest movie ever made. 4 out of 5 stars.\"\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Review of The Godfather\\nBy Anonymous\\n\\nThis movie was super boring. 1 out of 5 stars.\",\n",
    "        metadata={\"reliable\": False},\n",
    "    ),\n",
    "]\n",
    "\n",
    "enhanced_documents = document_transformer.transform_documents(original_documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Review of The Bee Movie\n",
      "By Roger Ebert\n",
      "\n",
      "This is the greatest movie ever made. 4 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
      "\n",
      "---------------\n",
      "\n",
      "Review of The Godfather\n",
      "By Anonymous\n",
      "\n",
      "This movie was super boring. 1 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Godfather\", \"critic\": \"Anonymous\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "print(\n",
    "    *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
    "    sep=\"\\n\\n---------------\\n\\n\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new documents can then be further processed by a text splitter before being loaded into a vector store. Extracted fields will not overwrite existing metadata.\n",
    "\n",
    "You can also initialize the document transformer with a Pydantic schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Review of The Bee Movie\n",
      "By Roger Ebert\n",
      "\n",
      "This is the greatest movie ever made. 4 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
      "\n",
      "---------------\n",
      "\n",
      "Review of The Godfather\n",
      "By Anonymous\n",
      "\n",
      "This movie was super boring. 1 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Godfather\", \"critic\": \"Anonymous\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
     ]
    }
   ],
   "source": [
    "from typing import Literal\n",
    "\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class Properties(BaseModel):\n",
    "    movie_title: str\n",
    "    critic: str\n",
    "    tone: Literal[\"positive\", \"negative\"]\n",
    "    rating: int = Field(description=\"Rating out of 5 stars\")\n",
    "\n",
    "\n",
    "document_transformer = create_metadata_tagger(Properties, llm)\n",
    "enhanced_documents = document_transformer.transform_documents(original_documents)\n",
    "\n",
    "print(\n",
    "    *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
    "    sep=\"\\n\\n---------------\\n\\n\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## Customization\n",
    "\n",
    "You can pass the underlying tagging chain the standard LLMChain arguments in the document transformer constructor. For example, if you wanted to ask the LLM to focus specific details in the input documents, or extract metadata in a certain style, you could pass in a custom prompt:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Review of The Bee Movie\n",
      "By Roger Ebert\n",
      "\n",
      "This is the greatest movie ever made. 4 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Bee Movie\", \"critic\": \"Roger Ebert\", \"tone\": \"positive\", \"rating\": 4}\n",
      "\n",
      "---------------\n",
      "\n",
      "Review of The Godfather\n",
      "By Anonymous\n",
      "\n",
      "This movie was super boring. 1 out of 5 stars.\n",
      "\n",
      "{\"movie_title\": \"The Godfather\", \"critic\": \"Roger Ebert\", \"tone\": \"negative\", \"rating\": 1, \"reliable\": false}\n"
     ]
    }
   ],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(\n",
    "    \"\"\"Extract relevant information from the following text.\n",
    "Anonymous critics are actually Roger Ebert.\n",
    "\n",
    "{input}\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)\n",
    "enhanced_documents = document_transformer.transform_documents(original_documents)\n",
    "\n",
    "print(\n",
    "    *[d.page_content + \"\\n\\n\" + json.dumps(d.metadata) for d in enhanced_documents],\n",
    "    sep=\"\\n\\n---------------\\n\\n\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}