langchain/docs/modules/agents/toolkits/examples/azure_cognitive_services.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Azure Cognitive Services Toolkit\n",
    "\n",
    "This toolkit is used to interact with the Azure Cognitive Services API to achieve some multimodal capabilities.\n",
    "\n",
    "Currently There are four tools bundled in this toolkit:\n",
    "- AzureCogsImageAnalysisTool: used to extract caption, objects, tags, and text from images. (Note: this tool is not available on Mac OS yet, due to the dependency on `azure-ai-vision` package, which is only supported on Windows and Linux currently.)\n",
    "- AzureCogsFormRecognizerTool: used to extract text, tables, and key-value pairs from documents.\n",
    "- AzureCogsSpeech2TextTool: used to transcribe speech to text.\n",
    "- AzureCogsText2SpeechTool: used to synthesize text to speech."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, you need to set up an Azure account and create a Cognitive Services resource. You can follow the instructions [here](https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-apis-create-account?tabs=multiservice%2Cwindows) to create a resource. \n",
    "\n",
    "Then, you need to get the endpoint, key and region of your resource, and set them as environment variables. You can find them in the \"Keys and Endpoint\" page of your resource."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install --upgrade azure-ai-formrecognizer > /dev/null\n",
    "# !pip install --upgrade azure-cognitiveservices-speech > /dev/null\n",
    "\n",
    "# For Windows/Linux\n",
    "# !pip install --upgrade azure-ai-vision > /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-\"\n",
    "os.environ[\"AZURE_COGS_KEY\"] = \"\"\n",
    "os.environ[\"AZURE_COGS_ENDPOINT\"] = \"\"\n",
    "os.environ[\"AZURE_COGS_REGION\"] = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the Toolkit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.agents.agent_toolkits import AzureCognitiveServicesToolkit\n",
    "\n",
    "toolkit = AzureCognitiveServicesToolkit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Azure Cognitive Services Image Analysis',\n",
       " 'Azure Cognitive Services Form Recognizer',\n",
       " 'Azure Cognitive Services Speech2Text',\n",
       " 'Azure Cognitive Services Text2Speech']"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[tool.name for tool in toolkit.get_tools()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use within an Agent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import OpenAI\n",
    "from langchain.agents import initialize_agent, AgentType"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0)\n",
    "agent = initialize_agent(\n",
    "    tools=toolkit.get_tools(),\n",
    "    llm=llm,\n",
    "    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3m\n",
      "Action:\n",
      "```\n",
      "{\n",
      "  \"action\": \"Azure Cognitive Services Image Analysis\",\n",
      "  \"action_input\": \"https://images.openai.com/blob/9ad5a2ab-041f-475f-ad6a-b51899c50182/ingredients.png\"\n",
      "}\n",
      "```\n",
      "\n",
      "\u001b[0m\n",
      "Observation: \u001b[36;1m\u001b[1;3mCaption: a group of eggs and flour in bowls\n",
      "Objects: Egg, Egg, Food\n",
      "Tags: dairy, ingredient, indoor, thickening agent, food, mixing bowl, powder, flour, egg, bowl\u001b[0m\n",
      "Thought:\u001b[32;1m\u001b[1;3m I can use the objects and tags to suggest recipes\n",
      "Action:\n",
      "```\n",
      "{\n",
      "  \"action\": \"Final Answer\",\n",
      "  \"action_input\": \"You can make pancakes, omelettes, or quiches with these ingredients!\"\n",
      "}\n",
      "```\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'You can make pancakes, omelettes, or quiches with these ingredients!'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent.run(\"What can I make with these ingredients?\"\n",
    "          \"https://images.openai.com/blob/9ad5a2ab-041f-475f-ad6a-b51899c50182/ingredients.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3mAction:\n",
      "```\n",
      "{\n",
      "  \"action\": \"Azure Cognitive Services Text2Speech\",\n",
      "  \"action_input\": \"Why did the chicken cross the playground? To get to the other slide!\"\n",
      "}\n",
      "```\n",
      "\n",
      "\u001b[0m\n",
      "Observation: \u001b[31;1m\u001b[1;3m/tmp/tmpa3uu_j6b.wav\u001b[0m\n",
      "Thought:\u001b[32;1m\u001b[1;3m I have the audio file of the joke\n",
      "Action:\n",
      "```\n",
      "{\n",
      "  \"action\": \"Final Answer\",\n",
      "  \"action_input\": \"/tmp/tmpa3uu_j6b.wav\"\n",
      "}\n",
      "```\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/tmp/tmpa3uu_j6b.wav'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "audio_file = agent.run(\"Tell me a joke and read it out for me.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython import display\n",
    "\n",
    "audio = display.Audio(audio_file)\n",
    "display.display(audio)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}