Add AzureCognitiveServicesToolkit to call Azure Cognitive Services API (#5012)

# Add AzureCognitiveServicesToolkit to call Azure Cognitive Services API: achieve some multimodal capabilities This PR adds a toolkit named AzureCognitiveServicesToolkit which bundles the following tools: - AzureCogsImageAnalysisTool: calls Azure Cognitive Services image analysis API to extract caption, objects, tags, and text from images. - AzureCogsFormRecognizerTool: calls Azure Cognitive Services form recognizer API to extract text, tables, and key-value pairs from documents. - AzureCogsSpeech2TextTool: calls Azure Cognitive Services speech to text API to transcribe speech to text. - AzureCogsText2SpeechTool: calls Azure Cognitive Services text to speech API to synthesize text to speech. This toolkit can be used to process image, document, and audio inputs. --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
12 months ago · d7f807b71f
parent d4fd589638
commit d7f807b71f
14 changed files with 1036 additions and 5 deletions
--- a/docs/modules/agents/toolkits/examples/azure_cognitive_services.ipynb
+++ b/docs/modules/agents/toolkits/examples/azure_cognitive_services.ipynb
@ -0,0 +1,270 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Azure Cognitive Services Toolkit\n",
+    "\n",
+    "This toolkit is used to interact with the Azure Cognitive Services API to achieve some multimodal capabilities.\n",
+    "\n",
+    "Currently There are four tools bundled in this toolkit:\n",
+    "- AzureCogsImageAnalysisTool: used to extract caption, objects, tags, and text from images. (Note: this tool is not available on Mac OS yet, due to the dependency on `azure-ai-vision` package, which is only supported on Windows and Linux currently.)\n",
+    "- AzureCogsFormRecognizerTool: used to extract text, tables, and key-value pairs from documents.\n",
+    "- AzureCogsSpeech2TextTool: used to transcribe speech to text.\n",
+    "- AzureCogsText2SpeechTool: used to synthesize text to speech."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, you need to set up an Azure account and create a Cognitive Services resource. You can follow the instructions [here](https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-apis-create-account?tabs=multiservice%2Cwindows) to create a resource. \n",
+    "\n",
+    "Then, you need to get the endpoint, key and region of your resource, and set them as environment variables. You can find them in the \"Keys and Endpoint\" page of your resource."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip install --upgrade azure-ai-formrecognizer > /dev/null\n",
+    "# !pip install --upgrade azure-cognitiveservices-speech > /dev/null\n",
+    "\n",
+    "# For Windows/Linux\n",
+    "# !pip install --upgrade azure-ai-vision > /dev/null"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"sk-\"\n",
+    "os.environ[\"AZURE_COGS_KEY\"] = \"\"\n",
+    "os.environ[\"AZURE_COGS_ENDPOINT\"] = \"\"\n",
+    "os.environ[\"AZURE_COGS_REGION\"] = \"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Toolkit"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.agents.agent_toolkits import AzureCognitiveServicesToolkit\n",
+    "\n",
+    "toolkit = AzureCognitiveServicesToolkit()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['Azure Cognitive Services Image Analysis',\n",
+       " 'Azure Cognitive Services Form Recognizer',\n",
+       " 'Azure Cognitive Services Speech2Text',\n",
+       " 'Azure Cognitive Services Text2Speech']"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "[tool.name for tool in toolkit.get_tools()]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use within an Agent"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import OpenAI\n",
+    "from langchain.agents import initialize_agent, AgentType"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "agent = initialize_agent(\n",
+    "    tools=toolkit.get_tools(),\n",
+    "    llm=llm,\n",
+    "    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
+      "\u001b[32;1m\u001b[1;3m\n",
+      "Action:\n",
+      "```\n",
+      "{\n",
+      "  \"action\": \"Azure Cognitive Services Image Analysis\",\n",
+      "  \"action_input\": \"https://images.openai.com/blob/9ad5a2ab-041f-475f-ad6a-b51899c50182/ingredients.png\"\n",
+      "}\n",
+      "```\n",
+      "\n",
+      "\u001b[0m\n",
+      "Observation: \u001b[36;1m\u001b[1;3mCaption: a group of eggs and flour in bowls\n",
+      "Objects: Egg, Egg, Food\n",
+      "Tags: dairy, ingredient, indoor, thickening agent, food, mixing bowl, powder, flour, egg, bowl\u001b[0m\n",
+      "Thought:\u001b[32;1m\u001b[1;3m I can use the objects and tags to suggest recipes\n",
+      "Action:\n",
+      "```\n",
+      "{\n",
+      "  \"action\": \"Final Answer\",\n",
+      "  \"action_input\": \"You can make pancakes, omelettes, or quiches with these ingredients!\"\n",
+      "}\n",
+      "```\u001b[0m\n",
+      "\n",
+      "\u001b[1m> Finished chain.\u001b[0m\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'You can make pancakes, omelettes, or quiches with these ingredients!'"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(\"What can I make with these ingredients?\"\n",
+    "          \"https://images.openai.com/blob/9ad5a2ab-041f-475f-ad6a-b51899c50182/ingredients.png\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
+      "\u001b[32;1m\u001b[1;3mAction:\n",
+      "```\n",
+      "{\n",
+      "  \"action\": \"Azure Cognitive Services Text2Speech\",\n",
+      "  \"action_input\": \"Why did the chicken cross the playground? To get to the other slide!\"\n",
+      "}\n",
+      "```\n",
+      "\n",
+      "\u001b[0m\n",
+      "Observation: \u001b[31;1m\u001b[1;3m/tmp/tmpa3uu_j6b.wav\u001b[0m\n",
+      "Thought:\u001b[32;1m\u001b[1;3m I have the audio file of the joke\n",
+      "Action:\n",
+      "```\n",
+      "{\n",
+      "  \"action\": \"Final Answer\",\n",
+      "  \"action_input\": \"/tmp/tmpa3uu_j6b.wav\"\n",
+      "}\n",
+      "```\u001b[0m\n",
+      "\n",
+      "\u001b[1m> Finished chain.\u001b[0m\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'/tmp/tmpa3uu_j6b.wav'"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "audio_file = agent.run(\"Tell me a joke and read it out for me.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython import display\n",
+    "\n",
+    "audio = display.Audio(audio_file)\n",
+    "display.display(audio)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/langchain/agents/agent_toolkits/init.py
+++ b/langchain/agents/agent_toolkits/init.py
@ -1,5 +1,8 @@
 """Agent toolkits."""

+from langchain.agents.agent_toolkits.azure_cognitive_services.toolkit import (
+    AzureCognitiveServicesToolkit,
+)
 from langchain.agents.agent_toolkits.csv.base import create_csv_agent
 from langchain.agents.agent_toolkits.file_management.toolkit import (
    FileManagementToolkit,
@ -60,4 +63,5 @@ __all__ = [
    "JiraToolkit",
    "FileManagementToolkit",
    "PlayWrightBrowserToolkit",
+    "AzureCognitiveServicesToolkit",
 ]
--- a/langchain/agents/agent_toolkits/azure_cognitive_services/init.py
+++ b/langchain/agents/agent_toolkits/azure_cognitive_services/init.py
@ -0,0 +1,7 @@
+"""Azure Cognitive Services Toolkit."""
+
+from langchain.agents.agent_toolkits.azure_cognitive_services.toolkit import (
+    AzureCognitiveServicesToolkit,
+)
+
+__all__ = ["AzureCognitiveServicesToolkit"]
--- a/langchain/agents/agent_toolkits/azure_cognitive_services/toolkit.py
+++ b/langchain/agents/agent_toolkits/azure_cognitive_services/toolkit.py
@ -0,0 +1,31 @@
+from __future__ import annotations
+
+import sys
+from typing import List
+
+from langchain.agents.agent_toolkits.base import BaseToolkit
+from langchain.tools.azure_cognitive_services import (
+    AzureCogsFormRecognizerTool,
+    AzureCogsImageAnalysisTool,
+    AzureCogsSpeech2TextTool,
+    AzureCogsText2SpeechTool,
+)
+from langchain.tools.base import BaseTool
+
+
+class AzureCognitiveServicesToolkit(BaseToolkit):
+    """Toolkit for Azure Cognitive Services."""
+
+    def get_tools(self) -> List[BaseTool]:
+        """Get the tools in the toolkit."""
+
+        tools = [
+            AzureCogsFormRecognizerTool(),
+            AzureCogsSpeech2TextTool(),
+            AzureCogsText2SpeechTool(),
+        ]
+
+        # TODO: Remove check once azure-ai-vision supports MacOS.
+        if sys.platform.startswith("linux") or sys.platform.startswith("win"):
+            tools.append(AzureCogsImageAnalysisTool())
+        return tools
--- a/langchain/tools/init.py
+++ b/langchain/tools/init.py
@ -1,5 +1,11 @@
 """Core toolkit implementations."""

+from langchain.tools.azure_cognitive_services import (
+    AzureCogsFormRecognizerTool,
+    AzureCogsImageAnalysisTool,
+    AzureCogsSpeech2TextTool,
+    AzureCogsText2SpeechTool,
+)
 from langchain.tools.base import BaseTool, StructuredTool, Tool, tool
 from langchain.tools.bing_search.tool import BingSearchResults, BingSearchRun
 from langchain.tools.ddg_search.tool import DuckDuckGoSearchResults, DuckDuckGoSearchRun
@ -56,6 +62,10 @@ from langchain.tools.zapier.tool import ZapierNLAListActions, ZapierNLARunAction
 __all__ = [
    "AIPluginTool",
    "APIOperation",
+    "AzureCogsFormRecognizerTool",
+    "AzureCogsImageAnalysisTool",
+    "AzureCogsSpeech2TextTool",
+    "AzureCogsText2SpeechTool",
    "BaseTool",
    "BaseTool",
    "BaseTool",
--- a/langchain/tools/azure_cognitive_services/init.py
+++ b/langchain/tools/azure_cognitive_services/init.py
@ -0,0 +1,21 @@
+"""Azure Cognitive Services Tools."""
+
+from langchain.tools.azure_cognitive_services.form_recognizer import (
+    AzureCogsFormRecognizerTool,
+)
+from langchain.tools.azure_cognitive_services.image_analysis import (
+    AzureCogsImageAnalysisTool,
+)
+from langchain.tools.azure_cognitive_services.speech2text import (
+    AzureCogsSpeech2TextTool,
+)
+from langchain.tools.azure_cognitive_services.text2speech import (
+    AzureCogsText2SpeechTool,
+)
+
+__all__ = [
+    "AzureCogsImageAnalysisTool",
+    "AzureCogsFormRecognizerTool",
+    "AzureCogsSpeech2TextTool",
+    "AzureCogsText2SpeechTool",
+]
--- a/langchain/tools/azure_cognitive_services/form_recognizer.py
+++ b/langchain/tools/azure_cognitive_services/form_recognizer.py
@ -0,0 +1,152 @@
+from __future__ import annotations
+
+import logging
+from typing import Any, Dict, List, Optional
+
+from pydantic import root_validator
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.azure_cognitive_services.utils import detect_file_src_type
+from langchain.tools.base import BaseTool
+from langchain.utils import get_from_dict_or_env
+
+logger = logging.getLogger(__name__)
+
+
+class AzureCogsFormRecognizerTool(BaseTool):
+    """Tool that queries the Azure Cognitive Services Form Recognizer API.
+
+    In order to set this up, follow instructions at:
+    https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/get-started-sdks-rest-api?view=form-recog-3.0.0&pivots=programming-language-python
+    """
+
+    azure_cogs_key: str = ""  #: :meta private:
+    azure_cogs_endpoint: str = ""  #: :meta private:
+    doc_analysis_client: Any  #: :meta private:
+
+    name = "Azure Cognitive Services Form Recognizer"
+    description = (
+        "A wrapper around Azure Cognitive Services Form Recognizer. "
+        "Useful for when you need to "
+        "extract text, tables, and key-value pairs from documents. "
+        "Input should be a url to a document."
+    )
+
+    @root_validator(pre=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validate that api key and endpoint exists in environment."""
+        azure_cogs_key = get_from_dict_or_env(
+            values, "azure_cogs_key", "AZURE_COGS_KEY"
+        )
+
+        azure_cogs_endpoint = get_from_dict_or_env(
+            values, "azure_cogs_endpoint", "AZURE_COGS_ENDPOINT"
+        )
+
+        try:
+            from azure.ai.formrecognizer import DocumentAnalysisClient
+            from azure.core.credentials import AzureKeyCredential
+
+            values["doc_analysis_client"] = DocumentAnalysisClient(
+                endpoint=azure_cogs_endpoint,
+                credential=AzureKeyCredential(azure_cogs_key),
+            )
+
+        except ImportError:
+            raise ImportError(
+                "azure-ai-formrecognizer is not installed. "
+                "Run `pip install azure-ai-formrecognizer` to install."
+            )
+
+        return values
+
+    def _parse_tables(self, tables: List[Any]) -> List[Any]:
+        result = []
+        for table in tables:
+            rc, cc = table.row_count, table.column_count
+            _table = [["" for _ in range(cc)] for _ in range(rc)]
+            for cell in table.cells:
+                _table[cell.row_index][cell.column_index] = cell.content
+            result.append(_table)
+        return result
+
+    def _parse_kv_pairs(self, kv_pairs: List[Any]) -> List[Any]:
+        result = []
+        for kv_pair in kv_pairs:
+            key = kv_pair.key.content if kv_pair.key else ""
+            value = kv_pair.value.content if kv_pair.value else ""
+            result.append((key, value))
+        return result
+
+    def _document_analysis(self, document_path: str) -> Dict:
+        document_src_type = detect_file_src_type(document_path)
+        if document_src_type == "local":
+            with open(document_path, "rb") as document:
+                poller = self.doc_analysis_client.begin_analyze_document(
+                    "prebuilt-document", document
+                )
+        elif document_src_type == "remote":
+            poller = self.doc_analysis_client.begin_analyze_document_from_url(
+                "prebuilt-document", document_path
+            )
+        else:
+            raise ValueError(f"Invalid document path: {document_path}")
+
+        result = poller.result()
+        res_dict = {}
+
+        if result.content is not None:
+            res_dict["content"] = result.content
+
+        if result.tables is not None:
+            res_dict["tables"] = self._parse_tables(result.tables)
+
+        if result.key_value_pairs is not None:
+            res_dict["key_value_pairs"] = self._parse_kv_pairs(result.key_value_pairs)
+
+        return res_dict
+
+    def _format_document_analysis_result(self, document_analysis_result: Dict) -> str:
+        formatted_result = []
+        if "content" in document_analysis_result:
+            formatted_result.append(
+                f"Content: {document_analysis_result['content']}".replace("\n", " ")
+            )
+
+        if "tables" in document_analysis_result:
+            for i, table in enumerate(document_analysis_result["tables"]):
+                formatted_result.append(f"Table {i}: {table}".replace("\n", " "))
+
+        if "key_value_pairs" in document_analysis_result:
+            for kv_pair in document_analysis_result["key_value_pairs"]:
+                formatted_result.append(
+                    f"{kv_pair[0]}: {kv_pair[1]}".replace("\n", " ")
+                )
+
+        return "\n".join(formatted_result)
+
+    def _run(
+        self,
+        query: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool."""
+        try:
+            document_analysis_result = self._document_analysis(query)
+            if not document_analysis_result:
+                return "No good document analysis result was found"
+
+            return self._format_document_analysis_result(document_analysis_result)
+        except Exception as e:
+            raise RuntimeError(f"Error while running AzureCogsFormRecognizerTool: {e}")
+
+    async def _arun(
+        self,
+        query: str,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool asynchronously."""
+        raise NotImplementedError("AzureCogsFormRecognizerTool does not support async")
--- a/langchain/tools/azure_cognitive_services/image_analysis.py
+++ b/langchain/tools/azure_cognitive_services/image_analysis.py
@ -0,0 +1,156 @@
+from __future__ import annotations
+
+import logging
+from typing import Any, Dict, Optional
+
+from pydantic import root_validator
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.azure_cognitive_services.utils import detect_file_src_type
+from langchain.tools.base import BaseTool
+from langchain.utils import get_from_dict_or_env
+
+logger = logging.getLogger(__name__)
+
+
+class AzureCogsImageAnalysisTool(BaseTool):
+    """Tool that queries the Azure Cognitive Services Image Analysis API.
+
+    In order to set this up, follow instructions at:
+    https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/image-analysis-client-library-40
+    """
+
+    azure_cogs_key: str = ""  #: :meta private:
+    azure_cogs_endpoint: str = ""  #: :meta private:
+    vision_service: Any  #: :meta private:
+    analysis_options: Any  #: :meta private:
+
+    name = "Azure Cognitive Services Image Analysis"
+    description = (
+        "A wrapper around Azure Cognitive Services Image Analysis. "
+        "Useful for when you need to analyze images. "
+        "Input should be a url to an image."
+    )
+
+    @root_validator(pre=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validate that api key and endpoint exists in environment."""
+        azure_cogs_key = get_from_dict_or_env(
+            values, "azure_cogs_key", "AZURE_COGS_KEY"
+        )
+
+        azure_cogs_endpoint = get_from_dict_or_env(
+            values, "azure_cogs_endpoint", "AZURE_COGS_ENDPOINT"
+        )
+
+        try:
+            import azure.ai.vision as sdk
+
+            values["vision_service"] = sdk.VisionServiceOptions(
+                endpoint=azure_cogs_endpoint, key=azure_cogs_key
+            )
+
+            values["analysis_options"] = sdk.ImageAnalysisOptions()
+            values["analysis_options"].features = (
+                sdk.ImageAnalysisFeature.CAPTION
+                | sdk.ImageAnalysisFeature.OBJECTS
+                | sdk.ImageAnalysisFeature.TAGS
+                | sdk.ImageAnalysisFeature.TEXT
+            )
+        except ImportError:
+            raise ImportError(
+                "azure-ai-vision is not installed. "
+                "Run `pip install azure-ai-vision` to install."
+            )
+
+        return values
+
+    def _image_analysis(self, image_path: str) -> Dict:
+        try:
+            import azure.ai.vision as sdk
+        except ImportError:
+            pass
+
+        image_src_type = detect_file_src_type(image_path)
+        if image_src_type == "local":
+            vision_source = sdk.VisionSource(filename=image_path)
+        elif image_src_type == "remote":
+            vision_source = sdk.VisionSource(url=image_path)
+        else:
+            raise ValueError(f"Invalid image path: {image_path}")
+
+        image_analyzer = sdk.ImageAnalyzer(
+            self.vision_service, vision_source, self.analysis_options
+        )
+        result = image_analyzer.analyze()
+
+        res_dict = {}
+        if result.reason == sdk.ImageAnalysisResultReason.ANALYZED:
+            if result.caption is not None:
+                res_dict["caption"] = result.caption.content
+
+            if result.objects is not None:
+                res_dict["objects"] = [obj.name for obj in result.objects]
+
+            if result.tags is not None:
+                res_dict["tags"] = [tag.name for tag in result.tags]
+
+            if result.text is not None:
+                res_dict["text"] = [line.content for line in result.text.lines]
+
+        else:
+            error_details = sdk.ImageAnalysisErrorDetails.from_result(result)
+            raise RuntimeError(
+                f"Image analysis failed.\n"
+                f"Reason: {error_details.reason}\n"
+                f"Details: {error_details.message}"
+            )
+
+        return res_dict
+
+    def _format_image_analysis_result(self, image_analysis_result: Dict) -> str:
+        formatted_result = []
+        if "caption" in image_analysis_result:
+            formatted_result.append("Caption: " + image_analysis_result["caption"])
+
+        if (
+            "objects" in image_analysis_result
+            and len(image_analysis_result["objects"]) > 0
+        ):
+            formatted_result.append(
+                "Objects: " + ", ".join(image_analysis_result["objects"])
+            )
+
+        if "tags" in image_analysis_result and len(image_analysis_result["tags"]) > 0:
+            formatted_result.append("Tags: " + ", ".join(image_analysis_result["tags"]))
+
+        if "text" in image_analysis_result and len(image_analysis_result["text"]) > 0:
+            formatted_result.append("Text: " + ", ".join(image_analysis_result["text"]))
+
+        return "\n".join(formatted_result)
+
+    def _run(
+        self,
+        query: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool."""
+        try:
+            image_analysis_result = self._image_analysis(query)
+            if not image_analysis_result:
+                return "No good image analysis result was found"
+
+            return self._format_image_analysis_result(image_analysis_result)
+        except Exception as e:
+            raise RuntimeError(f"Error while running AzureCogsImageAnalysisTool: {e}")
+
+    async def _arun(
+        self,
+        query: str,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool asynchronously."""
+        raise NotImplementedError("AzureCogsImageAnalysisTool does not support async")
--- a/langchain/tools/azure_cognitive_services/speech2text.py
+++ b/langchain/tools/azure_cognitive_services/speech2text.py
@ -0,0 +1,131 @@
+from __future__ import annotations
+
+import logging
+import time
+from typing import Any, Dict, Optional
+
+from pydantic import root_validator
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.azure_cognitive_services.utils import (
+    detect_file_src_type,
+    download_audio_from_url,
+)
+from langchain.tools.base import BaseTool
+from langchain.utils import get_from_dict_or_env
+
+logger = logging.getLogger(__name__)
+
+
+class AzureCogsSpeech2TextTool(BaseTool):
+    """Tool that queries the Azure Cognitive Services Speech2Text API.
+
+    In order to set this up, follow instructions at:
+    https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-speech-to-text?pivots=programming-language-python
+    """
+
+    azure_cogs_key: str = ""  #: :meta private:
+    azure_cogs_region: str = ""  #: :meta private:
+    speech_language: str = "en-US"  #: :meta private:
+    speech_config: Any  #: :meta private:
+
+    name = "Azure Cognitive Services Speech2Text"
+    description = (
+        "A wrapper around Azure Cognitive Services Speech2Text. "
+        "Useful for when you need to transcribe audio to text. "
+        "Input should be a url to an audio file."
+    )
+
+    @root_validator(pre=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validate that api key and endpoint exists in environment."""
+        azure_cogs_key = get_from_dict_or_env(
+            values, "azure_cogs_key", "AZURE_COGS_KEY"
+        )
+
+        azure_cogs_region = get_from_dict_or_env(
+            values, "azure_cogs_region", "AZURE_COGS_REGION"
+        )
+
+        try:
+            import azure.cognitiveservices.speech as speechsdk
+
+            values["speech_config"] = speechsdk.SpeechConfig(
+                subscription=azure_cogs_key, region=azure_cogs_region
+            )
+        except ImportError:
+            raise ImportError(
+                "azure-cognitiveservices-speech is not installed. "
+                "Run `pip install azure-cognitiveservices-speech` to install."
+            )
+
+        return values
+
+    def _continuous_recognize(self, speech_recognizer: Any) -> str:
+        done = False
+        text = ""
+
+        def stop_cb(evt: Any) -> None:
+            """callback that stop continuous recognition"""
+            speech_recognizer.stop_continuous_recognition_async()
+            nonlocal done
+            done = True
+
+        def retrieve_cb(evt: Any) -> None:
+            """callback that retrieves the intermediate recognition results"""
+            nonlocal text
+            text += evt.result.text
+
+        # retrieve text on recognized events
+        speech_recognizer.recognized.connect(retrieve_cb)
+        # stop continuous recognition on either session stopped or canceled events
+        speech_recognizer.session_stopped.connect(stop_cb)
+        speech_recognizer.canceled.connect(stop_cb)
+
+        # Start continuous speech recognition
+        speech_recognizer.start_continuous_recognition_async()
+        while not done:
+            time.sleep(0.5)
+        return text
+
+    def _speech2text(self, audio_path: str, speech_language: str) -> str:
+        try:
+            import azure.cognitiveservices.speech as speechsdk
+        except ImportError:
+            pass
+
+        audio_src_type = detect_file_src_type(audio_path)
+        if audio_src_type == "local":
+            audio_config = speechsdk.AudioConfig(filename=audio_path)
+        elif audio_src_type == "remote":
+            tmp_audio_path = download_audio_from_url(audio_path)
+            audio_config = speechsdk.AudioConfig(filename=tmp_audio_path)
+        else:
+            raise ValueError(f"Invalid audio path: {audio_path}")
+
+        self.speech_config.speech_recognition_language = speech_language
+        speech_recognizer = speechsdk.SpeechRecognizer(self.speech_config, audio_config)
+        return self._continuous_recognize(speech_recognizer)
+
+    def _run(
+        self,
+        query: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool."""
+        try:
+            text = self._speech2text(query, self.speech_language)
+            return text
+        except Exception as e:
+            raise RuntimeError(f"Error while running AzureCogsSpeech2TextTool: {e}")
+
+    async def _arun(
+        self,
+        query: str,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool asynchronously."""
+        raise NotImplementedError("AzureCogsSpeech2TextTool does not support async")
--- a/langchain/tools/azure_cognitive_services/text2speech.py
+++ b/langchain/tools/azure_cognitive_services/text2speech.py
@ -0,0 +1,114 @@
+from __future__ import annotations
+
+import logging
+import tempfile
+from typing import Any, Dict, Optional
+
+from pydantic import root_validator
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.base import BaseTool
+from langchain.utils import get_from_dict_or_env
+
+logger = logging.getLogger(__name__)
+
+
+class AzureCogsText2SpeechTool(BaseTool):
+    """Tool that queries the Azure Cognitive Services Text2Speech API.
+
+    In order to set this up, follow instructions at:
+    https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?pivots=programming-language-python
+    """
+
+    azure_cogs_key: str = ""  #: :meta private:
+    azure_cogs_region: str = ""  #: :meta private:
+    speech_language: str = "en-US"  #: :meta private:
+    speech_config: Any  #: :meta private:
+
+    name = "Azure Cognitive Services Text2Speech"
+    description = (
+        "A wrapper around Azure Cognitive Services Text2Speech. "
+        "Useful for when you need to convert text to speech. "
+    )
+
+    @root_validator(pre=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validate that api key and endpoint exists in environment."""
+        azure_cogs_key = get_from_dict_or_env(
+            values, "azure_cogs_key", "AZURE_COGS_KEY"
+        )
+
+        azure_cogs_region = get_from_dict_or_env(
+            values, "azure_cogs_region", "AZURE_COGS_REGION"
+        )
+
+        try:
+            import azure.cognitiveservices.speech as speechsdk
+
+            values["speech_config"] = speechsdk.SpeechConfig(
+                subscription=azure_cogs_key, region=azure_cogs_region
+            )
+        except ImportError:
+            raise ImportError(
+                "azure-cognitiveservices-speech is not installed. "
+                "Run `pip install azure-cognitiveservices-speech` to install."
+            )
+
+        return values
+
+    def _text2speech(self, text: str, speech_language: str) -> str:
+        try:
+            import azure.cognitiveservices.speech as speechsdk
+        except ImportError:
+            pass
+
+        self.speech_config.speech_synthesis_language = speech_language
+        speech_synthesizer = speechsdk.SpeechSynthesizer(
+            speech_config=self.speech_config, audio_config=None
+        )
+        result = speech_synthesizer.speak_text(text)
+
+        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
+            stream = speechsdk.AudioDataStream(result)
+            with tempfile.NamedTemporaryFile(
+                mode="wb", suffix=".wav", delete=False
+            ) as f:
+                stream.save_to_wav_file(f.name)
+
+            return f.name
+
+        elif result.reason == speechsdk.ResultReason.Canceled:
+            cancellation_details = result.cancellation_details
+            logger.debug(f"Speech synthesis canceled: {cancellation_details.reason}")
+            if cancellation_details.reason == speechsdk.CancellationReason.Error:
+                raise RuntimeError(
+                    f"Speech synthesis error: {cancellation_details.error_details}"
+                )
+
+            return "Speech synthesis canceled."
+
+        else:
+            return f"Speech synthesis failed: {result.reason}"
+
+    def _run(
+        self,
+        query: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool."""
+        try:
+            speech_file = self._text2speech(query, self.speech_language)
+            return speech_file
+        except Exception as e:
+            raise RuntimeError(f"Error while running AzureCogsText2SpeechTool: {e}")
+
+    async def _arun(
+        self,
+        query: str,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        """Use the tool asynchronously."""
+        raise NotImplementedError("AzureCogsText2SpeechTool does not support async")
--- a/langchain/tools/azure_cognitive_services/utils.py
+++ b/langchain/tools/azure_cognitive_services/utils.py
@ -0,0 +1,29 @@
+import os
+import tempfile
+from urllib.parse import urlparse
+
+import requests
+
+
+def detect_file_src_type(file_path: str) -> str:
+    """Detect if the file is local or remote."""
+    if os.path.isfile(file_path):
+        return "local"
+
+    parsed_url = urlparse(file_path)
+    if parsed_url.scheme and parsed_url.netloc:
+        return "remote"
+
+    return "invalid"
+
+
+def download_audio_from_url(audio_url: str) -> str:
+    """Download audio from url to local."""
+    ext = audio_url.split(".")[-1]
+    response = requests.get(audio_url, stream=True)
+    response.raise_for_status()
+    with tempfile.NamedTemporaryFile(mode="wb", suffix=f".{ext}", delete=False) as f:
+        for chunk in response.iter_content(chunk_size=8192):
+            f.write(chunk)
+
+    return f.name
--- a/poetry.lock
+++ b/poetry.lock
@ -566,6 +566,64 @@ dev = ["coverage (>=5,<6)", "flake8 (>=3,<4)", "pytest (>=6,<7)", "sphinx-copybu
 docs = ["sphinx-copybutton (>=0.4,<0.5)", "sphinx-rtd-theme (>=1.0,<2.0)", "sphinx-tabs (>=3,<4)", "sphinxcontrib-mermaid (>=0.7,<0.8)"]
 test = ["coverage (>=5,<6)", "pytest (>=6,<7)"]

+[[package]]
+name = "azure-ai-formrecognizer"
+version = "3.2.1"
+description = "Microsoft Azure Form Recognizer Client Library for Python"
+category = "main"
+optional = true
+python-versions = ">=3.7"
+files = [
+    {file = "azure-ai-formrecognizer-3.2.1.zip", hash = "sha256:5768765f9720ce87038f56afe0c0b5259192cfb29c840a39595b1e26e4ddfa32"},
+    {file = "azure_ai_formrecognizer-3.2.1-py3-none-any.whl", hash = "sha256:4db43b9dd0a2bc5296b752c04dbacb838ae2b8726adfe7cf277c2ea34e99419a"},
+]
+
+[package.dependencies]
+azure-common = ">=1.1,<2.0"
+azure-core = ">=1.23.0,<2.0.0"
+msrest = ">=0.6.21"
+typing-extensions = ">=4.0.1"
+
+[[package]]
+name = "azure-ai-vision"
+version = "0.11.1b1"
+description = "Microsoft Azure AI Vision SDK for Python"
+category = "main"
+optional = true
+python-versions = ">=3.7"
+files = [
+    {file = "azure_ai_vision-0.11.1b1-py3-none-manylinux1_x86_64.whl", hash = "sha256:6f8563ae26689da6cdee9b2de009a53546ae2fd86c6c180236ce5da5b45f41d3"},
+    {file = "azure_ai_vision-0.11.1b1-py3-none-win_amd64.whl", hash = "sha256:f5df03b9156feaa1d8c776631967b1455028d30dfd4cd1c732aa0f9c03d01517"},
+]
+
+[[package]]
+name = "azure-cognitiveservices-speech"
+version = "1.28.0"
+description = "Microsoft Cognitive Services Speech SDK for Python"
+category = "main"
+optional = true
+python-versions = ">=3.7"
+files = [
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-macosx_10_14_x86_64.whl", hash = "sha256:a6c277ec9c93f586dcc74d3a56a6aa0259f4cf371f5e03afcf169c691e2c4d0c"},
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:a412c6c5bc528548e0ee5fc5fe89fb8351307d0c5ef7ac4d506fab3d58efcb4a"},
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-manylinux1_x86_64.whl", hash = "sha256:ceb5a8862da4ab861bd06653074a4e5dc2d66a54f03dd4dd9356da7672febbce"},
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:d5cba32e9d8eaffc9d8f482c00950bc471f9dc4d7659c741c083e5e9d831b802"},
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-win32.whl", hash = "sha256:ac52c4549062771db5694346c1547334cf1bb0d08573a193c8dcec8386aa491d"},
+    {file = "azure_cognitiveservices_speech-1.28.0-py3-none-win_amd64.whl", hash = "sha256:5ff042d81d7ff4e50be196419fcd2042e41a97cebb229e0940026e1314ff7751"},
+]
+
+[[package]]
+name = "azure-common"
+version = "1.1.28"
+description = "Microsoft Azure Client Library for Python (Common)"
+category = "main"
+optional = true
+python-versions = "*"
+files = [
+    {file = "azure-common-1.1.28.zip", hash = "sha256:4ac0cd3214e36b6a1b6a442686722a5d8cc449603aa833f3f0f40bda836704a3"},
+    {file = "azure_common-1.1.28-py2.py3-none-any.whl", hash = "sha256:5c12d3dcf4ec20599ca6b0d3e09e86e146353d443e7fcc050c9a19c1f9df20ad"},
+]
+
 [[package]]
 name = "azure-core"
 version = "1.26.4"
@ -3178,6 +3236,21 @@ widgetsnbextension = ">=4.0.7,<4.1.0"
 [package.extras]
 test = ["ipykernel", "jsonschema", "pytest (>=3.6.0)", "pytest-cov", "pytz"]

+[[package]]
+name = "isodate"
+version = "0.6.1"
+description = "An ISO 8601 date/time/duration parser and formatter"
+category = "main"
+optional = true
+python-versions = "*"
+files = [
+    {file = "isodate-0.6.1-py2.py3-none-any.whl", hash = "sha256:0751eece944162659049d35f4f549ed815792b38793f07cf73381c1c87cbed96"},
+    {file = "isodate-0.6.1.tar.gz", hash = "sha256:48c5881de7e8b0a0d648cb024c8062dc84e7b840ed81e864c7614fd3c127bde9"},
+]
+
+[package.dependencies]
+six = "*"
+
 [[package]]
 name = "isoduration"
 version = "20.11.0"
@ -4511,6 +4584,28 @@ files = [
    {file = "msgpack-1.0.5.tar.gz", hash = "sha256:c075544284eadc5cddc70f4757331d99dcbc16b2bbd4849d15f8aae4cf36d31c"},
 ]

+[[package]]
+name = "msrest"
+version = "0.7.1"
+description = "AutoRest swagger generator Python client runtime."
+category = "main"
+optional = true
+python-versions = ">=3.6"
+files = [
+    {file = "msrest-0.7.1-py3-none-any.whl", hash = "sha256:21120a810e1233e5e6cc7fe40b474eeb4ec6f757a15d7cf86702c369f9567c32"},
+    {file = "msrest-0.7.1.zip", hash = "sha256:6e7661f46f3afd88b75667b7187a92829924446c7ea1d169be8c4bb7eeb788b9"},
+]
+
+[package.dependencies]
+azure-core = ">=1.24.0"
+certifi = ">=2017.4.17"
+isodate = ">=0.6.0"
+requests = ">=2.16,<3.0"
+requests-oauthlib = ">=0.5.0"
+
+[package.extras]
+async = ["aiodns", "aiohttp (>=3.0)"]
+
 [[package]]
 name = "multidict"
 version = "6.0.4"
@ -6779,6 +6874,7 @@ files = [
    {file = "pylance-0.4.12-cp38-abi3-macosx_10_15_x86_64.whl", hash = "sha256:2b86fb8dccc03094c0db37bef0d91bda60e8eb0d1eddf245c6971450c8d8a53f"},
    {file = "pylance-0.4.12-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:0bc82914b13204187d673b5f3d45f93219c38a0e9d0542ba251074f639669789"},
    {file = "pylance-0.4.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5a4bcce77f99ecd4cbebbadb01e58d5d8138d40eb56bdcdbc3b20b0475e7a472"},
+    {file = "pylance-0.4.12-cp38-abi3-win_amd64.whl", hash = "sha256:9616931c5300030adb9626d22515710a127d1e46a46737a7a0f980b52f13627c"},
 ]

 [package.dependencies]
@ -10756,8 +10852,8 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
 cffi = ["cffi (>=1.11)"]

 [extras]
-all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "langkit", "lark", "lxml", "manifest-ml", "neo4j", "networkx", "nlpcloud", "nltk", "nomic", "openai", "openlm", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "requests-toolbelt", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
-azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]
+all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "langkit", "lark", "lxml", "manifest-ml", "neo4j", "networkx", "nlpcloud", "nltk", "nomic", "openai", "openlm", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "requests-toolbelt", "sentence-transformers", "spacy", "steamship", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
+azure = ["azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-core", "azure-cosmos", "azure-identity", "openai"]
 cohere = ["cohere"]
 docarray = ["docarray"]
 embeddings = ["sentence-transformers"]
@ -10770,4 +10866,4 @@ text-helpers = ["chardet"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "dcb018c4ff95d43dc6ac7bc1c17b971417938ddc06d74b2a16694113009e097c"
+content-hash = "196588e10bb33939f5bae294a194ad01e803f40ed1087fe6a7a4b87e8d80712b"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -93,6 +93,9 @@ langkit = {version = ">=0.0.1.dev3, <0.1.0", optional = true}
 chardet = {version="^5.1.0", optional=true}
 requests-toolbelt = {version = "^1.0.0", optional = true}
 openlm = {version = "^0.0.5", optional = true}
+azure-ai-formrecognizer = {version = "^3.2.1", optional = true}
+azure-ai-vision = {version = "^0.11.1b1", optional = true}
+azure-cognitiveservices-speech = {version = "^1.28.0", optional = true}

 [tool.poetry.group.docs.dependencies]
 autodoc_pydantic = "^1.8.0"
@ -184,7 +187,7 @@ text_helpers = ["chardet"]
 cohere = ["cohere"]
 docarray = ["docarray"]
 embeddings = ["sentence-transformers"]
-azure = ["azure-identity", "azure-cosmos", "openai", "azure-core"]
+azure = ["azure-identity", "azure-cosmos", "openai", "azure-core", "azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech"]
 all = [
    "anthropic",
    "cohere",
@ -244,7 +247,10 @@ all = [
    "lxml",
    "requests-toolbelt",
    "neo4j",
-    "openlm"
+    "openlm",
+    "azure-ai-formrecognizer",
+    "azure-ai-vision",
+    "azure-cognitiveservices-speech",
 ]

 # An extra used to be able to add extended testing.
--- a/tests/unit_tests/tools/test_public_api.py
+++ b/tests/unit_tests/tools/test_public_api.py
@ -4,6 +4,10 @@ from langchain.tools import __all__ as public_api
 _EXPECTED = [
    "AIPluginTool",
    "APIOperation",
+    "AzureCogsFormRecognizerTool",
+    "AzureCogsImageAnalysisTool",
+    "AzureCogsSpeech2TextTool",
+    "AzureCogsText2SpeechTool",
    "BaseTool",
    "BaseTool",
    "BaseTool",