langchain/docs/extras/integrations/document_loaders/azure_document_intelligence.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Azure Document Intelligence"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Azure Document Intelligence (formerly known as Azure Forms Recognizer) is machine-learning \n",
    "based service that extracts text (including handwriting), tables or key-value-pairs from\n",
    "scanned documents or images.\n",
    "\n",
    "This current implementation of a loader using Document Intelligence is able to incorporate content page-wise and turn it into LangChain documents.\n",
    "\n",
    "Document Intelligence supports PDF, JPEG, PNG, BMP, or TIFF.\n",
    "\n",
    "Further documentation is available at https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/?view=doc-intel-3.1.0.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install langchain azure-ai-formrecognizer -q"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first example uses a local file which will be sent to Azure Document Intelligence.\n",
    "\n",
    "First, an instance of a DocumentAnalysisClient is created with endpoint and key for the Azure service.    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.ai.formrecognizer import DocumentAnalysisClient\n",
    "from azure.core.credentials import AzureKeyCredential\n",
    "\n",
    "document_analysis_client = DocumentAnalysisClient(\n",
    "                endpoint=\"<service_endpoint>\", credential=AzureKeyCredential(\"<service_key>\")\n",
    "            )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.pdf import DocumentIntelligenceLoader\n",
    "loader = DocumentIntelligenceLoader(\n",
    "    \"<Local_filename>\",\n",
    "    client=document_analysis_client,\n",
    "    model=\"<model_name>\") # e.g. prebuilt-document\n",
    "\n",
    "documents = loader.load()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output contains each page of the source document as a LangChain document: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='...', metadata={'source': '...', 'page': 1})]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "f9f85f796d01129d0dd105a088854619f454435301f6ffec2fea96ecbd9be4ac"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Add parser and loader for Azure document intelligence service. (#10136) Hi, this PR contains loader / parser for Azure Document intelligence which is a ML-based service to ingest arbitrary PDFs / images, even if scanned. The loader generates Documents by pages of the original document. This is my first contribution to LangChain. Unfortunately I could not find the correct place for test cases. Happy to add one if you can point me to the location, but as this is a cloud-based service, a test would require network access and credentials - so might be of limited help. Dependencies: The needed dependency was already part of pyproject.toml, no change. Twitter: feel free to mention @LarsAC on the announcement 2023-09-03 21:25:39 +00:00			`{`
			`"cells": [`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Azure Document Intelligence"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Azure Document Intelligence (formerly known as Azure Forms Recognizer) is machine-learning \n",`
			`"based service that extracts text (including handwriting), tables or key-value-pairs from\n",`
			`"scanned documents or images.\n",`
			`"\n",`
			`"This current implementation of a loader using Document Intelligence is able to incorporate content page-wise and turn it into LangChain documents.\n",`
			`"\n",`
			`"Document Intelligence supports PDF, JPEG, PNG, BMP, or TIFF.\n",`
			`"\n",`
			`"Further documentation is available at https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/?view=doc-intel-3.1.0.\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"%pip install langchain azure-ai-formrecognizer -q"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Example 1"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The first example uses a local file which will be sent to Azure Document Intelligence.\n",`
			`"\n",`
			`"First, an instance of a DocumentAnalysisClient is created with endpoint and key for the Azure service. "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from azure.ai.formrecognizer import DocumentAnalysisClient\n",`
			`"from azure.core.credentials import AzureKeyCredential\n",`
			`"\n",`
			`"document_analysis_client = DocumentAnalysisClient(\n",`
			`" endpoint=\"<service_endpoint>\", credential=AzureKeyCredential(\"<service_key>\")\n",`
			`" )"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.document_loaders.pdf import DocumentIntelligenceLoader\n",`
			`"loader = DocumentIntelligenceLoader(\n",`
			`" \"<Local_filename>\",\n",`
			`" client=document_analysis_client,\n",`
			`" model=\"<model_name>\") # e.g. prebuilt-document\n",`
			`"\n",`
			`"documents = loader.load()"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The output contains each page of the source document as a LangChain document: "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[Document(page_content='...', metadata={'source': '...', 'page': 1})]"`
			`]`
			`},`
			`"execution_count": 18,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"name": "python",`
			`"version": "3.9.5"`
			`},`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "f9f85f796d01129d0dd105a088854619f454435301f6ffec2fea96ecbd9be4ac"`
			`}`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`