langchain/docs/extras/integrations/document_loaders/azure_document_intelligence.ipynb
Lars von Wedel 6d82503eb1
Add parser and loader for Azure document intelligence service. (#10136)
Hi,

this PR contains loader / parser for Azure Document intelligence which
is a ML-based service to ingest arbitrary PDFs / images, even if
scanned. The loader generates Documents by pages of the original
document. This is my first contribution to LangChain.

Unfortunately I could not find the correct place for test cases. Happy
to add one if you can point me to the location, but as this is a
cloud-based service, a test would require network access and credentials
- so might be of limited help.

Dependencies: The needed dependency was already part of pyproject.toml,
no change.
Twitter: feel free to mention @LarsAC on the announcement
2023-09-03 14:25:39 -07:00

139 lines
3.4 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Azure Document Intelligence"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Azure Document Intelligence (formerly known as Azure Forms Recognizer) is machine-learning \n",
"based service that extracts text (including handwriting), tables or key-value-pairs from\n",
"scanned documents or images.\n",
"\n",
"This current implementation of a loader using Document Intelligence is able to incorporate content page-wise and turn it into LangChain documents.\n",
"\n",
"Document Intelligence supports PDF, JPEG, PNG, BMP, or TIFF.\n",
"\n",
"Further documentation is available at https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/?view=doc-intel-3.1.0.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install langchain azure-ai-formrecognizer -q"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The first example uses a local file which will be sent to Azure Document Intelligence.\n",
"\n",
"First, an instance of a DocumentAnalysisClient is created with endpoint and key for the Azure service. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.formrecognizer import DocumentAnalysisClient\n",
"from azure.core.credentials import AzureKeyCredential\n",
"\n",
"document_analysis_client = DocumentAnalysisClient(\n",
" endpoint=\"<service_endpoint>\", credential=AzureKeyCredential(\"<service_key>\")\n",
" )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.pdf import DocumentIntelligenceLoader\n",
"loader = DocumentIntelligenceLoader(\n",
" \"<Local_filename>\",\n",
" client=document_analysis_client,\n",
" model=\"<model_name>\") # e.g. prebuilt-document\n",
"\n",
"documents = loader.load()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The output contains each page of the source document as a LangChain document: "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='...', metadata={'source': '...', 'page': 1})]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.9.5"
},
"vscode": {
"interpreter": {
"hash": "f9f85f796d01129d0dd105a088854619f454435301f6ffec2fea96ecbd9be4ac"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}