**Description:** Textract PDF Loader generating linearized output,
meaning it will replicate the structure of the source document as close
as possible based on the features passed into the call (e. g. LAYOUT,
FORMS, TABLES). With LAYOUT reading order for multi-column documents or
identification of lists and figures is supported and with TABLES it will
generate the table structure as well. FORMS will indicate "key: value"
with columms.
- **Issue:** the issue fixes#12068
- **Dependencies:** amazon-textract-textractor is added, which provides
the linearization
- **Tag maintainer:** @3coins
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
"\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier Pillow>=9.4.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier pypdf>=2.5.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n",
" Obtaining dependency information for dataclasses-json<0.6.0,>=0.5.7 from https://files.pythonhosted.org/packages/97/5f/e7cc90f36152810cab08b6c9c1125e8bcb9d76f8b3018d101b5f877b386c/dataclasses_json-0.5.14-py3-none-any.whl.metadata\n",
"Building wheels for collected packages: langchain\n",
" Building editable for langchain (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for langchain: filename=langchain-0.0.267-py3-none-any.whl size=5553 sha256=daaf68d6658b27d69a4a092aa0a39e31f32b96868ef195102d2a17cf119f9d86\n",
" Stored in directory: /private/var/folders/s4/y_t_mj094c95t80n023c9wym0000gr/T/pip-ephem-wheel-cache-v1ynlirx/wheels/9f/73/28/b1d250633de6bd5759f959e16889c6c841dd0e0ffb6474185a\n",
"Successfully built langchain\n",
"\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier Pillow>=9.4.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier pypdf>=2.5.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n"
" Obtaining dependency information for amazon-textract-caller>=0.2.0 from https://files.pythonhosted.org/packages/35/42/17daacf400060ee1f768553980b7bd6bb77d5b80bcb8a82d8a9665e5bb9b/amazon_textract_caller-0.2.0-py2.py3-none-any.whl.metadata\n",
" Using cached amazon_textract_caller-0.2.0-py2.py3-none-any.whl.metadata (7.1 kB)\n",
"Requirement already satisfied: boto3>=1.26.35 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from amazon-textract-caller>=0.2.0) (1.28.67)\n",
"Requirement already satisfied: botocore in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from amazon-textract-caller>=0.2.0) (1.31.67)\n",
"Requirement already satisfied: amazon-textract-response-parser>=0.1.39 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from amazon-textract-caller>=0.2.0) (0.1.48)\n",
"Requirement already satisfied: marshmallow<4,>=3.14 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from amazon-textract-response-parser>=0.1.39->amazon-textract-caller>=0.2.0) (3.20.1)\n",
"Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from boto3>=1.26.35->amazon-textract-caller>=0.2.0) (1.0.1)\n",
"Requirement already satisfied: s3transfer<0.8.0,>=0.7.0 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from boto3>=1.26.35->amazon-textract-caller>=0.2.0) (0.7.0)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from botocore->amazon-textract-caller>=0.2.0) (2.8.2)\n",
"Requirement already satisfied: urllib3<2.1,>=1.25.4 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from botocore->amazon-textract-caller>=0.2.0) (1.26.18)\n",
"Requirement already satisfied: packaging>=17.0 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from marshmallow<4,>=3.14->amazon-textract-response-parser>=0.1.39->amazon-textract-caller>=0.2.0) (23.2)\n",
"Requirement already satisfied: six>=1.5 in /Users/schadem/.pyenv/versions/3.11.1/envs/langchain/lib/python3.11/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-caller>=0.2.0) (1.16.0)\n",
"\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier Pillow>=9.4.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mDEPRECATION: amazon-textract-pipeline-pagedimensions 0.0.8 has a non-standard dependency specifier pypdf>=2.5.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of amazon-textract-pipeline-pagedimensions or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063\u001b[0m\u001b[33m\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"amazon-textract-textractor 1.4.1 requires amazon-textract-caller<0.1.0,>=0.0.27, but you have amazon-textract-caller 0.2.0 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n"
]
}
],
"source": [
"!pip install \"amazon-textract-caller>=0.2.0\""
]
},
{
@ -53,12 +174,27 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 6,
"id": "1becee92-e82f-42d4-9b4e-b23d77cbe88d",
"metadata": {
"tags": []
},
"outputs": [],
"outputs": [
{
"ename": "ImportError",
"evalue": "cannot import name 'DocumentIntelligenceParser' from 'langchain.document_loaders.parsers.pdf' (/Users/schadem/code/github/schadem/langchain/libs/langchain/langchain/document_loaders/parsers/pdf.py)",
"\u001b[0;31mImportError\u001b[0m: cannot import name 'DocumentIntelligenceParser' from 'langchain.document_loaders.parsers.pdf' (/Users/schadem/code/github/schadem/langchain/libs/langchain/langchain/document_loaders/parsers/pdf.py)"