You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/use_cases/extraction/how_to/handle_files.ipynb

151 lines
4.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "raw",
"id": "8371e5d6-eb65-4c97-aac2-05037356c2c1",
"metadata": {},
"source": [
"---\n",
"title: Handle Files\n",
"sidebar_position: 3\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0d5eea7c-bc69-4da2-b91d-d7c71f7085d0",
"metadata": {},
"source": [
"Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.\n",
"\n",
"You can use LangChain [document loaders](/docs/modules/data_connection/document_loaders/) to parse files into a text format that can be fed into LLMs.\n",
"\n",
"LangChain features a large number of [document loader integrations](/docs/integrations/document_loaders).\n",
"\n",
"## MIME type based parsing\n",
"\n",
"For basic parsing exmaples take a look [at document loaders](/docs/modules/data_connection/document_loaders/).\n",
"\n",
"Here, we'll be looking at mime-type based parsing which is often useful for extraction based applications if you're writing server code that accepts user uploaded files.\n",
"\n",
"In this case, it's best to assume that the file extension of the file provided by the user is wrong and instead infer the mimetype from the binary content of the file.\n",
"\n",
"Let's download some content. This will be an HTML file, but the code below will work with other file types."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "76d42bb2-090b-4a70-a656-d6e9af769eba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'<!DOCTYPE html>\\n<htm'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"\n",
"response = requests.get(\"https://en.wikipedia.org/wiki/Car\")\n",
"data = response.content\n",
"data[:20]"
]
},
{
"cell_type": "markdown",
"id": "389400a2-6f05-48da-810e-9438d626e64b",
"metadata": {},
"source": [
"Configure the parsers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "430806a7-7d8a-4111-9f5d-46840dab0dc0",
"metadata": {},
"outputs": [],
"source": [
"import magic\n",
"from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser\n",
"from langchain.document_loaders.parsers.generic import MimeTypeBasedParser\n",
"from langchain.document_loaders.parsers.txt import TextParser\n",
"from langchain_community.document_loaders import Blob\n",
"\n",
"# Configure the parsers that you want to use per mime-type!\n",
"HANDLERS = {\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"text/html\": BS4HTMLParser(),\n",
"}\n",
"\n",
"# Instantiate a mimetype based parser with the given parsers\n",
"MIMETYPE_BASED_PARSER = MimeTypeBasedParser(\n",
" handlers=HANDLERS,\n",
" fallback_parser=None,\n",
")\n",
"\n",
"mime = magic.Magic(mime=True)\n",
"mime_type = mime.from_buffer(data)\n",
"\n",
"# A blob represents binary data by either reference (path on file system)\n",
"# or value (bytes in memory).\n",
"blob = Blob.from_data(\n",
" data=data,\n",
" mime_type=mime_type,\n",
")\n",
"\n",
"parser = HANDLERS[mime_type]\n",
"documents = parser.parse(blob=blob)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fb618df7-d7be-4f34-8939-6f7b10dfc2b6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Car - Wikipedia\n"
]
}
],
"source": [
"print(documents[0].page_content[:30].strip())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}