Vwp/docs improved document loaders (#4006)

Huge thanks to @leo-gan for improving the document loaders notebooks

---------

Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
pull/3636/head
Zander Chase 1 year ago committed by GitHub
parent 1c68cbdb28
commit aa38355999
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -20,7 +19,15 @@
]
},
{
"attachments": {},
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install apify-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
@ -39,7 +46,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -60,7 +66,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -85,7 +90,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -102,7 +106,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -156,9 +159,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -11,9 +11,11 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "d9b2e33e",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import CoNLLULoader"
@ -21,9 +23,11 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "5b5eec48",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = CoNLLULoader(\"example_data/conllu.conllu\")"
@ -31,9 +35,11 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"id": "10f3f725",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"document = loader.load()"
@ -41,10 +47,23 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "acbb3579",
"metadata": {},
"outputs": [],
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='They buy and sell books.', metadata={'source': 'example_data/conllu.conllu'})]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document"
]
@ -52,7 +71,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -66,7 +85,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.10.6"
},
"toc": {
"base_numbering": 1,

@ -5,7 +5,22 @@
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte JSON\n",
"# Airbyte JSON"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases."
]
},
{
"cell_type": "markdown",
"id": "1fe72234-3110-4c07-a766-3dc505dd25cc",
"metadata": {},
"source": [
"This covers how to load any source from Airbyte into a local JSON file that can be read in as a document\n",
"\n",
"Prereqs:\n",
@ -25,7 +40,7 @@
"\n",
"6) Set destination as Local JSON, with specified destination path - lets say `/json_data`. Set up manual sync.\n",
"\n",
"7) Run the connection!\n",
"7) Run the connection.\n",
"\n",
"7) To see what files are create, you can navigate to: `file:///tmp/airbyte_local`\n",
"\n",
@ -52,7 +67,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"_airbyte_raw_pokemon.jsonl\r\n"
"_airbyte_raw_pokemon.jsonl\n"
]
}
],

@ -1,15 +1,15 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Apify Dataset\n",
"\n",
">[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n",
"\n",
"This notebook shows how to load Apify datasets to LangChain.\n",
"\n",
"[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n",
"\n",
"## Prerequisites\n",
"\n",
@ -17,7 +17,17 @@
]
},
{
"attachments": {},
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install apify-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
@ -35,7 +45,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -77,7 +86,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -167,9 +175,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -7,7 +7,7 @@
"source": [
"# Arxiv\n",
"\n",
"[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
">[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
"\n",
"This notebook shows how to load scientific articles from `Arxiv.org` into a document format that we can use downstream."
]
@ -37,11 +37,10 @@
},
"outputs": [],
"source": [
"!pip install arxiv"
"#!pip install arxiv"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "094b5f13-7e54-4354-9d83-26d6926ecaa0",
"metadata": {
@ -60,7 +59,7 @@
},
"outputs": [],
"source": [
"!pip install pymupdf"
"#!pip install pymupdf"
]
},
{
@ -72,7 +71,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e29b954c-1407-4797-ae21-6ba8937156be",
"metadata": {},
@ -171,7 +169,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,6 +6,9 @@
"metadata": {},
"source": [
"# AZLyrics\n",
"\n",
">[AZLyrics](https://www.azlyrics.com/) is a large, legal, every day growing collection of lyrics.\n",
"\n",
"This covers how to load AZLyrics webpages into a document format that we can use downstream."
]
},
@ -85,7 +88,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,34 +1,45 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "a634365e",
"metadata": {},
"source": [
"# Azure Blob Storage Container\n",
"\n",
"This covers how to load document objects from a container on Azure Blob Storage."
">[Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) is Microsoft's object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.\n",
"\n",
"`Azure Blob Storage` is designed for:\n",
"- Serving images or documents directly to a browser.\n",
"- Storing files for distributed access.\n",
"- Streaming video and audio.\n",
"- Writing to log files.\n",
"- Storing data for backup and restore, disaster recovery, and archiving.\n",
"- Storing data for analysis by an on-premises or Azure-hosted service.\n",
"\n",
"This notebook covers how to load document objects from a container on `Azure Blob Storage`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2f0cd6a5",
"execution_count": null,
"id": "49815096",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import AzureBlobStorageContainerLoader"
"#!pip install azure-storage-blob"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "49815096",
"metadata": {},
"id": "2f0cd6a5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install azure-storage-blob"
"from langchain.document_loaders import AzureBlobStorageContainerLoader"
]
},
{
@ -127,7 +138,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,34 +1,37 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "66a7777e",
"metadata": {},
"source": [
"# Azure Blob Storage File\n",
"\n",
"This covers how to load document objects from a Azure Blob Storage file."
">[Azure Files](https://learn.microsoft.com/en-us/azure/storage/files/storage-files-introduction) offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (`SMB`) protocol, Network File System (`NFS`) protocol, and `Azure Files REST API`.\n",
"\n",
"This covers how to load document objects from a Azure Files."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9ec8a3b3",
"metadata": {},
"id": "43128d8d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import AzureBlobStorageFileLoader"
"#!pip install azure-storage-blob"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "43128d8d",
"execution_count": 1,
"id": "9ec8a3b3",
"metadata": {},
"outputs": [],
"source": [
"#!pip install azure-storage-blob"
"from langchain.document_loaders import AzureBlobStorageFileLoader"
]
},
{
@ -87,7 +90,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -4,15 +4,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# BigQuery Loader\n",
"# BigQuery\n",
"\n",
"Load a BigQuery query with one document per row."
">[BigQuery](https://cloud.google.com/bigquery) is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.\n",
"`BigQuery` is a part of the `Google Cloud Platform`.\n",
"\n",
"Load a `BigQuery` query with one document per row."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install google-cloud-bigquery"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import BigQueryLoader"
@ -194,9 +210,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -7,29 +7,33 @@
"source": [
"# Bilibili\n",
"\n",
"This loader utilizes the `bilibili-api` to fetch the text transcript from Bilibili, one of the most beloved long-form video sites in China.\n",
"This loader utilizes the [bilibili-api](https://github.com/MoyuScript/bilibili-api) to fetch the text transcript from [Bilibili](https://www.bilibili.tv/), one of the most beloved long-form video sites in China.\n",
"\n",
"With this BiliBiliLoader, users can easily obtain the transcript of their desired video content on the platform."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9ec8a3b3",
"metadata": {},
"execution_count": null,
"id": "43128d8d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders.bilibili import BiliBiliLoader"
"#!pip install bilibili-api"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "43128d8d",
"metadata": {},
"execution_count": null,
"id": "9ec8a3b3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install bilibili-api"
"from langchain.document_loaders.bilibili import BiliBiliLoader"
]
},
{
@ -51,16 +55,20 @@
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"loader.load()"
],
"id": "3470dadf",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
}
},
"outputs": [],
"source": [
"loader.load()"
]
}
],
"metadata": {
@ -79,9 +87,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
}

@ -1,13 +1,18 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Blackboard\n",
"\n",
"This covers how to load data from a Blackboard Learn instance."
"This covers how to load data from a [Blackboard Learn](https://www.anthology.com/products/teaching-and-learning/learning-effectiveness/blackboard-learn) instance.\n",
"\n",
"This loader is not compatible with all `Blackboard` courses. It is only\n",
" compatible with courses that use the new `Blackboard` interface.\n",
" To use this loader, you must have the BbRouter cookie. You can get this\n",
" cookie by logging into the course and then copying the value of the\n",
" BbRouter cookie from the browser's developer tools."
]
},
{
@ -28,11 +33,24 @@
}
],
"metadata": {
"language_info": {
"name": "python"
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"orig_nbformat": 4
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -1,151 +1,149 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vm8vn9t8DvC_"
},
"source": [
"# Blockchain"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "5WjXERXzFEhg"
},
"source": [
"## Overview"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "juAmbgoWD17u"
},
"source": [
"The intention of this notebook is to provide a means of testing functionality in the Langchain Document Loader for Blockchain.\n",
"\n",
"Initially this Loader supports:\n",
"\n",
"* Loading NFTs as Documents from NFT Smart Contracts (ERC721 and ERC1155)\n",
"* Ethereum Maninnet, Ethereum Testnet, Polgyon Mainnet, Polygon Testnet (default is eth-mainnet)\n",
"* Alchemy's getNFTsForCollection API\n",
"\n",
"It can be extended if the community finds value in this loader. Specifically:\n",
"\n",
"* Additional APIs can be added (e.g. Tranction-related APIs)\n",
"\n",
"This Document Loader Requires:\n",
"\n",
"* A free [Alchemy API Key](https://www.alchemy.com/)\n",
"\n",
"The output takes the following format:\n",
"\n",
"- pageContent= Individual NFT\n",
"- metadata={'source': '0x1a92f7381b9f03921564a437210bb9396471050c', 'blockchain': 'eth-mainnet', 'tokenId': '0x15'})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load NFTs into Document Loader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alchemyApiKey = \"get from https://www.alchemy.com/ and set in environment variable ALCHEMY_API_KEY\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Option 1: Ethereum Mainnet (default BlockchainType)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "J3LWHARC-Kn0"
},
"outputs": [],
"source": [
"contractAddress = \"0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d\" # Bored Ape Yacht Club contract address\n",
"\n",
"blockchainType = BlockchainType.ETH_MAINNET #default value, optional parameter\n",
"\n",
"blockchainLoader = BlockchainDocumentLoader(contract_address=contractAddress,\n",
" api_key=alchemyApiKey)\n",
"\n",
"nfts = blockchainLoader.load()\n",
"\n",
"nfts[:2]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Option 2: Polygon Mainnet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"contractAddress = \"0x448676ffCd0aDf2D85C1f0565e8dde6924A9A7D9\" # Polygon Mainnet contract address\n",
"\n",
"blockchainType = BlockchainType.POLYGON_MAINNET \n",
"\n",
"blockchainLoader = BlockchainDocumentLoader(contract_address=contractAddress, \n",
" blockchainType=blockchainType, \n",
" api_key=alchemyApiKey)\n",
"\n",
"nfts = blockchainLoader.load()\n",
"\n",
"nfts[:2]"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"5WjXERXzFEhg"
],
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "vm8vn9t8DvC_"
},
"source": [
"# Blockchain"
]
},
"nbformat": 4,
"nbformat_minor": 0
{
"cell_type": "markdown",
"metadata": {
"id": "5WjXERXzFEhg"
},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "juAmbgoWD17u"
},
"source": [
"The intention of this notebook is to provide a means of testing functionality in the Langchain Document Loader for Blockchain.\n",
"\n",
"Initially this Loader supports:\n",
"\n",
"* Loading NFTs as Documents from NFT Smart Contracts (ERC721 and ERC1155)\n",
"* Ethereum Maninnet, Ethereum Testnet, Polgyon Mainnet, Polygon Testnet (default is eth-mainnet)\n",
"* Alchemy's getNFTsForCollection API\n",
"\n",
"It can be extended if the community finds value in this loader. Specifically:\n",
"\n",
"* Additional APIs can be added (e.g. Tranction-related APIs)\n",
"\n",
"This Document Loader Requires:\n",
"\n",
"* A free [Alchemy API Key](https://www.alchemy.com/)\n",
"\n",
"The output takes the following format:\n",
"\n",
"- pageContent= Individual NFT\n",
"- metadata={'source': '0x1a92f7381b9f03921564a437210bb9396471050c', 'blockchain': 'eth-mainnet', 'tokenId': '0x15'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load NFTs into Document Loader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get ALCHEMY_API_KEY from https://www.alchemy.com/ \n",
"\n",
"alchemyApiKey = \"...\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Option 1: Ethereum Mainnet (default BlockchainType)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "J3LWHARC-Kn0"
},
"outputs": [],
"source": [
"from langchain.document_loaders.blockchain import BlockchainDocumentLoader, BlockchainType\n",
"contractAddress = \"0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d\" # Bored Ape Yacht Club contract address\n",
"\n",
"blockchainType = BlockchainType.ETH_MAINNET #default value, optional parameter\n",
"\n",
"blockchainLoader = BlockchainDocumentLoader(contract_address=contractAddress,\n",
" api_key=alchemyApiKey)\n",
"\n",
"nfts = blockchainLoader.load()\n",
"\n",
"nfts[:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Option 2: Polygon Mainnet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"contractAddress = \"0x448676ffCd0aDf2D85C1f0565e8dde6924A9A7D9\" # Polygon Mainnet contract address\n",
"\n",
"blockchainType = BlockchainType.POLYGON_MAINNET \n",
"\n",
"blockchainLoader = BlockchainDocumentLoader(contract_address=contractAddress, \n",
" blockchainType=blockchainType, \n",
" api_key=alchemyApiKey)\n",
"\n",
"nfts = blockchainLoader.load()\n",
"\n",
"nfts[:2]"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"5WjXERXzFEhg"
],
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

@ -1,21 +1,22 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### ChatGPT Data Loader\n",
"\n",
"This notebook covers how to load `conversations.json` from your ChatGPT data export folder.\n",
"This notebook covers how to load `conversations.json` from your `ChatGPT` data export folder.\n",
"\n",
"You can get your data export by email by going to: https://chat.openai.com/ -> (Profile) - Settings -> Export data -> Confirm export."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders.chatgpt import ChatGPTLoader"
@ -53,7 +54,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -67,10 +68,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
},
"orig_nbformat": 4
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -6,7 +6,10 @@
"metadata": {},
"source": [
"# College Confidential\n",
"This covers how to load College Confidential webpages into a document format that we can use downstream."
"\n",
">[College Confidential](https://www.collegeconfidential.com/) gives information on 3,800+ colleges and universities.\n",
"\n",
"This covers how to load `College Confidential` webpages into a document format that we can use downstream."
]
},
{
@ -85,7 +88,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,18 +6,29 @@
"source": [
"# Confluence\n",
"\n",
"A loader for Confluence pages.\n",
"A loader for [Confluence](https://www.atlassian.com/software/confluence) pages.\n",
"\n",
"\n",
"This currently supports both username/api_key and Oauth2 login.\n",
"This currently supports both `username/api_key` and `Oauth2 login`.\n",
"\n",
"\n",
"Specify a list page_ids and/or space_key to load in the corresponding pages into Document objects, if both are specified the union of both sets will be returned.\n",
"\n",
"\n",
"You can also specify a boolean `include_attachments` to include attachments, this is set to False by default, if set to True all attachments will be downloaded and ConfluenceReader will extract the text from the attachments and add it to the Document object. Currently supported attachment types are: PDF, PNG, JPEG/JPG, SVG, Word and Excel.\n",
"You can also specify a boolean `include_attachments` to include attachments, this is set to False by default, if set to True all attachments will be downloaded and ConfluenceReader will extract the text from the attachments and add it to the Document object. Currently supported attachment types are: `PDF`, `PNG`, `JPEG/JPG`, `SVG`, `Word` and `Excel`.\n",
"\n",
"Hint: space_key and page_id can both be found in the URL of a page in Confluence - https://yoursite.atlassian.com/wiki/spaces/<space_key>/pages/<page_id>\n"
"Hint: `space_key` and `page_id` can both be found in the URL of a page in Confluence - https://yoursite.atlassian.com/wiki/spaces/<space_key>/pages/<page_id>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install atlassian-python-api"
]
},
{
@ -33,7 +44,7 @@
" username=\"me\",\n",
" api_key=\"12345\"\n",
")\n",
"documents = loader.load(space_key=\"SPACE\", include_attachments=True, limit=50)\n"
"documents = loader.load(space_key=\"SPACE\", include_attachments=True, limit=50)"
]
}
],
@ -53,7 +64,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
@ -62,5 +73,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -94,7 +94,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -2,20 +2,21 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"metadata": {},
"source": [
"# CSV Loader\n",
"# CSV Files\n",
"\n",
"Load csv files with a single row per document."
"Load [csv](https://en.wikipedia.org/wiki/Comma-separated_values) data with a single row per document."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
@ -26,7 +27,10 @@
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
@ -39,7 +43,10 @@
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
@ -56,9 +63,7 @@
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"metadata": {},
"source": [
"## Customizing the csv parsing and loading\n",
"\n",
@ -69,7 +74,10 @@
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
@ -86,7 +94,10 @@
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
@ -102,13 +113,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specify a column to be used identify the document source\n",
"## Specify a column to identify the document source\n",
"\n",
"Use the `source_column` argument to specify a column to be set as the source for the document created from each row. Otherwise `file_path` will be used as the source for all documents created from the csv file.\n",
"Use the `source_column` argument to specify a source for the document created from each row. Otherwise `file_path` will be used as the source for all documents created from the CSV file.\n",
"\n",
"This is useful when using documents loaded from CSV files for chains that answer questions using sources."
]
@ -144,7 +154,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -158,9 +168,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 4
}

@ -5,9 +5,19 @@
"id": "213a38a2",
"metadata": {},
"source": [
"# DataFrame Loader\n",
"# Pandas DataFrame\n",
"\n",
"This notebook goes over how to load data from a pandas dataframe"
"This notebook goes over how to load data from a [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6a7a9e4-80d6-486a-b2e3-636c568aa97c",
"metadata": {},
"outputs": [],
"source": [
"#!pip install pandas"
]
},
{
@ -210,7 +220,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,13 +1,16 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "2dfc4698",
"metadata": {},
"source": [
"# Diffbot\n",
"\n",
">Unlike traditional web scraping tools, [Diffbot](https://docs.diffbot.com/docs) doesn't require any rules to read the content on a page.\n",
">It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.\n",
">The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.\n",
"\n",
"This covers how to extract HTML documents from a list of URLs using the [Diffbot extract API](https://www.diffbot.com/products/extract/), into a document format that we can use downstream."
]
},
@ -24,7 +27,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6fffec88",
"metadata": {},
@ -45,7 +47,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e0ce8c05",
"metadata": {},

@ -69,7 +69,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e633d62f",
"metadata": {},
@ -78,7 +77,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "43911860",
"metadata": {},
@ -119,7 +117,7 @@
"metadata": {},
"source": [
"## Change loader class\n",
"By default this uses the UnstructuredLoader class. However, you can change up the type of loader pretty easily."
"By default this uses the `UnstructuredLoader` class. However, you can change up the type of loader pretty easily."
]
},
{
@ -257,7 +255,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.3"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -4,24 +4,41 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# DuckDB Loader\n",
"# DuckDB\n",
"\n",
"Load a DuckDB query with one document per row."
">[DuckDB](https://duckdb.org/) is an in-process SQL OLAP database management system.\n",
"\n",
"Load a `DuckDB` query with one document per row."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import DuckDBLoader"
"#!pip install duckdb"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import DuckDBLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
@ -40,8 +57,10 @@
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = DuckDBLoader(\"SELECT * FROM read_csv_auto('example.csv')\")\n",
@ -51,8 +70,10 @@
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
@ -167,9 +188,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}

@ -7,7 +7,7 @@
"source": [
"# Email\n",
"\n",
"This notebook shows how to load email (`.eml`) and Microsoft Outlook (`.msg`) files."
"This notebook shows how to load email (`.eml`) or `Microsoft Outlook` (`.msg`) files."
]
},
{
@ -20,9 +20,23 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "226e50aa-407d-43d9-a81d-f6706298b10c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install unstructured"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "40cd9806",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredEmailLoader"
@ -30,9 +44,11 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 6,
"id": "2d20b852",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredEmailLoader('example_data/fake-email.eml')"
@ -40,9 +56,11 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "579fa702",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -50,17 +68,19 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 8,
"id": "90c1d899",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='This is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', lookup_str='', metadata={'source': 'example_data/fake-email.eml'}, lookup_index=0)]"
"[Document(page_content='This is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': 'example_data/fake-email.eml'})]"
]
},
"execution_count": 4,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@ -128,6 +148,16 @@
"## Using OutlookMessageLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "058e670e-9964-44ee-b888-44f23ffb9310",
"metadata": {},
"outputs": [],
"source": [
"#!pip install extract_msg"
]
},
{
"cell_type": "code",
"execution_count": 8,
@ -204,7 +234,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,16 +5,18 @@
"id": "39af9ecd",
"metadata": {},
"source": [
"# EPubs\n",
"# EPub \n",
"\n",
"This covers how to load `.epub` documents into a document format that we can use downstream. You'll need to install the [`pandocs`](https://pandoc.org/installing.html) package for this loader to work."
"This covers how to load `.epub` documents into the Document format that we can use downstream. You'll need to install the [`pandocs`](https://pandoc.org/installing.html) package for this loader to work."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "721c48aa",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredEPubLoader"
@ -24,7 +26,9 @@
"cell_type": "code",
"execution_count": 2,
"id": "9d3d0e35",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredEPubLoader(\"winter-sports.epub\")"
@ -32,9 +36,11 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "06073f91",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -54,7 +60,9 @@
"cell_type": "code",
"execution_count": 4,
"id": "064f9162",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredEPubLoader(\"winter-sports.epub\", mode=\"elements\")"
@ -62,9 +70,11 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "abefbbdb",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -116,7 +126,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,35 +7,41 @@
"source": [
"# EverNote\n",
"\n",
"How to load EverNote file from disk."
">[EverNote](https://evernote.com/) is intended for archiving and creating notes in which photos, audio and saved web content can be embedded. Notes are stored in virtual \"notebooks\" and can be tagged, annotated, edited, searched, and exported.\n",
"\n",
"This notebook shows how to load `EverNote` file from disk."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"id": "1a53ece0",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# !pip install pypandoc\n",
"# import pypandoc\n",
"#!pip install pypandoc\n",
"import pypandoc\n",
"\n",
"# pypandoc.download_pandoc()"
"pypandoc.download_pandoc()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"id": "88df766f",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='testing this\\n\\nwhat happens?\\n\\nto the world?\\n', lookup_str='', metadata={'source': 'example_data/testing.enex'}, lookup_index=0)]"
"[Document(page_content='testing this\\n\\nwhat happens?\\n\\nto the world?\\n', metadata={'source': 'example_data/testing.enex'})]"
]
},
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -46,14 +52,6 @@
"loader = EverNoteLoader(\"example_data/testing.enex\")\n",
"loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1329905",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -72,7 +70,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,60 +5,60 @@
{
"sender_name": "User 1",
"timestamp_ms": 1675597435669,
"content": "Oh no worries! Bye",
"content": "Oh no worries! Bye"
},
{
"sender_name": "User 2",
"timestamp_ms": 1675596277579,
"content": "No Im sorry it was my mistake, the blue one is not for sale",
"content": "No Im sorry it was my mistake, the blue one is not for sale"
},
{
"sender_name": "User 1",
"timestamp_ms": 1675595140251,
"content": "I thought you were selling the blue one!",
"content": "I thought you were selling the blue one!"
},
{
"sender_name": "User 1",
"timestamp_ms": 1675595109305,
"content": "Im not interested in this bag. Im interested in the blue one!",
"content": "Im not interested in this bag. Im interested in the blue one!"
},
{
"sender_name": "User 2",
"timestamp_ms": 1675595068468,
"content": "Here is $129",
"content": "Here is $129"
},
{
"sender_name": "User 2",
"timestamp_ms": 1675595060730,
"photos": [
{"uri": "url_of_some_picture.jpg", "creation_timestamp": 1675595059}
],
]
},
{
"sender_name": "User 2",
"timestamp_ms": 1675595045152,
"content": "Online is at least $100",
"content": "Online is at least $100"
},
{
"sender_name": "User 1",
"timestamp_ms": 1675594799696,
"content": "How much do you want?",
"content": "How much do you want?"
},
{
"sender_name": "User 2",
"timestamp_ms": 1675577876645,
"content": "Goodmorning! $50 is too low.",
"content": "Goodmorning! $50 is too low."
},
{
"sender_name": "User 1",
"timestamp_ms": 1675549022673,
"content": "Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!",
},
"content": "Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!"
}
],
"title": "User 1 and User 2 chat",
"is_still_participant": true,
"thread_path": "inbox/User 1 and User 2 chat",
"magic_words": [],
"image": {"uri": "image_of_the_chat.jpg", "creation_timestamp": 1675549016},
"joinable_mode": {"mode": 1, "link": ""},
"joinable_mode": {"mode": 1, "link": ""}
}

@ -6,14 +6,25 @@
"source": [
"### Facebook Chat\n",
"\n",
"This notebook covers how to load data from the Facebook Chats into a format that can be ingested into LangChain."
"This notebook covers how to load data from the [Facebook Chats](https://www.facebook.com/business/help/1646890868956360) into a format that can be ingested into LangChain."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import FacebookChatLoader"
]
@ -21,7 +32,9 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = FacebookChatLoader(\"example_data/facebook_chat.json\")"
@ -29,16 +42,18 @@
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='User 2 on 2023-02-05 12:46:11: Bye!\\n\\nUser 1 on 2023-02-05 12:43:55: Oh no worries! Bye\\n\\nUser 2 on 2023-02-05 12:24:37: No Im sorry it was my mistake, the blue one is not for sale\\n\\nUser 1 on 2023-02-05 12:05:40: I thought you were selling the blue one!\\n\\nUser 1 on 2023-02-05 12:05:09: Im not interested in this bag. Im interested in the blue one!\\n\\nUser 2 on 2023-02-05 12:04:28: Here is $129\\n\\nUser 2 on 2023-02-05 12:04:05: Online is at least $100\\n\\nUser 1 on 2023-02-05 11:59:59: How much do you want?\\n\\nUser 2 on 2023-02-05 07:17:56: Goodmorning! $50 is too low.\\n\\nUser 1 on 2023-02-04 23:17:02: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\\n\\n', lookup_str='', metadata={'source': 'docs/modules/document_loaders/examples/example_data/facebook_chat.json'}, lookup_index=0)]"
"[Document(page_content='User 2 on 2023-02-05 03:46:11: Bye!\\n\\nUser 1 on 2023-02-05 03:43:55: Oh no worries! Bye\\n\\nUser 2 on 2023-02-05 03:24:37: No Im sorry it was my mistake, the blue one is not for sale\\n\\nUser 1 on 2023-02-05 03:05:40: I thought you were selling the blue one!\\n\\nUser 1 on 2023-02-05 03:05:09: Im not interested in this bag. Im interested in the blue one!\\n\\nUser 2 on 2023-02-05 03:04:28: Here is $129\\n\\nUser 2 on 2023-02-05 03:04:05: Online is at least $100\\n\\nUser 1 on 2023-02-05 02:59:59: How much do you want?\\n\\nUser 2 on 2023-02-04 22:17:56: Goodmorning! $50 is too low.\\n\\nUser 1 on 2023-02-04 14:17:02: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\\n\\n', metadata={'source': 'example_data/facebook_chat.json'})]"
]
},
"execution_count": 3,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@ -64,7 +79,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
@ -73,5 +88,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -1,21 +1,24 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "33205b12",
"metadata": {},
"source": [
"# Figma\n",
"\n",
"This notebook covers how to load data from the Figma REST API into a format that can be ingested into LangChain, along with example usage for code generation."
">[Figma](https://www.figma.com/) is a collaborative web application for interface design.\n",
"\n",
"This notebook covers how to load data from the `Figma` REST API into a format that can be ingested into LangChain, along with example usage for code generation."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "90b69c94",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
@ -37,7 +40,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d809744a",
"metadata": {},
@ -117,7 +119,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "baf9b2c9",
"metadata": {},
@ -151,7 +152,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,17 +7,9 @@
"source": [
"# GCS Directory\n",
"\n",
"This covers how to load document objects from an Google Cloud Storage (GCS) directory."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5cfb25c9",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GCSDirectoryLoader"
">[Google Cloud Storage](https://en.wikipedia.org/wiki/Google_Cloud_Storage) is a managed service for storing unstructured data.\n",
"\n",
"This covers how to load document objects from an `Google Cloud Storage (GCS) directory (bucket)`."
]
},
{
@ -32,6 +24,16 @@
"# !pip install google-cloud-storage"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5cfb25c9",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GCSDirectoryLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
@ -148,7 +150,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,17 +7,9 @@
"source": [
"# GCS File Storage\n",
"\n",
"This covers how to load document objects from an Google Cloud Storage (GCS) file object."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5cfb25c9",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GCSFileLoader"
">[Google Cloud Storage](https://en.wikipedia.org/wiki/Google_Cloud_Storage) is a managed service for storing unstructured data.\n",
"\n",
"This covers how to load document objects from an `Google Cloud Storage (GCS) file object (blob)`."
]
},
{
@ -32,6 +24,16 @@
"# !pip install google-cloud-storage"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5cfb25c9",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GCSFileLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
@ -96,7 +98,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,7 +6,9 @@
"source": [
"# Git\n",
"\n",
"This notebook shows how to load text files from Git repository."
">[Git](https://en.wikipedia.org/wiki/Git) is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers collaboratively developing source code during software development.\n",
"\n",
"This notebook shows how to load text files from `Git` repository."
]
},
{
@ -18,8 +20,21 @@
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install GitPython"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from git import Repo\n",
@ -33,7 +48,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import GitLoader"
@ -184,9 +201,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -6,7 +6,10 @@
"metadata": {},
"source": [
"# GitBook\n",
"How to pull page data from any GitBook."
"\n",
">[GitBook](https://docs.gitbook.com/) is a modern documentation platform where teams can document everything from products to internal knowledge bases and APIs.\n",
"\n",
"This notebook shows how to pull page data from any `GitBook`."
]
},
{
@ -20,21 +23,21 @@
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "849a8d52",
"cell_type": "markdown",
"id": "65d5ddce",
"metadata": {},
"outputs": [],
"source": [
"loader = GitbookLoader(\"https://docs.gitbook.com\")"
"### Load from single GitBook page"
]
},
{
"cell_type": "markdown",
"id": "65d5ddce",
"cell_type": "code",
"execution_count": 2,
"id": "849a8d52",
"metadata": {},
"outputs": [],
"source": [
"### Load from single GitBook page"
"loader = GitbookLoader(\"https://docs.gitbook.com\")"
]
},
{
@ -178,7 +181,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {

@ -6,7 +6,7 @@
"metadata": {},
"source": [
"# Google Drive\n",
"This notebook covers how to load documents from Google Drive. Currently, only Google Docs are supported.\n",
"This notebook covers how to load documents from `Google Drive`. Currently, only `Google Docs` are supported.\n",
"\n",
"## Prerequisites\n",
"\n",
@ -23,6 +23,16 @@
"* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `\"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e40071c-3a65-4e26-b497-3e2be0bd86b9",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "code",
"execution_count": 1,
@ -80,7 +90,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,14 +7,18 @@
"source": [
"# Gutenberg\n",
"\n",
"This covers how to load links to Gutenberg e-books into a document format that we can use downstream."
">[Project Gutenberg](https://www.gutenberg.org/about/) is an online library of free eBooks.\n",
"\n",
"This notebook covers how to load links to `Gutenberg` e-books into a document format that we can use downstream."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9bfd5e46",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import GutenbergLoader"
@ -22,9 +26,11 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"id": "700e4ef2",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = GutenbergLoader('https://www.gutenberg.org/cache/epub/69972/pg69972.txt')"
@ -32,9 +38,11 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 3,
"id": "b6f28930",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -42,21 +50,49 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"id": "7d436441",
"metadata": {},
"outputs": [],
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'The Project Gutenberg eBook of The changed brides, by Emma Dorothy\\r\\n\\n\\nEliza Nevitte Southworth\\r\\n\\n\\n\\r\\n\\n\\nThis eBook is for the use of anyone anywhere in the United States and\\r\\n\\n\\nmost other parts of the world at no cost and with almost no restrictions\\r\\n\\n\\nwhatsoever. You may copy it, give it away or re-u'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
"data[0].page_content[:300]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b74d755",
"metadata": {},
"outputs": [],
"source": []
"execution_count": 9,
"id": "1481beb1-12a7-4654-9d91-bfd101109891",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://www.gutenberg.org/cache/epub/69972/pg69972.txt'}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[0].metadata"
]
}
],
"metadata": {
@ -75,7 +111,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,14 +6,19 @@
"metadata": {},
"source": [
"# Hacker News\n",
"How to pull page data and comments from Hacker News"
"\n",
">[Hacker News](https://en.wikipedia.org/wiki/Hacker_News) (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as \"anything that gratifies one's intellectual curiosity.\"\n",
"\n",
"This notebook covers how to pull page data and comments from [Hacker News](https://news.ycombinator.com/)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ff49b177",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import HNLoader"
@ -23,7 +28,9 @@
"cell_type": "code",
"execution_count": 2,
"id": "849a8d52",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = HNLoader(\"https://news.ycombinator.com/item?id=34817881\")"
@ -33,7 +40,9 @@
"cell_type": "code",
"execution_count": 3,
"id": "c2826836",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -43,15 +52,14 @@
"cell_type": "code",
"execution_count": 4,
"id": "fefa2adc",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content=\"delta_p_delta_x 18 hours ago \\n | next [] \\n\\nAstrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\\n \\nreply\", lookup_str='', metadata={'source': 'https://news.ycombinator.com/item?id=34817881', 'title': 'What Lights the Universes Standard Candles?'}, lookup_index=0),\n",
" Document(page_content=\"andrewflnr 19 hours ago \\n | prev | next [] \\n\\nWhoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\\n \\nreply\", lookup_str='', metadata={'source': 'https://news.ycombinator.com/item?id=34817881', 'title': 'What Lights the Universes Standard Candles?'}, lookup_index=0),\n",
" Document(page_content='andreareina 18 hours ago \\n | prev | next [] \\n\\nThis seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\\n \\nreply', lookup_str='', metadata={'source': 'https://news.ycombinator.com/item?id=34817881', 'title': 'What Lights the Universes Standard Candles?'}, lookup_index=0),\n",
" Document(page_content=\"andreareina 18 hours ago \\n | prev [] \\n\\nWouldn't double detonation show up as variance in the brightness?\\n \\nreply\", lookup_str='', metadata={'source': 'https://news.ycombinator.com/item?id=34817881', 'title': 'What Lights the Universes Standard Candles?'}, lookup_index=0)]"
"\"delta_p_delta_x 73 days ago \\n | next [] \\n\\nAstrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs a\""
]
},
"execution_count": 4,
@ -60,16 +68,32 @@
}
],
"source": [
"data"
"data[0].page_content[:300]"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"id": "938ff4ee",
"metadata": {},
"outputs": [],
"source": []
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://news.ycombinator.com/item?id=34817881',\n",
" 'title': 'What Lights the Universes Standard Candles?'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[0].metadata"
]
}
],
"metadata": {
@ -88,7 +112,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {

@ -7,7 +7,7 @@
"source": [
"# HTML\n",
"\n",
"This covers how to load HTML documents into a document format that we can use downstream."
"This covers how to load `HTML` documents into a document format that we can use downstream."
]
},
{
@ -48,7 +48,9 @@
"outputs": [
{
"data": {
"text/plain": "[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"
"text/plain": [
"[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"
]
},
"execution_count": 4,
"metadata": {},
@ -61,20 +63,21 @@
},
{
"cell_type": "markdown",
"id": "00337aae",
"metadata": {},
"source": [
"## Loading HTML with BeautifulSoup4\n",
"\n",
"We can also use BeautifulSoup4 to load HTML documents using the `BSHTMLLoader`. This will extract the text from the html into `page_content`, and the page title as `title` into `metadata`."
],
"metadata": {
"collapsed": false
}
"We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`. This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 1,
"id": "79b1bce4",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import BSHTMLLoader"
@ -82,13 +85,23 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 2,
"id": "4be99e6c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": "[Document(page_content='\\n\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n', lookup_str='', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'}, lookup_index=0)]"
"text/plain": [
"[Document(page_content='\\n\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]"
]
},
"execution_count": 17,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
@ -97,19 +110,7 @@
"loader = BSHTMLLoader(\"example_data/fake-content.html\")\n",
"data = loader.load()\n",
"data"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
]
}
],
"metadata": {
@ -128,7 +129,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,11 +5,14 @@
"id": "04c9fdc5",
"metadata": {},
"source": [
"# HuggingFace dataset loader \n",
"# HuggingFace dataset \n",
"\n",
"This notebook shows how to load Hugging Face Hub datasets to LangChain.\n",
"The [Hugging Face Hub](https://huggingface.co/docs/hub/index) hosts a large number of community-curated datasets for a diverse range of tasks such as translation,\n",
"automatic speech recognition, and image classification.\n",
"\n",
"The Hugging Face Hub hosts a large number of community-curated datasets for a diverse range of tasks such as translation, automatic speech recognition, and image classification.\n"
">The `Hugging Face Hub` is home to over 5,000 [datasets](https://huggingface.co/docs/hub/index#datasets) in more than 100 languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio.\n",
"\n",
"This notebook shows how to load `Hugging Face Hub` datasets to LangChain."
]
},
{
@ -212,7 +215,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

File diff suppressed because one or more lines are too long

@ -18,11 +18,25 @@
"## Using Unstructured"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db8e56db-2e66-443b-8a0b-ef69fa5fae9a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install pdfminer"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0cc0cd42",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders.image import UnstructuredImageLoader"
@ -32,7 +46,9 @@
"cell_type": "code",
"execution_count": 2,
"id": "082d557c",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredImageLoader(\"layout-parser-paper-fast.jpg\")"
@ -40,9 +56,11 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "df11c953",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -137,7 +155,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,9 +6,23 @@
"metadata": {},
"source": [
"# Image captions\n",
"\n",
"By default, the loader utilizes the pre-trained [Salesforce BLIP image captioning model](https://huggingface.co/Salesforce/blip-image-captioning-base).\n",
"\n",
"\n",
"This notebook shows how to use the ImageCaptionLoader tutorial to generate a query-able index of image captions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f78585a-a2fa-4ece-834f-66692b959efb",
"metadata": {},
"outputs": [],
"source": [
"#!pip install transformers"
]
},
{
"cell_type": "code",
"execution_count": 1,
@ -232,7 +246,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

File diff suppressed because one or more lines are too long

@ -7,34 +7,53 @@
"source": [
"# Markdown\n",
"\n",
"This covers how to load markdown documents into a document format that we can use downstream."
">[Markdown](https://en.wikipedia.org/wiki/Markdown) is a lightweight markup language for creating formatted text using a plain-text editor.\n",
"\n",
"This covers how to load `markdown` documents into a document format that we can use downstream."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "721c48aa",
"id": "5282f85c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredMarkdownLoader"
"# !pip install unstructured > /dev/null"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9d3d0e35",
"metadata": {},
"id": "721c48aa",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredMarkdownLoader(\"../../../../README.md\")"
"from langchain.document_loaders import UnstructuredMarkdownLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9d3d0e35",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"markdown_path = \"../../../../../README.md\"\n",
"loader = UnstructuredMarkdownLoader(markdown_path)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "06073f91",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -42,17 +61,19 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "c9adc5cb",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content=\"ð\\x9f¦\\x9cï¸\\x8fð\\x9f”\\x97 LangChain\\n\\nâ\\x9a¡ Building applications with LLMs through composability â\\x9a¡\\n\\nProduction Support: As you move your LangChains into production, we'd love to offer more comprehensive support.\\nPlease fill out this form and we'll set up a dedicated support Slack channel.\\n\\nQuick Install\\n\\npip install langchain\\n\\nð\\x9f¤” What is this?\\n\\nLarge language models (LLMs) are emerging as a transformative technology, enabling\\ndevelopers to build applications that they previously could not.\\nBut using these LLMs in isolation is often not enough to\\ncreate a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.\\n\\nThis library is aimed at assisting in the development of those types of applications. Common examples of these types of applications include:\\n\\nâ\\x9d“ Question Answering over specific documents\\n\\nDocumentation\\n\\nEnd-to-end Example: Question Answering over Notion Database\\n\\nð\\x9f¬ Chatbots\\n\\nDocumentation\\n\\nEnd-to-end Example: Chat-LangChain\\n\\nð\\x9f¤\\x96 Agents\\n\\nDocumentation\\n\\nEnd-to-end Example: GPT+WolframAlpha\\n\\nð\\x9f“\\x96 Documentation\\n\\nPlease see here for full documentation on:\\n\\nGetting started (installation, setting up the environment, simple examples)\\n\\nHow-To examples (demos, integrations, helper functions)\\n\\nReference (full API docs)\\n Resources (high-level explanation of core concepts)\\n\\nð\\x9f\\x9a\\x80 What can this help with?\\n\\nThere are six main areas that LangChain is designed to help with.\\nThese are, in increasing order of complexity:\\n\\nð\\x9f“\\x83 LLMs and Prompts:\\n\\nThis includes prompt management, prompt optimization, generic interface for all LLMs, and common utilities for working with LLMs.\\n\\nð\\x9f”\\x97 Chains:\\n\\nChains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.\\n\\nð\\x9f“\\x9a Data Augmented Generation:\\n\\nData Augmented Generation involves specific types of chains that first interact with an external datasource to fetch data to use in the generation step. Examples of this include summarization of long pieces of text and question/answering over specific data sources.\\n\\nð\\x9f¤\\x96 Agents:\\n\\nAgents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end to end agents.\\n\\nð\\x9f§\\xa0 Memory:\\n\\nMemory is the concept of persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.\\n\\nð\\x9f§\\x90 Evaluation:\\n\\n[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.\\n\\nFor more information on these concepts, please see our full documentation.\\n\\nð\\x9f\\x81 Contributing\\n\\nAs an open source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infra, or better documentation.\\n\\nFor detailed information on how to contribute, see here.\", lookup_str='', metadata={'source': '../../../../README.md'}, lookup_index=0)]"
"[Document(page_content=\"ð\\x9f¦\\x9cï¸\\x8fð\\x9f”\\x97 LangChain\\n\\nâ\\x9a¡ Building applications with LLMs through composability â\\x9a¡\\n\\nLooking for the JS/TS version? Check out LangChain.js.\\n\\nProduction Support: As you move your LangChains into production, we'd love to offer more comprehensive support.\\nPlease fill out this form and we'll set up a dedicated support Slack channel.\\n\\nQuick Install\\n\\npip install langchain\\nor\\nconda install langchain -c conda-forge\\n\\nð\\x9f¤” What is this?\\n\\nLarge language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.\\n\\nThis library aims to assist in the development of those types of applications. Common examples of these applications include:\\n\\nâ\\x9d“ Question Answering over specific documents\\n\\nDocumentation\\n\\nEnd-to-end Example: Question Answering over Notion Database\\n\\nð\\x9f¬ Chatbots\\n\\nDocumentation\\n\\nEnd-to-end Example: Chat-LangChain\\n\\nð\\x9f¤\\x96 Agents\\n\\nDocumentation\\n\\nEnd-to-end Example: GPT+WolframAlpha\\n\\nð\\x9f“\\x96 Documentation\\n\\nPlease see here for full documentation on:\\n\\nGetting started (installation, setting up the environment, simple examples)\\n\\nHow-To examples (demos, integrations, helper functions)\\n\\nReference (full API docs)\\n\\nResources (high-level explanation of core concepts)\\n\\nð\\x9f\\x9a\\x80 What can this help with?\\n\\nThere are six main areas that LangChain is designed to help with.\\nThese are, in increasing order of complexity:\\n\\nð\\x9f“\\x83 LLMs and Prompts:\\n\\nThis includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.\\n\\nð\\x9f”\\x97 Chains:\\n\\nChains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.\\n\\nð\\x9f“\\x9a Data Augmented Generation:\\n\\nData Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.\\n\\nð\\x9f¤\\x96 Agents:\\n\\nAgents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.\\n\\nð\\x9f§\\xa0 Memory:\\n\\nMemory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.\\n\\nð\\x9f§\\x90 Evaluation:\\n\\n[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.\\n\\nFor more information on these concepts, please see our full documentation.\\n\\nð\\x9f\\x81 Contributing\\n\\nAs an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.\\n\\nFor detailed information on how to contribute, see here.\", metadata={'source': '../../../../../README.md'})]"
]
},
"execution_count": 4,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@ -73,19 +94,23 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 6,
"id": "064f9162",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredMarkdownLoader(\"../../../../README.md\", mode=\"elements\")"
"loader = UnstructuredMarkdownLoader(markdown_path, mode=\"elements\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"id": "abefbbdb",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -93,17 +118,19 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"id": "a547c534",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='ð\\x9f¦\\x9cï¸\\x8fð\\x9f”\\x97 LangChain', lookup_str='', metadata={'source': '../../../../README.md', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0)"
"Document(page_content='ð\\x9f¦\\x9cï¸\\x8fð\\x9f”\\x97 LangChain', metadata={'source': '../../../../../README.md', 'page_number': 1, 'category': 'Title'})"
]
},
"execution_count": 7,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@ -111,14 +138,6 @@
"source": [
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "381d4139",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -137,7 +156,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.11.2"
}
},
"nbformat": 4,

@ -6,7 +6,13 @@
"source": [
"# Modern Treasury\n",
"\n",
"This notebook covers how to load data from the Modern Treasury REST API into a format that can be ingested into LangChain, along with example usage for vectorization."
">[Modern Treasury](https://www.moderntreasury.com/) simplifies complex payment operations\n",
"A unified platform to power products and processes that move money.\n",
">- Connect to banks and payment systems\n",
">- Track transactions and balances in real-time\n",
">- Automate payment operations for scale\n",
"\n",
"This notebook covers how to load data from the `Modern Treasury REST API` into a format that can be ingested into LangChain, along with example usage for vectorization."
]
},
{
@ -98,9 +104,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -6,13 +6,15 @@
"source": [
"# Notebook\n",
"\n",
"This notebook covers how to load data from an .ipynb notebook into a format suitable by LangChain."
"This notebook covers how to load data from a `Jupyter notebook (.ipynb)` into a format suitable by LangChain."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import NotebookLoader"
@ -20,8 +22,10 @@
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = NotebookLoader(\"example_data/notebook.ipynb\", include_outputs=True, max_output_length=20, remove_newline=True)"
@ -43,16 +47,18 @@
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='\\'markdown\\' cell: \\'[\\'# Notebook\\', \\'\\', \\'This notebook covers how to load data from an .ipynb notebook into a format suitable by LangChain.\\']\\'\\n\\n \\'code\\' cell: \\'[\\'from langchain.document_loaders import NotebookLoader\\']\\'\\n\\n \\'code\\' cell: \\'[\\'loader = NotebookLoader(\"example_data/notebook.ipynb\")\\']\\'\\n\\n \\'markdown\\' cell: \\'[\\'`NotebookLoader.load()` loads the `.ipynb` notebook file into a `Document` object.\\', \\'\\', \\'**Parameters**:\\', \\'\\', \\'* `include_outputs` (bool): whether to include cell outputs in the resulting document (default is False).\\', \\'* `max_output_length` (int): the maximum number of characters to include from each cell output (default is 10).\\', \\'* `remove_newline` (bool): whether to remove newline characters from the cell sources and outputs (default is False).\\', \\'* `traceback` (bool): whether to include full traceback (default is False).\\']\\'\\n\\n \\'code\\' cell: \\'[\\'loader.load(include_outputs=True, max_output_length=20, remove_newline=True)\\']\\'\\n\\n', lookup_str='', metadata={'source': 'example_data/notebook.ipynb'}, lookup_index=0)]"
"[Document(page_content='\\'markdown\\' cell: \\'[\\'# Notebook\\', \\'\\', \\'This notebook covers how to load data from an .ipynb notebook into a format suitable by LangChain.\\']\\'\\n\\n \\'code\\' cell: \\'[\\'from langchain.document_loaders import NotebookLoader\\']\\'\\n\\n \\'code\\' cell: \\'[\\'loader = NotebookLoader(\"example_data/notebook.ipynb\")\\']\\'\\n\\n \\'markdown\\' cell: \\'[\\'`NotebookLoader.load()` loads the `.ipynb` notebook file into a `Document` object.\\', \\'\\', \\'**Parameters**:\\', \\'\\', \\'* `include_outputs` (bool): whether to include cell outputs in the resulting document (default is False).\\', \\'* `max_output_length` (int): the maximum number of characters to include from each cell output (default is 10).\\', \\'* `remove_newline` (bool): whether to remove newline characters from the cell sources and outputs (default is False).\\', \\'* `traceback` (bool): whether to include full traceback (default is False).\\']\\'\\n\\n \\'code\\' cell: \\'[\\'loader.load(include_outputs=True, max_output_length=20, remove_newline=True)\\']\\'\\n\\n', metadata={'source': 'example_data/notebook.ipynb'})]"
]
},
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@ -60,13 +66,6 @@
"source": [
"loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -85,7 +84,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
@ -94,5 +93,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -5,7 +5,10 @@
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Notion\n",
"# Notion DB 1/2\n",
"\n",
">[Notion](https://www.notion.so/) is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.\n",
"\n",
"This notebook covers how to load documents from a Notion database dump.\n",
"\n",
"In order to get this notion dump, follow these instructions:\n",
@ -74,7 +77,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,13 +5,15 @@
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Notion DB Loader\n",
"# Notion DB 2/2\n",
"\n",
"NotionDBLoader is a Python class for loading content from a Notion database. It retrieves pages from the database, reads their content, and returns a list of Document objects.\n",
">[Notion](https://www.notion.so/) is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.\n",
"\n",
"`NotionDBLoader` is a Python class for loading content from a `Notion` database. It retrieves pages from the database, reads their content, and returns a list of Document objects.\n",
"\n",
"## Requirements\n",
"\n",
"- A Notion Database\n",
"- A `Notion` Database\n",
"- Notion Integration Token\n",
"\n",
"## Setup\n",
@ -28,12 +30,12 @@
"## 2. Create a Notion Integration\n",
"To create a Notion Integration, follow these steps:\n",
"\n",
"1. Visit the (Notion Developers)[https://www.notion.com/my-integrations] page and log in with your Notion account.\n",
"1. Visit the [Notion Developers](https://www.notion.com/my-integrations) page and log in with your Notion account.\n",
"2. Click on the \"+ New integration\" button.\n",
"3. Give your integration a name and choose the workspace where your database is located.\n",
"4. Select the require capabilities, this extension only need the Read content capability\n",
"5. Click the \"Submit\" button to create the integration.\n",
"Once the integration is created, you'll be provided with an Integration Token (API key). Copy this token and keep it safe, as you'll need it to use the NotionDBLoader.\n",
"Once the integration is created, you'll be provided with an `Integration Token (API key)`. Copy this token and keep it safe, as you'll need it to use the NotionDBLoader.\n",
"\n",
"### 3. Connect the Integration to the Database\n",
"To connect your integration to the database, follow these steps:\n",
@ -97,7 +99,7 @@
"metadata": {},
"outputs": [],
"source": [
"loader = NotionDBLoader(NOTION_TOKEN, DATABASE_ID)"
"loader = NotionDBLoader(integration_token=NOTION_TOKEN, database_id=DATABASE_ID)"
]
},
{
@ -145,7 +147,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,24 +1,29 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Obsidian\n",
"This notebook covers how to load documents from an Obsidian database.\n",
"\n",
"Since Obsidian is just stored on disk as a folder of Markdown files, the loader just takes a path to this directory.\n",
">[Obsidian](https://obsidian.md/) is a powerful and extensible knowledge base\n",
"that works on top of your local folder of plain text files.\n",
"\n",
"Obsidian files also sometimes contain [metadata](https://help.obsidian.md/Editing+and+formatting/Metadata) which is a YAML block at the top of the file. These values will be added to the document's metadata. (`ObsidianLoader` can also be passed a `collect_metadata=False` argument to disable this behavior.)"
"This notebook covers how to load documents from an `Obsidian` database.\n",
"\n",
"Since `Obsidian` is just stored on disk as a folder of Markdown files, the loader just takes a path to this directory.\n",
"\n",
"`Obsidian` files also sometimes contain [metadata](https://help.obsidian.md/Editing+and+formatting/Metadata) which is a YAML block at the top of the file. These values will be added to the document's metadata. (`ObsidianLoader` can also be passed a `collect_metadata=False` argument to disable this behavior.)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "007c5cbf",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import ObsidianLoader"
@ -61,7 +66,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,7 +7,7 @@
"source": [
"# PDF\n",
"\n",
"This covers how to load pdfs into a document format that we can use downstream."
"This covers how to load PDF documents into the Document format that we use downstream."
]
},
{
@ -22,9 +22,23 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "ae93c9e9-3684-42ab-844c-cc7eef4eed11",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install pypdf"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c428b0c5",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
@ -37,12 +51,14 @@
"cell_type": "code",
"execution_count": 4,
"id": "d333cabb",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='LayoutParser : A Uni\\x0ced Toolkit for Deep\\nLearning Based Document Image Analysis\\nZejiang Shen1( \\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\\nLee4, Jacob Carlson3, and Weining Li5\\n1Allen Institute for AI\\nshannons@allenai.org\\n2Brown University\\nruochen zhang@brown.edu\\n3Harvard University\\nfmelissadell,jacob carlson g@fas.harvard.edu\\n4University of Washington\\nbcgl@cs.washington.edu\\n5University of Waterloo\\nw422li@uwaterloo.ca\\nAbstract. Recent advances in document image analysis (DIA) have been\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomes could be easily deployed in production and extended for further\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model con\\x0cgurations complicate the easy reuse of im-\\nportant innovations by a wide audience. Though there have been on-going\\ne\\x0borts to improve reusability and simplify deep learning (DL) model\\ndevelopment in disciplines like natural language processing and computer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademic research across a wide range of disciplines in the social sciences\\nand humanities. This paper introduces LayoutParser , an open-source\\nlibrary for streamlining the usage of DL in DIA research and applica-\\ntions. The core LayoutParser library comes with a set of simple and\\nintuitive interfaces for applying and customizing DL models for layout de-\\ntection, character recognition, and many other document processing tasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digiti-\\nzation pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\nThe library is publicly available at https://layout-parser.github.io .\\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\\n·Character Recognition ·Open Source library ·Toolkit.\\n1 Introduction\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocument image analysis (DIA) tasks including document image classi\\x0ccation [ 11,arXiv:2103.15348v2 [cs.CV] 21 Jun 2021', lookup_str='', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': '0'}, lookup_index=0)"
"Document(page_content='LayoutParser : A Uni\\x0ced Toolkit for Deep\\nLearning Based Document Image Analysis\\nZejiang Shen1( \\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\\nLee4, Jacob Carlson3, and Weining Li5\\n1Allen Institute for AI\\nshannons@allenai.org\\n2Brown University\\nruochen zhang@brown.edu\\n3Harvard University\\nfmelissadell,jacob carlson g@fas.harvard.edu\\n4University of Washington\\nbcgl@cs.washington.edu\\n5University of Waterloo\\nw422li@uwaterloo.ca\\nAbstract. Recent advances in document image analysis (DIA) have been\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomes could be easily deployed in production and extended for further\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model con\\x0cgurations complicate the easy reuse of im-\\nportant innovations by a wide audience. Though there have been on-going\\ne\\x0borts to improve reusability and simplify deep learning (DL) model\\ndevelopment in disciplines like natural language processing and computer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademic research across a wide range of disciplines in the social sciences\\nand humanities. This paper introduces LayoutParser , an open-source\\nlibrary for streamlining the usage of DL in DIA research and applica-\\ntions. The core LayoutParser library comes with a set of simple and\\nintuitive interfaces for applying and customizing DL models for layout de-\\ntection, character recognition, and many other document processing tasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digiti-\\nzation pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\nThe library is publicly available at https://layout-parser.github.io .\\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\\n·Character Recognition ·Open Source library ·Toolkit.\\n1 Introduction\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocument image analysis (DIA) tasks including document image classi\\x0ccation [ 11,arXiv:2103.15348v2 [cs.CV] 21 Jun 2021', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': 0})"
]
},
"execution_count": 4,
@ -62,11 +78,44 @@
"An advantage of this approach is that documents can be retrieved with page numbers."
]
},
{
"cell_type": "markdown",
"id": "b334d071-3477-4788-8ff7-9670b47de082",
"metadata": {},
"source": [
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key."
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 6,
"id": "95b1674b-ec06-43ed-8d3e-60be0e015aa0",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"OpenAI API Key: ········\n"
]
}
],
"source": [
"import os\n",
"import getpass\n",
"\n",
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "87fa7b3a",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
@ -76,72 +125,10 @@
"Fig. 4: Illustration of (a) the original historical Japanese document with layout\n",
"detection results and (b) a recreated version of the document image that achieves\n",
"much better character recognition recall. The reorganization algorithm rearranges\n",
"the tokens based on the their detected bounding boxes given a maximum allowed\n",
"height.\n",
"4LayoutParser Community Platform\n",
"Another focus of LayoutParser is promoting the reusability of layout detection\n",
"models and full digitization pipelines. Similar to many existing deep learning\n",
"libraries, LayoutParser comes with a community model hub for distributing\n",
"layout models. End-users can upload their self-trained models to the model hub,\n",
"and these models can be loaded into a similar interface as the currently available\n",
"LayoutParser pre-trained models. For example, the model trained on the News\n",
"Navigator dataset [17] has been incorporated in the model hub.\n",
"Beyond DL models, LayoutParser also promotes the sharing of entire doc-\n",
"ument digitization pipelines. For example, sometimes the pipeline requires the\n",
"combination of multiple DL models to achieve better accuracy. Currently, pipelines\n",
"are mainly described in academic papers and implementations are often not pub-\n",
"licly available. To this end, the LayoutParser community platform also enables\n",
"the sharing of layout pipelines to promote the discussion and reuse of techniques.\n",
"For each shared pipeline, it has a dedicated project page, with links to the source\n",
"code, documentation, and an outline of the approaches. A discussion panel is\n",
"provided for exchanging ideas. Combined with the core LayoutParser library,\n",
"users can easily build reusable components based on the shared pipelines and\n",
"apply them to solve their unique problems.\n",
"5 Use Cases\n",
"The core objective of LayoutParser is to make it easier to create both large-scale\n",
"and light-weight document digitization pipelines. Large-scale document processing\n",
"the tokens based on the their detect\n",
"3: 4 Z. Shen et al.\n",
"Efficient Data AnnotationC u s t o m i z e d M o d e l T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images \n",
"T h e C o r e L a y o u t P a r s e r L i b r a r yOCR ModuleSt or age & VisualizationLa y out Data Structur e\n",
"Fig. 1: The overall architecture of LayoutParser . For an input document image,\n",
"the core LayoutParser library provides a set of o\u000b",
"\n",
"-the-shelf tools for layout\n",
"detection, OCR, visualization, and storage, backed by a carefully designed layout\n",
"data structure. LayoutParser also supports high level customization via e\u000ecient\n",
"layout annotation and model training functions. These improve model accuracy\n",
"on the target samples. The community platform enables the easy sharing of DIA\n",
"models and whole digitization pipelines to promote reusability and reproducibility.\n",
"A collection of detailed documentation, tutorials and exemplar projects make\n",
"LayoutParser easy to learn and use.\n",
"AllenNLP [ 8] and transformers [ 34] have provided the community with complete\n",
"DL-based support for developing and deploying models for general computer\n",
"vision and natural language processing problems. LayoutParser , on the other\n",
"hand, specializes speci\f",
"\n",
"cally in DIA tasks. LayoutParser is also equipped with a\n",
"community platform inspired by established model hubs such as Torch Hub [23]\n",
"andTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\n",
"full document processing pipelines that are unique to DIA tasks.\n",
"There have been a variety of document data collections to facilitate the\n",
"development of DL models. Some examples include PRImA [ 3](magazine layouts),\n",
"PubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\n",
"papers), Newspaper Navigator Dataset [ 16,17](newspaper \f",
"\n",
"gure layouts) and\n",
"HJDataset [31](historical Japanese document layouts). A spectrum of models\n",
"trained on these datasets are currently available in the LayoutParser model zoo\n",
"to support di\u000b",
"\n",
"erent use cases.\n",
"3 The Core LayoutParser Library\n",
"At the core of LayoutParser is an o\u000b",
"\n",
"-the-shelf toolkit that streamlines DL-\n",
"based document image analysis. Five components support a simple interface\n",
"with comprehensive functionalities: 1) The layout detection models enable using\n",
"pre-trained or self-trained DL models for layout detection with just four lines\n",
"of code. 2) The detected layout information is stored in carefully engineered\n"
"T h e C o r e L a y o u t P a r s e r L i b r a r yOCR ModuleSt or age & VisualizationLa y ou\n"
]
}
],
@ -152,7 +139,7 @@
"faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())\n",
"docs = faiss_index.similarity_search(\"How will the community be engaged?\", k=2)\n",
"for doc in docs:\n",
" print(str(doc.metadata[\"page\"]) + \":\", doc.page_content)"
" print(str(doc.metadata[\"page\"]) + \":\", doc.page_content[:300])"
]
},
{
@ -167,9 +154,11 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "950eb58f",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import MathpixPDFLoader"
@ -671,7 +660,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,20 +1,23 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "39af9ecd",
"metadata": {},
"source": [
"# PowerPoint\n",
"\n",
"This covers how to load PowerPoint documents into a document format that we can use downstream."
"This covers how to load `Microsoft PowerPoint` documents into a document format that we can use downstream."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "721c48aa",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredPowerPointLoader"
@ -24,7 +27,9 @@
"cell_type": "code",
"execution_count": 2,
"id": "9d3d0e35",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = UnstructuredPowerPointLoader(\"example_data/fake-power-point.pptx\")"
@ -34,7 +39,9 @@
"cell_type": "code",
"execution_count": 3,
"id": "06073f91",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = loader.load()"
@ -44,12 +51,14 @@
"cell_type": "code",
"execution_count": 4,
"id": "c9adc5cb",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Adding a Bullet Slide\\n\\nFind the bullet slide layout\\n\\nUse _TextFrame.text for first bullet\\n\\nUse _TextFrame.add_paragraph() for subsequent bullets\\n\\nHere is a lot of text!\\n\\nHere is some text in a text box!', lookup_str='', metadata={'source': 'example_data/fake-power-point.pptx'}, lookup_index=0)]"
"[Document(page_content='Adding a Bullet Slide\\n\\nFind the bullet slide layout\\n\\nUse _TextFrame.text for first bullet\\n\\nUse _TextFrame.add_paragraph() for subsequent bullets\\n\\nHere is a lot of text!\\n\\nHere is some text in a text box!', metadata={'source': 'example_data/fake-power-point.pptx'})]"
]
},
"execution_count": 4,
@ -68,7 +77,7 @@
"source": [
"## Retain Elements\n",
"\n",
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
"Under the hood, `Unstructured` creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
]
},
{
@ -137,7 +146,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -6,11 +6,24 @@
"metadata": {},
"source": [
"# ReadTheDocs Documentation\n",
"This notebook covers how to load content from html that was generated as part of a Read-The-Docs build.\n",
"\n",
">[Read the Docs](https://readthedocs.org/) is an open-sourced free software documentation hosting platform. It generates documentation written with the `Sphinx` documentation generator.\n",
"\n",
"This notebook covers how to load content from HTML that was generated as part of a `Read-The-Docs` build.\n",
"\n",
"For an example of this in the wild, see [here](https://github.com/hwchase17/chat-langchain).\n",
"\n",
"This assumes that the html has already been scraped into a folder. This can be done by uncommenting and running the following command"
"This assumes that the HTML has already been scraped into a folder. This can be done by uncommenting and running the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d153e07-8339-4cbe-8481-fc08644ba927",
"metadata": {},
"outputs": [],
"source": [
"#!pip install beautifulsoup4"
]
},
{
@ -25,9 +38,11 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"id": "92dd950b",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import ReadTheDocsLoader"
@ -70,7 +85,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,15 +1,17 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reddit\n",
"\n",
">[Reddit (reddit)](\twww.reddit.com) is an American social news aggregation, content rating, and discussion website.\n",
"\n",
"\n",
"This loader fetches the text from the Posts of Subreddits or Reddit users, using the `praw` Python package.\n",
"\n",
"Make a Reddit Application from https://www.reddit.com/prefs/apps/ and initialize the loader with with your Reddit API credentials."
"Make a [Reddit Application](https://www.reddit.com/prefs/apps/) and initialize the loader with with your Reddit API credentials."
]
},
{
@ -89,7 +91,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "env1",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -103,10 +105,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
},
"orig_nbformat": 4
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -1,11 +1,15 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Roam\n",
"\n",
">[ROAM](https://roamresearch.com/) is a note-taking tool for networked thought, designed to create a personal knowledge base.\n",
"\n",
"This notebook covers how to load documents from a Roam database. This takes a lot of inspiration from the example repo [here](https://github.com/JimmyLv/roam-qa).\n",
"\n",
"## 🧑 Instructions for ingesting your own dataset\n",
@ -40,7 +44,7 @@
"metadata": {},
"outputs": [],
"source": [
"loader = ObsidianLoader(\"Roam_DB\")"
"loader = RoamLoader(\"Roam_DB\")"
]
},
{
@ -70,7 +74,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,36 +5,46 @@
"id": "a634365e",
"metadata": {},
"source": [
"# s3 Directory\n",
"# AWS S3 Directory\n",
"\n",
"This covers how to load document objects from an s3 directory object."
">[Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html) is an object storage service\n",
"\n",
">[AWS S3 Directory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html)\n",
"\n",
"This covers how to load document objects from an `AWS S3 Directory` object."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2f0cd6a5",
"metadata": {},
"execution_count": null,
"id": "49815096",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import S3DirectoryLoader"
"#!pip install boto3"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "49815096",
"metadata": {},
"id": "2f0cd6a5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install boto3"
"from langchain.document_loaders import S3DirectoryLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "321cc7f1",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = S3DirectoryLoader(\"testing-hwc\")"
@ -42,21 +52,12 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "2b11d155",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Lorem ipsum dolor sit amet.', lookup_str='', metadata={'source': '/var/folders/y6/8_bzdg295ld6s1_97_12m4lr0000gn/T/tmpaa9xl6ch/fake.docx'}, lookup_index=0)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader.load()"
]
@ -126,7 +127,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,9 +5,13 @@
"id": "66a7777e",
"metadata": {},
"source": [
"# s3 File\n",
"# AWS S3 File\n",
"\n",
"This covers how to load document objects from an s3 file object."
">[Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html) is an object storage service.\n",
"\n",
">[AWS S3 Buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html)\n",
"\n",
"This covers how to load document objects from an `AWS S3 File` object."
]
},
{
@ -86,7 +90,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sitemap Loader\n",
"# Sitemap\n",
"\n",
"Extends from the [WebBaseLoader](), this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.\n",
"Extends from the `WebBaseLoader`, this will load a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.\n",
"\n",
"The scraping is done concurrently, using `WebBaseLoader`. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the `requests_per_second` parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but may cause the server to block you. Be careful!"
]
@ -20,10 +20,10 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: nest_asyncio in /Users/tasp/Code/projects/langchain/.venv/lib/python3.10/site-packages (1.5.6)\r\n",
"\r\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\r\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\r\n"
"Requirement already satisfied: nest_asyncio in /Users/tasp/Code/projects/langchain/.venv/lib/python3.10/site-packages (1.5.6)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
@ -39,6 +39,7 @@
"source": [
"# fixes a bug with asyncio and jupyter\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
@ -88,7 +89,7 @@
"source": [
"## Filtering sitemap URLs\n",
"\n",
"Sitemaps can be massive files, with thousands of urls. Often you don't need every single one of them. You can filter the urls by passing a list of strings or regex patterns to the `url_filter` parameter. Only urls that match one of the patterns will be loaded."
"Sitemaps can be massive files, with thousands of URLs. Often you don't need every single one of them. You can filter the URLs by passing a list of strings or regex patterns to the `url_filter` parameter. Only URLs that match one of the patterns will be loaded."
]
},
{
@ -148,9 +149,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}

@ -1,16 +1,17 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Slack (Local Exported Zipfile)\n",
"\n",
"This notebook covers how to load documents from a Zipfile generated from a Slack export.\n",
">[Slack](slack.com) is an instant messaging program.\n",
"\n",
"In order to get this Slack export, follow these instructions:\n",
"This notebook covers how to load documents from a Zipfile generated from a `Slack` export.\n",
"\n",
"In order to get this `Slack` export, follow these instructions:\n",
"\n",
"## 🧑 Instructions for ingesting your own dataset\n",
"\n",
@ -73,7 +74,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -1,12 +1,13 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spreedly\n",
"\n",
">[Spreedly](https://docs.spreedly.com/) is a service that allows you to securely store credit cards and use them to transact against any number of payment gateways and third party APIs. It does this by simultaneously providing a card tokenization/vault service as well as a gateway and receiver integration service. Payment methods tokenized by Spreedly are stored at `Spreedly`, allowing you to independently store a card and then pass that card to different end points based on your business requirements.\n",
"\n",
"This notebook covers how to load data from the [Spreedly REST API](https://docs.spreedly.com/reference/api/v1/) into a format that can be ingested into LangChain, along with example usage for vectorization.\n",
"\n",
"Note: this notebook assumes the following packages are installed: `openai`, `chromadb`, and `tiktoken`."
@ -107,9 +108,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
@ -121,9 +122,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

@ -6,14 +6,33 @@
"metadata": {},
"source": [
"# Subtitle Files\n",
"How to load data from subtitle (`.srt`) files"
"\n",
">[The SubRip file format](https://en.wikipedia.org/wiki/SubRip#SubRip_file_format) is described on the `Matroska` multimedia container format website as \"perhaps the most basic of all subtitle formats.\" `SubRip (SubRip Text)` files are named with the extension `.srt`, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hours:minutes:seconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (00:00:00,000). The fractional separator used is the comma, since the program was written in France.\n",
"\n",
"How to load data from subtitle (`.srt`) files\n",
"\n",
"Please, download the [example .srt file from here](https://www.opensubtitles.org/en/subtitles/5575150/star-wars-the-clone-wars-crisis-at-the-heart-en)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": null,
"id": "c6eb0372-ad36-4747-8120-d1557fe632fd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install pysrt"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2cbb7f5c",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import SRTLoader"
@ -21,9 +40,11 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 5,
"id": "865d8a14",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = SRTLoader(\"example_data/Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt\")"
@ -31,9 +52,11 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": null,
"id": "173a9234",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = loader.load()"
@ -59,14 +82,6 @@
"source": [
"docs[0].page_content[:100]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b7a8dc4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -85,7 +100,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

Loading…
Cancel
Save