diff --git a/docs/integrations/discord.md b/docs/integrations/discord.md new file mode 100644 index 00000000..116ce360 --- /dev/null +++ b/docs/integrations/discord.md @@ -0,0 +1,30 @@ +# Discord + +>[Discord](https://discord.com/) is a VoIP and instant messaging social platform. Users have the ability to communicate +> with voice calls, video calls, text messaging, media and files in private chats or as part of communities called +> "servers". A server is a collection of persistent chat rooms and voice channels which can be accessed via invite links. + +## Installation and Setup + + +```bash +pip install pandas +``` + +Follow these steps to download your `Discord` data: + +1. Go to your **User Settings** +2. Then go to **Privacy and Safety** +3. Head over to the **Request all of my Data** and click on **Request Data** button + +It might take 30 days for you to receive your data. You'll receive an email at the address which is registered +with Discord. That email will have a download button using which you would be able to download your personal Discord data. + + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/discord.ipynb). + +```python +from langchain.document_loaders import DiscordChatLoader +``` diff --git a/docs/integrations/docugami.md b/docs/integrations/docugami.md index 1547c993..e20adc85 100644 --- a/docs/integrations/docugami.md +++ b/docs/integrations/docugami.md @@ -1,25 +1,20 @@ # Docugami ->[Docugami](https://docugami.com) converts business documents into a Document XML Knowledge Graph, generating forests of -> XML semantic trees representing entire documents. -> This is a rich representation that includes the semantic and +>[Docugami](https://docugami.com) converts business documents into a Document XML Knowledge Graph, generating forests +> of XML semantic trees representing entire documents. This is a rich representation that includes the semantic and > structural characteristics of various chunks in the document as an XML tree. +## Installation and Setup -## Quick start -1. Create a Docugami workspace: http://www.docugami.com (free trials available) -2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later. -3. Create an access token via the Developer Playground for your workspace. Detailed instructions: https://help.docugami.com/home/docugami-api -4. Explore the Docugami API at https://api-docs.docugami.com to get a list of your processed docset IDs, or just the document IDs for a particular docset. -6. Use the DocugamiLoader as detailed in [this notebook](../modules/indexes/document_loaders/examples/docugami.ipynb), to get rich semantic chunks for your documents. -7. Optionally, build and publish one or more [reports or abstracts](https://help.docugami.com/home/reports). This helps Docugami improve the semantic XML with better tags based on your preferences, which are then added to the DocugamiLoader output as metadata. Use techniques like [self-querying retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html) to do high accuracy Document QA. +```bash +pip install lxml +``` -## Advantages vs Other Chunking Techniques +## Document Loader -Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach: +See a [usage example](../modules/indexes/document_loaders/examples/docugami.ipynb). -1. **Intelligent Chunking:** Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking. -2. **Structured Representation:** In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction. -3. **Semantic Annotations:** Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause. -4. **Additional Metadata:** Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through in [this notebook](../modules/indexes/document_loaders/examples/docugami.ipynb). +```python +from langchain.document_loaders import DocugamiLoader +``` diff --git a/docs/integrations/duckdb.md b/docs/integrations/duckdb.md new file mode 100644 index 00000000..a4cf5964 --- /dev/null +++ b/docs/integrations/duckdb.md @@ -0,0 +1,19 @@ +# DuckDB + +>[DuckDB](https://duckdb.org/) is an in-process SQL OLAP database management system. + +## Installation and Setup + +First, you need to install `duckdb` python package. + +```bash +pip install duckdb +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/duckdb.ipynb). + +```python +from langchain.document_loaders import DuckDBLoader +``` diff --git a/docs/integrations/evernote.md b/docs/integrations/evernote.md new file mode 100644 index 00000000..bf031314 --- /dev/null +++ b/docs/integrations/evernote.md @@ -0,0 +1,20 @@ +# EverNote + +>[EverNote](https://evernote.com/) is intended for archiving and creating notes in which photos, audio and saved web content can be embedded. Notes are stored in virtual "notebooks" and can be tagged, annotated, edited, searched, and exported. + +## Installation and Setup + +First, you need to install `lxml` and `html2text` python packages. + +```bash +pip install lxml +pip install html2text +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/evernote.ipynb). + +```python +from langchain.document_loaders import EverNoteLoader +``` diff --git a/docs/integrations/facebook_chat.md b/docs/integrations/facebook_chat.md new file mode 100644 index 00000000..292ee67f --- /dev/null +++ b/docs/integrations/facebook_chat.md @@ -0,0 +1,21 @@ +# Facebook Chat + +>[Messenger](https://en.wikipedia.org/wiki/Messenger_(software)) is an American proprietary instant messaging app and +> platform developed by `Meta Platforms`. Originally developed as `Facebook Chat` in 2008, the company revamped its +> messaging service in 2010. + +## Installation and Setup + +First, you need to install `pandas` python package. + +```bash +pip install pandas +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/facebook_chat.ipynb). + +```python +from langchain.document_loaders import FacebookChatLoader +``` diff --git a/docs/integrations/figma.md b/docs/integrations/figma.md new file mode 100644 index 00000000..a6e399ed --- /dev/null +++ b/docs/integrations/figma.md @@ -0,0 +1,21 @@ +# Figma + +>[Figma](https://www.figma.com/) is a collaborative web application for interface design. + +## Installation and Setup + +The Figma API requires an `access token`, `node_ids`, and a `file key`. + +The `file key` can be pulled from the URL. https://www.figma.com/file/{filekey}/sampleFilename + +`Node IDs` are also available in the URL. Click on anything and look for the '?node-id={node_id}' param. + +`Access token` [instructions](https://help.figma.com/hc/en-us/articles/8085703771159-Manage-personal-access-tokens). + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/figma.ipynb). + +```python +from langchain.document_loaders import FigmaFileLoader +``` diff --git a/docs/integrations/git.md b/docs/integrations/git.md new file mode 100644 index 00000000..cf6f0fc8 --- /dev/null +++ b/docs/integrations/git.md @@ -0,0 +1,19 @@ +# Git + +>[Git](https://en.wikipedia.org/wiki/Git) is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers collaboratively developing source code during software development. + +## Installation and Setup + +First, you need to install `GitPython` python package. + +```bash +pip install GitPython +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/git.ipynb). + +```python +from langchain.document_loaders import GitLoader +``` diff --git a/docs/integrations/gitbook.md b/docs/integrations/gitbook.md new file mode 100644 index 00000000..8781dd6c --- /dev/null +++ b/docs/integrations/gitbook.md @@ -0,0 +1,15 @@ +# GitBook + +>[GitBook](https://docs.gitbook.com/) is a modern documentation platform where teams can document everything from products to internal knowledge bases and APIs. + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/gitbook.ipynb). + +```python +from langchain.document_loaders import GitbookLoader +``` diff --git a/docs/integrations/google_bigquery.md b/docs/integrations/google_bigquery.md new file mode 100644 index 00000000..ada1801c --- /dev/null +++ b/docs/integrations/google_bigquery.md @@ -0,0 +1,20 @@ +# Google BigQuery + +>[Google BigQuery](https://cloud.google.com/bigquery) is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data. +`BigQuery` is a part of the `Google Cloud Platform`. + +## Installation and Setup + +First, you need to install `google-cloud-bigquery` python package. + +```bash +pip install google-cloud-bigquery +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/google_bigquery.ipynb). + +```python +from langchain.document_loaders import BigQueryLoader +``` diff --git a/docs/integrations/google_cloud_storage.md b/docs/integrations/google_cloud_storage.md new file mode 100644 index 00000000..3f716acf --- /dev/null +++ b/docs/integrations/google_cloud_storage.md @@ -0,0 +1,26 @@ +# Google Cloud Storage + +>[Google Cloud Storage](https://en.wikipedia.org/wiki/Google_Cloud_Storage) is a managed service for storing unstructured data. + +## Installation and Setup + +First, you need to install `google-cloud-bigquery` python package. + +```bash +pip install google-cloud-storage +``` + +## Document Loader + +There are two loaders for the `Google Cloud Storage`: the `Directory` and the `File` loaders. + +See a [usage example](../modules/indexes/document_loaders/examples/google_cloud_storage_directory.ipynb). + +```python +from langchain.document_loaders import GCSDirectoryLoader +``` +See a [usage example](../modules/indexes/document_loaders/examples/google_cloud_storage_file.ipynb). + +```python +from langchain.document_loaders import GCSFileLoader +``` diff --git a/docs/integrations/google_drive.md b/docs/integrations/google_drive.md new file mode 100644 index 00000000..6d2cdc08 --- /dev/null +++ b/docs/integrations/google_drive.md @@ -0,0 +1,22 @@ +# Google Drive + +>[Google Drive](https://en.wikipedia.org/wiki/Google_Drive) is a file storage and synchronization service developed by Google. + +Currently, only `Google Docs` are supported. + +## Installation and Setup + +First, you need to install several python package. + +```bash +pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib +``` + +## Document Loader + +See a [usage example and authorizing instructions](../modules/indexes/document_loaders/examples/google_drive.ipynb). + + +```python +from langchain.document_loaders import GoogleDriveLoader +``` diff --git a/docs/integrations/gutenberg.md b/docs/integrations/gutenberg.md new file mode 100644 index 00000000..c779b47b --- /dev/null +++ b/docs/integrations/gutenberg.md @@ -0,0 +1,15 @@ +# Gutenberg + +>[Project Gutenberg](https://www.gutenberg.org/about/) is an online library of free eBooks. + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/gutenberg.ipynb). + +```python +from langchain.document_loaders import GutenbergLoader +``` diff --git a/docs/integrations/hacker_news.md b/docs/integrations/hacker_news.md new file mode 100644 index 00000000..53953917 --- /dev/null +++ b/docs/integrations/hacker_news.md @@ -0,0 +1,18 @@ +# Hacker News + +>[Hacker News](https://en.wikipedia.org/wiki/Hacker_News) (sometimes abbreviated as `HN`) is a social news +> website focusing on computer science and entrepreneurship. It is run by the investment fund and startup +> incubator `Y Combinator`. In general, content that can be submitted is defined as "anything that gratifies +> one's intellectual curiosity." + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/hacker_news.ipynb). + +```python +from langchain.document_loaders import HNLoader +``` diff --git a/docs/integrations/ifixit.md b/docs/integrations/ifixit.md new file mode 100644 index 00000000..f7462f54 --- /dev/null +++ b/docs/integrations/ifixit.md @@ -0,0 +1,16 @@ +# iFixit + +>[iFixit](https://www.ifixit.com) is the largest, open repair community on the web. The site contains nearly 100k +> repair manuals, 200k Questions & Answers on 42k devices, and all the data is licensed under `CC-BY-NC-SA 3.0`. + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/ifixit.ipynb). + +```python +from langchain.document_loaders import IFixitLoader +``` diff --git a/docs/integrations/imsdb.md b/docs/integrations/imsdb.md new file mode 100644 index 00000000..496f343d --- /dev/null +++ b/docs/integrations/imsdb.md @@ -0,0 +1,16 @@ +# IMSDb + +>[IMSDb](https://imsdb.com/) is the `Internet Movie Script Database`. +> +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/imsdb.ipynb). + + +```python +from langchain.document_loaders import IMSDbLoader +``` diff --git a/docs/integrations/mediawikidump.md b/docs/integrations/mediawikidump.md new file mode 100644 index 00000000..1d9aca0a --- /dev/null +++ b/docs/integrations/mediawikidump.md @@ -0,0 +1,31 @@ +# MediaWikiDump + +>[MediaWiki XML Dumps](https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps) contain the content of a wiki +> (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup +> of the wiki database, the dump does not contain user accounts, images, edit logs, etc. + + +## Installation and Setup + +We need to install several python packages. + +The `mediawiki-utilities` supports XML schema 0.11 in unmerged branches. +```bash +pip install -qU git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11 +``` + +The `mediawiki-utilities mwxml` has a bug, fix PR pending. + +```bash +pip install -qU git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11 +pip install -qU mwparserfromhell +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/mediawikidump.ipynb). + + +```python +from langchain.document_loaders import MWDumpLoader +``` diff --git a/docs/integrations/microsoft_onedrive.md b/docs/integrations/microsoft_onedrive.md new file mode 100644 index 00000000..ee843451 --- /dev/null +++ b/docs/integrations/microsoft_onedrive.md @@ -0,0 +1,22 @@ +# Microsoft OneDrive + +>[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file-hosting service operated by Microsoft. + +## Installation and Setup + +First, you need to install a python package. + +```bash +pip install o365 +``` + +Then follow instructions [here](../modules/indexes/document_loaders/examples/microsoft_onedrive.ipynb). + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/microsoft_onedrive.ipynb). + + +```python +from langchain.document_loaders import OneDriveLoader +``` diff --git a/docs/integrations/microsoft_powerpoint.md b/docs/integrations/microsoft_powerpoint.md new file mode 100644 index 00000000..c5434ed4 --- /dev/null +++ b/docs/integrations/microsoft_powerpoint.md @@ -0,0 +1,16 @@ +# Microsoft PowerPoint + +>[Microsoft PowerPoint](https://en.wikipedia.org/wiki/Microsoft_PowerPoint) is a presentation program by Microsoft. + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/microsoft_powerpoint.ipynb). + + +```python +from langchain.document_loaders import UnstructuredPowerPointLoader +``` diff --git a/docs/integrations/microsoft_word.md b/docs/integrations/microsoft_word.md new file mode 100644 index 00000000..19190579 --- /dev/null +++ b/docs/integrations/microsoft_word.md @@ -0,0 +1,16 @@ +# Microsoft Word + +>[Microsoft Word](https://www.microsoft.com/en-us/microsoft-365/word) is a word processor developed by Microsoft. + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/microsoft_word.ipynb). + + +```python +from langchain.document_loaders import UnstructuredWordDocumentLoader +``` diff --git a/docs/integrations/modern_treasury.md b/docs/integrations/modern_treasury.md new file mode 100644 index 00000000..fa98f717 --- /dev/null +++ b/docs/integrations/modern_treasury.md @@ -0,0 +1,19 @@ +# Modern Treasury + +>[Modern Treasury](https://www.moderntreasury.com/) simplifies complex payment operations. It is a unified platform to power products and processes that move money. +>- Connect to banks and payment systems +>- Track transactions and balances in real-time +>- Automate payment operations for scale + +## Installation and Setup + +There isn't any special setup for it. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/modern_treasury.ipynb). + + +```python +from langchain.document_loaders import ModernTreasuryLoader +``` diff --git a/docs/integrations/notion.md b/docs/integrations/notion.md new file mode 100644 index 00000000..10e3d7ac --- /dev/null +++ b/docs/integrations/notion.md @@ -0,0 +1,27 @@ +# Notion DB + +>[Notion](https://www.notion.so/) is a collaboration platform with modified Markdown support that integrates kanban +> boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, +> and project and task management. + +## Installation and Setup + +All instructions are in examples below. + +## Document Loader + +We have two different loaders: `NotionDirectoryLoader` and `NotionDBLoader`. + +See a [usage example for the NotionDirectoryLoader](../modules/indexes/document_loaders/examples/notion.ipynb). + + +```python +from langchain.document_loaders import NotionDirectoryLoader +``` + +See a [usage example for the NotionDBLoader](../modules/indexes/document_loaders/examples/notiondb.ipynb). + + +```python +from langchain.document_loaders import NotionDBLoader +``` diff --git a/docs/integrations/obsidian.md b/docs/integrations/obsidian.md new file mode 100644 index 00000000..9ceef642 --- /dev/null +++ b/docs/integrations/obsidian.md @@ -0,0 +1,19 @@ +# Obsidian + +>[Obsidian](https://obsidian.md/) is a powerful and extensible knowledge base +that works on top of your local folder of plain text files. + +## Installation and Setup + +All instructions are in examples below. + +## Document Loader + + +See a [usage example](../modules/indexes/document_loaders/examples/obsidian.ipynb). + + +```python +from langchain.document_loaders import ObsidianLoader +``` + diff --git a/docs/integrations/psychic.md b/docs/integrations/psychic.md index f3363ab0..cd08a0e9 100644 --- a/docs/integrations/psychic.md +++ b/docs/integrations/psychic.md @@ -1,19 +1,25 @@ # Psychic -This page covers how to use [Psychic](https://www.psychic.dev/) within LangChain. +>[Psychic](https://www.psychic.dev/) is a platform for integrating with SaaS tools like `Notion`, `Zendesk`, +> `Confluence`, and `Google Drive` via OAuth and syncing documents from these applications to your SQL or vector +> database. You can think of it like Plaid for unstructured data. -## What is Psychic? +## Installation and Setup -Psychic is a platform for integrating with your customer’s SaaS tools like Notion, Zendesk, Confluence, and Google Drive via OAuth and syncing documents from these applications to your SQL or vector database. You can think of it like Plaid for unstructured data. Psychic is easy to set up - you use it by importing the react library and configuring it with your Sidekick API key, which you can get from the [Psychic dashboard](https://dashboard.psychic.dev/). When your users connect their applications, you can view these connections from the dashboard and retrieve data using the server-side libraries. - -## Quick start +```bash +pip install psychicapi +``` +Psychic is easy to set up - you import the `react` library and configure it with your `Sidekick API` key, which you get +from the [Psychic dashboard](https://dashboard.psychic.dev/). When you connect the applications, you +view these connections from the dashboard and retrieve data using the server-side libraries. + 1. Create an account in the [dashboard](https://dashboard.psychic.dev/). -2. Use the [react library](https://docs.psychic.dev/sidekick-link) to add the Psychic link modal to your frontend react app. Users will use this to connect their SaaS apps. -3. Once your user has created a connection, you can use the langchain PsychicLoader by following the [example notebook](../modules/indexes/document_loaders/examples/psychic.ipynb) +2. Use the [react library](https://docs.psychic.dev/sidekick-link) to add the Psychic link modal to your frontend react app. You will use this to connect the SaaS apps. +3. Once you have created a connection, you can use the `PsychicLoader` by following the [example notebook](../modules/indexes/document_loaders/examples/psychic.ipynb) -# Advantages vs Other Document Loaders +## Advantages vs Other Document Loaders 1. **Universal API:** Instead of building OAuth flows and learning the APIs for every SaaS app, you integrate Psychic once and leverage our universal API to retrieve data. 2. **Data Syncs:** Data in your customers' SaaS apps can get stale fast. With Psychic you can configure webhooks to keep your documents up to date on a daily or realtime basis. diff --git a/docs/integrations/reddit.md b/docs/integrations/reddit.md new file mode 100644 index 00000000..6026c122 --- /dev/null +++ b/docs/integrations/reddit.md @@ -0,0 +1,22 @@ +# Reddit + +>[Reddit](www.reddit.com) is an American social news aggregation, content rating, and discussion website. + +## Installation and Setup + +First, you need to install a python package. + +```bash +pip install praw +``` + +Make a [Reddit Application](https://www.reddit.com/prefs/apps/) and initialize the loader with with your Reddit API credentials. + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/reddit.ipynb). + + +```python +from langchain.document_loaders import RedditPostsLoader +``` diff --git a/docs/modules/indexes/document_loaders/examples/discord_loader.ipynb b/docs/modules/indexes/document_loaders/examples/discord.ipynb similarity index 100% rename from docs/modules/indexes/document_loaders/examples/discord_loader.ipynb rename to docs/modules/indexes/document_loaders/examples/discord.ipynb diff --git a/docs/modules/indexes/document_loaders/examples/docugami.ipynb b/docs/modules/indexes/document_loaders/examples/docugami.ipynb index 2c9a2a8e..296f9ac1 100644 --- a/docs/modules/indexes/document_loaders/examples/docugami.ipynb +++ b/docs/modules/indexes/document_loaders/examples/docugami.ipynb @@ -5,22 +5,47 @@ "metadata": {}, "source": [ "# Docugami\n", - "This notebook covers how to load documents from `Docugami`. See [here](../../../../ecosystem/docugami.md) for more details, and the advantages of using this system over alternative data loaders.\n", + "This notebook covers how to load documents from `Docugami`. It provides the advantages of using this system over alternative data loaders.\n", "\n", "## Prerequisites\n", - "1. Follow the Quick Start section in [this document](../../../../ecosystem/docugami.md)\n", - "2. Grab an access token for your workspace, and make sure it is set as the DOCUGAMI_API_KEY environment variable\n", + "1. Install necessary python packages.\n", + "2. Grab an access token for your workspace, and make sure it is set as the `DOCUGAMI_API_KEY` environment variable.\n", "3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# You need the lxml package to use the DocugamiLoader\n", - "!poetry run pip -q install lxml" + "!pip install lxml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Quick start\n", + "\n", + "1. Create a [Docugami workspace](http://www.docugami.com) (free trials available)\n", + "2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later.\n", + "3. Create an access token via the Developer Playground for your workspace. [Detailed instructions](https://help.docugami.com/home/docugami-api)\n", + "4. Explore the [Docugami API](https://api-docs.docugami.com) to get a list of your processed docset IDs, or just the document IDs for a particular docset. \n", + "6. Use the DocugamiLoader as detailed below, to get rich semantic chunks for your documents.\n", + "7. Optionally, build and publish one or more [reports or abstracts](https://help.docugami.com/home/reports). This helps Docugami improve the semantic XML with better tags based on your preferences, which are then added to the DocugamiLoader output as metadata. Use techniques like [self-querying retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html) to do high accuracy Document QA.\n", + "\n", + "## Advantages vs Other Chunking Techniques\n", + "\n", + "Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach:\n", + "\n", + "1. **Intelligent Chunking:** Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.\n", + "2. **Structured Representation:** In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.\n", + "3. **Semantic Annotations:** Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.\n", + "4. **Additional Metadata:** Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through below.\n" ] }, { @@ -398,7 +423,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.10.6" } }, "nbformat": 4, diff --git a/docs/modules/indexes/document_loaders/examples/facebook_chat.ipynb b/docs/modules/indexes/document_loaders/examples/facebook_chat.ipynb index c61b3fad..b4024aec 100644 --- a/docs/modules/indexes/document_loaders/examples/facebook_chat.ipynb +++ b/docs/modules/indexes/document_loaders/examples/facebook_chat.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Facebook Chat\n", + "# Facebook Chat\n", "\n", ">[Messenger](https://en.wikipedia.org/wiki/Messenger_(software)) is an American proprietary instant messaging app and platform developed by `Meta Platforms`. Originally developed as `Facebook Chat` in 2008, the company revamped its messaging service in 2010.\n", "\n", diff --git a/docs/modules/indexes/document_loaders/examples/reddit.ipynb b/docs/modules/indexes/document_loaders/examples/reddit.ipynb index 385a9177..adc562d2 100644 --- a/docs/modules/indexes/document_loaders/examples/reddit.ipynb +++ b/docs/modules/indexes/document_loaders/examples/reddit.ipynb @@ -6,7 +6,7 @@ "source": [ "# Reddit\n", "\n", - ">[Reddit (reddit)](www.reddit.com) is an American social news aggregation, content rating, and discussion website.\n", + ">[Reddit](www.reddit.com) is an American social news aggregation, content rating, and discussion website.\n", "\n", "\n", "This loader fetches the text from the Posts of Subreddits or Reddit users, using the `praw` Python package.\n", diff --git a/docs/templates/integration.md b/docs/templates/integration.md new file mode 100644 index 00000000..0388b936 --- /dev/null +++ b/docs/templates/integration.md @@ -0,0 +1,64 @@ + +[comment: Please, a reference example here "docs/integrations/arxiv.md"]:: +[comment: Use this template to create a new .md file in "docs/integrations/"]:: + +# Title_REPLACE_ME + +[comment: Only one Tile/H1 is allowed!]:: + +> + +[comment: Description: After reading this description, a reader should decide if this integration is good enough to try/follow reading OR]:: +[comment: go to read the next integration doc. ]:: +[comment: Description should include a link to the source for follow reading.]:: + +## Installation and Setup + +[comment: Installation and Setup: All necessary additional package installations and set ups for Tokens, etc]:: + +```bash +pip install package_name_REPLACE_ME +``` + +[comment: OR this text:]:: +There isn't any special setup for it. + + +[comment: The next H2/## sections with names of the integration modules, like "LLM", "Text Embedding Models", etc]:: +[comment: see "Modules" in the "index.html" page]:: +[comment: Each H2 section should include a link to an example(s) and a python code with import of the integration class]:: +[comment: Below are several example sections. Remove all unnecessary sections. Add all necessary sections not provided here.]:: + +## LLM + +See a [usage example](../modules/models/llms/integrations/INCLUDE_REAL_NAME.ipynb). + +```python +from langchain.llms import integration_class_REPLACE_ME +``` + + +## Text Embedding Models + +See a [usage example](../modules/models/text_embedding/examples/INCLUDE_REAL_NAME.ipynb) + +```python +from langchain.embeddings import integration_class_REPLACE_ME +``` + + +## Chat Models + +See a [usage example](../modules/models/chat/integrations/INCLUDE_REAL_NAME.ipynb) + +```python +from langchain.chat_models import integration_class_REPLACE_ME +``` + +## Document Loader + +See a [usage example](../modules/indexes/document_loaders/examples/INCLUDE_REAL_NAME.ipynb). + +```python +from langchain.document_loaders import integration_class_REPLACE_ME +```