feat: Add Notion database document loader (#2056)

This PR adds Notion DB loader for langchain. 

It reads content from pages within a Notion Database. It uses the Notion
API to query the database and read the pages. It also reads the metadata
from the pages and stores it in the Document object.
This commit is contained in:
Stéphane Busso 2023-03-29 04:07:09 +13:00 committed by GitHub
parent 923a7dde5a
commit 0bee219cb3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 307 additions and 0 deletions

View File

@ -0,0 +1,153 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1dc7df1d",
"metadata": {},
"source": [
"# Notion DB Loader\n",
"\n",
"NotionDBLoader is a Python class for loading content from a Notion database. It retrieves pages from the database, reads their content, and returns a list of Document objects.\n",
"\n",
"## Requirements\n",
"\n",
"- A Notion Database\n",
"- Notion Integration Token\n",
"\n",
"## Setup\n",
"\n",
"### 1. Create a Notion Table Database\n",
"Create a new table database in Notion. You can add any column to the database and they will be treated as metadata. For example you can add the following columns:\n",
"\n",
"- Title: set Title as the default property.\n",
"- Categories: A Multi-select property to store categories associated with the page.\n",
"- Keywords: A Multi-select property to store keywords associated with the page.\n",
"\n",
"Add your content to the body of each page in the database. The NotionDBLoader will extract the content and metadata from these pages.\n",
"\n",
"## 2. Create a Notion Integration\n",
"To create a Notion Integration, follow these steps:\n",
"\n",
"1. Visit the (Notion Developers)[https://www.notion.com/my-integrations] page and log in with your Notion account.\n",
"2. Click on the \"+ New integration\" button.\n",
"3. Give your integration a name and choose the workspace where your database is located.\n",
"4. Select the require capabilities, this extension only need the Read content capability\n",
"5. Click the \"Submit\" button to create the integration.\n",
"Once the integration is created, you'll be provided with an Integration Token (API key). Copy this token and keep it safe, as you'll need it to use the NotionDBLoader.\n",
"\n",
"### 3. Connect the Integration to the Database\n",
"To connect your integration to the database, follow these steps:\n",
"\n",
"1. Open your database in Notion.\n",
"2. Click on the three-dot menu icon in the top right corner of the database view.\n",
"3. Click on the \"+ New integration\" button.\n",
"4. Find your integration, you may need to start typing its name in the search box.\n",
"5. Click on the \"Connect\" button to connect the integration to the database.\n",
"\n",
"\n",
"### 4. Get the Database ID\n",
"To get the database ID, follow these steps:\n",
"\n",
"1. Open your database in Notion.\n",
"2. Click on the three-dot menu icon in the top right corner of the database view.\n",
"3. Select \"Copy link\" from the menu to copy the database URL to your clipboard.\n",
"4. The database ID is the long string of alphanumeric characters found in the URL. It typically looks like this: https://www.notion.so/username/8935f9d140a04f95a872520c4f123456?v=.... In this example, the database ID is 8935f9d140a04f95a872520c4f123456.\n",
"\n",
"With the database properly set up and the integration token and database ID in hand, you can now use the NotionDBLoader code to load content and metadata from your Notion database.\n",
"\n",
"## Usage\n",
"NotionDBLoader is part of the langchain package's document loaders. You can use it as follows:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6c3a314c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"········\n",
"········\n"
]
}
],
"source": [
"from getpass import getpass\n",
"NOTION_TOKEN = getpass()\n",
"DATABASE_ID = getpass()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "007c5cbf",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import NotionDBLoader"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "a1caec59",
"metadata": {},
"outputs": [],
"source": [
"loader = NotionDBLoader(NOTION_TOKEN, DATABASE_ID)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "b1c30ff7",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "4f5789a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"print(docs)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -32,6 +32,7 @@ from langchain.document_loaders.imsdb import IMSDbLoader
from langchain.document_loaders.markdown import UnstructuredMarkdownLoader
from langchain.document_loaders.notebook import NotebookLoader
from langchain.document_loaders.notion import NotionDirectoryLoader
from langchain.document_loaders.notiondb import NotionDBLoader
from langchain.document_loaders.obsidian import ObsidianLoader
from langchain.document_loaders.pdf import (
OnlinePDFLoader,
@ -72,6 +73,7 @@ __all__ = [
"UnstructuredURLLoader",
"DirectoryLoader",
"NotionDirectoryLoader",
"NotionDBLoader",
"ReadTheDocsLoader",
"GoogleDriveLoader",
"UnstructuredHTMLLoader",

View File

@ -0,0 +1,152 @@
"""Notion DB loader for langchain"""
from typing import Any, Dict, List
import requests
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
NOTION_BASE_URL = "https://api.notion.com/v1"
DATABASE_URL = NOTION_BASE_URL + "/databases/{database_id}/query"
PAGE_URL = NOTION_BASE_URL + "/pages/{page_id}"
BLOCK_URL = NOTION_BASE_URL + "/blocks/{block_id}/children"
class NotionDBLoader(BaseLoader):
"""Notion DB Loader.
Reads content from pages within a Noton Database.
Args:
integration_token (str): Notion integration token.
database_id (str): Notion database id.
"""
def __init__(self, integration_token: str, database_id: str) -> None:
"""Initialize with parameters."""
if not integration_token:
raise ValueError("integration_token must be provided")
if not database_id:
raise ValueError("database_id must be provided")
self.token = integration_token
self.database_id = database_id
self.headers = {
"Authorization": "Bearer " + self.token,
"Content-Type": "application/json",
"Notion-Version": "2022-06-28",
}
def load(self) -> List[Document]:
"""Load documents from the Notion database.
Returns:
List[Document]: List of documents.
"""
page_ids = self._retrieve_page_ids()
return list(self.load_page(page_id) for page_id in page_ids)
def _retrieve_page_ids(
self, query_dict: Dict[str, Any] = {"page_size": 100}
) -> List[str]:
"""Get all the pages from a Notion database."""
pages: List[Dict[str, Any]] = []
while True:
data = self._request(
DATABASE_URL.format(database_id=self.database_id),
method="POST",
query_dict=query_dict,
)
pages.extend(data.get("results"))
if not data.get("has_more"):
break
query_dict["start_cursor"] = data.get("next_cursor")
page_ids = [page["id"] for page in pages]
return page_ids
def load_page(self, page_id: str) -> Document:
"""Read a page."""
data = self._request(PAGE_URL.format(page_id=page_id))
# load properties as metadata
metadata: Dict[str, Any] = {}
for prop_name, prop_data in data["properties"].items():
prop_type = prop_data["type"]
if prop_type == "rich_text":
value = (
prop_data["rich_text"][0]["plain_text"]
if prop_data["rich_text"]
else None
)
elif prop_type == "title":
value = (
prop_data["title"][0]["plain_text"] if prop_data["title"] else None
)
elif prop_type == "multi_select":
value = (
[item["name"] for item in prop_data["multi_select"]]
if prop_data["multi_select"]
else []
)
else:
value = None
metadata[prop_name.lower()] = value
metadata["id"] = page_id
return Document(page_content=self._load_blocks(page_id), metadata=metadata)
def _load_blocks(self, block_id: str, num_tabs: int = 0) -> str:
"""Read a block and its children."""
result_lines_arr: List[str] = []
cur_block_id: str = block_id
while cur_block_id:
data = self._request(BLOCK_URL.format(block_id=cur_block_id))
for result in data["results"]:
result_obj = result[result["type"]]
if "rich_text" not in result_obj:
continue
cur_result_text_arr: List[str] = []
for rich_text in result_obj["rich_text"]:
if "text" in rich_text:
cur_result_text_arr.append(
"\t" * num_tabs + rich_text["text"]["content"]
)
if result["has_children"]:
children_text = self._load_blocks(
result["id"], num_tabs=num_tabs + 1
)
cur_result_text_arr.append(children_text)
result_lines_arr.append("\n".join(cur_result_text_arr))
cur_block_id = data.get("next_cursor")
return "\n".join(result_lines_arr)
def _request(
self, url: str, method: str = "GET", query_dict: Dict[str, Any] = {}
) -> Any:
res = requests.request(
method,
url,
headers=self.headers,
json=query_dict,
timeout=10,
)
res.raise_for_status()
return res.json()