Harrison/one drive loader (#4081)

Co-authored-by: José Ferraz Neto <netoferraz@gmail.com>
This commit is contained in:
Harrison Chase 2023-05-03 22:55:34 -07:00 committed by GitHub
parent bd277b5327
commit fba6921b50
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 1105 additions and 558 deletions

View File

@ -0,0 +1,90 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# OneDrive\n",
"This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
"2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called the `client ID`, this value uniquely identifies your application in the Microsoft identity platform.\n",
"3. During the steps you will be following at **item 1**, you can set the redirect URI as `http://localhost:8000/callback`\n",
"4. During the steps you will be following at **item 1**, generate a new password (`client_secret`) under Application Secrets section.\n",
"5. Follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-configure-app-expose-web-apis#add-a-scope) to add the following `SCOPES` (`offline_access` and `Files.Read.All`) to your application.\n",
"6. Visit the [Graph Explorer Playground](https://developer.microsoft.com/en-us/graph/graph-explorer) to obtain your `OneDrive ID`. The first step is to ensure you are logged in with the account associated your OneDrive account. Then you need to make a request to `https://graph.microsoft.com/v1.0/me/drive` and the response will return a payload with a field `id` that holds the ID of your OneDrive account.\n",
"7. You need to install the o365 package using the command `pip install o365`.\n",
"8. At the end of the steps you must have the following values: \n",
"- `CLIENT_ID`\n",
"- `CLIENT_SECRET`\n",
"- `DRIVE_ID`\n",
"\n",
"## 🧑 Instructions for ingesting your documents from OneDrive\n",
"\n",
"### 🔑 Authentication\n",
"\n",
"By default, the `OneDriveLoader` expects that the values of `CLIENT_ID` and `CLIENT_SECRET` must be stored as environment variables named `O365_CLIENT_ID` and `O365_CLIENT_SECRET` respectively. You could pass those environment variables through a `.env` file at the root of your application or using the following command in your script.\n",
"\n",
"```python\n",
"os.environ['O365_CLIENT_ID'] = \"YOUR CLIENT ID\"\n",
"os.environ['O365_CLIENT_SECRET'] = \"YOUR CLIENT SECRET\"\n",
"```\n",
"\n",
"This loader uses an authentication called [*on behalf of a user*](https://learn.microsoft.com/en-us/graph/auth-v2-user?context=graph%2Fapi%2F1.0&view=graph-rest-1.0). It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was succesful.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onedrive import OneDriveLoader\n",
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\")\n",
"```\n",
"\n",
"Once the authentication has been done, the loader will store a token (`o365_token.txt`) at `~/.credentials/` folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the `auth_with_token` parameter to True in the instantiation of the loader.\n",
"\n",
"```python\n",
"from langchain.document_loaders.onedrive import OneDriveLoader\n",
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", auth_with_token=True)\n",
"```\n",
"\n",
"### 🗂️ Documents loader\n",
"\n",
"#### 📑 Loading documents from a OneDrive Directory\n",
"\n",
"`OneDriveLoader` can load documents from a specific folder within your OneDrive. For instance, you want to load all documents that are stored at `Documents/clients` folder within your OneDrive.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onedrive import OneDriveLoader\n",
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", folder_path=\"Documents/clients\", auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n",
"\n",
"#### 📑 Loading documents from a list of Documents IDs\n",
"\n",
"Another possibility is to provide a list of `object_id` for each document you want to load. For that, you will need to query the [Microsoft Graph API](https://developer.microsoft.com/en-us/graph/graph-explorer) to find all the documents ID that you are interested in. This [link](https://learn.microsoft.com/en-us/graph/api/resources/onedrive?view=graph-rest-1.0#commonly-accessed-resources) provides a list of endpoints that will be helpful to retrieve the documents ID.\n",
"\n",
"For instance, to retrieve information about all objects that are stored at the root of the Documents folder, you need make a request to: `https://graph.microsoft.com/v1.0/drives/{YOUR DRIVE ID}/root/children`. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.\n",
"\n",
"\n",
"```python\n",
"from langchain.document_loaders.onedrive import OneDriveLoader\n",
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -87,7 +87,7 @@ class SQLAlchemyCache(BaseCache):
"""Look up based on prompt and llm_string."""
stmt = (
select(self.cache_schema.response)
.where(self.cache_schema.prompt == prompt)
.where(self.cache_schema.prompt == prompt) # type: ignore
.where(self.cache_schema.llm == llm_string)
.order_by(self.cache_schema.idx)
)

View File

@ -52,6 +52,7 @@ from langchain.document_loaders.notebook import NotebookLoader
from langchain.document_loaders.notion import NotionDirectoryLoader
from langchain.document_loaders.notiondb import NotionDBLoader
from langchain.document_loaders.obsidian import ObsidianLoader
from langchain.document_loaders.onedrive import OneDriveLoader
from langchain.document_loaders.pdf import (
MathpixPDFLoader,
OnlinePDFLoader,
@ -148,6 +149,7 @@ __all__ = [
"NotionDBLoader",
"NotionDirectoryLoader",
"ObsidianLoader",
"OneDriveLoader",
"OnlinePDFLoader",
"OutlookMessageLoader",
"PDFMinerLoader",

View File

@ -0,0 +1,234 @@
"""Loader that loads data from OneDrive"""
from __future__ import annotations
import logging
import os
import tempfile
from enum import Enum
from pathlib import Path
from typing import TYPE_CHECKING, Dict, List, Optional, Type, Union
from pydantic import BaseModel, BaseSettings, Field, FilePath, SecretStr
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.document_loaders.onedrive_file import OneDriveFileLoader
if TYPE_CHECKING:
from O365 import Account
from O365.drive import Drive, Folder
SCOPES = ["offline_access", "Files.Read.All"]
logger = logging.getLogger(__name__)
class _OneDriveSettings(BaseSettings):
client_id: str = Field(..., env="O365_CLIENT_ID")
client_secret: SecretStr = Field(..., env="O365_CLIENT_SECRET")
class Config:
env_prefix = ""
case_sentive = False
env_file = ".env"
class _OneDriveTokenStorage(BaseSettings):
token_path: FilePath = Field(Path.home() / ".credentials" / "o365_token.txt")
class _FileType(str, Enum):
DOC = "doc"
DOCX = "docx"
PDF = "pdf"
class _SupportedFileTypes(BaseModel):
file_types: List[_FileType]
def fetch_mime_types(self) -> Dict[str, str]:
mime_types_mapping = {}
for file_type in self.file_types:
if file_type.value == "doc":
mime_types_mapping[file_type.value] = "application/msword"
elif file_type.value == "docx":
mime_types_mapping[
file_type.value
] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
elif file_type.value == "pdf":
mime_types_mapping[file_type.value] = "application/pdf"
return mime_types_mapping
class OneDriveLoader(BaseLoader, BaseModel):
settings: _OneDriveSettings = Field(default_factory=_OneDriveSettings)
drive_id: str = Field(...)
folder_path: Optional[str] = None
object_ids: Optional[List[str]] = None
auth_with_token: bool = False
def _auth(self) -> Type[Account]:
"""
Authenticates the OneDrive API client using the specified
authentication method and returns the Account object.
Returns:
Type[Account]: The authenticated Account object.
"""
try:
from O365 import FileSystemTokenBackend
except ImportError:
raise ValueError(
"O365 package not found, please install it with `pip install o365`"
)
if self.auth_with_token:
token_storage = _OneDriveTokenStorage()
token_path = token_storage.token_path
token_backend = FileSystemTokenBackend(
token_path=token_path.parent, token_filename=token_path.name
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
else:
token_backend = FileSystemTokenBackend(
token_path=Path.home() / ".credentials"
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
# make the auth
account.authenticate()
return account
def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:
"""
Returns the folder or drive object located at the
specified path relative to the given drive.
Args:
drive (Type[Drive]): The root drive from which the folder path is relative.
Returns:
Union[Folder, Drive]: The folder or drive object
located at the specified path.
Raises:
FileNotFoundError: If the path does not exist.
"""
subfolder_drive = drive
if self.folder_path is None:
return subfolder_drive
subfolders = [f for f in self.folder_path.split("/") if f != ""]
if len(subfolders) == 0:
return subfolder_drive
items = subfolder_drive.get_items()
for subfolder in subfolders:
try:
subfolder_drive = list(filter(lambda x: subfolder in x.name, items))[0]
items = subfolder_drive.get_items()
except (IndexError, AttributeError):
raise FileNotFoundError("Path {} not exist.".format(self.folder_path))
return subfolder_drive
def _load_from_folder(self, folder: Type[Folder]) -> List[Document]:
"""
Loads all supported document files from the specified folder
and returns a list of Document objects.
Args:
folder (Type[Folder]): The folder object to load the documents from.
Returns:
List[Document]: A list of Document objects representing
the loaded documents.
"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
items = folder.get_items()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
for file in items:
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs
def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]:
"""
Loads all supported document files from the specified OneDrive
drive based on their object IDs and returns a list
of Document objects.
Args:
drive (Type[Drive]): The OneDrive drive object
to load the documents from.
Returns:
List[Document]: A list of Document objects representing
the loaded documents.
"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
for object_id in self.object_ids if self.object_ids else [""]:
file = drive.get_item(object_id)
if not file:
logging.warning(
"There isn't a file with"
f"object_id {object_id} in drive {drive}."
)
continue
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs
def load(self) -> List[Document]:
"""
Loads all supported document files from the specified OneDrive drive a
nd returns a list of Document objects.
Returns:
List[Document]: A list of Document objects
representing the loaded documents.
Raises:
ValueError: If the specified drive ID
does not correspond to a drive in the OneDrive storage.
"""
account = self._auth()
storage = account.storage()
drive = storage.get_drive(self.drive_id)
docs: List[Document] = []
if not drive:
raise ValueError(f"There isn't a drive with id {self.drive_id}.")
if self.folder_path:
folder = self._get_folder_from_path(drive=drive)
docs.extend(self._load_from_folder(folder=folder))
elif self.object_ids:
docs.extend(self._load_from_object_ids(drive=drive))
return docs

View File

@ -0,0 +1,30 @@
from __future__ import annotations
import tempfile
from typing import TYPE_CHECKING, List
from pydantic import BaseModel, Field
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.document_loaders.unstructured import UnstructuredFileLoader
if TYPE_CHECKING:
from O365.drive import File
CHUNK_SIZE = 1024 * 1024 * 5
class OneDriveFileLoader(BaseLoader, BaseModel):
file: File = Field(...)
class Config:
arbitrary_types_allowed = True
def load(self) -> List[Document]:
"""Load Documents"""
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}/{self.file.name}"
self.file.download(to_path=temp_dir, chunk_size=CHUNK_SIZE)
loader = UnstructuredFileLoader(file_path)
return loader.load()

View File

@ -4,13 +4,22 @@ from __future__ import annotations
import warnings
from typing import Any, Iterable, List, Optional
from sqlalchemy import MetaData, Table, create_engine, inspect, select, text
import sqlalchemy
from sqlalchemy import (
CursorResult,
MetaData,
Table,
create_engine,
inspect,
select,
text,
)
from sqlalchemy.engine import Engine
from sqlalchemy.exc import ProgrammingError, SQLAlchemyError
from sqlalchemy.schema import CreateTable
def _format_index(index: dict) -> str:
def _format_index(index: sqlalchemy.engine.interfaces.ReflectedIndex) -> str:
return (
f'Name: {index["name"]}, Unique: {index["unique"]},'
f' Columns: {str(index["column_names"])}'
@ -30,7 +39,7 @@ class SQLDatabase:
sample_rows_in_table_info: int = 3,
indexes_in_table_info: bool = False,
custom_table_info: Optional[dict] = None,
view_support: Optional[bool] = False,
view_support: bool = False,
):
"""Create engine from database URI."""
self._engine = engine
@ -90,7 +99,7 @@ class SQLDatabase:
self._metadata.reflect(
views=view_support,
bind=self._engine,
only=self._usable_tables,
only=list(self._usable_tables),
schema=self._schema,
)
@ -188,10 +197,10 @@ class SQLDatabase:
try:
# get the sample rows
with self._engine.connect() as connection:
sample_rows = connection.execute(command)
sample_rows_result: CursorResult = connection.execute(command)
# shorten values in the sample rows
sample_rows = list(
map(lambda ls: [str(i)[:100] for i in ls], sample_rows)
map(lambda ls: [str(i)[:100] for i in ls], sample_rows_result)
)
# save the sample rows in string format
@ -222,7 +231,7 @@ class SQLDatabase:
if fetch == "all":
result = cursor.fetchall()
elif fetch == "one":
result = cursor.fetchone()[0]
result = cursor.fetchone()[0] # type: ignore
else:
raise ValueError("Fetch parameter must be either 'one' or 'all'")
return str(result)

View File

@ -42,7 +42,7 @@ class CollectionStore(BaseModel):
@classmethod
def get_by_name(cls, session: Session, name: str) -> Optional["CollectionStore"]:
return session.query(cls).filter(cls.name == name).first()
return session.query(cls).filter(cls.name == name).first() # type: ignore
@classmethod
def get_or_create(
@ -79,7 +79,7 @@ class EmbeddingStore(BaseModel):
)
collection = relationship(CollectionStore, back_populates="embeddings")
embedding = sqlalchemy.Column(ARRAY(REAL))
embedding: sqlalchemy.Column = sqlalchemy.Column(ARRAY(REAL))
document = sqlalchemy.Column(sqlalchemy.String, nullable=True)
cmetadata = sqlalchemy.Column(JSON, nullable=True)

View File

@ -42,7 +42,7 @@ class CollectionStore(BaseModel):
@classmethod
def get_by_name(cls, session: Session, name: str) -> Optional["CollectionStore"]:
return session.query(cls).filter(cls.name == name).first()
return session.query(cls).filter(cls.name == name).first() # type: ignore
@classmethod
def get_or_create(

1273
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -75,6 +75,7 @@ lark = {version="^1.1.5", optional=true}
lancedb = {version = "^0.1", optional = true}
pexpect = {version = "^4.8.0", optional = true}
pyvespa = {version = "^0.33.0", optional = true}
O365 = {version = "^2.0.26", optional = true}
[tool.poetry.group.docs.dependencies]
autodoc_pydantic = "^1.8.0"
@ -155,7 +156,7 @@ openai = ["openai"]
cohere = ["cohere"]
embeddings = ["sentence-transformers"]
azure = ["azure-identity", "azure-cosmos", "openai", "azure-core"]
all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa"]
all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa", "O365"]
[tool.ruff]
select = [