Add SharePoint Loader (#4284)

- Added a loader (`SharePointLoader`) that can pull documents (`pdf`,
`docx`, `doc`) from the [SharePoint Document
Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872).
- Added a Base Loader (`O365BaseLoader`) to be used for all Loaders that
use [O365](https://github.com/O365/python-o365) Package
- Code refactoring on `OneDriveLoader` to use the new `O365BaseLoader`.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
José Ferraz Neto 2023-08-21 11:49:07 -03:00 committed by GitHub
parent bb4f7936f9
commit f116e10d53
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 426 additions and 183 deletions

View File

@ -0,0 +1,105 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Microsoft SharePoint\n",
"\n",
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
"\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
"2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called the `client ID`, this value uniquely identifies your application in the Microsoft identity platform.\n",
"3. During the steps you will be following at **item 1**, you can set the redirect URI as `https://login.microsoftonline.com/common/oauth2/nativeclient`\n",
"4. During the steps you will be following at **item 1**, generate a new password (`client_secret`) under Application Secrets section.\n",
"5. Follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-configure-app-expose-web-apis#add-a-scope) to add the following `SCOPES` (`offline_access` and `Sites.Read.All`) to your application.\n",
"6. To retrieve files from your **Document Library**, you will need its ID. To obtain it, you will need values of `Tenant Name`, `Collection ID`, and `Subsite ID`.\n",
"7. To find your `Tenant Name` follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory-b2c/tenant-management-read-tenant-name). Once you got this, just remove `.onmicrosoft.com` from the value and hold the rest as your `Tenant Name`.\n",
"8. To obtain your `Collection ID` and `Subsite ID`, you will need your **SharePoint** `site-name`. Your `SharePoint` site URL has the following format `https://<tenant-name>.sharepoint.com/sites/<site-name>`. The last part of this URL is the `site-name`.\n",
"9. To Get the Site `Collection ID`, hit this URL in the browser: `https://<tenant>.sharepoint.com/sites/<site-name>/_api/site/id` and copy the value of the `Edm.Guid` property.\n",
"10. To get the `Subsite ID` (or web ID) use: `https://<tenant>.sharepoint.com/<site-name>/_api/web/id` and copy the value of the `Edm.Guid` property.\n",
"11. The `SharePoint site ID` has the following format: `<tenant-name>.sharepoint.com,<Collection ID>,<subsite ID>`. You can hold that value to use in the next step.\n",
"12. Visit the [Graph Explorer Playground](https://developer.microsoft.com/en-us/graph/graph-explorer) to obtain your `Document Library ID`. The first step is to ensure you are logged in with the account associated with your **SharePoint** site. Then you need to make a request to `https://graph.microsoft.com/v1.0/sites/<SharePoint site ID>/drive` and the response will return a payload with a field `id` that holds the ID of your `Document Library ID`.\n",
"\n",
"## 🧑 Instructions for ingesting your documents from SharePoint Document Library\n",
"\n",
"### 🔑 Authentication\n",
"\n",
"By default, the `SharePointLoader` expects that the values of `CLIENT_ID` and `CLIENT_SECRET` must be stored as environment variables named `O365_CLIENT_ID` and `O365_CLIENT_SECRET` respectively. You could pass those environment variables through a `.env` file at the root of your application or using the following command in your script.\n",
"\n",
"```python\n",
"os.environ['O365_CLIENT_ID'] = \"YOUR CLIENT ID\"\n",
"os.environ['O365_CLIENT_SECRET'] = \"YOUR CLIENT SECRET\"\n",
"```\n",
"\n",
"This loader uses an authentication called [*on behalf of a user*](https://learn.microsoft.com/en-us/graph/auth-v2-user?context=graph%2Fapi%2F1.0&view=graph-rest-1.0). It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was succesful.\n",
"\n",
"```python\n",
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\")\n",
"```\n",
"\n",
"Once the authentication has been done, the loader will store a token (`o365_token.txt`) at `~/.credentials/` folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the `auth_with_token` parameter to True in the instantiation of the loader.\n",
"\n",
"```python\n",
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", auth_with_token=True)\n",
"```\n",
"\n",
"### 🗂️ Documents loader\n",
"\n",
"#### 📑 Loading documents from a Document Library Directory\n",
"\n",
"`SharePointLoader` can load documents from a specific folder within your Document Library. For instance, you want to load all documents that are stored at `Documents/marketing` folder within your Document Library.\n",
"\n",
"```python\n",
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", folder_path=\"Documents/marketing\", auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n",
"\n",
"#### 📑 Loading documents from a list of Documents IDs\n",
"\n",
"Another possibility is to provide a list of `object_id` for each document you want to load. For that, you will need to query the [Microsoft Graph API](https://developer.microsoft.com/en-us/graph/graph-explorer) to find all the documents ID that you are interested in. This [link](https://learn.microsoft.com/en-us/graph/api/resources/onedrive?view=graph-rest-1.0#commonly-accessed-resources) provides a list of endpoints that will be helpful to retrieve the documents ID.\n",
"\n",
"For instance, to retrieve information about all objects that are stored at `data/finance/` folder, you need make a request to: `https://graph.microsoft.com/v1.0/drives/<document-library-id>/root:/data/finance:/children`. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.\n",
"\n",
"```python\n",
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -147,6 +147,7 @@ from langchain.document_loaders.rst import UnstructuredRSTLoader
from langchain.document_loaders.rtf import UnstructuredRTFLoader
from langchain.document_loaders.s3_directory import S3DirectoryLoader
from langchain.document_loaders.s3_file import S3FileLoader
from langchain.document_loaders.sharepoint import SharePointLoader
from langchain.document_loaders.sitemap import SitemapLoader
from langchain.document_loaders.slack_directory import SlackDirectoryLoader
from langchain.document_loaders.snowflake_loader import SnowflakeLoader
@ -316,6 +317,7 @@ __all__ = [
"S3FileLoader",
"SRTLoader",
"SeleniumURLLoader",
"SharePointLoader",
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",

View File

@ -0,0 +1,182 @@
"""Base class for all loaders that uses O365 Package"""
from __future__ import annotations
import logging
import os
import tempfile
from abc import abstractmethod
from enum import Enum
from pathlib import Path
from typing import TYPE_CHECKING, Dict, Iterable, List, Sequence, Union
from langchain.document_loaders.base import BaseLoader
from langchain.document_loaders.blob_loaders.file_system import FileSystemBlobLoader
from langchain.document_loaders.blob_loaders.schema import Blob
from langchain.pydantic_v1 import BaseModel, BaseSettings, Field, FilePath, SecretStr
if TYPE_CHECKING:
from O365 import Account
from O365.drive import Drive, Folder
logger = logging.getLogger(__name__)
CHUNK_SIZE = 1024 * 1024 * 5
class _O365Settings(BaseSettings):
client_id: str = Field(..., env="O365_CLIENT_ID")
client_secret: SecretStr = Field(..., env="O365_CLIENT_SECRET")
class Config:
env_prefix = ""
case_sentive = False
env_file = ".env"
class _O365TokenStorage(BaseSettings):
token_path: FilePath = Path.home() / ".credentials" / "o365_token.txt"
class _FileType(str, Enum):
DOC = "doc"
DOCX = "docx"
PDF = "pdf"
def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:
mime_types_mapping = {}
for file_type in file_types:
if file_type.value == "doc":
mime_types_mapping[file_type.value] = "application/msword"
elif file_type.value == "docx":
mime_types_mapping[
file_type.value
] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
elif file_type.value == "pdf":
mime_types_mapping[file_type.value] = "application/pdf"
return mime_types_mapping
class O365BaseLoader(BaseLoader, BaseModel):
settings: _O365Settings = Field(default_factory=_O365Settings)
"""Settings for the Office365 API client."""
auth_with_token: bool = False
"""Whether to authenticate with a token or not. Defaults to False."""
chunk_size: Union[int, str] = CHUNK_SIZE
"""Number of bytes to retrieve from each api call to the server. int or 'auto'."""
@property
@abstractmethod
def _file_types(self) -> Sequence[_FileType]:
"""Return supported file types."""
@property
def _fetch_mime_types(self) -> Dict[str, str]:
"""Return a dict of supported file types to corresponding mime types."""
return fetch_mime_types(self._file_types)
@property
@abstractmethod
def _scopes(self) -> List[str]:
"""Return required scopes."""
def _load_from_folder(self, folder: Folder) -> Iterable[Blob]:
"""Lazily load all files from a specified folder of the configured MIME type.
Args:
folder: The Folder instance from which the files are to be loaded. This
Folder instance should represent a directory in a file system where the
files are stored.
Yields:
An iterator that yields Blob instances, which are binary representations of
the files loaded from the folder.
"""
file_mime_types = self._fetch_mime_types
items = folder.get_items()
with tempfile.TemporaryDirectory() as temp_dir:
os.makedirs(os.path.dirname(temp_dir), exist_ok=True)
for file in items:
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
file.download(to_path=temp_dir, chunk_size=self.chunk_size)
loader = FileSystemBlobLoader(path=temp_dir)
yield from loader.yield_blobs()
def _load_from_object_ids(
self, drive: Drive, object_ids: List[str]
) -> Iterable[Blob]:
"""Lazily load files specified by their object_ids from a drive.
Load files into the system as binary large objects (Blobs) and return Iterable.
Args:
drive: The Drive instance from which the files are to be loaded. This Drive
instance should represent a cloud storage service or similar storage
system where the files are stored.
object_ids: A list of object_id strings. Each object_id represents a unique
identifier for a file in the drive.
Yields:
An iterator that yields Blob instances, which are binary representations of
the files loaded from the drive using the specified object_ids.
"""
file_mime_types = self._fetch_mime_types
with tempfile.TemporaryDirectory() as temp_dir:
for object_id in object_ids:
file = drive.get_item(object_id)
if not file:
logging.warning(
"There isn't a file with"
f"object_id {object_id} in drive {drive}."
)
continue
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
file.download(to_path=temp_dir, chunk_size=self.chunk_size)
loader = FileSystemBlobLoader(path=temp_dir)
yield from loader.yield_blobs()
def _auth(self) -> Account:
"""Authenticates the OneDrive API client
Returns:
The authenticated Account object.
"""
try:
from O365 import Account, FileSystemTokenBackend
except ImportError:
raise ImportError(
"O365 package not found, please install it with `pip install o365`"
)
if self.auth_with_token:
token_storage = _O365TokenStorage()
token_path = token_storage.token_path
token_backend = FileSystemTokenBackend(
token_path=token_path.parent, token_filename=token_path.name
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=self._scopes,
token_backend=token_backend,
**{"raise_http_errors": False},
)
else:
token_backend = FileSystemTokenBackend(
token_path=Path.home() / ".credentials"
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=self._scopes,
token_backend=token_backend,
**{"raise_http_errors": False},
)
# make the auth
account.authenticate()
return account

View File

@ -2,129 +2,49 @@
from __future__ import annotations
import logging
import os
import tempfile
from enum import Enum
from pathlib import Path
from typing import TYPE_CHECKING, Dict, List, Optional, Type, Union
from typing import TYPE_CHECKING, Iterator, List, Optional, Sequence, Union
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.document_loaders.onedrive_file import OneDriveFileLoader
from langchain.pydantic_v1 import BaseModel, BaseSettings, Field, FilePath, SecretStr
from langchain.document_loaders.base_o365 import (
O365BaseLoader,
_FileType,
)
from langchain.document_loaders.parsers.registry import get_parser
from langchain.pydantic_v1 import Field
if TYPE_CHECKING:
from O365 import Account
from O365.drive import Drive, Folder
SCOPES = ["offline_access", "Files.Read.All"]
logger = logging.getLogger(__name__)
class _OneDriveSettings(BaseSettings):
client_id: str = Field(..., env="O365_CLIENT_ID")
client_secret: SecretStr = Field(..., env="O365_CLIENT_SECRET")
class Config:
env_prefix = ""
case_sentive = False
env_file = ".env"
class _OneDriveTokenStorage(BaseSettings):
token_path: FilePath = Field(Path.home() / ".credentials" / "o365_token.txt")
class _FileType(str, Enum):
DOC = "doc"
DOCX = "docx"
PDF = "pdf"
class _SupportedFileTypes(BaseModel):
file_types: List[_FileType]
def fetch_mime_types(self) -> Dict[str, str]:
mime_types_mapping = {}
for file_type in self.file_types:
if file_type.value == "doc":
mime_types_mapping[file_type.value] = "application/msword"
elif file_type.value == "docx":
mime_types_mapping[
file_type.value
] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
elif file_type.value == "pdf":
mime_types_mapping[file_type.value] = "application/pdf"
return mime_types_mapping
class OneDriveLoader(BaseLoader, BaseModel):
class OneDriveLoader(O365BaseLoader):
"""Load from `Microsoft OneDrive`."""
settings: _OneDriveSettings = Field(default_factory=_OneDriveSettings)
""" The settings for the OneDrive API client."""
drive_id: str = Field(...)
""" The ID of the OneDrive drive to load data from."""
folder_path: Optional[str] = None
""" The path to the folder to load data from."""
object_ids: Optional[List[str]] = None
""" The IDs of the objects to load data from."""
auth_with_token: bool = False
""" Whether to authenticate with a token or not. Defaults to False."""
def _auth(self) -> Type[Account]:
"""
Authenticates the OneDrive API client using the specified
authentication method and returns the Account object.
@property
def _file_types(self) -> Sequence[_FileType]:
"""Return supported file types."""
return _FileType.DOC, _FileType.DOCX, _FileType.PDF
Returns:
Type[Account]: The authenticated Account object.
"""
try:
from O365 import FileSystemTokenBackend
except ImportError:
raise ImportError(
"O365 package not found, please install it with `pip install o365`"
)
if self.auth_with_token:
token_storage = _OneDriveTokenStorage()
token_path = token_storage.token_path
token_backend = FileSystemTokenBackend(
token_path=token_path.parent, token_filename=token_path.name
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
else:
token_backend = FileSystemTokenBackend(
token_path=Path.home() / ".credentials"
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
# make the auth
account.authenticate()
return account
@property
def _scopes(self) -> List[str]:
"""Return required scopes."""
return ["offline_access", "Files.Read.All"]
def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:
def _get_folder_from_path(self, drive: Drive) -> Union[Folder, Drive]:
"""
Returns the folder or drive object located at the
specified path relative to the given drive.
Args:
drive (Type[Drive]): The root drive from which the folder path is relative.
drive (Drive): The root drive from which the folder path is relative.
Returns:
Union[Folder, Drive]: The folder or drive object
@ -151,90 +71,26 @@ class OneDriveLoader(BaseLoader, BaseModel):
raise FileNotFoundError("Path {} not exist.".format(self.folder_path))
return subfolder_drive
def _load_from_folder(self, folder: Type[Folder]) -> List[Document]:
"""
Loads all supported document files from the specified folder
and returns a list of Document objects.
Args:
folder (Type[Folder]): The folder object to load the documents from.
Returns:
List[Document]: A list of Document objects representing
the loaded documents.
"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
items = folder.get_items()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
for file in items:
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs
def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]:
"""
Loads all supported document files from the specified OneDrive
drive based on their object IDs and returns a list
of Document objects.
Args:
drive (Type[Drive]): The OneDrive drive object
to load the documents from.
Returns:
List[Document]: A list of Document objects representing
the loaded documents.
"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
for object_id in self.object_ids if self.object_ids else [""]:
file = drive.get_item(object_id)
if not file:
logger.warning(
"There isn't a file with "
f"object_id {object_id} in drive {drive}."
def lazy_load(self) -> Iterator[Document]:
"""Load documents lazily. Use this when working at a large scale."""
try:
from O365.drive import Drive
except ImportError:
raise ImportError(
"O365 package not found, please install it with `pip install o365`"
)
continue
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs
drive = self._auth().storage().get_drive(self.drive_id)
if not isinstance(drive, Drive):
raise ValueError(f"There isn't a Drive with id {self.drive_id}.")
blob_parser = get_parser("default")
if self.folder_path:
folder = self._get_folder_from_path(drive)
for blob in self._load_from_folder(folder):
yield from blob_parser.lazy_parse(blob)
if self.object_ids:
for blob in self._load_from_object_ids(drive, self.object_ids):
yield from blob_parser.lazy_parse(blob)
def load(self) -> List[Document]:
"""
Loads all supported document files from the specified OneDrive drive
and return a list of Document objects.
Returns:
List[Document]: A list of Document objects
representing the loaded documents.
Raises:
ValueError: If the specified drive ID
does not correspond to a drive in the OneDrive storage.
"""
account = self._auth()
storage = account.storage()
drive = storage.get_drive(self.drive_id)
docs: List[Document] = []
if not drive:
raise ValueError(f"There isn't a drive with id {self.drive_id}.")
if self.folder_path:
folder = self._get_folder_from_path(drive=drive)
docs.extend(self._load_from_folder(folder=folder))
elif self.object_ids:
docs.extend(self._load_from_object_ids(drive=drive))
return docs
"""Load all documents."""
return list(self.lazy_load())

View File

@ -0,0 +1,34 @@
from typing import Iterator
from langchain.document_loaders.base import BaseBlobParser
from langchain.document_loaders.blob_loaders import Blob
from langchain.schema import Document
class MsWordParser(BaseBlobParser):
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
try:
from unstructured.partition.doc import partition_doc
from unstructured.partition.docx import partition_docx
except ImportError as e:
raise ImportError(
"Could not import unstructured, please install with `pip install "
"unstructured`."
) from e
mime_type_parser = {
"application/msword": partition_doc,
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": (
partition_docx
),
}
if blob.mimetype not in (
"application/msword",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
):
raise ValueError("This blob type is not supported for this parser.")
with blob.as_bytes_io() as word_document:
elements = mime_type_parser[blob.mimetype](file=word_document)
text = "\n\n".join([str(el) for el in elements])
metadata = {"source": blob.source}
yield Document(page_content=text, metadata=metadata)

View File

@ -1,6 +1,7 @@
"""Module includes a registry of default parser configurations."""
from langchain.document_loaders.base import BaseBlobParser
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain.document_loaders.parsers.msword import MsWordParser
from langchain.document_loaders.parsers.pdf import PyMuPDFParser
from langchain.document_loaders.parsers.txt import TextParser
@ -11,6 +12,10 @@ def _get_default_parser() -> BaseBlobParser:
handlers={
"application/pdf": PyMuPDFParser(),
"text/plain": TextParser(),
"application/msword": MsWordParser(),
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": (
MsWordParser()
),
},
fallback_parser=None,
)

View File

@ -0,0 +1,59 @@
"""Loader that loads data from Sharepoint Document Library"""
from __future__ import annotations
from typing import Iterator, List, Optional, Sequence
from langchain.docstore.document import Document
from langchain.document_loaders.base_o365 import (
O365BaseLoader,
_FileType,
)
from langchain.document_loaders.parsers.registry import get_parser
from langchain.pydantic_v1 import Field
class SharePointLoader(O365BaseLoader):
"""Load from `SharePoint`."""
document_library_id: str = Field(...)
""" The ID of the SharePoint document library to load data from."""
folder_path: Optional[str] = None
""" The path to the folder to load data from."""
object_ids: Optional[List[str]] = None
""" The IDs of the objects to load data from."""
@property
def _file_types(self) -> Sequence[_FileType]:
"""Return supported file types."""
return _FileType.DOC, _FileType.DOCX, _FileType.PDF
@property
def _scopes(self) -> List[str]:
"""Return required scopes."""
return ["sharepoint", "basic"]
def lazy_load(self) -> Iterator[Document]:
"""Load documents lazily. Use this when working at a large scale."""
try:
from O365.drive import Drive, Folder
except ImportError:
raise ImportError(
"O365 package not found, please install it with `pip install o365`"
)
drive = self._auth().storage().get_drive(self.document_library_id)
if not isinstance(drive, Drive):
raise ValueError(f"There isn't a Drive with id {self.document_library_id}.")
blob_parser = get_parser("default")
if self.folder_path:
target_folder = drive.get_item_by_path(self.folder_path)
if not isinstance(target_folder, Folder):
raise ValueError(f"There isn't a folder with path {self.folder_path}.")
for blob in self._load_from_folder(target_folder):
yield from blob_parser.lazy_parse(blob)
if self.object_ids:
for blob in self._load_from_object_ids(drive, self.object_ids):
yield from blob_parser.lazy_parse(blob)
def load(self) -> List[Document]:
"""Load all documents."""
return list(self.lazy_load())