mirror of
https://github.com/hwchase17/langchain
synced 2024-11-08 07:10:35 +00:00
Add SharePoint Loader (#4284)
- Added a loader (`SharePointLoader`) that can pull documents (`pdf`, `docx`, `doc`) from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). - Added a Base Loader (`O365BaseLoader`) to be used for all Loaders that use [O365](https://github.com/O365/python-o365) Package - Code refactoring on `OneDriveLoader` to use the new `O365BaseLoader`. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
parent
bb4f7936f9
commit
f116e10d53
@ -0,0 +1,105 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Microsoft SharePoint\n",
|
||||
"\n",
|
||||
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
|
||||
"\n",
|
||||
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
|
||||
"2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called the `client ID`, this value uniquely identifies your application in the Microsoft identity platform.\n",
|
||||
"3. During the steps you will be following at **item 1**, you can set the redirect URI as `https://login.microsoftonline.com/common/oauth2/nativeclient`\n",
|
||||
"4. During the steps you will be following at **item 1**, generate a new password (`client_secret`) under Application Secrets section.\n",
|
||||
"5. Follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-configure-app-expose-web-apis#add-a-scope) to add the following `SCOPES` (`offline_access` and `Sites.Read.All`) to your application.\n",
|
||||
"6. To retrieve files from your **Document Library**, you will need its ID. To obtain it, you will need values of `Tenant Name`, `Collection ID`, and `Subsite ID`.\n",
|
||||
"7. To find your `Tenant Name` follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory-b2c/tenant-management-read-tenant-name). Once you got this, just remove `.onmicrosoft.com` from the value and hold the rest as your `Tenant Name`.\n",
|
||||
"8. To obtain your `Collection ID` and `Subsite ID`, you will need your **SharePoint** `site-name`. Your `SharePoint` site URL has the following format `https://<tenant-name>.sharepoint.com/sites/<site-name>`. The last part of this URL is the `site-name`.\n",
|
||||
"9. To Get the Site `Collection ID`, hit this URL in the browser: `https://<tenant>.sharepoint.com/sites/<site-name>/_api/site/id` and copy the value of the `Edm.Guid` property.\n",
|
||||
"10. To get the `Subsite ID` (or web ID) use: `https://<tenant>.sharepoint.com/<site-name>/_api/web/id` and copy the value of the `Edm.Guid` property.\n",
|
||||
"11. The `SharePoint site ID` has the following format: `<tenant-name>.sharepoint.com,<Collection ID>,<subsite ID>`. You can hold that value to use in the next step.\n",
|
||||
"12. Visit the [Graph Explorer Playground](https://developer.microsoft.com/en-us/graph/graph-explorer) to obtain your `Document Library ID`. The first step is to ensure you are logged in with the account associated with your **SharePoint** site. Then you need to make a request to `https://graph.microsoft.com/v1.0/sites/<SharePoint site ID>/drive` and the response will return a payload with a field `id` that holds the ID of your `Document Library ID`.\n",
|
||||
"\n",
|
||||
"## 🧑 Instructions for ingesting your documents from SharePoint Document Library\n",
|
||||
"\n",
|
||||
"### 🔑 Authentication\n",
|
||||
"\n",
|
||||
"By default, the `SharePointLoader` expects that the values of `CLIENT_ID` and `CLIENT_SECRET` must be stored as environment variables named `O365_CLIENT_ID` and `O365_CLIENT_SECRET` respectively. You could pass those environment variables through a `.env` file at the root of your application or using the following command in your script.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"os.environ['O365_CLIENT_ID'] = \"YOUR CLIENT ID\"\n",
|
||||
"os.environ['O365_CLIENT_SECRET'] = \"YOUR CLIENT SECRET\"\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"This loader uses an authentication called [*on behalf of a user*](https://learn.microsoft.com/en-us/graph/auth-v2-user?context=graph%2Fapi%2F1.0&view=graph-rest-1.0). It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was succesful.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\")\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Once the authentication has been done, the loader will store a token (`o365_token.txt`) at `~/.credentials/` folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the `auth_with_token` parameter to True in the instantiation of the loader.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", auth_with_token=True)\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"### 🗂️ Documents loader\n",
|
||||
"\n",
|
||||
"#### 📑 Loading documents from a Document Library Directory\n",
|
||||
"\n",
|
||||
"`SharePointLoader` can load documents from a specific folder within your Document Library. For instance, you want to load all documents that are stored at `Documents/marketing` folder within your Document Library.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", folder_path=\"Documents/marketing\", auth_with_token=True)\n",
|
||||
"documents = loader.load()\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"#### 📑 Loading documents from a list of Documents IDs\n",
|
||||
"\n",
|
||||
"Another possibility is to provide a list of `object_id` for each document you want to load. For that, you will need to query the [Microsoft Graph API](https://developer.microsoft.com/en-us/graph/graph-explorer) to find all the documents ID that you are interested in. This [link](https://learn.microsoft.com/en-us/graph/api/resources/onedrive?view=graph-rest-1.0#commonly-accessed-resources) provides a list of endpoints that will be helpful to retrieve the documents ID.\n",
|
||||
"\n",
|
||||
"For instance, to retrieve information about all objects that are stored at `data/finance/` folder, you need make a request to: `https://graph.microsoft.com/v1.0/drives/<document-library-id>/root:/data/finance:/children`. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"from langchain.document_loaders.sharepoint import SharePointLoader\n",
|
||||
"\n",
|
||||
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
|
||||
"documents = loader.load()\n",
|
||||
"```\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.10"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
@ -147,6 +147,7 @@ from langchain.document_loaders.rst import UnstructuredRSTLoader
|
||||
from langchain.document_loaders.rtf import UnstructuredRTFLoader
|
||||
from langchain.document_loaders.s3_directory import S3DirectoryLoader
|
||||
from langchain.document_loaders.s3_file import S3FileLoader
|
||||
from langchain.document_loaders.sharepoint import SharePointLoader
|
||||
from langchain.document_loaders.sitemap import SitemapLoader
|
||||
from langchain.document_loaders.slack_directory import SlackDirectoryLoader
|
||||
from langchain.document_loaders.snowflake_loader import SnowflakeLoader
|
||||
@ -316,6 +317,7 @@ __all__ = [
|
||||
"S3FileLoader",
|
||||
"SRTLoader",
|
||||
"SeleniumURLLoader",
|
||||
"SharePointLoader",
|
||||
"SitemapLoader",
|
||||
"SlackDirectoryLoader",
|
||||
"SnowflakeLoader",
|
||||
|
182
libs/langchain/langchain/document_loaders/base_o365.py
Normal file
182
libs/langchain/langchain/document_loaders/base_o365.py
Normal file
@ -0,0 +1,182 @@
|
||||
"""Base class for all loaders that uses O365 Package"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from abc import abstractmethod
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING, Dict, Iterable, List, Sequence, Union
|
||||
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
from langchain.document_loaders.blob_loaders.file_system import FileSystemBlobLoader
|
||||
from langchain.document_loaders.blob_loaders.schema import Blob
|
||||
from langchain.pydantic_v1 import BaseModel, BaseSettings, Field, FilePath, SecretStr
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from O365 import Account
|
||||
from O365.drive import Drive, Folder
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
CHUNK_SIZE = 1024 * 1024 * 5
|
||||
|
||||
|
||||
class _O365Settings(BaseSettings):
|
||||
client_id: str = Field(..., env="O365_CLIENT_ID")
|
||||
client_secret: SecretStr = Field(..., env="O365_CLIENT_SECRET")
|
||||
|
||||
class Config:
|
||||
env_prefix = ""
|
||||
case_sentive = False
|
||||
env_file = ".env"
|
||||
|
||||
|
||||
class _O365TokenStorage(BaseSettings):
|
||||
token_path: FilePath = Path.home() / ".credentials" / "o365_token.txt"
|
||||
|
||||
|
||||
class _FileType(str, Enum):
|
||||
DOC = "doc"
|
||||
DOCX = "docx"
|
||||
PDF = "pdf"
|
||||
|
||||
|
||||
def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:
|
||||
mime_types_mapping = {}
|
||||
for file_type in file_types:
|
||||
if file_type.value == "doc":
|
||||
mime_types_mapping[file_type.value] = "application/msword"
|
||||
elif file_type.value == "docx":
|
||||
mime_types_mapping[
|
||||
file_type.value
|
||||
] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
|
||||
elif file_type.value == "pdf":
|
||||
mime_types_mapping[file_type.value] = "application/pdf"
|
||||
return mime_types_mapping
|
||||
|
||||
|
||||
class O365BaseLoader(BaseLoader, BaseModel):
|
||||
settings: _O365Settings = Field(default_factory=_O365Settings)
|
||||
"""Settings for the Office365 API client."""
|
||||
auth_with_token: bool = False
|
||||
"""Whether to authenticate with a token or not. Defaults to False."""
|
||||
chunk_size: Union[int, str] = CHUNK_SIZE
|
||||
"""Number of bytes to retrieve from each api call to the server. int or 'auto'."""
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def _file_types(self) -> Sequence[_FileType]:
|
||||
"""Return supported file types."""
|
||||
|
||||
@property
|
||||
def _fetch_mime_types(self) -> Dict[str, str]:
|
||||
"""Return a dict of supported file types to corresponding mime types."""
|
||||
return fetch_mime_types(self._file_types)
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def _scopes(self) -> List[str]:
|
||||
"""Return required scopes."""
|
||||
|
||||
def _load_from_folder(self, folder: Folder) -> Iterable[Blob]:
|
||||
"""Lazily load all files from a specified folder of the configured MIME type.
|
||||
|
||||
Args:
|
||||
folder: The Folder instance from which the files are to be loaded. This
|
||||
Folder instance should represent a directory in a file system where the
|
||||
files are stored.
|
||||
|
||||
Yields:
|
||||
An iterator that yields Blob instances, which are binary representations of
|
||||
the files loaded from the folder.
|
||||
"""
|
||||
file_mime_types = self._fetch_mime_types
|
||||
items = folder.get_items()
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
os.makedirs(os.path.dirname(temp_dir), exist_ok=True)
|
||||
for file in items:
|
||||
if file.is_file:
|
||||
if file.mime_type in list(file_mime_types.values()):
|
||||
file.download(to_path=temp_dir, chunk_size=self.chunk_size)
|
||||
loader = FileSystemBlobLoader(path=temp_dir)
|
||||
yield from loader.yield_blobs()
|
||||
|
||||
def _load_from_object_ids(
|
||||
self, drive: Drive, object_ids: List[str]
|
||||
) -> Iterable[Blob]:
|
||||
"""Lazily load files specified by their object_ids from a drive.
|
||||
|
||||
Load files into the system as binary large objects (Blobs) and return Iterable.
|
||||
|
||||
Args:
|
||||
drive: The Drive instance from which the files are to be loaded. This Drive
|
||||
instance should represent a cloud storage service or similar storage
|
||||
system where the files are stored.
|
||||
object_ids: A list of object_id strings. Each object_id represents a unique
|
||||
identifier for a file in the drive.
|
||||
|
||||
Yields:
|
||||
An iterator that yields Blob instances, which are binary representations of
|
||||
the files loaded from the drive using the specified object_ids.
|
||||
"""
|
||||
file_mime_types = self._fetch_mime_types
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
for object_id in object_ids:
|
||||
file = drive.get_item(object_id)
|
||||
if not file:
|
||||
logging.warning(
|
||||
"There isn't a file with"
|
||||
f"object_id {object_id} in drive {drive}."
|
||||
)
|
||||
continue
|
||||
if file.is_file:
|
||||
if file.mime_type in list(file_mime_types.values()):
|
||||
file.download(to_path=temp_dir, chunk_size=self.chunk_size)
|
||||
loader = FileSystemBlobLoader(path=temp_dir)
|
||||
yield from loader.yield_blobs()
|
||||
|
||||
def _auth(self) -> Account:
|
||||
"""Authenticates the OneDrive API client
|
||||
|
||||
Returns:
|
||||
The authenticated Account object.
|
||||
"""
|
||||
try:
|
||||
from O365 import Account, FileSystemTokenBackend
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"O365 package not found, please install it with `pip install o365`"
|
||||
)
|
||||
if self.auth_with_token:
|
||||
token_storage = _O365TokenStorage()
|
||||
token_path = token_storage.token_path
|
||||
token_backend = FileSystemTokenBackend(
|
||||
token_path=token_path.parent, token_filename=token_path.name
|
||||
)
|
||||
account = Account(
|
||||
credentials=(
|
||||
self.settings.client_id,
|
||||
self.settings.client_secret.get_secret_value(),
|
||||
),
|
||||
scopes=self._scopes,
|
||||
token_backend=token_backend,
|
||||
**{"raise_http_errors": False},
|
||||
)
|
||||
else:
|
||||
token_backend = FileSystemTokenBackend(
|
||||
token_path=Path.home() / ".credentials"
|
||||
)
|
||||
account = Account(
|
||||
credentials=(
|
||||
self.settings.client_id,
|
||||
self.settings.client_secret.get_secret_value(),
|
||||
),
|
||||
scopes=self._scopes,
|
||||
token_backend=token_backend,
|
||||
**{"raise_http_errors": False},
|
||||
)
|
||||
# make the auth
|
||||
account.authenticate()
|
||||
return account
|
@ -2,129 +2,49 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING, Dict, List, Optional, Type, Union
|
||||
from typing import TYPE_CHECKING, Iterator, List, Optional, Sequence, Union
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base import BaseLoader
|
||||
from langchain.document_loaders.onedrive_file import OneDriveFileLoader
|
||||
from langchain.pydantic_v1 import BaseModel, BaseSettings, Field, FilePath, SecretStr
|
||||
from langchain.document_loaders.base_o365 import (
|
||||
O365BaseLoader,
|
||||
_FileType,
|
||||
)
|
||||
from langchain.document_loaders.parsers.registry import get_parser
|
||||
from langchain.pydantic_v1 import Field
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from O365 import Account
|
||||
from O365.drive import Drive, Folder
|
||||
|
||||
SCOPES = ["offline_access", "Files.Read.All"]
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class _OneDriveSettings(BaseSettings):
|
||||
client_id: str = Field(..., env="O365_CLIENT_ID")
|
||||
client_secret: SecretStr = Field(..., env="O365_CLIENT_SECRET")
|
||||
|
||||
class Config:
|
||||
env_prefix = ""
|
||||
case_sentive = False
|
||||
env_file = ".env"
|
||||
|
||||
|
||||
class _OneDriveTokenStorage(BaseSettings):
|
||||
token_path: FilePath = Field(Path.home() / ".credentials" / "o365_token.txt")
|
||||
|
||||
|
||||
class _FileType(str, Enum):
|
||||
DOC = "doc"
|
||||
DOCX = "docx"
|
||||
PDF = "pdf"
|
||||
|
||||
|
||||
class _SupportedFileTypes(BaseModel):
|
||||
file_types: List[_FileType]
|
||||
|
||||
def fetch_mime_types(self) -> Dict[str, str]:
|
||||
mime_types_mapping = {}
|
||||
for file_type in self.file_types:
|
||||
if file_type.value == "doc":
|
||||
mime_types_mapping[file_type.value] = "application/msword"
|
||||
elif file_type.value == "docx":
|
||||
mime_types_mapping[
|
||||
file_type.value
|
||||
] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
|
||||
elif file_type.value == "pdf":
|
||||
mime_types_mapping[file_type.value] = "application/pdf"
|
||||
return mime_types_mapping
|
||||
|
||||
|
||||
class OneDriveLoader(BaseLoader, BaseModel):
|
||||
class OneDriveLoader(O365BaseLoader):
|
||||
"""Load from `Microsoft OneDrive`."""
|
||||
|
||||
settings: _OneDriveSettings = Field(default_factory=_OneDriveSettings)
|
||||
""" The settings for the OneDrive API client."""
|
||||
drive_id: str = Field(...)
|
||||
""" The ID of the OneDrive drive to load data from."""
|
||||
folder_path: Optional[str] = None
|
||||
""" The path to the folder to load data from."""
|
||||
object_ids: Optional[List[str]] = None
|
||||
""" The IDs of the objects to load data from."""
|
||||
auth_with_token: bool = False
|
||||
""" Whether to authenticate with a token or not. Defaults to False."""
|
||||
|
||||
def _auth(self) -> Type[Account]:
|
||||
"""
|
||||
Authenticates the OneDrive API client using the specified
|
||||
authentication method and returns the Account object.
|
||||
@property
|
||||
def _file_types(self) -> Sequence[_FileType]:
|
||||
"""Return supported file types."""
|
||||
return _FileType.DOC, _FileType.DOCX, _FileType.PDF
|
||||
|
||||
Returns:
|
||||
Type[Account]: The authenticated Account object.
|
||||
"""
|
||||
try:
|
||||
from O365 import FileSystemTokenBackend
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"O365 package not found, please install it with `pip install o365`"
|
||||
)
|
||||
if self.auth_with_token:
|
||||
token_storage = _OneDriveTokenStorage()
|
||||
token_path = token_storage.token_path
|
||||
token_backend = FileSystemTokenBackend(
|
||||
token_path=token_path.parent, token_filename=token_path.name
|
||||
)
|
||||
account = Account(
|
||||
credentials=(
|
||||
self.settings.client_id,
|
||||
self.settings.client_secret.get_secret_value(),
|
||||
),
|
||||
scopes=SCOPES,
|
||||
token_backend=token_backend,
|
||||
**{"raise_http_errors": False},
|
||||
)
|
||||
else:
|
||||
token_backend = FileSystemTokenBackend(
|
||||
token_path=Path.home() / ".credentials"
|
||||
)
|
||||
account = Account(
|
||||
credentials=(
|
||||
self.settings.client_id,
|
||||
self.settings.client_secret.get_secret_value(),
|
||||
),
|
||||
scopes=SCOPES,
|
||||
token_backend=token_backend,
|
||||
**{"raise_http_errors": False},
|
||||
)
|
||||
# make the auth
|
||||
account.authenticate()
|
||||
return account
|
||||
@property
|
||||
def _scopes(self) -> List[str]:
|
||||
"""Return required scopes."""
|
||||
return ["offline_access", "Files.Read.All"]
|
||||
|
||||
def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:
|
||||
def _get_folder_from_path(self, drive: Drive) -> Union[Folder, Drive]:
|
||||
"""
|
||||
Returns the folder or drive object located at the
|
||||
specified path relative to the given drive.
|
||||
|
||||
Args:
|
||||
drive (Type[Drive]): The root drive from which the folder path is relative.
|
||||
drive (Drive): The root drive from which the folder path is relative.
|
||||
|
||||
Returns:
|
||||
Union[Folder, Drive]: The folder or drive object
|
||||
@ -151,90 +71,26 @@ class OneDriveLoader(BaseLoader, BaseModel):
|
||||
raise FileNotFoundError("Path {} not exist.".format(self.folder_path))
|
||||
return subfolder_drive
|
||||
|
||||
def _load_from_folder(self, folder: Type[Folder]) -> List[Document]:
|
||||
"""
|
||||
Loads all supported document files from the specified folder
|
||||
and returns a list of Document objects.
|
||||
|
||||
Args:
|
||||
folder (Type[Folder]): The folder object to load the documents from.
|
||||
|
||||
Returns:
|
||||
List[Document]: A list of Document objects representing
|
||||
the loaded documents.
|
||||
|
||||
"""
|
||||
docs = []
|
||||
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
|
||||
file_mime_types = file_types.fetch_mime_types()
|
||||
items = folder.get_items()
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
file_path = f"{temp_dir}"
|
||||
os.makedirs(os.path.dirname(file_path), exist_ok=True)
|
||||
for file in items:
|
||||
if file.is_file:
|
||||
if file.mime_type in list(file_mime_types.values()):
|
||||
loader = OneDriveFileLoader(file=file)
|
||||
docs.extend(loader.load())
|
||||
return docs
|
||||
|
||||
def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]:
|
||||
"""
|
||||
Loads all supported document files from the specified OneDrive
|
||||
drive based on their object IDs and returns a list
|
||||
of Document objects.
|
||||
|
||||
Args:
|
||||
drive (Type[Drive]): The OneDrive drive object
|
||||
to load the documents from.
|
||||
|
||||
Returns:
|
||||
List[Document]: A list of Document objects representing
|
||||
the loaded documents.
|
||||
"""
|
||||
docs = []
|
||||
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
|
||||
file_mime_types = file_types.fetch_mime_types()
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
file_path = f"{temp_dir}"
|
||||
os.makedirs(os.path.dirname(file_path), exist_ok=True)
|
||||
|
||||
for object_id in self.object_ids if self.object_ids else [""]:
|
||||
file = drive.get_item(object_id)
|
||||
if not file:
|
||||
logger.warning(
|
||||
"There isn't a file with "
|
||||
f"object_id {object_id} in drive {drive}."
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
"""Load documents lazily. Use this when working at a large scale."""
|
||||
try:
|
||||
from O365.drive import Drive
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"O365 package not found, please install it with `pip install o365`"
|
||||
)
|
||||
continue
|
||||
if file.is_file:
|
||||
if file.mime_type in list(file_mime_types.values()):
|
||||
loader = OneDriveFileLoader(file=file)
|
||||
docs.extend(loader.load())
|
||||
return docs
|
||||
drive = self._auth().storage().get_drive(self.drive_id)
|
||||
if not isinstance(drive, Drive):
|
||||
raise ValueError(f"There isn't a Drive with id {self.drive_id}.")
|
||||
blob_parser = get_parser("default")
|
||||
if self.folder_path:
|
||||
folder = self._get_folder_from_path(drive)
|
||||
for blob in self._load_from_folder(folder):
|
||||
yield from blob_parser.lazy_parse(blob)
|
||||
if self.object_ids:
|
||||
for blob in self._load_from_object_ids(drive, self.object_ids):
|
||||
yield from blob_parser.lazy_parse(blob)
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""
|
||||
Loads all supported document files from the specified OneDrive drive
|
||||
and return a list of Document objects.
|
||||
|
||||
Returns:
|
||||
List[Document]: A list of Document objects
|
||||
representing the loaded documents.
|
||||
|
||||
Raises:
|
||||
ValueError: If the specified drive ID
|
||||
does not correspond to a drive in the OneDrive storage.
|
||||
"""
|
||||
account = self._auth()
|
||||
storage = account.storage()
|
||||
drive = storage.get_drive(self.drive_id)
|
||||
docs: List[Document] = []
|
||||
if not drive:
|
||||
raise ValueError(f"There isn't a drive with id {self.drive_id}.")
|
||||
if self.folder_path:
|
||||
folder = self._get_folder_from_path(drive=drive)
|
||||
docs.extend(self._load_from_folder(folder=folder))
|
||||
elif self.object_ids:
|
||||
docs.extend(self._load_from_object_ids(drive=drive))
|
||||
return docs
|
||||
"""Load all documents."""
|
||||
return list(self.lazy_load())
|
||||
|
34
libs/langchain/langchain/document_loaders/parsers/msword.py
Normal file
34
libs/langchain/langchain/document_loaders/parsers/msword.py
Normal file
@ -0,0 +1,34 @@
|
||||
from typing import Iterator
|
||||
|
||||
from langchain.document_loaders.base import BaseBlobParser
|
||||
from langchain.document_loaders.blob_loaders import Blob
|
||||
from langchain.schema import Document
|
||||
|
||||
|
||||
class MsWordParser(BaseBlobParser):
|
||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||
try:
|
||||
from unstructured.partition.doc import partition_doc
|
||||
from unstructured.partition.docx import partition_docx
|
||||
except ImportError as e:
|
||||
raise ImportError(
|
||||
"Could not import unstructured, please install with `pip install "
|
||||
"unstructured`."
|
||||
) from e
|
||||
|
||||
mime_type_parser = {
|
||||
"application/msword": partition_doc,
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": (
|
||||
partition_docx
|
||||
),
|
||||
}
|
||||
if blob.mimetype not in (
|
||||
"application/msword",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
):
|
||||
raise ValueError("This blob type is not supported for this parser.")
|
||||
with blob.as_bytes_io() as word_document:
|
||||
elements = mime_type_parser[blob.mimetype](file=word_document)
|
||||
text = "\n\n".join([str(el) for el in elements])
|
||||
metadata = {"source": blob.source}
|
||||
yield Document(page_content=text, metadata=metadata)
|
@ -1,6 +1,7 @@
|
||||
"""Module includes a registry of default parser configurations."""
|
||||
from langchain.document_loaders.base import BaseBlobParser
|
||||
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser
|
||||
from langchain.document_loaders.parsers.msword import MsWordParser
|
||||
from langchain.document_loaders.parsers.pdf import PyMuPDFParser
|
||||
from langchain.document_loaders.parsers.txt import TextParser
|
||||
|
||||
@ -11,6 +12,10 @@ def _get_default_parser() -> BaseBlobParser:
|
||||
handlers={
|
||||
"application/pdf": PyMuPDFParser(),
|
||||
"text/plain": TextParser(),
|
||||
"application/msword": MsWordParser(),
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": (
|
||||
MsWordParser()
|
||||
),
|
||||
},
|
||||
fallback_parser=None,
|
||||
)
|
||||
|
59
libs/langchain/langchain/document_loaders/sharepoint.py
Normal file
59
libs/langchain/langchain/document_loaders/sharepoint.py
Normal file
@ -0,0 +1,59 @@
|
||||
"""Loader that loads data from Sharepoint Document Library"""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Iterator, List, Optional, Sequence
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.document_loaders.base_o365 import (
|
||||
O365BaseLoader,
|
||||
_FileType,
|
||||
)
|
||||
from langchain.document_loaders.parsers.registry import get_parser
|
||||
from langchain.pydantic_v1 import Field
|
||||
|
||||
|
||||
class SharePointLoader(O365BaseLoader):
|
||||
"""Load from `SharePoint`."""
|
||||
|
||||
document_library_id: str = Field(...)
|
||||
""" The ID of the SharePoint document library to load data from."""
|
||||
folder_path: Optional[str] = None
|
||||
""" The path to the folder to load data from."""
|
||||
object_ids: Optional[List[str]] = None
|
||||
""" The IDs of the objects to load data from."""
|
||||
|
||||
@property
|
||||
def _file_types(self) -> Sequence[_FileType]:
|
||||
"""Return supported file types."""
|
||||
return _FileType.DOC, _FileType.DOCX, _FileType.PDF
|
||||
|
||||
@property
|
||||
def _scopes(self) -> List[str]:
|
||||
"""Return required scopes."""
|
||||
return ["sharepoint", "basic"]
|
||||
|
||||
def lazy_load(self) -> Iterator[Document]:
|
||||
"""Load documents lazily. Use this when working at a large scale."""
|
||||
try:
|
||||
from O365.drive import Drive, Folder
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"O365 package not found, please install it with `pip install o365`"
|
||||
)
|
||||
drive = self._auth().storage().get_drive(self.document_library_id)
|
||||
if not isinstance(drive, Drive):
|
||||
raise ValueError(f"There isn't a Drive with id {self.document_library_id}.")
|
||||
blob_parser = get_parser("default")
|
||||
if self.folder_path:
|
||||
target_folder = drive.get_item_by_path(self.folder_path)
|
||||
if not isinstance(target_folder, Folder):
|
||||
raise ValueError(f"There isn't a folder with path {self.folder_path}.")
|
||||
for blob in self._load_from_folder(target_folder):
|
||||
yield from blob_parser.lazy_parse(blob)
|
||||
if self.object_ids:
|
||||
for blob in self._load_from_object_ids(drive, self.object_ids):
|
||||
yield from blob_parser.lazy_parse(blob)
|
||||
|
||||
def load(self) -> List[Document]:
|
||||
"""Load all documents."""
|
||||
return list(self.lazy_load())
|
Loading…
Reference in New Issue
Block a user