{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Microsoft OneDrive\n", "\n", ">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n", "\n", "This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n", "\n", "## Prerequisites\n", "1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n", "2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called the `client ID`, this value uniquely identifies your application in the Microsoft identity platform.\n", "3. During the steps you will be following at **item 1**, you can set the redirect URI as `http://localhost:8000/callback`\n", "4. During the steps you will be following at **item 1**, generate a new password (`client_secret`) under Application Secrets section.\n", "5. Follow the instructions at this [document](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-configure-app-expose-web-apis#add-a-scope) to add the following `SCOPES` (`offline_access` and `Files.Read.All`) to your application.\n", "6. Visit the [Graph Explorer Playground](https://developer.microsoft.com/en-us/graph/graph-explorer) to obtain your `OneDrive ID`. The first step is to ensure you are logged in with the account associated your OneDrive account. Then you need to make a request to `https://graph.microsoft.com/v1.0/me/drive` and the response will return a payload with a field `id` that holds the ID of your OneDrive account.\n", "7. You need to install the o365 package using the command `pip install o365`.\n", "8. At the end of the steps you must have the following values: \n", "- `CLIENT_ID`\n", "- `CLIENT_SECRET`\n", "- `DRIVE_ID`\n", "\n", "## 🧑 Instructions for ingesting your documents from OneDrive\n", "\n", "### 🔑 Authentication\n", "\n", "By default, the `OneDriveLoader` expects that the values of `CLIENT_ID` and `CLIENT_SECRET` must be stored as environment variables named `O365_CLIENT_ID` and `O365_CLIENT_SECRET` respectively. You could pass those environment variables through a `.env` file at the root of your application or using the following command in your script.\n", "\n", "```python\n", "os.environ['O365_CLIENT_ID'] = \"YOUR CLIENT ID\"\n", "os.environ['O365_CLIENT_SECRET'] = \"YOUR CLIENT SECRET\"\n", "```\n", "\n", "This loader uses an authentication called [*on behalf of a user*](https://learn.microsoft.com/en-us/graph/auth-v2-user?context=graph%2Fapi%2F1.0&view=graph-rest-1.0). It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was succesful.\n", "\n", "\n", "```python\n", "from langchain.document_loaders.onedrive import OneDriveLoader\n", "\n", "loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\")\n", "```\n", "\n", "Once the authentication has been done, the loader will store a token (`o365_token.txt`) at `~/.credentials/` folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the `auth_with_token` parameter to True in the instantiation of the loader.\n", "\n", "```python\n", "from langchain.document_loaders.onedrive import OneDriveLoader\n", "\n", "loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", auth_with_token=True)\n", "```\n", "\n", "### 🗂️ Documents loader\n", "\n", "#### 📑 Loading documents from a OneDrive Directory\n", "\n", "`OneDriveLoader` can load documents from a specific folder within your OneDrive. For instance, you want to load all documents that are stored at `Documents/clients` folder within your OneDrive.\n", "\n", "\n", "```python\n", "from langchain.document_loaders.onedrive import OneDriveLoader\n", "\n", "loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", folder_path=\"Documents/clients\", auth_with_token=True)\n", "documents = loader.load()\n", "```\n", "\n", "#### 📑 Loading documents from a list of Documents IDs\n", "\n", "Another possibility is to provide a list of `object_id` for each document you want to load. For that, you will need to query the [Microsoft Graph API](https://developer.microsoft.com/en-us/graph/graph-explorer) to find all the documents ID that you are interested in. This [link](https://learn.microsoft.com/en-us/graph/api/resources/onedrive?view=graph-rest-1.0#commonly-accessed-resources) provides a list of endpoints that will be helpful to retrieve the documents ID.\n", "\n", "For instance, to retrieve information about all objects that are stored at the root of the Documents folder, you need make a request to: `https://graph.microsoft.com/v1.0/drives/{YOUR DRIVE ID}/root/children`. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.\n", "\n", "\n", "```python\n", "from langchain.document_loaders.onedrive import OneDriveLoader\n", "\n", "loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n", "documents = loader.load()\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }