docs: `integrations/retrievers` cleanup (#11388)

fixed several notebooks: - headers - formats --------- Co-authored-by: Erick Friis <erick@langchain.dev>
1 year ago · c3d2b01adf
parent 5470e730d2
commit c3d2b01adf
5 changed files with 100 additions and 147 deletions
--- a/docs/docs_skeleton/docs/integrations/retrievers/docarray_retriever.ipynb
+++ b/docs/docs_skeleton/docs/integrations/retrievers/docarray_retriever.ipynb
@ -5,25 +5,12 @@
   "id": "a0eb506a-f52e-4a92-9204-63233c3eb5bd",
   "metadata": {},
   "source": [
-    "# DocArray Retriever\n",
+    "# DocArray\n",
    "\n",
-    "[DocArray](https://github.com/docarray/docarray) is a versatile, open-source tool for managing your multi-modal data. It lets you shape your data however you want, and offers the flexibility to store and search it using various document index backends. Plus, it gets even better - you can utilize your DocArray document index to create a DocArrayRetriever, and build awesome Langchain apps!\n",
+    ">[DocArray](https://github.com/docarray/docarray) is a versatile, open-source tool for managing your multi-modal data. It lets you shape your data however you want, and offers the flexibility to store and search it using various document index backends. Plus, it gets even better - you can utilize your `DocArray` document index to create a `DocArrayRetriever`, and build awesome Langchain apps!\n",
    "\n",
-    "This notebook is split into two sections. The first section offers an introduction to all five supported document index backends. It provides guidance on setting up and indexing each backend, and also instructs you on how to build a DocArrayRetriever for finding relevant documents. In the second section, we'll select one of these backends and illustrate how to use it through a basic example.\n",
-    "\n",
-    "\n",
-    "[Document Index Backends](#Document-Index-Backends)\n",
-    "1. [InMemoryExactNNIndex](#inmemoryexactnnindex)\n",
-    "2. [HnswDocumentIndex](#hnswdocumentindex)\n",
-    "3. [WeaviateDocumentIndex](#weaviatedocumentindex)\n",
-    "4. [ElasticDocIndex](#elasticdocindex)\n",
-    "5. [QdrantDocumentIndex](#qdrantdocumentindex)\n",
-    "\n",
-    "[Movie Retrieval using HnswDocumentIndex](#Movie-Retrieval-using-HnswDocumentIndex)\n",
-    "\n",
-    "- [Normal Retriever](#normal-retriever)\n",
-    "- [Retriever with Filters](#retriever-with-filters)\n",
-    "- [Retriever with MMR Search](#Retriever-with-MMR-search)\n"
+    "This notebook is split into two sections. The [first section](#document-index-backends) offers an introduction to all five supported document index backends. It provides guidance on setting up and indexing each backend and also instructs you on how to build a `DocArrayRetriever` for finding relevant documents. \n",
+    "In the [second section](#movie-retrieval-using-hnswdocumentindex), we'll select one of these backends and illustrate how to use it through a basic example.\n"
   ]
  },
  {
@ -31,7 +18,7 @@
   "id": "51db6285-58db-481d-8d24-b13d1888056b",
   "metadata": {},
   "source": [
-    "# Document Index Backends"
+    "## Document Index Backends"
   ]
  },
  {
@ -86,9 +73,9 @@
    "tags": []
   },
   "source": [
-    "## InMemoryExactNNIndex\n",
+    "### InMemoryExactNNIndex\n",
    "\n",
-    "InMemoryExactNNIndex stores all Documentsin memory. It is a great starting point for small datasets, where you may not want to launch a database server.\n",
+    "`InMemoryExactNNIndex` stores all Documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.\n",
    "\n",
    "Learn more here: https://docs.docarray.org/user_guide/storing/index_in_memory/"
   ]
@ -159,9 +146,9 @@
   "id": "a9daf2c4-6568-4a49-ba6e-21687962d2c1",
   "metadata": {},
   "source": [
-    "## HnswDocumentIndex\n",
+    "### HnswDocumentIndex\n",
    "\n",
-    "HnswDocumentIndex is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html).\n",
+    "`HnswDocumentIndex` is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html).\n",
    "\n",
    "Learn more here: https://docs.docarray.org/user_guide/storing/index_hnswlib/"
   ]
@ -233,9 +220,9 @@
   "id": "7177442e-3fd3-4f3d-ab22-cd8265b35112",
   "metadata": {},
   "source": [
-    "## WeaviateDocumentIndex\n",
+    "### WeaviateDocumentIndex\n",
    "\n",
-    "WeaviateDocumentIndex is a document index that is built upon [Weaviate](https://weaviate.io/) vector database.\n",
+    "`WeaviateDocumentIndex` is a document index that is built upon [Weaviate](https://weaviate.io/) vector database.\n",
    "\n",
    "Learn more here: https://docs.docarray.org/user_guide/storing/index_weaviate/"
   ]
@ -331,11 +318,11 @@
   "id": "6ee8f920-9297-4b0a-a353-053a86947d10",
   "metadata": {},
   "source": [
-    "## ElasticDocIndex\n",
+    "### ElasticDocIndex\n",
    "\n",
-    "ElasticDocIndex is a document index that is built upon [ElasticSearch](https://github.com/elastic/elasticsearch)\n",
+    "`ElasticDocIndex` is a document index that is built upon [ElasticSearch](https://github.com/elastic/elasticsearch)\n",
    "\n",
-    "Learn more here: https://docs.docarray.org/user_guide/storing/index_elastic/"
+    "Learn more [here](https://docs.docarray.org/user_guide/storing/index_elastic/)"
   ]
  },
  {
@ -407,11 +394,11 @@
   "id": "281432f8-87a5-4f22-a582-9d5dac33d158",
   "metadata": {},
   "source": [
-    "## QdrantDocumentIndex\n",
+    "### QdrantDocumentIndex\n",
    "\n",
-    "QdrantDocumentIndex is a document index that is build upon [Qdrant](https://qdrant.tech/) vector database\n",
+    "`QdrantDocumentIndex` is a document index that is built upon [Qdrant](https://qdrant.tech/) vector database\n",
    "\n",
-    "Learn more here: https://docs.docarray.org/user_guide/storing/index_qdrant/"
+    "Learn more [here](https://docs.docarray.org/user_guide/storing/index_qdrant/)"
   ]
  },
  {
@ -501,7 +488,7 @@
   "id": "3afb65b0-c620-411a-855f-1aa81481bdbb",
   "metadata": {},
   "source": [
-    "# Movie Retrieval using HnswDocumentIndex"
+    "## Movie Retrieval using HnswDocumentIndex"
   ]
  },
  {
@ -562,7 +549,7 @@
   },
   "outputs": [
    {
-     "name": "stdin",
+     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OpenAI API Key: ········\n"
@ -638,7 +625,7 @@
    "tags": []
   },
   "source": [
-    "## Normal Retriever"
+    "### Normal Retriever"
   ]
  },
  {
@ -678,7 +665,7 @@
   "id": "3defa711-51df-4b48-b02a-306706cfacd0",
   "metadata": {},
   "source": [
-    "## Retriever with Filters"
+    "### Retriever with Filters"
   ]
  },
  {
@ -720,7 +707,7 @@
   "id": "fa10afa6-1554-4c2b-8afc-cff44e32d2f8",
   "metadata": {},
   "source": [
-    "## Retriever with MMR search"
+    "### Retriever with MMR search"
   ]
  },
  {
@ -783,7 +770,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.17"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs_skeleton/docs/integrations/retrievers/google_drive.ipynb
+++ b/docs/docs_skeleton/docs/integrations/retrievers/google_drive.ipynb
@ -5,8 +5,9 @@
   "id": "b0ed136e-6983-4893-ae1b-b75753af05f8",
   "metadata": {},
   "source": [
-    "# Google Drive Retriever\n",
-    "This notebook covers how to retrieve documents from Google Drive.\n",
+    "# Google Drive\n",
+    "\n",
+    "This notebook covers how to retrieve documents from `Google Drive`.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
@ -15,9 +16,10 @@
    "1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n",
    "1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`\n",
    "\n",
-    "## Instructions for retrieving your Google Docs data\n",
+    "## Retrieve the Google Docs\n",
+    "\n",
    "By default, the `GoogleDriveRetriever` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `GOOGLE_ACCOUNT_FILE` environment variable. \n",
-    "The location of `token.json` use the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the retriever.\n",
+    "The location of `token.json` uses the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the retriever.\n",
    "\n",
    "`GoogleDriveRetriever` can retrieve a selection of files with some requests. \n",
    "\n",
@ -36,49 +38,6 @@
    "The special value `root` is for your personal home."
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9c9665c9-a023-4078-9d95-e43021cecb6f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "878928a6-a5ae-4f74-b351-64e3b01733fe",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2023-05-09T10:45:59.438650905Z",
-     "start_time": "2023-05-09T10:45:57.955900302Z"
-    },
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from langchain_googledrive.retrievers import GoogleDriveRetriever"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "755907c2-145d-4f0f-9b15-07a628a2d2d2",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2023-05-09T10:45:59.442890834Z",
-     "start_time": "2023-05-09T10:45:59.440941528Z"
-    },
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "folder_id=\"root\"\n",
-    "#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -92,6 +51,11 @@
   },
   "outputs": [],
   "source": [
+    "from langchain_googledrive.retrievers import GoogleDriveRetriever\n",
+    "\n",
+    "folder_id=\"root\"\n",
+    "#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'\n",
+    "\n",
    "retriever = GoogleDriveRetriever(\n",
    "    num_results=2,\n",
    ")"
@ -228,7 +192,8 @@
   "id": "9b6fed29-1666-452e-b677-401613270388",
   "metadata": {},
   "source": [
-    "# Use GDrive 'description' metadata\n",
+    "## Use Google Drive 'description' metadata\n",
+    "\n",
    "Each Google Drive has a `description` field in metadata (see the *details of a file*).\n",
    "Use the `snippets` mode to return the description of selected files.\n"
   ]
@ -271,7 +236,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs_skeleton/docs/integrations/retrievers/re_phrase.ipynb
+++ b/docs/docs_skeleton/docs/integrations/retrievers/re_phrase.ipynb
@ -5,69 +5,53 @@
   "id": "e8624be2",
   "metadata": {},
   "source": [
-    "# RePhraseQueryRetriever\n",
+    "# RePhraseQuery\n",
    "\n",
-    "Simple retriever that applies an LLM between the user input and the query pass the to retriever.\n",
+    "`RePhraseQuery` is a simple retriever that applies an LLM between the user input and the query passed by the retriever.\n",
    "\n",
    "It can be used to pre-process the user input in any way.\n",
    "\n",
-    "The default prompt used in the `from_llm` classmethod:\n",
+    "## Example\n",
    "\n",
-    "```\n",
-    "DEFAULT_TEMPLATE = \"\"\"You are an assistant tasked with taking a natural language \\\n",
-    "query from a user and converting it into a query for a vectorstore. \\\n",
-    "In this process, you strip out information that is not relevant for \\\n",
-    "the retrieval task. Here is the user query: {question}\"\"\"\n",
-    "```\n",
+    "### Setting up\n",
    "\n",
-    "Create a vectorstore."
+    "Create a vector store."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
-   "id": "1bfa6834",
+   "execution_count": 2,
+   "id": "d0b51556",
   "metadata": {},
   "outputs": [],
   "source": [
+    "import logging\n",
    "from langchain.document_loaders import WebBaseLoader\n",
-    "\n",
-    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
-    "data = loader.load()\n",
-    "\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
-    "\n",
-    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
-    "all_splits = text_splitter.split_documents(data)\n",
-    "\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
+    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
-    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())"
+    "from langchain.retrievers import RePhraseQueryRetriever"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
-   "id": "d0b51556",
+   "execution_count": 1,
+   "id": "1bfa6834",
   "metadata": {},
   "outputs": [],
   "source": [
-    "import logging\n",
-    "\n",
    "logging.basicConfig()\n",
-    "logging.getLogger(\"langchain.retrievers.re_phraser\").setLevel(logging.INFO)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "20e1e787",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain.chat_models import ChatOpenAI\n",
-    "from langchain.retrievers import RePhraseQueryRetriever"
+    "logging.getLogger(\"langchain.retrievers.re_phraser\").setLevel(logging.INFO)\n",
+    "\n",
+    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
+    "data = loader.load()\n",
+    "\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
+    "all_splits = text_splitter.split_documents(data)\n",
+    "\n",
+    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())"
   ]
  },
  {
@ -75,7 +59,16 @@
   "id": "88c0a972",
   "metadata": {},
   "source": [
-    "## Using the default prompt"
+    "### Using the default prompt\n",
+    "\n",
+    "The default prompt used in the `from_llm` classmethod:\n",
+    "\n",
+    "```\n",
+    "DEFAULT_TEMPLATE = \"\"\"You are an assistant tasked with taking a natural language \\\n",
+    "query from a user and converting it into a query for a vectorstore. \\\n",
+    "In this process, you strip out information that is not relevant for \\\n",
+    "the retrieval task. Here is the user query: {question}\"\"\"\n",
+    "```"
   ]
  },
  {
@ -95,9 +88,7 @@
   "cell_type": "code",
   "execution_count": 5,
   "id": "8d17ecc9",
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
@ -140,7 +131,7 @@
   "id": "0513a6e2",
   "metadata": {},
   "source": [
-    "## Supply a prompt"
+    "### Custom prompt"
   ]
  },
  {
@ -214,7 +205,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.1"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs_skeleton/docs/integrations/retrievers/sec_filings.ipynb
+++ b/docs/docs_skeleton/docs/integrations/retrievers/sec_filings.ipynb
@ -5,12 +5,12 @@
   "id": "263f914c-9d67-4316-8b3d-03c3b99ba9d8",
   "metadata": {},
   "source": [
-    "SEC filings data\n",
-    "=\n",
+    "# SEC filing\n",
    "\n",
-    "SEC filings data powered by [Kay.ai](https://kay.ai) and [Cybersyn](https://www.cybersyn.com/) via [Snowflake Marketplace](https://app.snowflake.com/marketplace/providers/GZTSZAS2KCS/Cybersyn%2C%20Inc).\n",
    "\n",
-    ">The SEC filing is a financial statement or other formal document submitted to the U.S. Securities and Exchange Commission (SEC). Public companies, certain insiders, and broker-dealers are required to make regular SEC filings. Investors and financial professionals rely on these filings for information about companies they are evaluating for investment purposes."
+    ">The SEC filing is a financial statement or other formal document submitted to the U.S. Securities and Exchange Commission (SEC). Public companies, certain insiders, and broker-dealers are required to make regular SEC filings. Investors and financial professionals rely on these filings for information about companies they are evaluating for investment purposes.\n",
+    ">\n",
+    ">SEC filings data powered by [Kay.ai](https://kay.ai) and [Cybersyn](https://www.cybersyn.com/) via [Snowflake Marketplace](https://app.snowflake.com/marketplace/providers/GZTSZAS2KCS/Cybersyn%2C%20Inc).\n"
   ]
  },
  {
@ -18,22 +18,12 @@
   "id": "fc507b8e-ea51-417c-93da-42bf998a1195",
   "metadata": {},
   "source": [
-    "Setup\n",
-    "=\n",
+    "## Setup\n",
    "\n",
-    "First you will need to install the `kay` package. You will also need an API key: you can get one for free at [https://kay.ai](https://kay.ai/). Once you have an API key, you must set it as an environment variable `KAY_API_KEY`.\n",
    "\n",
-    "In this example we're going to use the `KayAiRetriever`. Take a look at the [kay notebook](/docs/integrations/retrievers/kay) for more detailed information for the parmeters that it accepts.`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c923bea0-585a-4f62-8662-efc167e8d793",
-   "metadata": {},
-   "source": [
-    "Examples\n",
-    "=\n",
-    "\n"
+    "First, you will need to install the `kay` package. You will also need an API key: you can get one for free at [https://kay.ai](https://kay.ai/). Once you have an API key, you must set it as an environment variable `KAY_API_KEY`.\n",
+    "\n",
+    "In this example, we're going to use the `KayAiRetriever`. Take a look at the [kay notebook](/docs/integrations/retrievers/kay) for more detailed information for the parameters that it accepts.`"
   ]
  },
  {
@ -70,6 +60,14 @@
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "c923bea0-585a-4f62-8662-efc167e8d793",
+   "metadata": {},
+   "source": [
+    "## Example"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 7,
@ -157,7 +155,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs_skeleton/vercel.json
+++ b/docs/docs_skeleton/vercel.json
@ -2360,6 +2360,18 @@
      "source": "/en/latest/modules/indexes/vectorstores/examples/awadb.html",
      "destination": "/docs/integrations/vectorstores/awadb"
    },
+    {
+      "source": "/docs/integrations/providers/aws_dynamodb",
+      "destination": "/docs/integrations/platforms/aws#aws-dynamodb"
+    },
+    {
+      "source": "/docs/integrations/providers/google_document_ai",
+      "destination": "/docs/integrations/platforms/google#google-document-ai"
+    },
+    {
+      "source": "/docs/integrations/providers/scann",
+      "destination": "/docs/integrations/platforms/google#google-scann"
+    },
    {
      "source": "/docs/modules/data_connection/vectorstores/integrations/awadb",
      "destination": "/docs/integrations/vectorstores/awadb"