langchain/docs/extras/integrations/vectorstores/vectara.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {},
   "source": [
    "# Vectara\n",
    "\n",
    ">[Vectara](https://vectara.com/) is a API platform for building GenAI applications. It provides an easy-to-use API for document indexing and querying that is managed by Vectara and is optimized for performance and accuracy. \n",
    "See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API.\n",
    "\n",
    "This notebook shows how to use functionality related to the `Vectara`'s integration with langchain.\n",
    "Note that unlike many other integrations in this category, Vectara provides an end-to-end managed service for [Grounded Generation](https://vectara.com/grounded-generation/) (aka retrieval agumented generation), which includes:\n",
    "1. A way to extract text from document files and chunk them into sentences.\n",
    "2. Its own embeddings model and vector store - each text segment is encoded into a vector embedding and stored in the Vectara internal vector store\n",
    "3. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching))\n",
    "\n",
    "All of these are supported in this LangChain integration."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc0f4344",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps:\n",
    "1. [Sign up](https://console.vectara.com/signup) for a Vectara account if you don't already have one. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.\n",
    "2. Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **\"Create Corpus\"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.\n",
    "3. Next you'll need to create API keys to access the corpus. Click on the **\"Authorization\"** tab in the corpus view and then the **\"Create API Key\"** button. Give your key a name, and choose whether you want query only or query+index for your key. Click \"Create\" and you now have an active API key. Keep this key confidential. \n",
    "\n",
    "To use LangChain with Vectara, you'll need to have these three values: customer ID, corpus ID and api_key.\n",
    "You can provide those to LangChain in two ways:\n",
    "\n",
    "1. Include in your environment these three variables: `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`.\n",
    "\n",
    "> For example, you can set these variables using os.environ and getpass as follows:\n",
    "\n",
    "```python\n",
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"VECTARA_CUSTOMER_ID\"] = getpass.getpass(\"Vectara Customer ID:\")\n",
    "os.environ[\"VECTARA_CORPUS_ID\"] = getpass.getpass(\"Vectara Corpus ID:\")\n",
    "os.environ[\"VECTARA_API_KEY\"] = getpass.getpass(\"Vectara API Key:\")\n",
    "```\n",
    "\n",
    "2. Add them to the Vectara vectorstore constructor:\n",
    "\n",
    "```python\n",
    "vectorstore = Vectara(\n",
    "                vectara_customer_id=vectara_customer_id,\n",
    "                vectara_corpus_id=vectara_corpus_id,\n",
    "                vectara_api_key=vectara_api_key\n",
    "            )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eeead681",
   "metadata": {},
   "source": [
    "## Connecting to Vectara from LangChain\n",
    "\n",
    "To get started, let's ingest the documents using the from_documents() method.\n",
    "We assume here that you've added your VECTARA_CUSTOMER_ID, VECTARA_CORPUS_ID and query+indexing VECTARA_API_KEY as environment variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "04a1f1a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import FakeEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import Vectara\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "be0a4973",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8429667e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:22.525091Z",
     "start_time": "2023-04-04T10:51:22.522015Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "vectara = Vectara.from_documents(\n",
    "    docs,\n",
    "    embedding=FakeEmbeddings(size=768),\n",
    "    doc_metadata={\"speech\": \"state-of-the-union\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90dbf3e7",
   "metadata": {},
   "source": [
    "Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.\n",
    "To use this, we added the add_files() method (as well as from_files()). \n",
    "\n",
    "Let's see this in action. We pick two PDF documents to upload: \n",
    "1. The \"I have a dream\" speech by Dr. King\n",
    "2. Churchill's \"We Shall Fight on the Beaches\" speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "85ef3468",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tempfile\n",
    "import urllib.request\n",
    "\n",
    "urls = [\n",
    "    [\n",
    "        \"https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf\",\n",
    "        \"I-have-a-dream\",\n",
    "    ],\n",
    "    [\n",
    "        \"https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf\",\n",
    "        \"we shall fight on the beaches\",\n",
    "    ],\n",
    "]\n",
    "files_list = []\n",
    "for url, _ in urls:\n",
    "    name = tempfile.NamedTemporaryFile().name\n",
    "    urllib.request.urlretrieve(url, name)\n",
    "    files_list.append(name)\n",
    "\n",
    "docsearch: Vectara = Vectara.from_files(\n",
    "    files=files_list,\n",
    "    embedding=FakeEmbeddings(size=768),\n",
    "    metadatas=[{\"url\": url, \"speech\": title} for url, title in urls],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f9215c8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T09:27:29.920258Z",
     "start_time": "2023-04-04T09:27:29.913714Z"
    }
   },
   "source": [
    "## Similarity search\n",
    "\n",
    "The simplest scenario for using Vectara is to perform a similarity search. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a8c513ab",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.204469Z",
     "start_time": "2023-04-04T10:51:24.855618Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vectara.similarity_search(\n",
    "    query, n_sentence_context=0, filter=\"doc.speech = 'state-of-the-union'\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fc516993",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.220984Z",
     "start_time": "2023-04-04T10:51:25.213943Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.\n"
     ]
    }
   ],
   "source": [
    "print(found_docs[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bda9bf5",
   "metadata": {},
   "source": [
    "## Similarity search with score\n",
    "\n",
    "Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8804a21d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.631585Z",
     "start_time": "2023-04-04T10:51:25.227384Z"
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vectara.similarity_search_with_score(\n",
    "    query, filter=\"doc.speech = 'state-of-the-union'\", score_threshold=0.2,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "756a6887",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.642282Z",
     "start_time": "2023-04-04T10:51:25.635947Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.\n",
      "\n",
      "Score: 0.786569\n"
     ]
    }
   ],
   "source": [
    "document, score = found_docs[0]\n",
    "print(document.page_content)\n",
    "print(f\"\\nScore: {score}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f9876a8",
   "metadata": {},
   "source": [
    "Now let's do similar search for content in the files we uploaded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "47784de5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "With this threshold of 1.2 we have 0 documents\n"
     ]
    }
   ],
   "source": [
    "query = \"We must forever conduct our struggle\"\n",
    "min_score = 1.2\n",
    "found_docs = vectara.similarity_search_with_score(\n",
    "    query, filter=\"doc.speech = 'I-have-a-dream'\", score_threshold=min_score,\n",
    ")\n",
    "print(f\"With this threshold of {min_score} we have {len(found_docs)} documents\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3e22949f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "With this threshold of 0.2 we have 3 documents\n"
     ]
    }
   ],
   "source": [
    "query = \"We must forever conduct our struggle\"\n",
    "min_score = 0.2\n",
    "found_docs = vectara.similarity_search_with_score(\n",
    "    query, filter=\"doc.speech = 'I-have-a-dream'\", score_threshold=min_score,\n",
    ")\n",
    "print(f\"With this threshold of {min_score} we have {len(found_docs)} documents\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "691a82d6",
   "metadata": {},
   "source": [
    "## Vectara as a Retriever\n",
    "\n",
    "Vectara, as all the other LangChain vectorstores, is most often used as a LangChain Retriever:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "9427195f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:26.031451Z",
     "start_time": "2023-04-04T10:51:26.018763Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "VectaraRetriever(tags=['Vectara'], metadata=None, vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x1586bd330>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '2'})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever = vectara.as_retriever()\n",
    "retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f3c70c31",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:26.495652Z",
     "start_time": "2023-04-04T10:51:26.046407Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union'})"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "retriever.get_relevant_documents(query)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2300e785",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}