langchain/docs/extras/integrations/vectorstores/vectara.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {},
   "source": [
    "# Vectara\n",
    "\n",
    ">[Vectara](https://vectara.com/) is a API platform for building GenAI applications. It provides an easy-to-use API for document indexing and query that is managed by Vectara and is optimized for performance and accuracy. \n",
    "See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API.\n",
    "\n",
    "This notebook shows how to use functionality related to the `Vectara`'s integration with langchain.\n",
    "Note that unlike many other integrations in this category, Vectara provides an end-to-end managed service for [Grounded Generation](https://vectara.com/grounded-generation/) (aka retrieval agumented generation), which includes:\n",
    "1. A way to extract text from document files and chunk them into sentences.\n",
    "2. Its own embeddings model and vector store - each text segment is encoded into a vector embedding and stored in the Vectara internal vector store\n",
    "3. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching))\n",
    "\n",
    "All of these are supported in this LangChain integration."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc0f4344",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps:\n",
    "1. [Sign up](https://console.vectara.com/signup) for a Vectara account if you don't already have one. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.\n",
    "2. Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **\"Create Corpus\"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.\n",
    "3. Next you'll need to create API keys to access the corpus. Click on the **\"Authorization\"** tab in the corpus view and then the **\"Create API Key\"** button. Give your key a name, and choose whether you want query only or query+index for your key. Click \"Create\" and you now have an active API key. Keep this key confidential. \n",
    "\n",
    "To use LangChain with Vectara, you'll need to have these three values: customer ID, corpus ID and api_key.\n",
    "You can provide those to LangChain in two ways:\n",
    "\n",
    "1. Include in your environment these three variables: `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`.\n",
    "\n",
    "> For example, you can set these variables using os.environ and getpass as follows:\n",
    "\n",
    "```python\n",
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"VECTARA_CUSTOMER_ID\"] = getpass.getpass(\"Vectara Customer ID:\")\n",
    "os.environ[\"VECTARA_CORPUS_ID\"] = getpass.getpass(\"Vectara Corpus ID:\")\n",
    "os.environ[\"VECTARA_API_KEY\"] = getpass.getpass(\"Vectara API Key:\")\n",
    "```\n",
    "\n",
    "2. Add them to the Vectara vectorstore constructor:\n",
    "\n",
    "```python\n",
    "vectorstore = Vectara(\n",
    "                vectara_customer_id=vectara_customer_id,\n",
    "                vectara_corpus_id=vectara_corpus_id,\n",
    "                vectara_api_key=vectara_api_key\n",
    "            )\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eeead681",
   "metadata": {},
   "source": [
    "## Connecting to Vectara from LangChain\n",
    "\n",
    "To get started, let's ingest the documents using the from_documents() method.\n",
    "We assume here that you've added your VECTARA_CUSTOMER_ID, VECTARA_CORPUS_ID and query+indexing VECTARA_API_KEY as environment variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04a1f1a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import FakeEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import Vectara\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "be0a4973",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8429667e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:22.525091Z",
     "start_time": "2023-04-04T10:51:22.522015Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "vectara = Vectara.from_documents(\n",
    "    docs,\n",
    "    embedding=FakeEmbeddings(size=768),\n",
    "    doc_metadata={\"speech\": \"state-of-the-union\"},\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "90dbf3e7",
   "metadata": {},
   "source": [
    "Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.\n",
    "To use this, we added the add_files() method (as well as from_files()). \n",
    "\n",
    "Let's see this in action. We pick two PDF documents to upload: \n",
    "1. The \"I have a dream\" speech by Dr. King\n",
    "2. Churchill's \"We Shall Fight on the Beaches\" speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "85ef3468",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tempfile\n",
    "import urllib.request\n",
    "\n",
    "urls = [\n",
    "    [\n",
    "        \"https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf\",\n",
    "        \"I-have-a-dream\",\n",
    "    ],\n",
    "    [\n",
    "        \"https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf\",\n",
    "        \"we shall fight on the beaches\",\n",
    "    ],\n",
    "]\n",
    "files_list = []\n",
    "for url, _ in urls:\n",
    "    name = tempfile.NamedTemporaryFile().name\n",
    "    urllib.request.urlretrieve(url, name)\n",
    "    files_list.append(name)\n",
    "\n",
    "docsearch: Vectara = Vectara.from_files(\n",
    "    files=files_list,\n",
    "    embedding=FakeEmbeddings(size=768),\n",
    "    metadatas=[{\"url\": url, \"speech\": title} for url, title in urls],\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1f9215c8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T09:27:29.920258Z",
     "start_time": "2023-04-04T09:27:29.913714Z"
    }
   },
   "source": [
    "## Similarity search\n",
    "\n",
    "The simplest scenario for using Vectara is to perform a similarity search. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a8c513ab",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.204469Z",
     "start_time": "2023-04-04T10:51:24.855618Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vectara.similarity_search(\n",
    "    query, n_sentence_context=0, filter=\"doc.speech = 'state-of-the-union'\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fc516993",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.220984Z",
     "start_time": "2023-04-04T10:51:25.213943Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
      "\n",
      "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
     ]
    }
   ],
   "source": [
    "print(found_docs[0].page_content)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1bda9bf5",
   "metadata": {},
   "source": [
    "## Similarity search with score\n",
    "\n",
    "Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8804a21d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.631585Z",
     "start_time": "2023-04-04T10:51:25.227384Z"
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vectara.similarity_search_with_score(\n",
    "    query, filter=\"doc.speech = 'state-of-the-union'\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "756a6887",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:25.642282Z",
     "start_time": "2023-04-04T10:51:25.635947Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
      "\n",
      "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
      "\n",
      "Score: 0.4917977\n"
     ]
    }
   ],
   "source": [
    "document, score = found_docs[0]\n",
    "print(document.page_content)\n",
    "print(f\"\\nScore: {score}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1f9876a8",
   "metadata": {},
   "source": [
    "Now let's do similar search for content in the files we uploaded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "47784de5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Document(page_content='We must forever conduct our struggle on the high plane of dignity and discipline.', metadata={'section': '1'}), 0.7962591)\n",
      "(Document(page_content='We must not allow our\\ncreative protests to degenerate into physical violence. . . .', metadata={'section': '1'}), 0.25983918)\n"
     ]
    }
   ],
   "source": [
    "query = \"We must forever conduct our struggle\"\n",
    "found_docs = vectara.similarity_search_with_score(\n",
    "    query, filter=\"doc.speech = 'I-have-a-dream'\"\n",
    ")\n",
    "print(found_docs[0])\n",
    "print(found_docs[1])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "691a82d6",
   "metadata": {},
   "source": [
    "## Vectara as a Retriever\n",
    "\n",
    "Vectara, as all the other LangChain vectorstores, is most often used as a LangChain Retriever:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "9427195f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:26.031451Z",
     "start_time": "2023-04-04T10:51:26.018763Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x12772caf0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever = vectara.as_retriever()\n",
    "retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f3c70c31",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-04-04T10:51:26.495652Z",
     "start_time": "2023-04-04T10:51:26.046407Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "retriever.get_relevant_documents(query)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2300e785",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}