Vectara upd2 (#6506)

Update to Vectara integration - By user request added "add_files" to take advantage of Vectara capabilities to process files on the backend, without the need for separate loading of documents and chunking in the chain. - Updated vectara.ipynb example notebook to be broader and added testing of add_file() @hwchase17 - project lead --------- Co-authored-by: rlm <pexpresss31@gmail.com>
1 year ago · 153b56d19b
parent 1feac83323
commit 153b56d19b
4 changed files with 295 additions and 83 deletions
--- a/docs/extras/ecosystem/integrations/vectara/index.mdx
+++ b/docs/extras/ecosystem/integrations/vectara/index.mdx
@ -39,6 +39,21 @@ vectara = Vectara(
 ```
 The customer_id, corpus_id and api_key are optional, and if they are not supplied will be read from the environment variables `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`, respectively.

+Afer you have the vectorstore, you can `add_texts` or `add_documents` as per the standard `VectorStore` interface, for example:
+
+```python
+vectara.add_texts(["to be or not to be", "that is the question"])
+```
+
+
+Since Vectara supports file-upload, we also added the ability to upload files (PDF, TXT, HTML, PPT, DOC, etc) directly as file. When using this method, the file is uploaded directly to the Vectara backend, processed and chunked optimally there, so you don't have to use the LangChain document loader or chunking mechanism.
+
+As an example:
+
+```python
+vectara.add_files(["path/to/file1.pdf", "path/to/file2.pdf",...])
+```
+
 To query the vectorstore, you can use the `similarity_search` method (or `similarity_search_with_score`), which takes a query string and returns a list of results:
 ```python
 results = vectara.similarity_score("what is LangChain?")
--- a/docs/extras/modules/data_connection/vectorstores/integrations/vectara.ipynb
+++ b/docs/extras/modules/data_connection/vectorstores/integrations/vectara.ipynb
@ -11,43 +11,11 @@
    ">[Vectara](https://vectara.com/) is a API platform for building LLM-powered applications. It provides a simple to use API for document indexing and query that is managed by Vectara and is optimized for performance and accuracy. \n",
    "\n",
    "\n",
-    "This notebook shows how to use functionality related to the `Vectara` vector database. \n",
+    "This notebook shows how to use functionality related to the `Vectara` vector database or the `Vectara` retriever. \n",
    "\n",
    "See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API."
   ]
  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "id": "7b2f111b-357a-4f42-9730-ef0603bdc1b5",
-   "metadata": {},
-   "source": [
-    "We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "082e7e8b-ac52-430c-98d6-8f0924457642",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "OpenAI API Key:········\n"
-     ]
-    }
-   ],
-   "source": [
-    "import os\n",
-    "import getpass\n",
-    "\n",
-    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": 2,
@ -61,58 +29,95 @@
   },
   "outputs": [],
   "source": [
-    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "import os\n",
+    "from langchain.embeddings import FakeEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import Vectara\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "eeead681",
+   "metadata": {},
+   "source": [
+    "## Connecting to Vectara from LangChain\n",
+    "\n",
+    "The Vectara API provides simple API endpoints for indexing and querying, which is encapsulated in the Vectara integration.\n",
+    "First let's ingest the documents using the from_documents() method:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 3,
-   "id": "a3c3999a",
+   "id": "be0a4973",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = TextLoader('../../../state_of_the_union.txt')\n",
+    "documents = loader.load()\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+    "docs = text_splitter.split_documents(documents)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "8429667e",
   "metadata": {
    "ExecuteTime": {
-     "end_time": "2023-04-04T10:51:22.520144Z",
-     "start_time": "2023-04-04T10:51:22.285826Z"
+     "end_time": "2023-04-04T10:51:22.525091Z",
+     "start_time": "2023-04-04T10:51:22.522015Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
-    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
-    "documents = loader.load()\n",
-    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
-    "docs = text_splitter.split_documents(documents)\n",
-    "\n",
-    "embeddings = OpenAIEmbeddings()"
+    "vectara = Vectara.from_documents(docs, \n",
+    "                                 embedding=FakeEmbeddings(size=768), \n",
+    "                                 doc_metadata = {\"speech\": \"state-of-the-union\"})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
-   "id": "eeead681",
+   "id": "90dbf3e7",
   "metadata": {},
   "source": [
-    "## Connecting to Vectara from LangChain\n",
+    "Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.\n",
+    "To use this, we added the add_files() method (and from_files()). \n",
    "\n",
-    "The Vectara API provides simple API endpoints for indexing and querying."
+    "Let's see this in action. We pick two PDF documents to upload: \n",
+    "1. The \"I have a dream\" speech by Dr. King\n",
+    "2. Churchill's \"We Shall Fight on the Beaches\" speech"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
-   "id": "8429667e",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2023-04-04T10:51:22.525091Z",
-     "start_time": "2023-04-04T10:51:22.522015Z"
-    },
-    "tags": []
-   },
+   "execution_count": 5,
+   "id": "85ef3468",
+   "metadata": {},
   "outputs": [],
   "source": [
-    "vectara = Vectara.from_documents(docs, embedding=None)"
+    "import tempfile\n",
+    "import urllib.request\n",
+    "\n",
+    "urls = [\n",
+    "    ['https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'I-have-a-dream'],\n",
+    "    ['https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf', 'we shall fight on the beaches'],\n",
+    "]\n",
+    "files_list = []\n",
+    "for url,_ in urls:\n",
+    "    name = tempfile.NamedTemporaryFile().name\n",
+    "    urllib.request.urlretrieve(url, name)\n",
+    "    files_list.append(name)\n",
+    "\n",
+    "docsearch: Vectara = Vectara.from_files(\n",
+    "    files=files_list,\n",
+    "    embedding=FakeEmbeddings(size=768),\n",
+    "    metadatas=[{\"url\": url, \"speech\": title} for url,title in urls],\n",
+    ")"
   ]
  },
  {
@ -133,7 +138,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
   "id": "a8c513ab",
   "metadata": {
    "ExecuteTime": {
@ -145,12 +150,12 @@
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
-    "found_docs = vectara.similarity_search(query, n_sentence_context=0)"
+    "found_docs = vectara.similarity_search(query, n_sentence_context=0, filter=\"doc.speech = 'state-of-the-union'\")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
   "id": "fc516993",
   "metadata": {
    "ExecuteTime": {
@ -191,7 +196,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
   "id": "8804a21d",
   "metadata": {
    "ExecuteTime": {
@ -202,12 +207,12 @@
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
-    "found_docs = vectara.similarity_search_with_score(query)"
+    "found_docs = vectara.similarity_search_with_score(query, filter=\"doc.speech = 'state-of-the-union'\")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 9,
   "id": "756a6887",
   "metadata": {
    "ExecuteTime": {
@ -228,7 +233,7 @@
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
      "\n",
-      "Score: 0.7129974\n"
+      "Score: 0.4917977\n"
     ]
    }
   ],
@ -238,6 +243,37 @@
    "print(f\"\\nScore: {score}\")"
   ]
  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "1f9876a8",
+   "metadata": {},
+   "source": [
+    "Now let's do similar search for content in the files we uploaded"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "47784de5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(Document(page_content='We must forever conduct our struggle on the high plane of dignity and discipline.', metadata={'section': '1'}), 0.7962591)\n",
+      "(Document(page_content='We must not allow our\\ncreative protests to degenerate into physical violence. . . .', metadata={'section': '1'}), 0.25983918)\n"
+     ]
+    }
+   ],
+   "source": [
+    "query = \"We must forever conduct our struggle\"\n",
+    "found_docs = vectara.similarity_search_with_score(query, filter=\"doc.speech = 'I-have-a-dream'\")\n",
+    "print(found_docs[0])\n",
+    "print(found_docs[1])"
+   ]
+  },
  {
   "attachments": {},
   "cell_type": "markdown",
@ -246,12 +282,12 @@
   "source": [
    "## Vectara as a Retriever\n",
    "\n",
-    "Vectara, as all the other vector stores, is a LangChain Retriever, by using cosine similarity. "
+    "Vectara, as all the other vector stores, can be used also as a LangChain Retriever:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 11,
   "id": "9427195f",
   "metadata": {
    "ExecuteTime": {
@ -263,10 +299,10 @@
    {
     "data": {
      "text/plain": [
-       "VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x122db2830>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})"
+       "VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x12772caf0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -278,7 +314,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 12,
   "id": "f3c70c31",
   "metadata": {
    "ExecuteTime": {
@ -293,7 +329,7 @@
       "Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
      ]
     },
-     "execution_count": 10,
+     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
--- a/langchain/vectorstores/vectara.py
+++ b/langchain/vectorstores/vectara.py
@ -97,7 +97,7 @@ class Vectara(VectorStore):
            return False
        return True

-    def _index_doc(self, doc: dict) -> bool:
+    def _index_doc(self, doc: dict) -> str:
        request: dict[str, Any] = {}
        request["customer_id"] = self._vectara_customer_id
        request["corpus_id"] = self._vectara_corpus_id
@ -115,15 +115,70 @@ class Vectara(VectorStore):

        result = response.json()
        status_str = result["status"]["code"] if "status" in result else None
-        if status_code == 409 or (status_str and status_str == "ALREADY_EXISTS"):
-            return False
+        if status_code == 409 or status_str and (status_str == "ALREADY_EXISTS"):
+            return "E_ALREADY_EXISTS"
+        elif status_str and (status_str == "FORBIDDEN"):
+            return "E_NO_PERMISSIONS"
        else:
-            return True
+            return "E_SUCCEEDED"
+
+    def add_files(
+        self,
+        files_list: Iterable[str],
+        metadatas: Optional[List[dict]] = None,
+        **kwargs: Any,
+    ) -> List[str]:
+        """
+        Vectara provides a way to add documents directly via our API where
+        pre-processing and chunking occurs internally in an optimal way
+        This method provides a way to use that API in LangChain
+
+        Args:
+            files_list: Iterable of strings, each representing a local file path.
+                    Files could be text, HTML, PDF, markdown, doc/docx, ppt/pptx, etc.
+                    see API docs for full list
+            metadatas: Optional list of metadatas associated with each file
+
+        Returns:
+            List of ids associated with each of the files indexed
+        """
+        doc_ids = []
+        for inx, file in enumerate(files_list):
+            if not os.path.exists(file):
+                logging.error(f"File {file} does not exist, skipping")
+                continue
+            md = metadatas[inx] if metadatas else {}
+            files: dict = {
+                "file": (file, open(file, "rb")),
+                "doc_metadata": json.dumps(md),
+            }
+            headers = self._get_post_headers()
+            headers.pop("Content-Type")
+            response = self._session.post(
+                f"https://api.vectara.io/upload?c={self._vectara_customer_id}&o={self._vectara_corpus_id}&d=True",
+                files=files,
+                verify=True,
+                headers=headers,
+            )
+
+            if response.status_code == 409:
+                doc_id = response.json()["document"]["documentId"]
+                logging.info(
+                    f"File {file} already exists on Vectara (doc_id={doc_id}), skipping"
+                )
+            elif response.status_code == 200:
+                doc_id = response.json()["document"]["documentId"]
+                doc_ids.append(doc_id)
+            else:
+                logging.info(f"Error indexing file {file}: {response.json()}")
+
+        return doc_ids

    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
+        doc_metadata: Optional[dict] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
@ -131,6 +186,12 @@ class Vectara(VectorStore):
        Args:
            texts: Iterable of strings to add to the vectorstore.
            metadatas: Optional list of metadatas associated with the texts.
+            doc_metadata: optional metadata for the document
+
+        This function indexes all the input text strings in the Vectara corpus as a
+        single Vectara document, where each input text is considered a "part" and the
+        metadata are associated with each part.
+        if 'doc_metadata' is provided, it is associated with the Vectara document.

        Returns:
            List of ids from adding the texts into the vectorstore.
@ -142,18 +203,27 @@ class Vectara(VectorStore):
        doc_id = doc_hash.hexdigest()
        if metadatas is None:
            metadatas = [{} for _ in texts]
+        if doc_metadata:
+            doc_metadata["source"] = "langchain"
+        else:
+            doc_metadata = {"source": "langchain"}
        doc = {
            "document_id": doc_id,
-            "metadataJson": json.dumps({"source": "langchain"}),
+            "metadataJson": json.dumps(doc_metadata),
            "parts": [
                {"text": text, "metadataJson": json.dumps(md)}
                for text, md in zip(texts, metadatas)
            ],
        }
-        succeeded = self._index_doc(doc)
-        if not succeeded:
+        success_str = self._index_doc(doc)
+        if success_str == "E_ALREADY_EXISTS":
            self._delete_doc(doc_id)
            self._index_doc(doc)
+        elif success_str == "E_NO_PERMISSIONS":
+            print(
+                """No permissions to add document to Vectara. 
+                Check your corpus ID, customer ID and API key"""
+            )
        return [doc_id]

    def similarity_search_with_score(
@ -296,8 +366,36 @@ class Vectara(VectorStore):
        """
        # Note: Vectara generates its own embeddings, so we ignore the provided
        # embeddings (required by interface)
+        doc_metadata = kwargs.pop("doc_metadata", {})
        vectara = cls(**kwargs)
-        vectara.add_texts(texts, metadatas)
+        vectara.add_texts(texts, metadatas, doc_metadata=doc_metadata, **kwargs)
+        return vectara
+
+    @classmethod
+    def from_files(
+        cls: Type[Vectara],
+        files: List[str],
+        embedding: Optional[Embeddings] = None,
+        metadatas: Optional[List[dict]] = None,
+        **kwargs: Any,
+    ) -> Vectara:
+        """Construct Vectara wrapper from raw documents.
+        This is intended to be a quick way to get started.
+        Example:
+            .. code-block:: python
+
+                from langchain import Vectara
+                vectara = Vectara.from_files(
+                    files_list,
+                    vectara_customer_id=customer_id,
+                    vectara_corpus_id=corpus_id,
+                    vectara_api_key=api_key,
+                )
+        """
+        # Note: Vectara generates its own embeddings, so we ignore the provided
+        # embeddings (required by interface)
+        vectara = cls(**kwargs)
+        vectara.add_files(files, metadatas)
        return vectara

    def as_retriever(self, **kwargs: Any) -> VectaraRetriever:
@ -325,7 +423,10 @@ class VectaraRetriever(VectorStoreRetriever):
    """

    def add_texts(
-        self, texts: List[str], metadatas: Optional[List[dict]] = None
+        self,
+        texts: List[str],
+        metadatas: Optional[List[dict]] = None,
+        doc_metadata: Optional[dict] = {},
    ) -> None:
        """Add text to the Vectara vectorstore.

@ -333,4 +434,4 @@ class VectaraRetriever(VectorStoreRetriever):
            texts (List[str]): The text
            metadatas (List[dict]): Metadata dicts, must line up with existing store
        """
-        self.vectorstore.add_texts(texts, metadatas)
+        self.vectorstore.add_texts(texts, metadatas, doc_metadata)
--- a/tests/integration_tests/vectorstores/test_vectara.py
+++ b/tests/integration_tests/vectorstores/test_vectara.py
@ -1,7 +1,16 @@
+import tempfile
+import urllib.request
+
 from langchain.docstore.document import Document
 from langchain.vectorstores.vectara import Vectara
 from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings

+# For this test to run properly, please setup as follows
+# 1. Create a corpus in Vectara, with a filter attribute called "test_num".
+# 2. Create an API_KEY for this corpus with permissions for query and indexing
+# 3. Setup environment variables:
+#    VECTARA_API_KEY, VECTARA_CORPUS_ID and VECTARA_CUSTOMER_ID
+

 def get_abbr(s: str) -> str:
    words = s.split(" ")  # Split the string into words
@ -12,25 +21,76 @@ def get_abbr(s: str) -> str:
 def test_vectara_add_documents() -> None:
    """Test end to end construction and search."""

-    # start with some initial documents
+    # start with some initial texts
    texts = ["grounded generation", "retrieval augmented generation", "data privacy"]
    docsearch: Vectara = Vectara.from_texts(
        texts,
        embedding=FakeEmbeddings(),
-        metadatas=[{"abbr": "gg"}, {"abbr": "rag"}, {"abbr": "dp"}],
+        metadatas=[
+            {"abbr": "gg", "test_num": "1"},
+            {"abbr": "rag", "test_num": "1"},
+            {"abbr": "dp", "test_num": "1"},
+        ],
+        doc_metadata={"test_num": "1"},
    )

    # then add some additional documents
    new_texts = ["large language model", "information retrieval", "question answering"]
    docsearch.add_documents(
-        [Document(page_content=t, metadata={"abbr": get_abbr(t)}) for t in new_texts]
+        [Document(page_content=t, metadata={"abbr": get_abbr(t)}) for t in new_texts],
+        doc_metadata={"test_num": "1"},
    )

    # finally do a similarity search to see if all works okay
    output = docsearch.similarity_search(
-        "large language model", k=2, n_sentence_context=0
+        "large language model",
+        k=2,
+        n_sentence_context=0,
+        filter="doc.test_num = 1",
    )
    assert output[0].page_content == "large language model"
    assert output[0].metadata == {"abbr": "llm"}
    assert output[1].page_content == "information retrieval"
    assert output[1].metadata == {"abbr": "ir"}
+
+
+def test_vectara_from_files() -> None:
+    """Test end to end construction and search."""
+
+    # download documents to local storage and then upload as files
+    # attention paper and deep learning book
+    urls = [
+        ("https://arxiv.org/pdf/1706.03762.pdf"),
+        (
+            "https://www.microsoft.com/en-us/research/wp-content/uploads/"
+            "2016/02/Final-DengYu-NOW-Book-DeepLearn2013-ForLecturesJuly2.docx"
+        ),
+    ]
+
+    files_list = []
+    for url in urls:
+        name = tempfile.NamedTemporaryFile().name
+        urllib.request.urlretrieve(url, name)
+        files_list.append(name)
+
+    docsearch: Vectara = Vectara.from_files(
+        files=files_list,
+        embedding=FakeEmbeddings(),
+        metadatas=[{"url": url, "test_num": "2"} for url in urls],
+    )
+
+    # finally do a similarity search to see if all works okay
+    output = docsearch.similarity_search(
+        "By the commonly adopted machine learning tradition",
+        k=1,
+        n_sentence_context=0,
+        filter="doc.test_num = 2",
+    )
+    print(output)
+    assert output[0].page_content == (
+        "By the commonly adopted machine learning tradition "
+        "(e.g., Chapter 28 in Murphy, 2012; Deng and Li, 2013), it may be natural "
+        "to just classify deep learning techniques into deep discriminative models "
+        "(e.g., DNNs) and deep probabilistic generative models (e.g., DBN, Deep "
+        "Boltzmann Machine (DBM))."
+    )