Add support for Xata as a vector store (#8822)

This adds support for [Xata](https://xata.io) (data platform based on Postgres) as a vector store. We have recently added [Xata to Langchain.js](https://github.com/hwchase17/langchainjs/pull/2125) and would love to have the equivalent in the Python project as well. The PR includes integration tests and a Jupyter notebook as docs. Please let me know if anything else would be needed or helpful. I have added the xata python SDK as an optional dependency. ## To run the integration tests You will need to create a DB in xata (see the docs), then run something like: ``` OPENAI_API_KEY=sk-... XATA_API_KEY=xau_... XATA_DB_URL='https://....xata.sh/db/langchain' poetry run pytest tests/integration_tests/vectorstores/test_xata.py ```  --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Philip Krauss <35487337+philkra@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago · aeaef8f3a3
parent 472f00ada7
commit aeaef8f3a3
5 changed files with 601 additions and 8 deletions
--- a/docs/extras/integrations/vectorstores/xata.ipynb
+++ b/docs/extras/integrations/vectorstores/xata.ipynb
@ -0,0 +1,240 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Xata\n",
    "\n",
    "> [Xata](https://xata.io) is a serverless data platform, based on PostgreSQL. It provides a Python SDK for interacting with your database, and a UI for managing your data.\n",
    "> Xata has a native vector type, which can be added to any table, and supports similarity search. LangChain inserts vectors directly to Xata, and queries it for the nearest neighbors of a given vector, so that you can use all the LangChain Embeddings integrations with Xata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook guides you how to use Xata as a VectorStore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Create a database to use as a vector store\n",
    "\n",
    "In the [Xata UI](https://app.xata.io) create a new database. You can name it whatever you want, in this notepad we'll use `langchain`.\n",
    "Create a table, again you can name it anything, but we will use `vectors`. Add the following columns via the UI:\n",
    "\n",
    "* `content` of type \"Text\". This is used to store the `Document.pageContent` values.\n",
    "* `embedding` of type \"Vector\". Use the dimension used by the model you plan to use. In this notebook we use OpenAI embeddings, which have 1536 dimensions.\n",
    "* `search` of type \"Text\". This is used as a metadata column by this example.\n",
    "* any other columns you want to use as metadata. They are populated from the `Document.metadata` object. For example, if in the `Document.metadata` object you have a `title` property, you can create a `title` column in the table and it will be populated.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first install our dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "!pip install xata==1.0.0a7 openai tiktoken langchain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the OpenAI key to the environemnt. If you don't have one you can create an OpenAI account and create a key on this [page](https://platform.openai.com/account/api-keys)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we need to get the environment variables for Xata. You can create a new API key by visiting your [account settings](https://app.xata.io/settings). To find the database URL, go to the Settings page of the database that you have created. The database URL should look something like this: `https://demo-uni3q8.eu-west-1.xata.sh/db/langchain`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = getpass.getpass(\"Xata API key: \")\n",
    "db_url = input(\"Xata database URL (copy it from your DB settings):\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.vectorstores.xata import XataVectorStore\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the Xata vector store\n",
    "Let's import our test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now create the actual vector store, backed by the Xata table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "vector_store = XataVectorStore.from_documents(docs, embeddings, api_key=api_key, db_url=db_url, table_name=\"vectors\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After running the above command, if you go to the Xata UI, you should see the documents loaded together with their embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarity Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vector_store.similarity_search(query)\n",
    "print(found_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarity Search with score (vector distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "result = vector_store.similarity_search_with_score(query)\n",
    "for doc, score in result:\n",
    "    print(f\"document={doc}, score={score}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/libs/langchain/langchain/vectorstores/xata.py
+++ b/libs/langchain/langchain/vectorstores/xata.py
@ -0,0 +1,263 @@
 """Wrapper around Xata as a vector database."""
 from __future__ import annotations
 import time
 from itertools import repeat
 from typing import Any, Dict, Iterable, List, Optional, Tuple, Type
 from langchain.docstore.document import Document
 from langchain.embeddings.base import Embeddings
 from langchain.vectorstores.base import VectorStore
 class XataVectorStore(VectorStore):
    """VectorStore for a Xata database. Assumes you have a Xata database
    created with the right schema. See the guide at:
    https://integrations.langchain.com/vectorstores?integration_name=XataVectorStore
    """
    def __init__(
        self,
        api_key: str,
        db_url: str,
        embedding: Embeddings,
        table_name: str,
    ) -> None:
        """Initialize with Xata client."""
        try:
            from xata.client import XataClient  # noqa: F401
        except ImportError:
            raise ValueError(
                "Could not import xata python package. "
                "Please install it with `pip install xata`."
            )
        self._client = XataClient(api_key=api_key, db_url=db_url)
        self._embedding: Embeddings = embedding
        self._table_name = table_name or "vectors"
    @property
    def embeddings(self) -> Embeddings:
        return self._embedding
    def add_vectors(
        self,
        vectors: List[List[float]],
        documents: List[Document],
        ids: Optional[List[str]] = None,
    ) -> List[str]:
        return self._add_vectors(vectors, documents, ids)
    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[Dict[Any, Any]]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        ids = ids
        docs = self._texts_to_documents(texts, metadatas)
        vectors = self._embedding.embed_documents(list(texts))
        return self.add_vectors(vectors, docs, ids)
    def _add_vectors(
        self,
        vectors: List[List[float]],
        documents: List[Document],
        ids: Optional[List[str]] = None,
    ) -> List[str]:
        """Add vectors to the Xata database."""
        rows: List[Dict[str, Any]] = []
        for idx, embedding in enumerate(vectors):
            row = {
                "content": documents[idx].page_content,
                "embedding": embedding,
            }
            if ids:
                row["id"] = ids[idx]
            for key, val in documents[idx].metadata.items():
                if key not in ["id", "content", "embedding"]:
                    row[key] = val
            rows.append(row)
        # XXX: I would have liked to use the BulkProcessor here, but it
        # doesn't return the IDs, which we need here. Manual chunking it is.
        chunk_size = 1000
        id_list: List[str] = []
        for i in range(0, len(rows), chunk_size):
            chunk = rows[i : i + chunk_size]
            r = self._client.records().bulk_insert(self._table_name, {"records": chunk})
            if r.status_code != 200:
                raise Exception(f"Error adding vectors to Xata: {r.status_code} {r}")
            id_list.extend(r["recordIDs"])
        return id_list
    @staticmethod
    def _texts_to_documents(
        texts: Iterable[str],
        metadatas: Optional[Iterable[Dict[Any, Any]]] = None,
    ) -> List[Document]:
        """Return list of Documents from list of texts and metadatas."""
        if metadatas is None:
            metadatas = repeat({})
        docs = [
            Document(page_content=text, metadata=metadata)
            for text, metadata in zip(texts, metadatas)
        ]
        return docs
    @classmethod
    def from_texts(
        cls: Type["XataVectorStore"],
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        api_key: Optional[str] = None,
        db_url: Optional[str] = None,
        table_name: str = "vectors",
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> "XataVectorStore":
        """Return VectorStore initialized from texts and embeddings."""
        if not api_key or not db_url:
            raise ValueError("Xata api_key and db_url must be set.")
        embeddings = embedding.embed_documents(texts)
        ids = None  # Xata will generate them for us
        docs = cls._texts_to_documents(texts, metadatas)
        vector_db = cls(
            api_key=api_key,
            db_url=db_url,
            embedding=embedding,
            table_name=table_name,
        )
        vector_db._add_vectors(embeddings, docs, ids)
        return vector_db
    def similarity_search(
        self, query: str, k: int = 4, filter: Optional[dict] = None, **kwargs: Any
    ) -> List[Document]:
        """Return docs most similar to query.
        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
        Returns:
            List of Documents most similar to the query.
        """
        docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)
        documents = [d[0] for d in docs_and_scores]
        return documents
    def similarity_search_with_score(
        self, query: str, k: int = 4, filter: Optional[dict] = None, **kwargs: Any
    ) -> List[Tuple[Document, float]]:
        """Run similarity search with Chroma with distance.
        Args:
            query (str): Query text to search for.
            k (int): Number of results to return. Defaults to 4.
            filter (Optional[dict]): Filter by metadata. Defaults to None.
        Returns:
            List[Tuple[Document, float]]: List of documents most similar to the query
                text with distance in float.
        """
        embedding = self._embedding.embed_query(query)
        payload = {
            "queryVector": embedding,
            "column": "embedding",
            "size": k,
        }
        if filter:
            payload["filter"] = filter
        r = self._client.data().vector_search(self._table_name, payload=payload)
        if r.status_code != 200:
            raise Exception(f"Error running similarity search: {r.status_code} {r}")
        hits = r["records"]
        docs_and_scores = [
            (
                Document(
                    page_content=hit["content"],
                    metadata=self._extractMetadata(hit),
                ),
                hit["xata"]["score"],
            )
            for hit in hits
        ]
        return docs_and_scores
    def _extractMetadata(self, record: dict) -> dict:
        """Extract metadata from a record. Filters out known columns."""
        metadata = {}
        for key, val in record.items():
            if key not in ["id", "content", "embedding", "xata"]:
                metadata[key] = val
        return metadata
    def delete(
        self,
        ids: Optional[List[str]] = None,
        delete_all: Optional[bool] = None,
        **kwargs: Any,
    ) -> None:
        """Delete by vector IDs.
        Args:
            ids: List of ids to delete.
            delete_all: Delete all records in the table.
        """
        if delete_all:
            self._delete_all()
            self.wait_for_indexing(ndocs=0)
        elif ids is not None:
            chunk_size = 500
            for i in range(0, len(ids), chunk_size):
                chunk = ids[i : i + chunk_size]
                operations = [
                    {"delete": {"table": self._table_name, "id": id}} for id in chunk
                ]
                self._client.records().transaction(payload={"operations": operations})
        else:
            raise ValueError("Either ids or delete_all must be set.")
    def _delete_all(self) -> None:
        """Delete all records in the table."""
        while True:
            r = self._client.data().query(self._table_name, payload={"columns": ["id"]})
            if r.status_code != 200:
                raise Exception(f"Error running query: {r.status_code} {r}")
            ids = [rec["id"] for rec in r["records"]]
            if len(ids) == 0:
                break
            operations = [
                {"delete": {"table": self._table_name, "id": id}} for id in ids
            ]
            self._client.records().transaction(payload={"operations": operations})
    def wait_for_indexing(self, timeout: float = 5, ndocs: int = 1) -> None:
        """Wait for the search index to contain a certain number of
        documents. Useful in tests.
        """
        start = time.time()
        while True:
            r = self._client.data().search_table(
                self._table_name, payload={"query": "", "page": {"size": 0}}
            )
            if r.status_code != 200:
                raise Exception(f"Error running search: {r.status_code} {r}")
            if r["totalCount"] == ndocs:
                break
            if time.time() - start > timeout:
                raise Exception("Timed out waiting for indexing to complete.")
            time.sleep(0.5)
--- a/libs/langchain/poetry.lock
+++ b/libs/langchain/poetry.lock
@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand.
+# This file is automatically @generated by Poetry and should not be changed by hand.
 [[package]]
 name = "absl-py"
@ -2439,6 +2439,21 @@ wrapt = ">=1.10,<2"
 [package.extras]
 dev = ["PyTest", "PyTest-Cov", "bump2version (<1)", "sphinx (<2)", "tox"]
 [[package]]
 name = "deprecation"
 version = "2.1.0"
 description = "A library to handle automated deprecations"
 category = "main"
 optional = true
 python-versions = "*"
 files = [
    {file = "deprecation-2.1.0-py2.py3-none-any.whl", hash = "sha256:a10811591210e1fb0e768a8c25517cabeabcba6f0bf96564f8ff45189f90b14a"},
    {file = "deprecation-2.1.0.tar.gz", hash = "sha256:72b3bde64e5d778694b0cf68178aed03d15e15477116add3fb773e581f9518ff"},
 ]
 [package.dependencies]
 packaging = "*"
 [[package]]
 name = "dill"
 version = "0.3.6"
@ -4738,7 +4753,6 @@ optional = false
 python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*, !=3.6.*"
 files = [
    {file = "jsonpointer-2.4-py2.py3-none-any.whl", hash = "sha256:15d51bba20eea3165644553647711d150376234112651b4f1811022aecad7d7a"},
    {file = "jsonpointer-2.4.tar.gz", hash = "sha256:585cee82b70211fa9e6043b7bb89db6e1aa49524340dde8ad6b63206ea689d88"},
 ]
 [[package]]
@ -12098,7 +12112,7 @@ files = [
 ]
 [package.dependencies]
-accelerate = {version = ">=0.20.2", optional = true, markers = "extra == \"accelerate\" or extra == \"torch\""}
+accelerate = {version = ">=0.20.2", optional = true, markers = "extra == \"accelerate\""}
 filelock = "*"
 huggingface-hub = ">=0.14.1,<1.0"
 numpy = ">=1.17"
@ -13070,6 +13084,24 @@ files = [
    {file = "wrapt-1.15.0.tar.gz", hash = "sha256:d06730c6aed78cee4126234cf2d071e01b44b915e725a6cb439a879ec9754a3a"},
 ]
 [[package]]
 name = "xata"
 version = "1.0.0a7"
 description = "Python client for Xata.io"
 category = "main"
 optional = true
 python-versions = ">=3.8,<4.0"
 files = [
    {file = "xata-1.0.0a7-py3-none-any.whl", hash = "sha256:1427e97bccddfd5fa8fba56ba993b2d78f1dc074e729d06ccc79c48d07bd023a"},
    {file = "xata-1.0.0a7.tar.gz", hash = "sha256:32769ddc22cc091bf133e66b91662185047fff05aa431e7c760b55cd0ddef6c3"},
 ]
 [package.dependencies]
 deprecation = ">=2.1.0,<3.0.0"
 orjson = ">=3.8.1,<4.0.0"
 python-dotenv = ">=0.21,<2.0"
 requests = ">=2.28.1,<3.0.0"
 [[package]]
 name = "xinference"
 version = "0.0.6"
@ -13570,15 +13602,15 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
 cffi = ["cffi (>=1.11)"]
 [extras]
-all = ["O365", "aleph-alpha-client", "amadeus", "anthropic", "arxiv", "atlassian-python-api", "awadb", "azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-cosmos", "azure-identity", "beautifulsoup4", "clarifai", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "esprima", "faiss-cpu", "google-api-python-client", "google-auth", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "langkit", "lark", "libdeeplake", "librosa", "lxml", "manifest-ml", "marqo", "momento", "nebula3-python", "neo4j", "networkx", "nlpcloud", "nltk", "nomic", "octoai-sdk", "openai", "openlm", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pymongo", "pyowm", "pypdf", "pytesseract", "python-arango", "pyvespa", "qdrant-client", "rdflib", "redis", "requests-toolbelt", "sentence-transformers", "singlestoredb", "spacy", "steamship", "tensorflow-text", "tigrisdb", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha", "xinference"]
+all = ["anthropic", "clarifai", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "marqo", "pymongo", "weaviate-client", "redis", "google-api-python-client", "google-auth", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "libdeeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "langkit", "lark", "pexpect", "pyvespa", "O365", "jq", "docarray", "steamship", "pdfminer-six", "lxml", "requests-toolbelt", "neo4j", "openlm", "azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "momento", "singlestoredb", "tigrisdb", "nebula3-python", "awadb", "esprima", "octoai-sdk", "rdflib", "amadeus", "xinference", "librosa", "python-arango"]
-azure = ["azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-core", "azure-cosmos", "azure-identity", "azure-search-documents", "openai"]
+azure = ["azure-identity", "azure-cosmos", "openai", "azure-core", "azure-ai-formrecognizer", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-search-documents"]
 clarifai = ["clarifai"]
 cohere = ["cohere"]
 docarray = ["docarray"]
 embeddings = ["sentence-transformers"]
-extended-testing = ["amazon-textract-caller", "atlassian-python-api", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "esprima", "feedparser", "geopandas", "gitpython", "gql", "html2text", "jinja2", "jq", "lxml", "mwparserfromhell", "mwxml", "newspaper3k", "openai", "openai", "pandas", "pdfminer-six", "pgvector", "psychicapi", "py-trello", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "requests-toolbelt", "scikit-learn", "streamlit", "sympy", "telethon", "tqdm", "xinference", "zep-python"]
+extended-testing = ["amazon-textract-caller", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "esprima", "jq", "pdfminer-six", "pgvector", "pypdf", "pymupdf", "pypdfium2", "tqdm", "lxml", "atlassian-python-api", "mwparserfromhell", "mwxml", "pandas", "telethon", "psychicapi", "zep-python", "gql", "requests-toolbelt", "html2text", "py-trello", "scikit-learn", "streamlit", "pyspark", "openai", "sympy", "rapidfuzz", "openai", "rank-bm25", "geopandas", "jinja2", "xinference", "gitpython", "newspaper3k", "feedparser", "xata"]
 javascript = ["esprima"]
-llms = ["anthropic", "clarifai", "cohere", "huggingface_hub", "manifest-ml", "nlpcloud", "openai", "openllm", "openlm", "torch", "transformers", "xinference"]
+llms = ["anthropic", "clarifai", "cohere", "openai", "openllm", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers", "xinference"]
 openai = ["openai", "tiktoken"]
 qdrant = ["qdrant-client"]
 text-helpers = ["chardet"]
@ -13586,4 +13618,4 @@ text-helpers = ["chardet"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "39305f23d3d69179d247d643631133ac50f5e944d98518c8a56c5f839b8e7a04"
+content-hash = "9c970917244d05f76c8592b986007e689495e94c6c47e2609677e2907dd0a312"
--- a/libs/langchain/pyproject.toml
+++ b/libs/langchain/pyproject.toml
@ -131,6 +131,7 @@ librosa = {version="^0.10.0.post2", optional = true }
 feedparser = {version = "^6.0.10", optional = true}
 newspaper3k = {version = "^0.2.8", optional = true}
 amazon-textract-caller = {version = "<2", optional = true}
 xata = {version = "^1.0.0a7", optional = true}
 [tool.poetry.group.test.dependencies]
 # The only dependencies that should be added are
@ -369,6 +370,7 @@ extended_testing = [
 "gitpython",
 "newspaper3k",
 "feedparser",
 "xata",
 ]
 [tool.ruff]
--- a/libs/langchain/tests/integration_tests/vectorstores/test_xata.py
+++ b/libs/langchain/tests/integration_tests/vectorstores/test_xata.py
@ -0,0 +1,56 @@
 """Test Xata vector store functionality.
 Before running this test, please create a Xata database by following
 the instructions from: 
 https://python.langchain.com/docs/integrations/vectorstores/xata
 """
 import os
 from langchain.docstore.document import Document
 from langchain.embeddings.openai import OpenAIEmbeddings
 from langchain.vectorstores.xata import XataVectorStore
 class TestXata:
    @classmethod
    def setup_class(cls) -> None:
        assert os.getenv("XATA_API_KEY"), "XATA_API_KEY environment variable is not set"
        assert os.getenv("XATA_DB_URL"), "XATA_DB_URL environment variable is not set"
    def test_similarity_search_without_metadata(
        self, embedding_openai: OpenAIEmbeddings
    ) -> None:
        """Test end to end constructions and search without metadata."""
        texts = ["foo", "bar", "baz"]
        docsearch = XataVectorStore.from_texts(
            api_key=os.getenv("XATA_API_KEY"),
            db_url=os.getenv("XATA_DB_URL"),
            texts=texts,
            embedding=embedding_openai,
        )
        docsearch.wait_for_indexing(ndocs=3)
        output = docsearch.similarity_search("foo", k=1)
        assert output == [Document(page_content="foo")]
        docsearch.delete(delete_all=True)
    def test_similarity_search_with_metadata(
        self, embedding_openai: OpenAIEmbeddings
    ) -> None:
        """Test end to end construction and search with a metadata filter.
        This test requires a column named "a" of type integer to be present
        in the Xata table."""
        texts = ["foo", "foo", "foo"]
        metadatas = [{"a": i} for i in range(len(texts))]
        docsearch = XataVectorStore.from_texts(
            api_key=os.getenv("XATA_API_KEY"),
            db_url=os.getenv("XATA_DB_URL"),
            texts=texts,
            embedding=embedding_openai,
            metadatas=metadatas,
        )
        docsearch.wait_for_indexing(ndocs=3)
        output = docsearch.similarity_search("foo", k=1, filter={"a": 1})
        assert output == [Document(page_content="foo", metadata={"a": 1})]
        docsearch.delete(delete_all=True)