Harrison/tair (#3770)

Co-authored-by: Seth Huang <848849+seth-hg@users.noreply.github.com>
1 year ago · 0c0f14407c
parent 502ba6a0be
commit 0c0f14407c
6 changed files with 455 additions and 0 deletions
--- a/docs/ecosystem/tair.md
+++ b/docs/ecosystem/tair.md
@ -0,0 +1,22 @@
+# Tair
+
+This page covers how to use the Tair ecosystem within LangChain.
+
+## Installation and Setup
+
+Install Tair Python SDK with `pip install tair`.
+
+## Wrappers
+
+### VectorStore
+
+There exists a wrapper around TairVector, allowing you to use it as a vectorstore,
+whether for semantic search or example selection.
+
+To import this vectorstore:
+
+```python
+from langchain.vectorstores import Tair
+```
+
+For a more detailed walkthrough of the Tair wrapper, see [this notebook](../modules/indexes/vectorstores/examples/tair.ipynb)
--- a/docs/modules/indexes/vectorstores/examples/tair.ipynb
+++ b/docs/modules/indexes/vectorstores/examples/tair.ipynb
@ -0,0 +1,129 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tair\n",
+    "\n",
+    "This notebook shows how to use functionality related to the Tair vector database.\n",
+    "To run, you should have an [Tair](https://www.alibabacloud.com/help/en/tair/latest/what-is-tair) instance up and running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.embeddings.fake import FakeEmbeddings\n",
+    "from langchain.text_splitter import CharacterTextSplitter\n",
+    "from langchain.vectorstores import Tair"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "loader = TextLoader('../../../state_of_the_union.txt')\n",
+    "documents = loader.load()\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+    "docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "embeddings = FakeEmbeddings(size=128)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Connect to Tair using the `TAIR_URL` environment variable \n",
+    "```\n",
+    "export TAIR_URL=\"redis://{username}:{password}@{tair_address}:{tair_port}\"\n",
+    "```\n",
+    "\n",
+    "or the keyword argument `tair_url`.\n",
+    "\n",
+    "Then store documents and embeddings into Tair."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tair_url = \"redis://localhost:6379\"\n",
+    "\n",
+    "# drop first if index already exists\n",
+    "Tair.drop_index(tair_url=tair_url)\n",
+    "\n",
+    "vector_store = Tair.from_documents(\n",
+    "    docs,\n",
+    "    embeddings,\n",
+    "    tair_url=tair_url\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Query similar documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(page_content='We’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.  \\n\\nAnd tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud. \\n\\nBy the end of this year, the deficit will be down to less than half what it was before I took office.  \\n\\nThe only president ever to cut the deficit by more than one trillion dollars in a single year. \\n\\nLowering your costs also means demanding more competition. \\n\\nI’m a capitalist, but capitalism without competition isn’t capitalism. \\n\\nIt’s exploitation—and it drives up prices. \\n\\nWhen corporations don’t have to compete, their profits go up, your prices go up, and small businesses and family farmers and ranchers go under. \\n\\nWe see it happening with ocean carriers moving goods in and out of America. \\n\\nDuring the pandemic, these foreign-owned companies raised prices by as much as 1,000% and made record profits.', metadata={'source': '../../../state_of_the_union.txt'})"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
+    "docs = vector_store.similarity_search(query)\n",
+    "docs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/langchain/vectorstores/init.py
+++ b/langchain/vectorstores/init.py
@ -15,6 +15,7 @@ from langchain.vectorstores.pinecone import Pinecone
 from langchain.vectorstores.qdrant import Qdrant
 from langchain.vectorstores.redis import Redis
 from langchain.vectorstores.supabase import SupabaseVectorStore
+from langchain.vectorstores.tair import Tair
 from langchain.vectorstores.weaviate import Weaviate
 from langchain.vectorstores.zilliz import Zilliz

@ -37,5 +38,6 @@ __all__ = [
    "MyScaleSettings",
    "SupabaseVectorStore",
    "AnalyticDB",
+    "Tair",
    "LanceDB",
 ]
--- a/langchain/vectorstores/tair.py
+++ b/langchain/vectorstores/tair.py
@ -0,0 +1,286 @@
+"""Wrapper around Tair Vector."""
+from __future__ import annotations
+
+import json
+import logging
+import uuid
+from typing import Any, Iterable, List, Optional, Type
+
+from langchain.docstore.document import Document
+from langchain.embeddings.base import Embeddings
+from langchain.utils import get_from_dict_or_env
+from langchain.vectorstores.base import VectorStore
+
+logger = logging.getLogger(__name__)
+
+
+def _uuid_key() -> str:
+    return uuid.uuid4().hex
+
+
+class Tair(VectorStore):
+    def __init__(
+        self,
+        embedding_function: Embeddings,
+        url: str,
+        index_name: str,
+        content_key: str = "content",
+        metadata_key: str = "metadata",
+        search_params: Optional[dict] = None,
+        **kwargs: Any,
+    ):
+        self.embedding_function = embedding_function
+        self.index_name = index_name
+        try:
+            from tair import Tair as TairClient
+        except ImportError:
+            raise ValueError(
+                "Could not import tair python package. "
+                "Please install it with `pip install tair`."
+            )
+        try:
+            # connect to tair from url
+            client = TairClient.from_url(url, **kwargs)
+        except ValueError as e:
+            raise ValueError(f"Tair failed to connect: {e}")
+
+        self.client = client
+        self.content_key = content_key
+        self.metadata_key = metadata_key
+        self.search_params = search_params
+
+    def create_index_if_not_exist(
+        self,
+        dim: int,
+        distance_type: str,
+        index_type: str,
+        data_type: str,
+        **kwargs: Any,
+    ) -> bool:
+        index = self.client.tvs_get_index(self.index_name)
+        if index is not None:
+            logger.info("Index already exists")
+            return False
+        self.client.tvs_create_index(
+            self.index_name,
+            dim,
+            distance_type,
+            index_type,
+            data_type,
+            **kwargs,
+        )
+        return True
+
+    def add_texts(
+        self,
+        texts: Iterable[str],
+        metadatas: Optional[List[dict]] = None,
+        **kwargs: Any,
+    ) -> List[str]:
+        """Add texts data to an existing index."""
+        ids = []
+        keys = kwargs.get("keys", None)
+        # Write data to tair
+        pipeline = self.client.pipeline(transaction=False)
+        embeddings = self.embedding_function.embed_documents(list(texts))
+        for i, text in enumerate(texts):
+            # Use provided key otherwise use default key
+            key = keys[i] if keys else _uuid_key()
+            metadata = metadatas[i] if metadatas else {}
+            pipeline.tvs_hset(
+                self.index_name,
+                key,
+                embeddings[i],
+                False,
+                **{
+                    self.content_key: text,
+                    self.metadata_key: json.dumps(metadata),
+                },
+            )
+            ids.append(key)
+        pipeline.execute()
+        return ids
+
+    def similarity_search(
+        self, query: str, k: int = 4, **kwargs: Any
+    ) -> List[Document]:
+        """
+        Returns the most similar indexed documents to the query text.
+
+        Args:
+            query (str): The query text for which to find similar documents.
+            k (int): The number of documents to return. Default is 4.
+
+        Returns:
+            List[Document]: A list of documents that are most similar to the query text.
+        """
+        # Creates embedding vector from user query
+        embedding = self.embedding_function.embed_query(query)
+
+        keys_and_scores = self.client.tvs_knnsearch(
+            self.index_name, k, embedding, False, None, **kwargs
+        )
+
+        pipeline = self.client.pipeline(transaction=False)
+        for key, _ in keys_and_scores:
+            pipeline.tvs_hmget(
+                self.index_name, key, self.metadata_key, self.content_key
+            )
+        docs = pipeline.execute()
+
+        return [
+            Document(
+                page_content=d[1],
+                metadata=json.loads(d[0]),
+            )
+            for d in docs
+        ]
+
+    @classmethod
+    def from_texts(
+        cls: Type[Tair],
+        texts: List[str],
+        embedding: Embeddings,
+        metadatas: Optional[List[dict]] = None,
+        index_name: str = "langchain",
+        content_key: str = "content",
+        metadata_key: str = "metadata",
+        **kwargs: Any,
+    ) -> Tair:
+        try:
+            from tair import tairvector
+        except ImportError:
+            raise ValueError(
+                "Could not import tair python package. "
+                "Please install it with `pip install tair`."
+            )
+        url = get_from_dict_or_env(kwargs, "tair_url", "TAIR_URL")
+        if "tair_url" in kwargs:
+            kwargs.pop("tair_url")
+
+        distance_type = tairvector.DistanceMetric.InnerProduct
+        if "distance_type" in kwargs:
+            distance_type = kwargs.pop("distance_typ")
+        index_type = tairvector.IndexType.HNSW
+        if "index_type" in kwargs:
+            index_type = kwargs.pop("index_type")
+        data_type = tairvector.DataType.Float32
+        if "data_type" in kwargs:
+            data_type = kwargs.pop("data_type")
+        index_params = {}
+        if "index_params" in kwargs:
+            index_params = kwargs.pop("index_params")
+        search_params = {}
+        if "search_params" in kwargs:
+            search_params = kwargs.pop("search_params")
+
+        keys = None
+        if "keys" in kwargs:
+            keys = kwargs.pop("keys")
+        try:
+            tair_vector_store = cls(
+                embedding,
+                url,
+                index_name,
+                content_key=content_key,
+                metadata_key=metadata_key,
+                search_params=search_params,
+                **kwargs,
+            )
+        except ValueError as e:
+            raise ValueError(f"tair failed to connect: {e}")
+
+        # Create embeddings for documents
+        embeddings = embedding.embed_documents(texts)
+
+        tair_vector_store.create_index_if_not_exist(
+            len(embeddings[0]),
+            distance_type,
+            index_type,
+            data_type,
+            **index_params,
+        )
+
+        tair_vector_store.add_texts(texts, metadatas, keys=keys)
+        return tair_vector_store
+
+    @classmethod
+    def from_documents(
+        cls,
+        documents: List[Document],
+        embedding: Embeddings,
+        metadatas: Optional[List[dict]] = None,
+        index_name: str = "langchain",
+        content_key: str = "content",
+        metadata_key: str = "metadata",
+        **kwargs: Any,
+    ) -> Tair:
+        texts = [d.page_content for d in documents]
+        metadatas = [d.metadata for d in documents]
+
+        return cls.from_texts(
+            texts, embedding, metadatas, index_name, content_key, metadata_key, **kwargs
+        )
+
+    @staticmethod
+    def drop_index(
+        index_name: str = "langchain",
+        **kwargs: Any,
+    ) -> bool:
+        """
+        Drop an existing index.
+
+        Args:
+            index_name (str): Name of the index to drop.
+
+        Returns:
+            bool: True if the index is dropped successfully.
+        """
+        try:
+            from tair import Tair as TairClient
+        except ImportError:
+            raise ValueError(
+                "Could not import tair python package. "
+                "Please install it with `pip install tair`."
+            )
+        url = get_from_dict_or_env(kwargs, "tair_url", "TAIR_URL")
+
+        try:
+            if "tair_url" in kwargs:
+                kwargs.pop("tair_url")
+            client = TairClient.from_url(url=url, **kwargs)
+        except ValueError as e:
+            raise ValueError(f"Tair connection error: {e}")
+        # delete index
+        ret = client.tvs_del_index(index_name)
+        if ret == 0:
+            # index not exist
+            logger.info("Index does not exist")
+            return False
+        return True
+
+    @classmethod
+    def from_existing_index(
+        cls,
+        embedding: Embeddings,
+        index_name: str = "langchain",
+        content_key: str = "content",
+        metadata_key: str = "metadata",
+        **kwargs: Any,
+    ) -> Tair:
+        """Connect to an existing Tair index."""
+        url = get_from_dict_or_env(kwargs, "tair_url", "TAIR_URL")
+
+        search_params = {}
+        if "search_params" in kwargs:
+            search_params = kwargs.pop("search_params")
+
+        return cls(
+            embedding,
+            url,
+            index_name,
+            content_key=content_key,
+            metadata_key=metadata_key,
+            search_params=search_params,
+            **kwargs,
+        )
--- a/pyproject.toml
+++ b/pyproject.toml
@ -126,6 +126,7 @@ python-dotenv = "^1.0.0"
 sentence-transformers = "^2"
 gptcache = "^0.1.9"
 promptlayer = "^0.1.80"
+tair = "^1.3.3"

 [tool.poetry.group.lint.dependencies]
 ruff = "^0.0.249"
--- a/tests/integration_tests/vectorstores/test_tair.py
+++ b/tests/integration_tests/vectorstores/test_tair.py
@ -0,0 +1,15 @@
+"""Test tair functionality."""
+
+from langchain.docstore.document import Document
+from langchain.vectorstores.tair import Tair
+from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
+
+
+def test_tair() -> None:
+    """Test end to end construction and search."""
+    texts = ["foo", "bar", "baz"]
+    docsearch = Tair.from_texts(
+        texts, FakeEmbeddings(), tair_url="redis://localhost:6379"
+    )
+    output = docsearch.similarity_search("foo", k=1)
+    assert output == [Document(page_content="foo")]