Add dashvector vectorstore (#9163)

## Description Add `Dashvector` vectorstore for langchain - [dashvector quick start](https://help.aliyun.com/document_detail/2510223.html) - [dashvector package description](https://pypi.org/project/dashvector/) ## How to use ```python from langchain.vectorstores.dashvector import DashVector dashvector = DashVector.from_documents(docs, embeddings) ``` --------- Co-authored-by: smallrain.xuxy <smallrain.xuxy@alibaba-inc.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago · b30f449dae
parent bfbb97b74c
commit b30f449dae
5 changed files with 702 additions and 0 deletions
--- a/docs/extras/integrations/providers/dashvector.mdx
+++ b/docs/extras/integrations/providers/dashvector.mdx
@ -0,0 +1,24 @@
 # DashVector
 > [DashVector](https://help.aliyun.com/document_detail/2510225.html) is a fully-managed vectorDB service that supports high-dimension dense and sparse vectors, real-time insertion and filtered search. It is built to scale automatically and can adapt to different application requirements.  
 This document demonstrates to leverage DashVector within the LangChain ecosystem. In particular, it shows how to install DashVector, and how to use it as a VectorStore plugin in LangChain.
 It is broken into two parts: installation and setup, and then references to specific DashVector wrappers.
 ## Installation and Setup
 Install the Python SDK:
 ```bash
 pip install dashvector
 ```
 ## VectorStore
 A DashVector Collection is wrapped as a familiar VectorStore for native usage within LangChain, 
 which allows it to be readily used for various scenarios, such as semantic search or example selection.
 You may import the vectorstore by:
 ```python
 from langchain.vectorstores import DashVector
 ```
 For a detailed walkthrough of the DashVector wrapper, please refer to [this notebook](/docs/integrations/vectorstores/dashvector.html)
--- a/docs/extras/integrations/vectorstores/dashvector.ipynb
+++ b/docs/extras/integrations/vectorstores/dashvector.ipynb
@ -0,0 +1,236 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# DashVector\n",
    "\n",
    "> [DashVector](https://help.aliyun.com/document_detail/2510225.html) is a fully-managed vectorDB service that supports high-dimension dense and sparse vectors, real-time insertion and filtered search. It is built to scale automatically and can adapt to different application requirements.\n",
    "\n",
    "This notebook shows how to use functionality related to the `DashVector` vector database.\n",
    "\n",
    "To use DashVector, you must have an API key.\n",
    "Here are the [installation instructions](https://help.aliyun.com/document_detail/2510223.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Install"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!pip install dashvector dashscope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We want to use `DashScopeEmbeddings` so we also have to get the Dashscope API Key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    },
    "ExecuteTime": {
     "end_time": "2023-08-11T10:37:15.091585Z",
     "start_time": "2023-08-11T10:36:51.859753Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"DASHVECTOR_API_KEY\"] = getpass.getpass(\"DashVector API Key:\")\n",
    "os.environ[\"DASHSCOPE_API_KEY\"] = getpass.getpass(\"DashScope API Key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    },
    "ExecuteTime": {
     "end_time": "2023-08-11T10:42:30.243460Z",
     "start_time": "2023-08-11T10:42:27.783785Z"
    }
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings.dashscope import DashScopeEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import DashVector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "is_executing": true,
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2023-08-11T10:42:30.391580Z",
     "start_time": "2023-08-11T10:42:30.249021Z"
    }
   },
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = DashScopeEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We can create DashVector from documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "dashvector = DashVector.from_documents(docs, embeddings)\n",
    "\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = dashvector.similarity_search(query)\n",
    "print(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We can add texts with meta datas and ids, and search with meta filter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2023-08-11T10:42:51.641309Z",
     "start_time": "2023-08-11T10:42:51.132109Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(page_content='baz', metadata={'key': 2})]\n"
     ]
    }
   ],
   "source": [
    "texts = [\"foo\", \"bar\", \"baz\"]\n",
    "metadatas = [{\"key\": i} for i in range(len(texts))]\n",
    "ids = [\"0\", \"1\", \"2\"]\n",
    "\n",
    "dashvector.add_texts(texts, metadatas=metadatas, ids=ids)\n",
    "\n",
    "docs = dashvector.similarity_search(\"foo\", filter=\"key = 2\")\n",
    "print(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/libs/langchain/langchain/vectorstores/init.py
+++ b/libs/langchain/langchain/vectorstores/init.py
@ -33,6 +33,7 @@ from langchain.vectorstores.cassandra import Cassandra
 from langchain.vectorstores.chroma import Chroma
 from langchain.vectorstores.clarifai import Clarifai
 from langchain.vectorstores.clickhouse import Clickhouse, ClickhouseSettings
 from langchain.vectorstores.dashvector import DashVector
 from langchain.vectorstores.deeplake import DeepLake
 from langchain.vectorstores.dingo import Dingo
 from langchain.vectorstores.docarray import DocArrayHnswSearch, DocArrayInMemorySearch
@ -83,6 +84,7 @@ __all__ = [
    "Chroma",
    "Clickhouse",
    "ClickhouseSettings",
    "DashVector",
    "DeepLake",
    "Dingo",
    "DocArrayHnswSearch",
--- a/libs/langchain/langchain/vectorstores/dashvector.py
+++ b/libs/langchain/langchain/vectorstores/dashvector.py
@ -0,0 +1,365 @@
 """Wrapper around DashVector vector database."""
 from __future__ import annotations
 import logging
 import uuid
 from typing import (
    Any,
    Iterable,
    List,
    Optional,
    Tuple,
 )
 import numpy as np
 from langchain.docstore.document import Document
 from langchain.embeddings.base import Embeddings
 from langchain.utils import get_from_env
 from langchain.vectorstores.base import VectorStore
 from langchain.vectorstores.utils import maximal_marginal_relevance
 logger = logging.getLogger(__name__)
 class DashVector(VectorStore):
    """Wrapper around DashVector vector database.
    To use, you should have the ``dashvector`` python package installed.
    Example:
        .. code-block:: python
            from langchain.vectorstores import dashvector
            from langchain.embeddings.openai import OpenAIEmbeddings
            import dashvector
            client = dashvector.Client.init(api_key="***")
            client.create("langchain")
            collection = client.get("langchain")
            embeddings = OpenAIEmbeddings()
            vectorstore = Dashvector(collection, embeddings.embed_query, "text")
    """
    def __init__(
        self,
        collection: Any,
        embedding: Embeddings,
        text_field: str,
    ):
        """Initialize with DashVector collection."""
        try:
            import dashvector
        except ImportError:
            raise ValueError(
                "Could not import dashvector python package. "
                "Please install it with `pip install dashvector`."
            )
        if not isinstance(collection, dashvector.Collection):
            raise ValueError(
                f"collection should be an instance of dashvector.Collection, "
                f"bug got {type(collection)}"
            )
        self._collection = collection
        self._embedding = embedding
        self._text_field = text_field
    def _similarity_search_with_score_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[str] = None,
    ) -> List[Tuple[Document, float]]:
        """Return docs most similar to query vector, along with scores"""
        # query by vector
        ret = self._collection.query(embedding, topk=k, filter=filter)
        if not ret:
            raise ValueError(
                f"Fail to query docs by vector, error {self._collection.message}"
            )
        docs = []
        for doc in ret:
            metadata = doc.fields
            text = metadata.pop(self._text_field)
            score = doc.score
            docs.append((Document(page_content=text, metadata=metadata), score))
        return docs
    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        ids: Optional[List[str]] = None,
        batch_size: int = 25,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            texts: Iterable of strings to add to the vectorstore.
            metadatas: Optional list of metadatas associated with the texts.
            ids: Optional list of ids associated with the texts.
            batch_size: Optional batch size to upsert docs.
            kwargs: vectorstore specific parameters
        Returns:
            List of ids from adding the texts into the vectorstore.
        """
        ids = ids or [str(uuid.uuid4().hex) for _ in texts]
        text_list = list(texts)
        for i in range(0, len(text_list), batch_size):
            # batch end
            end = min(i + batch_size, len(text_list))
            batch_texts = text_list[i:end]
            batch_ids = ids[i:end]
            batch_embeddings = self._embedding.embed_documents(list(batch_texts))
            # batch metadatas
            if metadatas:
                batch_metadatas = metadatas[i:end]
            else:
                batch_metadatas = [{} for _ in range(i, end)]
            for metadata, text in zip(batch_metadatas, batch_texts):
                metadata[self._text_field] = text
            # batch upsert to collection
            docs = list(zip(batch_ids, batch_embeddings, batch_metadatas))
            ret = self._collection.upsert(docs)
            if not ret:
                raise ValueError(
                    f"Fail to upsert docs to dashvector vector database,"
                    f"Error: {ret.message}"
                )
        return ids
    def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> bool:
        """Delete by vector ID.
        Args:
            ids: List of ids to delete.
        Returns:
            True if deletion is successful,
            False otherwise.
        """
        return bool(self._collection.delete(ids))
    def similarity_search(
        self,
        query: str,
        k: int = 4,
        filter: Optional[str] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs most similar to query.
        Args:
            query: Text to search documents similar to.
            k: Number of documents to return. Default to 4.
            filter: Doc fields filter conditions that meet the SQL where clause
                    specification.
        Returns:
            List of Documents most similar to the query text.
        """
        docs_and_scores = self.similarity_search_with_relevance_scores(query, k, filter)
        return [doc for doc, _ in docs_and_scores]
    def similarity_search_with_relevance_scores(
        self,
        query: str,
        k: int = 4,
        filter: Optional[str] = None,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Return docs most similar to query text , alone with relevance scores.
        Less is more similar, more is more dissimilar.
        Args:
            query: input text
            k: Number of Documents to return. Defaults to 4.
            filter: Doc fields filter conditions that meet the SQL where clause
                    specification.
        Returns:
            List of Tuples of (doc, similarity_score)
        """
        embedding = self._embedding.embed_query(query)
        return self._similarity_search_with_score_by_vector(
            embedding, k=k, filter=filter
        )
    def similarity_search_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[str] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs most similar to embedding vector.
        Args:
            embedding: Embedding to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            filter: Doc fields filter conditions that meet the SQL where clause
                    specification.
        Returns:
            List of Documents most similar to the query vector.
        """
        docs_and_scores = self._similarity_search_with_score_by_vector(
            embedding, k, filter
        )
        return [doc for doc, _ in docs_and_scores]
    def max_marginal_relevance_search(
        self,
        query: str,
        k: int = 4,
        fetch_k: int = 20,
        lambda_mult: float = 0.5,
        filter: Optional[dict] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs selected using the maximal marginal relevance.
        Maximal marginal relevance optimizes for similarity to query AND diversity
        among selected documents.
        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            fetch_k: Number of Documents to fetch to pass to MMR algorithm.
            lambda_mult: Number between 0 and 1 that determines the degree
                        of diversity among the results with 0 corresponding
                        to maximum diversity and 1 to minimum diversity.
                        Defaults to 0.5.
            filter: Doc fields filter conditions that meet the SQL where clause
                    specification.
        Returns:
            List of Documents selected by maximal marginal relevance.
        """
        embedding = self._embedding.embed_query(query)
        return self.max_marginal_relevance_search_by_vector(
            embedding, k, fetch_k, lambda_mult, filter
        )
    def max_marginal_relevance_search_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        fetch_k: int = 20,
        lambda_mult: float = 0.5,
        filter: Optional[dict] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs selected using the maximal marginal relevance.
        Maximal marginal relevance optimizes for similarity to query AND diversity
        among selected documents.
        Args:
            embedding: Embedding to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            fetch_k: Number of Documents to fetch to pass to MMR algorithm.
            lambda_mult: Number between 0 and 1 that determines the degree
                        of diversity among the results with 0 corresponding
                        to maximum diversity and 1 to minimum diversity.
                        Defaults to 0.5.
            filter: Doc fields filter conditions that meet the SQL where clause
                    specification.
        Returns:
            List of Documents selected by maximal marginal relevance.
        """
        # query by vector
        ret = self._collection.query(
            embedding, topk=fetch_k, filter=filter, include_vector=True
        )
        if not ret:
            raise ValueError(
                f"Fail to query docs by vector, error {self._collection.message}"
            )
        candidate_embeddings = [doc.vector for doc in ret]
        mmr_selected = maximal_marginal_relevance(
            np.array(embedding), candidate_embeddings, lambda_mult, k
        )
        metadatas = [ret.output[i].fields for i in mmr_selected]
        return [
            Document(page_content=metadata.pop(self._text_field), metadata=metadata)
            for metadata in metadatas
        ]
    @classmethod
    def from_texts(
        cls,
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        dashvector_api_key: Optional[str] = None,
        collection_name: str = "langchain",
        text_field: str = "text",
        batch_size: int = 25,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> DashVector:
        """Return DashVector VectorStore initialized from texts and embeddings.
        This is the quick way to get started with dashvector vector store.
        Example:
            .. code-block:: python
            from langchain.vectorstores import DashVector
            from langchain.embeddings import OpenAIEmbeddings
            import dashvector
            embeddings = OpenAIEmbeddings()
            dashvector = DashVector.from_documents(
                docs,
                embeddings,
                dashvector_api_key="{DASHVECTOR_API_KEY}"
            )
        """
        try:
            import dashvector
        except ImportError:
            raise ValueError(
                "Could not import dashvector python package. "
                "Please install it with `pip install dashvector`."
            )
        dashvector_api_key = dashvector_api_key or get_from_env(
            "dashvector_api_key", "DASHVECTOR_API_KEY"
        )
        dashvector_client = dashvector.Client(api_key=dashvector_api_key)
        dashvector_client.delete(collection_name)
        collection = dashvector_client.get(collection_name)
        if not collection:
            dim = len(embedding.embed_query(texts[0]))
            # create collection if not existed
            resp = dashvector_client.create(collection_name, dimension=dim)
            if resp:
                collection = dashvector_client.get(collection_name)
            else:
                raise ValueError(
                    "Fail to create collection. " f"Error: {resp.message}."
                )
        dashvector_vector_db = cls(collection, embedding, text_field)
        dashvector_vector_db.add_texts(texts, metadatas, ids, batch_size)
        return dashvector_vector_db
--- a/libs/langchain/tests/integration_tests/vectorstores/test_dashvector.py
+++ b/libs/langchain/tests/integration_tests/vectorstores/test_dashvector.py
@ -0,0 +1,75 @@
 from time import sleep
 from langchain.schema import Document
 from langchain.vectorstores import DashVector
 from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
 texts = ["foo", "bar", "baz"]
 ids = ["1", "2", "3"]
 def test_dashvector_from_texts() -> None:
    dashvector = DashVector.from_texts(
        texts=texts,
        embedding=FakeEmbeddings(),
        ids=ids,
    )
    # the vector insert operation is async by design, we wait here a bit for the
    # insertion to complete.
    sleep(0.5)
    output = dashvector.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo")]
 def test_dashvector_with_text_with_metadatas() -> None:
    metadatas = [{"meta": i} for i in range(len(texts))]
    dashvector = DashVector.from_texts(
        texts=texts,
        embedding=FakeEmbeddings(),
        metadatas=metadatas,
        ids=ids,
    )
    # the vector insert operation is async by design, we wait here a bit for the
    # insertion to complete.
    sleep(0.5)
    output = dashvector.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"meta": 0})]
 def test_dashvector_search_with_filter() -> None:
    metadatas = [{"meta": i} for i in range(len(texts))]
    dashvector = DashVector.from_texts(
        texts=texts,
        embedding=FakeEmbeddings(),
        metadatas=metadatas,
        ids=ids,
    )
    # the vector insert operation is async by design, we wait here a bit for the
    # insertion to complete.
    sleep(0.5)
    output = dashvector.similarity_search("foo", filter="meta=2")
    assert output == [Document(page_content="baz", metadata={"meta": 2})]
 def test_dashvector_search_with_scores() -> None:
    dashvector = DashVector.from_texts(
        texts=texts,
        embedding=FakeEmbeddings(),
        ids=ids,
    )
    # the vector insert operation is async by design, we wait here a bit for the
    # insertion to complete.
    sleep(0.5)
    output = dashvector.similarity_search_with_relevance_scores("foo")
    docs, scores = zip(*output)
    assert scores[0] < scores[1] < scores[2]
    assert list(docs) == [
        Document(page_content="foo"),
        Document(page_content="bar"),
        Document(page_content="baz"),
    ]