Harrison/myscale (#3352)

Co-authored-by: Fangrui Liu <fangruil@moqi.ai> Co-authored-by: 刘方瑞 <fangrui.liu@outlook.com> Co-authored-by: Fangrui.Liu <fangrui.liu@ubc.ca>
1 year ago · a6664be79c
parent 6200a2a00e
commit a6664be79c
8 changed files with 887 additions and 7 deletions
--- a/docs/ecosystem/myscale.md
+++ b/docs/ecosystem/myscale.md
@ -0,0 +1,65 @@
+# MyScale
+
+This page covers how to use MyScale vector database within LangChain.
+It is broken into two parts: installation and setup, and then references to specific MyScale wrappers.
+
+With MyScale, you can manage both structured and unstructured (vectorized) data, and perform joint queries and analytics on both types of data using SQL. Plus, MyScale's cloud-native OLAP architecture, built on top of ClickHouse, enables lightning-fast data processing even on massive datasets.
+
+## Introduction
+
+[Overview to MyScale and High performance vector search](https://docs.myscale.com/en/overview/)
+
+You can now register on our SaaS and [start a cluster now!](https://docs.myscale.com/en/quickstart/)
+
+If you are also interested in how we managed to integrate SQL and vector, please refer to [this document](https://docs.myscale.com/en/vector-reference/) for further syntax reference.
+
+We also deliver with live demo on huggingface! Please checkout our [huggingface space](https://huggingface.co/myscale)! They search millions of vector within a blink!
+
+## Installation and Setup
+- Install the Python SDK with `pip install clickhouse-connect`
+
+### Setting up envrionments
+
+There are two ways to set up parameters for myscale index.
+
+1. Environment Variables
+
+    Before you run the app, please set the environment variable with `export`:
+    `export MYSCALE_URL='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`
+
+    You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)
+    Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.
+
+2. Create `MyScaleSettings` object with parameters
+
+
+    ```python
+    from langchain.vectorstores import MyScale, MyScaleSettings
+    config = MyScaleSetting(host="<your-backend-url>", port=8443, ...)
+    index = MyScale(embedding_function, config)
+    index.add_documents(...)
+    ```
+  
+## Wrappers
+supported functions:
+- `add_texts`
+- `add_documents`
+- `from_texts`
+- `from_documents`
+- `similarity_search`
+- `asimilarity_search`
+- `similarity_search_by_vector`
+- `asimilarity_search_by_vector`
+- `similarity_search_with_relevance_scores`
+
+### VectorStore
+
+There exists a wrapper around MyScale database, allowing you to use it as a vectorstore,
+whether for semantic search or similar example retrieval.
+
+To import this vectorstore:
+```python
+from langchain.vectorstores import MyScale
+```
+
+For a more detailed walkthrough of the MyScale wrapper, see [this notebook](../modules/indexes/vectorstores/examples/myscale.ipynb)
--- a/docs/modules/indexes/vectorstores/examples/myscale.ipynb
+++ b/docs/modules/indexes/vectorstores/examples/myscale.ipynb
@ -0,0 +1,267 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "683953b3",
+   "metadata": {},
+   "source": [
+    "# MyScale\n",
+    "\n",
+    "This notebook shows how to use functionality related to the MyScale vector database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "aac9563e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "from langchain.text_splitter import CharacterTextSplitter\n",
+    "from langchain.vectorstores import MyScale\n",
+    "from langchain.document_loaders import TextLoader"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a9d16fa3",
+   "metadata": {},
+   "source": [
+    "## Setting up envrionments\n",
+    "\n",
+    "There are two ways to set up parameters for myscale index.\n",
+    "\n",
+    "1. Environment Variables\n",
+    "\n",
+    "    Before you run the app, please set the environment variable with `export`:\n",
+    "    `export MYSCALE_URL='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`\n",
+    "\n",
+    "    You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)\n",
+    "\n",
+    "    Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.\n",
+    "\n",
+    "2. Create `MyScaleSettings` object with parameters\n",
+    "\n",
+    "\n",
+    "    ```python\n",
+    "    from langchain.vectorstores import MyScale, MyScaleSettings\n",
+    "    config = MyScaleSetting(host=\"<your-backend-url>\", port=8443, ...)\n",
+    "    index = MyScale(embedding_function, config)\n",
+    "    index.add_documents(...)\n",
+    "    ```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "a3c3999a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "loader = TextLoader('../../../state_of_the_union.txt')\n",
+    "documents = loader.load()\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+    "docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "embeddings = OpenAIEmbeddings()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "6e104aee",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Inserting data...: 100%|██████████| 42/42 [00:18<00:00,  2.21it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for d in docs:\n",
+    "    d.metadata = {'some': 'metadata'}\n",
+    "docsearch = MyScale.from_documents(docs, embeddings)\n",
+    "\n",
+    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
+    "docs = docsearch.similarity_search(query)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "9c608226",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "As Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment they’re conducting on our children for profit. \n",
+      "\n",
+      "It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
+      "\n",
+      "And let’s get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
+      "\n",
+      "Third, support our veterans. \n",
+      "\n",
+      "Veterans are the best of us. \n",
+      "\n",
+      "I’ve always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
+      "\n",
+      "My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free.  \n",
+      "\n",
+      "Our troops in Iraq and Afghanistan faced many dangers.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(docs[0].page_content)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e3a8b105",
+   "metadata": {},
+   "source": [
+    "## Get connection info and data schema"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69996818",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(str(docsearch))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f59360c0",
+   "metadata": {},
+   "source": [
+    "## Filtering\n",
+    "\n",
+    "You can have direct access to myscale SQL where statement. You can write `WHERE` clause following standard SQL.\n",
+    "\n",
+    "**NOTE**: Please be aware of SQL injection, this interface must not be directly called by end-user.\n",
+    "\n",
+    "If you custimized your `column_map` under your setting, you search with filter like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "232055f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.69it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.vectorstores import MyScale, MyScaleSettings\n",
+    "from langchain.document_loaders import TextLoader\n",
+    "\n",
+    "loader = TextLoader('../../../state_of_the_union.txt')\n",
+    "documents = loader.load()\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+    "docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "embeddings = OpenAIEmbeddings()\n",
+    "\n",
+    "for i, d in enumerate(docs):\n",
+    "    d.metadata = {'doc_id': i}\n",
+    "\n",
+    "docsearch = MyScale.from_documents(docs, embeddings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "ddbcee77",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.252379834651947 {'doc_id': 6, 'some': ''} And I’m taking robus...\n",
+      "0.25022566318511963 {'doc_id': 1, 'some': ''} Groups of citizens b...\n",
+      "0.2469480037689209 {'doc_id': 8, 'some': ''} And so many families...\n",
+      "0.2428302764892578 {'doc_id': 0, 'some': 'metadata'} As Frances Haugen, w...\n"
+     ]
+    }
+   ],
+   "source": [
+    "meta = docsearch.metadata_column\n",
+    "output = docsearch.similarity_search_with_relevance_scores('What did the president say about Ketanji Brown Jackson?', \n",
+    "                                                           k=4, where_str=f\"{meta}.doc_id<10\")\n",
+    "for d, dist in output:\n",
+    "    print(dist, d.metadata, d.page_content[:20] + '...')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "a359ed74",
+   "metadata": {},
+   "source": [
+    "## Deleting your data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb6a9d36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "docsearch.drop()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48dbd8e0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/reference/integrations.md
+++ b/docs/reference/integrations.md
@ -45,6 +45,8 @@ The following use cases require specific installs and api keys:
  - Set up Elasticsearch backend. If you want to do locally, [this](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/getting-started.html) is a good guide.
 - _FAISS_:
  - Install requirements with `pip install faiss` for Python 3.7 and `pip install faiss-cpu` for Python 3.10+.
+- _MyScale_
+  - Install requirements with `pip install clickhouse-connect`. For documentations, please refer to [this document](https://docs.myscale.com/en/overview/).
 - _Manifest_:
  - Install requirements with `pip install manifest-ml` (Note: this is only available in Python 3.8+ currently).
 - _OpenSearch_:
--- a/langchain/vectorstores/init.py
+++ b/langchain/vectorstores/init.py
@ -8,6 +8,7 @@ from langchain.vectorstores.deeplake import DeepLake
 from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
 from langchain.vectorstores.faiss import FAISS
 from langchain.vectorstores.milvus import Milvus
+from langchain.vectorstores.myscale import MyScale, MyScaleSettings
 from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch
 from langchain.vectorstores.pinecone import Pinecone
 from langchain.vectorstores.qdrant import Qdrant
@ -29,6 +30,8 @@ __all__ = [
    "AtlasDB",
    "DeepLake",
    "Annoy",
+    "MyScale",
+    "MyScaleSettings",
    "SupabaseVectorStore",
    "AnalyticDB",
 ]
--- a/langchain/vectorstores/myscale.py
+++ b/langchain/vectorstores/myscale.py
@ -0,0 +1,433 @@
+"""Wrapper around MyScale vector database."""
+from __future__ import annotations
+
+import json
+import logging
+from hashlib import sha1
+from threading import Thread
+from typing import Any, Dict, Iterable, List, Optional, Tuple
+
+from pydantic import BaseSettings
+
+from langchain.docstore.document import Document
+from langchain.embeddings.base import Embeddings
+from langchain.vectorstores.base import VectorStore
+
+logger = logging.getLogger()
+
+
+def has_mul_sub_str(s: str, *args: Any) -> bool:
+    for a in args:
+        if a not in s:
+            return False
+    return True
+
+
+class MyScaleSettings(BaseSettings):
+    """MyScale Client Configuration
+
+    Attribute:
+        myscale_host (str) : An URL to connect to MyScale backend.
+                             Defaults to 'localhost'.
+        myscale_port (int) : URL port to connect with HTTP. Defaults to 8443.
+        username (str) : Usernamed to login. Defaults to None.
+        password (str) : Password to login. Defaults to None.
+        index_type (str): index type string.
+        index_param (dict): index build parameter.
+        database (str) : Database name to find the table. Defaults to 'default'.
+        table (str) : Table name to operate on.
+                      Defaults to 'vector_table'.
+        metric (str) : Metric to compute distance,
+                       supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'.
+        column_map (Dict) : Column type map to project column name onto langchain
+                            semantics. Must have keys: `text`, `id`, `vector`,
+                            must be same size to number of columns. For example:
+                            .. code-block:: python
+                            {
+                                'id': 'text_id',
+                                'vector': 'text_embedding',
+                                'text': 'text_plain',
+                                'metadata': 'metadata_dictionary_in_json',
+                            }
+
+                            Defaults to identity map.
+    """
+
+    host: str = "localhost"
+    port: int = 8443
+
+    username: Optional[str] = None
+    password: Optional[str] = None
+
+    index_type: str = "IVFFLAT"
+    index_param: Optional[Dict[str, str]] = None
+
+    column_map: Dict[str, str] = {
+        "id": "id",
+        "text": "text",
+        "vector": "vector",
+        "metadata": "metadata",
+    }
+
+    database: str = "default"
+    table: str = "langchain"
+    metric: str = "cosine"
+
+    def __getitem__(self, item: str) -> Any:
+        return getattr(self, item)
+
+    class Config:
+        env_file = ".env"
+        env_prefix = "myscale_"
+        env_file_encoding = "utf-8"
+
+
+class MyScale(VectorStore):
+    """Wrapper around MyScale vector database
+
+    You need a `clickhouse-connect` python package, and a valid account
+    to connect to MyScale.
+
+    MyScale can not only search with simple vector indexes,
+    it also supports complex query with multiple conditions,
+    constraints and even sub-queries.
+
+    For more information, please visit
+        [myscale official site](https://docs.myscale.com/en/overview/)
+    """
+
+    def __init__(
+        self,
+        embedding: Embeddings,
+        config: Optional[MyScaleSettings] = None,
+        **kwargs: Any,
+    ) -> None:
+        """MyScale Wrapper to LangChain
+
+        embedding_function (Embeddings):
+        config (MyScaleSettings): Configuration to MyScale Client
+        Other keyword arguments will pass into
+            [clickhouse-connect](https://docs.myscale.com/)
+        """
+        try:
+            from clickhouse_connect import get_client
+        except ImportError:
+            raise ValueError(
+                "Could not import clickhouse connect python package. "
+                "Please install it with `pip install clickhouse-connect`."
+            )
+        try:
+            from tqdm import tqdm
+
+            self.pgbar = tqdm
+        except ImportError:
+            # Just in case if tqdm is not installed
+            self.pgbar = lambda x: x
+        super().__init__()
+        if config is not None:
+            self.config = config
+        else:
+            self.config = MyScaleSettings()
+        assert self.config
+        assert self.config.host and self.config.port
+        assert (
+            self.config.column_map
+            and self.config.database
+            and self.config.table
+            and self.config.metric
+        )
+        for k in ["id", "vector", "text", "metadata"]:
+            assert k in self.config.column_map
+        assert self.config.metric in ["ip", "cosine", "l2"]
+
+        # initialize the schema
+        dim = len(embedding.embed_query("try this out"))
+
+        index_params = (
+            ", " + ",".join([f"'{k}={v}'" for k, v in self.config.index_param.items()])
+            if self.config.index_param
+            else ""
+        )
+        schema_ = f"""
+            CREATE TABLE IF NOT EXISTS {self.config.database}.{self.config.table}(
+                {self.config.column_map['id']} String,
+                {self.config.column_map['text']} String,
+                {self.config.column_map['vector']} Array(Float32),
+                {self.config.column_map['metadata']} JSON,
+                CONSTRAINT cons_vec_len CHECK length(\
+                    {self.config.column_map['vector']}) = {dim},
+                VECTOR INDEX vidx {self.config.column_map['vector']} \
+                    TYPE {self.config.index_type}(\
+                        'metric_type={self.config.metric}'{index_params})
+            ) ENGINE = MergeTree ORDER BY {self.config.column_map['id']}
+        """
+        self.dim = dim
+        self.BS = "\\"
+        self.must_escape = ("\\", "'")
+        self.embedding_function = embedding.embed_query
+        self.dist_order = "ASC" if self.config.metric in ["cosine", "l2"] else "DESC"
+
+        # Create a connection to myscale
+        self.client = get_client(
+            host=self.config.host,
+            port=self.config.port,
+            username=self.config.username,
+            password=self.config.password,
+            **kwargs,
+        )
+        self.client.command("SET allow_experimental_object_type=1")
+        self.client.command(schema_)
+
+    def escape_str(self, value: str) -> str:
+        return "".join(f"{self.BS}{c}" if c in self.must_escape else c for c in value)
+
+    def _build_istr(self, transac: Iterable, column_names: Iterable[str]) -> str:
+        ks = ",".join(column_names)
+        _data = []
+        for n in transac:
+            n = ",".join([f"'{self.escape_str(str(_n))}'" for _n in n])
+            _data.append(f"({n})")
+        i_str = f"""
+                INSERT INTO TABLE 
+                    {self.config.database}.{self.config.table}({ks})
+                VALUES
+                {','.join(_data)}
+                """
+        return i_str
+
+    def _insert(self, transac: Iterable, column_names: Iterable[str]) -> None:
+        _i_str = self._build_istr(transac, column_names)
+        self.client.command(_i_str)
+
+    def add_texts(
+        self,
+        texts: Iterable[str],
+        metadatas: Optional[List[dict]] = None,
+        batch_size: int = 32,
+        ids: Optional[Iterable[str]] = None,
+        **kwargs: Any,
+    ) -> List[str]:
+        """Run more texts through the embeddings and add to the vectorstore.
+
+        Args:
+            texts: Iterable of strings to add to the vectorstore.
+            ids: Optional list of ids to associate with the texts.
+            batch_size: Batch size of insertion
+            metadata: Optional column data to be inserted
+
+        Returns:
+            List of ids from adding the texts into the vectorstore.
+
+        """
+        # Embed and create the documents
+        ids = ids or [sha1(t.encode("utf-8")).hexdigest() for t in texts]
+        colmap_ = self.config.column_map
+
+        transac = []
+        column_names = {
+            colmap_["id"]: ids,
+            colmap_["text"]: texts,
+            colmap_["vector"]: map(self.embedding_function, texts),
+        }
+        metadatas = metadatas or [{} for _ in texts]
+        column_names[colmap_["metadata"]] = map(json.dumps, metadatas)
+        assert len(set(colmap_) - set(column_names)) >= 0
+        keys, values = zip(*column_names.items())
+        try:
+            t = None
+            for v in self.pgbar(
+                zip(*values), desc="Inserting data...", total=len(metadatas)
+            ):
+                assert len(v[keys.index(self.config.column_map["vector"])]) == self.dim
+                transac.append(v)
+                if len(transac) == batch_size:
+                    if t:
+                        t.join()
+                    t = Thread(target=self._insert, args=[transac, keys])
+                    t.start()
+                    transac = []
+            if len(transac) > 0:
+                if t:
+                    t.join()
+                self._insert(transac, keys)
+            return [i for i in ids]
+        except Exception as e:
+            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
+            return []
+
+    @classmethod
+    def from_texts(
+        cls,
+        texts: List[str],
+        embedding: Embeddings,
+        metadatas: Optional[List[Dict[Any, Any]]] = None,
+        config: Optional[MyScaleSettings] = None,
+        text_ids: Optional[Iterable[str]] = None,
+        batch_size: int = 32,
+        **kwargs: Any,
+    ) -> MyScale:
+        """Create Myscale wrapper with existing texts
+
+        Args:
+            embedding_function (Embeddings): Function to extract text embedding
+            texts (Iterable[str]): List or tuple of strings to be added
+            config (MyScaleSettings, Optional): Myscale configuration
+            text_ids (Optional[Iterable], optional): IDs for the texts.
+                                                     Defaults to None.
+            batch_size (int, optional): Batchsize when transmitting data to MyScale.
+                                        Defaults to 32.
+            metadata (List[dict], optional): metadata to texts. Defaults to None.
+            Other keyword arguments will pass into
+                [clickhouse-connect](https://clickhouse.com/docs/en/integrations/python#clickhouse-connect-driver-api)
+        Returns:
+            MyScale Index
+        """
+        ctx = cls(embedding, config, **kwargs)
+        ctx.add_texts(texts, ids=text_ids, batch_size=batch_size, metadatas=metadatas)
+        return ctx
+
+    def __repr__(self) -> str:
+        """Text representation for myscale, prints backends, username and schemas.
+            Easy to use with `str(Myscale())`
+
+        Returns:
+            repr: string to show connection info and data schema
+        """
+        _repr = f"\033[92m\033[1m{self.config.database}.{self.config.table} @ "
+        _repr += f"{self.config.host}:{self.config.port}\033[0m\n\n"
+        _repr += f"\033[1musername: {self.config.username}\033[0m\n\nTable Schema:\n"
+        _repr += "-" * 51 + "\n"
+        for r in self.client.query(
+            f"DESC {self.config.database}.{self.config.table}"
+        ).named_results():
+            _repr += (
+                f"|\033[94m{r['name']:24s}\033[0m|\033[96m{r['type']:24s}\033[0m|\n"
+            )
+        _repr += "-" * 51 + "\n"
+        return _repr
+
+    def _build_qstr(
+        self, q_emb: List[float], topk: int, where_str: Optional[str] = None
+    ) -> str:
+        q_emb_str = ",".join(map(str, q_emb))
+        if where_str:
+            where_str = f"PREWHERE {where_str}"
+        else:
+            where_str = ""
+
+        q_str = f"""
+            SELECT {self.config.column_map['text']}, 
+                {self.config.column_map['metadata']}, dist
+            FROM {self.config.database}.{self.config.table}
+            {where_str}
+            ORDER BY distance({self.config.column_map['vector']}, [{q_emb_str}]) 
+                AS dist {self.dist_order}
+            LIMIT {topk}
+            """
+        return q_str
+
+    def similarity_search(
+        self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any
+    ) -> List[Document]:
+        """Perform a similarity search with MyScale
+
+        Args:
+            query (str): query string
+            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
+            where_str (Optional[str], optional): where condition string.
+                                                 Defaults to None.
+
+            NOTE: Please do not let end-user to fill this and always be aware
+                  of SQL injection. When dealing with metadatas, remember to
+                  use `{self.metadata_column}.attribute` instead of `attribute`
+                  alone. The default name for it is `metadata`.
+
+        Returns:
+            List[Document]: List of Documents
+        """
+        return self.similarity_search_by_vector(
+            self.embedding_function(query), k, where_str, **kwargs
+        )
+
+    def similarity_search_by_vector(
+        self,
+        embedding: List[float],
+        k: int = 4,
+        where_str: Optional[str] = None,
+        **kwargs: Any,
+    ) -> List[Document]:
+        """Perform a similarity search with MyScale by vectors
+
+        Args:
+            query (str): query string
+            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
+            where_str (Optional[str], optional): where condition string.
+                                                 Defaults to None.
+
+            NOTE: Please do not let end-user to fill this and always be aware
+                  of SQL injection. When dealing with metadatas, remember to
+                  use `{self.metadata_column}.attribute` instead of `attribute`
+                  alone. The default name for it is `metadata`.
+
+        Returns:
+            List[Document]: List of (Document, similarity)
+        """
+        q_str = self._build_qstr(embedding, k, where_str)
+        try:
+            return [
+                Document(
+                    page_content=r[self.config.column_map["text"]],
+                    metadata=r[self.config.column_map["metadata"]],
+                )
+                for r in self.client.query(q_str).named_results()
+            ]
+        except Exception as e:
+            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
+            return []
+
+    def similarity_search_with_relevance_scores(
+        self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any
+    ) -> List[Tuple[Document, float]]:
+        """Perform a similarity search with MyScale
+
+        Args:
+            query (str): query string
+            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
+            where_str (Optional[str], optional): where condition string.
+                                                 Defaults to None.
+
+            NOTE: Please do not let end-user to fill this and always be aware
+                  of SQL injection. When dealing with metadatas, remember to
+                  use `{self.metadata_column}.attribute` instead of `attribute`
+                  alone. The default name for it is `metadata`.
+
+        Returns:
+            List[Document]: List of documents
+        """
+        q_str = self._build_qstr(self.embedding_function(query), k, where_str)
+        try:
+            return [
+                (
+                    Document(
+                        page_content=r[self.config.column_map["text"]],
+                        metadata=r[self.config.column_map["metadata"]],
+                    ),
+                    r["dist"],
+                )
+                for r in self.client.query(q_str).named_results()
+            ]
+        except Exception as e:
+            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
+            return []
+
+    def drop(self) -> None:
+        """
+        Helper function: Drop data
+        """
+        self.client.command(
+            f"DROP TABLE IF EXISTS {self.config.database}.{self.config.table}"
+        )
+
+    @property
+    def metadata_column(self) -> str:
+        return self.config.column_map["metadata"]
--- a/poetry.lock
+++ b/poetry.lock
@ -1055,7 +1055,7 @@ colorama = {version = "*", markers = "platform_system == \"Windows\""}
 name = "clickhouse-connect"
 version = "0.5.20"
 description = "ClickHouse core driver, SqlAlchemy, and Superset libraries"
-category = "dev"
+category = "main"
 optional = false
 python-versions = "~=3.7"
 files = [
@ -3519,7 +3519,7 @@ dev = ["Sphinx (==5.3.0)", "colorama (==0.4.5)", "colorama (==0.4.6)", "freezegu
 name = "lz4"
 version = "4.3.2"
 description = "LZ4 Bindings for Python"
-category = "dev"
+category = "main"
 optional = false
 python-versions = ">=3.7"
 files = [
@ -6293,7 +6293,7 @@ dev = ["atomicwrites (==1.2.1)", "attrs (==19.2.0)", "coverage (==6.5.0)", "hatc
 name = "pytz"
 version = "2023.3"
 description = "World timezone definitions, modern and historical"
-category = "dev"
+category = "main"
 optional = false
 python-versions = "*"
 files = [
@ -9212,7 +9212,7 @@ testing = ["big-O", "flake8 (<5)", "jaraco.functools", "jaraco.itertools", "more
 name = "zstandard"
 version = "0.21.0"
 description = "Zstandard bindings for Python"
-category = "dev"
+category = "main"
 optional = false
 python-versions = ">=3.7"
 files = [
@ -9268,7 +9268,7 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
 cffi = ["cffi (>=1.11)"]

 [extras]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"]
 cohere = ["cohere"]
 llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
 openai = ["openai"]
@ -9277,4 +9277,4 @@ qdrant = ["qdrant-client"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "8b0be7a924d83d9afc5e21e95aa529258a3ae916418e0c1c159732291a615af8"
+content-hash = "da027a1b27f348548ca828c6da40795e2f57a7a7858bdeac1a08573d3e031e12"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -34,6 +34,7 @@ jinja2 = {version = "^3", optional = true}
 tiktoken = {version = "^0.3.2", optional = true, python="^3.9"}
 pinecone-client = {version = "^2", optional = true}
 pinecone-text = {version = "^0.4.2", optional = true}
+clickhouse-connect = {version="^0.5.14", optional=true}
 weaviate-client = {version = "^3", optional = true}
 google-api-python-client = {version = "2.70.0", optional = true}
 wolframalpha = {version = "5.0.0", optional = true}
@ -106,6 +107,7 @@ elasticsearch = {extras = ["async"], version = "^8.6.2"}
 redis = "^4.5.4"
 pinecone-client = "^2.2.1"
 pinecone-text = "^0.4.2"
+clickhouse-connect = "^0.5.14"
 pgvector = "^0.1.6"
 transformers = "^4.27.4"
 pandas = "^2.0.0"
@ -142,7 +144,7 @@ llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifes
 qdrant = ["qdrant-client"]
 openai = ["openai"]
 cohere = ["cohere"]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"]

 [tool.ruff]
 select = [
--- a/tests/integration_tests/vectorstores/test_myscale.py
+++ b/tests/integration_tests/vectorstores/test_myscale.py
@ -0,0 +1,108 @@
+"""Test MyScale functionality."""
+import pytest
+
+from langchain.docstore.document import Document
+from langchain.vectorstores import MyScale, MyScaleSettings
+from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
+
+
+def test_myscale() -> None:
+    """Test end to end construction and search."""
+    texts = ["foo", "bar", "baz"]
+    config = MyScaleSettings()
+    config.table = "test_myscale"
+    docsearch = MyScale.from_texts(texts, FakeEmbeddings(), config=config)
+    output = docsearch.similarity_search("foo", k=1)
+    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
+    docsearch.drop()
+
+
+@pytest.mark.asyncio
+async def test_myscale_async() -> None:
+    """Test end to end construction and search."""
+    texts = ["foo", "bar", "baz"]
+    config = MyScaleSettings()
+    config.table = "test_myscale_async"
+    docsearch = MyScale.from_texts(
+        texts=texts, embedding=FakeEmbeddings(), config=config
+    )
+    output = await docsearch.asimilarity_search("foo", k=1)
+    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
+    docsearch.drop()
+
+
+def test_myscale_with_metadatas() -> None:
+    """Test end to end construction and search."""
+    texts = ["foo", "bar", "baz"]
+    metadatas = [{"page": str(i)} for i in range(len(texts))]
+    config = MyScaleSettings()
+    config.table = "test_myscale_with_metadatas"
+    docsearch = MyScale.from_texts(
+        texts=texts,
+        embedding=FakeEmbeddings(),
+        config=config,
+        metadatas=metadatas,
+    )
+    output = docsearch.similarity_search("foo", k=1)
+    assert output == [Document(page_content="foo", metadata={"page": "0"})]
+    docsearch.drop()
+
+
+def test_myscale_with_metadatas_with_relevance_scores() -> None:
+    """Test end to end construction and scored search."""
+    texts = ["foo", "bar", "baz"]
+    metadatas = [{"page": str(i)} for i in range(len(texts))]
+    config = MyScaleSettings()
+    config.table = "test_myscale_with_metadatas_with_relevance_scores"
+    docsearch = MyScale.from_texts(
+        texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config
+    )
+    output = docsearch.similarity_search_with_relevance_scores("foo", k=1)
+    assert output[0][0] == Document(page_content="foo", metadata={"page": "0"})
+    docsearch.drop()
+
+
+def test_myscale_search_filter() -> None:
+    """Test end to end construction and search with metadata filtering."""
+    texts = ["far", "bar", "baz"]
+    metadatas = [{"first_letter": "{}".format(text[0])} for text in texts]
+    config = MyScaleSettings()
+    config.table = "test_myscale_search_filter"
+    docsearch = MyScale.from_texts(
+        texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config
+    )
+    output = docsearch.similarity_search(
+        "far", k=1, where_str=f"{docsearch.metadata_column}.first_letter='f'"
+    )
+    assert output == [Document(page_content="far", metadata={"first_letter": "f"})]
+    output = docsearch.similarity_search(
+        "bar", k=1, where_str=f"{docsearch.metadata_column}.first_letter='b'"
+    )
+    assert output == [Document(page_content="bar", metadata={"first_letter": "b"})]
+    docsearch.drop()
+
+
+def test_myscale_with_persistence() -> None:
+    """Test end to end construction and search, with persistence."""
+    config = MyScaleSettings()
+    config.table = "test_myscale_with_persistence"
+    texts = [
+        "foo",
+        "bar",
+        "baz",
+    ]
+    docsearch = MyScale.from_texts(
+        texts=texts, embedding=FakeEmbeddings(), config=config
+    )
+
+    output = docsearch.similarity_search("foo", k=1)
+    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
+
+    # Get a new VectorStore with same config
+    # it will reuse the table spontaneously
+    # unless you drop it
+    docsearch = MyScale(embedding=FakeEmbeddings(), config=config)
+    output = docsearch.similarity_search("foo", k=1)
+
+    # Clean up
+    docsearch.drop()