Add Support for OpenSearch Vector database (#1191)

### Description This PR adds a wrapper which adds support for the OpenSearch vector database. Using opensearch-py client we are ingesting the embeddings of given text into opensearch cluster using Bulk API. We can perform the `similarity_search` on the index using the 3 popular searching methods of OpenSearch k-NN plugin: - `Approximate k-NN Search` use approximate nearest neighbor (ANN) algorithms from the [nmslib](https://github.com/nmslib/nmslib), [faiss](https://github.com/facebookresearch/faiss), and [Lucene](https://lucene.apache.org/) libraries to power k-NN search. - `Script Scoring` extends OpenSearch’s script scoring functionality to execute a brute force, exact k-NN search. - `Painless Scripting` adds the distance functions as painless extensions that can be used in more complex combinations. Also, supports brute force, exact k-NN search like Script Scoring. ### Issues Resolved https://github.com/hwchase17/langchain/issues/1054 --------- Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
2 years ago · 0118706fd6
parent c5015d77e2
commit 0118706fd6
10 changed files with 786 additions and 5 deletions
--- a/docs/ecosystem/opensearch.md
+++ b/docs/ecosystem/opensearch.md
@ -0,0 +1,21 @@
 # OpenSearch
 This page covers how to use the OpenSearch ecosystem within LangChain.
 It is broken into two parts: installation and setup, and then references to specific OpenSearch wrappers.
 ## Installation and Setup
 - Install the Python package with `pip install opensearch-py`
 ## Wrappers
 ### VectorStore
 There exists a wrapper around OpenSearch vector databases, allowing you to use it as a vectorstore 
 for semantic search using approximate vector search powered by lucene, nmslib and faiss engines 
 or using painless scripting and script scoring functions for bruteforce vector search.
 To import this vectorstore:
 ```python
 from langchain.vectorstores import OpenSearchVectorSearch
 ```
 For a more detailed walkthrough of the OpenSearch wrapper, see [this notebook](../modules/indexes/vectorstore_examples/opensearch.ipynb)
--- a/docs/modules/indexes/chain_examples/qa_with_sources.ipynb
+++ b/docs/modules/indexes/chain_examples/qa_with_sources.ipynb
@ -732,4 +732,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/docs/modules/indexes/chain_examples/vector_db_qa_with_sources.ipynb
+++ b/docs/modules/indexes/chain_examples/vector_db_qa_with_sources.ipynb
@ -215,4 +215,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/docs/modules/indexes/vectorstore_examples/opensearch.ipynb
+++ b/docs/modules/indexes/vectorstore_examples/opensearch.ipynb
@ -0,0 +1,220 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {},
   "source": [
    "# OpenSearch\n",
    "\n",
    "This notebook shows how to use functionality related to the OpenSearch database.\n",
    "\n",
    "To run, you should have the opensearch instance up and running: [here](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/index/)\n",
    "`similarity_search` by default performs the Approximate k-NN Search which uses one of the several algorithms like lucene, nmslib, faiss recommended for\n",
    "large datasets. To perform brute force search we have other search methods known as Script Scoring and Painless Scripting.\n",
    "Check [this](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aac9563e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import OpenSearchVectorSearch\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a3c3999a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "loader = TextLoader('../../state_of_the_union.txt')\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\")\n",
    "\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(query)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### similarity_search using Approximate k-NN Search with Custom Parameters"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", engine=\"faiss\", space_type=\"innerproduct\", ef_construction=256, m=48)\n",
    "\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(query)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### similarity_search using Script Scoring with Custom Parameters"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
    "\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", k=1, search_type=\"script_scoring\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### similarity_search using Painless Scripting with Custom Parameters"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
    "filter = {\"bool\": {\"filter\": {\"term\": {\"text\": \"smuggling\"}}}}\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", search_type=\"painless_scripting\", space_type=\"cosineSimilarity\", pre_filter=filter)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/docs/reference/integrations.md
+++ b/docs/reference/integrations.md
@ -47,5 +47,9 @@ The following use cases require specific installs and api keys:
  - Install requirements with `pip install faiss` for Python 3.7 and `pip install faiss-cpu` for Python 3.10+.
 - _Manifest_:
  - Install requirements with `pip install manifest-ml` (Note: this is only available in Python 3.8+ currently).
 - _OpenSearch_:
  - Install requirements with `pip install opensearch-py`
  - If you want to set up OpenSearch on your local, [here](https://opensearch.org/docs/latest/)
 If you are using the `NLTKTextSplitter` or the `SpacyTextSplitter`, you will also need to install the appropriate models. For example, if you want to use the `SpacyTextSplitter`, you will need to install the `en_core_web_sm` model with `python -m spacy download en_core_web_sm`. Similarly, if you want to use the `NLTKTextSplitter`, you will need to install the `punkt` model with `python -m nltk.downloader punkt`.
--- a/langchain/vectorstores/init.py
+++ b/langchain/vectorstores/init.py
@ -4,6 +4,7 @@ from langchain.vectorstores.chroma import Chroma
 from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
 from langchain.vectorstores.faiss import FAISS
 from langchain.vectorstores.milvus import Milvus
 from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch
 from langchain.vectorstores.pinecone import Pinecone
 from langchain.vectorstores.qdrant import Qdrant
 from langchain.vectorstores.weaviate import Weaviate
@ -17,4 +18,5 @@ __all__ = [
    "Qdrant",
    "Milvus",
    "Chroma",
    "OpenSearchVectorSearch",
 ]
--- a/langchain/vectorstores/opensearch_vector_search.py
+++ b/langchain/vectorstores/opensearch_vector_search.py
@ -0,0 +1,382 @@
 """Wrapper around OpenSearch vector database."""
 from __future__ import annotations
 import uuid
 from typing import Any, Dict, Iterable, List, Optional
 from langchain.docstore.document import Document
 from langchain.embeddings.base import Embeddings
 from langchain.utils import get_from_dict_or_env
 from langchain.vectorstores.base import VectorStore
 IMPORT_OPENSEARCH_PY_ERROR = (
    "Could not import OpenSearch. Please install it with `pip install opensearch-py`."
 )
 SCRIPT_SCORING_SEARCH = "script_scoring"
 PAINLESS_SCRIPTING_SEARCH = "painless_scripting"
 MATCH_ALL_QUERY = {"match_all": {}}  # type: Dict
 def _import_opensearch() -> Any:
    """Import OpenSearch if available, otherwise raise error."""
    try:
        from opensearchpy import OpenSearch
    except ImportError:
        raise ValueError(IMPORT_OPENSEARCH_PY_ERROR)
    return OpenSearch
 def _import_bulk() -> Any:
    """Import bulk if available, otherwise raise error."""
    try:
        from opensearchpy.helpers import bulk
    except ImportError:
        raise ValueError(IMPORT_OPENSEARCH_PY_ERROR)
    return bulk
 def _get_opensearch_client(opensearch_url: str) -> Any:
    """Get OpenSearch client from the opensearch_url, otherwise raise error."""
    try:
        opensearch = _import_opensearch()
        client = opensearch(opensearch_url)
    except ValueError as e:
        raise ValueError(
            f"OpenSearch client string provided is not in proper format. "
            f"Got error: {e} "
        )
    return client
 def _validate_embeddings_and_bulk_size(embeddings_length: int, bulk_size: int) -> None:
    """Validate Embeddings Length and Bulk Size."""
    if embeddings_length == 0:
        raise RuntimeError("Embeddings size is zero")
    if bulk_size < embeddings_length:
        raise RuntimeError(
            f"The embeddings count, {embeddings_length} is more than the "
            f"[bulk_size], {bulk_size}. Increase the value of [bulk_size]."
        )
 def _bulk_ingest_embeddings(
    client: Any,
    index_name: str,
    embeddings: List[List[float]],
    texts: Iterable[str],
    metadatas: Optional[List[dict]] = None,
 ) -> List[str]:
    """Bulk Ingest Embeddings into given index."""
    bulk = _import_bulk()
    requests = []
    ids = []
    for i, text in enumerate(texts):
        metadata = metadatas[i] if metadatas else {}
        _id = str(uuid.uuid4())
        request = {
            "_op_type": "index",
            "_index": index_name,
            "vector_field": embeddings[i],
            "text": text,
            "metadata": metadata,
            "_id": _id,
        }
        requests.append(request)
        ids.append(_id)
    bulk(client, requests)
    client.indices.refresh(index=index_name)
    return ids
 def _default_scripting_text_mapping(dim: int) -> Dict:
    """For Painless Scripting or Script Scoring,the default mapping to create index."""
    return {
        "mappings": {
            "properties": {
                "vector_field": {"type": "knn_vector", "dimension": dim},
            }
        }
    }
 def _default_text_mapping(
    dim: int,
    engine: str = "nmslib",
    space_type: str = "l2",
    ef_search: int = 512,
    ef_construction: int = 512,
    m: int = 16,
 ) -> Dict:
    """For Approximate k-NN Search, this is the default mapping to create index."""
    return {
        "settings": {"index": {"knn": True, "knn.algo_param.ef_search": ef_search}},
        "mappings": {
            "properties": {
                "vector_field": {
                    "type": "knn_vector",
                    "dimension": dim,
                    "method": {
                        "name": "hnsw",
                        "space_type": space_type,
                        "engine": engine,
                        "parameters": {"ef_construction": ef_construction, "m": m},
                    },
                }
            }
        },
    }
 def _default_approximate_search_query(
    query_vector: List[float], size: int = 4, k: int = 4
 ) -> Dict:
    """For Approximate k-NN Search, this is the default query."""
    return {
        "size": size,
        "query": {"knn": {"vector_field": {"vector": query_vector, "k": k}}},
    }
 def _default_script_query(
    query_vector: List[float],
    space_type: str = "l2",
    pre_filter: Dict = MATCH_ALL_QUERY,
 ) -> Dict:
    """For Script Scoring Search, this is the default query."""
    return {
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": "knn_score",
                    "lang": "knn",
                    "params": {
                        "field": "vector_field",
                        "query_value": query_vector,
                        "space_type": space_type,
                    },
                },
            }
        }
    }
 def __get_painless_scripting_source(space_type: str, query_vector: List[float]) -> str:
    """For Painless Scripting, it returns the script source based on space type."""
    source_value = (
        "(1.0 + " + space_type + "(" + str(query_vector) + ", doc['vector_field']))"
    )
    if space_type == "cosineSimilarity":
        return source_value
    else:
        return "1/" + source_value
 def _default_painless_scripting_query(
    query_vector: List[float],
    space_type: str = "l2Squared",
    pre_filter: Dict = MATCH_ALL_QUERY,
 ) -> Dict:
    """For Painless Scripting Search, this is the default query."""
    source = __get_painless_scripting_source(space_type, query_vector)
    return {
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": source,
                    "params": {
                        "field": "vector_field",
                        "query_value": query_vector,
                    },
                },
            }
        }
    }
 def _get_kwargs_value(kwargs: Any, key: str, default_value: Any) -> Any:
    """Get the value of the key if present. Else get the default_value."""
    if key in kwargs:
        return kwargs.get(key)
    return default_value
 class OpenSearchVectorSearch(VectorStore):
    """Wrapper around OpenSearch as a vector database.
    Example:
        .. code-block:: python
            from langchain import OpenSearchVectorSearch
            opensearch_vector_search = OpenSearchVectorSearch(
                "http://localhost:9200",
                "embeddings",
                embedding_function
            )
    """
    def __init__(
        self, opensearch_url: str, index_name: str, embedding_function: Embeddings
    ):
        """Initialize with necessary components."""
        self.embedding_function = embedding_function
        self.index_name = index_name
        self.client = _get_opensearch_client(opensearch_url)
    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        bulk_size: int = 500,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            texts: Iterable of strings to add to the vectorstore.
            metadatas: Optional list of metadatas associated with the texts.
            bulk_size: Bulk API request count; Default: 500
        Returns:
            List of ids from adding the texts into the vectorstore.
        """
        embeddings = [
            self.embedding_function.embed_documents(list(text))[0] for text in texts
        ]
        _validate_embeddings_and_bulk_size(len(embeddings), bulk_size)
        return _bulk_ingest_embeddings(
            self.client, self.index_name, embeddings, texts, metadatas
        )
    def similarity_search(
        self, query: str, k: int = 4, **kwargs: Any
    ) -> List[Document]:
        """Return docs most similar to query.
        By default supports Approximate Search.
        Also supports Script Scoring and Painless Scripting.
        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
        Returns:
            List of Documents most similar to the query.
        Optional Args for Approximate Search:
            search_type: "approximate_search"; default: "approximate_search"
            size: number of results the query actually returns; default: 4
        Optional Args for Script Scoring Search:
            search_type: "script_scoring"; default: "approximate_search"
            space_type: "l2", "l1", "linf", "cosinesimil", "innerproduct",
            "hammingbit"; default: "l2"
            pre_filter: script_score query to pre-filter documents before identifying
            nearest neighbors; default: {"match_all": {}}
        Optional Args for Painless Scripting Search:
            search_type: "painless_scripting"; default: "approximate_search"
            space_type: "l2Squared", "l1Norm", "cosineSimilarity"; default: "l2Squared"
            pre_filter: script_score query to pre-filter documents before identifying
            nearest neighbors; default: {"match_all": {}}
        """
        embedding = self.embedding_function.embed_query(query)
        search_type = _get_kwargs_value(kwargs, "search_type", "approximate_search")
        if search_type == "approximate_search":
            size = _get_kwargs_value(kwargs, "size", 4)
            search_query = _default_approximate_search_query(embedding, size, k)
        elif search_type == SCRIPT_SCORING_SEARCH:
            space_type = _get_kwargs_value(kwargs, "space_type", "l2")
            pre_filter = _get_kwargs_value(kwargs, "pre_filter", MATCH_ALL_QUERY)
            search_query = _default_script_query(embedding, space_type, pre_filter)
        elif search_type == PAINLESS_SCRIPTING_SEARCH:
            space_type = _get_kwargs_value(kwargs, "space_type", "l2Squared")
            pre_filter = _get_kwargs_value(kwargs, "pre_filter", MATCH_ALL_QUERY)
            search_query = _default_painless_scripting_query(
                embedding, space_type, pre_filter
            )
        else:
            raise ValueError("Invalid `search_type` provided as an argument")
        response = self.client.search(index=self.index_name, body=search_query)
        hits = [hit["_source"] for hit in response["hits"]["hits"][:k]]
        documents = [
            Document(page_content=hit["text"], metadata=hit["metadata"]) for hit in hits
        ]
        return documents
    @classmethod
    def from_texts(
        cls,
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        bulk_size: int = 500,
        **kwargs: Any,
    ) -> OpenSearchVectorSearch:
        """Construct OpenSearchVectorSearch wrapper from raw documents.
        Example:
            .. code-block:: python
                from langchain import OpenSearchVectorSearch
                from langchain.embeddings import OpenAIEmbeddings
                embeddings = OpenAIEmbeddings()
                opensearch_vector_search = OpenSearchVectorSearch.from_texts(
                    texts,
                    embeddings,
                    opensearch_url="http://localhost:9200"
                )
        OpenSearch by default supports Approximate Search powered by nmslib, faiss
        and lucene engines recommended for large datasets. Also supports brute force
        search through Script Scoring and Painless Scripting.
        Optional Keyword Args for Approximate Search:
            engine: "nmslib", "faiss", "hnsw"; default: "nmslib"
            space_type: "l2", "l1", "cosinesimil", "linf", "innerproduct"; default: "l2"
            ef_search: Size of the dynamic list used during k-NN searches. Higher values
            lead to more accurate but slower searches; default: 512
            ef_construction: Size of the dynamic list used during k-NN graph creation.
            Higher values lead to more accurate graph but slower indexing speed;
            default: 512
            m: Number of bidirectional links created for each new element. Large impact
            on memory consumption. Between 2 and 100; default: 16
        Keyword Args for Script Scoring or Painless Scripting:
            is_appx_search: False
        """
        opensearch_url = get_from_dict_or_env(
            kwargs, "opensearch_url", "OPENSEARCH_URL"
        )
        client = _get_opensearch_client(opensearch_url)
        embeddings = embedding.embed_documents(texts)
        _validate_embeddings_and_bulk_size(len(embeddings), bulk_size)
        dim = len(embeddings[0])
        index_name = uuid.uuid4().hex
        is_appx_search = _get_kwargs_value(kwargs, "is_appx_search", True)
        if is_appx_search:
            engine = _get_kwargs_value(kwargs, "engine", "nmslib")
            space_type = _get_kwargs_value(kwargs, "space_type", "l2")
            ef_search = _get_kwargs_value(kwargs, "ef_search", 512)
            ef_construction = _get_kwargs_value(kwargs, "ef_construction", 512)
            m = _get_kwargs_value(kwargs, "m", 16)
            mapping = _default_text_mapping(
                dim, engine, space_type, ef_search, ef_construction, m
            )
        else:
            mapping = _default_scripting_text_mapping(dim)
        client.indices.create(index=index_name, body=mapping)
        _bulk_ingest_embeddings(client, index_name, embeddings, texts, metadatas)
        return cls(opensearch_url, index_name, embedding)
--- a/poetry.lock
+++ b/poetry.lock
@ -3552,6 +3552,29 @@ dev = ["black (>=21.6b0,<22.0)", "pytest (>=6.0.0,<7.0.0)", "pytest-asyncio", "p
 embeddings = ["matplotlib", "numpy", "openpyxl (>=3.0.7)", "pandas (>=1.2.3)", "pandas-stubs (>=1.1.0.11)", "plotly", "scikit-learn (>=1.0.2)", "sklearn", "tenacity (>=8.0.1)"]
 wandb = ["numpy", "openpyxl (>=3.0.7)", "pandas (>=1.2.3)", "pandas-stubs (>=1.1.0.11)", "wandb"]
 [[package]]
 name = "opensearch-py"
 version = "2.1.1"
 description = "Python low-level client for OpenSearch"
 category = "main"
 optional = true
 python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4"
 files = [
    {file = "opensearch-py-2.1.1.tar.gz", hash = "sha256:dd54a50c6771bc2582741bfdcf629b8d7eed409ae7fc2722249e53f9a10de0d8"},
    {file = "opensearch_py-2.1.1-py2.py3-none-any.whl", hash = "sha256:3e7085bf25487979581416f4ab195c2fe62e90f1f07f393091f8233cbea032eb"},
 ]
 [package.dependencies]
 certifi = "*"
 requests = ">=2.4.0,<3.0.0"
 urllib3 = ">=1.21.1,<2"
 [package.extras]
 async = ["aiohttp (>=3,<4)"]
 develop = ["black", "botocore", "coverage", "jinja2", "mock", "myst-parser", "pytest", "pytest-cov", "pyyaml", "requests (>=2.0.0,<3.0.0)", "sphinx", "sphinx-copybutton", "sphinx-rtd-theme"]
 docs = ["myst-parser", "sphinx", "sphinx-copybutton", "sphinx-rtd-theme"]
 kerberos = ["requests-kerberos"]
 [[package]]
 name = "opt-einsum"
 version = "3.3.0"
@ -7039,10 +7062,10 @@ docs = ["furo", "jaraco.packaging (>=9)", "jaraco.tidelift (>=1.4)", "rst.linker
 testing = ["flake8 (<5)", "func-timeout", "jaraco.functools", "jaraco.itertools", "more-itertools", "pytest (>=6)", "pytest-black (>=0.3.7)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=1.3)", "pytest-flake8", "pytest-mypy (>=0.9.1)"]
 [extras]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "elasticsearch", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx"]
 llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "690fdd08a207a73cb343cfdf25f7ae7d4177dc39b704d8655f3a4f26a881c2fc"
+content-hash = "7997201f64373247d8799baed84a5ad11ab3d92e26cc2114b26e734cfb9664a4"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -20,6 +20,7 @@ numpy = "^1"
 faiss-cpu = {version = "^1", optional = true}
 wikipedia = {version = "^1", optional = true}
 elasticsearch = {version = "^8", optional = true}
 opensearch-py = {version = "^2.0.0", optional = true}
 redis = {version = "^4", optional = true}
 manifest-ml = {version = "^0.0.1", optional = true}
 spacy = {version = "^3", optional = true}
@ -94,7 +95,7 @@ playwright = "^1.28.0"
 [tool.poetry.extras]
 llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "elasticsearch", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx"]
 [tool.isort]
 profile = "black"
--- a/tests/integration_tests/vectorstores/test_opensearch.py
+++ b/tests/integration_tests/vectorstores/test_opensearch.py
@ -0,0 +1,128 @@
 """Test OpenSearch functionality."""
 import pytest
 from langchain.docstore.document import Document
 from langchain.vectorstores.opensearch_vector_search import (
    PAINLESS_SCRIPTING_SEARCH,
    SCRIPT_SCORING_SEARCH,
    OpenSearchVectorSearch,
 )
 from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
 DEFAULT_OPENSEARCH_URL = "http://localhost:9200"
 texts = ["foo", "bar", "baz"]
 def test_opensearch() -> None:
    """Test end to end indexing and search using Approximate Search."""
    docsearch = OpenSearchVectorSearch.from_texts(
        texts, FakeEmbeddings(), opensearch_url=DEFAULT_OPENSEARCH_URL
    )
    output = docsearch.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo")]
 def test_opensearch_with_metadatas() -> None:
    """Test end to end indexing and search with metadata."""
    metadatas = [{"page": i} for i in range(len(texts))]
    docsearch = OpenSearchVectorSearch.from_texts(
        texts,
        FakeEmbeddings(),
        metadatas=metadatas,
        opensearch_url=DEFAULT_OPENSEARCH_URL,
    )
    output = docsearch.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"page": 0})]
 def test_add_text() -> None:
    """Test adding additional text elements to existing index."""
    text_input = ["test", "add", "text", "method"]
    metadatas = [{"page": i} for i in range(len(text_input))]
    docsearch = OpenSearchVectorSearch.from_texts(
        texts, FakeEmbeddings(), opensearch_url=DEFAULT_OPENSEARCH_URL
    )
    docids = OpenSearchVectorSearch.add_texts(docsearch, text_input, metadatas)
    assert len(docids) == len(text_input)
 def test_opensearch_script_scoring() -> None:
    """Test end to end indexing and search using Script Scoring Search."""
    pre_filter_val = {"bool": {"filter": {"term": {"text": "bar"}}}}
    docsearch = OpenSearchVectorSearch.from_texts(
        texts,
        FakeEmbeddings(),
        opensearch_url=DEFAULT_OPENSEARCH_URL,
        is_appx_search=False,
    )
    output = docsearch.similarity_search(
        "foo", k=1, search_type=SCRIPT_SCORING_SEARCH, pre_filter=pre_filter_val
    )
    assert output == [Document(page_content="bar")]
 def test_add_text_script_scoring() -> None:
    """Test adding additional text elements and validating using Script Scoring."""
    text_input = ["test", "add", "text", "method"]
    metadatas = [{"page": i} for i in range(len(text_input))]
    docsearch = OpenSearchVectorSearch.from_texts(
        text_input,
        FakeEmbeddings(),
        opensearch_url=DEFAULT_OPENSEARCH_URL,
        is_appx_search=False,
    )
    OpenSearchVectorSearch.add_texts(docsearch, texts, metadatas)
    output = docsearch.similarity_search(
        "add", k=1, search_type=SCRIPT_SCORING_SEARCH, space_type="innerproduct"
    )
    assert output == [Document(page_content="test")]
 def test_opensearch_painless_scripting() -> None:
    """Test end to end indexing and search using Painless Scripting Search."""
    pre_filter_val = {"bool": {"filter": {"term": {"text": "baz"}}}}
    docsearch = OpenSearchVectorSearch.from_texts(
        texts,
        FakeEmbeddings(),
        opensearch_url=DEFAULT_OPENSEARCH_URL,
        is_appx_search=False,
    )
    output = docsearch.similarity_search(
        "foo", k=1, search_type=PAINLESS_SCRIPTING_SEARCH, pre_filter=pre_filter_val
    )
    assert output == [Document(page_content="baz")]
 def test_add_text_painless_scripting() -> None:
    """Test adding additional text elements and validating using Painless Scripting."""
    text_input = ["test", "add", "text", "method"]
    metadatas = [{"page": i} for i in range(len(text_input))]
    docsearch = OpenSearchVectorSearch.from_texts(
        text_input,
        FakeEmbeddings(),
        opensearch_url=DEFAULT_OPENSEARCH_URL,
        is_appx_search=False,
    )
    OpenSearchVectorSearch.add_texts(docsearch, texts, metadatas)
    output = docsearch.similarity_search(
        "add", k=1, search_type=PAINLESS_SCRIPTING_SEARCH, space_type="cosineSimilarity"
    )
    assert output == [Document(page_content="test")]
 def test_opensearch_invalid_search_type() -> None:
    """Test to validate similarity_search by providing invalid search_type."""
    docsearch = OpenSearchVectorSearch.from_texts(
        texts, FakeEmbeddings(), opensearch_url=DEFAULT_OPENSEARCH_URL
    )
    with pytest.raises(ValueError):
        docsearch.similarity_search("foo", k=1, search_type="invalid_search_type")
 def test_opensearch_embedding_size_zero() -> None:
    """Test to validate indexing when embedding size is zero."""
    with pytest.raises(RuntimeError):
        OpenSearchVectorSearch.from_texts(
            [], FakeEmbeddings(), opensearch_url=DEFAULT_OPENSEARCH_URL
        )