Es knn index search 5346 (#5569)

# Create elastic_vector_search.ElasticKnnSearch class

This extends `langchain/vectorstores/elastic_vector_search.py` by adding
a new class `ElasticKnnSearch`

Features:
- Allow creating an index with the `dense_vector` mapping compataible
with kNN search
- Store embeddings in index for use with kNN search (correct mapping
creates HNSW data structure)
- Perform approximate kNN search
- Perform hybrid BM25 (`query{}`) + kNN (`knn{}`) search
- perform knn search by either providing a `query_vector` or passing a
hosted `model_id` to use query_vector_builder to automatically generate
a query_vector at search time

Connection options
- Using `cloud_id` from Elastic Cloud
- Passing elasticsearch client object

search options
- query
- k
- query_vector
- model_id
- size
- source
- knn_boost (hybrid search)
- query_boost (hybrid search)
- fields


This also adds examples to
`docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb`


Fixes # [5346](https://github.com/hwchase17/langchain/issues/5346)

cc: @dev2049

 -->

---------

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
searx_updates
Jeff Vestal 1 year ago committed by GitHub
parent 8b3df18bcc
commit d1f65d8dc1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

3
.gitignore vendored

@ -150,3 +150,6 @@ wandb/
# integration test artifacts
data_map*
\[('_type', 'fake'), ('stop', None)]
# Replit files
*replit*

@ -3,7 +3,9 @@
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"metadata": {
"id": "683953b3"
},
"source": [
"# ElasticSearch\n",
"\n",
@ -12,11 +14,22 @@
"This notebook shows how to use functionality related to the `Elasticsearch` database."
]
},
{
"cell_type": "markdown",
"source": [
"# ElasticVectorSearch class"
],
"metadata": {
"id": "tKSYjyTBtSLc"
},
"id": "tKSYjyTBtSLc"
},
{
"cell_type": "markdown",
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
"metadata": {
"tags": []
"tags": [],
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409"
},
"source": [
"## Installation"
@ -25,7 +38,9 @@
{
"cell_type": "markdown",
"id": "81f43794-f002-477c-9b68-4975df30e718",
"metadata": {},
"metadata": {
"id": "81f43794-f002-477c-9b68-4975df30e718"
},
"source": [
"Check out [Elasticsearch installation instructions](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).\n",
"\n",
@ -89,7 +104,8 @@
"execution_count": null,
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c",
"metadata": {
"tags": []
"tags": [],
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c"
},
"outputs": [],
"source": [
@ -98,10 +114,12 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
"metadata": {
"tags": []
"tags": [],
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
"outputId": "fd16b37f-cb76-40a9-b83f-eab58dd0d912"
},
"outputs": [
{
@ -123,7 +141,8 @@
"cell_type": "markdown",
"id": "f6030187-0bd7-4798-8372-a265036af5e0",
"metadata": {
"tags": []
"tags": [],
"id": "f6030187-0bd7-4798-8372-a265036af5e0"
},
"source": [
"## Example"
@ -131,10 +150,11 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "aac9563e",
"metadata": {
"tags": []
"tags": [],
"id": "aac9563e"
},
"outputs": [],
"source": [
@ -146,10 +166,11 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "a3c3999a",
"metadata": {
"tags": []
"tags": [],
"id": "a3c3999a"
},
"outputs": [],
"source": [
@ -167,7 +188,8 @@
"execution_count": null,
"id": "12eb86d8",
"metadata": {
"tags": []
"tags": [],
"id": "12eb86d8"
},
"outputs": [],
"source": [
@ -179,9 +201,12 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"id": "4b172de8",
"metadata": {},
"metadata": {
"id": "4b172de8",
"outputId": "ca05a209-4514-4b5c-f6cb-2348f58c19a2"
},
"outputs": [
{
"name": "stdout",
@ -205,13 +230,327 @@
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"source": [
"# ElasticKnnSearch Class\n",
"The `ElasticKnnSearch` implements features allowing storing vectors and documents in Elasticsearch for use with approximate [kNN search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html)"
],
"metadata": {
"id": "FheGPztJsrRB"
},
"id": "FheGPztJsrRB"
},
{
"cell_type": "code",
"source": [
"!pip install langchain elasticsearch"
],
"metadata": {
"id": "gRVcbh5zqCJQ"
},
"execution_count": null,
"outputs": [],
"id": "gRVcbh5zqCJQ"
},
{
"cell_type": "code",
"source": [
"from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch\n",
"from langchain.embeddings import ElasticsearchEmbeddings\n",
"import elasticsearch"
],
"metadata": {
"id": "TJtqiw5AqBp8"
},
"execution_count": null,
"outputs": [],
"id": "TJtqiw5AqBp8"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticsearchEmbeddings\n",
"model_id = \"<model_id_from_es>\" \n",
"dims = dim_count\n",
"es_cloud_id = \"ESS_CLOUD_ID\"\n",
"es_user = \"es_user\"\n",
"es_password = \"es_pass\"\n",
"test_index = \"<index_name>\"\n",
"#input_field = \"your_input_field\" # if different from 'text_field'"
],
"metadata": {
"id": "XHfC0As6qN3T"
},
"execution_count": null,
"outputs": [],
"id": "XHfC0As6qN3T"
},
{
"cell_type": "code",
"source": [
"# Generate embedding object\n",
"embeddings = ElasticsearchEmbeddings.from_credentials(\n",
" model_id,\n",
" #input_field=input_field,\n",
" es_cloud_id=es_cloud_id,\n",
" es_user=es_user,\n",
" es_password=es_password,\n",
")"
],
"metadata": {
"id": "UkTipx1lqc3h"
},
"execution_count": null,
"outputs": [],
"id": "UkTipx1lqc3h"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticKnnSearch\n",
"knn_search = ElasticKnnSearch(\n",
"\tes_cloud_id=es_cloud_id, \n",
"\tes_user=es_user, \n",
"\tes_password=es_password, \n",
"\tindex_name= test_index, \n",
"\tembedding= embeddings\n",
")"
],
"metadata": {
"id": "74psgD0oqjYK"
},
"execution_count": null,
"outputs": [],
"id": "74psgD0oqjYK"
},
{
"cell_type": "markdown",
"source": [
"## Test adding vectors"
],
"metadata": {
"id": "7AfgIKLWqnQl"
},
"id": "7AfgIKLWqnQl"
},
{
"cell_type": "code",
"source": [
"# Test `add_texts` method\n",
"texts = [\"Hello, world!\", \"Machine learning is fun.\", \"I love Python.\"]\n",
"knn_search.add_texts(texts)\n",
"\n",
"# Test `from_texts` method\n",
"new_texts = [\"This is a new text.\", \"Elasticsearch is powerful.\", \"Python is great for data analysis.\"]\n",
"knn_search.from_texts(new_texts, dims=dims)"
],
"metadata": {
"id": "yNUUIaL9qmze"
},
"execution_count": null,
"outputs": [],
"id": "yNUUIaL9qmze"
},
{
"cell_type": "markdown",
"source": [
"## Test knn search using query vector builder "
],
"metadata": {
"id": "0zdR-Iubquov"
},
"id": "0zdR-Iubquov"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)\n",
"print(f\"kNN search results for query '{query}': {knn_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2)\n",
"print(f\"Hybrid search results for query '{query}': {hybrid_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{hybrid_result['hits']['hits'][0]['_source']['text']}'\")"
],
"metadata": {
"id": "bwR4jYvqqxTo"
},
"execution_count": null,
"outputs": [],
"id": "bwR4jYvqqxTo"
},
{
"cell_type": "markdown",
"source": [
"## Test knn search using pre generated vector \n"
],
"metadata": {
"id": "ltXYqp0qqz7R"
},
"id": "ltXYqp0qqz7R"
},
{
"cell_type": "code",
"source": [
"# Generate embedding for tests\n",
"query_text = 'Hello'\n",
"query_embedding = embeddings.embed_query(query_text)\n",
"print(f\"Length of embedding: {len(query_embedding)}\\nFirst two items in embedding: {query_embedding[:2]}\")\n",
"\n",
"# Test knn Search\n",
"knn_result = knn_search.knn_search(query_vector = query_embedding, k=2)\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n",
"\n",
"# Test hybrid search - Requires both query_text and query_vector\n",
"knn_result = knn_search.knn_hybrid_search(query_vector = query_embedding, query=query_text, k=2)\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")"
],
"metadata": {
"id": "O5COtpTqq23t"
},
"execution_count": null,
"outputs": [],
"id": "O5COtpTqq23t"
},
{
"cell_type": "markdown",
"source": [
"## Test source option"
],
"metadata": {
"id": "0dnmimcJq42C"
},
"id": "0dnmimcJq42C"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, source=False)\n",
"assert not '_source' in knn_result['hits']['hits'][0].keys()\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, source=False)\n",
"assert not '_source' in hybrid_result['hits']['hits'][0].keys()"
],
"metadata": {
"id": "v4_B72nHq7g1"
},
"execution_count": null,
"outputs": [],
"id": "v4_B72nHq7g1"
},
{
"cell_type": "markdown",
"source": [
"## Test fields option "
],
"metadata": {
"id": "teHgJgrlq-Jb"
},
"id": "teHgJgrlq-Jb"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, fields=['text'])\n",
"assert 'text' in knn_result['hits']['hits'][0]['fields'].keys()\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, fields=['text'])\n",
"assert 'text' in hybrid_result['hits']['hits'][0]['fields'].keys()"
],
"metadata": {
"id": "utNBbpZYrAYW"
},
"execution_count": null,
"id": "a359ed74",
"metadata": {},
"outputs": [],
"source": []
"id": "utNBbpZYrAYW"
},
{
"cell_type": "markdown",
"source": [
"### Test with es client connection rather than cloud_id "
],
"metadata": {
"id": "hddsIFferBy1"
},
"id": "hddsIFferBy1"
},
{
"cell_type": "code",
"source": [
"# Create Elasticsearch connection\n",
"es_connection = Elasticsearch(\n",
" hosts=['https://es_cluster_url:port'], \n",
" basic_auth=('user', 'password')\n",
")"
],
"metadata": {
"id": "bXqrUnoirFia"
},
"execution_count": null,
"outputs": [],
"id": "bXqrUnoirFia"
},
{
"cell_type": "code",
"source": [
"# Instantiate ElasticsearchEmbeddings using es_connection\n",
"embeddings = ElasticsearchEmbeddings.from_es_connection(\n",
" model_id,\n",
" es_connection,\n",
")"
],
"metadata": {
"id": "TIM__Hm8rSEW"
},
"execution_count": null,
"outputs": [],
"id": "TIM__Hm8rSEW"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticKnnSearch\n",
"knn_search = ElasticKnnSearch(\n",
"\tes_connection = es_connection,\n",
"\tindex_name= test_index, \n",
"\tembedding= embeddings\n",
")"
],
"metadata": {
"id": "1-CdnOrArVc_"
},
"execution_count": null,
"outputs": [],
"id": "1-CdnOrArVc_"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)\n",
"print(f\"kNN search results for query '{query}': {knn_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n"
],
"metadata": {
"id": "0kgyaL6QrYVF"
},
"execution_count": null,
"outputs": [],
"id": "0kgyaL6QrYVF"
}
],
"metadata": {
@ -231,6 +570,9 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"colab": {
"provenance": []
}
},
"nbformat": 4,

@ -3,13 +3,26 @@ from __future__ import annotations
import uuid
from abc import ABC
from typing import Any, Dict, Iterable, List, Optional, Tuple
from typing import (
TYPE_CHECKING,
Any,
Dict,
Iterable,
List,
Mapping,
Optional,
Tuple,
Union,
)
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.utils import get_from_env
from langchain.vectorstores.base import VectorStore
if TYPE_CHECKING:
from elasticsearch import Elasticsearch
def _default_text_mapping(dim: int) -> Dict:
return {
@ -304,3 +317,239 @@ class ElasticVectorSearch(VectorStore, ABC):
index=index_name, body={"query": script_query, "size": size}
)
return response
class ElasticKnnSearch(ElasticVectorSearch):
"""
A class for performing k-Nearest Neighbors (k-NN) search on an Elasticsearch index.
The class is designed for a text search scenario where documents are text strings
and their embeddings are vector representations of those strings.
"""
def __init__(
self,
index_name: str,
embedding: Embeddings,
es_connection: Optional["Elasticsearch"] = None,
es_cloud_id: Optional[str] = None,
es_user: Optional[str] = None,
es_password: Optional[str] = None,
):
"""
Initializes an instance of the ElasticKnnSearch class and sets up the
Elasticsearch client.
Args:
index_name: The name of the Elasticsearch index.
embedding: An instance of the Embeddings class, used to generate vector
representations of text strings.
es_connection: An existing Elasticsearch connection.
es_cloud_id: The Cloud ID of the Elasticsearch instance. Required if
creating a new connection.
es_user: The username for the Elasticsearch instance. Required if
creating a new connection.
es_password: The password for the Elasticsearch instance. Required if
creating a new connection.
"""
try:
import elasticsearch
except ImportError:
raise ImportError(
"Could not import elasticsearch python package. "
"Please install it with `pip install elasticsearch`."
)
self.embedding = embedding
self.index_name = index_name
# If a pre-existing Elasticsearch connection is provided, use it.
if es_connection is not None:
self.client = es_connection
else:
# If credentials for a new Elasticsearch connection are provided,
# create a new connection.
if es_cloud_id and es_user and es_password:
self.client = elasticsearch.Elasticsearch(
cloud_id=es_cloud_id, basic_auth=(es_user, es_password)
)
else:
raise ValueError(
"""Either provide a pre-existing Elasticsearch connection, \
or valid credentials for creating a new connection."""
)
@staticmethod
def _default_knn_mapping(dims: int) -> Dict:
"""Generates a default index mapping for kNN search."""
return {
"properties": {
"text": {"type": "text"},
"vector": {
"type": "dense_vector",
"dims": dims,
"index": True,
"similarity": "dot_product",
},
}
}
@staticmethod
def _default_knn_query(
query_vector: Optional[List[float]] = None,
query: Optional[str] = None,
model_id: Optional[str] = None,
field: Optional[str] = "vector",
k: Optional[int] = 10,
num_candidates: Optional[int] = 10,
) -> Dict:
knn: Dict = {
"field": field,
"k": k,
"num_candidates": num_candidates,
}
# Case 1: `query_vector` is provided, but not `model_id` -> use query_vector
if query_vector and not model_id:
knn["query_vector"] = query_vector
# Case 2: `query` and `model_id` are provided, -> use query_vector_builder
elif query and model_id:
knn["query_vector_builder"] = {
"text_embedding": {
"model_id": model_id, # use 'model_id' argument
"model_text": query, # use 'query' argument
}
}
else:
raise ValueError(
"Either `query_vector` or `model_id` must be provided, but not both."
)
return knn
def knn_search(
self,
query: Optional[str] = None,
k: Optional[int] = 10,
query_vector: Optional[List[float]] = None,
model_id: Optional[str] = None,
size: Optional[int] = 10,
source: Optional[bool] = True,
fields: Optional[
Union[List[Mapping[str, Any]], Tuple[Mapping[str, Any], ...], None]
] = None,
) -> Dict:
"""
Performs a k-nearest neighbor (k-NN) search on the Elasticsearch index.
The search can be conducted using either a raw query vector or a model ID.
The method first generates
the body of the search query, which can be interpreted by Elasticsearch.
It then performs the k-NN
search on the Elasticsearch index and returns the results.
Args:
query: The query or queries to be used for the search. Required if
`query_vector` is not provided.
k: The number of nearest neighbors to return. Defaults to 10.
query_vector: The query vector to be used for the search. Required if
`query` is not provided.
model_id: The ID of the model to use for generating the query vector, if
`query` is provided.
size: The number of search hits to return. Defaults to 10.
source: Whether to include the source of each hit in the results.
fields: The fields to include in the source of each hit. If None, all
fields are included.
Returns:
The search results.
Raises:
ValueError: If neither `query_vector` nor `model_id` is provided, or if
both are provided.
"""
knn_query_body = self._default_knn_query(
query_vector=query_vector, query=query, model_id=model_id, k=k
)
# Perform the kNN search on the Elasticsearch index and return the results.
res = self.client.search(
index=self.index_name,
knn=knn_query_body,
size=size,
source=source,
fields=fields,
)
return dict(res)
def knn_hybrid_search(
self,
query: Optional[str] = None,
k: Optional[int] = 10,
query_vector: Optional[List[float]] = None,
model_id: Optional[str] = None,
size: Optional[int] = 10,
source: Optional[bool] = True,
knn_boost: Optional[float] = 0.9,
query_boost: Optional[float] = 0.1,
fields: Optional[
Union[List[Mapping[str, Any]], Tuple[Mapping[str, Any], ...], None]
] = None,
) -> Dict[Any, Any]:
"""Performs a hybrid k-nearest neighbor (k-NN) and text-based search on the
Elasticsearch index.
The search can be conducted using either a raw query vector or a model ID.
The method first generates
the body of the k-NN search query and the text-based query, which can be
interpreted by Elasticsearch.
It then performs the hybrid search on the Elasticsearch index and returns the
results.
Args:
query: The query or queries to be used for the search. Required if
`query_vector` is not provided.
k: The number of nearest neighbors to return. Defaults to 10.
query_vector: The query vector to be used for the search. Required if
`query` is not provided.
model_id: The ID of the model to use for generating the query vector, if
`query` is provided.
size: The number of search hits to return. Defaults to 10.
source: Whether to include the source of each hit in the results.
knn_boost: The boost factor for the k-NN part of the search.
query_boost: The boost factor for the text-based part of the search.
fields
The fields to include in the source of each hit. If None, all fields are
included. Defaults to None.
Returns:
The search results.
Raises:
ValueError: If neither `query_vector` nor `model_id` is provided, or if
both are provided.
"""
knn_query_body = self._default_knn_query(
query_vector=query_vector, query=query, model_id=model_id, k=k
)
# Modify the knn_query_body to add a "boost" parameter
knn_query_body["boost"] = knn_boost
# Generate the body of the standard Elasticsearch query
match_query_body = {"match": {"text": {"query": query, "boost": query_boost}}}
# Perform the hybrid search on the Elasticsearch index and return the results.
res = self.client.search(
index=self.index_name,
query=match_query_body,
knn=knn_query_body,
fields=fields,
size=size,
source=source,
)
return dict(res)

Loading…
Cancel
Save