forked from Archives/langchain
Add Alibaba Cloud OpenSearch as a new vector store (#6154)
Hello Folks, Thanks for creating and maintaining this great project. I'm excited to submit this PR to add Alibaba Cloud OpenSearch as a new vector store. OpenSearch is a one-stop platform to develop intelligent search services. OpenSearch was built based on the large-scale distributed search engine developed by Alibaba. OpenSearch serves more than 500 business cases in Alibaba Group and thousands of Alibaba Cloud customers. OpenSearch helps develop search services in different search scenarios, including e-commerce, O2O, multimedia, the content industry, communities and forums, and big data query in enterprises. OpenSearch provides the vector search feature. In specific scenarios, especially test question search and image search scenarios, you can use the vector search feature together with the multimodal search feature to improve the accuracy of search results. This PR includes: A AlibabaCloudOpenSearch class that can connect to the Alibaba Cloud OpenSearch instance. add embedings and metadata into a opensearch datasource. querying by squared euclidean and metadata. integration tests. ipython notebook and docs. I have read your contributing guidelines. And I have passed the tests below - [x] make format - [x] make lint - [x] make coverage - [x] make test --------- Co-authored-by: zhaoshengbo <shengbo.zsb@alibaba-inc.com>
This commit is contained in:
parent
b7ad4c4c30
commit
ab44c24333
@ -0,0 +1,28 @@
|
||||
# Alibaba Cloud Opensearch
|
||||
|
||||
[Alibaba Cloud Opensearch](https://www.alibabacloud.com/product/opensearch) OpenSearch is a one-stop platform to develop intelligent search services. OpenSearch was built based on the large-scale distributed search engine developed by Alibaba. OpenSearch serves more than 500 business cases in Alibaba Group and thousands of Alibaba Cloud customers. OpenSearch helps develop search services in different search scenarios, including e-commerce, O2O, multimedia, the content industry, communities and forums, and big data query in enterprises.
|
||||
|
||||
OpenSearch helps you develop high quality, maintenance-free, and high performance intelligent search services to provide your users with high search efficiency and accuracy.
|
||||
|
||||
OpenSearch provides the vector search feature. In specific scenarios, especially test question search and image search scenarios, you can use the vector search feature together with the multimodal search feature to improve the accuracy of search results. This topic describes the syntax and usage notes of vector indexes.
|
||||
|
||||
## Purchase an instance and configure it
|
||||
|
||||
- Purchase OpenSearch Vector Search Edition from [Alibaba Cloud](https://opensearch.console.aliyun.com) and configure the instance according to the help [documentation](https://help.aliyun.com/document_detail/463198.html?spm=a2c4g.465092.0.0.2cd15002hdwavO).
|
||||
|
||||
## Alibaba Cloud Opensearch Vector Store Wrappers
|
||||
supported functions:
|
||||
- `add_texts`
|
||||
- `add_documents`
|
||||
- `from_texts`
|
||||
- `from_documents`
|
||||
- `similarity_search`
|
||||
- `asimilarity_search`
|
||||
- `similarity_search_by_vector`
|
||||
- `asimilarity_search_by_vector`
|
||||
- `similarity_search_with_relevance_scores`
|
||||
|
||||
For a more detailed walk through of the Alibaba Cloud OpenSearch wrapper, see [this notebook](../modules/indexes/vectorstores/examples/alibabacloud_opensearch.ipynb)
|
||||
|
||||
If you encounter any problems during use, please feel free to contact [xingshaomin.xsm@alibaba-inc.com](xingshaomin.xsm@alibaba-inc.com) , and we will do our best to provide you with assistance and support.
|
||||
|
@ -0,0 +1,294 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"# Alibaba Cloud OpenSearch\n",
|
||||
"\n",
|
||||
">[Alibaba Cloud Opensearch](https://www.alibabacloud.com/product/opensearch) OpenSearch is a one-stop platform to develop intelligent search services. OpenSearch was built based on the large-scale distributed search engine developed by Alibaba. OpenSearch serves more than 500 business cases in Alibaba Group and thousands of Alibaba Cloud customers. OpenSearch helps develop search services in different search scenarios, including e-commerce, O2O, multimedia, the content industry, communities and forums, and big data query in enterprises.\n",
|
||||
"\n",
|
||||
">OpenSearch helps you develop high quality, maintenance-free, and high performance intelligent search services to provide your users with high search efficiency and accuracy.\n",
|
||||
"\n",
|
||||
">OpenSearch provides the vector search feature. In specific scenarios, especially test question search and image search scenarios, you can use the vector search feature together with the multimodal search feature to improve the accuracy of search results. This topic describes the syntax and usage notes of vector indexes.\n",
|
||||
"\n",
|
||||
"This notebook shows how to use functionality related to the `Alibaba Cloud OpenSearch Vector Search Edition`.\n",
|
||||
"To run, you should have an [OpenSearch Vector Search Edition](https://opensearch.console.aliyun.com) instance up and running:\n",
|
||||
"- Read the [help document](https://www.alibabacloud.com/help/en/opensearch/latest/vector-search) to quickly familiarize and configure OpenSearch Vector Search Edition instance.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"After completing the configuration, follow these steps to connect to the instance, index documents, and perform vector retrieval."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores import (\n",
|
||||
" AlibabaCloudOpenSearch,\n",
|
||||
" AlibabaCloudOpenSearchSettings,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"Split documents and get embeddings by call OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"\n",
|
||||
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Create opensearch settings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"settings = AlibabaCloudOpenSearchSettings(\n",
|
||||
" endpoint=\"The endpoint of opensearch instance, You can find it from the console of Alibaba Cloud OpenSearch.\",\n",
|
||||
" instance_id=\"The identify of opensearch instance, You can find it from the console of Alibaba Cloud OpenSearch.\",\n",
|
||||
" datasource_name=\"The name of the data source specified when creating it.\",\n",
|
||||
" username=\"The username specified when purchasing the instance.\",\n",
|
||||
" password=\"The password specified when purchasing the instance.\",\n",
|
||||
" embedding_index_name=\"The name of the vector attribute specified when configuring the instance attributes.\",\n",
|
||||
" field_name_mapping={\n",
|
||||
" \"id\": \"id\", # The id field name mapping of index document.\n",
|
||||
" \"document\": \"document\", # The text field name mapping of index document.\n",
|
||||
" \"embedding\": \"embedding\", # The embedding field name mapping of index document.\n",
|
||||
" \"metadata_x\": \"metadata_x,=\", # The metadata field name mapping of index document, could specify multiple, The value field contains mapping name and operator, the operator would be used when executing metadata filter query.\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# for example\n",
|
||||
"# settings = AlibabaCloudOpenSearchSettings(\n",
|
||||
"# endpoint=\"ha-cn-5yd39d83c03.public.ha.aliyuncs.com\",\n",
|
||||
"# instance_id=\"ha-cn-5yd39d83c03\",\n",
|
||||
"# datasource_name=\"ha-cn-5yd39d83c03_test\",\n",
|
||||
"# username=\"this is a user name\",\n",
|
||||
"# password=\"this is a password\",\n",
|
||||
"# embedding_index_name=\"index_embedding\",\n",
|
||||
"# field_name_mapping={\n",
|
||||
"# \"id\": \"id\",\n",
|
||||
"# \"document\": \"document\",\n",
|
||||
"# \"embedding\": \"embedding\",\n",
|
||||
"# \"metadata\": \"metadata,=\" #The value field contains mapping name and operator, the operator would be used when executing metadata filter query\n",
|
||||
"# })"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"Create an opensearch access instance by settings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create an opensearch instance and index docs.\n",
|
||||
"opensearch = AlibabaCloudOpenSearch.from_texts(\n",
|
||||
" texts=docs, embedding=embeddings, config=settings\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"or"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create an opensearch instance.\n",
|
||||
"opensearch = AlibabaCloudOpenSearch(embedding=embeddings, config=settings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"Add texts and build index."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"metadatas = {\"md_key_a\": \"md_val_a\", \"md_key_b\": \"md_val_b\"}\n",
|
||||
"# the key of metadatas must match field_name_mapping in settings.\n",
|
||||
"opensearch.add_texts(texts=docs, ids=[], metadatas=metadatas)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"Query and retrieve data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"docs = opensearch.similarity_search(query)\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"Query and retrieve data with metadata\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"metadatas = {\"md_key_a\": \"md_val_a\"}\n",
|
||||
"docs = opensearch.similarity_search(query, filter=metadatas)\n",
|
||||
"print(docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"If you encounter any problems during use, please feel free to contact <xingshaomin.xsm@alibaba-inc.com>, and we will do our best to provide you with assistance and support.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -1,4 +1,8 @@
|
||||
"""Wrappers on top of vector stores."""
|
||||
from langchain.vectorstores.alibabacloud_opensearch import (
|
||||
AlibabaCloudOpenSearch,
|
||||
AlibabaCloudOpenSearchSettings,
|
||||
)
|
||||
from langchain.vectorstores.analyticdb import AnalyticDB
|
||||
from langchain.vectorstores.annoy import Annoy
|
||||
from langchain.vectorstores.atlas import AtlasDB
|
||||
@ -32,6 +36,8 @@ from langchain.vectorstores.weaviate import Weaviate
|
||||
from langchain.vectorstores.zilliz import Zilliz
|
||||
|
||||
__all__ = [
|
||||
"AlibabaCloudOpenSearch",
|
||||
"AlibabaCloudOpenSearchSettings",
|
||||
"AnalyticDB",
|
||||
"Annoy",
|
||||
"AtlasDB",
|
||||
|
363
langchain/vectorstores/alibabacloud_opensearch.py
Normal file
363
langchain/vectorstores/alibabacloud_opensearch.py
Normal file
@ -0,0 +1,363 @@
|
||||
import json
|
||||
import logging
|
||||
import numbers
|
||||
from hashlib import sha1
|
||||
from typing import Any, Dict, Iterable, List, Optional, Tuple
|
||||
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.schema import Document
|
||||
from langchain.vectorstores.base import VectorStore
|
||||
|
||||
logger = logging.getLogger()
|
||||
|
||||
|
||||
class AlibabaCloudOpenSearchSettings:
|
||||
"""Opensearch Client Configuration
|
||||
Attribute:
|
||||
endpoint (str) : The endpoint of opensearch instance, You can find it
|
||||
from the console of Alibaba Cloud OpenSearch.
|
||||
instance_id (str) : The identify of opensearch instance, You can find
|
||||
it from the console of Alibaba Cloud OpenSearch.
|
||||
datasource_name (str): The name of the data source specified when creating it.
|
||||
username (str) : The username specified when purchasing the instance.
|
||||
password (str) : The password specified when purchasing the instance.
|
||||
embedding_index_name (str) : The name of the vector attribute specified
|
||||
when configuring the instance attributes.
|
||||
field_name_mapping (Dict) : Using field name mapping between opensearch
|
||||
vector store and opensearch instance configuration table field names:
|
||||
{
|
||||
'id': 'The id field name map of index document.',
|
||||
'document': 'The text field name map of index document.',
|
||||
'embedding': 'In the embedding field of the opensearch instance,
|
||||
the values must be in float16 multivalue type and separated by commas.',
|
||||
'metadata_field_x': 'Metadata field mapping includes the mapped
|
||||
field name and operator in the mapping value, separated by a comma
|
||||
between the mapped field name and the operator.',
|
||||
}
|
||||
"""
|
||||
|
||||
endpoint: str
|
||||
instance_id: str
|
||||
username: str
|
||||
password: str
|
||||
datasource_name: str
|
||||
embedding_index_name: str
|
||||
field_name_mapping: Dict[str, str] = {
|
||||
"id": "id",
|
||||
"document": "document",
|
||||
"embedding": "embedding",
|
||||
"metadata_field_x": "metadata_field_x,operator",
|
||||
}
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
endpoint: str,
|
||||
instance_id: str,
|
||||
username: str,
|
||||
password: str,
|
||||
datasource_name: str,
|
||||
embedding_index_name: str,
|
||||
field_name_mapping: Dict[str, str],
|
||||
) -> None:
|
||||
self.endpoint = endpoint
|
||||
self.instance_id = instance_id
|
||||
self.username = username
|
||||
self.password = password
|
||||
self.datasource_name = datasource_name
|
||||
self.embedding_index_name = embedding_index_name
|
||||
self.field_name_mapping = field_name_mapping
|
||||
|
||||
def __getitem__(self, item: str) -> Any:
|
||||
return getattr(self, item)
|
||||
|
||||
|
||||
def create_metadata(fields: Dict[str, Any]) -> Dict[str, Any]:
|
||||
metadata: Dict[str, Any] = {}
|
||||
for key, value in fields.items():
|
||||
if key == "id" or key == "document" or key == "embedding":
|
||||
continue
|
||||
metadata[key] = value
|
||||
return metadata
|
||||
|
||||
|
||||
class AlibabaCloudOpenSearch(VectorStore):
|
||||
def __init__(
|
||||
self,
|
||||
embedding: Embeddings,
|
||||
config: AlibabaCloudOpenSearchSettings,
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
try:
|
||||
from alibabacloud_ha3engine import client, models
|
||||
from alibabacloud_tea_util import models as util_models
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"Could not import alibaba cloud opensearch python package. "
|
||||
"Please install it with `pip install alibabacloud-ha3engine`."
|
||||
)
|
||||
|
||||
self.config = config
|
||||
self.embedding = embedding
|
||||
|
||||
self.runtime = util_models.RuntimeOptions(
|
||||
connect_timeout=5000,
|
||||
read_timeout=10000,
|
||||
autoretry=False,
|
||||
ignore_ssl=False,
|
||||
max_idle_conns=50,
|
||||
)
|
||||
self.ha3EngineClient = client.Client(
|
||||
models.Config(
|
||||
endpoint=config.endpoint,
|
||||
instance_id=config.instance_id,
|
||||
protocol="http",
|
||||
access_user_name=config.username,
|
||||
access_pass_word=config.password,
|
||||
)
|
||||
)
|
||||
|
||||
self.options_headers: Dict[str, str] = {}
|
||||
|
||||
def add_texts(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
def _upsert(push_doc_list: List[Dict]) -> List[str]:
|
||||
if push_doc_list is None or len(push_doc_list) == 0:
|
||||
return []
|
||||
try:
|
||||
push_request = models.PushDocumentsRequestModel(
|
||||
self.options_headers, push_doc_list
|
||||
)
|
||||
push_response = self.ha3EngineClient.push_documents(
|
||||
self.config.datasource_name, field_name_map["id"], push_request
|
||||
)
|
||||
json_response = json.loads(push_response.body)
|
||||
if json_response["status"] == "OK":
|
||||
return [
|
||||
push_doc["fields"][field_name_map["id"]]
|
||||
for push_doc in push_doc_list
|
||||
]
|
||||
return []
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"add doc to endpoint:{self.config.endpoint} "
|
||||
f"instance_id:{self.config.instance_id} failed.",
|
||||
e,
|
||||
)
|
||||
raise e
|
||||
|
||||
from alibabacloud_ha3engine import models
|
||||
|
||||
ids = [sha1(t.encode("utf-8")).hexdigest() for t in texts]
|
||||
embeddings = self.embedding.embed_documents(list(texts))
|
||||
metadatas = metadatas or [{} for _ in texts]
|
||||
field_name_map = self.config.field_name_mapping
|
||||
add_doc_list = []
|
||||
text_list = list(texts)
|
||||
for idx, doc_id in enumerate(ids):
|
||||
embedding = embeddings[idx] if idx < len(embeddings) else None
|
||||
metadata = metadatas[idx] if idx < len(metadatas) else None
|
||||
text = text_list[idx] if idx < len(text_list) else None
|
||||
add_doc: Dict[str, Any] = dict()
|
||||
add_doc_fields: Dict[str, Any] = dict()
|
||||
add_doc_fields.__setitem__(field_name_map["id"], doc_id)
|
||||
add_doc_fields.__setitem__(field_name_map["document"], text)
|
||||
if embedding is not None:
|
||||
add_doc_fields.__setitem__(
|
||||
field_name_map["embedding"],
|
||||
",".join(str(unit) for unit in embedding),
|
||||
)
|
||||
if metadata is not None:
|
||||
for md_key, md_value in metadata.items():
|
||||
add_doc_fields.__setitem__(
|
||||
field_name_map[md_key].split(",")[0], md_value
|
||||
)
|
||||
add_doc.__setitem__("fields", add_doc_fields)
|
||||
add_doc.__setitem__("cmd", "add")
|
||||
add_doc_list.append(add_doc)
|
||||
return _upsert(add_doc_list)
|
||||
|
||||
def similarity_search(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
search_filter: Optional[Dict[str, Any]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
embedding = self.embedding.embed_query(query)
|
||||
return self.create_results(
|
||||
self.inner_embedding_query(
|
||||
embedding=embedding, search_filter=search_filter, k=k
|
||||
)
|
||||
)
|
||||
|
||||
def similarity_search_with_relevance_scores(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
search_filter: Optional[dict] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Tuple[Document, float]]:
|
||||
embedding: List[float] = self.embedding.embed_query(query)
|
||||
return self.create_results_with_score(
|
||||
self.inner_embedding_query(
|
||||
embedding=embedding, search_filter=search_filter, k=k
|
||||
)
|
||||
)
|
||||
|
||||
def similarity_search_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
search_filter: Optional[dict] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
return self.create_results(
|
||||
self.inner_embedding_query(
|
||||
embedding=embedding, search_filter=search_filter, k=k
|
||||
)
|
||||
)
|
||||
|
||||
def inner_embedding_query(
|
||||
self,
|
||||
embedding: List[float],
|
||||
search_filter: Optional[Dict[str, Any]] = None,
|
||||
k: int = 4,
|
||||
) -> Dict[str, Any]:
|
||||
def generate_embedding_query() -> str:
|
||||
tmp_search_config_str = (
|
||||
f"config=start:0,hit:{k},format:json&&cluster=general&&kvpairs="
|
||||
f"first_formula:proxima_score({self.config.embedding_index_name})&&sort=+RANK"
|
||||
)
|
||||
tmp_query_str = (
|
||||
f"&&query={self.config.embedding_index_name}:"
|
||||
+ "'"
|
||||
+ ",".join(str(x) for x in embedding)
|
||||
+ "'"
|
||||
)
|
||||
if search_filter is not None:
|
||||
filter_clause = "&&filter=" + " AND ".join(
|
||||
[
|
||||
create_filter(md_key, md_value)
|
||||
for md_key, md_value in search_filter.items()
|
||||
]
|
||||
)
|
||||
tmp_query_str += filter_clause
|
||||
|
||||
return tmp_search_config_str + tmp_query_str
|
||||
|
||||
def create_filter(md_key: str, md_value: Any) -> str:
|
||||
md_filter_expr = self.config.field_name_mapping[md_key]
|
||||
if md_filter_expr is None:
|
||||
return ""
|
||||
expr = md_filter_expr.split(",")
|
||||
if len(expr) != 2:
|
||||
logger.error(
|
||||
f"filter {md_filter_expr} express is not correct, "
|
||||
f"must contain mapping field and operator."
|
||||
)
|
||||
return ""
|
||||
md_filter_key = expr[0].strip()
|
||||
md_filter_operator = expr[1].strip()
|
||||
if isinstance(md_value, numbers.Number):
|
||||
return f"{md_filter_key} {md_filter_operator} {md_value}"
|
||||
return f'{md_filter_key}{md_filter_operator}"{md_value}"'
|
||||
|
||||
def search_data(single_query_str: str) -> Dict[str, Any]:
|
||||
search_query = models.SearchQuery(query=single_query_str)
|
||||
search_request = models.SearchRequestModel(
|
||||
self.options_headers, search_query
|
||||
)
|
||||
return json.loads(self.ha3EngineClient.search(search_request).body)
|
||||
|
||||
from alibabacloud_ha3engine import models
|
||||
|
||||
try:
|
||||
query_str = generate_embedding_query()
|
||||
json_response = search_data(query_str)
|
||||
if len(json_response["errors"]) != 0:
|
||||
logger.error(
|
||||
f"query {self.config.endpoint} {self.config.instance_id} "
|
||||
f"errors:{json_response['errors']} failed."
|
||||
)
|
||||
else:
|
||||
return json_response
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"query instance endpoint:{self.config.endpoint} "
|
||||
f"instance_id:{self.config.instance_id} failed.",
|
||||
e,
|
||||
)
|
||||
return {}
|
||||
|
||||
def create_results(self, json_result: Dict[str, Any]) -> List[Document]:
|
||||
items = json_result["result"]["items"]
|
||||
query_result_list: List[Document] = []
|
||||
for item in items:
|
||||
fields = item["fields"]
|
||||
query_result_list.append(
|
||||
Document(
|
||||
page_content=fields[self.config.field_name_mapping["document"]],
|
||||
metadata=create_metadata(fields),
|
||||
)
|
||||
)
|
||||
return query_result_list
|
||||
|
||||
def create_results_with_score(
|
||||
self, json_result: Dict[str, Any]
|
||||
) -> List[Tuple[Document, float]]:
|
||||
items = json_result["result"]["items"]
|
||||
query_result_list: List[Tuple[Document, float]] = []
|
||||
for item in items:
|
||||
fields = item["fields"]
|
||||
query_result_list.append(
|
||||
(
|
||||
Document(
|
||||
page_content=fields[self.config.field_name_mapping["document"]],
|
||||
metadata=create_metadata(fields),
|
||||
),
|
||||
float(item["sortExprValues"][0]),
|
||||
)
|
||||
)
|
||||
return query_result_list
|
||||
|
||||
@classmethod
|
||||
def from_texts(
|
||||
cls,
|
||||
texts: List[str],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
config: Optional[AlibabaCloudOpenSearchSettings] = None,
|
||||
**kwargs: Any,
|
||||
) -> "AlibabaCloudOpenSearch":
|
||||
if config is None:
|
||||
raise Exception("config can't be none")
|
||||
|
||||
ctx = cls(embedding, config, **kwargs)
|
||||
ctx.add_texts(texts=texts, metadatas=metadatas)
|
||||
return ctx
|
||||
|
||||
@classmethod
|
||||
def from_documents(
|
||||
cls,
|
||||
documents: List[Document],
|
||||
embedding: Embeddings,
|
||||
ids: Optional[List[str]] = None,
|
||||
config: Optional[AlibabaCloudOpenSearchSettings] = None,
|
||||
**kwargs: Any,
|
||||
) -> "AlibabaCloudOpenSearch":
|
||||
if config is None:
|
||||
raise Exception("config can't be none")
|
||||
|
||||
texts = [d.page_content for d in documents]
|
||||
metadatas = [d.metadata for d in documents]
|
||||
return cls.from_texts(
|
||||
texts=texts,
|
||||
embedding=embedding,
|
||||
metadatas=metadatas,
|
||||
config=config,
|
||||
**kwargs,
|
||||
)
|
@ -0,0 +1,128 @@
|
||||
from typing import List
|
||||
|
||||
from langchain.schema import Document
|
||||
from langchain.vectorstores.alibabacloud_opensearch import (
|
||||
AlibabaCloudOpenSearch,
|
||||
AlibabaCloudOpenSearchSettings,
|
||||
)
|
||||
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
|
||||
|
||||
OS_TOKEN_COUNT = 1536
|
||||
|
||||
texts = ["foo", "bar", "baz"]
|
||||
|
||||
|
||||
class FakeEmbeddingsWithOsDimension(FakeEmbeddings):
|
||||
"""Fake embeddings functionality for testing."""
|
||||
|
||||
def embed_documents(self, embedding_texts: List[str]) -> List[List[float]]:
|
||||
"""Return simple embeddings."""
|
||||
return [
|
||||
[float(1.0)] * (OS_TOKEN_COUNT - 1) + [float(i)]
|
||||
for i in range(len(embedding_texts))
|
||||
]
|
||||
|
||||
def embed_query(self, text: str) -> List[float]:
|
||||
"""Return simple embeddings."""
|
||||
return [float(1.0)] * (OS_TOKEN_COUNT - 1) + [float(texts.index(text))]
|
||||
|
||||
|
||||
settings = AlibabaCloudOpenSearchSettings(
|
||||
endpoint="The endpoint of opensearch instance, "
|
||||
"You can find it from the console of Alibaba Cloud OpenSearch.",
|
||||
instance_id="The identify of opensearch instance, "
|
||||
"You can find it from the console of Alibaba Cloud OpenSearch.",
|
||||
datasource_name="The name of the data source specified when creating it.",
|
||||
username="The username specified when purchasing the instance.",
|
||||
password="The password specified when purchasing the instance.",
|
||||
embedding_index_name="The name of the vector attribute "
|
||||
"specified when configuring the instance attributes.",
|
||||
field_name_mapping={
|
||||
# insert data into opensearch based on the mapping name of the field.
|
||||
"id": "The id field name map of index document.",
|
||||
"document": "The text field name map of index document.",
|
||||
"embedding": "The embedding field name map of index document,"
|
||||
"the values must be in float16 multivalue type "
|
||||
"and separated by commas.",
|
||||
"metadata_x": "The metadata field name map of index document, "
|
||||
"could specify multiple, The value field contains "
|
||||
"mapping name and operator, the operator would be "
|
||||
"used when executing metadata filter query",
|
||||
},
|
||||
)
|
||||
|
||||
embeddings = FakeEmbeddingsWithOsDimension()
|
||||
|
||||
|
||||
def test_create_alibabacloud_opensearch() -> None:
|
||||
opensearch = create_alibabacloud_opensearch()
|
||||
output = opensearch.similarity_search("foo", k=10)
|
||||
assert len(output) == 3
|
||||
|
||||
|
||||
def test_alibabacloud_opensearch_with_text_query() -> None:
|
||||
opensearch = create_alibabacloud_opensearch()
|
||||
output = opensearch.similarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo", metadata={"metadata": "0"})]
|
||||
|
||||
output = opensearch.similarity_search("bar", k=1)
|
||||
assert output == [Document(page_content="bar", metadata={"metadata": "1"})]
|
||||
|
||||
output = opensearch.similarity_search("baz", k=1)
|
||||
assert output == [Document(page_content="baz", metadata={"metadata": "2"})]
|
||||
|
||||
|
||||
def test_alibabacloud_opensearch_with_vector_query() -> None:
|
||||
opensearch = create_alibabacloud_opensearch()
|
||||
output = opensearch.similarity_search_by_vector(embeddings.embed_query("foo"), k=1)
|
||||
assert output == [Document(page_content="foo", metadata={"metadata": "0"})]
|
||||
|
||||
output = opensearch.similarity_search_by_vector(embeddings.embed_query("bar"), k=1)
|
||||
assert output == [Document(page_content="bar", metadata={"metadata": "1"})]
|
||||
|
||||
output = opensearch.similarity_search_by_vector(embeddings.embed_query("baz"), k=1)
|
||||
assert output == [Document(page_content="baz", metadata={"metadata": "2"})]
|
||||
|
||||
|
||||
def test_alibabacloud_opensearch_with_text_and_meta_query() -> None:
|
||||
opensearch = create_alibabacloud_opensearch()
|
||||
output = opensearch.similarity_search(
|
||||
query="foo", search_filter={"metadata": "0"}, k=1
|
||||
)
|
||||
assert output == [Document(page_content="foo", metadata={"metadata": "0"})]
|
||||
|
||||
output = opensearch.similarity_search(
|
||||
query="bar", search_filter={"metadata": "1"}, k=1
|
||||
)
|
||||
assert output == [Document(page_content="bar", metadata={"metadata": "1"})]
|
||||
|
||||
output = opensearch.similarity_search(
|
||||
query="baz", search_filter={"metadata": "2"}, k=1
|
||||
)
|
||||
assert output == [Document(page_content="baz", metadata={"metadata": "2"})]
|
||||
|
||||
output = opensearch.similarity_search(
|
||||
query="baz", search_filter={"metadata": "3"}, k=1
|
||||
)
|
||||
assert len(output) == 0
|
||||
|
||||
|
||||
def test_alibabacloud_opensearch_with_text_and_meta_score_query() -> None:
|
||||
opensearch = create_alibabacloud_opensearch()
|
||||
output = opensearch.similarity_search_with_relevance_scores(
|
||||
query="foo", search_filter={"metadata": "0"}, k=1
|
||||
)
|
||||
assert output == [
|
||||
(Document(page_content="foo", metadata={"metadata": "0"}), 10000.0)
|
||||
]
|
||||
|
||||
|
||||
def create_alibabacloud_opensearch() -> AlibabaCloudOpenSearch:
|
||||
metadatas = [{"metadata": str(i)} for i in range(len(texts))]
|
||||
|
||||
return AlibabaCloudOpenSearch.from_texts(
|
||||
texts=texts,
|
||||
embedding=FakeEmbeddingsWithOsDimension(),
|
||||
metadatas=metadatas,
|
||||
config=settings,
|
||||
)
|
Loading…
Reference in New Issue
Block a user