Add SemaDB VST wrapper (#11484)

- **Description**: Adding vectorstore wrapper for
[SemaDB](https://rapidapi.com/semafind-semadb/api/semadb).
- **Issue**: None
- **Dependencies**: None
- **Twitter handle**: semafind

Checks performed:
- [x] `make format`
- [x] `make lint`
- [x] `make test`
- [x] `make spell_check`
- [x] `make docs_build`

Documentation added:

- SemaDB vectorstore wrapper tutorial
pull/11707/head
nuric 12 months ago committed by GitHub
parent 0b743f005b
commit 44da27c07b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,19 @@
# SemaDB
>[SemaDB](https://semafind.com/) is a no fuss vector similarity search engine. It provides a low-cost cloud hosted version to help you build AI applications with ease.
With SemaDB Cloud, our hosted version, no fuss means no pod size calculations, no schema definitions, no partition settings, no parameter tuning, no search algorithm tuning, no complex installation, no complex API. It is integrated with [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb) providing transparent billing, automatic sharding and an interactive API playground.
## Installation
None required, get started directly with SemaDB Cloud at [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb).
## Vector Store
There is a basic wrapper around `SemaDB` collections allowing you to use it as a vectorstore.
```python
from langchain.vectorstores import SemaDB
```
You can follow a tutorial on how to use the wrapper in [this notebook](/docs/integrations/vectorstores/semadb.html).

@ -0,0 +1,299 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe1cf4b8-4fee-49d9-aad5-18adabaca692",
"metadata": {},
"source": [
"# SemaDB\n",
"\n",
"> SemaDB is a no fuss vector similarity database for building AI applications. The hosted SemaDB Cloud offers a no fuss developer experience to get started.\n",
"\n",
"The full documentation of the API along with examples and an interactive playground is available on [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb).\n",
"\n",
"This notebook demonstrates how the `langchain` wrapper can be used with SemaDB Cloud."
]
},
{
"cell_type": "markdown",
"id": "aa8c1970-52f0-4834-8f06-3ca8f7fac857",
"metadata": {},
"source": [
"## Load document embeddings\n",
"\n",
"To run things locally, we are using [Sentence Transformers](https://www.sbert.net/) which are commonly used for embedding sentences. You can use any embedding model LangChain offers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "386a6b49-edee-45f2-9c0e-ebc125507ece",
"metadata": {},
"outputs": [],
"source": [
"!pip install sentence_transformers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5bd07a44-34fd-4318-8033-4c8dbd327559",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"\n",
"embeddings = HuggingFaceEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b0079bdf-b3cd-4856-85d5-f7787f5d93d5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"114\n"
]
}
],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.document_loaders import TextLoader\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"print(len(docs))"
]
},
{
"cell_type": "markdown",
"id": "92ed5523-330d-4697-9008-c910044ac45a",
"metadata": {},
"source": [
"## Connect to SemaDB\n",
"\n",
"SemaDB Cloud uses [RapidAPI keys](https://rapidapi.com/semafind-semadb/api/semadb) to authenticate. You can obtain yours by creating a free RapidAPI account."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c4ffeeef-e6f5-4bcc-8c97-0e4222ca8282",
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"SemaDB API Key: ········\n"
]
}
],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ['SEMADB_API_KEY'] = getpass.getpass(\"SemaDB API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ba5f7a81-0f59-448a-93a8-5d8bf3bfc0f9",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import SemaDB\n",
"from langchain.vectorstores.utils import DistanceStrategy"
]
},
{
"cell_type": "markdown",
"id": "320f743c-39ae-456c-8c20-0683196358a4",
"metadata": {},
"source": [
"The parameters to the SemaDB vector store reflect the API directly:\n",
"\n",
"- \"mycollection\": is the collection name in which we will store these vectors.\n",
"- 768: is dimensions of the vectors. In our case, the sentence transformer embeddings yield 768 dimensional vectors.\n",
"- API_KEY: is your RapidAPI key.\n",
"- embeddings: correspond to how the embeddings of documents, texts and queries will be generated.\n",
"- DistanceStrategy: is the distance metric used. The wrapper automatically normalises vectors if COSINE is used."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c1cb1f78-c25e-41a7-8001-6c84d51514ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db = SemaDB(\"mycollection\", 768, embeddings, DistanceStrategy.COSINE)\n",
"\n",
"# Create collection if running for the first time. If the collection\n",
"# already exists this will fail.\n",
"db.create_collection()"
]
},
{
"cell_type": "markdown",
"id": "44348469-1d1f-4f3e-9af3-a955aec3dd71",
"metadata": {},
"source": [
"The SemaDB vector store wrapper adds the document text as point metadata to collect later. Storing large chunks of text is *not recommended*. If you are indexing a large collection, we instead recommend storing references to the documents such as external Ids."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9adca5d3-e534-4fd2-aace-f436de4630ed",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['813c7ef3-9797-466b-8afa-587115592c6c',\n",
" 'fc392f7f-082b-4932-bfcc-06800db5e017']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.add_documents(docs)[:2]"
]
},
{
"cell_type": "markdown",
"id": "fb177b0d-148b-4cbc-86cc-b62dff135a9d",
"metadata": {},
"source": [
"## Similarity Search\n",
"\n",
"We use the default LangChain similarity search interface to search for the most similar sentences."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7536aba2-a757-4a3f-beda-79cfee5c34cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a51e940e-487e-484d-9dc4-1aa1a6371660",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'text': 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.'}),\n",
" 0.42369342)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = db.similarity_search_with_score(query)\n",
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "79aec3f4-d4d8-4c51-b4b2-074b6c22c3c0",
"metadata": {},
"source": [
"## Clean up\n",
"\n",
"You can delete the collection to remove all data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b00afad5-8ec1-4c19-be6b-1c2ae2d5fead",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.delete_collection()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "239a0bca-5c88-401f-9828-1cb0b652e7d0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -295,6 +295,12 @@ def _import_scann() -> Any:
return ScaNN
def _import_semadb() -> Any:
from langchain.vectorstores.semadb import SemaDB
return SemaDB
def _import_singlestoredb() -> Any:
from langchain.vectorstores.singlestoredb import SingleStoreDB
@ -486,6 +492,8 @@ def __getattr__(name: str) -> Any:
return _import_rocksetdb()
elif name == "ScaNN":
return _import_scann()
elif name == "SemaDB":
return _import_semadb()
elif name == "SingleStoreDB":
return _import_singlestoredb()
elif name == "SKLearnVectorStore":
@ -576,6 +584,7 @@ __all__ = [
"Rockset",
"SKLearnVectorStore",
"ScaNN",
"SemaDB",
"SingleStoreDB",
"SingleStoreDB",
"SQLiteVSS",

@ -0,0 +1,272 @@
from typing import Any, Iterable, List, Optional, Tuple
from uuid import uuid4
import numpy as np
import requests
from langchain.schema.document import Document
from langchain.schema.embeddings import Embeddings
from langchain.utils import get_from_env
from langchain.vectorstores import VectorStore
from langchain.vectorstores.utils import DistanceStrategy
class SemaDB(VectorStore):
"""`SemaDB` vector store.
This vector store is a wrapper around the SemaDB database.
Example:
.. code-block:: python
from langchain.vectorstores import SemaDB
db = SemaDB('mycollection', 768, embeddings, DistanceStrategy.COSINE)
"""
HOST = "semadb.p.rapidapi.com"
BASE_URL = "https://" + HOST
def __init__(
self,
collection_name: str,
vector_size: int,
embedding: Embeddings,
distance_strategy: DistanceStrategy = DistanceStrategy.EUCLIDEAN_DISTANCE,
api_key: str = "",
):
"""Initialise the SemaDB vector store."""
self.collection_name = collection_name
self.vector_size = vector_size
self.api_key = api_key or get_from_env("api_key", "SEMADB_API_KEY")
self._embedding = embedding
self.distance_strategy = distance_strategy
@property
def headers(self) -> dict:
"""Return the common headers."""
return {
"content-type": "application/json",
"X-RapidAPI-Key": self.api_key,
"X-RapidAPI-Host": SemaDB.HOST,
}
def _get_internal_distance_strategy(self) -> str:
"""Return the internal distance strategy."""
if self.distance_strategy == DistanceStrategy.EUCLIDEAN_DISTANCE:
return "euclidean"
elif self.distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
raise ValueError("Max inner product is not supported by SemaDB")
elif self.distance_strategy == DistanceStrategy.DOT_PRODUCT:
return "dot"
elif self.distance_strategy == DistanceStrategy.JACCARD:
raise ValueError("Max inner product is not supported by SemaDB")
elif self.distance_strategy == DistanceStrategy.COSINE:
return "cosine"
else:
raise ValueError(f"Unknown distance strategy {self.distance_strategy}")
def create_collection(self) -> bool:
"""Creates the corresponding collection in SemaDB."""
payload = {
"id": self.collection_name,
"vectorSize": self.vector_size,
"distanceMetric": self._get_internal_distance_strategy(),
}
response = requests.post(
SemaDB.BASE_URL + "/collections",
json=payload,
headers=self.headers,
)
return response.status_code == 200
def delete_collection(self) -> bool:
"""Deletes the corresponding collection in SemaDB."""
response = requests.delete(
SemaDB.BASE_URL + f"/collections/{self.collection_name}",
headers=self.headers,
)
return response.status_code == 200
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
batch_size: int = 1000,
**kwargs: Any,
) -> List[str]:
"""Add texts to the vector store."""
if not isinstance(texts, list):
texts = list(texts)
embeddings = self._embedding.embed_documents(texts)
# Check dimensions
if len(embeddings[0]) != self.vector_size:
raise ValueError(
f"Embedding size mismatch {len(embeddings[0])} != {self.vector_size}"
)
# Normalise if needed
if self.distance_strategy == DistanceStrategy.COSINE:
embed_matrix = np.array(embeddings)
embed_matrix = embed_matrix / np.linalg.norm(
embed_matrix, axis=1, keepdims=True
)
embeddings = embed_matrix.tolist()
# Create points
ids: List[str] = []
points = []
if metadatas is not None:
for text, embedding, metadata in zip(texts, embeddings, metadatas):
new_id = str(uuid4())
ids.append(new_id)
points.append(
{
"id": new_id,
"vector": embedding,
"metadata": {**metadata, **{"text": text}},
}
)
else:
for text, embedding in zip(texts, embeddings):
new_id = str(uuid4())
ids.append(new_id)
points.append(
{
"id": new_id,
"vector": embedding,
"metadata": {"text": text},
}
)
# Insert points in batches
for i in range(0, len(points), batch_size):
batch = points[i : i + batch_size]
response = requests.post(
SemaDB.BASE_URL + f"/collections/{self.collection_name}/points",
json={"points": batch},
headers=self.headers,
)
if response.status_code != 200:
print("HERE--", batch)
raise ValueError(f"Error adding points: {response.text}")
failed_ranges = response.json()["failedRanges"]
if len(failed_ranges) > 0:
raise ValueError(f"Error adding points: {failed_ranges}")
# Return ids
return ids
@property
def embeddings(self) -> Embeddings:
"""Return the embeddings."""
return self._embedding
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
"""Delete by vector ID or other criteria.
Args:
ids: List of ids to delete.
**kwargs: Other keyword arguments that subclasses might use.
Returns:
Optional[bool]: True if deletion is successful,
False otherwise, None if not implemented.
"""
payload = {
"ids": ids,
}
response = requests.delete(
SemaDB.BASE_URL + f"/collections/{self.collection_name}/points",
json=payload,
headers=self.headers,
)
return response.status_code == 200 and len(response.json()["failedPoints"]) == 0
def _search_points(self, embedding: List[float], k: int = 4) -> List[dict]:
"""Search points."""
# Normalise if needed
if self.distance_strategy == DistanceStrategy.COSINE:
vec = np.array(embedding)
vec = vec / np.linalg.norm(vec)
embedding = vec.tolist()
# Perform search request
payload = {
"vector": embedding,
"limit": k,
}
response = requests.post(
SemaDB.BASE_URL + f"/collections/{self.collection_name}/points/search",
json=payload,
headers=self.headers,
)
if response.status_code != 200:
raise ValueError(f"Error searching: {response.text}")
return response.json()["points"]
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
"""Return docs most similar to query."""
query_embedding = self._embedding.embed_query(query)
return self.similarity_search_by_vector(query_embedding, k=k)
def similarity_search_with_score(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Tuple[Document, float]]:
"""Run similarity search with distance."""
query_embedding = self._embedding.embed_query(query)
points = self._search_points(query_embedding, k=k)
return [
(
Document(page_content=p["metadata"]["text"], metadata=p["metadata"]),
p["distance"],
)
for p in points
]
def similarity_search_by_vector(
self, embedding: List[float], k: int = 4, **kwargs: Any
) -> List[Document]:
"""Return docs most similar to embedding vector.
Args:
embedding: Embedding to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
Returns:
List of Documents most similar to the query vector.
"""
points = self._search_points(embedding, k=k)
return [
Document(page_content=p["metadata"]["text"], metadata=p["metadata"])
for p in points
]
@classmethod
def from_texts(
cls,
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
collection_name: str = "",
vector_size: int = 0,
api_key: str = "",
distance_strategy: DistanceStrategy = DistanceStrategy.EUCLIDEAN_DISTANCE,
**kwargs: Any,
) -> "SemaDB":
"""Return VectorStore initialized from texts and embeddings."""
if not collection_name:
raise ValueError("Collection name must be provided")
if not vector_size:
raise ValueError("Vector size must be provided")
if not api_key:
raise ValueError("API key must be provided")
semadb = cls(
collection_name,
vector_size,
embedding,
distance_strategy=distance_strategy,
api_key=api_key,
)
if not semadb.create_collection():
raise ValueError("Error creating collection")
semadb.add_texts(texts, metadatas=metadatas)
return semadb
Loading…
Cancel
Save