community[minor]: added jaguar vector store (#14838)

Description: A new vector store Jaguar is being added. Class, test
scripts, and documentation is added.
Issue: None -- This is the first PR contributing to LangChain
Dependencies: This depends on "pip install -U jaguardb-http-client"
client http package
Tag maintainer: @baskaryan, @eyurtsev, @hwchase1
Twitter handle: @workbot

---------

Co-authored-by: JY <jyjy@jaguardb>
Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
JaguarDB 2023-12-19 07:40:18 -08:00 committed by GitHub
parent a5be9f9475
commit 992b04e475
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 1158 additions and 0 deletions

View File

@ -0,0 +1,62 @@
# Jaguar
This page describes how to use Jaguar vector database within LangChain.
It contains three sections: introduction, installation and setup, and Jaguar API.
## Introduction
Jaguar vector database has the following characteristics:
1. It is a distributed vector database
2. The “ZeroMove” feature of JaguarDB enables instant horizontal scalability
3. Multimodal: embeddings, text, images, videos, PDFs, audio, time series, and geospatial
4. All-masters: allows both parallel reads and writes
5. Anomaly detection capabilities
6. RAG support: combines LLM with proprietary and real-time data
7. Shared metadata: sharing of metadata across multiple vector indexes
8. Distance metrics: Euclidean, Cosine, InnerProduct, Manhatten, Chebyshev, Hamming, Jeccard, Minkowski
[Overview of Jaguar scalable vector database](http://www.jaguardb.com)
You can run JaguarDB in docker container; or download the software and run on-cloud or off-cloud.
## Installation and Setup
- Install the JaguarDB on one host or multiple hosts
- Install the Jaguar HTTP Gateway server on one host
- Install the JaguarDB HTTP Client package
The steps are described in [Jaguar Documents](http://www.jaguardb.com/support.html)
Environment Variables in client programs:
export OPENAI_API_KEY="......"
export JAGUAR_API_KEY="......"
## Jaguar API
Together with LangChain, a Jaguar client class is provided by importing it in Python:
```python
from langchain_community.vectorstores.jaguar import Jaguar
```
Supported API functions of the Jaguar class are:
- `add_texts`
- `add_documents`
- `from_texts`
- `from_documents`
- `similarity_search`
- `is_anomalous`
- `create`
- `delete`
- `clear`
- `drop`
- `login`
- `logout`
For more details of the Jaguar API, please refer to [this notebook](/docs/integrations/vectorstores/jaguar)

View File

@ -0,0 +1,246 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "671e9ec1-fa00-4c92-a2fb-ceb142168ea9",
"metadata": {},
"source": [
"# Jaguar Vector Database\n",
"\n",
"1. It is a distributed vector database\n",
"2. The “ZeroMove” feature of JaguarDB enables instant horizontal scalability\n",
"3. Multimodal: embeddings, text, images, videos, PDFs, audio, time series, and geospatial\n",
"4. All-masters: allows both parallel reads and writes\n",
"5. Anomaly detection capabilities\n",
"6. RAG support: combines LLM with proprietary and real-time data\n",
"7. Shared metadata: sharing of metadata across multiple vector indexes\n",
"8. Distance metrics: Euclidean, Cosine, InnerProduct, Manhatten, Chebyshev, Hamming, Jeccard, Minkowski"
]
},
{
"cell_type": "markdown",
"id": "1a87dc28-1344-4003-b31a-13e4cb71bf48",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"There are two requirements for running the examples in this file.\n",
"1. You must install and set up the JaguarDB server and its HTTP gateway server.\n",
" Please refer to the instructions in:\n",
" [www.jaguardb.com](http://www.jaguardb.com)\n",
"\n",
"2. You must install the http client package for JaguarDB:\n",
" ```\n",
" pip install -U jaguardb-http-client\n",
" ```\n"
]
},
{
"cell_type": "markdown",
"id": "c7d56993-4809-4e42-a409-94d3a7305ad8",
"metadata": {},
"source": [
"## RAG With Langchain\n",
"\n",
"This section demonstrates chatting with LLM together with Jaguar in the langchain software stack.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d62c2393-5c7c-4bb6-8367-c4389fa36a4e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain_community.vectorstores.jaguar import Jaguar\n",
"\n",
"\"\"\" \n",
"Load a text file into a set of documents \n",
"\"\"\"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"\"\"\"\n",
"Instantiate a Jaguar vector store\n",
"\"\"\"\n",
"### Jaguar HTTP endpoint\n",
"url = \"http://192.168.5.88:8080/fwww/\"\n",
"\n",
"### Use OpenAI embedding model\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"### Pod is a database for vectors\n",
"pod = \"vdb\"\n",
"\n",
"### Vector store name\n",
"store = \"langchain_rag_store\"\n",
"\n",
"### Vector index name\n",
"vector_index = \"v\"\n",
"\n",
"### Type of the vector index\n",
"# cosine: distance metric\n",
"# fraction: embedding vectors are decimal numbers\n",
"# float: values stored with floating-point numbers\n",
"vector_type = \"cosine_fraction_float\"\n",
"\n",
"### Dimension of each embedding vector\n",
"vector_dimension = 1536\n",
"\n",
"### Instantiate a Jaguar store object\n",
"vectorstore = Jaguar(\n",
" pod, store, vector_index, vector_type, vector_dimension, url, embeddings\n",
")\n",
"\n",
"\"\"\"\n",
"Login must be performed to authorize the client.\n",
"The environment variable JAGUAR_API_KEY or file $HOME/.jagrc\n",
"should contain the API key for accessing JaguarDB servers.\n",
"\"\"\"\n",
"vectorstore.login()\n",
"\n",
"\n",
"\"\"\"\n",
"Create vector store on the JaguarDB database server.\n",
"This should be done only once.\n",
"\"\"\"\n",
"# Extra metadata fields for the vector store\n",
"metadata = \"category char(16)\"\n",
"\n",
"# Number of characters for the text field of the store\n",
"text_size = 4096\n",
"\n",
"# Create a vector store on the server\n",
"vectorstore.create(metadata, text_size)\n",
"\n",
"\"\"\"\n",
"Add the texts from the text splitter to our vectorstore\n",
"\"\"\"\n",
"vectorstore.add_documents(docs)\n",
"\n",
"\"\"\" Get the retriever object \"\"\"\n",
"retriever = vectorstore.as_retriever()\n",
"# retriever = vectorstore.as_retriever(search_kwargs={\"where\": \"m1='123' and m2='abc'\"})\n",
"\n",
"\"\"\" The retriever object can be used with LangChain and LLM \"\"\""
]
},
{
"cell_type": "markdown",
"id": "11178867-d143-4a10-93bf-278f5f10dc1a",
"metadata": {},
"source": [
"## Interaction With Jaguar Vector Store\n",
"\n",
"Users can interact directly with the Jaguar vector store for similarity search and anomaly detection.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9a53cb5-e298-4125-9ace-0d851198869a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain_community.vectorstores.jaguar import Jaguar\n",
"\n",
"# Instantiate a Jaguar vector store object\n",
"url = \"http://192.168.3.88:8080/fwww/\"\n",
"pod = \"vdb\"\n",
"store = \"langchain_test_store\"\n",
"vector_index = \"v\"\n",
"vector_type = \"cosine_fraction_float\"\n",
"vector_dimension = 10\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = Jaguar(\n",
" pod, store, vector_index, vector_type, vector_dimension, url, embeddings\n",
")\n",
"\n",
"# Login for authorization\n",
"vectorstore.login()\n",
"\n",
"# Create the vector store with two metadata fields\n",
"# This needs to be run only once.\n",
"metadata_str = \"author char(32), category char(16)\"\n",
"vectorstore.create(metadata_str, 1024)\n",
"\n",
"# Add a list of texts\n",
"texts = [\"foo\", \"bar\", \"baz\"]\n",
"metadatas = [\n",
" {\"author\": \"Adam\", \"category\": \"Music\"},\n",
" {\"author\": \"Eve\", \"category\": \"Music\"},\n",
" {\"author\": \"John\", \"category\": \"History\"},\n",
"]\n",
"ids = vectorstore.add_texts(texts=texts, metadatas=metadatas)\n",
"\n",
"# Search similar text\n",
"output = vectorstore.similarity_search(\n",
" query=\"foo\",\n",
" k=1,\n",
" metadatas=[\"author\", \"category\"],\n",
")\n",
"assert output[0].page_content == \"foo\"\n",
"assert output[0].metadata[\"author\"] == \"Adam\"\n",
"assert output[0].metadata[\"category\"] == \"Music\"\n",
"assert len(output) == 1\n",
"\n",
"# Search with filtering (where)\n",
"where = \"author='Eve'\"\n",
"output = vectorstore.similarity_search(\n",
" query=\"foo\",\n",
" k=3,\n",
" fetch_k=9,\n",
" where=where,\n",
" metadatas=[\"author\", \"category\"],\n",
")\n",
"assert output[0].page_content == \"bar\"\n",
"assert output[0].metadata[\"author\"] == \"Eve\"\n",
"assert output[0].metadata[\"category\"] == \"Music\"\n",
"assert len(output) == 1\n",
"\n",
"# Anomaly detection\n",
"result = vectorstore.is_anomalous(\n",
" query=\"dogs can jump high\",\n",
")\n",
"assert result is False\n",
"\n",
"# Remove all data in the store\n",
"vectorstore.clear()\n",
"assert vectorstore.count() == 0\n",
"\n",
"# Remove the store completely\n",
"vectorstore.drop()\n",
"\n",
"# Logout\n",
"vectorstore.logout()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,271 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "671e9ec1-fa00-4c92-a2fb-ceb142168ea9",
"metadata": {},
"source": [
"# Jaguar Vector Database\n",
"\n",
"1. It is a distributed vector database\n",
"2. The “ZeroMove” feature of JaguarDB enables instant horizontal scalability\n",
"3. Multimodal: embeddings, text, images, videos, PDFs, audio, time series, and geospatial\n",
"4. All-masters: allows both parallel reads and writes\n",
"5. Anomaly detection capabilities\n",
"6. RAG support: combines LLM with proprietary and real-time data\n",
"7. Shared metadata: sharing of metadata across multiple vector indexes\n",
"8. Distance metrics: Euclidean, Cosine, InnerProduct, Manhatten, Chebyshev, Hamming, Jeccard, Minkowski"
]
},
{
"cell_type": "markdown",
"id": "1a87dc28-1344-4003-b31a-13e4cb71bf48",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"There are two requirements for running the examples in this file.\n",
"1. You must install and set up the JaguarDB server and its HTTP gateway server.\n",
" Please refer to the instructions in:\n",
" [www.jaguardb.com](http://www.jaguardb.com)\n",
"\n",
"2. You must install the http client package for JaguarDB:\n",
" ```\n",
" pip install -U jaguardb-http-client\n",
" ```\n"
]
},
{
"cell_type": "markdown",
"id": "c7d56993-4809-4e42-a409-94d3a7305ad8",
"metadata": {},
"source": [
"## RAG With Langchain\n",
"\n",
"This section demonstrates chatting with LLM together with Jaguar in the langchain software stack.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d62c2393-5c7c-4bb6-8367-c4389fa36a4e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQAWithSourcesChain\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.document_loaders import TextLoader\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.llms import OpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain_community.vectorstores.jaguar import Jaguar\n",
"\n",
"\"\"\" \n",
"Load a text file into a set of documents \n",
"\"\"\"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"\"\"\"\n",
"Instantiate a Jaguar vector store\n",
"\"\"\"\n",
"### Jaguar HTTP endpoint\n",
"url = \"http://192.168.5.88:8080/fwww/\"\n",
"\n",
"### Use OpenAI embedding model\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"### Pod is a database for vectors\n",
"pod = \"vdb\"\n",
"\n",
"### Vector store name\n",
"store = \"langchain_rag_store\"\n",
"\n",
"### Vector index name\n",
"vector_index = \"v\"\n",
"\n",
"### Type of the vector index\n",
"# cosine: distance metric\n",
"# fraction: embedding vectors are decimal numbers\n",
"# float: values stored with floating-point numbers\n",
"vector_type = \"cosine_fraction_float\"\n",
"\n",
"### Dimension of each embedding vector\n",
"vector_dimension = 1536\n",
"\n",
"### Instantiate a Jaguar store object\n",
"vectorstore = Jaguar(\n",
" pod, store, vector_index, vector_type, vector_dimension, url, embeddings\n",
")\n",
"\n",
"\"\"\"\n",
"Login must be performed to authorize the client.\n",
"The environment variable JAGUAR_API_KEY or file $HOME/.jagrc\n",
"should contain the API key for accessing JaguarDB servers.\n",
"\"\"\"\n",
"vectorstore.login()\n",
"\n",
"\n",
"\"\"\"\n",
"Create vector store on the JaguarDB database server.\n",
"This should be done only once.\n",
"\"\"\"\n",
"# Extra metadata fields for the vector store\n",
"metadata = \"category char(16)\"\n",
"\n",
"# Number of characters for the text field of the store\n",
"text_size = 4096\n",
"\n",
"# Create a vector store on the server\n",
"vectorstore.create(metadata, text_size)\n",
"\n",
"\"\"\"\n",
"Add the texts from the text splitter to our vectorstore\n",
"\"\"\"\n",
"vectorstore.add_documents(docs)\n",
"\n",
"\"\"\" Get the retriever object \"\"\"\n",
"retriever = vectorstore.as_retriever()\n",
"# retriever = vectorstore.as_retriever(search_kwargs={\"where\": \"m1='123' and m2='abc'\"})\n",
"\n",
"template = \"\"\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n",
"Question: {question}\n",
"Context: {context}\n",
"Answer:\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"\"\"\" Obtain a Large Language Model \"\"\"\n",
"LLM = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
"\n",
"\"\"\" Create a chain for the RAG flow \"\"\"\n",
"rag_chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | LLM\n",
" | StrOutputParser()\n",
")\n",
"\n",
"resp = rag_chain.invoke(\"What did the president say about Justice Breyer?\")\n",
"print(resp)"
]
},
{
"cell_type": "markdown",
"id": "11178867-d143-4a10-93bf-278f5f10dc1a",
"metadata": {},
"source": [
"## Interaction With Jaguar Vector Store\n",
"\n",
"Users can interact directly with the Jaguar vector store for similarity search and anomaly detection.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9a53cb5-e298-4125-9ace-0d851198869a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain_community.vectorstores.jaguar import Jaguar\n",
"\n",
"# Instantiate a Jaguar vector store object\n",
"url = \"http://192.168.3.88:8080/fwww/\"\n",
"pod = \"vdb\"\n",
"store = \"langchain_test_store\"\n",
"vector_index = \"v\"\n",
"vector_type = \"cosine_fraction_float\"\n",
"vector_dimension = 10\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = Jaguar(\n",
" pod, store, vector_index, vector_type, vector_dimension, url, embeddings\n",
")\n",
"\n",
"# Login for authorization\n",
"vectorstore.login()\n",
"\n",
"# Create the vector store with two metadata fields\n",
"# This needs to be run only once.\n",
"metadata_str = \"author char(32), category char(16)\"\n",
"vectorstore.create(metadata_str, 1024)\n",
"\n",
"# Add a list of texts\n",
"texts = [\"foo\", \"bar\", \"baz\"]\n",
"metadatas = [\n",
" {\"author\": \"Adam\", \"category\": \"Music\"},\n",
" {\"author\": \"Eve\", \"category\": \"Music\"},\n",
" {\"author\": \"John\", \"category\": \"History\"},\n",
"]\n",
"ids = vectorstore.add_texts(texts=texts, metadatas=metadatas)\n",
"\n",
"# Search similar text\n",
"output = vectorstore.similarity_search(\n",
" query=\"foo\",\n",
" k=1,\n",
" metadatas=[\"author\", \"category\"],\n",
")\n",
"assert output[0].page_content == \"foo\"\n",
"assert output[0].metadata[\"author\"] == \"Adam\"\n",
"assert output[0].metadata[\"category\"] == \"Music\"\n",
"assert len(output) == 1\n",
"\n",
"# Search with filtering (where)\n",
"where = \"author='Eve'\"\n",
"output = vectorstore.similarity_search(\n",
" query=\"foo\",\n",
" k=3,\n",
" fetch_k=9,\n",
" where=where,\n",
" metadatas=[\"author\", \"category\"],\n",
")\n",
"assert output[0].page_content == \"bar\"\n",
"assert output[0].metadata[\"author\"] == \"Eve\"\n",
"assert output[0].metadata[\"category\"] == \"Music\"\n",
"assert len(output) == 1\n",
"\n",
"# Anomaly detection\n",
"result = vectorstore.is_anomalous(\n",
" query=\"dogs can jump high\",\n",
")\n",
"assert result is False\n",
"\n",
"# Remove all data in the store\n",
"vectorstore.clear()\n",
"assert vectorstore.count() == 0\n",
"\n",
"# Remove the store completely\n",
"vectorstore.drop()\n",
"\n",
"# Logout\n",
"vectorstore.logout()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,441 @@
from __future__ import annotations
import json
import logging
from typing import TYPE_CHECKING, Any, List, Optional, Tuple
if TYPE_CHECKING:
from jaguardb_http_client.JaguarHttpClient import JaguarHttpClient
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.vectorstores import VectorStore
logger = logging.getLogger(__name__)
class Jaguar(VectorStore):
"""`Jaguar API` vector store.
See http://www.jaguardb.com
See http://github.com/fserv/jaguar-sdk
Example:
.. code-block:: python
from langchain.vectorstores import Jaguar
vectorstore = Jaguar(
pod = 'vdb',
store = 'mystore',
vector_index = 'v',
vector_type = 'cosine_fraction_float',
vector_dimension = 1536,
url='http://192.168.8.88:8080/fwww/',
embedding=openai_model
)
"""
def __init__(
self,
pod: str,
store: str,
vector_index: str,
vector_type: str,
vector_dimension: int,
url: str,
embedding: Embeddings,
):
self._pod = pod
self._store = store
self._vector_index = vector_index
self._vector_type = vector_type
self._vector_dimension = vector_dimension
self._embedding = embedding
self._jag = JaguarHttpClient(url)
self._token = ""
def login(
self,
jaguar_api_key: Optional[str] = "",
) -> bool:
"""
login to jaguardb server with a jaguar_api_key or let self._jag find a key
Args:
pod (str): name of a Pod
store (str): name of a vector store
optional jaguar_api_key (str): API key of user to jaguardb server
Returns:
True if successful; False if not successful
"""
if jaguar_api_key == "":
jaguar_api_key = self._jag.getApiKey()
self._jaguar_api_key = jaguar_api_key
self._token = self._jag.login(jaguar_api_key)
if self._token == "":
logger.error("E0001 error init(): invalid jaguar_api_key")
return False
return True
def create(
self,
metadata_str: str,
text_size: int,
) -> None:
"""
create the vector store on the backend database
Args:
metadata_str (str): columns and their types
Returns:
True if successful; False if not successful
"""
podstore = self._pod + "." + self._store
"""
source column is required.
v:text column is required.
"""
q = "create store "
q += podstore
q += f" ({self._vector_index} vector({self._vector_dimension},"
q += f" '{self._vector_type}'),"
q += f" source char(256), v:text char({text_size}),"
q += metadata_str + ")"
self.run(q)
def run(self, query: str, withFile: bool = False) -> dict:
"""
Run any query statement in jaguardb
Args:
query (str): query statement to jaguardb
Returns:
None for invalid token, or
json result string
"""
if self._token == "":
logger.error(f"E0005 error run({query})")
return {}
resp = self._jag.post(query, self._token, withFile)
txt = resp.text
try:
js = json.loads(txt)
return js
except Exception:
return {}
@property
def embeddings(self) -> Optional[Embeddings]:
return self._embedding
def add_texts(
self,
texts: List[str],
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> List[str]:
"""
Add texts through the embeddings and add to the vectorstore.
Args:
texts: list of text strings to add to the jaguar vector store.
metadatas: Optional list of metadatas associated with the texts.
[{"m1": "v11", "m2": "v12", "m3": "v13", "filecol": "path_file1.jpg" },
{"m1": "v21", "m2": "v22", "m3": "v23", "filecol": "path_file2.jpg" },
{"m1": "v31", "m2": "v32", "m3": "v33", "filecol": "path_file3.jpg" },
{"m1": "v41", "m2": "v42", "m3": "v43", "filecol": "path_file4.jpg" }]
kwargs: vector_index=name_of_vector_index
file_column=name_of_file_column
Returns:
List of ids from adding the texts into the vectorstore
"""
vcol = self._vector_index
filecol = kwargs.get("file_column", "")
podstorevcol = self._pod + "." + self._store + "." + vcol
q = "textcol " + podstorevcol
js = self.run(q)
if js == "":
return []
textcol = js["data"]
embeddings = self._embedding.embed_documents(list(texts))
ids = []
if metadatas is None:
### no meta and no files to upload
i = 0
for vec in embeddings:
str_vec = [str(x) for x in vec]
values_comma = ",".join(str_vec)
podstore = self._pod + "." + self._store
q = "insert into " + podstore + " ("
q += vcol + "," + textcol + ") values ('" + values_comma
q += "','" + texts[i] + "')"
js = self.run(q, False)
ids.append(js["zid"])
i += 1
else:
i = 0
for vec in embeddings:
str_vec = [str(x) for x in vec]
nvec, vvec, filepath = self._parseMeta(metadatas[i], filecol)
if filecol != "":
rc = self._jag.postFile(self._token, filepath, 1)
if not rc:
return []
names_comma = ",".join(nvec)
names_comma += "," + vcol
## col1,col2,col3,vecl
values_comma = "'" + "','".join(vvec) + "'"
### 'va1','val2','val3'
values_comma += ",'" + ",".join(str_vec) + "'"
### 'v1,v2,v3'
podstore = self._pod + "." + self._store
q = "insert into " + podstore + " ("
q += names_comma + "," + textcol + ") values (" + values_comma
q += ",'" + texts[i] + "')"
if filecol != "":
js = self.run(q, True)
else:
js = self.run(q, False)
ids.append(js["zid"])
i += 1
return ids
def similarity_search_with_score(
self,
query: str,
k: int = 3,
fetch_k: int = -1,
where: Optional[str] = None,
score_threshold: Optional[float] = -1.0,
metadatas: Optional[List[str]] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
"""
Return Jaguar documents most similar to query, along with scores.
Args:
query: Text to look up documents similar to.
k: Number of Documents to return. Defaults to 3.
lambda_val: lexical match parameter for hybrid search.
where: the where clause in select similarity. For example a
where can be "rating > 3.0 and (state = 'NV' or state = 'CA')"
score_threshold: minimal score threshold for the result.
If defined, results with score less than this value will be
filtered out.
kwargs: vector_index=vcol, vector_type=cosine_fraction_float
Returns:
List of Documents most similar to the query and score for each.
List of Tuples of (doc, similarity_score):
[ (doc, score), (doc, score), ...]
"""
vcol = self._vector_index
vtype = self._vector_type
embeddings = self._embedding.embed_query(query)
str_embeddings = [str(f) for f in embeddings]
qv_comma = ",".join(str_embeddings)
podstore = self._pod + "." + self._store
q = (
"select similarity("
+ vcol
+ ",'"
+ qv_comma
+ "','topk="
+ str(k)
+ ",fetch_k="
+ str(fetch_k)
+ ",type="
+ vtype
)
q += ",with_score=yes,with_text=yes,score_threshold=" + str(score_threshold)
if metadatas is not None:
meta = "&".join(metadatas)
q += ",metadata=" + meta
q += "') from " + podstore
if where is not None:
q += " where " + where
jarr = self.run(q)
if jarr is None:
return []
docs_with_score = []
for js in jarr:
score = js["score"]
text = js["text"]
zid = js["zid"]
### give metadatas
md = {}
md["zid"] = zid
if metadatas is not None:
for m in metadatas:
mv = js[m]
md[m] = mv
doc = Document(page_content=text, metadata=md)
tup = (doc, score)
docs_with_score.append(tup)
return docs_with_score
def similarity_search(
self,
query: str,
k: int = 3,
where: Optional[str] = None,
metadatas: Optional[List[str]] = None,
**kwargs: Any,
) -> List[Document]:
"""
Return Jaguar documents most similar to query, along with scores.
Args:
query: Text to look up documents similar to.
k: Number of Documents to return. Defaults to 5.
where: the where clause in select similarity. For example a
where can be "rating > 3.0 and (state = 'NV' or state = 'CA')"
Returns:
List of Documents most similar to the query
"""
docs_and_scores = self.similarity_search_with_score(
query, k=k, where=where, metadatas=metadatas, **kwargs
)
return [doc for doc, _ in docs_and_scores]
def is_anomalous(
self,
query: str,
**kwargs: Any,
) -> bool:
"""
Detect if given text is anomalous from the dataset
Args:
query: Text to detect if it is anomaly
Returns:
True or False
"""
vcol = self._vector_index
vtype = self._vector_type
embeddings = self._embedding.embed_query(query)
str_embeddings = [str(f) for f in embeddings]
qv_comma = ",".join(str_embeddings)
podstore = self._pod + "." + self._store
q = "select anomalous(" + vcol + ", '" + qv_comma + "', 'type=" + vtype + "')"
q += " from " + podstore
js = self.run(q)
if isinstance(js, list) and len(js) == 0:
return False
jd = json.loads(js[0])
if jd["anomalous"] == "YES":
return True
return False
@classmethod
def from_texts(
cls,
texts: List[str],
embedding: Embeddings,
url: str,
pod: str,
store: str,
vector_index: str,
vector_type: str,
vector_dimension: int,
metadatas: Optional[List[dict]] = None,
jaguar_api_key: Optional[str] = "",
**kwargs: Any,
) -> Jaguar:
jagstore = cls(
pod, store, vector_index, vector_type, vector_dimension, url, embedding
)
jagstore.login(jaguar_api_key)
jagstore.clear()
jagstore.add_texts(texts, metadatas, **kwargs)
return jagstore
def clear(self) -> None:
"""
Delete all records in jaguardb
Args: No args
Returns: None
"""
podstore = self._pod + "." + self._store
q = "truncate store " + podstore
self.run(q)
def delete(self, zids: List[str], **kwargs: Any) -> None:
"""
Delete records in jaguardb by a list of zero-ids
Args:
pod (str): name of a Pod
ids (List[str]): a list of zid as string
Returns:
Do not return anything
"""
podstore = self._pod + "." + self._store
for zid in zids:
q = "delete from " + podstore + " where zid='" + zid + "'"
self.run(q)
def count(self) -> int:
"""
Count records of a store in jaguardb
Args: no args
Returns: (int) number of records in pod store
"""
podstore = self._pod + "." + self._store
q = "select count() from " + podstore
js = self.run(q)
if isinstance(js, list) and len(js) == 0:
return 0
jd = json.loads(js[0])
return int(jd["data"])
def drop(self) -> None:
"""
Drop or remove a store in jaguardb
Args: no args
Returns: None
"""
podstore = self._pod + "." + self._store
q = "drop store " + podstore
self.run(q)
def logout(self) -> None:
"""
Logout to cleanup resources
Args: no args
Returns: None
"""
self._jag.logout(self._token)
def prt(self, msg: str) -> None:
with open("/tmp/debugjaguar.log", "a") as file:
print(f"msg={msg}", file=file, flush=True)
def _parseMeta(self, nvmap: dict, filecol: str) -> Tuple[List[str], List[str], str]:
filepath = ""
if filecol == "":
nvec = list(nvmap.keys())
vvec = list(nvmap.values())
else:
nvec = []
vvec = []
if filecol in nvmap:
nvec.append(filecol)
vvec.append(nvmap[filecol])
filepath = nvmap[filecol]
for k, v in nvmap.items():
if k != filecol:
nvec.append(k)
vvec.append(v)
return nvec, vvec, filepath

View File

@ -0,0 +1,138 @@
import json
from langchain_community.vectorstores.jaguar import Jaguar
from tests.integration_tests.vectorstores.fake_embeddings import (
ConsistentFakeEmbeddings,
)
#############################################################################################
##
## Requirement: fwww http server must be running at 127.0.0.1:8080 (or any end point)
## jaguardb server must be running accepting commands from the http server
##
## FakeEmbeddings is used to create text embeddings with dimension of 10.
##
#############################################################################################
class TestJaguar:
vectorstore: Jaguar
pod: str
store: str
@classmethod
def setup_class(cls) -> None:
url = "http://127.0.0.1:8080/fwww/"
cls.pod = "vdb"
cls.store = "langchain_test_store"
vector_index = "v"
vector_type = "cosine_fraction_float"
vector_dimension = 10
embeddings = ConsistentFakeEmbeddings()
cls.vectorstore = Jaguar(
cls.pod,
cls.store,
vector_index,
vector_type,
vector_dimension,
url,
embeddings,
)
@classmethod
def teardown_class(cls) -> None:
pass
def test_login(self) -> None:
"""
Requires environment variable JAGUAR_API_KEY
or $HOME/.jagrc storing the jaguar api key
"""
self.vectorstore.login()
def test_create(self) -> None:
"""
Create a vector with vector index 'v' of dimension 10
and 'v:text' to hold text and metadatas author and category
"""
metadata_str = "author char(32), category char(16)"
self.vectorstore.create(metadata_str, 1024)
podstore = self.pod + "." + self.store
js = self.vectorstore.run(f"desc {podstore}")
jd = json.loads(js[0])
assert podstore in jd["data"]
def test_add_texts(self) -> None:
"""
Add some texts
"""
texts = ["foo", "bar", "baz"]
metadatas = [
{"author": "Adam", "category": "Music"},
{"author": "Eve", "category": "Music"},
{"author": "John", "category": "History"},
]
ids = self.vectorstore.add_texts(texts=texts, metadatas=metadatas)
assert len(ids) == len(texts)
def test_search(self) -> None:
"""
Test that `foo` is closest to `foo`
Here k is 1
"""
output = self.vectorstore.similarity_search(
query="foo",
k=1,
metadatas=["author", "category"],
)
assert output[0].page_content == "foo"
assert output[0].metadata["author"] == "Adam"
assert output[0].metadata["category"] == "Music"
assert len(output) == 1
def test_search_filter(self) -> None:
"""
Test filter(where)
"""
where = "author='Eve'"
output = self.vectorstore.similarity_search(
query="foo",
k=3,
fetch_k=9,
where=where,
metadatas=["author", "category"],
)
assert output[0].page_content == "bar"
assert output[0].metadata["author"] == "Eve"
assert output[0].metadata["category"] == "Music"
assert len(output) == 1
def test_search_anomalous(self) -> None:
"""
Test detection of anomalousness
"""
result = self.vectorstore.is_anomalous(
query="dogs can jump high",
)
assert result is False
def test_clear(self) -> None:
"""
Test cleanup of data in the store
"""
self.vectorstore.clear()
assert self.vectorstore.count() == 0
def test_drop(self) -> None:
"""
Destroy the vector store
"""
self.vectorstore.drop()
def test_logout(self) -> None:
"""
Logout and free resources
"""
self.vectorstore.logout()