Community[minor]: Update VDMS vectorstore (#23729)

**Description:** 
- This PR exposes some functions in VDMS vectorstore, updates VDMS
related notebooks, updates tests, and upgrade version of VDMS (>=0.0.20)

**Issue:** N/A

**Dependencies:** 
- Update vdms>=0.0.20
This commit is contained in:
Chaunte W. Lacewell 2024-07-25 19:13:04 -07:00 committed by GitHub
parent 703491e824
commit 69eacaa887
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 627 additions and 350 deletions

View File

@ -36,6 +36,7 @@ Notebook | Description
[llm_symbolic_math.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/llm_symbolic_math.ipynb) | Solve algebraic equations with the help of llms (language learning models) and sympy, a python library for symbolic mathematics.
[meta_prompt.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/meta_prompt.ipynb) | Implement the meta-prompt concept, which is a method for building self-improving agents that reflect on their own performance and modify their instructions accordingly.
[multi_modal_output_agent.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/multi_modal_output_agent.ipynb) | Generate multi-modal outputs, specifically images and text.
[multi_modal_RAG_vdms.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/multi_modal_RAG_vdms.ipynb) | Perform retrieval-augmented generation (rag) on documents including text and images, using unstructured for parsing, Intel's Visual Data Management System (VDMS) as the vectorstore, and chains.
[multi_player_dnd.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/multi_player_dnd.ipynb) | Simulate multi-player dungeons & dragons games, with a custom function determining the speaking schedule of the agents.
[multiagent_authoritarian.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/multiagent_authoritarian.ipynb) | Implement a multi-agent simulation where a privileged agent controls the conversation, including deciding who speaks and when the conversation ends, in the context of a simulated news network.
[multiagent_bidding.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/multiagent_bidding.ipynb) | Implement a multi-agent simulation where agents bid to speak, with the highest bidder speaking next, demonstrated through a fictitious presidential debate example.

View File

@ -18,26 +18,7 @@
"* Use of multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text\n",
"* Use of [VDMS](https://github.com/IntelLabs/vdms/blob/master/README.md) as a vector store with support for multi-modal\n",
"* Retrieval of both images and text using similarity search\n",
"* Passing raw images and text chunks to a multimodal LLM for answer synthesis \n",
"\n",
"\n",
"## Packages\n",
"\n",
"For `unstructured`, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "febbc459-ebba-4c1a-a52b-fed7731593f8",
"metadata": {},
"outputs": [],
"source": [
"# (newest versions required for multi-modal)\n",
"! pip install --quiet -U vdms langchain-experimental\n",
"\n",
"# lock to 0.10.19 due to a persistent bug in more recent versions\n",
"! pip install --quiet pdf2image \"unstructured[all-docs]==0.10.19\" pillow pydantic lxml open_clip_torch"
"* Passing raw images and text chunks to a multimodal LLM for answer synthesis "
]
},
{
@ -53,7 +34,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 1,
"id": "5f483872",
"metadata": {},
"outputs": [
@ -61,8 +42,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"docker: Error response from daemon: Conflict. The container name \"/vdms_rag_nb\" is already in use by container \"0c19ed281463ac10d7efe07eb815643e3e534ddf24844357039453ad2b0c27e8\". You have to remove (or rename) that container to be able to reuse that name.\n",
"See 'docker run --help'.\n"
"a1b9206b08ef626e15b356bf9e031171f7c7eb8f956a2733f196f0109246fe2b\n"
]
}
],
@ -75,9 +55,32 @@
"vdms_client = VDMS_Client(port=55559)"
]
},
{
"cell_type": "markdown",
"id": "2498a0a1",
"metadata": {},
"source": [
"## Packages\n",
"\n",
"For `unstructured`, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "febbc459-ebba-4c1a-a52b-fed7731593f8",
"metadata": {},
"outputs": [],
"source": [
"! pip install --quiet -U vdms langchain-experimental\n",
"\n",
"# lock to 0.10.19 due to a persistent bug in more recent versions\n",
"! pip install --quiet pdf2image \"unstructured[all-docs]==0.10.19\" pillow pydantic lxml open_clip_torch"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "78ac6543",
"metadata": {},
"outputs": [],
@ -95,14 +98,9 @@
"\n",
"### Partition PDF text and images\n",
" \n",
"Let's look at an example pdf containing interesting images.\n",
"Let's use famous photographs from the PDF version of Library of Congress Magazine in this example.\n",
"\n",
"Famous photographs from library of congress:\n",
"\n",
"* https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf\n",
"* We'll use this as an example below\n",
"\n",
"We can use `partition_pdf` below from [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#key-concepts) to extract text and images."
"We can use `partition_pdf` from [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#key-concepts) to extract text and images."
]
},
{
@ -116,8 +114,8 @@
"\n",
"import requests\n",
"\n",
"# Folder with pdf and extracted images\n",
"datapath = Path(\"./multimodal_files\").resolve()\n",
"# Folder to store pdf and extracted images\n",
"datapath = Path(\"./data/multimodal_files\").resolve()\n",
"datapath.mkdir(parents=True, exist_ok=True)\n",
"\n",
"pdf_url = \"https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf\"\n",
@ -174,14 +172,8 @@
"source": [
"## Multi-modal embeddings with our document\n",
"\n",
"We will use [OpenClip multimodal embeddings](https://python.langchain.com/docs/integrations/text_embedding/open_clip).\n",
"\n",
"We use a larger model for better performance (set in `langchain_experimental.open_clip.py`).\n",
"\n",
"```\n",
"model_name = \"ViT-g-14\"\n",
"checkpoint = \"laion2b_s34b_b88k\"\n",
"```"
"In this section, we initialize the VDMS vector store for both text and images. For better performance, we use model `ViT-g-14` from [OpenClip multimodal embeddings](https://python.langchain.com/docs/integrations/text_embedding/open_clip).\n",
"The images are stored as base64 encoded strings with `vectorstore.add_images`.\n"
]
},
{
@ -200,9 +192,7 @@
"vectorstore = VDMS(\n",
" client=vdms_client,\n",
" collection_name=\"mm_rag_clip_photos\",\n",
" embedding_function=OpenCLIPEmbeddings(\n",
" model_name=\"ViT-g-14\", checkpoint=\"laion2b_s34b_b88k\"\n",
" ),\n",
" embedding=OpenCLIPEmbeddings(model_name=\"ViT-g-14\", checkpoint=\"laion2b_s34b_b88k\"),\n",
")\n",
"\n",
"# Get image URIs with .jpg extension only\n",
@ -233,7 +223,7 @@
"source": [
"## RAG\n",
"\n",
"`vectorstore.add_images` will store / retrieve images as base64 encoded strings."
"Here we define helper functions for image results."
]
},
{
@ -392,7 +382,8 @@
"id": "1566096d-97c2-4ddc-ba4a-6ef88c525e4e",
"metadata": {},
"source": [
"## Test retrieval and run RAG"
"## Test retrieval and run RAG\n",
"Now let's query for a `woman with children` and retrieve the top results."
]
},
{
@ -452,6 +443,14 @@
" print(doc.page_content)"
]
},
{
"cell_type": "markdown",
"id": "15e9b54d",
"metadata": {},
"source": [
"Now let's use the `multi_modal_rag_chain` to process the same query and display the response."
]
},
{
"cell_type": "code",
"execution_count": 11,
@ -462,10 +461,10 @@
"name": "stdout",
"output_type": "stream",
"text": [
"1. Detailed description of the visual elements in the image: The image features a woman with children, likely a mother and her family, standing together outside. They appear to be poor or struggling financially, as indicated by their attire and surroundings.\n",
"2. Historical and cultural context of the image: The photo was taken in 1936 during the Great Depression, when many families struggled to make ends meet. Dorothea Lange, a renowned American photographer, took this iconic photograph that became an emblem of poverty and hardship experienced by many Americans at that time.\n",
"3. Interpretation of the image's symbolism and meaning: The image conveys a sense of unity and resilience despite adversity. The woman and her children are standing together, displaying their strength as a family unit in the face of economic challenges. The photograph also serves as a reminder of the importance of empathy and support for those who are struggling.\n",
"4. Connections between the image and the related text: The text provided offers additional context about the woman in the photo, her background, and her feelings towards the photograph. It highlights the historical backdrop of the Great Depression and emphasizes the significance of this particular image as a representation of that time period.\n"
" The image depicts a woman with several children. The woman appears to be of Cherokee heritage, as suggested by the text provided. The image is described as having been initially regretted by the subject, Florence Owens Thompson, due to her feeling that it did not accurately represent her leadership qualities.\n",
"The historical and cultural context of the image is tied to the Great Depression and the Dust Bowl, both of which affected the Cherokee people in Oklahoma. The photograph was taken during this period, and its subject, Florence Owens Thompson, was a leader within her community who worked tirelessly to help those affected by these crises.\n",
"The image's symbolism and meaning can be interpreted as a representation of resilience and strength in the face of adversity. The woman is depicted with multiple children, which could signify her role as a caregiver and protector during difficult times.\n",
"Connections between the image and the related text include Florence Owens Thompson's leadership qualities and her regretted feelings about the photograph. Additionally, the mention of Dorothea Lange, the photographer who took this photo, ties the image to its historical context and the broader narrative of the Great Depression and Dust Bowl in Oklahoma. \n"
]
}
],
@ -492,14 +491,6 @@
"source": [
"! docker kill vdms_rag_nb"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ba652da",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -518,7 +509,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@ -12,7 +12,8 @@
"VDMS supports:\n",
"* K nearest neighbor search\n",
"* Euclidean distance (L2) and inner product (IP)\n",
"* Libraries for indexing and computing distances: TileDBDense, TileDBSparse, FaissFlat (Default), FaissIVFFlat\n",
"* Libraries for indexing and computing distances: TileDBDense, TileDBSparse, FaissFlat (Default), FaissIVFFlat, Flinng\n",
"* Embeddings for text, images, and video\n",
"* Vector and metadata searches\n",
"\n",
"VDMS has server and client components. To setup the server, see the [installation instructions](https://github.com/IntelLabs/vdms/blob/master/INSTALL.md) or use the [docker image](https://hub.docker.com/r/intellabs/vdms).\n",
@ -40,7 +41,7 @@
],
"source": [
"# Pip install necessary package\n",
"%pip install --upgrade --quiet pip sentence-transformers vdms \"unstructured-inference==0.6.6\";"
"%pip install --upgrade --quiet pip vdms sentence-transformers langchain-huggingface > /dev/null"
]
},
{
@ -62,7 +63,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"e6061b270eef87de5319a6c5af709b36badcad8118069a8f6b577d2e01ad5e2d\n"
"b26917ffac236673ef1d035ab9c91fe999e29c9eb24aa6c7103d7baa6bf2f72d\n"
]
}
],
@ -92,6 +93,9 @@
"outputs": [],
"source": [
"import time\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"from langchain_community.document_loaders.text import TextLoader\n",
"from langchain_community.vectorstores import VDMS\n",
@ -290,7 +294,7 @@
"source": [
"# add data\n",
"collection_name = \"my_collection_faiss_L2\"\n",
"db = VDMS.from_documents(\n",
"db_FaissFlat = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
@ -301,7 +305,7 @@
"# Query (No metadata filtering)\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"returned_docs = db.similarity_search(query, k=k, filter=None)\n",
"returned_docs = db_FaissFlat.similarity_search(query, k=k, filter=None)\n",
"print_results(returned_docs, score=False)"
]
},
@ -392,25 +396,24 @@
"k = 3\n",
"constraints = {\"page_number\": [\">\", 30], \"president_included\": [\"==\", True]}\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"returned_docs = db.similarity_search(query, k=k, filter=constraints)\n",
"returned_docs = db_FaissFlat.similarity_search(query, k=k, filter=constraints)\n",
"print_results(returned_docs, score=False)"
]
},
{
"cell_type": "markdown",
"id": "a5984766",
"id": "92ab3370",
"metadata": {},
"source": [
"### Similarity Search using TileDBDense and Euclidean Distance\n",
"### Similarity Search using Faiss IVFFlat and Inner Product (IP) Distance\n",
"\n",
"In this section, we add the documents to VDMS using TileDB Dense indexing and L2 as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n",
"\n"
"In this section, we add the documents to VDMS using Faiss IndexIVFFlat indexing and IP as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3001ba6e",
"id": "78f502cf",
"metadata": {},
"outputs": [
{
@ -419,7 +422,7 @@
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425491333\n",
"Score:\t1.2032090425\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
@ -437,7 +440,7 @@
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.495247483253479\n",
"Score:\t1.4952471256\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
@ -463,7 +466,224 @@
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008409023284912\n",
"Score:\t1.5008399487\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"db_FaissIVFFlat = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=\"my_collection_FaissIVFFlat_IP\",\n",
" embedding=embedding,\n",
" engine=\"FaissIVFFlat\",\n",
" distance_strategy=\"IP\",\n",
")\n",
"# Query\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db_FaissIVFFlat.similarity_search_with_score(query, k=k, filter=None)\n",
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "e66d9125",
"metadata": {},
"source": [
"### Similarity Search using FLINNG and IP Distance\n",
"\n",
"In this section, we add the documents to VDMS using Filters to Identify Near-Neighbor Groups (FLINNG) indexing and IP as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "add81beb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.4952471256\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008399487\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"db_Flinng = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=\"my_collection_Flinng_IP\",\n",
" embedding=embedding,\n",
" engine=\"Flinng\",\n",
" distance_strategy=\"IP\",\n",
")\n",
"# Query\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db_Flinng.similarity_search_with_score(query, k=k, filter=None)\n",
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "a5984766",
"metadata": {},
"source": [
"### Similarity Search using TileDBDense and Euclidean Distance\n",
"\n",
"In this section, we add the documents to VDMS using TileDB Dense indexing and L2 as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3001ba6e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.4952471256\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008399487\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
@ -505,114 +725,6 @@
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "92ab3370",
"metadata": {},
"source": [
"### Similarity Search using Faiss IVFFlat and Euclidean Distance\n",
"\n",
"In this section, we add the documents to VDMS using Faiss IndexIVFFlat indexing and L2 as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "78f502cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425491333\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.495247483253479\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008409023284912\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"db_FaissIVFFlat = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=\"my_collection_FaissIVFFlat_L2\",\n",
" embedding=embedding,\n",
" engine=\"FaissIVFFlat\",\n",
" distance_strategy=\"L2\",\n",
")\n",
"# Query\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db_FaissIVFFlat.similarity_search_with_score(query, k=k, filter=None)\n",
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "9ed3ec50",
@ -622,12 +734,12 @@
"\n",
"While building toward a real application, you want to go beyond adding data, and also update and delete data.\n",
"\n",
"Here is a basic example showing how to do so. First, we will update the metadata for the document most relevant to the query."
"Here is a basic example showing how to do so. First, we will update the metadata for the document most relevant to the query by adding a date. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 11,
"id": "81a02810",
"metadata": {},
"outputs": [
@ -638,7 +750,7 @@
"Original metadata: \n",
"\t{'id': '32', 'page_number': 32, 'president_included': True, 'source': '../../how_to/state_of_the_union.txt'}\n",
"new metadata: \n",
"\t{'id': '32', 'page_number': 32, 'president_included': True, 'source': '../../how_to/state_of_the_union.txt', 'new_value': 'hello world'}\n",
"\t{'id': '32', 'page_number': 32, 'president_included': True, 'source': '../../how_to/state_of_the_union.txt', 'last_date_read': {'_date': '2024-05-01T14:30:00'}}\n",
"--------------------------------------------------\n",
"\n",
"UPDATED ENTRY (id=32):\n",
@ -655,8 +767,8 @@
"id:\n",
"\t32\n",
"\n",
"new_value:\n",
"\thello world\n",
"last_date_read:\n",
"\t2024-05-01T14:30:00+00:00\n",
"\n",
"page_number:\n",
"\t32\n",
@ -672,19 +784,26 @@
}
],
"source": [
"doc = db.similarity_search(query)[0]\n",
"from datetime import datetime\n",
"\n",
"doc = db_FaissFlat.similarity_search(query)[0]\n",
"print(f\"Original metadata: \\n\\t{doc.metadata}\")\n",
"\n",
"# update the metadata for a document\n",
"doc.metadata[\"new_value\"] = \"hello world\"\n",
"# Update the metadata for a document by adding last datetime document read\n",
"datetime_str = datetime(2024, 5, 1, 14, 30, 0).isoformat()\n",
"doc.metadata[\"last_date_read\"] = {\"_date\": datetime_str}\n",
"print(f\"new metadata: \\n\\t{doc.metadata}\")\n",
"print(f\"{DELIMITER}\\n\")\n",
"\n",
"# Update document in VDMS\n",
"id_to_update = doc.metadata[\"id\"]\n",
"db.update_document(collection_name, id_to_update, doc)\n",
"response, response_array = db.get(\n",
" collection_name, constraints={\"id\": [\"==\", id_to_update]}\n",
"db_FaissFlat.update_document(collection_name, id_to_update, doc)\n",
"response, response_array = db_FaissFlat.get(\n",
" collection_name,\n",
" constraints={\n",
" \"id\": [\"==\", id_to_update],\n",
" \"last_date_read\": [\">=\", {\"_date\": \"2024-05-01T00:00:00\"}],\n",
" },\n",
")\n",
"\n",
"# Display Results\n",
@ -702,7 +821,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"id": "95537fe8",
"metadata": {},
"outputs": [
@ -716,11 +835,13 @@
}
],
"source": [
"print(\"Documents before deletion: \", db.count(collection_name))\n",
"print(\"Documents before deletion: \", db_FaissFlat.count(collection_name))\n",
"\n",
"id_to_remove = ids[-1]\n",
"db.delete(collection_name=collection_name, ids=[id_to_remove])\n",
"print(f\"Documents after deletion (id={id_to_remove}): {db.count(collection_name)}\")"
"db_FaissFlat.delete(collection_name=collection_name, ids=[id_to_remove])\n",
"print(\n",
" f\"Documents after deletion (id={id_to_remove}): {db_FaissFlat.count(collection_name)}\"\n",
")"
]
},
{
@ -739,7 +860,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"id": "1db4d6ed",
"metadata": {},
"outputs": [
@ -758,7 +879,7 @@
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tlast_date_read:\t2024-05-01T14:30:00+00:00\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n"
@ -767,7 +888,7 @@
],
"source": [
"embedding_vector = embedding.embed_query(query)\n",
"returned_docs = db.similarity_search_by_vector(embedding_vector)\n",
"returned_docs = db_FaissFlat.similarity_search_by_vector(embedding_vector)\n",
"\n",
"# Print Results\n",
"print_document_details(returned_docs[0])"
@ -787,7 +908,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 14,
"id": "2bc0313b",
"metadata": {},
"outputs": [
@ -795,7 +916,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Returned entry:\n",
"Deleted entry:\n",
"\n",
"blob:\n",
"\tTrue\n",
@ -838,18 +959,18 @@
}
],
"source": [
"response, response_array = db.get(\n",
"response, response_array = db_FaissFlat.get(\n",
" collection_name,\n",
" limit=1,\n",
" include=[\"metadata\", \"embeddings\"],\n",
" constraints={\"id\": [\"==\", \"2\"]},\n",
")\n",
"\n",
"print(\"Returned entry:\")\n",
"print_response([response[0][\"FindDescriptor\"][\"entities\"][0]])\n",
"\n",
"# Delete id=2\n",
"db.delete(collection_name=collection_name, ids=[\"2\"])"
"db_FaissFlat.delete(collection_name=collection_name, ids=[\"2\"])\n",
"\n",
"print(\"Deleted entry:\")\n",
"print_response([response[0][\"FindDescriptor\"][\"entities\"][0]])"
]
},
{
@ -869,7 +990,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 15,
"id": "120f55eb",
"metadata": {},
"outputs": [
@ -888,7 +1009,7 @@
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tlast_date_read:\t2024-05-01T14:30:00+00:00\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n"
@ -896,7 +1017,7 @@
}
],
"source": [
"retriever = db.as_retriever()\n",
"retriever = db_FaissFlat.as_retriever()\n",
"relevant_docs = retriever.invoke(query)[0]\n",
"\n",
"print_document_details(relevant_docs)"
@ -914,7 +1035,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 16,
"id": "f00be6d0",
"metadata": {},
"outputs": [
@ -933,7 +1054,7 @@
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tlast_date_read:\t2024-05-01T14:30:00+00:00\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n"
@ -941,7 +1062,7 @@
}
],
"source": [
"retriever = db.as_retriever(search_type=\"mmr\")\n",
"retriever = db_FaissFlat.as_retriever(search_type=\"mmr\")\n",
"relevant_docs = retriever.invoke(query)[0]\n",
"\n",
"print_document_details(relevant_docs)"
@ -957,7 +1078,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 17,
"id": "ab911470",
"metadata": {},
"outputs": [
@ -967,7 +1088,7 @@
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032092809677124\n",
"Score:\t1.2032091618\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
@ -980,13 +1101,13 @@
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tlast_date_read:\t2024-05-01T14:30:00+00:00\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../how_to/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.507053256034851\n",
"Score:\t1.50705266\n",
"\n",
"Content:\n",
"\tBut cancer from prolonged exposure to burn pits ravaged Heaths lungs and body. \n",
@ -1022,7 +1143,7 @@
}
],
"source": [
"mmr_resp = db.max_marginal_relevance_search_with_score(query, k=2, fetch_k=10)\n",
"mmr_resp = db_FaissFlat.max_marginal_relevance_search_with_score(query, k=2, fetch_k=10)\n",
"print_results(mmr_resp)"
]
},
@ -1037,7 +1158,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 18,
"id": "874e7af9",
"metadata": {},
"outputs": [
@ -1051,11 +1172,11 @@
}
],
"source": [
"print(\"Documents before deletion: \", db.count(collection_name))\n",
"print(\"Documents before deletion: \", db_FaissFlat.count(collection_name))\n",
"\n",
"db.delete(collection_name=collection_name)\n",
"db_FaissFlat.delete(collection_name=collection_name)\n",
"\n",
"print(\"Documents after deletion: \", db.count(collection_name))"
"print(\"Documents after deletion: \", db_FaissFlat.count(collection_name))"
]
},
{
@ -1068,7 +1189,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 19,
"id": "08931796",
"metadata": {},
"outputs": [
@ -1097,7 +1218,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "0386ea81",
"id": "a60725a6",
"metadata": {},
"outputs": [],
"source": []
@ -1119,7 +1240,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@ -86,7 +86,7 @@ tree-sitter>=0.20.2,<0.21
tree-sitter-languages>=1.8.0,<2
upstash-redis>=1.1.0,<2
upstash-ratelimit>=1.1.0,<2
vdms==0.0.20
vdms>=0.0.20
xata>=1.0.0a7,<2
xmltodict>=0.13.0,<0.14
nanopq==0.2.1

View File

@ -2,6 +2,7 @@ from __future__ import annotations
import base64
import logging
import os
import uuid
from copy import deepcopy
from typing import (
@ -76,6 +77,41 @@ def _len_check_if_sized(x: Any, y: Any, x_name: str, y_name: str) -> None:
return
def _results_to_docs(results: Any) -> List[Document]:
return [doc for doc, _ in _results_to_docs_and_scores(results)]
def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
final_res: List[Any] = []
try:
responses, blobs = results[0]
if (
len(responses) > 0
and "FindDescriptor" in responses[0]
and "entities" in responses[0]["FindDescriptor"]
):
result_entities = responses[0]["FindDescriptor"]["entities"]
# result_blobs = blobs
for ent in result_entities:
distance = round(ent["_distance"], 10)
txt_contents = ent["content"]
for p in INVALID_DOC_METADATA_KEYS:
if p in ent:
del ent[p]
props = {
mkey: mval
for mkey, mval in ent.items()
if mval not in INVALID_METADATA_VALUE
}
final_res.append(
(Document(page_content=txt_contents, metadata=props), distance)
)
except Exception as e:
logger.warn(f"No results returned. Error while parsing results: {e}")
return final_res
def VDMS_Client(host: str = "localhost", port: int = 55555) -> vdms.vdms:
"""VDMS client for the VDMS server.
@ -122,7 +158,7 @@ class VDMS(VectorStore):
Example:
.. code-block:: python
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.vdms import VDMS, VDMS_Client
vectorstore = VDMS(
@ -143,19 +179,20 @@ class VDMS(VectorStore):
distance_strategy: DISTANCE_METRICS = "L2",
engine: ENGINES = "FaissFlat",
relevance_score_fn: Optional[Callable[[float], float]] = None,
embedding_dimensions: Optional[int] = None,
) -> None:
# Check required parameters
self._client = client
self.similarity_search_engine = engine
self.distance_strategy = distance_strategy
self.embedding = embedding
self._check_required_inputs(collection_name)
self._check_required_inputs(collection_name, embedding_dimensions)
# Update other parameters
self.override_relevance_score_fn = relevance_score_fn
# Initialize collection
self._collection_name = self.__add_set(
self._collection_name = self.add_set(
collection_name,
engine=self.similarity_search_engine,
metric=self.distance_strategy,
@ -173,6 +210,14 @@ class VDMS(VectorStore):
p_str += " to be an Embeddings object"
raise ValueError(p_str)
def _embed_video(self, paths: List[str], **kwargs: Any) -> List[List[float]]:
if self.embedding is not None and hasattr(self.embedding, "embed_video"):
return self.embedding.embed_video(paths=paths, **kwargs)
else:
raise ValueError(
"Must provide `embedding` which has attribute `embed_video`"
)
def _embed_image(self, uris: List[str]) -> List[List[float]]:
if self.embedding is not None and hasattr(self.embedding, "embed_image"):
return self.embedding.embed_image(uris=uris)
@ -225,10 +270,10 @@ class VDMS(VectorStore):
if self.override_relevance_score_fn is None:
kwargs["normalize_distance"] = True
docs_and_scores = self.similarity_search_with_score(
query,
k,
fetch_k,
filter,
query=query,
k=k,
fetch_k=fetch_k,
filter=filter,
**kwargs,
)
@ -242,7 +287,7 @@ class VDMS(VectorStore):
)
return docs_and_rel_scores
def __add(
def add(
self,
collection_name: str,
texts: List[str],
@ -275,7 +320,7 @@ class VDMS(VectorStore):
return inserted_ids
def __add_set(
def add_set(
self,
collection_name: str,
engine: ENGINES = "FaissFlat",
@ -333,6 +378,12 @@ class VDMS(VectorStore):
all_queries.append(query)
response, response_array = self.__run_vdms_query(all_queries, all_blobs)
# Update/store indices after deletion
query = _add_descriptorset(
"FindDescriptorSet", collection_name, storeIndex=True
)
responseSet, _ = self.__run_vdms_query([query], all_blobs)
return "FindDescriptor" in response[0]
def __get_add_query(
@ -365,7 +416,7 @@ class VDMS(VectorStore):
if metadata:
props.update(metadata)
if document:
if document not in [None, ""]:
props["content"] = document
for k in props.keys():
@ -515,7 +566,7 @@ class VDMS(VectorStore):
Args:
uris: List of paths to the images to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
metadatas: Optional list of metadatas associated with the images.
ids: Optional list of unique IDs.
batch_size (int): Number of concurrent requests to send to the server.
add_path: Bool to add image path as metadata
@ -545,7 +596,7 @@ class VDMS(VectorStore):
else:
metadatas = [_validate_vdms_properties(m) for m in metadatas]
self.__from(
self.add_from(
texts=b64_texts,
embeddings=embeddings,
ids=ids,
@ -555,6 +606,62 @@ class VDMS(VectorStore):
)
return ids
def add_videos(
self,
paths: List[str],
texts: Optional[List[str]] = None,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
batch_size: int = 1,
add_path: Optional[bool] = True,
**kwargs: Any,
) -> List[str]:
"""Run videos through the embeddings and add to the vectorstore.
Videos are added as embeddings (AddDescriptor) instead of separate
entity (AddVideo) within VDMS to leverage similarity search capability
Args:
paths: List of paths to the videos to add to the vectorstore.
metadatas: Optional list of text associated with the videos.
metadatas: Optional list of metadatas associated with the videos.
ids: Optional list of unique IDs.
batch_size (int): Number of concurrent requests to send to the server.
add_path: Bool to add video path as metadata
Returns:
List of ids from adding videos into the vectorstore.
"""
if texts is None:
texts = ["" for _ in paths]
if add_path and metadatas:
for midx, path in enumerate(paths):
metadatas[midx]["video_path"] = path
elif add_path:
metadatas = []
for path in paths:
metadatas.append({"video_path": path})
# Populate IDs
ids = ids if ids is not None else [str(uuid.uuid4()) for _ in paths]
# Set embeddings
embeddings = self._embed_video(paths=paths, **kwargs)
if metadatas is None:
metadatas = [{} for _ in paths]
self.add_from(
texts=texts,
embeddings=embeddings,
ids=ids,
metadatas=metadatas,
batch_size=batch_size,
**kwargs,
)
return ids
def add_texts(
self,
texts: Iterable[str],
@ -586,7 +693,7 @@ class VDMS(VectorStore):
else:
metadatas = [_validate_vdms_properties(m) for m in metadatas]
inserted_ids = self.__from(
inserted_ids = self.add_from(
texts=texts,
embeddings=embeddings,
ids=ids,
@ -596,7 +703,7 @@ class VDMS(VectorStore):
)
return inserted_ids
def __from(
def add_from(
self,
texts: List[str],
embeddings: List[List[float]],
@ -617,7 +724,7 @@ class VDMS(VectorStore):
if metadatas:
batch_metadatas = metadatas[start_idx:end_idx]
result = self.__add(
result = self.add(
self._collection_name,
embeddings=batch_embedding_vectors,
texts=batch_texts,
@ -633,7 +740,9 @@ class VDMS(VectorStore):
)
return inserted_ids
def _check_required_inputs(self, collection_name: str) -> None:
def _check_required_inputs(
self, collection_name: str, embedding_dimensions: Union[int, None]
) -> None:
# Check connection to client
if not self._client.is_connected():
raise ValueError(
@ -656,7 +765,29 @@ class VDMS(VectorStore):
if self.embedding is None:
raise ValueError("Must provide embedding function")
self.embedding_dimension = len(self._embed_query("This is a sample sentence."))
if embedding_dimensions is not None:
self.embedding_dimension = embedding_dimensions
elif self.embedding is not None and hasattr(self.embedding, "embed_query"):
self.embedding_dimension = len(
self._embed_query("This is a sample sentence.")
)
elif self.embedding is not None and (
hasattr(self.embedding, "embed_image")
or hasattr(self.embedding, "embed_video")
):
if hasattr(self.embedding, "model"):
try:
self.embedding_dimension = (
self.embedding.model.token_embedding.embedding_dim
)
except ValueError:
raise ValueError(
"Embedding dimension needed. Please define embedding_dimensions"
)
else:
raise ValueError(
"Embedding dimension needed. Please define embedding_dimensions"
)
# Check for properties
current_props = self.__get_properties(collection_name)
@ -727,7 +858,7 @@ class VDMS(VectorStore):
)
response, response_array = self.__run_vdms_query([query], all_blobs)
if normalize:
if normalize and command_str in response[0]:
max_dist = response[0][command_str]["entities"][-1]["_distance"]
return response, response_array, max_dist
@ -769,14 +900,21 @@ class VDMS(VectorStore):
results=results,
)
response, response_array = self.__run_vdms_query([query])
ids_of_interest = [
ent["id"] for ent in response[0][command_str]["entities"]
]
if command_str in response[0] and response[0][command_str]["returned"] > 0:
ids_of_interest = [
ent["id"] for ent in response[0][command_str]["entities"]
]
else:
return [], []
# (2) Find top fetch_k results
response, response_array, max_dist = self.get_k_candidates(
setname, fetch_k, results, all_blobs, normalize=normalize_distance
)
if command_str not in response[0] or (
command_str in response[0] and response[0][command_str]["returned"] == 0
):
return [], []
# (3) Intersection of (1) & (2) using ids
new_entities: List[Dict] = []
@ -792,7 +930,7 @@ class VDMS(VectorStore):
print(p_str) # noqa: T201
if normalize_distance:
max_dist = 1.0 if max_dist == 0 else max_dist
max_dist = 1.0 if max_dist in [0, np.inf] else max_dist
for ent_idx, ent in enumerate(response[0][command_str]["entities"]):
ent["_distance"] = ent["_distance"] / max_dist
response[0][command_str]["entities"][ent_idx]["_distance"] = ent[
@ -946,7 +1084,7 @@ class VDMS(VectorStore):
among selected documents.
Args:
query: Text to look up documents similar to.
query (str): Query to look up. Text or path for image or video.
k: Number of Documents to return. Defaults to 4.
fetch_k: Number of Documents to fetch to pass to MMR algorithm.
lambda_mult: Number between 0 and 1 that determines the degree
@ -963,7 +1101,20 @@ class VDMS(VectorStore):
"For MMR search, you must specify an embedding function on" "creation."
)
embedding_vector: List[float] = self._embed_query(query)
# embedding_vector: List[float] = self._embed_query(query)
embedding_vector: List[float]
if not os.path.isfile(query) and hasattr(self.embedding, "embed_query"):
embedding_vector = self._embed_query(query)
elif os.path.isfile(query) and hasattr(self.embedding, "embed_image"):
embedding_vector = self._embed_image(uris=[query])[0]
elif os.path.isfile(query) and hasattr(self.embedding, "embed_video"):
embedding_vector = self._embed_video(paths=[query])[0]
else:
error_msg = f"Could not generate embedding for query '{query}'."
error_msg += "If using path for image or video, verify embedding model "
error_msg += "has callable functions 'embed_image' or 'embed_video'."
raise ValueError(error_msg)
docs = self.max_marginal_relevance_search_by_vector(
embedding_vector,
k,
@ -1006,19 +1157,27 @@ class VDMS(VectorStore):
include=["metadatas", "documents", "distances", "embeddings"],
)
embedding_list = [list(_bytes2embedding(result)) for result in results[0][1]]
if len(results[0][1]) == 0:
# No results returned
return []
else:
embedding_list = [
list(_bytes2embedding(result)) for result in results[0][1]
]
mmr_selected = maximal_marginal_relevance(
np.array(embedding, dtype=np.float32),
embedding_list,
k=k,
lambda_mult=lambda_mult,
)
mmr_selected = maximal_marginal_relevance(
np.array(embedding, dtype=np.float32),
embedding_list,
k=k,
lambda_mult=lambda_mult,
)
candidates = _results_to_docs(results)
candidates = _results_to_docs(results)
selected_results = [r for i, r in enumerate(candidates) if i in mmr_selected]
return selected_results
selected_results = [
r for i, r in enumerate(candidates) if i in mmr_selected
]
return selected_results
def max_marginal_relevance_search_with_score(
self,
@ -1034,7 +1193,7 @@ class VDMS(VectorStore):
among selected documents.
Args:
query: Text to look up documents similar to.
query (str): Query to look up. Text or path for image or video.
k: Number of Documents to return. Defaults to 4.
fetch_k: Number of Documents to fetch to pass to MMR algorithm.
lambda_mult: Number between 0 and 1 that determines the degree
@ -1051,7 +1210,18 @@ class VDMS(VectorStore):
"For MMR search, you must specify an embedding function on" "creation."
)
embedding = self._embed_query(query)
if not os.path.isfile(query) and hasattr(self.embedding, "embed_query"):
embedding = self._embed_query(query)
elif os.path.isfile(query) and hasattr(self.embedding, "embed_image"):
embedding = self._embed_image(uris=[query])[0]
elif os.path.isfile(query) and hasattr(self.embedding, "embed_video"):
embedding = self._embed_video(paths=[query])[0]
else:
error_msg = f"Could not generate embedding for query '{query}'."
error_msg += "If using path for image or video, verify embedding model "
error_msg += "has callable functions 'embed_image' or 'embed_video'."
raise ValueError(error_msg)
docs = self.max_marginal_relevance_search_with_score_by_vector(
embedding,
k,
@ -1094,21 +1264,27 @@ class VDMS(VectorStore):
include=["metadatas", "documents", "distances", "embeddings"],
)
embedding_list = [list(_bytes2embedding(result)) for result in results[0][1]]
if len(results[0][1]) == 0:
# No results returned
return []
else:
embedding_list = [
list(_bytes2embedding(result)) for result in results[0][1]
]
mmr_selected = maximal_marginal_relevance(
np.array(embedding, dtype=np.float32),
embedding_list,
k=k,
lambda_mult=lambda_mult,
)
mmr_selected = maximal_marginal_relevance(
np.array(embedding, dtype=np.float32),
embedding_list,
k=k,
lambda_mult=lambda_mult,
)
candidates = _results_to_docs_and_scores(results)
candidates = _results_to_docs_and_scores(results)
selected_results = [
(r, s) for i, (r, s) in enumerate(candidates) if i in mmr_selected
]
return selected_results
selected_results = [
(r, s) for i, (r, s) in enumerate(candidates) if i in mmr_selected
]
return selected_results
def query_collection_embeddings(
self,
@ -1162,7 +1338,7 @@ class VDMS(VectorStore):
"""Run similarity search with VDMS.
Args:
query (str): Query text to search for.
query (str): Query to look up. Text or path for image or video.
k (int): Number of results to return. Defaults to 3.
fetch_k (int): Number of candidates to fetch for knn (>= k).
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
@ -1171,7 +1347,7 @@ class VDMS(VectorStore):
List[Document]: List of documents most similar to the query text.
"""
docs_and_scores = self.similarity_search_with_score(
query, k, fetch_k, filter=filter, **kwargs
query, k=k, fetch_k=fetch_k, filter=filter, **kwargs
)
return [doc for doc, _ in docs_and_scores]
@ -1213,7 +1389,7 @@ class VDMS(VectorStore):
"""Run similarity search with VDMS with distance.
Args:
query (str): Query text to search for.
query (str): Query to look up. Text or path for image or video.
k (int): Number of results to return. Defaults to 3.
fetch_k (int): Number of candidates to fetch for knn (>= k).
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
@ -1226,7 +1402,18 @@ class VDMS(VectorStore):
if self.embedding is None:
raise ValueError("Must provide embedding function")
else:
query_embedding: List[float] = self._embed_query(query)
if not os.path.isfile(query) and hasattr(self.embedding, "embed_query"):
query_embedding: List[float] = self._embed_query(query)
elif os.path.isfile(query) and hasattr(self.embedding, "embed_image"):
query_embedding = self._embed_image(uris=[query])[0]
elif os.path.isfile(query) and hasattr(self.embedding, "embed_video"):
query_embedding = self._embed_video(paths=[query])[0]
else:
error_msg = f"Could not generate embedding for query '{query}'."
error_msg += "If using path for image or video, verify embedding model "
error_msg += "has callable functions 'embed_image' or 'embed_video'."
raise ValueError(error_msg)
results = self.query_collection_embeddings(
query_embeddings=[query_embedding],
n_results=k,
@ -1256,10 +1443,10 @@ class VDMS(VectorStore):
Returns:
List[Tuple[Document, float]]: List of documents most similar to
the query text and cosine distance in float for each.
Lower score represents more similarity.
the query text. Lower score represents more similarity.
"""
kwargs["normalize_distance"] = True
# kwargs["normalize_distance"] = True
results = self.query_collection_embeddings(
query_embeddings=[embedding],
@ -1308,37 +1495,6 @@ class VDMS(VectorStore):
# VDMS UTILITY
def _results_to_docs(results: Any) -> List[Document]:
return [doc for doc, _ in _results_to_docs_and_scores(results)]
def _results_to_docs_and_scores(results: Any) -> List[Tuple[Document, float]]:
final_res: List[Any] = []
responses, blobs = results[0]
if (
"FindDescriptor" in responses[0]
and "entities" in responses[0]["FindDescriptor"]
):
result_entities = responses[0]["FindDescriptor"]["entities"]
# result_blobs = blobs
for ent in result_entities:
distance = ent["_distance"]
txt_contents = ent["content"]
for p in INVALID_DOC_METADATA_KEYS:
if p in ent:
del ent[p]
props = {
mkey: mval
for mkey, mval in ent.items()
if mval not in INVALID_METADATA_VALUE
}
final_res.append(
(Document(page_content=txt_contents, metadata=props), distance)
)
return final_res
def _add_descriptor(
command_str: str,
setname: str,

View File

@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
# This file is automatically @generated by Poetry 1.8.3 and should not be changed by hand.
[[package]]
name = "aiohttp"
@ -2117,7 +2117,7 @@ files = [
[[package]]
name = "langchain"
version = "0.2.10"
version = "0.2.11"
description = "Building applications with LLMs through composability"
optional = false
python-versions = ">=3.8.1,<4.0"
@ -2127,7 +2127,7 @@ develop = true
[package.dependencies]
aiohttp = "^3.8.3"
async-timeout = {version = "^4.0.0", markers = "python_version < \"3.11\""}
langchain-core = "^0.2.22"
langchain-core = "^0.2.23"
langchain-text-splitters = "^0.2.0"
langsmith = "^0.1.17"
numpy = [
@ -3819,7 +3819,6 @@ files = [
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"},
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
{file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
{file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
@ -5358,13 +5357,13 @@ tests = ["Werkzeug (==2.0.3)", "aiohttp", "boto3", "httplib2", "httpx", "pytest"
[[package]]
name = "vdms"
version = "0.0.20"
version = "0.0.21"
description = "VDMS Client Module"
optional = false
python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*, <4"
python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,<4,>=2.6"
files = [
{file = "vdms-0.0.20-py3-none-any.whl", hash = "sha256:7b81127f2981f2dabdcc5880ad7eb4bc2c7833a25aaf79a7b1a560e86bf7b5ec"},
{file = "vdms-0.0.20.tar.gz", hash = "sha256:746c21a96e420b9b034495537b42d70f2326b020a1c6907677f7851a926e8605"},
{file = "vdms-0.0.21-py3-none-any.whl", hash = "sha256:18e785cd7ec66c3a6c5921a6a93fe2ca22d97f45f40dccb9ff0c954675139daf"},
{file = "vdms-0.0.21.tar.gz", hash = "sha256:bbb62d3f1a5cdab6b6bd41950942880cc431729313742870eb255a23c5f0381f"},
]
[package.dependencies]
@ -5759,4 +5758,4 @@ test = ["big-O", "importlib-resources", "jaraco.functools", "jaraco.itertools",
[metadata]
lock-version = "2.0"
python-versions = ">=3.8.1,<4.0"
content-hash = "14d60e1f61fa9c0ba69cb4e227e4af3de395a8dd4a53b121fe488e7b9f75ea66"
content-hash = "324e10fe59335abccbd422d9ee8ae771714edf72078a750b99c87ba853bd617c"

View File

@ -102,7 +102,7 @@ cassio = "^0.1.6"
tiktoken = ">=0.3.2,<0.6.0"
anthropic = "^0.3.11"
fireworks-ai = "^0.9.0"
vdms = "^0.0.20"
vdms = ">=0.0.20"
exllamav2 = "^0.0.18"
[tool.poetry.group.lint.dependencies]

View File

@ -20,6 +20,7 @@ if TYPE_CHECKING:
import vdms
logging.basicConfig(level=logging.DEBUG)
embedding_function = FakeEmbeddings()
# The connection string matches the default settings in the docker-compose file
@ -28,6 +29,7 @@ logging.basicConfig(level=logging.DEBUG)
# cd [root]/docker
# docker compose up -d vdms
@pytest.fixture
@pytest.mark.enable_socket
def vdms_client() -> vdms.vdms:
return VDMS_Client(
host=os.getenv("VDMS_DBHOST", "localhost"),
@ -36,19 +38,19 @@ def vdms_client() -> vdms.vdms:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_init_from_client(vdms_client: vdms.vdms) -> None:
embedding_function = FakeEmbeddings()
_ = VDMS( # type: ignore[call-arg]
embedding_function=embedding_function,
embedding=embedding_function,
client=vdms_client,
)
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_from_texts_with_metadatas(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and search."""
collection_name = "test_from_texts_with_metadatas"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_from_texts_with_metadatas_{i}" for i in range(len(texts))]
metadatas = [{"page": str(i)} for i in range(1, len(texts) + 1)]
@ -67,10 +69,10 @@ def test_from_texts_with_metadatas(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_from_texts_with_metadatas_with_scores(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and scored search."""
collection_name = "test_from_texts_with_metadatas_with_scores"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_from_texts_with_metadatas_with_scores_{i}" for i in range(len(texts))]
metadatas = [{"page": str(i)} for i in range(1, len(texts) + 1)]
@ -82,19 +84,19 @@ def test_from_texts_with_metadatas_with_scores(vdms_client: vdms.vdms) -> None:
collection_name=collection_name,
client=vdms_client,
)
output = docsearch.similarity_search_with_score("foo", k=1)
output = docsearch.similarity_search_with_score("foo", k=1, fetch_k=1)
assert output == [
(Document(page_content="foo", metadata={"page": "1", "id": ids[0]}), 0.0)
]
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_from_texts_with_metadatas_with_scores_using_vector(
vdms_client: vdms.vdms,
) -> None:
"""Test end to end construction and scored search, using embedding vector."""
collection_name = "test_from_texts_with_metadatas_with_scores_using_vector"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_from_texts_with_metadatas_{i}" for i in range(len(texts))]
metadatas = [{"page": str(i)} for i in range(1, len(texts) + 1)]
@ -113,10 +115,10 @@ def test_from_texts_with_metadatas_with_scores_using_vector(
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_search_filter(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and search with metadata filtering."""
collection_name = "test_search_filter"
embedding_function = FakeEmbeddings()
texts = ["far", "bar", "baz"]
ids = [f"test_search_filter_{i}" for i in range(len(texts))]
metadatas = [{"first_letter": "{}".format(text[0])} for text in texts]
@ -144,10 +146,10 @@ def test_search_filter(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_search_filter_with_scores(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and scored search with metadata filtering."""
collection_name = "test_search_filter_with_scores"
embedding_function = FakeEmbeddings()
texts = ["far", "bar", "baz"]
ids = [f"test_search_filter_with_scores_{i}" for i in range(len(texts))]
metadatas = [{"first_letter": "{}".format(text[0])} for text in texts]
@ -185,10 +187,10 @@ def test_search_filter_with_scores(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_mmr(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and search."""
collection_name = "test_mmr"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_mmr_{i}" for i in range(len(texts))]
docsearch = VDMS.from_texts(
@ -203,10 +205,10 @@ def test_mmr(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_mmr_by_vector(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and search."""
collection_name = "test_mmr_by_vector"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_mmr_by_vector_{i}" for i in range(len(texts))]
docsearch = VDMS.from_texts(
@ -222,10 +224,10 @@ def test_mmr_by_vector(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_with_include_parameter(vdms_client: vdms.vdms) -> None:
"""Test end to end construction and include parameter."""
collection_name = "test_with_include_parameter"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
docsearch = VDMS.from_texts(
texts=texts,
@ -233,19 +235,23 @@ def test_with_include_parameter(vdms_client: vdms.vdms) -> None:
collection_name=collection_name,
client=vdms_client,
)
response, response_array = docsearch.get(collection_name, include=["embeddings"])
assert response_array != []
for emb in embedding_function.embed_documents(texts):
assert embedding2bytes(emb) in response_array
response, response_array = docsearch.get(collection_name)
assert response_array == []
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_update_document(vdms_client: vdms.vdms) -> None:
"""Test the update_document function in the VDMS class."""
collection_name = "test_update_document"
# Make a consistent embedding
embedding_function = ConsistentFakeEmbeddings()
const_embedding_function = ConsistentFakeEmbeddings()
# Initial document content and id
initial_content = "foo"
@ -259,10 +265,10 @@ def test_update_document(vdms_client: vdms.vdms) -> None:
client=vdms_client,
collection_name=collection_name,
documents=[original_doc],
embedding=embedding_function,
embedding=const_embedding_function,
ids=[document_id],
)
response, old_embedding = docsearch.get(
old_response, old_embedding = docsearch.get(
collection_name,
constraints={"id": ["==", document_id]},
include=["metadata", "embeddings"],
@ -281,17 +287,15 @@ def test_update_document(vdms_client: vdms.vdms) -> None:
)
# Perform a similarity search with the updated content
output = docsearch.similarity_search(updated_content, k=1)
output = docsearch.similarity_search(updated_content, k=3)[0]
# Assert that the updated document is returned by the search
assert output == [
Document(
page_content=updated_content, metadata={"page": "1", "id": document_id}
)
]
assert output == Document(
page_content=updated_content, metadata={"page": "1", "id": document_id}
)
# Assert that the new embedding is correct
response, new_embedding = docsearch.get(
new_response, new_embedding = docsearch.get(
collection_name,
constraints={"id": ["==", document_id]},
include=["metadata", "embeddings"],
@ -299,16 +303,21 @@ def test_update_document(vdms_client: vdms.vdms) -> None:
# new_embedding = response_array[0]
assert new_embedding[0] == embedding2bytes(
embedding_function.embed_documents([updated_content])[0]
const_embedding_function.embed_documents([updated_content])[0]
)
assert new_embedding != old_embedding
assert (
new_response[0]["FindDescriptor"]["entities"][0]["content"]
!= old_response[0]["FindDescriptor"]["entities"][0]["content"]
)
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_with_relevance_score(vdms_client: vdms.vdms) -> None:
"""Test to make sure the relevance score is scaled to 0-1."""
collection_name = "test_with_relevance_score"
embedding_function = FakeEmbeddings()
texts = ["foo", "bar", "baz"]
ids = [f"test_relevance_scores_{i}" for i in range(len(texts))]
metadatas = [{"page": str(i)} for i in range(1, len(texts) + 1)]
@ -320,7 +329,7 @@ def test_with_relevance_score(vdms_client: vdms.vdms) -> None:
collection_name=collection_name,
client=vdms_client,
)
output = docsearch.similarity_search_with_relevance_scores("foo", k=3)
output = docsearch._similarity_search_with_relevance_scores("foo", k=3)
assert output == [
(Document(page_content="foo", metadata={"page": "1", "id": ids[0]}), 0.0),
(Document(page_content="bar", metadata={"page": "2", "id": ids[1]}), 0.25),
@ -329,24 +338,24 @@ def test_with_relevance_score(vdms_client: vdms.vdms) -> None:
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_add_documents_no_metadata(vdms_client: vdms.vdms) -> None:
collection_name = "test_add_documents_no_metadata"
embedding_function = FakeEmbeddings()
db = VDMS( # type: ignore[call-arg]
collection_name=collection_name,
embedding_function=embedding_function,
embedding=embedding_function,
client=vdms_client,
)
db.add_documents([Document(page_content="foo")])
@pytest.mark.requires("vdms")
@pytest.mark.enable_socket
def test_add_documents_mixed_metadata(vdms_client: vdms.vdms) -> None:
collection_name = "test_add_documents_mixed_metadata"
embedding_function = FakeEmbeddings()
db = VDMS( # type: ignore[call-arg]
collection_name=collection_name,
embedding_function=embedding_function,
embedding=embedding_function,
client=vdms_client,
)