You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/vdms.ipynb

1126 lines
44 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# Intel's Visual Data Management System (VDMS)\n",
"\n",
">Intel's [VDMS](https://github.com/IntelLabs/vdms) is a storage solution for efficient access of big-”visual”-data that aims to achieve cloud scale by searching for relevant visual data via visual metadata stored as a graph and enabling machine friendly enhancements to visual data for faster access. VDMS is licensed under MIT.\n",
"\n",
"VDMS supports:\n",
"* K nearest neighbor search\n",
"* Euclidean distance (L2) and inner product (IP)\n",
"* Libraries for indexing and computing distances: TileDBDense, TileDBSparse, FaissFlat (Default), FaissIVFFlat\n",
"* Vector and metadata searches\n",
"\n",
"VDMS has server and client components. To setup the server, see the [installation instructions](https://github.com/IntelLabs/vdms/blob/master/INSTALL.md) or use the [docker image](https://hub.docker.com/r/intellabs/vdms).\n",
"\n",
"This notebook shows how to use VDMS as a vector store using the docker image.\n",
"\n",
"To begin, install the Python packages for the VDMS client and Sentence Transformers:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2167badd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# Pip install necessary package\n",
"%pip install --upgrade --quiet pip sentence-transformers vdms \"unstructured-inference==0.6.6\";"
]
},
{
"cell_type": "markdown",
"id": "af2b4512",
"metadata": {},
"source": [
"## Start VDMS Server\n",
"Here we start the VDMS server with port 55555."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4b1537c7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"e6061b270eef87de5319a6c5af709b36badcad8118069a8f6b577d2e01ad5e2d\n"
]
}
],
"source": [
"!docker run --rm -d -p 55555:55555 --name vdms_vs_test_nb intellabs/vdms:latest"
]
},
{
"cell_type": "markdown",
"id": "2b5ffbf8",
"metadata": {},
"source": [
"## Basic Example (using the Docker Container)\n",
"\n",
"In this basic example, we demonstrate adding documents into VDMS and using it as a vector database.\n",
"\n",
"You can run the VDMS Server in a Docker container separately to use with LangChain which connects to the server via the VDMS Python Client. \n",
"\n",
"VDMS has the ability to handle multiple collections of documents, but the LangChain interface expects one, so we need to specify the name of the collection . The default collection name used by LangChain is \"langchain\".\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5201ba0c",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"from langchain_community.document_loaders.text import TextLoader\n",
"from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings\n",
"from langchain_community.vectorstores import VDMS\n",
"from langchain_community.vectorstores.vdms import VDMS_Client\n",
"from langchain_text_splitters.character import CharacterTextSplitter\n",
"\n",
"time.sleep(2)\n",
"DELIMITER = \"-\" * 50\n",
"\n",
"# Connect to VDMS Vector Store\n",
"vdms_client = VDMS_Client(host=\"localhost\", port=55555)"
]
},
{
"cell_type": "markdown",
"id": "935069bc",
"metadata": {},
"source": [
"Here are some helper functions for printing results."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e78814eb",
"metadata": {},
"outputs": [],
"source": [
"def print_document_details(doc):\n",
" print(f\"Content:\\n\\t{doc.page_content}\\n\")\n",
" print(\"Metadata:\")\n",
" for key, value in doc.metadata.items():\n",
" if value != \"Missing property\":\n",
" print(f\"\\t{key}:\\t{value}\")\n",
"\n",
"\n",
"def print_results(similarity_results, score=True):\n",
" print(f\"{DELIMITER}\\n\")\n",
" if score:\n",
" for doc, score in similarity_results:\n",
" print(f\"Score:\\t{score}\\n\")\n",
" print_document_details(doc)\n",
" print(f\"{DELIMITER}\\n\")\n",
" else:\n",
" for doc in similarity_results:\n",
" print_document_details(doc)\n",
" print(f\"{DELIMITER}\\n\")\n",
"\n",
"\n",
"def print_response(list_of_entities):\n",
" for ent in list_of_entities:\n",
" for key, value in ent.items():\n",
" if value != \"Missing property\":\n",
" print(f\"\\n{key}:\\n\\t{value}\")\n",
" print(f\"{DELIMITER}\\n\")"
]
},
{
"cell_type": "markdown",
"id": "88229867",
"metadata": {},
"source": [
"### Load Document and Obtain Embedding Function\n",
"Here we load the most recent State of the Union Address and split the document into chunks. \n",
"\n",
"LangChain vector stores use a string/keyword `id` for bookkeeping documents. By default, `id` is a uuid but here we're defining it as an integer cast as a string. Additional metadata is also provided with the documents and the HuggingFaceEmbeddings are used for this example as the embedding function."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2ebfc16c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# Documents: 42\n",
"# Embedding Dimensions: 768\n"
]
}
],
"source": [
"# load the document and split it into chunks\n",
"document_path = \"../../modules/state_of_the_union.txt\"\n",
"raw_documents = TextLoader(document_path).load()\n",
"\n",
"# split it into chunks\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(raw_documents)\n",
"ids = []\n",
"for doc_idx, doc in enumerate(docs):\n",
" ids.append(str(doc_idx + 1))\n",
" docs[doc_idx].metadata[\"id\"] = str(doc_idx + 1)\n",
" docs[doc_idx].metadata[\"page_number\"] = int(doc_idx + 1)\n",
" docs[doc_idx].metadata[\"president_included\"] = (\n",
" \"president\" in doc.page_content.lower()\n",
" )\n",
"print(f\"# Documents: {len(docs)}\")\n",
"\n",
"\n",
"# create the open-source embedding function\n",
"embedding = HuggingFaceEmbeddings()\n",
"print(\n",
" f\"# Embedding Dimensions: {len(embedding.embed_query('This is a test document.'))}\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a6a596f0",
"metadata": {},
"source": [
"### Similarity Search using Faiss Flat and Euclidean Distance (Default)\n",
"\n",
"In this section, we add the documents to VDMS using FAISS IndexFlat indexing (default) and Euclidena distance (default) as the distance metric for simiarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1f3f43d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"# add data\n",
"collection_name = \"my_collection_faiss_L2\"\n",
"db = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=collection_name,\n",
" embedding=embedding,\n",
")\n",
"\n",
"# Query (No metadata filtering)\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"returned_docs = db.similarity_search(query, k=k, filter=None)\n",
"print_results(returned_docs, score=False)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c2e36c18",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tAnd for our LGBTQ+ Americans, lets finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n",
"\n",
"As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
"\n",
"While it often appears that we never agree, that isnt true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n",
"\n",
"And soon, well strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n",
"\n",
"So tonight Im offering a Unity Agenda for the Nation. Four big things we can do together. \n",
"\n",
"First, beat the opioid epidemic.\n",
"\n",
"Metadata:\n",
"\tid:\t35\n",
"\tpage_number:\t35\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Content:\n",
"\tLast month, I announced our plan to supercharge \n",
"the Cancer Moonshot that President Obama asked me to lead six years ago. \n",
"\n",
"Our goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases. \n",
"\n",
"More support for patients and families. \n",
"\n",
"To get there, I call on Congress to fund ARPA-H, the Advanced Research Projects Agency for Health. \n",
"\n",
"Its based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more. \n",
"\n",
"ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimers, diabetes, and more. \n",
"\n",
"A unity agenda for the nation. \n",
"\n",
"We can do this. \n",
"\n",
"My fellow Americans—tonight , we have gathered in a sacred space—the citadel of our democracy. \n",
"\n",
"In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n",
"\n",
"We have fought for freedom, expanded liberty, defeated totalitarianism and terror.\n",
"\n",
"Metadata:\n",
"\tid:\t40\n",
"\tpage_number:\t40\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"# Query (with filtering)\n",
"k = 3\n",
"constraints = {\"page_number\": [\">\", 30], \"president_included\": [\"==\", True]}\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"returned_docs = db.similarity_search(query, k=k, filter=constraints)\n",
"print_results(returned_docs, score=False)"
]
},
{
"cell_type": "markdown",
"id": "a5984766",
"metadata": {},
"source": [
"### Similarity Search using TileDBDense and Euclidean Distance\n",
"\n",
"In this section, we add the documents to VDMS using TileDB Dense indexing and L2 as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3001ba6e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425491333\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.495247483253479\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008409023284912\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"db_tiledbD = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=\"my_collection_tiledbD_L2\",\n",
" embedding=embedding,\n",
" engine=\"TileDBDense\",\n",
" distance_strategy=\"L2\",\n",
")\n",
"\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db_tiledbD.similarity_search_with_score(query, k=k, filter=None)\n",
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "92ab3370",
"metadata": {},
"source": [
"### Similarity Search using Faiss IVFFlat and Euclidean Distance\n",
"\n",
"In this section, we add the documents to VDMS using Faiss IndexIVFFlat indexing and L2 as the distance metric for similarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "78f502cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032090425491333\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.495247483253479\n",
"\n",
"Content:\n",
"\tAs Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment theyre conducting on our children for profit. \n",
"\n",
"Its time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
"\n",
"And lets get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
"\n",
"Third, support our veterans. \n",
"\n",
"Veterans are the best of us. \n",
"\n",
"Ive always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
"\n",
"My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n",
"\n",
"Our troops in Iraq and Afghanistan faced many dangers.\n",
"\n",
"Metadata:\n",
"\tid:\t37\n",
"\tpage_number:\t37\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.5008409023284912\n",
"\n",
"Content:\n",
"\tA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"\n",
"Metadata:\n",
"\tid:\t33\n",
"\tpage_number:\t33\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"db_FaissIVFFlat = VDMS.from_documents(\n",
" docs,\n",
" client=vdms_client,\n",
" ids=ids,\n",
" collection_name=\"my_collection_FaissIVFFlat_L2\",\n",
" embedding=embedding,\n",
" engine=\"FaissIVFFlat\",\n",
" distance_strategy=\"L2\",\n",
")\n",
"# Query\n",
"k = 3\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db_FaissIVFFlat.similarity_search_with_score(query, k=k, filter=None)\n",
"print_results(docs_with_score)"
]
},
{
"cell_type": "markdown",
"id": "9ed3ec50",
"metadata": {},
"source": [
"### Update and Delete\n",
"\n",
"While building toward a real application, you want to go beyond adding data, and also update and delete data.\n",
"\n",
"Here is a basic example showing how to do so. First, we will update the metadata for the document most relevant to the query."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "81a02810",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original metadata: \n",
"\t{'id': '32', 'page_number': 32, 'president_included': True, 'source': '../../modules/state_of_the_union.txt'}\n",
"new metadata: \n",
"\t{'id': '32', 'page_number': 32, 'president_included': True, 'source': '../../modules/state_of_the_union.txt', 'new_value': 'hello world'}\n",
"--------------------------------------------------\n",
"\n",
"UPDATED ENTRY (id=32):\n",
"\n",
"content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"id:\n",
"\t32\n",
"\n",
"new_value:\n",
"\thello world\n",
"\n",
"page_number:\n",
"\t32\n",
"\n",
"president_included:\n",
"\tTrue\n",
"\n",
"source:\n",
"\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"doc = db.similarity_search(query)[0]\n",
"print(f\"Original metadata: \\n\\t{doc.metadata}\")\n",
"\n",
"# update the metadata for a document\n",
"doc.metadata[\"new_value\"] = \"hello world\"\n",
"print(f\"new metadata: \\n\\t{doc.metadata}\")\n",
"print(f\"{DELIMITER}\\n\")\n",
"\n",
"# Update document in VDMS\n",
"id_to_update = doc.metadata[\"id\"]\n",
"db.update_document(collection_name, id_to_update, doc)\n",
"response, response_array = db.get(\n",
" collection_name, constraints={\"id\": [\"==\", id_to_update]}\n",
")\n",
"\n",
"# Display Results\n",
"print(f\"UPDATED ENTRY (id={id_to_update}):\")\n",
"print_response([response[0][\"FindDescriptor\"][\"entities\"][0]])"
]
},
{
"cell_type": "markdown",
"id": "872a7dff",
"metadata": {},
"source": [
"Next we will delete the last document by ID (id=42)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "95537fe8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Documents before deletion: 42\n",
"Documents after deletion (id=42): 41\n"
]
}
],
"source": [
"print(\"Documents before deletion: \", db.count(collection_name))\n",
"\n",
"id_to_remove = ids[-1]\n",
"db.delete(collection_name=collection_name, ids=[id_to_remove])\n",
"print(f\"Documents after deletion (id={id_to_remove}): {db.count(collection_name)}\")"
]
},
{
"cell_type": "markdown",
"id": "18152965",
"metadata": {},
"source": [
"## Other Information\n",
"VDMS supports various types of visual data and operations. Some of the capabilities are integrated in the LangChain interface but additional workflow improvements will be added as VDMS is under continuous development.\n",
"\n",
"Addtional capabilities integrated into LangChain are below.\n",
"\n",
"### Similarity search by vector\n",
"Instead of searching by string query, you can also search by embedding/vector."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "1db4d6ed",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n"
]
}
],
"source": [
"embedding_vector = embedding.embed_query(query)\n",
"returned_docs = db.similarity_search_by_vector(embedding_vector)\n",
"\n",
"# Print Results\n",
"print_document_details(returned_docs[0])"
]
},
{
"cell_type": "markdown",
"id": "daf718b2",
"metadata": {},
"source": [
"### Filtering on metadata\n",
"\n",
"It can be helpful to narrow down the collection before working with it.\n",
"\n",
"For example, collections can be filtered on metadata using the get method. A dictionary is used to filter metadata. Here we retrieve the document where `id = 2` and remove it from the vector store."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2bc0313b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Returned entry:\n",
"\n",
"blob:\n",
"\tTrue\n",
"\n",
"content:\n",
"\tGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n",
"\n",
"In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n",
"\n",
"Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \n",
"\n",
"Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \n",
"\n",
"Throughout our history weve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \n",
"\n",
"They keep moving. \n",
"\n",
"And the costs and the threats to America and the world keep rising. \n",
"\n",
"Thats why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \n",
"\n",
"The United States is a member along with 29 other nations. \n",
"\n",
"It matters. American diplomacy matters. American resolve matters.\n",
"\n",
"id:\n",
"\t2\n",
"\n",
"page_number:\n",
"\t2\n",
"\n",
"president_included:\n",
"\tTrue\n",
"\n",
"source:\n",
"\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"response, response_array = db.get(\n",
" collection_name,\n",
" limit=1,\n",
" include=[\"metadata\", \"embeddings\"],\n",
" constraints={\"id\": [\"==\", \"2\"]},\n",
")\n",
"\n",
"print(\"Returned entry:\")\n",
"print_response([response[0][\"FindDescriptor\"][\"entities\"][0]])\n",
"\n",
"# Delete id=2\n",
"db.delete(collection_name=collection_name, ids=[\"2\"]);"
]
},
{
"cell_type": "markdown",
"id": "794a7552",
"metadata": {},
"source": [
"### Retriever options\n",
"\n",
"This section goes over different options for how to use VDMS as a retriever.\n",
"\n",
"\n",
"#### Simiarity Search\n",
"\n",
"Here we use similarity search in the retriever object.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "120f55eb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n"
]
}
],
"source": [
"retriever = db.as_retriever()\n",
"relevant_docs = retriever.get_relevant_documents(query)[0]\n",
"\n",
"print_document_details(relevant_docs)"
]
},
{
"cell_type": "markdown",
"id": "e8c0fb24",
"metadata": {},
"source": [
"#### Maximal Marginal Relevance Search (MMR)\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f00be6d0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n"
]
}
],
"source": [
"retriever = db.as_retriever(search_type=\"mmr\")\n",
"relevant_docs = retriever.get_relevant_documents(query)[0]\n",
"\n",
"print_document_details(relevant_docs)"
]
},
{
"cell_type": "markdown",
"id": "ffadbafc",
"metadata": {},
"source": [
"We can also use MMR directly."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ab911470",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"\n",
"Score:\t1.2032092809677124\n",
"\n",
"Content:\n",
"\tTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Metadata:\n",
"\tid:\t32\n",
"\tnew_value:\thello world\n",
"\tpage_number:\t32\n",
"\tpresident_included:\tTrue\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n",
"Score:\t1.507053256034851\n",
"\n",
"Content:\n",
"\tBut cancer from prolonged exposure to burn pits ravaged Heaths lungs and body. \n",
"\n",
"Danielle says Heath was a fighter to the very end. \n",
"\n",
"He didnt know how to stop fighting, and neither did she. \n",
"\n",
"Through her pain she found purpose to demand we do better. \n",
"\n",
"Tonight, Danielle—we are. \n",
"\n",
"The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits. \n",
"\n",
"And tonight, Im announcing were expanding eligibility to veterans suffering from nine respiratory cancers. \n",
"\n",
"Im also calling on Congress: pass a law to make sure veterans devastated by toxic exposures in Iraq and Afghanistan finally get the benefits and comprehensive health care they deserve. \n",
"\n",
"And fourth, lets end cancer as we know it. \n",
"\n",
"This is personal to me and Jill, to Kamala, and to so many of you. \n",
"\n",
"Cancer is the #2 cause of death in Americasecond only to heart disease.\n",
"\n",
"Metadata:\n",
"\tid:\t39\n",
"\tpage_number:\t39\n",
"\tpresident_included:\tFalse\n",
"\tsource:\t../../modules/state_of_the_union.txt\n",
"--------------------------------------------------\n",
"\n"
]
}
],
"source": [
"mmr_resp = db.max_marginal_relevance_search_with_score(query, k=2, fetch_k=10)\n",
"print_results(mmr_resp)"
]
},
{
"cell_type": "markdown",
"id": "190bc4b5",
"metadata": {},
"source": [
"### Delete collection\n",
"Previously, we removed documents based on its `id`. Here, all documents are removed since no ID is provided."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "874e7af9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Documents before deletion: 40\n",
"Documents after deletion: 0\n"
]
}
],
"source": [
"print(\"Documents before deletion: \", db.count(collection_name))\n",
"\n",
"db.delete(collection_name=collection_name)\n",
"\n",
"print(\"Documents after deletion: \", db.count(collection_name))"
]
},
{
"cell_type": "markdown",
"id": "68b7a400",
"metadata": {},
"source": [
"## Stop VDMS Server"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "08931796",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"vdms_vs_test_nb\n"
]
}
],
"source": [
"!docker kill vdms_vs_test_nb"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0386ea81",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}