Add Vespa vector store (#11329)

Addition of Vespa vector store integration including notebook showing
its use.

Maintainer: @lesters 
Twitter handle: LesterSolbakken
pull/11395/head
Lester Solbakken 1 year ago committed by GitHub
parent 58a88f3911
commit a30f98f534
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,883 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ce0f17b9",
"metadata": {},
"source": [
"# Vespa\n",
"\n",
">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n",
"\n",
"This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n",
"\n",
"In order to create the vector store, we use\n",
"[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n",
"connection a `Vespa` service."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e6a11ab-38bd-4920-ba11-60cb2f075754",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install pyvespa"
]
},
{
"cell_type": "markdown",
"source": [
"Using the `pyvespa` package, you can either connect to a\n",
"[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
"or a local\n",
"[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
"Here, we will create a new Vespa application and deploy that using Docker.\n",
"\n",
"#### Creating a Vespa application\n",
"\n",
"First, we need to create an application package:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from vespa.package import ApplicationPackage, Field, RankProfile\n",
"\n",
"app_package = ApplicationPackage(name=\"testapp\")\n",
"app_package.schema.add_fields(\n",
" Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n",
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
" indexing=[\"attribute\", \"summary\"],\n",
" attribute=[f\"distance-metric: angular\"]),\n",
")\n",
"app_package.schema.add_rank_profile(\n",
" RankProfile(name=\"default\",\n",
" first_phase=\"closeness(field, embedding)\",\n",
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
" )\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This sets up a Vespa application with a schema for each document that contains\n",
"two fields: `text` for holding the document text and `embedding` for holding\n",
"the embedding vector. The `text` field is set up to use a BM25 index for\n",
"efficient text retrieval, and we'll see how to use this and hybrid search a\n",
"bit later.\n",
"\n",
"The `embedding` field is set up with a vector of length 384 to hold the\n",
"embedding representation of the text. See\n",
"[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n",
"for more on tensors in Vespa.\n",
"\n",
"Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n",
"instruct Vespa how to order documents. Here we set this up with a\n",
"[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n",
"\n",
"Now we can deploy this application locally:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c10dd962",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from vespa.deployment import VespaDocker\n",
"\n",
"vespa_docker = VespaDocker()\n",
"vespa_app = vespa_docker.deploy(application_package=app_package)"
]
},
{
"cell_type": "markdown",
"id": "3df4ce53",
"metadata": {},
"source": [
"This deploys and creates a connection to a `Vespa` service. In case you\n",
"already have a Vespa application running, for instance in the cloud,\n",
"please refer to the PyVespa application for how to connect.\n",
"\n",
"#### Creating a Vespa vector store\n",
"\n",
"Now, let's load some documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
"\n",
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here, we also set up local sentence embedder to transform the text to embedding\n",
"vectors. One could also use OpenAI embeddings, but the vector length needs to\n",
"be updated to `1536` to reflect the larger size of that embedding.\n",
"\n",
"To feed these to Vespa, we need to configure how the vector store should map to\n",
"fields in the Vespa application. Then we create the vector store directly from\n",
"this set of documents:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"vespa_config = dict(\n",
" page_content_field=\"text\",\n",
" embedding_field=\"embedding\",\n",
" input_field=\"query_embedding\"\n",
")\n",
"\n",
"from langchain.vectorstores import VespaStore\n",
"\n",
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This creates a Vespa vector store and feeds that set of documents to Vespa.\n",
"The vector store takes care of calling the embedding function for each document\n",
"and inserts them into the database.\n",
"\n",
"We can now query the vector store:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ccca1f4",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = db.similarity_search(query)\n",
"\n",
"print(results[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "1e7e34e1",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This will use the embedding function given above to create a representation\n",
"for the query and use that to search Vespa. Note that this will use the\n",
"`default` ranking function, which we set up in the application package\n",
"above. You can use the `ranking` argument to `similarity_search` to\n",
"specify which ranking function to use.\n",
"\n",
"Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
"for more information.\n",
"\n",
"This covers the basic usage of the Vespa store in LangChain.\n",
"Now you can return the results and continue using these in LangChain.\n",
"\n",
"#### Updating documents\n",
"\n",
"An alternative to calling `from_documents`, you can create the vector\n",
"store directly and call `add_texts` from that. This can also be used to update\n",
"documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = db.similarity_search(query)\n",
"result = results[0]\n",
"\n",
"result.page_content = \"UPDATED: \" + result.page_content\n",
"db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n",
"\n",
"results = db.similarity_search(query)\n",
"print(results[0].page_content)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"However, the `pyvespa` library contains methods to manipulate\n",
"content on Vespa which you can use directly.\n",
"\n",
"#### Deleting documents\n",
"\n",
"You can delete documents using the `delete` function:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"result = db.similarity_search(query)\n",
"# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
"\n",
"db.delete([\"32\"])\n",
"result = db.similarity_search(query)\n",
"# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\""
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Again, the `pyvespa` connection contains methods to delete documents as well.\n",
"\n",
"### Returning with scores\n",
"\n",
"The `similarity_search` method only returns the documents in order of\n",
"relevancy. To retrieve the actual scores:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"results = db.similarity_search_with_score(query)\n",
"result = results[0]\n",
"# result[1] ~= 0.463"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n",
"cosine distance function (as given by the argument `angular` in the\n",
"application function).\n",
"\n",
"Different embedding functions need different distance functions, and Vespa\n",
"needs to know which distance function to use when orderings documents.\n",
"Please refer to the\n",
"[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n",
"for more information.\n",
"\n",
"### As retriever\n",
"\n",
"To use this vector store as a\n",
"[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n",
"simply call the `as_retriever` function, which is a standard vector store\n",
"method:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
"retriever = db.as_retriever()\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = retriever.get_relevant_documents(query)\n",
"\n",
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\""
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This allows for more general, unstructured, retrieval from the vector store.\n",
"\n",
"### Metadata\n",
"\n",
"In the example so far, we've only used the text and the embedding for that\n",
"text. Documents usually contain additional information, which in LangChain\n",
"is referred to as metadata.\n",
"\n",
"Vespa can contain many fields with different types by adding them to the application\n",
"package:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"app_package.schema.add_fields(\n",
" # ...\n",
" Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
" Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n",
" Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
" # ...\n",
")\n",
"vespa_app = vespa_docker.deploy(application_package=app_package)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We can add some metadata fields in the documents:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Add metadata\n",
"for i, doc in enumerate(docs):\n",
" doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n",
" doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n",
" doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"And let the Vespa vector store know about these fields:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Now, when searching for these documents, these fields will be returned.\n",
"Also, these fields can be filtered on:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = db.similarity_search(query, filter=\"rating > 3\")\n",
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n",
"# results[0].metadata[\"author\"] == \"Unknown\""
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Custom query\n",
"\n",
"If the default behavior of the similarity search does not fit your\n",
"requirements, you can always provide your own query. Thus, you don't\n",
"need to provide all of the configuration to the vector store, but\n",
"rather just write this yourself.\n",
"\n",
"First, let's add a BM25 ranking function to our application:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from vespa.package import FieldSet\n",
"\n",
"app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n",
"app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n",
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Then, to perform a regular text search based on BM25:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"custom_query = {\n",
" \"yql\": f\"select * from sources * where userQuery()\",\n",
" \"query\": query,\n",
" \"type\": \"weakAnd\",\n",
" \"ranking\": \"bm25\",\n",
" \"hits\": 4\n",
"}\n",
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
"# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
"# results[0][1] ~= 14.384"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"All of the powerful search and query capabilities of Vespa can be used\n",
"by using a custom query. Please refer to the Vespa documentation on it's\n",
"[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n",
"\n",
"### Hybrid search\n",
"\n",
"Hybrid search means using both a classic term-based search such as\n",
"BM25 and a vector search and combining the results. We need to create\n",
"a new rank profile for hybrid search on Vespa:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"app_package.schema.add_rank_profile(\n",
" RankProfile(name=\"hybrid\",\n",
" first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n",
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
" )\n",
")\n",
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here, we score each document as a combination of it's BM25 score and its\n",
"distance score. We can query using a custom query:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"query_embedding = embedding_function.embed_query(query)\n",
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n",
"custom_query = {\n",
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n",
" \"query\": query,\n",
" \"type\": \"weakAnd\",\n",
" \"input.query(query_embedding)\": query_embedding,\n",
" \"ranking\": \"hybrid\",\n",
" \"hits\": 4\n",
"}\n",
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
"# results[0][1] ~= 2.897"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Native embedders in Vespa\n",
"\n",
"Up until this point we've used an embedding function in Python to provide\n",
"embeddings for the texts. Vespa supports embedding function natively, so\n",
"you can defer this calculation in to Vespa. One benefit is the ability to use\n",
"GPUs when embedding documents if you have a large collections.\n",
"\n",
"Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n",
"for more information.\n",
"\n",
"First, we need to modify our application package:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from vespa.package import Component, Parameter\n",
"\n",
"app_package.components = [\n",
" Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n",
" parameters=[\n",
" Parameter(\"transformer-model\", {\"path\": \"...\"}),\n",
" Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n",
" ]\n",
" )\n",
"]\n",
"Field(name=\"hfembedding\", type=\"tensor<float>(x[384])\",\n",
" is_document_field=False,\n",
" indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n",
" attribute=[f\"distance-metric: angular\"],\n",
" )\n",
"app_package.schema.add_rank_profile(\n",
" RankProfile(name=\"hf_similarity\",\n",
" first_phase=\"closeness(field, hfembedding)\",\n",
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
" )\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Please refer to the embeddings documentation on adding embedder models\n",
"and tokenizers to the application. Note that the `hfembedding` field\n",
"includes instructions for embedding using the `hf-embedder`.\n",
"\n",
"Now we can query with a custom query:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n",
"custom_query = {\n",
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n",
" \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n",
" \"ranking\": \"internal_similarity\",\n",
" \"hits\": 4\n",
"}\n",
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
"# results[0][1] ~= 0.630"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Note that the query here includes an `embed` instruction to embed the query\n",
"using the same model as for the documents.\n",
"\n",
"### Approximate nearest neighbor\n",
"\n",
"In all of the above examples, we've used exact nearest neighbor to\n",
"find results. However, for large collections of documents this is\n",
"not feasible as one has to scan through all documents to find the\n",
"best matches. To avoid this, we can use\n",
"[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n",
"\n",
"First, we can change the embedding field to create a HNSW index:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from vespa.package import HNSW\n",
"\n",
"app_package.schema.add_fields(\n",
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
" indexing=[\"attribute\", \"summary\", \"index\"],\n",
" ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n",
" )\n",
")\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This creates a HNSW index on the embedding data which allows for efficient\n",
"searching. With this set, we can easily search using ANN by setting\n",
"the `approximate` argument to `True`:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = db.similarity_search(query, approximate=True)\n",
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This covers most of the functionality in the Vespa vector store in LangChain.\n",
"\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -1,19 +1,16 @@
from __future__ import annotations
import json
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union
from typing import Any, Dict, List, Literal, Optional, Sequence, Union
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
from langchain.schema import BaseRetriever, Document
if TYPE_CHECKING:
from vespa.application import Vespa
class VespaRetriever(BaseRetriever):
"""`Vespa` retriever."""
app: Vespa
app: Any
"""Vespa application to query."""
body: Dict
"""Body of the query."""

@ -76,6 +76,7 @@ from langchain.vectorstores.usearch import USearch
from langchain.vectorstores.vald import Vald
from langchain.vectorstores.vearch import Vearch
from langchain.vectorstores.vectara import Vectara
from langchain.vectorstores.vespa import VespaStore
from langchain.vectorstores.weaviate import Weaviate
from langchain.vectorstores.zep import ZepVectorStore
from langchain.vectorstores.zilliz import Zilliz
@ -143,6 +144,7 @@ __all__ = [
"Vearch",
"Vectara",
"VectorStore",
"VespaStore",
"Weaviate",
"ZepVectorStore",
"Zilliz",

@ -0,0 +1,267 @@
from __future__ import annotations
from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union
from langchain.docstore.document import Document
from langchain.schema.embeddings import Embeddings
from langchain.vectorstores.base import VectorStore, VectorStoreRetriever
class VespaStore(VectorStore):
"""
`Vespa` vector store.
To use, you should have the python client library ``pyvespa`` installed.
Example:
.. code-block:: python
from langchain.vectorstores import VespaStore
from langchain.embeddings.openai import OpenAIEmbeddings
from vespa.application import Vespa
# Create a vespa client dependent upon your application,
# e.g. either connecting to Vespa Cloud or a local deployment
# such as Docker. Please refer to the PyVespa documentation on
# how to initialize the client.
vespa_app = Vespa(url="...", port=..., application_package=...)
# You need to instruct LangChain on which fields to use for embeddings
vespa_config = dict(
page_content_field="text",
embedding_field="embedding",
input_field="query_embedding",
metadata_fields=["date", "rating", "author"]
)
embedding_function = OpenAIEmbeddings()
vectorstore = VespaStore(vespa_app, embedding_function, **vespa_config)
"""
def __init__(
self,
app: Any,
embedding_function: Optional[Embeddings] = None,
page_content_field: Optional[str] = None,
embedding_field: Optional[str] = None,
input_field: Optional[str] = None,
metadata_fields: Optional[List[str]] = None,
) -> None:
"""
Initialize with a PyVespa client.
"""
try:
from vespa.application import Vespa
except ImportError:
raise ImportError(
"Could not import Vespa python package. "
"Please install it with `pip install pyvespa`."
)
if not isinstance(app, Vespa):
raise ValueError(
f"app should be an instance of vespa.application.Vespa, got {type(app)}"
)
self._vespa_app = app
self._embedding_function = embedding_function
self._page_content_field = page_content_field
self._embedding_field = embedding_field
self._input_field = input_field
self._metadata_fields = metadata_fields
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""
Add texts to the vectorstore.
Args:
texts: Iterable of strings to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
ids: Optional list of ids associated with the texts.
kwargs: vectorstore specific parameters
Returns:
List of ids from adding the texts into the vectorstore.
"""
embeddings = None
if self._embedding_function is not None:
embeddings = self._embedding_function.embed_documents(list(texts))
if ids is None:
ids = [str(f"{i+1}") for i, _ in enumerate(texts)]
batch = []
for i, text in enumerate(texts):
fields: Dict[str, Union[str, List[float]]] = {}
if self._page_content_field is not None:
fields[self._page_content_field] = text
if self._embedding_field is not None and embeddings is not None:
fields[self._embedding_field] = embeddings[i]
if metadatas is not None and self._metadata_fields is not None:
for metadata_field in self._metadata_fields:
if metadata_field in metadatas[i]:
fields[metadata_field] = metadatas[i][metadata_field]
batch.append({"id": ids[i], "fields": fields})
results = self._vespa_app.feed_batch(batch)
for result in results:
if not (str(result.status_code).startswith("2")):
raise RuntimeError(
f"Could not add document to Vespa. "
f"Error code: {result.status_code}. "
f"Message: {result.json['message']}"
)
return ids
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
if ids is None:
return False
batch = [{"id": id} for id in ids]
result = self._vespa_app.delete_batch(batch)
return sum([0 if r.status_code == 200 else 1 for r in result]) == 0
def _create_query(
self, query_embedding: List[float], k: int = 4, **kwargs: Any
) -> Dict:
hits = k
doc_embedding_field = self._embedding_field
input_embedding_field = self._input_field
ranking_function = kwargs["ranking"] if "ranking" in kwargs else "default"
filter = kwargs["filter"] if "filter" in kwargs else None
approximate = kwargs["approximate"] if "approximate" in kwargs else False
approximate = "true" if approximate else "false"
yql = "select * from sources * where "
yql += f"{{targetHits: {hits}, approximate: {approximate}}}"
yql += f"nearestNeighbor({doc_embedding_field}, {input_embedding_field})"
if filter is not None:
yql += f" and {filter}"
query = {
"yql": yql,
f"input.query({input_embedding_field})": query_embedding,
"ranking": ranking_function,
"hits": hits,
}
return query
def similarity_search_by_vector_with_score(
self, query_embedding: List[float], k: int = 4, **kwargs: Any
) -> List[Tuple[Document, float]]:
"""
Performs similarity search from a embeddings vector.
Args:
query_embedding: Embeddings vector to search for.
k: Number of results to return.
custom_query: Use this custom query instead default query (kwargs)
kwargs: other vector store specific parameters
Returns:
List of ids from adding the texts into the vectorstore.
"""
if "custom_query" in kwargs:
query = kwargs["custom_query"]
else:
query = self._create_query(query_embedding, k, **kwargs)
try:
response = self._vespa_app.query(body=query)
except Exception as e:
raise RuntimeError(
f"Could not retrieve data from Vespa: "
f"{e.args[0][0]['summary']}. "
f"Error: {e.args[0][0]['message']}"
)
if not str(response.status_code).startswith("2"):
raise RuntimeError(
f"Could not retrieve data from Vespa. "
f"Error code: {response.status_code}. "
f"Message: {response.json['message']}"
)
root = response.json["root"]
if "errors" in root:
import json
raise RuntimeError(json.dumps(root["errors"]))
if response is None or response.hits is None:
return []
docs = []
for child in response.hits:
page_content = child["fields"][self._page_content_field]
score = child["relevance"]
metadata = {"id": child["id"]}
if self._metadata_fields is not None:
for field in self._metadata_fields:
metadata[field] = child["fields"].get(field)
doc = Document(page_content=page_content, metadata=metadata)
docs.append((doc, score))
return docs
def similarity_search_by_vector(
self, embedding: List[float], k: int = 4, **kwargs: Any
) -> List[Document]:
results = self.similarity_search_by_vector_with_score(embedding, k, **kwargs)
return [r[0] for r in results]
def similarity_search_with_score(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Tuple[Document, float]]:
query_emb = []
if self._embedding_function is not None:
query_emb = self._embedding_function.embed_query(query)
return self.similarity_search_by_vector_with_score(query_emb, k, **kwargs)
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
results = self.similarity_search_with_score(query, k, **kwargs)
return [r[0] for r in results]
def max_marginal_relevance_search(
self,
query: str,
k: int = 4,
fetch_k: int = 20,
lambda_mult: float = 0.5,
**kwargs: Any,
) -> List[Document]:
raise NotImplementedError("MMR search not implemented")
def max_marginal_relevance_search_by_vector(
self,
embedding: List[float],
k: int = 4,
fetch_k: int = 20,
lambda_mult: float = 0.5,
**kwargs: Any,
) -> List[Document]:
raise NotImplementedError("MMR search by vector not implemented")
@classmethod
def from_texts(
cls: Type[VespaStore],
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> VespaStore:
vespa = cls(embedding_function=embedding, **kwargs)
vespa.add_texts(texts=texts, metadatas=metadatas, ids=ids)
return vespa
def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever:
return super().as_retriever(**kwargs)
Loading…
Cancel
Save