mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
a30f98f534
Addition of Vespa vector store integration including notebook showing its use. Maintainer: @lesters Twitter handle: LesterSolbakken
883 lines
25 KiB
Plaintext
883 lines
25 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ce0f17b9",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Vespa\n",
|
|
"\n",
|
|
">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n",
|
|
"\n",
|
|
"This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n",
|
|
"\n",
|
|
"In order to create the vector store, we use\n",
|
|
"[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n",
|
|
"connection a `Vespa` service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7e6a11ab-38bd-4920-ba11-60cb2f075754",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#!pip install pyvespa"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Using the `pyvespa` package, you can either connect to a\n",
|
|
"[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
|
|
"or a local\n",
|
|
"[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
|
|
"Here, we will create a new Vespa application and deploy that using Docker.\n",
|
|
"\n",
|
|
"#### Creating a Vespa application\n",
|
|
"\n",
|
|
"First, we need to create an application package:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import ApplicationPackage, Field, RankProfile\n",
|
|
"\n",
|
|
"app_package = ApplicationPackage(name=\"testapp\")\n",
|
|
"app_package.schema.add_fields(\n",
|
|
" Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n",
|
|
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
|
" indexing=[\"attribute\", \"summary\"],\n",
|
|
" attribute=[f\"distance-metric: angular\"]),\n",
|
|
")\n",
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(name=\"default\",\n",
|
|
" first_phase=\"closeness(field, embedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
|
" )\n",
|
|
")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This sets up a Vespa application with a schema for each document that contains\n",
|
|
"two fields: `text` for holding the document text and `embedding` for holding\n",
|
|
"the embedding vector. The `text` field is set up to use a BM25 index for\n",
|
|
"efficient text retrieval, and we'll see how to use this and hybrid search a\n",
|
|
"bit later.\n",
|
|
"\n",
|
|
"The `embedding` field is set up with a vector of length 384 to hold the\n",
|
|
"embedding representation of the text. See\n",
|
|
"[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n",
|
|
"for more on tensors in Vespa.\n",
|
|
"\n",
|
|
"Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n",
|
|
"instruct Vespa how to order documents. Here we set this up with a\n",
|
|
"[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n",
|
|
"\n",
|
|
"Now we can deploy this application locally:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "c10dd962",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.deployment import VespaDocker\n",
|
|
"\n",
|
|
"vespa_docker = VespaDocker()\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3df4ce53",
|
|
"metadata": {},
|
|
"source": [
|
|
"This deploys and creates a connection to a `Vespa` service. In case you\n",
|
|
"already have a Vespa application running, for instance in the cloud,\n",
|
|
"please refer to the PyVespa application for how to connect.\n",
|
|
"\n",
|
|
"#### Creating a Vespa vector store\n",
|
|
"\n",
|
|
"Now, let's load some documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import TextLoader\n",
|
|
"from langchain.text_splitter import CharacterTextSplitter\n",
|
|
"\n",
|
|
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
|
"documents = loader.load()\n",
|
|
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
|
"docs = text_splitter.split_documents(documents)\n",
|
|
"\n",
|
|
"from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
|
|
"\n",
|
|
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here, we also set up local sentence embedder to transform the text to embedding\n",
|
|
"vectors. One could also use OpenAI embeddings, but the vector length needs to\n",
|
|
"be updated to `1536` to reflect the larger size of that embedding.\n",
|
|
"\n",
|
|
"To feed these to Vespa, we need to configure how the vector store should map to\n",
|
|
"fields in the Vespa application. Then we create the vector store directly from\n",
|
|
"this set of documents:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"vespa_config = dict(\n",
|
|
" page_content_field=\"text\",\n",
|
|
" embedding_field=\"embedding\",\n",
|
|
" input_field=\"query_embedding\"\n",
|
|
")\n",
|
|
"\n",
|
|
"from langchain.vectorstores import VespaStore\n",
|
|
"\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This creates a Vespa vector store and feeds that set of documents to Vespa.\n",
|
|
"The vector store takes care of calling the embedding function for each document\n",
|
|
"and inserts them into the database.\n",
|
|
"\n",
|
|
"We can now query the vector store:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7ccca1f4",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"\n",
|
|
"print(results[0].page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1e7e34e1",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This will use the embedding function given above to create a representation\n",
|
|
"for the query and use that to search Vespa. Note that this will use the\n",
|
|
"`default` ranking function, which we set up in the application package\n",
|
|
"above. You can use the `ranking` argument to `similarity_search` to\n",
|
|
"specify which ranking function to use.\n",
|
|
"\n",
|
|
"Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"This covers the basic usage of the Vespa store in LangChain.\n",
|
|
"Now you can return the results and continue using these in LangChain.\n",
|
|
"\n",
|
|
"#### Updating documents\n",
|
|
"\n",
|
|
"An alternative to calling `from_documents`, you can create the vector\n",
|
|
"store directly and call `add_texts` from that. This can also be used to update\n",
|
|
"documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"result = results[0]\n",
|
|
"\n",
|
|
"result.page_content = \"UPDATED: \" + result.page_content\n",
|
|
"db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n",
|
|
"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"print(results[0].page_content)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"However, the `pyvespa` library contains methods to manipulate\n",
|
|
"content on Vespa which you can use directly.\n",
|
|
"\n",
|
|
"#### Deleting documents\n",
|
|
"\n",
|
|
"You can delete documents using the `delete` function:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"result = db.similarity_search(query)\n",
|
|
"# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
|
"\n",
|
|
"db.delete([\"32\"])\n",
|
|
"result = db.similarity_search(query)\n",
|
|
"# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\""
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Again, the `pyvespa` connection contains methods to delete documents as well.\n",
|
|
"\n",
|
|
"### Returning with scores\n",
|
|
"\n",
|
|
"The `similarity_search` method only returns the documents in order of\n",
|
|
"relevancy. To retrieve the actual scores:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"results = db.similarity_search_with_score(query)\n",
|
|
"result = results[0]\n",
|
|
"# result[1] ~= 0.463"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n",
|
|
"cosine distance function (as given by the argument `angular` in the\n",
|
|
"application function).\n",
|
|
"\n",
|
|
"Different embedding functions need different distance functions, and Vespa\n",
|
|
"needs to know which distance function to use when orderings documents.\n",
|
|
"Please refer to the\n",
|
|
"[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"### As retriever\n",
|
|
"\n",
|
|
"To use this vector store as a\n",
|
|
"[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n",
|
|
"simply call the `as_retriever` function, which is a standard vector store\n",
|
|
"method:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
|
"retriever = db.as_retriever()\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = retriever.get_relevant_documents(query)\n",
|
|
"\n",
|
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\""
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This allows for more general, unstructured, retrieval from the vector store.\n",
|
|
"\n",
|
|
"### Metadata\n",
|
|
"\n",
|
|
"In the example so far, we've only used the text and the embedding for that\n",
|
|
"text. Documents usually contain additional information, which in LangChain\n",
|
|
"is referred to as metadata.\n",
|
|
"\n",
|
|
"Vespa can contain many fields with different types by adding them to the application\n",
|
|
"package:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"app_package.schema.add_fields(\n",
|
|
" # ...\n",
|
|
" Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" # ...\n",
|
|
")\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"We can add some metadata fields in the documents:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add metadata\n",
|
|
"for i, doc in enumerate(docs):\n",
|
|
" doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n",
|
|
" doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n",
|
|
" doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"And let the Vespa vector store know about these fields:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Now, when searching for these documents, these fields will be returned.\n",
|
|
"Also, these fields can be filtered on:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query, filter=\"rating > 3\")\n",
|
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n",
|
|
"# results[0].metadata[\"author\"] == \"Unknown\""
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Custom query\n",
|
|
"\n",
|
|
"If the default behavior of the similarity search does not fit your\n",
|
|
"requirements, you can always provide your own query. Thus, you don't\n",
|
|
"need to provide all of the configuration to the vector store, but\n",
|
|
"rather just write this yourself.\n",
|
|
"\n",
|
|
"First, let's add a BM25 ranking function to our application:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import FieldSet\n",
|
|
"\n",
|
|
"app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n",
|
|
"app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Then, to perform a regular text search based on BM25:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": f\"select * from sources * where userQuery()\",\n",
|
|
" \"query\": query,\n",
|
|
" \"type\": \"weakAnd\",\n",
|
|
" \"ranking\": \"bm25\",\n",
|
|
" \"hits\": 4\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
|
"# results[0][1] ~= 14.384"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"All of the powerful search and query capabilities of Vespa can be used\n",
|
|
"by using a custom query. Please refer to the Vespa documentation on it's\n",
|
|
"[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n",
|
|
"\n",
|
|
"### Hybrid search\n",
|
|
"\n",
|
|
"Hybrid search means using both a classic term-based search such as\n",
|
|
"BM25 and a vector search and combining the results. We need to create\n",
|
|
"a new rank profile for hybrid search on Vespa:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(name=\"hybrid\",\n",
|
|
" first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
|
" )\n",
|
|
")\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here, we score each document as a combination of it's BM25 score and its\n",
|
|
"distance score. We can query using a custom query:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"query_embedding = embedding_function.embed_query(query)\n",
|
|
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n",
|
|
" \"query\": query,\n",
|
|
" \"type\": \"weakAnd\",\n",
|
|
" \"input.query(query_embedding)\": query_embedding,\n",
|
|
" \"ranking\": \"hybrid\",\n",
|
|
" \"hits\": 4\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
|
"# results[0][1] ~= 2.897"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Native embedders in Vespa\n",
|
|
"\n",
|
|
"Up until this point we've used an embedding function in Python to provide\n",
|
|
"embeddings for the texts. Vespa supports embedding function natively, so\n",
|
|
"you can defer this calculation in to Vespa. One benefit is the ability to use\n",
|
|
"GPUs when embedding documents if you have a large collections.\n",
|
|
"\n",
|
|
"Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"First, we need to modify our application package:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import Component, Parameter\n",
|
|
"\n",
|
|
"app_package.components = [\n",
|
|
" Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n",
|
|
" parameters=[\n",
|
|
" Parameter(\"transformer-model\", {\"path\": \"...\"}),\n",
|
|
" Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n",
|
|
" ]\n",
|
|
" )\n",
|
|
"]\n",
|
|
"Field(name=\"hfembedding\", type=\"tensor<float>(x[384])\",\n",
|
|
" is_document_field=False,\n",
|
|
" indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n",
|
|
" attribute=[f\"distance-metric: angular\"],\n",
|
|
" )\n",
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(name=\"hf_similarity\",\n",
|
|
" first_phase=\"closeness(field, hfembedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
|
" )\n",
|
|
")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Please refer to the embeddings documentation on adding embedder models\n",
|
|
"and tokenizers to the application. Note that the `hfembedding` field\n",
|
|
"includes instructions for embedding using the `hf-embedder`.\n",
|
|
"\n",
|
|
"Now we can query with a custom query:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n",
|
|
" \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n",
|
|
" \"ranking\": \"internal_similarity\",\n",
|
|
" \"hits\": 4\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
|
"# results[0][1] ~= 0.630"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Note that the query here includes an `embed` instruction to embed the query\n",
|
|
"using the same model as for the documents.\n",
|
|
"\n",
|
|
"### Approximate nearest neighbor\n",
|
|
"\n",
|
|
"In all of the above examples, we've used exact nearest neighbor to\n",
|
|
"find results. However, for large collections of documents this is\n",
|
|
"not feasible as one has to scan through all documents to find the\n",
|
|
"best matches. To avoid this, we can use\n",
|
|
"[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n",
|
|
"\n",
|
|
"First, we can change the embedding field to create a HNSW index:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import HNSW\n",
|
|
"\n",
|
|
"app_package.schema.add_fields(\n",
|
|
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
|
" indexing=[\"attribute\", \"summary\", \"index\"],\n",
|
|
" ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n",
|
|
" )\n",
|
|
")\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This creates a HNSW index on the embedding data which allows for efficient\n",
|
|
"searching. With this set, we can easily search using ANN by setting\n",
|
|
"the `approximate` argument to `True`:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query, approximate=True)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"This covers most of the functionality in the Vespa vector store in LangChain.\n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
} |