Implemented the ability to enable full-text search within the
SingleStore vector store, offering users a versatile range of search
strategies. This enhancement allows users to seamlessly combine
full-text search with vector search, enabling the following search
strategies:
* Search solely by vector similarity.
* Conduct searches exclusively based on text similarity, utilizing
Lucene internally.
* Filter search results by text similarity score, with the option to
specify a threshold, followed by a search based on vector similarity.
* Filter results by vector similarity score before conducting a search
based on text similarity.
* Perform searches using a weighted sum of vector and text similarity
scores.
Additionally, integration tests have been added to comprehensively cover
all scenarios.
Updated notebook with examples.
CC: @baskaryan, @hwchase17
---------
Co-authored-by: Volodymyr Tkachuk <vtkachuk-ua@singlestore.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
">[SingleStoreDB](https://singlestore.com/) is a high-performance distributed SQL database that supports deployment both in the [cloud](https://www.singlestore.com/cloud/) and on-premises. It provides vector storage, and vector functions including [dot_product](https://docs.singlestore.com/managed-service/en/reference/sql-reference/vector-functions/dot_product.html) and [euclidean_distance](https://docs.singlestore.com/managed-service/en/reference/sql-reference/vector-functions/euclidean_distance.html), thereby supporting AI applications that require text similarity matching. \n",
">[SingleStoreDB](https://singlestore.com/) is a robust, high-performance distributed SQL database solution designed to excel in both [cloud](https://www.singlestore.com/cloud/) and on-premises environments. Boasting a versatile feature set, it offers seamless deployment options while delivering unparalleled performance.\n",
"\n",
"This tutorial illustrates how to [work with vector data in SingleStoreDB](https://docs.singlestore.com/managed-service/en/developer-resources/functional-extensions/working-with-vector-data.html)."
"A standout feature of SingleStoreDB is its advanced support for vector storage and operations, making it an ideal choice for applications requiring intricate AI capabilities such as text similarity matching. With built-in vector functions like [dot_product](https://docs.singlestore.com/managed-service/en/reference/sql-reference/vector-functions/dot_product.html) and [euclidean_distance](https://docs.singlestore.com/managed-service/en/reference/sql-reference/vector-functions/euclidean_distance.html), SingleStoreDB empowers developers to implement sophisticated algorithms efficiently.\n",
"\n",
"For developers keen on leveraging vector data within SingleStoreDB, a comprehensive tutorial is available, guiding them through the intricacies of [working with vector data](https://docs.singlestore.com/managed-service/en/developer-resources/functional-extensions/working-with-vector-data.html). This tutorial delves into the Vector Store within SingleStoreDB, showcasing its ability to facilitate searches based on vector similarity. Leveraging vector indexes, queries can be executed with remarkable speed, enabling swift retrieval of relevant data.\n",
"\n",
"Moreover, SingleStoreDB's Vector Store seamlessly integrates with [full-text indexing based on Lucene](https://docs.singlestore.com/cloud/developer-resources/functional-extensions/working-with-full-text-search/), enabling powerful text similarity searches. Users can filter search results based on selected fields of document metadata objects, enhancing query precision.\n",
"\n",
"What sets SingleStoreDB apart is its ability to combine vector and full-text searches in various ways, offering flexibility and versatility. Whether prefiltering by text or vector similarity and selecting the most relevant data, or employing a weighted sum approach to compute a final similarity score, developers have multiple options at their disposal.\n",
"\n",
"In essence, SingleStoreDB provides a comprehensive solution for managing and querying vector data, offering unparalleled performance and flexibility for AI-driven applications."
"# we will use some artificial data for this example\n",
"docs = [\n",
" Document(\n",
" page_content=\"\"\"In the parched desert, a sudden rainstorm brought relief,\n",
" as the droplets danced upon the thirsty earth, rejuvenating the landscape\n",
" with the sweet scent of petrichor.\"\"\",\n",
" metadata={\"category\": \"rain\"},\n",
" ),\n",
" Document(\n",
" page_content=\"\"\"Amidst the bustling cityscape, the rain fell relentlessly,\n",
" creating a symphony of pitter-patter on the pavement, while umbrellas\n",
" bloomed like colorful flowers in a sea of gray.\"\"\",\n",
" metadata={\"category\": \"rain\"},\n",
" ),\n",
" Document(\n",
" page_content=\"\"\"High in the mountains, the rain transformed into a delicate\n",
" mist, enveloping the peaks in a mystical veil, where each droplet seemed to\n",
" whisper secrets to the ancient rocks below.\"\"\",\n",
" metadata={\"category\": \"rain\"},\n",
" ),\n",
" Document(\n",
" page_content=\"\"\"Blanketing the countryside in a soft, pristine layer, the\n",
" snowfall painted a serene tableau, muffling the world in a tranquil hush\n",
" as delicate flakes settled upon the branches of trees like nature's own \n",
" lacework.\"\"\",\n",
" metadata={\"category\": \"snow\"},\n",
" ),\n",
" Document(\n",
" page_content=\"\"\"In the urban landscape, snow descended, transforming\n",
" bustling streets into a winter wonderland, where the laughter of\n",
" children echoed amidst the flurry of snowballs and the twinkle of\n",
" holiday lights.\"\"\",\n",
" metadata={\"category\": \"snow\"},\n",
" ),\n",
" Document(\n",
" page_content=\"\"\"Atop the rugged peaks, snow fell with an unyielding\n",
" intensity, sculpting the landscape into a pristine alpine paradise,\n",
" where the frozen crystals shimmered under the moonlight, casting a\n",
" spell of enchantment over the wilderness below.\"\"\",\n",
" metadata={\"category\": \"snow\"},\n",
" ),\n",
"]\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
@ -101,11 +147,33 @@
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"query = \"trees in the snow\"\n",
"docs = docsearch.similarity_search(query) # Find documents that correspond to the query\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "51b2b552",
"metadata": {},
"source": [
"SingleStoreDB elevates search capabilities by enabling users to enhance and refine search results through prefiltering based on metadata fields. This functionality empowers developers and data analysts to fine-tune queries, ensuring that search results are precisely tailored to their requirements. By filtering search results using specific metadata attributes, users can narrow down the scope of their queries, focusing only on relevant data subsets. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "389bf801",
"metadata": {},
"outputs": [],
"source": [
"query = \"trees branches\"\n",
"docs = docsearch.similarity_search(\n",
" query, filter={\"category\": \"snow\"}\n",
") # Find documents that correspond to the query and has category \"snow\"\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "035cba66",
@ -114,6 +182,70 @@
"Enhance your search efficiency with SingleStore DB version 8.5 or above by leveraging [ANN vector indexes](https://docs.singlestore.com/cloud/reference/sql-reference/vector-functions/vector-indexing/). By setting `use_vector_index=True` during vector store object creation, you can activate this feature. Additionally, if your vectors differ in dimensionality from the default OpenAI embedding size of 1536, ensure to specify the `vector_size` parameter accordingly. "
]
},
{
"cell_type": "markdown",
"id": "5308afe5",
"metadata": {},
"source": [
"SingleStoreDB presents a diverse range of search strategies, each meticulously crafted to cater to specific use cases and user preferences. The default `VECTOR_ONLY` strategy utilizes vector operations such as `dot_product` or `euclidean_distance` to calculate similarity scores directly between vectors, while `TEXT_ONLY` employs Lucene-based full-text search, particularly advantageous for text-centric applications. For users seeking a balanced approach, `FILTER_BY_TEXT` first refines results based on text similarity before conducting vector comparisons, whereas `FILTER_BY_VECTOR` prioritizes vector similarity, filtering results before assessing text similarity for optimal matches. Notably, both `FILTER_BY_TEXT` and `FILTER_BY_VECTOR` necessitate a full-text index for operation. Additionally, `WEIGHTED_SUM` emerges as a sophisticated strategy, calculating the final similarity score by weighing vector and text similarities, albeit exclusively utilizing dot_product distance calculations and also requiring a full-text index. These versatile strategies empower users to fine-tune searches according to their unique needs, facilitating efficient and precise data retrieval and analysis. Moreover, SingleStoreDB's hybrid approaches, exemplified by `FILTER_BY_TEXT`, `FILTER_BY_VECTOR`, and `WEIGHTED_SUM` strategies, seamlessly blend vector and text-based searches to maximize efficiency and accuracy, ensuring users can fully leverage the platform's capabilities for a wide range of applications."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17db0116",
"metadata": {},
"outputs": [],
"source": [
"docsearch = SingleStoreDB.from_documents(\n",
" docs,\n",
" embeddings,\n",
" distance_strategy=DistanceStrategy.DOT_PRODUCT, # Use dot product for similarity search\n",
" use_vector_index=True, # Use vector index for faster search\n",
" use_full_text_search=True, # Use full text index\n",