{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BagelDB\n", "\n", "> [BagelDB](https://www.bageldb.ai/) (`Open Vector Database for AI`), is like GitHub for AI data.\n", "It is a collaborative platform where users can create,\n", "share, and manage vector datasets. It can support private projects for independent developers,\n", "internal collaborations for enterprises, and public contributions for data DAOs.\n", "\n", "### Installation and Setup\n", "\n", "```bash\n", "pip install betabageldb\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create VectorStore from texts" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from langchain.vectorstores import Bagel\n", "\n", "texts = [\"hello bagel\", \"hello langchain\", \"I love salad\", \"my car\", \"a dog\"]\n", "# create cluster and add texts\n", "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='hello bagel', metadata={}),\n", " Document(page_content='my car', metadata={}),\n", " Document(page_content='I love salad', metadata={})]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# similarity search\n", "cluster.similarity_search(\"bagel\", k=3)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(Document(page_content='hello bagel', metadata={}), 0.27392977476119995),\n", " (Document(page_content='my car', metadata={}), 1.4783176183700562),\n", " (Document(page_content='I love salad', metadata={}), 1.5342965126037598)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the score is a distance metric, so lower is better\n", "cluster.similarity_search_with_score(\"bagel\", k=3)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# delete the cluster\n", "cluster.delete_cluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create VectorStore from docs" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import TextLoader\n", "from langchain.text_splitter import CharacterTextSplitter\n", "\n", "loader = TextLoader(\"../../../state_of_the_union.txt\")\n", "documents = loader.load()\n", "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", "docs = text_splitter.split_documents(documents)[:10]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# create cluster with docs\n", "cluster = Bagel.from_documents(cluster_name=\"testing_with_docs\", documents=docs)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the \n" ] } ], "source": [ "# similarity search\n", "query = \"What did the president say about Ketanji Brown Jackson\"\n", "docs = cluster.similarity_search(query)\n", "print(docs[0].page_content[:102])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get all text/doc from Cluster" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "texts = [\"hello bagel\", \"this is langchain\"]\n", "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)\n", "cluster_data = cluster.get()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['ids', 'embeddings', 'metadatas', 'documents'])" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all keys\n", "cluster_data.keys()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ids': ['578c6d24-3763-11ee-a8ab-b7b7b34f99ba',\n", " '578c6d25-3763-11ee-a8ab-b7b7b34f99ba',\n", " 'fb2fc7d8-3762-11ee-a8ab-b7b7b34f99ba',\n", " 'fb2fc7d9-3762-11ee-a8ab-b7b7b34f99ba',\n", " '6b40881a-3762-11ee-a8ab-b7b7b34f99ba',\n", " '6b40881b-3762-11ee-a8ab-b7b7b34f99ba',\n", " '581e691e-3762-11ee-a8ab-b7b7b34f99ba',\n", " '581e691f-3762-11ee-a8ab-b7b7b34f99ba'],\n", " 'embeddings': None,\n", " 'metadatas': [{}, {}, {}, {}, {}, {}, {}, {}],\n", " 'documents': ['hello bagel',\n", " 'this is langchain',\n", " 'hello bagel',\n", " 'this is langchain',\n", " 'hello bagel',\n", " 'this is langchain',\n", " 'hello bagel',\n", " 'this is langchain']}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all values and keys\n", "cluster_data" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "cluster.delete_cluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create cluster with metadata & filter using metadata" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(Document(page_content='hello bagel', metadata={'source': 'notion'}), 0.0)]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "texts = [\"hello bagel\", \"this is langchain\"]\n", "metadatas = [{\"source\": \"notion\"}, {\"source\": \"google\"}]\n", "\n", "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts, metadatas=metadatas)\n", "cluster.similarity_search_with_score(\"hello bagel\", where={\"source\": \"notion\"})" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "# delete the cluster\n", "cluster.delete_cluster()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }