langchain/docs/extras/integrations/vectorstores/bageldb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BagelDB\n",
    "\n",
    "> [BagelDB](https://www.bageldb.ai/) (`Open Vector Database for AI`), is like GitHub for AI data.\n",
    "It is a collaborative platform where users can create,\n",
    "share, and manage vector datasets. It can support private projects for independent developers,\n",
    "internal collaborations for enterprises, and public contributions for data DAOs.\n",
    "\n",
    "### Installation and Setup\n",
    "\n",
    "```bash\n",
    "pip install betabageldb\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create VectorStore from texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import Bagel\n",
    "\n",
    "texts = [\"hello bagel\", \"hello langchain\", \"I love salad\", \"my car\", \"a dog\"]\n",
    "# create cluster and add texts\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='hello bagel', metadata={}),\n",
       " Document(page_content='my car', metadata={}),\n",
       " Document(page_content='I love salad', metadata={})]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# similarity search\n",
    "cluster.similarity_search(\"bagel\", k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(Document(page_content='hello bagel', metadata={}), 0.27392977476119995),\n",
       " (Document(page_content='my car', metadata={}), 1.4783176183700562),\n",
       " (Document(page_content='I love salad', metadata={}), 1.5342965126037598)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the score is a distance metric, so lower is better\n",
    "cluster.similarity_search_with_score(\"bagel\", k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete the cluster\n",
    "cluster.delete_cluster()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create VectorStore from docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "\n",
    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create cluster with docs\n",
    "cluster = Bagel.from_documents(cluster_name=\"testing_with_docs\", documents=docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the \n"
     ]
    }
   ],
   "source": [
    "# similarity search\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = cluster.similarity_search(query)\n",
    "print(docs[0].page_content[:102])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get all text/doc from Cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [\"hello bagel\", \"this is langchain\"]\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)\n",
    "cluster_data = cluster.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['ids', 'embeddings', 'metadatas', 'documents'])"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# all keys\n",
    "cluster_data.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ids': ['578c6d24-3763-11ee-a8ab-b7b7b34f99ba',\n",
       "  '578c6d25-3763-11ee-a8ab-b7b7b34f99ba',\n",
       "  'fb2fc7d8-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  'fb2fc7d9-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '6b40881a-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '6b40881b-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '581e691e-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '581e691f-3762-11ee-a8ab-b7b7b34f99ba'],\n",
       " 'embeddings': None,\n",
       " 'metadatas': [{}, {}, {}, {}, {}, {}, {}, {}],\n",
       " 'documents': ['hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain']}"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# all values and keys\n",
    "cluster_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster.delete_cluster()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create cluster with metadata & filter using metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(Document(page_content='hello bagel', metadata={'source': 'notion'}), 0.0)]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "texts = [\"hello bagel\", \"this is langchain\"]\n",
    "metadatas = [{\"source\": \"notion\"}, {\"source\": \"google\"}]\n",
    "\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts, metadatas=metadatas)\n",
    "cluster.similarity_search_with_score(\"hello bagel\", where={\"source\": \"notion\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete the cluster\n",
    "cluster.delete_cluster()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
BagelDB (bageldb.ai), VectorStore integration. (#8971) - Description: [BagelDB](bageldb.ai) a collaborative vector database. Integrated the bageldb PyPi package with langchain with related tests and code. - Issue: Not applicable. - Dependencies: `betabageldb` PyPi package. - Tag maintainer: @rlancemartin, @eyurtsev, @baskaryan - Twitter handle: bageldb_ai (https://twitter.com/BagelDB_ai) We ran `make format`, `make lint` and `make test` locally. Followed the contribution guideline thoroughly https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --------- Co-authored-by: Towhid1 <nurulaktertowhid@gmail.com> 2023-08-10 23:48:36 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# BagelDB\n",`
			`"\n",`
			"> [BagelDB](https://www.bageldb.ai/) (`Open Vector Database for AI`), is like GitHub for AI data.\n",
			`"It is a collaborative platform where users can create,\n",`
			`"share, and manage vector datasets. It can support private projects for independent developers,\n",`
			`"internal collaborations for enterprises, and public contributions for data DAOs.\n",`
			`"\n",`
			`"### Installation and Setup\n",`
			`"\n",`
			"```bash\n",
			`"pip install betabageldb\n",`
			"```\n",
			`"\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Create VectorStore from texts"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.vectorstores import Bagel\n",`
			`"\n",`
			`"texts = [\"hello bagel\", \"hello langchain\", \"I love salad\", \"my car\", \"a dog\"]\n",`
			`"# create cluster and add texts\n",`
			`"cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[Document(page_content='hello bagel', metadata={}),\n",`
			`" Document(page_content='my car', metadata={}),\n",`
			`" Document(page_content='I love salad', metadata={})]"`
			`]`
			`},`
			`"execution_count": 11,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# similarity search\n",`
			`"cluster.similarity_search(\"bagel\", k=3)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[(Document(page_content='hello bagel', metadata={}), 0.27392977476119995),\n",`
			`" (Document(page_content='my car', metadata={}), 1.4783176183700562),\n",`
			`" (Document(page_content='I love salad', metadata={}), 1.5342965126037598)]"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# the score is a distance metric, so lower is better\n",`
			`"cluster.similarity_search_with_score(\"bagel\", k=3)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# delete the cluster\n",`
			`"cluster.delete_cluster()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Create VectorStore from docs"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 33,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.document_loaders import TextLoader\n",`
			`"from langchain.text_splitter import CharacterTextSplitter\n",`
			`"\n",`
			`"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",`
			`"documents = loader.load()\n",`
			`"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",`
			`"docs = text_splitter.split_documents(documents)[:10]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 36,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# create cluster with docs\n",`
			`"cluster = Bagel.from_documents(cluster_name=\"testing_with_docs\", documents=docs)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 37,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the \n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# similarity search\n",`
			`"query = \"What did the president say about Ketanji Brown Jackson\"\n",`
			`"docs = cluster.similarity_search(query)\n",`
			`"print(docs[0].page_content[:102])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Get all text/doc from Cluster"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 53,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"texts = [\"hello bagel\", \"this is langchain\"]\n",`
			`"cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)\n",`
			`"cluster_data = cluster.get()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 54,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"dict_keys(['ids', 'embeddings', 'metadatas', 'documents'])"`
			`]`
			`},`
			`"execution_count": 54,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# all keys\n",`
			`"cluster_data.keys()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 56,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'ids': ['578c6d24-3763-11ee-a8ab-b7b7b34f99ba',\n",`
			`" '578c6d25-3763-11ee-a8ab-b7b7b34f99ba',\n",`
			`" 'fb2fc7d8-3762-11ee-a8ab-b7b7b34f99ba',\n",`
			`" 'fb2fc7d9-3762-11ee-a8ab-b7b7b34f99ba',\n",`
			`" '6b40881a-3762-11ee-a8ab-b7b7b34f99ba',\n",`
			`" '6b40881b-3762-11ee-a8ab-b7b7b34f99ba',\n",`
			`" '581e691e-3762-11ee-a8ab-b7b7b34f99ba',\n",`
			`" '581e691f-3762-11ee-a8ab-b7b7b34f99ba'],\n",`
			`" 'embeddings': None,\n",`
			`" 'metadatas': [{}, {}, {}, {}, {}, {}, {}, {}],\n",`
			`" 'documents': ['hello bagel',\n",`
			`" 'this is langchain',\n",`
			`" 'hello bagel',\n",`
			`" 'this is langchain',\n",`
			`" 'hello bagel',\n",`
			`" 'this is langchain',\n",`
			`" 'hello bagel',\n",`
			`" 'this is langchain']}"`
			`]`
			`},`
			`"execution_count": 56,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# all values and keys\n",`
			`"cluster_data"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 57,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"cluster.delete_cluster()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Create cluster with metadata & filter using metadata"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 63,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[(Document(page_content='hello bagel', metadata={'source': 'notion'}), 0.0)]"`
			`]`
			`},`
			`"execution_count": 63,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"texts = [\"hello bagel\", \"this is langchain\"]\n",`
			`"metadatas = [{\"source\": \"notion\"}, {\"source\": \"google\"}]\n",`
			`"\n",`
			`"cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts, metadatas=metadatas)\n",`
			`"cluster.similarity_search_with_score(\"hello bagel\", where={\"source\": \"notion\"})"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 64,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# delete the cluster\n",`
			`"cluster.delete_cluster()"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.12"`
			`},`
			`"orig_nbformat": 4`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`