langchain/docs/extras/integrations/vectorstores/deeplake.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Activeloop's Deep Lake\n",
    "\n",
    ">[Activeloop's Deep Lake](https://docs.activeloop.ai/) as a Multi-Modal Vector Store that stores embeddings and their metadata including text, jsons, images, audio, video, and more. It saves the data locally, in your cloud, or on Activeloop storage. It performs hybrid search including embeddings and their attributes.\n",
    "\n",
    "This notebook showcases basic functionality related to `Activeloop's Deep Lake`. While `Deep Lake` can store embeddings, it is capable of storing any type of data. It is a serverless data lake with version control, query engine and streaming dataloaders to deep learning frameworks.  \n",
    "\n",
    "For more information, please see the Deep Lake [documentation](https://docs.activeloop.ai) or [api reference](https://docs.deeplake.ai)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install openai 'deeplake[enterprise]' tiktoken"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import DeepLake"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
    "activeloop_token = getpass.getpass(\"activeloop token:\")\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a dataset locally at `./deeplake/`, then run similarity search. The Deeplake+LangChain integration uses Deep Lake datasets under the hood, so `dataset` and `vector store` are used interchangeably. To create a dataset in your own cloud, or in the Deep Lake storage, [adjust the path accordingly](https://docs.activeloop.ai/storage-and-credentials/storage-options)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db = DeepLake(\n",
    "    dataset_path=\"./my_deeplake/\", embedding_function=embeddings, overwrite=True\n",
    ")\n",
    "db.add_documents(docs)\n",
    "# or shorter\n",
    "# db = DeepLake.from_documents(docs, dataset_path=\"./my_deeplake/\", embedding=embeddings, overwrite=True)\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = db.similarity_search(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Later, you can reload the dataset without recomputing embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db = DeepLake(\n",
    "    dataset_path=\"./my_deeplake/\", embedding_function=embeddings, read_only=True\n",
    ")\n",
    "docs = db.similarity_search(query)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Lake, for now, is single writer and multiple reader. Setting `read_only=True` helps to avoid acquiring the writer lock."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieval Question/Answering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain.llms import OpenAIChat\n",
    "\n",
    "qa = RetrievalQA.from_chain_type(\n",
    "    llm=OpenAIChat(model=\"gpt-3.5-turbo\"),\n",
    "    chain_type=\"stuff\",\n",
    "    retriever=db.as_retriever(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "qa.run(query)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Attribute based filtering in metadata"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create another vector store containing metadata with the year the documents were created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "for d in docs:\n",
    "    d.metadata[\"year\"] = random.randint(2012, 2014)\n",
    "\n",
    "db = DeepLake.from_documents(\n",
    "    docs, embeddings, dataset_path=\"./my_deeplake/\", overwrite=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db.similarity_search(\n",
    "    \"What did the president say about Ketanji Brown Jackson\",\n",
    "    filter={\"metadata\": {\"year\": 2013}},\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing distance function\n",
    "Distance function `L2` for Euclidean, `L1` for Nuclear, `Max` l-infinity distance, `cos` for cosine similarity, `dot` for dot product "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db.similarity_search(\n",
    "    \"What did the president say about Ketanji Brown Jackson?\", distance_metric=\"cos\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Maximal Marginal relevance\n",
    "Using maximal marginal relevance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db.max_marginal_relevance_search(\n",
    "    \"What did the president say about Ketanji Brown Jackson?\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "db.delete_dataset()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and if delete fails you can also force delete"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "DeepLake.force_delete_by_path(\"./my_deeplake\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Lake datasets on cloud (Activeloop, AWS, GCS, etc.) or in memory\n",
    "By default, Deep Lake datasets are stored locally. To store them in memory, in the Deep Lake Managed DB, or in any object storage, you can provide the [corresponding path and credentials when creating the vector store](https://docs.activeloop.ai/storage-and-credentials/storage-options). Some paths require registration with Activeloop and creation of an API token that can be [retrieved here](https://app.activeloop.ai/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"ACTIVELOOP_TOKEN\"] = activeloop_token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Embed and store the texts\n",
    "username = \"<username>\"  # your username on app.activeloop.ai\n",
    "dataset_path = f\"hub://{username}/langchain_testing_python\"  # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.\n",
    "\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embedding = OpenAIEmbeddings()\n",
    "db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings, overwrite=True)\n",
    "db.add_documents(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = db.similarity_search(query)\n",
    "print(docs[0].page_content)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `tensor_db` execution option "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to utilize Deep Lake's Managed Tensor Database, it is necessary to specify the runtime parameter as {'tensor_db': True} during the creation of the vector store. This configuration enables the execution of queries on the Managed Tensor Database, rather than on the client side. It should be noted that this functionality is not applicable to datasets stored locally or in-memory. In the event that a vector store has already been created outside of the Managed Tensor Database, it is possible to transfer it to the Managed Tensor Database by following the prescribed steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Embed and store the texts\n",
    "username = \"adilkhan\"  # your username on app.activeloop.ai\n",
    "dataset_path = f\"hub://{username}/langchain_testing\"\n",
    "\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embedding = OpenAIEmbeddings()\n",
    "db = DeepLake(\n",
    "    dataset_path=dataset_path,\n",
    "    embedding_function=embeddings,\n",
    "    overwrite=True,\n",
    "    runtime={\"tensor_db\": True},\n",
    ")\n",
    "db.add_documents(docs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TQL Search"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Furthermore, the execution of queries is also supported within the similarity_search method, whereby the query can be specified utilizing Deep Lake's Tensor Query Language (TQL)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_id = db.vectorstore.dataset.id[0].numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = db.similarity_search(\n",
    "    query=None,\n",
    "    tql_query=f\"SELECT * WHERE id == '{search_id[0]}'\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating vector stores on AWS S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "s3://hub-2.0-datasets-n/langchain_test loaded successfully.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Evaluating ingest: 100%|██████████| 1/1 [00:10<00:00\n",
      "\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset(path='s3://hub-2.0-datasets-n/langchain_test', tensors=['embedding', 'ids', 'metadata', 'text'])\n",
      "\n",
      "  tensor     htype     shape     dtype  compression\n",
      "  -------   -------   -------   -------  ------- \n",
      " embedding  generic  (4, 1536)  float32   None   \n",
      "    ids      text     (4, 1)      str     None   \n",
      " metadata    json     (4, 1)      str     None   \n",
      "   text      text     (4, 1)      str     None   \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r"
     ]
    }
   ],
   "source": [
    "dataset_path = f\"s3://BUCKET/langchain_test\"  # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.\n",
    "\n",
    "embedding = OpenAIEmbeddings()\n",
    "db = DeepLake.from_documents(\n",
    "    docs,\n",
    "    dataset_path=dataset_path,\n",
    "    embedding=embeddings,\n",
    "    overwrite=True,\n",
    "    creds={\n",
    "        \"aws_access_key_id\": os.environ[\"AWS_ACCESS_KEY_ID\"],\n",
    "        \"aws_secret_access_key\": os.environ[\"AWS_SECRET_ACCESS_KEY\"],\n",
    "        \"aws_session_token\": os.environ[\"AWS_SESSION_TOKEN\"],  # Optional\n",
    "    },\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Lake API\n",
    "you can access the Deep Lake  dataset at `db.vectorstore`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset(path='hub://adilkhan/langchain_testing', tensors=['embedding', 'id', 'metadata', 'text'])\n",
      "\n",
      "  tensor      htype      shape      dtype  compression\n",
      "  -------    -------    -------    -------  ------- \n",
      " embedding  embedding  (42, 1536)  float32   None   \n",
      "    id        text      (42, 1)      str     None   \n",
      " metadata     json      (42, 1)      str     None   \n",
      "   text       text      (42, 1)      str     None   \n"
     ]
    }
   ],
   "source": [
    "# get structure of the dataset\n",
    "db.vectorstore.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get embeddings numpy array\n",
    "embeds = db.vectorstore.dataset.embedding.numpy()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transfer local dataset to cloud\n",
    "Copy already created dataset to the cloud. You can also transfer from cloud to local."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Copying dataset: 100%|██████████| 56/56 [00:38<00:00\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy\n",
      "Your Deep Lake dataset has been successfully created!\n",
      "The dataset is private so make sure you are logged in!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import deeplake\n",
    "\n",
    "username = \"davitbun\"  # your username on app.activeloop.ai\n",
    "source = f\"hub://{username}/langchain_test\"  # could be local, s3, gcs, etc.\n",
    "destination = f\"hub://{username}/langchain_test_copy\"  # could be local, s3, gcs, etc.\n",
    "\n",
    "deeplake.deepcopy(src=source, dest=destination, overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hub://davitbun/langchain_test_copy loaded successfully.\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Deep Lake Dataset in hub://davitbun/langchain_test_copy already exists, loading from the storage\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])\n",
      "\n",
      "  tensor     htype     shape     dtype  compression\n",
      "  -------   -------   -------   -------  ------- \n",
      " embedding  generic  (4, 1536)  float32   None   \n",
      "    ids      text     (4, 1)      str     None   \n",
      " metadata    json     (4, 1)      str     None   \n",
      "   text      text     (4, 1)      str     None   \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Evaluating ingest: 100%|██████████| 1/1 [00:31<00:00\n",
      "-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])\n",
      "\n",
      "  tensor     htype     shape     dtype  compression\n",
      "  -------   -------   -------   -------  ------- \n",
      " embedding  generic  (8, 1536)  float32   None   \n",
      "    ids      text     (8, 1)      str     None   \n",
      " metadata    json     (8, 1)      str     None   \n",
      "   text      text     (8, 1)      str     None   \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['ad42f3fe-e188-11ed-b66d-41c5f7b85421',\n",
       " 'ad42f3ff-e188-11ed-b66d-41c5f7b85421',\n",
       " 'ad42f400-e188-11ed-b66d-41c5f7b85421',\n",
       " 'ad42f401-e188-11ed-b66d-41c5f7b85421']"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db = DeepLake(dataset_path=destination, embedding_function=embeddings)\n",
    "db.add_documents(docs)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.6 ('langchain_venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "0b0bacaffd430edc3085253ee7ee1bcda9f76a5e66b369dda8ba68baa6d14ba7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}