langchain/docs/docs/integrations/vectorstores/astradb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "66d0270a-b74f-4110-901e-7960b00297af",
   "metadata": {},
   "source": [
    "# Astra DB\n",
    "\n",
    "This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab8cd64f-3bb2-4f16-a0a9-12d7b1789bf6",
   "metadata": {},
   "source": [
    "> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
   "metadata": {},
   "source": [
    "_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
   "metadata": {},
   "source": [
    "## Setup and general dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbe7c156-0413-47e3-9237-4769c4248869",
   "metadata": {},
   "source": [
    "Use of the integration requires the corresponding Python package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d00fcf4-9798-4289-9214-d9734690adfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install --upgrade langchain-astradb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2453d83a-bc8f-41e1-a692-befe4dd90156",
   "metadata": {},
   "source": [
    "_**Note.** the following are all packages required to run the full demo on this page. Depending on your LangChain setup, some of them may need to be installed:_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56c1f86e-5921-4976-ac8f-1d62e5a512b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install langchain langchain-openai datasets pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2910035-e61f-48d9-a110-d68c401b62aa",
   "metadata": {},
   "source": [
    "### Import dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b06619af-fea2-4863-8149-7f239a8c9c82",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "from datasets import (\n",
    "    load_dataset,\n",
    ")\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1983f1da-0ae7-4a9b-bf4c-4ade328f7a3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c656df06-e938-4bc5-b570-440b8b7a0189",
   "metadata": {},
   "outputs": [],
   "source": [
    "embe = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22866f09-e10d-4f05-a24b-b9420129462e",
   "metadata": {},
   "source": [
    "## Import the Vector Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b32730d-176e-414c-9d91-fd3644c54211",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_astradb import AstraDBVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
   "metadata": {},
   "source": [
    "## Connection parameters\n",
    "\n",
    "These are found on your Astra DB dashboard:\n",
    "\n",
    "- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
    "- the Token looks like `AstraCS:6gBhNmsk135....`\n",
    "- you may optionally provide a _Namespace_ such as `my_namespace`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
    "ASTRA_DB_APPLICATION_TOKEN = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n",
    "\n",
    "desired_namespace = input(\"(optional) Namespace = \")\n",
    "if desired_namespace:\n",
    "    ASTRA_DB_KEYSPACE = desired_namespace\n",
    "else:\n",
    "    ASTRA_DB_KEYSPACE = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "196268bd-a950-41c3-bede-f5b55f6a0804",
   "metadata": {},
   "source": [
    "Now you can create the vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
   "metadata": {},
   "outputs": [],
   "source": [
    "vstore = AstraDBVectorStore(\n",
    "    embedding=embe,\n",
    "    collection_name=\"astra_vector_demo\",\n",
    "    api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
    "    token=ASTRA_DB_APPLICATION_TOKEN,\n",
    "    namespace=ASTRA_DB_KEYSPACE,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
   "metadata": {},
   "source": [
    "## Load a dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "552e56b0-301a-4b06-99c7-57ba6faa966f",
   "metadata": {},
   "source": [
    "Convert each entry in the source dataset into a `Document`, then write them into the vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
    "\n",
    "docs = []\n",
    "for entry in philo_dataset:\n",
    "    metadata = {\"author\": entry[\"author\"]}\n",
    "    doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
    "    docs.append(doc)\n",
    "\n",
    "inserted_ids = vstore.add_documents(docs)\n",
    "print(f\"\\nInserted {len(inserted_ids)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79d4f436-ef04-4288-8f79-97c9abb983ed",
   "metadata": {},
   "source": [
    "In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n",
    "\n",
    "_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "084d8802-ab39-4262-9a87-42eafb746f92",
   "metadata": {},
   "source": [
    "Add some more entries, this time with `add_texts`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
    "metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
    "ids = [\"desc_01\", \"huss_xy\"]\n",
    "\n",
    "inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
    "print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63840eb3-8b29-4017-bc2f-301bf5001f28",
   "metadata": {},
   "source": [
    "_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n",
    "_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n",
    "_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
   "metadata": {},
   "source": [
    "## Run searches"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a77d8e-1aae-4054-8805-01c77947c49f",
   "metadata": {},
   "source": [
    "This section demonstrates metadata filtering and getting the similarity scores back:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1761806a-1afd-4491-867c-25a80d92b9fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
    "for res in results:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "results_filtered = vstore.similarity_search(\n",
    "    \"Our life is what we make of it\",\n",
    "    k=3,\n",
    "    filter={\"author\": \"plato\"},\n",
    ")\n",
    "for res in results_filtered:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
    "for res, score in results:\n",
    "    print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b14ea558-bfbe-41ce-807e-d70670060ada",
   "metadata": {},
   "source": [
    "### MMR (Maximal-marginal-relevance) search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.max_marginal_relevance_search(\n",
    "    \"Our life is what we make of it\",\n",
    "    k=3,\n",
    "    filter={\"author\": \"aristotle\"},\n",
    ")\n",
    "for res in results:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60fda5df-14e4-4fb0-bd17-65a393fab8a9",
   "metadata": {},
   "source": [
    "### Async\n",
    "\n",
    "Note that the Astra DB vector store supports all fully async methods (`asimilarity_search`, `afrom_texts`, `adelete` and so on) natively, i.e. without thread wrapping involved."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
   "metadata": {},
   "source": [
    "## Deleting stored documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38a70ec4-b522-4d32-9ead-c642864fca37",
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_1 = vstore.delete(inserted_ids[:3])\n",
    "print(f\"all_succeed={delete_1}\")  # True, all documents deleted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_2 = vstore.delete(inserted_ids[2:5])\n",
    "print(f\"some_succeeds={delete_2}\")  # True, though some IDs were gone already"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
   "metadata": {},
   "source": [
    "## A minimal RAG chain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
   "metadata": {},
   "source": [
    "The next cells will implement a simple RAG pipeline:\n",
    "- download a sample PDF file and load it onto the store;\n",
    "- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
    "- run the question-answering chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!curl -L \\\n",
    "    \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
    "    -o \"what-is-philosophy.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
    "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
    "docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
    "\n",
    "print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
    "inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
    "print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
    "\n",
    "philo_template = \"\"\"\n",
    "You are a philosopher that draws inspiration from great thinkers of the past\n",
    "to craft well-thought answers to user questions. Use the provided context as the basis\n",
    "for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
    "Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
    "\n",
    "CONTEXT:\n",
    "{context}\n",
    "\n",
    "QUESTION: {question}\n",
    "\n",
    "YOUR ANSWER:\"\"\"\n",
    "\n",
    "philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
    "\n",
    "llm = ChatOpenAI()\n",
    "\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    "    | philo_prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "869ab448-a029-4692-aefc-26b85513314d",
   "metadata": {},
   "source": [
    "For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "177610c7-50d0-4b7b-8634-b03338054c8e",
   "metadata": {},
   "source": [
    "## Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0da4d19f-9878-4d3d-82c9-09cafca20322",
   "metadata": {},
   "source": [
    "If you want to completely delete the collection from your Astra DB instance, run this.\n",
    "\n",
    "_(You will lose the data you stored in it.)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd405a13-6f71-46fa-87e6-167238e9c25e",
   "metadata": {},
   "outputs": [],
   "source": [
    "vstore.delete_collection()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}