You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/astradb.ipynb

544 lines
15 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "66d0270a-b74f-4110-901e-7960b00297af",
"metadata": {},
"source": [
"# Astra DB\n",
"\n",
"This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store."
]
},
{
"cell_type": "markdown",
"id": "ab8cd64f-3bb2-4f16-a0a9-12d7b1789bf6",
"metadata": {},
"source": [
"> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API."
]
},
{
"cell_type": "markdown",
"id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
"metadata": {},
"source": [
"_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
]
},
{
"cell_type": "markdown",
"id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
"metadata": {},
"source": [
"## Setup and general dependencies"
]
},
{
"cell_type": "markdown",
"id": "dbe7c156-0413-47e3-9237-4769c4248869",
"metadata": {},
"source": [
"Use of the integration requires the corresponding Python package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d00fcf4-9798-4289-9214-d9734690adfc",
"metadata": {},
"outputs": [],
"source": [
"pip install --upgrade langchain-astradb"
]
},
{
"cell_type": "markdown",
"id": "2453d83a-bc8f-41e1-a692-befe4dd90156",
"metadata": {},
"source": [
"_**Note.** the following are all packages required to run the full demo on this page. Depending on your LangChain setup, some of them may need to be installed:_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56c1f86e-5921-4976-ac8f-1d62e5a512b0",
"metadata": {},
"outputs": [],
"source": [
"pip install langchain langchain-openai datasets pypdf"
]
},
{
"cell_type": "markdown",
"id": "c2910035-e61f-48d9-a110-d68c401b62aa",
"metadata": {},
"source": [
"### Import dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b06619af-fea2-4863-8149-7f239a8c9c82",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"from datasets import (\n",
" load_dataset,\n",
")\n",
"from langchain_community.document_loaders import PyPDFLoader\n",
"from langchain_core.documents import Document\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1983f1da-0ae7-4a9b-bf4c-4ade328f7a3a",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c656df06-e938-4bc5-b570-440b8b7a0189",
"metadata": {},
"outputs": [],
"source": [
"embe = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "22866f09-e10d-4f05-a24b-b9420129462e",
"metadata": {},
"source": [
"## Import the Vector Store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b32730d-176e-414c-9d91-fd3644c54211",
"metadata": {},
"outputs": [],
"source": [
"from langchain_astradb import AstraDBVectorStore"
]
},
{
"cell_type": "markdown",
"id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
"metadata": {},
"source": [
"## Connection parameters\n",
"\n",
"These are found on your Astra DB dashboard:\n",
"\n",
"- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
"- the Token looks like `AstraCS:6gBhNmsk135....`\n",
"- you may optionally provide a _Namespace_ such as `my_namespace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_APPLICATION_TOKEN = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n",
"\n",
"desired_namespace = input(\"(optional) Namespace = \")\n",
"if desired_namespace:\n",
" ASTRA_DB_KEYSPACE = desired_namespace\n",
"else:\n",
" ASTRA_DB_KEYSPACE = None"
]
},
{
"cell_type": "markdown",
"id": "196268bd-a950-41c3-bede-f5b55f6a0804",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
"metadata": {},
"outputs": [],
"source": [
"vstore = AstraDBVectorStore(\n",
" embedding=embe,\n",
" collection_name=\"astra_vector_demo\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_APPLICATION_TOKEN,\n",
" namespace=ASTRA_DB_KEYSPACE,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
"metadata": {},
"source": [
"## Load a dataset"
]
},
{
"cell_type": "markdown",
"id": "552e56b0-301a-4b06-99c7-57ba6faa966f",
"metadata": {},
"source": [
"Convert each entry in the source dataset into a `Document`, then write them into the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
"metadata": {},
"outputs": [],
"source": [
"philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
"\n",
"docs = []\n",
"for entry in philo_dataset:\n",
" metadata = {\"author\": entry[\"author\"]}\n",
" doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
" docs.append(doc)\n",
"\n",
"inserted_ids = vstore.add_documents(docs)\n",
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "79d4f436-ef04-4288-8f79-97c9abb983ed",
"metadata": {},
"source": [
"In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n",
"\n",
"_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._"
]
},
{
"cell_type": "markdown",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
"metadata": {},
"source": [
"Add some more entries, this time with `add_texts`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
"metadata": {},
"outputs": [],
"source": [
"texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
"metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
"ids = [\"desc_01\", \"huss_xy\"]\n",
"\n",
"inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "63840eb3-8b29-4017-bc2f-301bf5001f28",
"metadata": {},
"source": [
"_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n",
"_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n",
"_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._"
]
},
{
"cell_type": "markdown",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
"metadata": {},
"source": [
"## Run searches"
]
},
{
"cell_type": "markdown",
"id": "02a77d8e-1aae-4054-8805-01c77947c49f",
"metadata": {},
"source": [
"This section demonstrates metadata filtering and getting the similarity scores back:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1761806a-1afd-4491-867c-25a80d92b9fe",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
"metadata": {},
"outputs": [],
"source": [
"results_filtered = vstore.similarity_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"plato\"},\n",
")\n",
"for res in results_filtered:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "b14ea558-bfbe-41ce-807e-d70670060ada",
"metadata": {},
"source": [
"### MMR (Maximal-marginal-relevance) search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.max_marginal_relevance_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"aristotle\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "60fda5df-14e4-4fb0-bd17-65a393fab8a9",
"metadata": {},
"source": [
"### Async\n",
"\n",
"Note that the Astra DB vector store supports all fully async methods (`asimilarity_search`, `afrom_texts`, `adelete` and so on) natively, i.e. without thread wrapping involved."
]
},
{
"cell_type": "markdown",
"id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
"metadata": {},
"source": [
"## Deleting stored documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38a70ec4-b522-4d32-9ead-c642864fca37",
"metadata": {},
"outputs": [],
"source": [
"delete_1 = vstore.delete(inserted_ids[:3])\n",
"print(f\"all_succeed={delete_1}\") # True, all documents deleted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
"metadata": {},
"outputs": [],
"source": [
"delete_2 = vstore.delete(inserted_ids[2:5])\n",
"print(f\"some_succeeds={delete_2}\") # True, though some IDs were gone already"
]
},
{
"cell_type": "markdown",
"id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
"metadata": {},
"source": [
"## A minimal RAG chain"
]
},
{
"cell_type": "markdown",
"id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
"metadata": {},
"source": [
"The next cells will implement a simple RAG pipeline:\n",
"- download a sample PDF file and load it onto the store;\n",
"- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
"- run the question-answering chain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
"metadata": {},
"outputs": [],
"source": [
"!curl -L \\\n",
" \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
" -o \"what-is-philosophy.pdf\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
"metadata": {},
"outputs": [],
"source": [
"pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
"docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
"\n",
"print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
"inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
"print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
"metadata": {},
"outputs": [],
"source": [
"retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
"\n",
"philo_template = \"\"\"\n",
"You are a philosopher that draws inspiration from great thinkers of the past\n",
"to craft well-thought answers to user questions. Use the provided context as the basis\n",
"for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
"Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
"\n",
"CONTEXT:\n",
"{context}\n",
"\n",
"QUESTION: {question}\n",
"\n",
"YOUR ANSWER:\"\"\"\n",
"\n",
"philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | philo_prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
"metadata": {},
"outputs": [],
"source": [
"chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
]
},
{
"cell_type": "markdown",
"id": "869ab448-a029-4692-aefc-26b85513314d",
"metadata": {},
"source": [
"For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
]
},
{
"cell_type": "markdown",
"id": "177610c7-50d0-4b7b-8634-b03338054c8e",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "markdown",
"id": "0da4d19f-9878-4d3d-82c9-09cafca20322",
"metadata": {},
"source": [
"If you want to completely delete the collection from your Astra DB instance, run this.\n",
"\n",
"_(You will lose the data you stored in it.)_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd405a13-6f71-46fa-87e6-167238e9c25e",
"metadata": {},
"outputs": [],
"source": [
"vstore.delete_collection()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}