Add "Astra DB" vector store integration (#12966)

# Astra DB Vector store integration

- **Description:** This PR adds a `VectorStore` implementation for
DataStax Astra DB using its HTTP API
  - **Issue:** (no related issue)
- **Dependencies:** A new required dependency is `astrapy` (`>=0.5.3`)
which was added to pyptoject.toml, optional, as per guidelines
- **Tag maintainer:** I recently mentioned to @baskaryan this
integration was coming
  - **Twitter handle:** `@rsprrs` if you want to mention me

This PR introduces the `AstraDB` vector store class, extensive
integration test coverage, a reworking of the documentation which
conflates Cassandra and Astra DB on a single "provider" page and a new,
completely reworked vector-store example notebook (common to the
Cassandra store, since parts of the flow is shared by the two APIs). I
also took care in ensuring docs (and redirects therein) are behaving
correctly.

All style, linting, typechecks and tests pass as far as the `AstraDB`
integration is concerned.

I could build the documentation and check it all right (but ran into
trouble with the `api_docs_build` makefile target which I could not
verify: `Error: Unable to import module
'plan_and_execute.agent_executor' with error: No module named
'langchain_experimental'` was the first of many similar errors)

Thank you for a review!
Stefano

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
pull/13028/head
Stefano Lottini 7 months ago committed by GitHub
parent 13bd83bd61
commit 4f4b020582
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,85 @@
# Astra DB
This page lists the integrations available with [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) and [Apache Cassandra®](https://cassandra.apache.org/).
### Setup
Install the following Python package:
```bash
pip install "astrapy>=0.5.3"
```
## Astra DB
> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Cassandra and made conveniently available
> through an easy-to-use JSON API.
### Vector Store
```python
from langchain.vectorstores import AstraDB
vector_store = AstraDB(
embedding=my_embedding,
collection_name="my_store",
api_endpoint="...",
token="...",
)
```
Learn more in the [example notebook](/docs/integrations/vectorstores/astradb).
## Apache Cassandra and Astra DB through CQL
> [Cassandra](https://cassandra.apache.org/) is a NoSQL, row-oriented, highly scalable and highly available database.
> Starting with version 5.0, the database ships with [vector search capabilities](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html).
> DataStax [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html) is a managed serverless database built on Cassandra, offering the same interface and strengths.
These databases use the CQL protocol (Cassandra Query Language).
Hence, a different set of connectors, outlined below, shall be used.
### Vector Store
```python
from langchain.vectorstores import Cassandra
vector_store = Cassandra(
embedding=my_embedding,
table_name="my_store",
)
```
Learn more in the [example notebook](/docs/integrations/vectorstores/astradb) (scroll down to the CQL-specific section).
### Memory
```python
from langchain.memory import CassandraChatMessageHistory
message_history = CassandraChatMessageHistory(session_id="my-session")
```
Learn more in the [example notebook](/docs/integrations/memory/cassandra_chat_message_history).
### LLM Cache
```python
from langchain.cache import CassandraCache
langchain.llm_cache = CassandraCache()
```
Learn more in the [example notebook](/docs/integrations/llms/llm_caching) (scroll to the Cassandra section).
### Semantic LLM Cache
```python
from langchain.cache import CassandraSemanticCache
cassSemanticCache = CassandraSemanticCache(
embedding=my_embedding,
table_name="my_store",
)
```
Learn more in the [example notebook](/docs/integrations/llms/llm_caching) (scroll to the appropriate section).

@ -1,35 +0,0 @@
# Cassandra
>[Apache Cassandra®](https://cassandra.apache.org/) is a free and open-source, distributed, wide-column
> store, NoSQL database management system designed to handle large amounts of data across many commodity servers,
> providing high availability with no single point of failure. Cassandra offers support for clusters spanning
> multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.
> Cassandra was designed to implement a combination of _Amazon's Dynamo_ distributed storage and replication
> techniques combined with _Google's Bigtable_ data and storage engine model.
## Installation and Setup
```bash
pip install cassandra-driver
pip install cassio
```
## Vector Store
See a [usage example](/docs/integrations/vectorstores/cassandra).
```python
from langchain.vectorstores import Cassandra
```
## Memory
See a [usage example](/docs/integrations/memory/cassandra_chat_message_history).
```python
from langchain.memory import CassandraChatMessageHistory
```

@ -0,0 +1,749 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
"metadata": {},
"source": [
"# Astra DB\n",
"\n",
"This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) and [Apache Cassandra®](https://cassandra.apache.org/) as a Vector Store.\n",
"\n",
"_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
]
},
{
"cell_type": "markdown",
"id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
"metadata": {},
"source": [
"### Setup and general dependencies"
]
},
{
"cell_type": "markdown",
"id": "dbe7c156-0413-47e3-9237-4769c4248869",
"metadata": {},
"source": [
"Use of the integration requires the following Python package.\n",
"\n",
"_Note: depending on your LangChain setup, you may need to install other dependencies needed for this demo._"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d00fcf4-9798-4289-9214-d9734690adfc",
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet \"astrapy>=0.5.3\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b06619af-fea2-4863-8149-7f239a8c9c82",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"from datasets import load_dataset # if not present yet, run: pip install \"datasets==2.14.6\"\n",
"\n",
"from langchain.schema import Document\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.document_loaders import PyPDFLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"from langchain.schema.output_parser import StrOutputParser"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1983f1da-0ae7-4a9b-bf4c-4ade328f7a3a",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c656df06-e938-4bc5-b570-440b8b7a0189",
"metadata": {},
"outputs": [],
"source": [
"embe = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "dd8caa76-bc41-429e-a93b-989ba13aff01",
"metadata": {},
"source": [
"_Keep reading to connect with Astra DB. For usage with Apache Cassandra and Astra DB through CQL, scroll to the section below._"
]
},
{
"cell_type": "markdown",
"id": "22866f09-e10d-4f05-a24b-b9420129462e",
"metadata": {},
"source": [
"## Astra DB"
]
},
{
"cell_type": "markdown",
"id": "5fba47cc-3533-42fc-84b7-9dc14cd68b2b",
"metadata": {},
"source": [
"DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Cassandra and made conveniently available through an easy-to-use JSON API."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b32730d-176e-414c-9d91-fd3644c54211",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import AstraDB"
]
},
{
"cell_type": "markdown",
"id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
"metadata": {},
"source": [
"### Astra DB connection parameters\n",
"\n",
"- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
"- the Token looks like `AstraCS:6gBhNmsk135....`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
"metadata": {},
"outputs": [],
"source": [
"vstore = AstraDB(\n",
" embedding=embe,\n",
" collection_name=\"astra_vector_demo\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_TOKEN,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
"metadata": {},
"source": [
"### Load a dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
"metadata": {},
"outputs": [],
"source": [
"philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
"\n",
"docs = []\n",
"for entry in philo_dataset:\n",
" metadata = {\"author\": entry[\"author\"]}\n",
" doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
" docs.append(doc)\n",
"\n",
"inserted_ids = vstore.add_documents(docs)\n",
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
"metadata": {},
"source": [
"Add some more entries, this time with `add_texts`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
"metadata": {},
"outputs": [],
"source": [
"texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
"metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
"ids = [\"desc_01\", \"huss_xy\"]\n",
"\n",
"inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
"metadata": {},
"source": [
"### Run simple searches"
]
},
{
"cell_type": "markdown",
"id": "02a77d8e-1aae-4054-8805-01c77947c49f",
"metadata": {},
"source": [
"This section demonstrates metadata filtering and getting the similarity scores back:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1761806a-1afd-4491-867c-25a80d92b9fe",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
"metadata": {},
"outputs": [],
"source": [
"results_filtered = vstore.similarity_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"plato\"},\n",
")\n",
"for res in results_filtered:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "b14ea558-bfbe-41ce-807e-d70670060ada",
"metadata": {},
"source": [
"### MMR (Maximal-marginal-relevance) search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.max_marginal_relevance_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"aristotle\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
"metadata": {},
"source": [
"### Deleting stored documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38a70ec4-b522-4d32-9ead-c642864fca37",
"metadata": {},
"outputs": [],
"source": [
"delete_1 = vstore.delete(inserted_ids[:3])\n",
"print(f\"all_succeed={delete_1}\") # True, all documents deleted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
"metadata": {},
"outputs": [],
"source": [
"delete_2 = vstore.delete(inserted_ids[2:5])\n",
"print(f\"some_succeeds={delete_2}\") # True, though some IDs were gone already"
]
},
{
"cell_type": "markdown",
"id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
"metadata": {},
"source": [
"### A minimal RAG chain"
]
},
{
"cell_type": "markdown",
"id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
"metadata": {},
"source": [
"The next cells will implement a simple RAG pipeline:\n",
"- download a sample PDF file and load it onto the store;\n",
"- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
"- run the question-answering chain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
"metadata": {},
"outputs": [],
"source": [
"!curl -L \\\n",
" \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
" -o \"what-is-philosophy.pdf\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
"metadata": {},
"outputs": [],
"source": [
"pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
"docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
"\n",
"print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
"inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
"print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
"metadata": {},
"outputs": [],
"source": [
"retriever = vstore.as_retriever(search_kwargs={'k': 3})\n",
"\n",
"philo_template = \"\"\"\n",
"You are a philosopher that draws inspiration from great thinkers of the past\n",
"to craft well-thought answers to user questions. Use the provided context as the basis\n",
"for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
"Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
"\n",
"CONTEXT:\n",
"{context}\n",
"\n",
"QUESTION: {question}\n",
"\n",
"YOUR ANSWER:\"\"\"\n",
"\n",
"philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
" | philo_prompt \n",
" | llm \n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
"metadata": {},
"outputs": [],
"source": [
"chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
]
},
{
"cell_type": "markdown",
"id": "869ab448-a029-4692-aefc-26b85513314d",
"metadata": {},
"source": [
"For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
]
},
{
"cell_type": "markdown",
"id": "177610c7-50d0-4b7b-8634-b03338054c8e",
"metadata": {},
"source": [
"### Cleanup"
]
},
{
"cell_type": "markdown",
"id": "0da4d19f-9878-4d3d-82c9-09cafca20322",
"metadata": {},
"source": [
"If you want to completely delete the collection from your Astra DB instance, run this.\n",
"\n",
"_(You will lose the data you stored in it.)_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd405a13-6f71-46fa-87e6-167238e9c25e",
"metadata": {},
"outputs": [],
"source": [
"vstore.delete_collection()"
]
},
{
"cell_type": "markdown",
"id": "94ebaab1-7cbf-4144-a147-7b0e32c43069",
"metadata": {},
"source": [
"## Apache Cassandra and Astra DB through CQL"
]
},
{
"cell_type": "markdown",
"id": "bc3931b4-211d-4f84-bcc0-51c127e3027c",
"metadata": {},
"source": [
"[Cassandra](https://cassandra.apache.org/) is a NoSQL, row-oriented, highly scalable and highly available database.Starting with version 5.0, the database ships with [vector search capabilities](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html).\n",
"\n",
"DataStax [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html) is a managed serverless database built on Cassandra, offering the same interface and strengths."
]
},
{
"cell_type": "markdown",
"id": "a0055fbf-448d-4e46-9c40-28d43df25ca3",
"metadata": {},
"source": [
"#### What sets this case apart from \"Astra DB\" above?\n",
"\n",
"Thanks to LangChain having a standardized `VectorStore` interface, most of the \"Astra DB\" section above applies to this case as well. However, this time the database uses the CQL protocol, which means you'll use a _different_ class this time and instantiate it in another way.\n",
"\n",
"The cells below show how you should get your `vstore` object in this case and how you can clean up the database resources at the end: for the rest, i.e. the actual usage of the vector store, you will be able to run the very code that was shown above.\n",
"\n",
"In other words, running this demo in full with Cassandra or Astra DB through CQL means:\n",
"\n",
"- **initialization as shown below**\n",
"- \"Load a dataset\", _see above section_\n",
"- \"Run simple searches\", _see above section_\n",
"- \"MMR search\", _see above section_\n",
"- \"Deleting stored documents\", _see above section_\n",
"- \"A minimal RAG chain\", _see above section_\n",
"- **cleanup as shown below**"
]
},
{
"cell_type": "markdown",
"id": "23d12be2-745f-4e72-a82c-334a887bc7cd",
"metadata": {},
"source": [
"### Initialization"
]
},
{
"cell_type": "markdown",
"id": "e3212542-79be-423e-8e1f-b8d725e3cda8",
"metadata": {},
"source": [
"The class to use is the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "941af73e-a090-4fba-b23c-595757d470eb",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Cassandra"
]
},
{
"cell_type": "markdown",
"id": "414d1e72-f7c9-4b6d-bf6f-16075712c7e3",
"metadata": {},
"source": [
"Now, depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when creating the vector store object."
]
},
{
"cell_type": "markdown",
"id": "48ecca56-71a4-4a91-b198-29384c44ce27",
"metadata": {},
"source": [
"#### Initialization (Cassandra cluster)"
]
},
{
"cell_type": "markdown",
"id": "55ebe958-5654-43e0-9aed-d607ffd3fa48",
"metadata": {},
"source": [
"In this case, you first need to create a `cassandra.cluster.Session` object, as described in the [Cassandra driver documentation](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#module-cassandra.cluster). The details vary (e.g. with network settings and authentication), but this might be something like:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4642dafb-a065-4063-b58c-3d276f5ad07e",
"metadata": {},
"outputs": [],
"source": [
"from cassandra.cluster import Cluster\n",
"\n",
"cluster = Cluster([\"127.0.0.1\"])\n",
"session = cluster.connect()"
]
},
{
"cell_type": "markdown",
"id": "624c93bf-fb46-4350-bcfa-09ca09dc068f",
"metadata": {},
"source": [
"You can now set the session, along with your desired keyspace name, as a global CassIO parameter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92a4ab28-1c4f-4dad-9671-d47e0b1dde7b",
"metadata": {},
"outputs": [],
"source": [
"import cassio\n",
"\n",
"CASSANDRA_KEYSPACE = input(\"CASSANDRA_KEYSPACE = \")\n",
"\n",
"cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)"
]
},
{
"cell_type": "markdown",
"id": "3b87a824-36f1-45b4-b54c-efec2a2de216",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "853a2a88-a565-4e24-8789-d78c213954a6",
"metadata": {},
"outputs": [],
"source": [
"vstore = Cassandra(\n",
" embedding=embe,\n",
" table_name=\"cassandra_vector_demo\",\n",
" # session=None, keyspace=None # Uncomment on older versions of LangChain\n",
")"
]
},
{
"cell_type": "markdown",
"id": "768ddf7a-0c3e-4134-ad38-25ac53c3da7a",
"metadata": {},
"source": [
"#### Initialization (Astra DB through CQL)"
]
},
{
"cell_type": "markdown",
"id": "4ed4269a-b7e7-4503-9e66-5a11335c7681",
"metadata": {},
"source": [
"In this case you initialize CassIO with the following connection parameters:\n",
"\n",
"- the Database ID, e.g. `01234567-89ab-cdef-0123-456789abcdef`\n",
"- the Token, e.g. `AstraCS:6gBhNmsk135....` (it must be a \"Database Administrator\" token)\n",
"- Optionally a Keyspace name (if omitted, the default one for the database will be used)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fa6bd74-d4b2-45c5-9757-96dddc6242fb",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_ID = input(\"ASTRA_DB_ID = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")\n",
"\n",
"desired_keyspace = input(\"ASTRA_DB_KEYSPACE (optional, can be left empty) = \")\n",
"if desired_keyspace:\n",
" ASTRA_DB_KEYSPACE = desired_keyspace\n",
"else:\n",
" ASTRA_DB_KEYSPACE = None"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "add6e585-17ff-452e-8ef6-7e485ead0b06",
"metadata": {},
"outputs": [],
"source": [
"import cassio\n",
"\n",
"cassio.init(\n",
" database_id=ASTRA_DB_ID,\n",
" token=ASTRA_DB_TOKEN,\n",
" keyspace=ASTRA_DB_KEYSPACE,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b305823c-bc98-4f3d-aabb-d7eb663ea421",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f45f3038-9d59-41cc-8b43-774c6aa80295",
"metadata": {},
"outputs": [],
"source": [
"vstore = Cassandra(\n",
" embedding=embe,\n",
" table_name=\"cassandra_vector_demo\",\n",
" # session=None, keyspace=None # Uncomment on older versions of LangChain\n",
")"
]
},
{
"cell_type": "markdown",
"id": "39284918-cf8a-49bb-a2d3-aef285bb2ffa",
"metadata": {},
"source": [
"### Usage of the vector store"
]
},
{
"cell_type": "markdown",
"id": "3cc1aead-d6ec-48a3-affe-1d0cffa955a9",
"metadata": {},
"source": [
"_See the sections \"Load a dataset\" through \"A minimal RAG chain\" above._\n",
"\n",
"Speaking of the latter, you can check out a full RAG template for Astra DB through CQL [here](https://github.com/langchain-ai/langchain/tree/master/templates/cassandra-entomology-rag)."
]
},
{
"cell_type": "markdown",
"id": "096397d8-6622-4685-9f9d-7e238beca467",
"metadata": {},
"source": [
"### Cleanup"
]
},
{
"cell_type": "markdown",
"id": "cc1e74f9-5500-41aa-836f-235b1ed5f20c",
"metadata": {},
"source": [
"the following essentially retrieves the `Session` object from CassIO and runs a CQL `DROP TABLE` statement with it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5b82c33-0e77-4a37-852c-8d50edbdd991",
"metadata": {},
"outputs": [],
"source": [
"cassio.config.resolve_session().execute(\n",
" f\"DROP TABLE {cassio.config.resolve_keyspace()}.cassandra_vector_demo;\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c10ece4d-ae06-42ab-baf4-4d0ac2051743",
"metadata": {},
"source": [
"### Learn more"
]
},
{
"cell_type": "markdown",
"id": "51ea8b69-7e15-458f-85aa-9fa199f95f9c",
"metadata": {},
"source": [
"For more information, extended quickstarts and additional usage examples, please visit the [CassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using the LangChain `Cassandra` vector store."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -1,326 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# Cassandra\n",
"\n",
">[Apache Cassandra®](https://cassandra.apache.org) is a NoSQL, row-oriented, highly scalable and highly available database.\n",
"\n",
"Newest Cassandra releases natively [support](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor(ANN)+Vector+Search+via+Storage-Attached+Indexes) Vector Similarity Search.\n",
"\n",
"To run this notebook you need either a running Cassandra cluster equipped with Vector Search capabilities (in pre-release at the time of writing) or a DataStax Astra DB instance running in the cloud (you can get one for free at [datastax.com](https://astra.datastax.com)). Check [cassio.org](https://cassio.org/start_here/) for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4c41cad-08ef-4f72-a545-2151e4598efe",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install \"cassio>=0.1.0\""
]
},
{
"cell_type": "markdown",
"id": "b7e46bb0",
"metadata": {},
"source": [
"### Please provide database connection parameters and secrets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36128a32",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"database_mode = (input(\"\\n(C)assandra or (A)stra DB? \")).upper()\n",
"\n",
"keyspace_name = input(\"\\nKeyspace name? \")\n",
"\n",
"if database_mode == \"A\":\n",
" ASTRA_DB_APPLICATION_TOKEN = getpass.getpass('\\nAstra DB Token (\"AstraCS:...\") ')\n",
" #\n",
" ASTRA_DB_SECURE_BUNDLE_PATH = input(\"Full path to your Secure Connect Bundle? \")\n",
"elif database_mode == \"C\":\n",
" CASSANDRA_CONTACT_POINTS = input(\n",
" \"Contact points? (comma-separated, empty for localhost) \"\n",
" ).strip()"
]
},
{
"cell_type": "markdown",
"id": "4f22aac2",
"metadata": {},
"source": [
"#### depending on whether local or cloud-based Astra DB, create the corresponding database connection \"Session\" object"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "677f8576",
"metadata": {},
"outputs": [],
"source": [
"from cassandra.cluster import Cluster\n",
"from cassandra.auth import PlainTextAuthProvider\n",
"\n",
"if database_mode == \"C\":\n",
" if CASSANDRA_CONTACT_POINTS:\n",
" cluster = Cluster(\n",
" [cp.strip() for cp in CASSANDRA_CONTACT_POINTS.split(\",\") if cp.strip()]\n",
" )\n",
" else:\n",
" cluster = Cluster()\n",
" session = cluster.connect()\n",
"elif database_mode == \"A\":\n",
" ASTRA_DB_CLIENT_ID = \"token\"\n",
" cluster = Cluster(\n",
" cloud={\n",
" \"secure_connect_bundle\": ASTRA_DB_SECURE_BUNDLE_PATH,\n",
" },\n",
" auth_provider=PlainTextAuthProvider(\n",
" ASTRA_DB_CLIENT_ID,\n",
" ASTRA_DB_APPLICATION_TOKEN,\n",
" ),\n",
" )\n",
" session = cluster.connect()\n",
"else:\n",
" raise NotImplementedError"
]
},
{
"cell_type": "markdown",
"id": "320af802-9271-46ee-948f-d2453933d44b",
"metadata": {},
"source": [
"### Please provide OpenAI access key\n",
"\n",
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffea66e4-bc23-46a9-9580-b348dfe7b7a7",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"id": "e98a139b",
"metadata": {},
"source": [
"### Creation and usage of the Vector Store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aac9563e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import Cassandra\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"\n",
"SOURCE_FILE_NAME = \"../../modules/state_of_the_union.txt\"\n",
"\n",
"loader = TextLoader(SOURCE_FILE_NAME)\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embedding_function = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"table_name = \"my_vector_db_table\"\n",
"\n",
"docsearch = Cassandra.from_documents(\n",
" documents=docs,\n",
" embedding=embedding_function,\n",
" session=session,\n",
" keyspace=keyspace_name,\n",
" table_name=table_name,\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f509ee02",
"metadata": {},
"outputs": [],
"source": [
"## if you already have an index, you can load it and use it like this:\n",
"\n",
"# docsearch_preexisting = Cassandra(\n",
"# embedding=embedding_function,\n",
"# session=session,\n",
"# keyspace=keyspace_name,\n",
"# table_name=table_name,\n",
"# )\n",
"\n",
"# docs = docsearch_preexisting.similarity_search(query, k=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c608226",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "d46d1452",
"metadata": {},
"source": [
"### Maximal Marginal Relevance Searches\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr` as retriever.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a359ed74",
"metadata": {},
"outputs": [],
"source": [
"retriever = docsearch.as_retriever(search_type=\"mmr\")\n",
"matched_docs = retriever.get_relevant_documents(query)\n",
"for i, d in enumerate(matched_docs):\n",
" print(f\"\\n## Document {i}\\n\")\n",
" print(d.page_content)"
]
},
{
"cell_type": "markdown",
"id": "7c477287",
"metadata": {},
"source": [
"Or use `max_marginal_relevance_search` directly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ca82740",
"metadata": {},
"outputs": [],
"source": [
"found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)\n",
"for i, doc in enumerate(found_docs):\n",
" print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
]
},
{
"cell_type": "markdown",
"id": "da791c5f",
"metadata": {},
"source": [
"### Metadata filtering\n",
"\n",
"You can specify filtering on metadata when running searches in the vector store. By default, when inserting documents, the only metadata is the `\"source\"` (but you can customize the metadata at insertion time).\n",
"\n",
"Since only one files was inserted, this is just a demonstration of how filters are passed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93f132fa",
"metadata": {},
"outputs": [],
"source": [
"filter = {\"source\": SOURCE_FILE_NAME}\n",
"filtered_docs = docsearch.similarity_search(query, filter=filter, k=5)\n",
"print(f\"{len(filtered_docs)} documents retrieved.\")\n",
"print(f\"{filtered_docs[0].page_content[:64]} ...\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b413ec4",
"metadata": {},
"outputs": [],
"source": [
"filter = {\"source\": \"nonexisting_file.txt\"}\n",
"filtered_docs2 = docsearch.similarity_search(query, filter=filter)\n",
"print(f\"{len(filtered_docs2)} documents retrieved.\")"
]
},
{
"cell_type": "markdown",
"id": "a0fea764",
"metadata": {},
"source": [
"Please visit the [cassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using vector stores with Langchain."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -58,9 +58,9 @@
"1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
"2. Only works with LangChain `vectorstore`'s that support:\n",
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with)\n",
" * delete by id (`delete` method with `ids` argument)\n",
"\n",
"Compatible Vectorstores: `AnalyticDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `ScaNN`, `SupabaseVectorStore`, `TimescaleVector`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `ScaNN`, `SupabaseVectorStore`, `TimescaleVector`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n",
" \n",
"## Caution\n",
"\n",

@ -414,7 +414,15 @@
},
{
"source": "/docs/integrations/cassandra",
"destination": "/docs/integrations/providers/cassandra"
"destination": "/docs/integrations/providers/astradb"
},
{
"source": "/docs/integrations/providers/cassandra",
"destination": "/docs/integrations/providers/astradb"
},
{
"source": "/docs/integrations/vectorstores/cassandra",
"destination": "/docs/integrations/vectorstores/astradb"
},
{
"source": "/docs/integrations/cerebriumai",

@ -104,6 +104,12 @@ def _import_cassandra() -> Any:
return Cassandra
def _import_astradb() -> Any:
from langchain.vectorstores.astradb import AstraDB
return AstraDB
def _import_chroma() -> Any:
from langchain.vectorstores.chroma import Chroma
@ -443,6 +449,8 @@ def __getattr__(name: str) -> Any:
return _import_baiducloud_vector_search()
elif name == "Cassandra":
return _import_cassandra()
elif name == "AstraDB":
return _import_astradb()
elif name == "Chroma":
return _import_chroma()
elif name == "Clarifai":
@ -561,6 +569,7 @@ __all__ = [
"AzureSearch",
"Bagel",
"Cassandra",
"AstraDB",
"Chroma",
"Clarifai",
"Clickhouse",

@ -0,0 +1,751 @@
from __future__ import annotations
import uuid
import warnings
from concurrent.futures import ThreadPoolExecutor
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Set,
Tuple,
Type,
TypeVar,
)
import numpy as np
from langchain.docstore.document import Document
from langchain.schema.embeddings import Embeddings
from langchain.schema.vectorstore import VectorStore
from langchain.utils.iter import batch_iterate
from langchain.vectorstores.utils import maximal_marginal_relevance
ADBVST = TypeVar("ADBVST", bound="AstraDB")
T = TypeVar("T")
U = TypeVar("U")
DocDict = Dict[str, Any] # dicts expressing entries to insert
# Batch/concurrency default values (if parameters not provided):
# Size of batches for bulk insertions:
# (20 is the max batch size for the HTTP API at the time of writing)
DEFAULT_BATCH_SIZE = 20
# Number of threads to insert batches concurrently:
DEFAULT_BULK_INSERT_BATCH_CONCURRENCY = 5
# Number of threads in a batch to insert pre-existing entries:
DEFAULT_BULK_INSERT_OVERWRITE_CONCURRENCY = 10
# Number of threads (for deleting multiple rows concurrently):
DEFAULT_BULK_DELETE_CONCURRENCY = 20
def _unique_list(lst: List[T], key: Callable[[T], U]) -> List[T]:
visited_keys: Set[U] = set()
new_lst = []
for item in lst:
item_key = key(item)
if item_key not in visited_keys:
visited_keys.add(item_key)
new_lst.append(item)
return new_lst
class AstraDB(VectorStore):
"""Wrapper around DataStax Astra DB for vector-store workloads.
To use it, you need a recent installation of the `astrapy` library
and an Astra DB cloud database.
For quickstart and details, visit:
docs.datastax.com/en/astra/home/astra.html
Example:
.. code-block:: python
from langchain.vectorstores import AstraDB
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = AstraDB(
embedding=embeddings,
collection_name="my_store",
token="AstraCS:...",
api_endpoint="https://<DB-ID>-us-east1.apps.astra.datastax.com"
)
vectorstore.add_texts(["Giraffes", "All good here"])
results = vectorstore.similarity_search("Everything's ok", k=1)
"""
@staticmethod
def _filter_to_metadata(filter_dict: Optional[Dict[str, str]]) -> Dict[str, Any]:
if filter_dict is None:
return {}
else:
return {f"metadata.{mdk}": mdv for mdk, mdv in filter_dict.items()}
def __init__(
self,
*,
embedding: Embeddings,
collection_name: str,
token: Optional[str] = None,
api_endpoint: Optional[str] = None,
astra_db_client: Optional[Any] = None, # 'astrapy.db.AstraDB' if passed
namespace: Optional[str] = None,
metric: Optional[str] = None,
batch_size: Optional[int] = None,
bulk_insert_batch_concurrency: Optional[int] = None,
bulk_insert_overwrite_concurrency: Optional[int] = None,
bulk_delete_concurrency: Optional[int] = None,
) -> None:
try:
from astrapy.db import (
AstraDB as LibAstraDB,
)
from astrapy.db import (
AstraDBCollection as LibAstraDBCollection,
)
except (ImportError, ModuleNotFoundError):
raise ImportError(
"Could not import a recent astrapy python package. "
"Please install it with `pip install --upgrade astrapy`."
)
"""
Create an AstraDB vector store object.
Args (only keyword-arguments accepted):
embedding (Embeddings): embedding function to use.
collection_name (str): name of the Astra DB collection to create/use.
token (Optional[str]): API token for Astra DB usage.
api_endpoint (Optional[str]): full URL to the API endpoint,
such as "https://<DB-ID>-us-east1.apps.astra.datastax.com".
astra_db_client (Optional[Any]): *alternative to token+api_endpoint*,
you can pass an already-created 'astrapy.db.AstraDB' instance.
namespace (Optional[str]): namespace (aka keyspace) where the
collection is created. Defaults to the database's "default namespace".
metric (Optional[str]): similarity function to use out of those
available in Astra DB. If left out, it will use Astra DB API's
defaults (i.e. "cosine" - but, for performance reasons,
"dot_product" is suggested if embeddings are normalized to one).
Advanced arguments (coming with sensible defaults):
batch_size (Optional[int]): Size of batches for bulk insertions.
bulk_insert_batch_concurrency (Optional[int]): Number of threads
to insert batches concurrently.
bulk_insert_overwrite_concurrency (Optional[int]): Number of
threads in a batch to insert pre-existing entries.
bulk_delete_concurrency (Optional[int]): Number of threads
(for deleting multiple rows concurrently).
"""
# Conflicting-arg checks:
if astra_db_client is not None:
if token is not None or api_endpoint is not None:
raise ValueError(
"You cannot pass 'astra_db_client' to AstraDB if passing "
"'token' and 'api_endpoint'."
)
self.embedding = embedding
self.collection_name = collection_name
self.token = token
self.api_endpoint = api_endpoint
self.namespace = namespace
# Concurrency settings
self.batch_size: int = batch_size or DEFAULT_BATCH_SIZE
self.bulk_insert_batch_concurrency: int = (
bulk_insert_batch_concurrency or DEFAULT_BULK_INSERT_BATCH_CONCURRENCY
)
self.bulk_insert_overwrite_concurrency: int = (
bulk_insert_overwrite_concurrency
or DEFAULT_BULK_INSERT_OVERWRITE_CONCURRENCY
)
self.bulk_delete_concurrency: int = (
bulk_delete_concurrency or DEFAULT_BULK_DELETE_CONCURRENCY
)
# "vector-related" settings
self._embedding_dimension: Optional[int] = None
self.metric = metric
if astra_db_client is not None:
self.astra_db = astra_db_client
else:
self.astra_db = LibAstraDB(
token=self.token,
api_endpoint=self.api_endpoint,
namespace=self.namespace,
)
self._provision_collection()
self.collection = LibAstraDBCollection(
collection_name=self.collection_name,
astra_db=self.astra_db,
)
def _get_embedding_dimension(self) -> int:
if self._embedding_dimension is None:
self._embedding_dimension = len(
self.embedding.embed_query("This is a sample sentence.")
)
return self._embedding_dimension
def _drop_collection(self) -> None:
"""
Drop the collection from storage.
This is meant as an internal-usage method, no members
are set other than actual deletion on the backend.
"""
_ = self.astra_db.delete_collection(
collection_name=self.collection_name,
)
return None
def _provision_collection(self) -> None:
"""
Run the API invocation to create the collection on the backend.
Internal-usage method, no object members are set,
other than working on the underlying actual storage.
"""
_ = self.astra_db.create_collection(
dimension=self._get_embedding_dimension(),
collection_name=self.collection_name,
metric=self.metric,
)
return None
@property
def embeddings(self) -> Embeddings:
return self.embedding
@staticmethod
def _dont_flip_the_cos_score(similarity0to1: float) -> float:
"""Keep similarity from client unchanged ad it's in [0:1] already."""
return similarity0to1
def _select_relevance_score_fn(self) -> Callable[[float], float]:
"""
The underlying API calls already returns a "score proper",
i.e. one in [0, 1] where higher means more *similar*,
so here the final score transformation is not reversing the interval:
"""
return self._dont_flip_the_cos_score
def clear(self) -> None:
"""Empty the collection of all its stored entries."""
self._drop_collection()
self._provision_collection()
return None
def delete_by_document_id(self, document_id: str) -> bool:
"""
Remove a single document from the store, given its document_id (str).
Return True if a document has indeed been deleted, False if ID not found.
"""
deletion_response = self.collection.delete(document_id)
return ((deletion_response or {}).get("status") or {}).get(
"deletedCount", 0
) == 1
def delete(
self,
ids: Optional[List[str]] = None,
concurrency: Optional[int] = None,
**kwargs: Any,
) -> Optional[bool]:
"""Delete by vector ids.
Args:
ids (Optional[List[str]]): List of ids to delete.
concurrency (Optional[int]): max number of threads issuing
single-doc delete requests. Defaults to instance-level setting.
Returns:
Optional[bool]: True if deletion is successful,
False otherwise, None if not implemented.
"""
if kwargs:
warnings.warn(
"Method 'delete' of AstraDB vector store invoked with "
f"unsupported arguments ({', '.join(sorted(kwargs.keys()))}), "
"which will be ignored."
)
if ids is None:
raise ValueError("No ids provided to delete.")
_max_workers = concurrency or self.bulk_delete_concurrency
with ThreadPoolExecutor(max_workers=_max_workers) as tpe:
_ = list(
tpe.map(
self.delete_by_document_id,
ids,
)
)
return True
def delete_collection(self) -> None:
"""
Completely delete the collection from the database (as opposed
to 'clear()', which empties it only).
Stored data is lost and unrecoverable, resources are freed.
Use with caution.
"""
self._drop_collection()
return None
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
*,
batch_size: Optional[int] = None,
batch_concurrency: Optional[int] = None,
overwrite_concurrency: Optional[int] = None,
**kwargs: Any,
) -> List[str]:
"""Run texts through the embeddings and add them to the vectorstore.
If passing explicit ids, those entries whose id is in the store already
will be replaced.
Args:
texts (Iterable[str]): Texts to add to the vectorstore.
metadatas (Optional[List[dict]], optional): Optional list of metadatas.
ids (Optional[List[str]], optional): Optional list of ids.
batch_size (Optional[int]): Number of documents in each API call.
Check the underlying Astra DB HTTP API specs for the max value
(20 at the time of writing this). If not provided, defaults
to the instance-level setting.
batch_concurrency (Optional[int]): number of threads to process
insertion batches concurrently. Defaults to instance-level
setting if not provided.
overwrite_concurrency (Optional[int]): number of threads to process
pre-existing documents in each batch (which require individual
API calls). Defaults to instance-level setting if not provided.
Returns:
List[str]: List of ids of the added texts.
"""
if kwargs:
warnings.warn(
"Method 'add_texts' of AstraDB vector store invoked with "
f"unsupported arguments ({', '.join(sorted(kwargs.keys()))}), "
"which will be ignored."
)
_texts = list(texts)
if ids is None:
ids = [uuid.uuid4().hex for _ in _texts]
if metadatas is None:
metadatas = [{} for _ in _texts]
#
embedding_vectors = self.embedding.embed_documents(_texts)
documents_to_insert = [
{
"content": b_txt,
"_id": b_id,
"$vector": b_emb,
"metadata": b_md,
}
for b_txt, b_emb, b_id, b_md in zip(
_texts,
embedding_vectors,
ids,
metadatas,
)
]
# make unique by id, keeping the last
uniqued_documents_to_insert = _unique_list(
documents_to_insert[::-1],
lambda document: document["_id"],
)[::-1]
all_ids = []
def _handle_batch(document_batch: List[DocDict]) -> List[str]:
im_result = self.collection.insert_many(
documents=document_batch,
options={"ordered": False},
partial_failures_allowed=True,
)
if "status" not in im_result:
raise ValueError(
f"API Exception while running bulk insertion: {str(im_result)}"
)
batch_inserted = im_result["status"]["insertedIds"]
# estimation of the preexisting documents that failed
missed_inserted_ids = {
document["_id"] for document in document_batch
} - set(batch_inserted)
errors = im_result.get("errors", [])
# careful for other sources of error other than "doc already exists"
num_errors = len(errors)
unexpected_errors = any(
error.get("errorCode") != "DOCUMENT_ALREADY_EXISTS" for error in errors
)
if num_errors != len(missed_inserted_ids) or unexpected_errors:
raise ValueError(
f"API Exception while running bulk insertion: {str(errors)}"
)
# deal with the missing insertions as upserts
missing_from_batch = [
document
for document in document_batch
if document["_id"] in missed_inserted_ids
]
def _handle_missing_document(missing_document: DocDict) -> str:
replacement_result = self.collection.find_one_and_replace(
filter={"_id": missing_document["_id"]},
replacement=missing_document,
)
return replacement_result["data"]["document"]["_id"]
_u_max_workers = (
overwrite_concurrency or self.bulk_insert_overwrite_concurrency
)
with ThreadPoolExecutor(max_workers=_u_max_workers) as tpe2:
batch_replaced = list(
tpe2.map(
_handle_missing_document,
missing_from_batch,
)
)
upsert_ids = batch_inserted + batch_replaced
return upsert_ids
_b_max_workers = batch_concurrency or self.bulk_insert_batch_concurrency
with ThreadPoolExecutor(max_workers=_b_max_workers) as tpe:
all_ids_nested = tpe.map(
_handle_batch,
batch_iterate(
batch_size or self.batch_size,
uniqued_documents_to_insert,
),
)
all_ids = [iid for id_list in all_ids_nested for iid in id_list]
return all_ids
def similarity_search_with_score_id_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Dict[str, str]] = None,
) -> List[Tuple[Document, float, str]]:
"""Return docs most similar to embedding vector.
Args:
embedding (str): Embedding to look up documents similar to.
k (int): Number of Documents to return. Defaults to 4.
Returns:
List of (Document, score, id), the most similar to the query vector.
"""
metadata_parameter = self._filter_to_metadata(filter)
#
hits = list(
self.collection.paginated_find(
filter=metadata_parameter,
sort={"$vector": embedding},
options={"limit": k},
projection={
"_id": 1,
"content": 1,
"metadata": 1,
"$similarity": 1,
},
)
)
#
return [
(
Document(
page_content=hit["content"],
metadata=hit["metadata"],
),
hit["$similarity"],
hit["_id"],
)
for hit in hits
]
def similarity_search_with_score_id(
self,
query: str,
k: int = 4,
filter: Optional[Dict[str, str]] = None,
) -> List[Tuple[Document, float, str]]:
embedding_vector = self.embedding.embed_query(query)
return self.similarity_search_with_score_id_by_vector(
embedding=embedding_vector,
k=k,
filter=filter,
)
def similarity_search_with_score_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Dict[str, str]] = None,
) -> List[Tuple[Document, float]]:
"""Return docs most similar to embedding vector.
Args:
embedding (str): Embedding to look up documents similar to.
k (int): Number of Documents to return. Defaults to 4.
Returns:
List of (Document, score), the most similar to the query vector.
"""
return [
(doc, score)
for (doc, score, doc_id) in self.similarity_search_with_score_id_by_vector(
embedding=embedding,
k=k,
filter=filter,
)
]
def similarity_search(
self,
query: str,
k: int = 4,
filter: Optional[Dict[str, str]] = None,
**kwargs: Any,
) -> List[Document]:
embedding_vector = self.embedding.embed_query(query)
return self.similarity_search_by_vector(
embedding_vector,
k,
filter=filter,
)
def similarity_search_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Dict[str, str]] = None,
**kwargs: Any,
) -> List[Document]:
return [
doc
for doc, _ in self.similarity_search_with_score_by_vector(
embedding,
k,
filter=filter,
)
]
def similarity_search_with_score(
self,
query: str,
k: int = 4,
filter: Optional[Dict[str, str]] = None,
) -> List[Tuple[Document, float]]:
embedding_vector = self.embedding.embed_query(query)
return self.similarity_search_with_score_by_vector(
embedding_vector,
k,
filter=filter,
)
def max_marginal_relevance_search_by_vector(
self,
embedding: List[float],
k: int = 4,
fetch_k: int = 20,
lambda_mult: float = 0.5,
filter: Optional[Dict[str, str]] = None,
**kwargs: Any,
) -> List[Document]:
"""Return docs selected using the maximal marginal relevance.
Maximal marginal relevance optimizes for similarity to query AND diversity
among selected documents.
Args:
embedding: Embedding to look up documents similar to.
k: Number of Documents to return.
fetch_k: Number of Documents to fetch to pass to MMR algorithm.
lambda_mult: Number between 0 and 1 that determines the degree
of diversity among the results with 0 corresponding
to maximum diversity and 1 to minimum diversity.
Returns:
List of Documents selected by maximal marginal relevance.
"""
metadata_parameter = self._filter_to_metadata(filter)
prefetch_hits = list(
self.collection.paginated_find(
filter=metadata_parameter,
sort={"$vector": embedding},
options={"limit": fetch_k},
projection={
"_id": 1,
"content": 1,
"metadata": 1,
"$similarity": 1,
"$vector": 1,
},
)
)
mmr_chosen_indices = maximal_marginal_relevance(
np.array(embedding, dtype=np.float32),
[prefetch_hit["$vector"] for prefetch_hit in prefetch_hits],
k=k,
lambda_mult=lambda_mult,
)
mmr_hits = [
prefetch_hit
for prefetch_index, prefetch_hit in enumerate(prefetch_hits)
if prefetch_index in mmr_chosen_indices
]
return [
Document(
page_content=hit["content"],
metadata=hit["metadata"],
)
for hit in mmr_hits
]
def max_marginal_relevance_search(
self,
query: str,
k: int = 4,
fetch_k: int = 20,
lambda_mult: float = 0.5,
filter: Optional[Dict[str, str]] = None,
**kwargs: Any,
) -> List[Document]:
"""Return docs selected using the maximal marginal relevance.
Maximal marginal relevance optimizes for similarity to query AND diversity
among selected documents.
Args:
query (str): Text to look up documents similar to.
k (int = 4): Number of Documents to return.
fetch_k (int = 20): Number of Documents to fetch to pass to MMR algorithm.
lambda_mult (float = 0.5): Number between 0 and 1 that determines the degree
of diversity among the results with 0 corresponding
to maximum diversity and 1 to minimum diversity.
Optional.
Returns:
List of Documents selected by maximal marginal relevance.
"""
embedding_vector = self.embedding.embed_query(query)
return self.max_marginal_relevance_search_by_vector(
embedding_vector,
k,
fetch_k,
lambda_mult=lambda_mult,
filter=filter,
)
@classmethod
def from_texts(
cls: Type[ADBVST],
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> ADBVST:
"""Create an Astra DB vectorstore from raw texts.
Args:
texts (List[str]): the texts to insert.
embedding (Embeddings): the embedding function to use in the store.
metadatas (Optional[List[dict]]): metadata dicts for the texts.
ids (Optional[List[str]]): ids to associate to the texts.
*Additional arguments*: you can pass any argument that you would
to 'add_texts' and/or to the 'AstraDB' class constructor
(see these methods for details). These arguments will be
routed to the respective methods as they are.
Returns:
an `AstraDb` vectorstore.
"""
known_kwargs = {
"collection_name",
"token",
"api_endpoint",
"astra_db_client",
"namespace",
"metric",
"batch_size",
"bulk_insert_batch_concurrency",
"bulk_insert_overwrite_concurrency",
"bulk_delete_concurrency",
"batch_concurrency",
"overwrite_concurrency",
}
if kwargs:
unknown_kwargs = set(kwargs.keys()) - known_kwargs
if unknown_kwargs:
warnings.warn(
"Method 'from_texts' of AstraDB vector store invoked with "
f"unsupported arguments ({', '.join(sorted(unknown_kwargs))}), "
"which will be ignored."
)
collection_name: str = kwargs["collection_name"]
token = kwargs.get("token")
api_endpoint = kwargs.get("api_endpoint")
astra_db_client = kwargs.get("astra_db_client")
namespace = kwargs.get("namespace")
metric = kwargs.get("metric")
astra_db_store = cls(
embedding=embedding,
collection_name=collection_name,
token=token,
api_endpoint=api_endpoint,
astra_db_client=astra_db_client,
namespace=namespace,
metric=metric,
batch_size=kwargs.get("batch_size"),
bulk_insert_batch_concurrency=kwargs.get("bulk_insert_batch_concurrency"),
bulk_insert_overwrite_concurrency=kwargs.get(
"bulk_insert_overwrite_concurrency"
),
bulk_delete_concurrency=kwargs.get("bulk_delete_concurrency"),
)
astra_db_store.add_texts(
texts=texts,
metadatas=metadatas,
ids=ids,
batch_size=kwargs.get("batch_size"),
batch_concurrency=kwargs.get("batch_concurrency"),
overwrite_concurrency=kwargs.get("overwrite_concurrency"),
)
return astra_db_store
@classmethod
def from_documents(
cls: Type[ADBVST],
documents: List[Document],
embedding: Embeddings,
**kwargs: Any,
) -> ADBVST:
"""Create an Astra DB vectorstore from a document list.
Utility method that defers to 'from_texts' (see that one).
Args: see 'from_texts', except here you have to supply 'documents'
in place of 'texts' and 'metadatas'.
Returns:
an `AstraDB` vectorstore.
"""
return super().from_documents(documents, embedding, **kwargs)

@ -0,0 +1,468 @@
"""
Test of Astra DB vector store class `AstraDB`
Required to run this test:
- a recent `astrapy` Python package available
- an Astra DB instance;
- the two environment variables set:
export ASTRA_DB_API_ENDPOINT="https://<DB-ID>-us-east1.apps.astra.datastax.com"
export ASTRA_DB_APPLICATION_TOKEN="AstraCS:........."
- optionally this as well (otherwise defaults are used):
export ASTRA_DB_KEYSPACE="my_keyspace"
"""
import json
import math
import os
from typing import Iterable, List
import pytest
from langchain.embeddings.base import Embeddings
from langchain.schema import Document
from langchain.vectorstores import AstraDB
# Ad-hoc embedding classes:
class SomeEmbeddings(Embeddings):
"""
Turn a sentence into an embedding vector in some way.
Not important how. It is deterministic is all that counts.
"""
def __init__(self, dimension: int) -> None:
self.dimension = dimension
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [self.embed_query(txt) for txt in texts]
async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
return self.embed_documents(texts)
def embed_query(self, text: str) -> List[float]:
unnormed0 = [ord(c) for c in text[: self.dimension]]
unnormed = (unnormed0 + [1] + [0] * (self.dimension - 1 - len(unnormed0)))[
: self.dimension
]
norm = sum(x * x for x in unnormed) ** 0.5
normed = [x / norm for x in unnormed]
return normed
async def aembed_query(self, text: str) -> List[float]:
return self.embed_query(text)
class ParserEmbeddings(Embeddings):
"""
Parse input texts: if they are json for a List[float], fine.
Otherwise, return all zeros and call it a day.
"""
def __init__(self, dimension: int) -> None:
self.dimension = dimension
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [self.embed_query(txt) for txt in texts]
async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
return self.embed_documents(texts)
def embed_query(self, text: str) -> List[float]:
try:
vals = json.loads(text)
assert len(vals) == self.dimension
return vals
except Exception:
print(f'[ParserEmbeddings] Returning a moot vector for "{text}"')
return [0.0] * self.dimension
async def aembed_query(self, text: str) -> List[float]:
return self.embed_query(text)
def _has_env_vars() -> bool:
return all(
[
"ASTRA_DB_APPLICATION_TOKEN" in os.environ,
"ASTRA_DB_API_ENDPOINT" in os.environ,
]
)
@pytest.fixture(scope="function")
def store_someemb() -> Iterable[AstraDB]:
emb = SomeEmbeddings(dimension=2)
v_store = AstraDB(
embedding=emb,
collection_name="lc_test_s",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
yield v_store
v_store.delete_collection()
@pytest.fixture(scope="function")
def store_parseremb() -> Iterable[AstraDB]:
emb = ParserEmbeddings(dimension=2)
v_store = AstraDB(
embedding=emb,
collection_name="lc_test_p",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
yield v_store
v_store.delete_collection()
@pytest.mark.requires("astrapy")
@pytest.mark.skipif(not _has_env_vars(), reason="Missing Astra DB env. vars")
class TestAstraDB:
def test_astradb_vectorstore_create_delete(self) -> None:
"""Create and delete."""
emb = SomeEmbeddings(dimension=2)
# creation by passing the connection secrets
v_store = AstraDB(
embedding=emb,
collection_name="lc_test_1",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
v_store.delete_collection()
# Creation by passing a ready-made astrapy client:
from astrapy.db import AstraDB as LibAstraDB
astra_db_client = LibAstraDB(
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
v_store_2 = AstraDB(
embedding=emb,
collection_name="lc_test_2",
astra_db_client=astra_db_client,
)
v_store_2.delete_collection()
def test_astradb_vectorstore_from_x(self) -> None:
"""from_texts and from_documents methods."""
emb = SomeEmbeddings(dimension=2)
# from_texts
v_store = AstraDB.from_texts(
texts=["Hi", "Ho"],
embedding=emb,
collection_name="lc_test_ft",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
assert v_store.similarity_search("Ho", k=1)[0].page_content == "Ho"
v_store.delete_collection()
# from_texts
v_store_2 = AstraDB.from_documents(
[
Document(page_content="Hee"),
Document(page_content="Hoi"),
],
embedding=emb,
collection_name="lc_test_fd",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
assert v_store_2.similarity_search("Hoi", k=1)[0].page_content == "Hoi"
# manual collection delete
v_store_2.delete_collection()
def test_astradb_vectorstore_crud(self, store_someemb: AstraDB) -> None:
"""Basic add/delete/update behaviour."""
res0 = store_someemb.similarity_search("Abc", k=2)
assert res0 == []
# write and check again
store_someemb.add_texts(
texts=["aa", "bb", "cc"],
metadatas=[
{"k": "a", "ord": 0},
{"k": "b", "ord": 1},
{"k": "c", "ord": 2},
],
ids=["a", "b", "c"],
)
res1 = store_someemb.similarity_search("Abc", k=5)
assert {doc.page_content for doc in res1} == {"aa", "bb", "cc"}
# partial overwrite and count total entries
store_someemb.add_texts(
texts=["cc", "dd"],
metadatas=[
{"k": "c_new", "ord": 102},
{"k": "d_new", "ord": 103},
],
ids=["c", "d"],
)
res2 = store_someemb.similarity_search("Abc", k=10)
assert len(res2) == 4
# pick one that was just updated and check its metadata
res3 = store_someemb.similarity_search_with_score_id("cc", k=1)
doc3, score3, id3 = res3[0]
assert doc3.page_content == "cc"
assert doc3.metadata == {"k": "c_new", "ord": 102}
assert score3 > 0.999 # leaving some leeway for approximations...
assert id3 == "c"
# delete and count again
del1_res = store_someemb.delete(["b"])
assert del1_res is True
del2_res = store_someemb.delete(["a", "c", "Z!"])
assert del2_res is False # a non-existing ID was supplied
assert len(store_someemb.similarity_search("xy", k=10)) == 1
# clear store
store_someemb.clear()
assert store_someemb.similarity_search("Abc", k=2) == []
# add_documents with "ids" arg passthrough
store_someemb.add_documents(
[
Document(page_content="vv", metadata={"k": "v", "ord": 204}),
Document(page_content="ww", metadata={"k": "w", "ord": 205}),
],
ids=["v", "w"],
)
assert len(store_someemb.similarity_search("xy", k=10)) == 2
res4 = store_someemb.similarity_search("ww", k=1)
assert res4[0].metadata["ord"] == 205
def test_astradb_vectorstore_mmr(self, store_parseremb: AstraDB) -> None:
"""
MMR testing. We work on the unit circle with angle multiples
of 2*pi/20 and prepare a store with known vectors for a controlled
MMR outcome.
"""
def _v_from_i(i: int, N: int) -> str:
angle = 2 * math.pi * i / N
vector = [math.cos(angle), math.sin(angle)]
return json.dumps(vector)
i_vals = [0, 4, 5, 13]
N_val = 20
store_parseremb.add_texts(
[_v_from_i(i, N_val) for i in i_vals], metadatas=[{"i": i} for i in i_vals]
)
res1 = store_parseremb.max_marginal_relevance_search(
_v_from_i(3, N_val),
k=2,
fetch_k=3,
)
res_i_vals = {doc.metadata["i"] for doc in res1}
assert res_i_vals == {0, 4}
def test_astradb_vectorstore_metadata(self, store_someemb: AstraDB) -> None:
"""Metadata filtering."""
store_someemb.add_documents(
[
Document(
page_content="q",
metadata={"ord": ord("q"), "group": "consonant"},
),
Document(
page_content="w",
metadata={"ord": ord("w"), "group": "consonant"},
),
Document(
page_content="r",
metadata={"ord": ord("r"), "group": "consonant"},
),
Document(
page_content="e",
metadata={"ord": ord("e"), "group": "vowel"},
),
Document(
page_content="i",
metadata={"ord": ord("i"), "group": "vowel"},
),
Document(
page_content="o",
metadata={"ord": ord("o"), "group": "vowel"},
),
]
)
# no filters
res0 = store_someemb.similarity_search("x", k=10)
assert {doc.page_content for doc in res0} == set("qwreio")
# single filter
res1 = store_someemb.similarity_search(
"x",
k=10,
filter={"group": "vowel"},
)
assert {doc.page_content for doc in res1} == set("eio")
# multiple filters
res2 = store_someemb.similarity_search(
"x",
k=10,
filter={"group": "consonant", "ord": ord("q")},
)
assert {doc.page_content for doc in res2} == set("q")
# excessive filters
res3 = store_someemb.similarity_search(
"x",
k=10,
filter={"group": "consonant", "ord": ord("q"), "case": "upper"},
)
assert res3 == []
def test_astradb_vectorstore_similarity_scale(
self, store_parseremb: AstraDB
) -> None:
"""Scale of the similarity scores."""
store_parseremb.add_texts(
texts=[
json.dumps([1, 1]),
json.dumps([-1, -1]),
],
ids=["near", "far"],
)
res1 = store_parseremb.similarity_search_with_score(
json.dumps([0.5, 0.5]),
k=2,
)
scores = [sco for _, sco in res1]
sco_near, sco_far = scores
assert abs(1 - sco_near) < 0.001 and abs(sco_far) < 0.001
def test_astradb_vectorstore_massive_delete(self, store_someemb: AstraDB) -> None:
"""Larger-scale bulk deletes."""
M = 50
texts = [str(i + 1 / 7.0) for i in range(2 * M)]
ids0 = ["doc_%i" % i for i in range(M)]
ids1 = ["doc_%i" % (i + M) for i in range(M)]
ids = ids0 + ids1
store_someemb.add_texts(texts=texts, ids=ids)
# deleting a bunch of these
del_res0 = store_someemb.delete(ids0)
assert del_res0 is True
# deleting the rest plus a fake one
del_res1 = store_someemb.delete(ids1 + ["ghost!"])
assert del_res1 is False # not *all* ids could be deleted...
# nothing left
assert store_someemb.similarity_search("x", k=2 * M) == []
def test_astradb_vectorstore_drop(self) -> None:
"""behaviour of 'delete_collection'."""
emb = SomeEmbeddings(dimension=2)
v_store = AstraDB(
embedding=emb,
collection_name="lc_test_d",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
v_store.add_texts(["huh"])
assert len(v_store.similarity_search("hah", k=10)) == 1
# another instance pointing to the same collection on DB
v_store_kenny = AstraDB(
embedding=emb,
collection_name="lc_test_d",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
v_store_kenny.delete_collection()
# dropped on DB, but 'v_store' should have no clue:
with pytest.raises(ValueError):
_ = v_store.similarity_search("hah", k=10)
def test_astradb_vectorstore_custom_params(self) -> None:
"""Custom batch size and concurrency params."""
emb = SomeEmbeddings(dimension=2)
v_store = AstraDB(
embedding=emb,
collection_name="lc_test_c",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
batch_size=17,
bulk_insert_batch_concurrency=13,
bulk_insert_overwrite_concurrency=7,
bulk_delete_concurrency=19,
)
# add_texts
N = 50
texts = [str(i + 1 / 7.0) for i in range(N)]
ids = ["doc_%i" % i for i in range(N)]
v_store.add_texts(texts=texts, ids=ids)
v_store.add_texts(
texts=texts,
ids=ids,
batch_size=19,
batch_concurrency=7,
overwrite_concurrency=13,
)
#
_ = v_store.delete(ids[: N // 2])
_ = v_store.delete(ids[N // 2 :], concurrency=23)
#
v_store.delete_collection()
def test_astradb_vectorstore_metrics(self) -> None:
"""
Different choices of similarity metric.
Both stores (with "cosine" and "euclidea" metrics) contain these two:
- a vector slightly rotated w.r.t query vector
- a vector which is a long multiple of query vector
so, which one is "the closest one" depends on the metric.
"""
emb = ParserEmbeddings(dimension=2)
isq2 = 0.5**0.5
isa = 0.7
isb = (1.0 - isa * isa) ** 0.5
texts = [
json.dumps([isa, isb]),
json.dumps([10 * isq2, 10 * isq2]),
]
ids = [
"rotated",
"scaled",
]
query_text = json.dumps([isq2, isq2])
# creation, population, query - cosine
vstore_cos = AstraDB(
embedding=emb,
collection_name="lc_test_m_c",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
metric="cosine",
)
vstore_cos.add_texts(
texts=texts,
ids=ids,
)
_, _, id_from_cos = vstore_cos.similarity_search_with_score_id(
query_text,
k=1,
)[0]
assert id_from_cos == "scaled"
vstore_cos.delete_collection()
# creation, population, query - euclidean
vstore_euc = AstraDB(
embedding=emb,
collection_name="lc_test_m_e",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
metric="euclidean",
)
vstore_euc.add_texts(
texts=texts,
ids=ids,
)
_, _, id_from_euc = vstore_euc.similarity_search_with_score_id(
query_text,
k=1,
)[0]
assert id_from_euc == "rotated"
vstore_euc.delete_collection()

@ -1125,6 +1125,7 @@ def test_compatible_vectorstore_documentation() -> None:
# These are mentioned in the indexing.ipynb documentation
documented = {
"AnalyticDB",
"AstraDB",
"AzureCosmosDBVectorSearch",
"AwaDB",
"Bagel",

@ -11,6 +11,7 @@ _EXPECTED = [
"AzureSearch",
"Bagel",
"Cassandra",
"AstraDB",
"Chroma",
"Clarifai",
"Clickhouse",

@ -1,7 +1,7 @@
# cassandra-entomology-rag
This template will perform RAG using Astra DB and Apache Cassandra®.
This template will perform RAG using Apache Cassandra® or Astra DB through CQL (`Cassandra` vector store class)
## Environment Setup
@ -53,16 +53,6 @@ export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
To populate the vector store, ensure that you have set all the environment variables, then from this directory, execute the following just once:
```shell
poetry run bash -c "cd [...]/cassandra_entomology_rag; python setup.py"
```
The output will be something like `Done (29 lines inserted).`.
Note: In a full application, the vector store might be populated in other ways. This step is to pre-populate the vector store with some rows for the demo RAG chains to work sensibly.
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell

@ -1,7 +1,7 @@
# cassandra-synonym-caching
This template provides a simple chain template showcasing the usage of LLM Caching backed by Astra DB / Apache Cassandra®.
This template provides a simple chain template showcasing the usage of LLM Caching backed by Apache Cassandra® or Astra DB through CQL.
## Environment Setup

@ -0,0 +1,5 @@
export OPENAI_API_KEY="..."
export ASTRA_DB_API_ENDPOINT="https://...-....apps.astra.datastax.com"
export ASTRA_DB_APPLICATION_TOKEN="AstraCS:..."
export ASTRA_DB_KEYSPACE="..." # Optional - falls back to default if not provided

@ -0,0 +1,78 @@
# rag-astradb
This template will perform RAG using Astra DB (`AstraDB` vector store class)
## Environment Setup
An [Astra DB](https://astra.datastax.com) database is required; free tier is fine.
- You need the database **API endpoint** (such as `https://0123...-us-east1.apps.astra.datastax.com`) ...
- ... and a **token** (`AstraCS:...`).
Also, an **OpenAI API Key** is required. _Note that out-of-the-box this demo supports OpenAI only, unless you tinker with the code._
Provide the connection parameters and secrets through environment variables. Please refer to `.env.template` for the variable names.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U "langchain-cli[serve]"
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package rag-astradb
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add rag-astradb
```
And add the following code to your `server.py` file:
```python
from astradb_entomology_rag import chain as astradb_entomology_rag_chain
add_routes(app, astradb_entomology_rag_chain, path="/rag-astradb")
```
(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-astradb/playground](http://127.0.0.1:8000/rag-astradb/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/rag-astradb")
```
## Reference
Stand-alone repo with LangServe chain: [here](https://github.com/hemidactylus/langserve_astradb_entomology_rag).

@ -0,0 +1,53 @@
import os
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores import AstraDB
from .populate_vector_store import populate
# inits
llm = ChatOpenAI()
embeddings = OpenAIEmbeddings()
vector_store = AstraDB(
embedding=embeddings,
collection_name="langserve_rag_demo",
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
namespace=os.environ.get("ASTRA_DB_KEYSPACE"),
)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# For demo reasons, let's ensure there are rows on the vector store.
# Please remove this and/or adapt to your use case!
inserted_lines = populate(vector_store)
if inserted_lines:
print(f"Done ({inserted_lines} lines inserted).")
entomology_template = """
You are an expert entomologist, tasked with answering enthusiast biologists' questions.
You must answer based only on the provided context, do not make up any fact.
Your answers must be concise and to the point, but strive to provide scientific details
(such as family, order, Latin names, and so on when appropriate).
You MUST refuse to answer questions on other topics than entomology,
as well as questions whose answer is not found in the provided context.
CONTEXT:
{context}
QUESTION: {question}
YOUR ANSWER:"""
entomology_prompt = ChatPromptTemplate.from_template(entomology_template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| entomology_prompt
| llm
| StrOutputParser()
)

@ -0,0 +1,29 @@
import os
BASE_DIR = os.path.abspath(os.path.dirname(__file__))
def populate(vector_store):
# is the store empty? find out with a probe search
hits = vector_store.similarity_search_by_vector(
embedding=[0.001] * 1536,
k=1,
)
#
if len(hits) == 0:
# this seems a first run:
# must populate the vector store
src_file_name = os.path.join(BASE_DIR, "..", "sources.txt")
lines = [
line.strip()
for line in open(src_file_name).readlines()
if line.strip()
if line[0] != "#"
]
# deterministic IDs to prevent duplicates on multiple runs
ids = ["_".join(line.split(" ")[:2]).lower().replace(":", "") for line in lines]
#
vector_store.add_texts(texts=lines, ids=ids)
return len(lines)
else:
return 0

@ -0,0 +1,5 @@
from astradb_entomology_rag import chain
if __name__ == "__main__":
response = chain.invoke("Are there more coleoptera or bugs?")
print(response)

File diff suppressed because it is too large Load Diff

@ -0,0 +1,28 @@
[tool.poetry]
name = "astradb_entomology_rag"
version = "0.0.1"
description = ""
authors = [
"Stefano Lottini <stefano.lottini@datastax.com>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.325"
openai = "^0.28.1"
tiktoken = "^0.5.1"
astrapy = "^0.5.3"
[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"
[tool.langserve]
export_module = "astradb_entomology_rag"
export_attr = "chain"
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"

@ -0,0 +1,31 @@
# source: https://www.thoughtco.com/a-guide-to-the-twenty-nine-insect-orders-1968419
Order Thysanura: The silverfish and firebrats are found in the order Thysanura. They are wingless insects often found in people's attics, and have a lifespan of several years. There are about 600 species worldwide.
Order Diplura: Diplurans are the most primitive insect species, with no eyes or wings. They have the unusual ability among insects to regenerate body parts. There are over 400 members of the order Diplura in the world.
Order Protura: Another very primitive group, the proturans have no eyes, no antennae, and no wings. They are uncommon, with perhaps less than 100 species known.
Order Collembola: The order Collembola includes the springtails, primitive insects without wings. There are approximately 2,000 species of Collembola worldwide.
Order Ephemeroptera: The mayflies of order Ephemeroptera are short-lived, and undergo incomplete metamorphosis. The larvae are aquatic, feeding on algae and other plant life. Entomologists have described about 2,100 species worldwide.
Order Odonata: The order Odonata includes dragonflies and damselflies, which undergo incomplete metamorphosis. They are predators of other insects, even in their immature stage. There are about 5,000 species in the order Odonata.
Order Plecoptera: The stoneflies of order Plecoptera are aquatic and undergo incomplete metamorphosis. The nymphs live under rocks in well flowing streams. Adults are usually seen on the ground along stream and river banks. There are roughly 3,000 species in this group.
Order Grylloblatodea: Sometimes referred to as "living fossils," the insects of the order Grylloblatodea have changed little from their ancient ancestors. This order is the smallest of all the insect orders, with perhaps only 25 known species living today. Grylloblatodea live at elevations above 1500 ft., and are commonly named ice bugs or rock crawlers.
Order Orthoptera: These are familiar insects (grasshoppers, locusts, katydids, and crickets) and one of the largest orders of herbivorous insects. Many species in the order Orthoptera can produce and detect sounds. Approximately 20,000 species exist in this group.
Order Phasmida: The order Phasmida are masters of camouflage, the stick and leaf insects. They undergo incomplete metamorphosis and feed on leaves. There are some 3,000 insects in this group, but only a small fraction of this number is leaf insects. Stick insects are the longest insects in the world.
Order Dermaptera: This order contains the earwigs, an easily recognized insect that often has pincers at the end of the abdomen. Many earwigs are scavengers, eating both plant and animal matter. The order Dermaptera includes less than 2,000 species.
Order Embiidina: The order Embioptera is another ancient order with few species, perhaps only 200 worldwide. The web spinners have silk glands in their front legs and weave nests under leaf litter and in tunnels where they live. Webspinners live in tropical or subtropical climates.
Order Dictyoptera: The order Dictyoptera includes roaches and mantids. Both groups have long, segmented antennae and leathery forewings held tightly against their backs. They undergo incomplete metamorphosis. Worldwide, there approximately 6,000 species in this order, most living in tropical regions.
Order Isoptera: Termites feed on wood and are important decomposers in forest ecosystems. They also feed on wood products and are thought of as pests for the destruction they cause to man-made structures. There are between 2,000 and 3,000 species in this order.
Order Zoraptera: Little is know about the angel insects, which belong to the order Zoraptera. Though they are grouped with winged insects, many are actually wingless. Members of this group are blind, small, and often found in decaying wood. There are only about 30 described species worldwide.
Order Psocoptera: Bark lice forage on algae, lichen, and fungus in moist, dark places. Booklice frequent human dwellings, where they feed on book paste and grains. They undergo incomplete metamorphosis. Entomologists have named about 3,200 species in the order Psocoptera.
Order Mallophaga: Biting lice are ectoparasites that feed on birds and some mammals. There are an estimated 3,000 species in the order Mallophaga, all of which undergo incomplete metamorphosis.
Order Siphunculata: The order Siphunculata are the sucking lice, which feed on the fresh blood of mammals. Their mouthparts are adapted for sucking or siphoning blood. There are only about 500 species of sucking lice.
Order Hemiptera: Most people use the term "bugs" to mean insects; an entomologist uses the term to refer to the order Hemiptera. The Hemiptera are the true bugs, and include cicadas, aphids, and spittlebugs, and others. This is a large group of over 70,000 species worldwide.
Order Thysanoptera: The thrips of order Thysanoptera are small insects that feed on plant tissue. Many are considered agricultural pests for this reason. Some thrips prey on other small insects as well. This order contains about 5,000 species.
Order Neuroptera: Commonly called the order of lacewings, this group actually includes a variety of other insects, too: dobsonflies, owlflies, mantidflies, antlions, snakeflies, and alderflies. Insects in the order Neuroptera undergo complete metamorphosis. Worldwide, there are over 5,500 species in this group.
Order Mecoptera: This order includes the scorpionflies, which live in moist, wooded habitats. Scorpionflies are omnivorous in both their larval and adult forms. The larva are caterpillar-like. There are less than 500 described species in the order Mecoptera.
Order Siphonaptera: Pet lovers fear insects in the order Siphonaptera - the fleas. Fleas are blood-sucking ectoparasites that feed on mammals, and rarely, birds. There are well over 2,000 species of fleas in the world.
Order Coleoptera: This group, the beetles and weevils, is the largest order in the insect world, with over 300,000 distinct species known. The order Coleoptera includes well-known families: june beetles, lady beetles, click beetles, and fireflies. All have hardened forewings that fold over the abdomen to protect the delicate hindwings used for flight.
Order Strepsiptera: Insects in this group are parasites of other insects, particularly bees, grasshoppers, and the true bugs. The immature Strepsiptera lies in wait on a flower and quickly burrows into any host insect that comes along. Strepsiptera undergo complete metamorphosis and pupate within the host insect's body.
Order Diptera: Diptera is one of the largest orders, with nearly 100,000 insects named to the order. These are the true flies, mosquitoes, and gnats. Insects in this group have modified hindwings which are used for balance during flight. The forewings function as the propellers for flying.
Order Lepidoptera: The butterflies and moths of the order Lepidoptera comprise the second largest group in the class Insecta. These well-known insects have scaly wings with interesting colors and patterns. You can often identify an insect in this order just by the wing shape and color.
Order Trichoptera: Caddisflies are nocturnal as adults and aquatic when immature. The caddisfly adults have silky hairs on their wings and body, which is key to identifying a Trichoptera member. The larvae spin traps for prey with silk. They also make cases from the silk and other materials that they carry and use for protection.
Order Hymenoptera: The order Hymenoptera includes many of the most common insects - ants, bees, and wasps. The larvae of some wasps cause trees to form galls, which then provides food for the immature wasps. Other wasps are parasitic, living in caterpillars, beetles, or even aphids. This is the third-largest insect order with just over 100,000 species.
Loading…
Cancel
Save