CrateDB: Documentation about Vector Store, Document Loader, and Memory

This commit is contained in:
Andreas Motl 2024-10-29 14:25:01 +01:00
parent 0606aabfa3
commit 5f04f9bc80
6 changed files with 1430 additions and 1 deletions

View File

@ -4,4 +4,5 @@ node_modules/
.docusaurus
.cache-loader
docs/api
docs/api
example.sqlite

View File

@ -0,0 +1,276 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CrateDB Document Loader\n",
"\n",
"> [CrateDB] is capable of performing both vector and lexical search.\n",
"> It is built on top of the Apache Lucene library, talks SQL,\n",
"> is PostgreSQL-compatible, and scales like Elasticsearch.\n",
"\n",
"This notebook covers how to get started with the CrateDB document loader.\n",
"\n",
"The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's\n",
"SQLDatabaseLoader. It loads the result of a database query with one document\n",
"per row.\n",
"\n",
"[CrateDB]: https://github.com/crate/crate\n",
"[SQLAlchemy]: https://www.sqlalchemy.org/\n",
"\n",
"## Overview\n",
"\n",
"The `CrateDBLoader` class helps you get your unstructured content from CrateDB\n",
"into LangChain's `Document` format.\n",
"\n",
"You must provide an SQLAlchemy-compatible connection string, and a query\n",
"expression in SQL format. \n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: | :---: |\n",
"| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Async Support\n",
"| :---: | :---: | :---: | \n",
"| CrateDBLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.\n",
"\n",
"### Credentials\n",
"\n",
"You will supply credentials through a regular SQLAlchemy connection string, like\n",
"`crate://username:password@cratedb.example.org/`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install the **langchain-community** and **sqlalchemy-cratedb** packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-community sqlalchemy-cratedb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now, initialize the loader and start loading documents. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import CrateDBLoader\n",
"\n",
"loader = CrateDBLoader(\"SELECT * FROM sys.summits\", url=\"crate://crate@localhost/\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": "## Load"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"documents = loader.load()\n",
"print(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Lazy Load\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Tutorial\n",
"\n",
"### Populate database."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!crash < ./example_data/mlb_teams_2012.sql\n",
"!crash --command \"REFRESH TABLE mlb_teams_2012;\""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": "### Usage"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"\n",
"from langchain.document_loaders import CrateDBLoader\n",
"\n",
"CONNECTION_STRING = \"crate://crate@localhost/\"\n",
"\n",
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Specifying Which Columns are Content vs Metadata"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
" page_content_columns=[\"Team\"],\n",
" metadata_columns=[\"Payroll (millions)\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Adding Source to Metadata"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = CrateDBLoader(\n",
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
" url=CONNECTION_STRING,\n",
" source_columns=[\"Team\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(documents)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,6 +1,7 @@
-- Provisioning table "mlb_teams_2012".
--
-- psql postgresql://postgres@localhost < mlb_teams_2012.sql
-- crash < mlb_teams_2012.sql
DROP TABLE IF EXISTS mlb_teams_2012;
CREATE TABLE mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT);

View File

@ -0,0 +1,359 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f22eab3f84cbeb37",
"metadata": {
"collapsed": false
},
"source": [
"# CrateDB Chat Message History\n",
"\n",
"This notebook demonstrates how to use the `CrateDBChatMessageHistory`\n",
"to manage chat history in CrateDB, for supporting conversational memory."
]
},
{
"cell_type": "markdown",
"id": "7fb27b941602401d91542211134fc71a",
"metadata": {
"collapsed": false
},
"source": [
"## Prerequisites"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acae54e37e7d407bbb7b55eff062a284",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!#pip install langchain sqlalchemy-cratedb"
]
},
{
"cell_type": "markdown",
"id": "f8f2830ee9ca1e01",
"metadata": {
"collapsed": false
},
"source": [
"## Configuration\n",
"\n",
"To use the storage wrapper, you will need to configure two details.\n",
"\n",
"1. Session Id - a unique identifier of the session, like user name, email, chat id etc.\n",
"2. Database connection string: An SQLAlchemy-compatible URI that specifies the database\n",
" connection. It will be passed to SQLAlchemy create_engine function."
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "9a63283cbaf04dbcab1f6479b197f3a8",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from langchain.memory.chat_message_histories import CrateDBChatMessageHistory\n",
"\n",
"CONNECTION_STRING = \"crate://crate@localhost:4200/?schema=example\"\n",
"\n",
"chat_message_history = CrateDBChatMessageHistory(\n",
" session_id=\"test_session\", connection_string=CONNECTION_STRING\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8dd0d8092fe74a7c96281538738b07e2",
"metadata": {
"collapsed": false
},
"source": [
"## Basic Usage"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "4576e914a866fb40",
"metadata": {
"ExecuteTime": {
"end_time": "2023-08-28T10:04:38.077748Z",
"start_time": "2023-08-28T10:04:36.105894Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"chat_message_history.add_user_message(\"Hello\")\n",
"chat_message_history.add_ai_message(\"Hi\")"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "b476688cbb32ba90",
"metadata": {
"ExecuteTime": {
"end_time": "2023-08-28T10:04:38.929396Z",
"start_time": "2023-08-28T10:04:38.915727Z"
},
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]"
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_message_history.messages"
]
},
{
"cell_type": "markdown",
"id": "2e5337719d5614fd",
"metadata": {
"collapsed": false
},
"source": [
"## Custom Storage Model\n",
"\n",
"The default data model, which stores information about conversation messages only\n",
"has two slots for storing message details, the session id, and the message dictionary.\n",
"\n",
"If you want to store additional information, like message date, author, language etc.,\n",
"please provide an implementation for a custom message converter.\n",
"\n",
"This example demonstrates how to create a custom message converter, by implementing\n",
"the `BaseMessageConverter` interface."
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "fdfde84c07d071bb",
"metadata": {
"ExecuteTime": {
"end_time": "2023-08-28T10:04:41.510498Z",
"start_time": "2023-08-28T10:04:41.494912Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"from datetime import datetime\n",
"from typing import Any\n",
"\n",
"import sqlalchemy as sa\n",
"from langchain.memory.chat_message_histories.sql import BaseMessageConverter\n",
"from langchain.schema import AIMessage, BaseMessage, HumanMessage, SystemMessage\n",
"from sqlalchemy.orm import declarative_base\n",
"\n",
"Base = declarative_base()\n",
"\n",
"\n",
"class CustomMessage(Base):\n",
" __tablename__ = \"custom_message_store\"\n",
"\n",
" id = sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())\n",
" session_id = sa.Column(sa.Text)\n",
" type = sa.Column(sa.Text)\n",
" content = sa.Column(sa.Text)\n",
" created_at = sa.Column(sa.DateTime)\n",
" author_email = sa.Column(sa.Text)\n",
"\n",
"\n",
"class CustomMessageConverter(BaseMessageConverter):\n",
" def __init__(self, author_email: str):\n",
" self.author_email = author_email\n",
"\n",
" def from_sql_model(self, sql_message: Any) -> BaseMessage:\n",
" if sql_message.type == \"human\":\n",
" return HumanMessage(\n",
" content=sql_message.content,\n",
" )\n",
" elif sql_message.type == \"ai\":\n",
" return AIMessage(\n",
" content=sql_message.content,\n",
" )\n",
" elif sql_message.type == \"system\":\n",
" return SystemMessage(\n",
" content=sql_message.content,\n",
" )\n",
" else:\n",
" raise ValueError(f\"Unknown message type: {sql_message.type}\")\n",
"\n",
" def to_sql_model(self, message: BaseMessage, session_id: str) -> Any:\n",
" now = datetime.now()\n",
" return CustomMessage(\n",
" session_id=session_id,\n",
" type=message.type,\n",
" content=message.content,\n",
" created_at=now,\n",
" author_email=self.author_email,\n",
" )\n",
"\n",
" def get_sql_model_class(self) -> Any:\n",
" return CustomMessage\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" Base.metadata.drop_all(bind=sa.create_engine(CONNECTION_STRING))\n",
"\n",
" chat_message_history = CrateDBChatMessageHistory(\n",
" session_id=\"test_session\",\n",
" connection_string=CONNECTION_STRING,\n",
" custom_message_converter=CustomMessageConverter(\n",
" author_email=\"test@example.com\"\n",
" ),\n",
" )\n",
"\n",
" chat_message_history.add_user_message(\"Hello\")\n",
" chat_message_history.add_ai_message(\"Hi\")"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "4a6a54d8a9e2856f",
"metadata": {
"ExecuteTime": {
"end_time": "2023-08-28T10:04:43.497990Z",
"start_time": "2023-08-28T10:04:43.492517Z"
},
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]"
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_message_history.messages"
]
},
{
"cell_type": "markdown",
"id": "622aded629a1adeb",
"metadata": {
"collapsed": false
},
"source": [
"## Custom Name for Session Column\n",
"\n",
"The session id, a unique token identifying the session, is an important property of\n",
"this subsystem. If your database table stores it in a different column, you can use\n",
"the `session_id_field_name` keyword argument to adjust the name correspondingly."
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "72eea5119410473aa328ad9291626812",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import json\n",
"import typing as t\n",
"\n",
"from langchain.memory.chat_message_histories.cratedb import CrateDBMessageConverter\n",
"from langchain.schema import _message_to_dict\n",
"\n",
"Base = declarative_base()\n",
"\n",
"\n",
"class MessageWithDifferentSessionIdColumn(Base):\n",
" __tablename__ = \"message_store_different_session_id\"\n",
" id = sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())\n",
" custom_session_id = sa.Column(sa.Text)\n",
" message = sa.Column(sa.Text)\n",
"\n",
"\n",
"class CustomMessageConverterWithDifferentSessionIdColumn(CrateDBMessageConverter):\n",
" def __init__(self):\n",
" self.model_class = MessageWithDifferentSessionIdColumn\n",
"\n",
" def to_sql_model(self, message: BaseMessage, custom_session_id: str) -> t.Any:\n",
" return self.model_class(\n",
" custom_session_id=custom_session_id,\n",
" message=json.dumps(_message_to_dict(message)),\n",
" )\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" Base.metadata.drop_all(bind=sa.create_engine(CONNECTION_STRING))\n",
"\n",
" chat_message_history = CrateDBChatMessageHistory(\n",
" session_id=\"test_session\",\n",
" connection_string=CONNECTION_STRING,\n",
" custom_message_converter=CustomMessageConverterWithDifferentSessionIdColumn(),\n",
" session_id_field_name=\"custom_session_id\",\n",
" )\n",
"\n",
" chat_message_history.add_user_message(\"Hello\")\n",
" chat_message_history.add_ai_message(\"Hi\")"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "8edb47106e1a46a883d545849b8ab81b",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": "[HumanMessage(content='Hello', additional_kwargs={}, example=False),\n AIMessage(content='Hi', additional_kwargs={}, example=False)]"
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_message_history.messages"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,203 @@
# CrateDB
This documentation section shows how to use the CrateDB vector store
functionality around [`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn
how to use it for similarity search and other purposes.
## What is CrateDB?
[CrateDB] is an open-source, distributed, and scalable SQL analytics database
for storing and analyzing massive amounts of data in near real-time, even with
complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits
the shared-nothing distribution layer of [Elasticsearch].
It provides a distributed, multi-tenant-capable relational database and search
engine with HTTP and PostgreSQL interfaces, and schema-free objects. It supports
sharding, partitioning, and replication out of the box.
CrateDB enables you to efficiently store billions of records, and terabytes of
data, and query it using SQL.
- Provides a standards-based SQL interface for querying relational data, nested
documents, geospatial constraints, and vector embeddings at the same time.
- Improves your operations by storing time-series data, relational metadata,
and vector embeddings within a single database.
- Builds upon approved technologies from Lucene and Elasticsearch.
## CrateDB Cloud
- Offers on-demand CrateDB clusters without operational overhead,
with enterprise-grade features and [ISO 27001] certification.
- The entrypoint to [CrateDB Cloud] is the [CrateDB Cloud Console].
- Crate.io offers a free tier via [CrateDB Cloud CRFREE].
- To get started, [sign up] to CrateDB Cloud, deploy a database cluster,
and follow the upcoming instructions.
## Features
The CrateDB adapter supports the _Vector Store_, _Document Loader_,
and _Conversational Memory_ subsystems of LangChain.
### Vector Store
`CrateDBVectorSearch` is an API wrapper around CrateDB's `FLOAT_VECTOR` type
and the corresponding `KNN_MATCH` function, based on SQLAlchemy and CrateDB's
SQLAlchemy dialect. It provides an interface to store and retrieve floating
point vectors, and to conduct similarity searches.
Supports:
- Approximate nearest neighbor search.
- Euclidean distance.
### Document Loader
`CrateDBLoader` provides loading documents from a database table by an SQL
query expression or an SQLAlchemy selectable instance.
### Conversational Memory
`CrateDBChatMessageHistory` uses CrateDB to manage conversation history.
## Installation and Setup
There are multiple ways to get started with CrateDB.
### Install CrateDB on your local machine
You can [download CrateDB], or use the [OCI image] to run CrateDB on Docker or Podman.
Note that this is not recommended for production use.
```shell
docker run --rm -it --name=cratedb --publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=4g crate/crate:nightly \
-Cdiscovery.type=single-node
```
### Deploy a cluster on CrateDB Cloud
[CrateDB Cloud] is a managed CrateDB service. Sign up for a [free trial].
### Install Client
```bash
pip install crash langchain langchain-openai sqlalchemy-cratedb
```
## Usage » Vector Store
For a more detailed walkthrough of the `CrateDBVectorSearch` wrapper, there is also
a corresponding [Jupyter notebook](/docs/extras/integrations/vectorstores/cratedb.html).
### Provide input data
The example uses the canonical `state_of_the_union.txt`.
```shell
wget https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt
```
### Set environment variables
Use a valid OpenAI API key and SQL connection string. This one fits a local instance of CrateDB.
```shell
export OPENAI_API_KEY=foobar
export CRATEDB_CONNECTION_STRING=crate://crate@localhost
```
### Example
Load and index documents, and invoke query.
```python
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import CrateDBVectorSearch
def main():
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = UnstructuredURLLoader("https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = CrateDBVectorSearch.from_documents(documents, OpenAIEmbeddings())
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
if __name__ == "__main__":
main()
```
## Usage » Document Loader
For a more detailed walkthrough of the `CrateDBLoader`, there is also a corresponding
[Jupyter notebook](/docs/extras/integrations/document_loaders/cratedb.html).
### Provide input data
```shell
wget https://github.com/crate-workbench/langchain/raw/cratedb/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql
crash < ./example_data/mlb_teams_2012.sql
crash --command "REFRESH TABLE mlb_teams_2012;"
```
### Load documents by SQL query
```python
from langchain.document_loaders import CrateDBLoader
from pprint import pprint
def main():
loader = CrateDBLoader(
'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
url="crate://crate@localhost/",
)
documents = loader.load()
pprint(documents)
if __name__ == "__main__":
main()
```
## Usage » Conversational Memory
For a more detailed walkthrough of the `CrateDBChatMessageHistory`, there is also a corresponding
[Jupyter notebook](/docs/extras/integrations/memory/cratedb_chat_message_history.html).
```python
from langchain.memory.chat_message_histories import CrateDBChatMessageHistory
from pprint import pprint
def main():
chat_message_history = CrateDBChatMessageHistory(
session_id="test_session",
connection_string="crate://crate@localhost/",
)
chat_message_history.add_user_message("Hello")
chat_message_history.add_ai_message("Hi")
pprint(chat_message_history)
if __name__ == "__main__":
main()
```
[CrateDB]: https://github.com/crate/crate
[CrateDB Cloud]: https://cratedb.com/product
[CrateDB Cloud Console]: https://console.cratedb.cloud/
[CrateDB Cloud CRFREE]: https://community.crate.io/t/new-cratedb-cloud-edge-feature-cratedb-cloud-free-tier/1402
[CrateDB SQLAlchemy dialect]: https://cratedb.com/docs/sqlalchemy-cratedb/
[download CrateDB]: https://cratedb.com/download
[Elastisearch]: https://github.com/elastic/elasticsearch
[`FLOAT_VECTOR`]: https://cratedb.com/docs/crate/reference/en/master/general/ddl/data-types.html#float-vector
[free trial]: https://cratedb.com/lp-crfree?utm_source=langchain
[ISO 27001]: https://cratedb.com/blog/cratedb-elevates-its-security-standards-and-achieves-iso-27001-certification
[`KNN_MATCH`]: https://cratedb.com/docs/crate/reference/en/master/general/builtins/scalar-functions.html#scalar-knn-match
[Lucene]: https://github.com/apache/lucene
[OCI image]: https://hub.docker.com/_/crate
[sign up]: https://console.cratedb.cloud/

View File

@ -0,0 +1,589 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CrateDB\n",
"\n",
"> [CrateDB] is capable of performing both vector and lexical search.\n",
"> It is built on top of the Apache Lucene library, talks SQL,\n",
"> is PostgreSQL-compatible, and scales like Elasticsearch.\n",
"\n",
"This notebook shows how to use the CrateDB vector store functionality around\n",
"[`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn how to use LangChain's\n",
"`CrateDBVectorSearch` adapter for similarity search and other purposes.\n",
"\n",
"It supports:\n",
"- Similarity Search with Euclidean Distance\n",
"- Maximal Marginal Relevance Search (MMR)\n",
"\n",
"## What is CrateDB?\n",
"\n",
"[CrateDB] is an open-source, distributed, and scalable SQL analytics database\n",
"for storing and analyzing massive amounts of data in near real-time, even with\n",
"complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits\n",
"the shared-nothing distribution layer of [Elasticsearch].\n",
"\n",
"This example uses the [Python client driver for CrateDB]. For more documentation,\n",
"see also [LangChain with CrateDB].\n",
"\n",
"\n",
"[CrateDB]: https://github.com/crate/crate\n",
"[Elasticsearch]: https://github.com/elastic/elasticsearch\n",
"[`FLOAT_VECTOR`]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#float-vector\n",
"[`KNN_MATCH`]: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#scalar-knn-match\n",
"[LangChain with CrateDB]: /docs/extras/integrations/providers/cratedb.html\n",
"[Lucene]: https://github.com/apache/lucene\n",
"[Python client driver for CrateDB]: https://cratedb.com/docs/python/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"In order to use the CrateDB vector search you must install the sqlalchemy-cratedb package."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
},
"tags": []
},
"outputs": [],
"source": [
"# Install required packages: LangChain, OpenAI SDK, and the CrateDB SQLAlchemy adapter.\n",
"%pip install -qU langchain-community langchain-openai sqlalchemy-cratedb"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"You will supply credentials through a regular SQLAlchemy connection string, like\n",
"`crate://username:password@cratedb.example.org/`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"### OpenAI API key\n",
"\n",
"You need to provide an OpenAI API key, optionally using the environment\n",
"variable `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:02:16.802456Z",
"start_time": "2023-09-09T08:02:07.065604Z"
}
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"from dotenv import find_dotenv, load_dotenv\n",
"\n",
"# Run `export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY`.\n",
"# Get OpenAI api key from `.env` file.\n",
"# Otherwise, prompt for it.\n",
"_ = load_dotenv(find_dotenv())\n",
"OPENAI_API_KEY = os.environ.get(\"OPENAI_API_KEY\", getpass.getpass(\"OpenAI API key:\"))\n",
"os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"You also need to provide a connection string to your CrateDB database cluster,\n",
"optionally using the environment variable `CRATEDB_CONNECTION_STRING`.\n",
"\n",
"This example uses a CrateDB instance on your workstation, which you can start by\n",
"running [CrateDB using Docker]. Alternatively, you can also connect to a cluster\n",
"running on [CrateDB Cloud].\n",
"\n",
"[CrateDB Cloud]: https://console.cratedb.cloud/\n",
"[CrateDB using Docker]: https://cratedb.com/docs/guide/install/container/\n",
"\n",
"### CrateDB connection string\n",
"\n",
"You will need to supply an SQLAlchemy-compatible connection string."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\n",
"\n",
"CONNECTION_STRING = os.environ.get(\n",
" \"CRATEDB_CONNECTION_STRING\",\n",
" \"crate://crate@localhost:4200/?schema=langchain\",\n",
")\n",
"\n",
"# For CrateDB Cloud, use:\n",
"# CONNECTION_STRING = os.environ.get(\n",
"# \"CRATEDB_CONNECTION_STRING\",\n",
"# \"crate://username:password@hostname:4200/?ssl=true&schema=langchain\",\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:02:28.174088Z",
"start_time": "2023-09-09T08:02:28.162698Z"
}
},
"outputs": [],
"source": [
"\"\"\"\n",
"# Alternatively, the connection string can be assembled from individual\n",
"# environment variables.\n",
"import os\n",
"\n",
"CONNECTION_STRING = CrateDBVectorSearch.connection_string_from_db_params(\n",
" driver=os.environ.get(\"CRATEDB_DRIVER\", \"crate\"),\n",
" host=os.environ.get(\"CRATEDB_HOST\", \"localhost\"),\n",
" port=int(os.environ.get(\"CRATEDB_PORT\", \"4200\")),\n",
" database=os.environ.get(\"CRATEDB_DATABASE\", \"langchain\"),\n",
" user=os.environ.get(\"CRATEDB_USER\", \"crate\"),\n",
" password=os.environ.get(\"CRATEDB_PASSWORD\", \"\"),\n",
")\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Import Python Modules\n",
"\n",
"You will start by importing all required modules."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"from langchain.document_loaders import UnstructuredURLLoader\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import CrateDBVectorSearch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manage vector store\n",
"\n",
"In the example above, you created a vector store from scratch. When\n",
"aiming to work with an existing vector store, you can initialize it directly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings()\n",
"\n",
"store = CrateDBVectorSearch(\n",
" collection_name=\"testdrive\",\n",
" connection_string=CONNECTION_STRING,\n",
" embedding_function=embeddings,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add items to vector store\n",
"\n",
"You can also add documents to an existing vector store."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"store.add_documents([Document(page_content=\"foo\")])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"is_executing": true
}
},
"outputs": [],
"source": [
"docs_with_score = store.similarity_search_with_score(\"foo\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Update items in vector store\n",
"\n",
"FIXME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Foo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Delete items from vector store\n",
"FIXME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"store.delete(ids=[\"foo\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Load and Index Documents\n",
"\n",
"Next, you will read input data, and tokenize it. The module will create a table\n",
"with the name of the collection. Make sure the collection name is unique, and\n",
"that you have the permission to create a table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"loader = UnstructuredURLLoader(\n",
" \"https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt\"\n",
")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"COLLECTION_NAME = \"state_of_the_union_test\"\n",
"\n",
"db = CrateDBVectorSearch.from_documents(\n",
" embedding=embeddings,\n",
" documents=docs,\n",
" collection_name=COLLECTION_NAME,\n",
" connection_string=CONNECTION_STRING,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overwriting a Vector Store\n",
"\n",
"If you have an existing collection, you can overwrite it by using `from_documents`,\n",
"aad setting `pre_delete_collection = True`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = CrateDBVectorSearch.from_documents(\n",
" documents=docs,\n",
" embedding=embeddings,\n",
" collection_name=COLLECTION_NAME,\n",
" connection_string=CONNECTION_STRING,\n",
" pre_delete_collection=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score = db.similarity_search_with_score(\"foo\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Query vector store\n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"Searching by euclidean distance is the default."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:05:11.104135Z",
"start_time": "2023-09-09T08:05:10.548998Z"
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:05:13.532334Z",
"start_time": "2023-09-09T08:05:13.523191Z"
}
},
"outputs": [],
"source": [
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"#### Maximal Marginal Relevance Search (MMR)\n",
"Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:05:23.276819Z",
"start_time": "2023-09-09T08:05:21.972256Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"docs_with_score = db.max_marginal_relevance_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:05:27.478580Z",
"start_time": "2023-09-09T08:05:27.470138Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"#### Searching in Multiple Collections\n",
"`CrateDBVectorSearchMultiCollection` is a special adapter which provides similarity search across\n",
"multiple collections. It can not be used for indexing documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from langchain.vectorstores.cratedb import CrateDBVectorSearchMultiCollection\n",
"\n",
"multisearch = CrateDBVectorSearchMultiCollection(\n",
" collection_names=[\"test_collection_1\", \"test_collection_2\"],\n",
" embedding_function=embeddings,\n",
" connection_string=CONNECTION_STRING,\n",
")\n",
"docs_with_score = multisearch.similarity_search_with_score(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Query by turning into retriever"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retriever = store.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(retriever)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage for retrieval-augmented generation\n",
"\n",
"For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:\n",
"\n",
"- [Tutorials: working with external knowledge](https://python.langchain.com/docs/tutorials/#working-with-external-knowledge)\n",
"- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)\n",
"- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `CrateDBVectorSearch` features and configurations,\n",
"head to the API reference:\n",
"https://python.langchain.com/api_reference/cratedb/vectorstores/langchain_cratedb.vectorstores.CrateDBVectorSearch.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}