Add template for conversational rag with timescale vector (#13041)

**Description:** This is like the rag-conversation template in many
ways. What's different is:
- support for a timescale vector store.
- support for time-based filters.
- support for metadata filters.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
pull/13232/head
Matvey Arye 7 months ago committed by GitHub
parent 1a1a1a883f
commit 180657ca7a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2023 LangChain, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

@ -0,0 +1,80 @@
# rag-timescale-conversation
This template is used for [conversational](https://python.langchain.com/docs/expression_language/cookbook/retrieval#conversational-retrieval-chain) [retrieval](https://python.langchain.com/docs/use_cases/question_answering/), which is one of the most popular LLM use-cases.
It passes both a conversation history and retrieved documents into an LLM for synthesis.
## Environment Setup
This template uses Timescale Vector as a vectorstore and requires that `TIMESCALES_SERVICE_URL`. Signup for a 90-day trial [here](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) if you don't yet have an account.
To load the sample dataset, set `LOAD_SAMPLE_DATA=1`. To load your own dataset see the section below.
Set the `OPENAI_API_KEY` environment variable to access the OpenAI models.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U "langchain-cli[serve]"
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package rag-timescale-conversation
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add rag-timescale-conversation
```
And add the following code to your `server.py` file:
```python
from rag_timescale_conversation import chain as rag_timescale_conversation_chain
add_routes(app, rag_timescale_conversation_chain, path="/rag-timescale_conversation")
```
(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-timescale-conversation/playground](http://127.0.0.1:8000/rag-timescale-conversation/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/rag-timescale-conversation")
```
See the `rag_conversation.ipynb` notebook for example usage.
## Loading your own dataset
To load your own dataset you will have to create a `load_dataset` function. You can see an example, in the
`load_ts_git_dataset` function defined in the `load_sample_dataset.py` file. You can then run this as a
standalone function (e.g. in a bash script) or add it to chain.py (but then you should run it just once).

File diff suppressed because it is too large Load Diff

@ -0,0 +1,31 @@
[tool.poetry]
name = "rag-timescale-conversation"
version = "0.1.0"
description = ""
authors = [
"Lance Martin <lance@langchain.dev>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.325"
openai = ">=0.28.1"
tiktoken = ">=0.5.1"
pinecone-client = ">=2.2.4"
beautifulsoup4 = "^4.12.2"
python-dotenv = "^1.0.0"
timescale-vector = "^0.0.3"
[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"
[tool.langserve]
export_module = "rag_timescale_conversation"
export_attr = "chain"
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"

@ -0,0 +1,238 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "424a9d8d",
"metadata": {},
"source": [
"## Run Template\n",
"\n",
"In `server.py`, set -\n",
"```\n",
"add_routes(app, chain_rag_timescale_conv, path=\"/rag_timescale_conversation\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5f521923",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"rag_app = RemoteRunnable(\"http://0.0.0.0:8000/rag_timescale_conversation\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "563a58dd",
"metadata": {},
"source": [
"First, setup the history"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "14541994",
"metadata": {},
"outputs": [],
"source": [
"question = \"My name is Sven Klemm\"\n",
"answer = rag_app.invoke(\n",
" {\n",
" \"question\": question,\n",
" \"chat_history\": [],\n",
" }\n",
")\n",
"chat_history = [(question, answer)]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "63e76c4d",
"metadata": {},
"source": [
"Next, use the history for a question"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b2d8f735",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The person named Sven Klemm made the following commits:\\n\\n1. Commit \"a31c9b9f8cdfe8643499b710dc983e5c5d6457e4\" on \"Mon May 22 11:34:06 2023 +0200\" with the change summary \"Increase number of sqlsmith loops in nightly CI\". The change details are \"To improve coverage with sqlsmith we run it for longer in the scheduled nightly run.\"\\n\\n2. Commit \"e4ba2bcf560568ae68f3775c058f0a8d7f7c0501\" on \"Wed Nov 9 09:29:36 2022 +0100\" with the change summary \"Remove debian 9 from packages tests.\" The change details are \"Debian 9 is EOL since July 2022 so we won\\'t build packages for it anymore and can remove it from CI.\"'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"answer = rag_app.invoke(\n",
" {\n",
" \"question\": \"What commits did the person with my name make?\",\n",
" \"chat_history\": chat_history,\n",
" }\n",
")\n",
"answer"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "bd62df23",
"metadata": {},
"source": [
"## Filter by time\n",
"\n",
"You can also use timed filters. For example, the sample dataset doesn't include any commits before 2010, so this should return no matches."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b0a598b7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The context does not provide any information about any commits made by a person named Sven Klemm.'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer = rag_app.invoke(\n",
" {\n",
" \"question\": \"What commits did the person with my name make?\",\n",
" \"chat_history\": chat_history,\n",
" \"end_date\": \"2016-01-01 00:00:00\",\n",
" }\n",
")\n",
"answer\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "25851869",
"metadata": {},
"source": [
"However, there is data from 2022, which can be used"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4aef5219",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The person named Sven Klemm made the following commits:\\n\\n1. \"e4ba2bcf560568ae68f3775c058f0a8d7f7c0501\" with the change summary \"Remove debian 9 from packages tests.\" The details of this change are that \"Debian 9 is EOL since July 2022 so we won\\'t build packages for it anymore and can remove it from CI.\"\\n\\n2. \"2f237e6e57e5ac66c126233d66969a1f674ffaa4\" with the change summary \"Add Enterprise Linux 9 packages to RPM package test\". The change details for this commit are not provided.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer = rag_app.invoke(\n",
" {\n",
" \"question\": \"What commits did the person with my name make?\",\n",
" \"chat_history\": chat_history,\n",
" \"start_date\": \"2020-01-01 00:00:00\",\n",
" \"end_date\": \"2023-01-01 00:00:00\",\n",
" }\n",
")\n",
"answer"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6ad86fbd",
"metadata": {},
"source": [
"## Filter by metadata\n",
"\n",
"You can also filter by metadata using this chain"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7ac9365f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The person named Sven Klemm made a commit with the ID \"5cd2c038796fb302190b080c90e5acddbef4b8d1\". The change summary for this commit is \"Simplify windows-build-and-test-ignored.yaml\" and the change details are \"Remove code not needed for the skip workflow of the windows test.\" The commit was made on \"Sat Mar 4 10:18:34 2023 +0100\".'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer = rag_app.invoke(\n",
" {\n",
" \"question\": \"What commits did the person with my name make?\",\n",
" \"chat_history\": chat_history,\n",
" \"metadata_filter\": {\"commit_hash\": \" 5cd2c038796fb302190b080c90e5acddbef4b8d1\"},\n",
" }\n",
")\n",
"answer"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1cde5da5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,3 @@
from rag_timescale_conversation.chain import chain
__all__ = ["chain"]

@ -0,0 +1,164 @@
import os
from datetime import datetime, timedelta
from operator import itemgetter
from typing import List, Optional, Tuple
from dotenv import find_dotenv, load_dotenv
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts.prompt import PromptTemplate
from langchain.schema import AIMessage, HumanMessage, format_document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import (
RunnableBranch,
RunnableLambda,
RunnableMap,
RunnablePassthrough,
)
from langchain.vectorstores.timescalevector import TimescaleVector
from pydantic import BaseModel, Field
from .load_sample_dataset import load_ts_git_dataset
load_dotenv(find_dotenv())
if os.environ.get("TIMESCALE_SERVICE_URL", None) is None:
raise Exception("Missing `TIMESCALE_SERVICE_URL` environment variable.")
SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
LOAD_SAMPLE_DATA = os.environ.get("LOAD_SAMPLE_DATA", False)
COLLECTION_NAME = os.environ.get("COLLECTION_NAME", "timescale_commits")
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4")
partition_interval = timedelta(days=7)
if LOAD_SAMPLE_DATA:
load_ts_git_dataset(
SERVICE_URL,
collection_name=COLLECTION_NAME,
num_records=500,
partition_interval=partition_interval,
)
embeddings = OpenAIEmbeddings()
vectorstore = TimescaleVector(
embedding=embeddings,
collection_name=COLLECTION_NAME,
service_url=SERVICE_URL,
time_partition_interval=partition_interval,
)
retriever = vectorstore.as_retriever()
# Condense a chat history and follow-up question into a standalone question
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:""" # noqa: E501
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
# RAG answer synthesis prompt
template = """Answer the question based only on the following context:
<context>
{context}
</context>"""
ANSWER_PROMPT = ChatPromptTemplate.from_messages(
[
("system", template),
MessagesPlaceholder(variable_name="chat_history"),
("user", "{question}"),
]
)
# Conversational Retrieval Chain
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
def _combine_documents(
docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
buffer = []
for human, ai in chat_history:
buffer.append(HumanMessage(content=human))
buffer.append(AIMessage(content=ai))
return buffer
# User input
class ChatHistory(BaseModel):
chat_history: List[Tuple[str, str]] = Field(..., extra={"widget": {"type": "chat"}})
question: str
start_date: Optional[datetime]
end_date: Optional[datetime]
metadata_filter: Optional[dict]
_search_query = RunnableBranch(
# If input includes chat_history, we condense it with the follow-up question
(
RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
run_name="HasChatHistoryCheck"
), # Condense follow-up question and chat into a standalone_question
RunnablePassthrough.assign(
retriever_query=RunnablePassthrough.assign(
chat_history=lambda x: _format_chat_history(x["chat_history"])
)
| CONDENSE_QUESTION_PROMPT
| ChatOpenAI(temperature=0, model=OPENAI_MODEL)
| StrOutputParser()
),
),
# Else, we have no chat history, so just pass through the question
RunnablePassthrough.assign(retriever_query=lambda x: x["question"]),
)
def get_retriever_with_metadata(x):
start_dt = x.get("start_date", None)
end_dt = x.get("end_date", None)
metadata_filter = x.get("metadata_filter", None)
opt = {}
if start_dt is not None:
opt["start_date"] = start_dt
if end_dt is not None:
opt["end_date"] = end_dt
if metadata_filter is not None:
opt["filter"] = metadata_filter
v = vectorstore.as_retriever(search_kwargs=opt)
return RunnableLambda(itemgetter("retriever_query")) | v
_retriever = RunnableLambda(get_retriever_with_metadata)
_inputs = RunnableMap(
{
"question": lambda x: x["question"],
"chat_history": lambda x: _format_chat_history(x["chat_history"]),
"start_date": lambda x: x.get("start_date", None),
"end_date": lambda x: x.get("end_date", None),
"context": _search_query | _retriever | _combine_documents,
}
)
_datetime_to_string = RunnablePassthrough.assign(
start_date=lambda x: x.get("start_date", None).isoformat()
if x.get("start_date", None) is not None
else None,
end_date=lambda x: x.get("end_date", None).isoformat()
if x.get("end_date", None) is not None
else None,
).with_types(input_type=ChatHistory)
chain = (
_datetime_to_string
| _inputs
| ANSWER_PROMPT
| ChatOpenAI(model=OPENAI_MODEL)
| StrOutputParser()
)

@ -0,0 +1,84 @@
import os
import tempfile
from datetime import datetime, timedelta
import requests
from langchain.document_loaders import JSONLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.timescalevector import TimescaleVector
from timescale_vector import client
def parse_date(date_string: str) -> datetime:
if date_string is None:
return None
time_format = "%a %b %d %H:%M:%S %Y %z"
return datetime.strptime(date_string, time_format)
def extract_metadata(record: dict, metadata: dict) -> dict:
dt = parse_date(record["date"])
metadata["id"] = str(client.uuid_from_time(dt))
if dt is not None:
metadata["date"] = dt.isoformat()
else:
metadata["date"] = None
metadata["author"] = record["author"]
metadata["commit_hash"] = record["commit"]
return metadata
def load_ts_git_dataset(
service_url,
collection_name="timescale_commits",
num_records: int = 500,
partition_interval=timedelta(days=7),
):
json_url = "https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
tmp_file = "ts_git_log.json"
temp_dir = tempfile.gettempdir()
json_file_path = os.path.join(temp_dir, tmp_file)
if not os.path.exists(json_file_path):
response = requests.get(json_url)
if response.status_code == 200:
with open(json_file_path, "w") as json_file:
json_file.write(response.text)
else:
print(f"Failed to download JSON file. Status code: {response.status_code}")
loader = JSONLoader(
file_path=json_file_path,
jq_schema=".commit_history[]",
text_content=False,
metadata_func=extract_metadata,
)
documents = loader.load()
# Remove documents with None dates
documents = [doc for doc in documents if doc.metadata["date"] is not None]
if num_records > 0:
documents = documents[:num_records]
# Split the documents into chunks for embedding
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
# Create a Timescale Vector instance from the collection of documents
TimescaleVector.from_documents(
embedding=embeddings,
ids=[doc.metadata["id"] for doc in docs],
documents=docs,
collection_name=collection_name,
service_url=service_url,
time_partition_interval=partition_interval,
)
Loading…
Cancel
Save