mirror of https://github.com/hwchase17/langchain
Add template for conversational rag with timescale vector (#13041)
**Description:** This is like the rag-conversation template in many ways. What's different is: - support for a timescale vector store. - support for time-based filters. - support for metadata filters. <!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Erick Friis <erick@langchain.dev>pull/13232/head
parent
1a1a1a883f
commit
180657ca7a
@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2023 LangChain, Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
@ -0,0 +1,80 @@
|
||||
|
||||
# rag-timescale-conversation
|
||||
|
||||
This template is used for [conversational](https://python.langchain.com/docs/expression_language/cookbook/retrieval#conversational-retrieval-chain) [retrieval](https://python.langchain.com/docs/use_cases/question_answering/), which is one of the most popular LLM use-cases.
|
||||
|
||||
It passes both a conversation history and retrieved documents into an LLM for synthesis.
|
||||
|
||||
## Environment Setup
|
||||
|
||||
This template uses Timescale Vector as a vectorstore and requires that `TIMESCALES_SERVICE_URL`. Signup for a 90-day trial [here](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) if you don't yet have an account.
|
||||
|
||||
To load the sample dataset, set `LOAD_SAMPLE_DATA=1`. To load your own dataset see the section below.
|
||||
|
||||
Set the `OPENAI_API_KEY` environment variable to access the OpenAI models.
|
||||
|
||||
## Usage
|
||||
|
||||
To use this package, you should first have the LangChain CLI installed:
|
||||
|
||||
```shell
|
||||
pip install -U "langchain-cli[serve]"
|
||||
```
|
||||
|
||||
To create a new LangChain project and install this as the only package, you can do:
|
||||
|
||||
```shell
|
||||
langchain app new my-app --package rag-timescale-conversation
|
||||
```
|
||||
|
||||
If you want to add this to an existing project, you can just run:
|
||||
|
||||
```shell
|
||||
langchain app add rag-timescale-conversation
|
||||
```
|
||||
|
||||
And add the following code to your `server.py` file:
|
||||
```python
|
||||
from rag_timescale_conversation import chain as rag_timescale_conversation_chain
|
||||
|
||||
add_routes(app, rag_timescale_conversation_chain, path="/rag-timescale_conversation")
|
||||
```
|
||||
|
||||
(Optional) Let's now configure LangSmith.
|
||||
LangSmith will help us trace, monitor and debug LangChain applications.
|
||||
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
|
||||
If you don't have access, you can skip this section
|
||||
|
||||
```shell
|
||||
export LANGCHAIN_TRACING_V2=true
|
||||
export LANGCHAIN_API_KEY=<your-api-key>
|
||||
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
|
||||
```
|
||||
|
||||
If you are inside this directory, then you can spin up a LangServe instance directly by:
|
||||
|
||||
```shell
|
||||
langchain serve
|
||||
```
|
||||
|
||||
This will start the FastAPI app with a server is running locally at
|
||||
[http://localhost:8000](http://localhost:8000)
|
||||
|
||||
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
|
||||
We can access the playground at [http://127.0.0.1:8000/rag-timescale-conversation/playground](http://127.0.0.1:8000/rag-timescale-conversation/playground)
|
||||
|
||||
We can access the template from code with:
|
||||
|
||||
```python
|
||||
from langserve.client import RemoteRunnable
|
||||
|
||||
runnable = RemoteRunnable("http://localhost:8000/rag-timescale-conversation")
|
||||
```
|
||||
|
||||
See the `rag_conversation.ipynb` notebook for example usage.
|
||||
|
||||
## Loading your own dataset
|
||||
|
||||
To load your own dataset you will have to create a `load_dataset` function. You can see an example, in the
|
||||
`load_ts_git_dataset` function defined in the `load_sample_dataset.py` file. You can then run this as a
|
||||
standalone function (e.g. in a bash script) or add it to chain.py (but then you should run it just once).
|
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,31 @@
|
||||
[tool.poetry]
|
||||
name = "rag-timescale-conversation"
|
||||
version = "0.1.0"
|
||||
description = ""
|
||||
authors = [
|
||||
"Lance Martin <lance@langchain.dev>",
|
||||
]
|
||||
readme = "README.md"
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = ">=3.8.1,<4.0"
|
||||
langchain = ">=0.0.325"
|
||||
openai = ">=0.28.1"
|
||||
tiktoken = ">=0.5.1"
|
||||
pinecone-client = ">=2.2.4"
|
||||
beautifulsoup4 = "^4.12.2"
|
||||
python-dotenv = "^1.0.0"
|
||||
timescale-vector = "^0.0.3"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
langchain-cli = ">=0.0.15"
|
||||
|
||||
[tool.langserve]
|
||||
export_module = "rag_timescale_conversation"
|
||||
export_attr = "chain"
|
||||
|
||||
[build-system]
|
||||
requires = [
|
||||
"poetry-core",
|
||||
]
|
||||
build-backend = "poetry.core.masonry.api"
|
@ -0,0 +1,238 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "424a9d8d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run Template\n",
|
||||
"\n",
|
||||
"In `server.py`, set -\n",
|
||||
"```\n",
|
||||
"add_routes(app, chain_rag_timescale_conv, path=\"/rag_timescale_conversation\")\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5f521923",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langserve.client import RemoteRunnable\n",
|
||||
"\n",
|
||||
"rag_app = RemoteRunnable(\"http://0.0.0.0:8000/rag_timescale_conversation\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "563a58dd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, setup the history"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "14541994",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question = \"My name is Sven Klemm\"\n",
|
||||
"answer = rag_app.invoke(\n",
|
||||
" {\n",
|
||||
" \"question\": question,\n",
|
||||
" \"chat_history\": [],\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"chat_history = [(question, answer)]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "63e76c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, use the history for a question"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "b2d8f735",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The person named Sven Klemm made the following commits:\\n\\n1. Commit \"a31c9b9f8cdfe8643499b710dc983e5c5d6457e4\" on \"Mon May 22 11:34:06 2023 +0200\" with the change summary \"Increase number of sqlsmith loops in nightly CI\". The change details are \"To improve coverage with sqlsmith we run it for longer in the scheduled nightly run.\"\\n\\n2. Commit \"e4ba2bcf560568ae68f3775c058f0a8d7f7c0501\" on \"Wed Nov 9 09:29:36 2022 +0100\" with the change summary \"Remove debian 9 from packages tests.\" The change details are \"Debian 9 is EOL since July 2022 so we won\\'t build packages for it anymore and can remove it from CI.\"'"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"\n",
|
||||
"answer = rag_app.invoke(\n",
|
||||
" {\n",
|
||||
" \"question\": \"What commits did the person with my name make?\",\n",
|
||||
" \"chat_history\": chat_history,\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"answer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "bd62df23",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filter by time\n",
|
||||
"\n",
|
||||
"You can also use timed filters. For example, the sample dataset doesn't include any commits before 2010, so this should return no matches."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "b0a598b7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The context does not provide any information about any commits made by a person named Sven Klemm.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"answer = rag_app.invoke(\n",
|
||||
" {\n",
|
||||
" \"question\": \"What commits did the person with my name make?\",\n",
|
||||
" \"chat_history\": chat_history,\n",
|
||||
" \"end_date\": \"2016-01-01 00:00:00\",\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"answer\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "25851869",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"However, there is data from 2022, which can be used"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "4aef5219",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The person named Sven Klemm made the following commits:\\n\\n1. \"e4ba2bcf560568ae68f3775c058f0a8d7f7c0501\" with the change summary \"Remove debian 9 from packages tests.\" The details of this change are that \"Debian 9 is EOL since July 2022 so we won\\'t build packages for it anymore and can remove it from CI.\"\\n\\n2. \"2f237e6e57e5ac66c126233d66969a1f674ffaa4\" with the change summary \"Add Enterprise Linux 9 packages to RPM package test\". The change details for this commit are not provided.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"answer = rag_app.invoke(\n",
|
||||
" {\n",
|
||||
" \"question\": \"What commits did the person with my name make?\",\n",
|
||||
" \"chat_history\": chat_history,\n",
|
||||
" \"start_date\": \"2020-01-01 00:00:00\",\n",
|
||||
" \"end_date\": \"2023-01-01 00:00:00\",\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"answer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "6ad86fbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filter by metadata\n",
|
||||
"\n",
|
||||
"You can also filter by metadata using this chain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "7ac9365f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The person named Sven Klemm made a commit with the ID \"5cd2c038796fb302190b080c90e5acddbef4b8d1\". The change summary for this commit is \"Simplify windows-build-and-test-ignored.yaml\" and the change details are \"Remove code not needed for the skip workflow of the windows test.\" The commit was made on \"Sat Mar 4 10:18:34 2023 +0100\".'"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"answer = rag_app.invoke(\n",
|
||||
" {\n",
|
||||
" \"question\": \"What commits did the person with my name make?\",\n",
|
||||
" \"chat_history\": chat_history,\n",
|
||||
" \"metadata_filter\": {\"commit_hash\": \" 5cd2c038796fb302190b080c90e5acddbef4b8d1\"},\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"answer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1cde5da5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -0,0 +1,3 @@
|
||||
from rag_timescale_conversation.chain import chain
|
||||
|
||||
__all__ = ["chain"]
|
@ -0,0 +1,164 @@
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
from operator import itemgetter
|
||||
from typing import List, Optional, Tuple
|
||||
|
||||
from dotenv import find_dotenv, load_dotenv
|
||||
from langchain.chat_models import ChatOpenAI
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
|
||||
from langchain.prompts.prompt import PromptTemplate
|
||||
from langchain.schema import AIMessage, HumanMessage, format_document
|
||||
from langchain.schema.output_parser import StrOutputParser
|
||||
from langchain.schema.runnable import (
|
||||
RunnableBranch,
|
||||
RunnableLambda,
|
||||
RunnableMap,
|
||||
RunnablePassthrough,
|
||||
)
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from .load_sample_dataset import load_ts_git_dataset
|
||||
|
||||
load_dotenv(find_dotenv())
|
||||
|
||||
if os.environ.get("TIMESCALE_SERVICE_URL", None) is None:
|
||||
raise Exception("Missing `TIMESCALE_SERVICE_URL` environment variable.")
|
||||
|
||||
SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
|
||||
LOAD_SAMPLE_DATA = os.environ.get("LOAD_SAMPLE_DATA", False)
|
||||
COLLECTION_NAME = os.environ.get("COLLECTION_NAME", "timescale_commits")
|
||||
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4")
|
||||
|
||||
partition_interval = timedelta(days=7)
|
||||
if LOAD_SAMPLE_DATA:
|
||||
load_ts_git_dataset(
|
||||
SERVICE_URL,
|
||||
collection_name=COLLECTION_NAME,
|
||||
num_records=500,
|
||||
partition_interval=partition_interval,
|
||||
)
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = TimescaleVector(
|
||||
embedding=embeddings,
|
||||
collection_name=COLLECTION_NAME,
|
||||
service_url=SERVICE_URL,
|
||||
time_partition_interval=partition_interval,
|
||||
)
|
||||
retriever = vectorstore.as_retriever()
|
||||
|
||||
# Condense a chat history and follow-up question into a standalone question
|
||||
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
|
||||
Chat History:
|
||||
{chat_history}
|
||||
Follow Up Input: {question}
|
||||
Standalone question:""" # noqa: E501
|
||||
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
|
||||
|
||||
# RAG answer synthesis prompt
|
||||
template = """Answer the question based only on the following context:
|
||||
<context>
|
||||
{context}
|
||||
</context>"""
|
||||
ANSWER_PROMPT = ChatPromptTemplate.from_messages(
|
||||
[
|
||||
("system", template),
|
||||
MessagesPlaceholder(variable_name="chat_history"),
|
||||
("user", "{question}"),
|
||||
]
|
||||
)
|
||||
|
||||
# Conversational Retrieval Chain
|
||||
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
|
||||
|
||||
|
||||
def _combine_documents(
|
||||
docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
|
||||
):
|
||||
doc_strings = [format_document(doc, document_prompt) for doc in docs]
|
||||
return document_separator.join(doc_strings)
|
||||
|
||||
|
||||
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
|
||||
buffer = []
|
||||
for human, ai in chat_history:
|
||||
buffer.append(HumanMessage(content=human))
|
||||
buffer.append(AIMessage(content=ai))
|
||||
return buffer
|
||||
|
||||
|
||||
# User input
|
||||
class ChatHistory(BaseModel):
|
||||
chat_history: List[Tuple[str, str]] = Field(..., extra={"widget": {"type": "chat"}})
|
||||
question: str
|
||||
start_date: Optional[datetime]
|
||||
end_date: Optional[datetime]
|
||||
metadata_filter: Optional[dict]
|
||||
|
||||
|
||||
_search_query = RunnableBranch(
|
||||
# If input includes chat_history, we condense it with the follow-up question
|
||||
(
|
||||
RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
|
||||
run_name="HasChatHistoryCheck"
|
||||
), # Condense follow-up question and chat into a standalone_question
|
||||
RunnablePassthrough.assign(
|
||||
retriever_query=RunnablePassthrough.assign(
|
||||
chat_history=lambda x: _format_chat_history(x["chat_history"])
|
||||
)
|
||||
| CONDENSE_QUESTION_PROMPT
|
||||
| ChatOpenAI(temperature=0, model=OPENAI_MODEL)
|
||||
| StrOutputParser()
|
||||
),
|
||||
),
|
||||
# Else, we have no chat history, so just pass through the question
|
||||
RunnablePassthrough.assign(retriever_query=lambda x: x["question"]),
|
||||
)
|
||||
|
||||
|
||||
def get_retriever_with_metadata(x):
|
||||
start_dt = x.get("start_date", None)
|
||||
end_dt = x.get("end_date", None)
|
||||
metadata_filter = x.get("metadata_filter", None)
|
||||
opt = {}
|
||||
|
||||
if start_dt is not None:
|
||||
opt["start_date"] = start_dt
|
||||
if end_dt is not None:
|
||||
opt["end_date"] = end_dt
|
||||
if metadata_filter is not None:
|
||||
opt["filter"] = metadata_filter
|
||||
v = vectorstore.as_retriever(search_kwargs=opt)
|
||||
return RunnableLambda(itemgetter("retriever_query")) | v
|
||||
|
||||
|
||||
_retriever = RunnableLambda(get_retriever_with_metadata)
|
||||
|
||||
_inputs = RunnableMap(
|
||||
{
|
||||
"question": lambda x: x["question"],
|
||||
"chat_history": lambda x: _format_chat_history(x["chat_history"]),
|
||||
"start_date": lambda x: x.get("start_date", None),
|
||||
"end_date": lambda x: x.get("end_date", None),
|
||||
"context": _search_query | _retriever | _combine_documents,
|
||||
}
|
||||
)
|
||||
|
||||
_datetime_to_string = RunnablePassthrough.assign(
|
||||
start_date=lambda x: x.get("start_date", None).isoformat()
|
||||
if x.get("start_date", None) is not None
|
||||
else None,
|
||||
end_date=lambda x: x.get("end_date", None).isoformat()
|
||||
if x.get("end_date", None) is not None
|
||||
else None,
|
||||
).with_types(input_type=ChatHistory)
|
||||
|
||||
chain = (
|
||||
_datetime_to_string
|
||||
| _inputs
|
||||
| ANSWER_PROMPT
|
||||
| ChatOpenAI(model=OPENAI_MODEL)
|
||||
| StrOutputParser()
|
||||
)
|
@ -0,0 +1,84 @@
|
||||
import os
|
||||
import tempfile
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
import requests
|
||||
from langchain.document_loaders import JSONLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from timescale_vector import client
|
||||
|
||||
|
||||
def parse_date(date_string: str) -> datetime:
|
||||
if date_string is None:
|
||||
return None
|
||||
time_format = "%a %b %d %H:%M:%S %Y %z"
|
||||
return datetime.strptime(date_string, time_format)
|
||||
|
||||
|
||||
def extract_metadata(record: dict, metadata: dict) -> dict:
|
||||
dt = parse_date(record["date"])
|
||||
metadata["id"] = str(client.uuid_from_time(dt))
|
||||
if dt is not None:
|
||||
metadata["date"] = dt.isoformat()
|
||||
else:
|
||||
metadata["date"] = None
|
||||
metadata["author"] = record["author"]
|
||||
metadata["commit_hash"] = record["commit"]
|
||||
return metadata
|
||||
|
||||
|
||||
def load_ts_git_dataset(
|
||||
service_url,
|
||||
collection_name="timescale_commits",
|
||||
num_records: int = 500,
|
||||
partition_interval=timedelta(days=7),
|
||||
):
|
||||
json_url = "https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
|
||||
tmp_file = "ts_git_log.json"
|
||||
|
||||
temp_dir = tempfile.gettempdir()
|
||||
json_file_path = os.path.join(temp_dir, tmp_file)
|
||||
|
||||
if not os.path.exists(json_file_path):
|
||||
response = requests.get(json_url)
|
||||
if response.status_code == 200:
|
||||
with open(json_file_path, "w") as json_file:
|
||||
json_file.write(response.text)
|
||||
else:
|
||||
print(f"Failed to download JSON file. Status code: {response.status_code}")
|
||||
|
||||
loader = JSONLoader(
|
||||
file_path=json_file_path,
|
||||
jq_schema=".commit_history[]",
|
||||
text_content=False,
|
||||
metadata_func=extract_metadata,
|
||||
)
|
||||
|
||||
documents = loader.load()
|
||||
|
||||
# Remove documents with None dates
|
||||
documents = [doc for doc in documents if doc.metadata["date"] is not None]
|
||||
|
||||
if num_records > 0:
|
||||
documents = documents[:num_records]
|
||||
|
||||
# Split the documents into chunks for embedding
|
||||
text_splitter = CharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
)
|
||||
docs = text_splitter.split_documents(documents)
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
|
||||
# Create a Timescale Vector instance from the collection of documents
|
||||
TimescaleVector.from_documents(
|
||||
embedding=embeddings,
|
||||
ids=[doc.metadata["id"] for doc in docs],
|
||||
documents=docs,
|
||||
collection_name=collection_name,
|
||||
service_url=service_url,
|
||||
time_partition_interval=partition_interval,
|
||||
)
|
Loading…
Reference in New Issue