langchain/libs/experimental/langchain_experimental/sql/prompt.py

86 lines
4.5 KiB
Python
Raw Normal View History

Resolve: VectorSearch enabled SQLChain? (#10177) Squashed from #7454 with updated features We have separated the `SQLDatabseChain` from `VectorSQLDatabseChain` and put everything into `experimental/`. Below is the original PR message from #7454. ------- We have been working on features to fill up the gap among SQL, vector search and LLM applications. Some inspiring works like self-query retrievers for VectorStores (for example [Weaviate](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/weaviate_self_query.html) and [others](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query.html)) really turn those vector search databases into a powerful knowledge base! 🚀🚀 We are thinking if we can merge all in one, like SQL and vector search and LLMChains, making this SQL vector database memory as the only source of your data. Here are some benefits we can think of for now, maybe you have more 👀: With ALL data you have: since you store all your pasta in the database, you don't need to worry about the foreign keys or links between names from other data source. Flexible data structure: Even if you have changed your schema, for example added a table, the LLM will know how to JOIN those tables and use those as filters. SQL compatibility: We found that vector databases that supports SQL in the marketplace have similar interfaces, which means you can change your backend with no pain, just change the name of the distance function in your DB solution and you are ready to go! ### Issue resolved: - [Feature Proposal: VectorSearch enabled SQLChain?](https://github.com/hwchase17/langchain/issues/5122) ### Change made in this PR: - An improved schema handling that ignore `types.NullType` columns - A SQL output Parser interface in `SQLDatabaseChain` to enable Vector SQL capability and further more - A Retriever based on `SQLDatabaseChain` to retrieve data from the database for RetrievalQAChains and many others - Allow `SQLDatabaseChain` to retrieve data in python native format - Includes PR #6737 - Vector SQL Output Parser for `SQLDatabaseChain` and `SQLDatabaseChainRetriever` - Prompts that can implement text to VectorSQL - Corresponding unit-tests and notebook ### Twitter handle: - @MyScaleDB ### Tag Maintainer: Prompts / General: @hwchase17, @baskaryan DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev ### Dependencies: No dependency added
2023-09-07 00:08:12 +00:00
# flake8: noqa
from langchain.prompts.prompt import PromptTemplate
PROMPT_SUFFIX = """Only use the following tables:
{table_info}
Question: {input}"""
_VECTOR_SQL_DEFAULT_TEMPLATE = """You are a {dialect} expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.
{dialect} queries has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by the relevance.
When the query is asking for {top_k} closest row, you have to use this distance function to calculate distance to entity's array on vector column and order by the distance to retrieve relevant rows.
*NOTICE*: `DISTANCE(column, array)` only accept an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user defined function called `NeuralArray(entity)` to retrieve the entity's array.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per {dialect}. You should only order according to the distance function.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use today() function to get the current date, if the question involves "today". `ORDER BY` clause should always be after `WHERE` clause. DO NOT add semicolon to the end of SQL. Pay attention to the comment in table schema.
Use the following format:
Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"
"""
VECTOR_SQL_PROMPT = PromptTemplate(
input_variables=["input", "table_info", "dialect", "top_k"],
template=_VECTOR_SQL_DEFAULT_TEMPLATE + PROMPT_SUFFIX,
)
_myscale_prompt = """You are a MyScale expert. Given an input question, first create a syntactically correct MyScale query to run, then look at the results of the query and return the answer to the input question.
MyScale queries has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by the relevance.
When the query is asking for {top_k} closest row, you have to use this distance function to calculate distance to entity's array on vector column and order by the distance to retrieve relevant rows.
*NOTICE*: `DISTANCE(column, array)` only accept an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user defined function called `NeuralArray(entity)` to retrieve the entity's array.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per MyScale. You should only order according to the distance function.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use today() function to get the current date, if the question involves "today". `ORDER BY` clause should always be after `WHERE` clause. DO NOT add semicolon to the end of SQL. Pay attention to the comment in table schema.
Use the following format:
======== table info ========
<some table infos>
Question: "Question here"
SQLQuery: "SQL Query to run"
Here are some examples:
======== table info ========
CREATE TABLE "ChatPaper" (
abstract String,
id String,
vector Array(Float32),
) ENGINE = ReplicatedReplacingMergeTree()
ORDER BY id
PRIMARY KEY id
Question: What is Feartue Pyramid Network?
SQLQuery: SELECT ChatPaper.title, ChatPaper.id, ChatPaper.authors FROM ChatPaper ORDER BY DISTANCE(vector, NeuralArray(PaperRank contribution)) LIMIT {top_k}
Let's begin:
======== table info ========
{table_info}
Question: {input}
SQLQuery: """
MYSCALE_PROMPT = PromptTemplate(
input_variables=["input", "table_info", "top_k"],
template=_myscale_prompt + PROMPT_SUFFIX,
)
VECTOR_SQL_PROMPTS = {
"myscale": MYSCALE_PROMPT,
}