langchain/cookbook/rag_fusion.ipynb

272 lines
7.5 KiB
Plaintext
Raw Normal View History

2023-10-21 22:37:11 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "993c2768",
"metadata": {},
"source": [
"# RAG Fusion\n",
"\n",
"Re-implemented from [this GitHub repo](https://github.com/Raudaschl/rag-fusion), all credit to original author\n",
"\n",
"> RAG-Fusion, a search methodology that aims to bridge the gap between traditional search paradigms and the multifaceted dimensions of human queries. Inspired by the capabilities of Retrieval Augmented Generation (RAG), this project goes a step further by employing multiple query generation and Reciprocal Rank Fusion to re-rank search results."
]
},
{
"cell_type": "markdown",
"id": "ebcc6791",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"For this example, we will use Pinecone and some fake data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "661a1c36",
"metadata": {},
"outputs": [],
"source": [
"import pinecone\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
2023-11-14 22:17:44 +00:00
"from langchain.vectorstores import Pinecone\n",
2023-10-21 22:37:11 +00:00
"\n",
2023-10-29 22:50:09 +00:00
"pinecone.init(api_key=\"...\", environment=\"...\")"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "48ef7e93",
"metadata": {},
"outputs": [],
"source": [
"all_documents = {\n",
" \"doc1\": \"Climate change and economic impact.\",\n",
" \"doc2\": \"Public health concerns due to climate change.\",\n",
" \"doc3\": \"Climate change: A social perspective.\",\n",
" \"doc4\": \"Technological solutions to climate change.\",\n",
" \"doc5\": \"Policy changes needed to combat climate change.\",\n",
" \"doc6\": \"Climate change and its impact on biodiversity.\",\n",
" \"doc7\": \"Climate change: The science and models.\",\n",
" \"doc8\": \"Global warming: A subset of climate change.\",\n",
" \"doc9\": \"How climate change affects daily weather.\",\n",
2023-10-29 22:50:09 +00:00
" \"doc10\": \"The history of climate change activism.\",\n",
2023-10-21 22:37:11 +00:00
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fde89f0b",
"metadata": {},
"outputs": [],
"source": [
2023-10-29 22:50:09 +00:00
"vectorstore = Pinecone.from_texts(\n",
" list(all_documents.values()), OpenAIEmbeddings(), index_name=\"rag-fusion\"\n",
")"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "markdown",
"id": "22ddd041",
"metadata": {},
"source": [
"## Define the Query Generator\n",
"\n",
"We will now define a chain to do the query generation"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1d547524",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
docs[patch], templates[patch]: Import from core (#14575) Update imports to use core for the low-hanging fruit changes. Ran following ```bash git grep -l 'langchain.schema.runnable' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.runnable/langchain_core.runnables/g' git grep -l 'langchain.schema.output_parser' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.output_parser/langchain_core.output_parsers/g' git grep -l 'langchain.schema.messages' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.messages/langchain_core.messages/g' git grep -l 'langchain.schema.chat_histry' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.chat_history/langchain_core.chat_history/g' git grep -l 'langchain.schema.prompt_template' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.prompt_template/langchain_core.prompts/g' git grep -l 'from langchain.pydantic_v1' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.pydantic_v1/from langchain_core.pydantic_v1/g' git grep -l 'from langchain.tools.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.tools\.base/from langchain_core.tools/g' git grep -l 'from langchain.chat_models.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.chat_models.base/from langchain_core.language_models.chat_models/g' git grep -l 'from langchain.llms.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.llms\.base\ /from langchain_core.language_models.llms\ /g' git grep -l 'from langchain.embeddings.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.embeddings\.base/from langchain_core.embeddings/g' git grep -l 'from langchain.vectorstores.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.vectorstores\.base/from langchain_core.vectorstores/g' git grep -l 'from langchain.agents.tools' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.agents\.tools/from langchain_core.tools/g' git grep -l 'from langchain.schema.output' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.output\ /from langchain_core.outputs\ /g' git grep -l 'from langchain.schema.embeddings' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.embeddings/from langchain_core.embeddings/g' git grep -l 'from langchain.schema.document' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.document/from langchain_core.documents/g' git grep -l 'from langchain.schema.agent' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.agent/from langchain_core.agents/g' git grep -l 'from langchain.schema.prompt ' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.prompt\ /from langchain_core.prompt_values /g' git grep -l 'from langchain.schema.language_model' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.language_model/from langchain_core.language_models/g' ```
2023-12-12 00:49:10 +00:00
"from langchain_core.output_parsers import StrOutputParser"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "af9ab4db",
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"\n",
2023-10-29 22:50:09 +00:00
"prompt = hub.pull(\"langchain-ai/rag-fusion-query-generation\")"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3628b552",
"metadata": {},
"outputs": [],
"source": [
"# prompt = ChatPromptTemplate.from_messages([\n",
"# (\"system\", \"You are a helpful assistant that generates multiple search queries based on a single input query.\"),\n",
"# (\"user\", \"Generate multiple search queries related to: {original_query}\"),\n",
"# (\"user\", \"OUTPUT (4 queries):\")\n",
"# ])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8d6cbb73",
"metadata": {},
"outputs": [],
"source": [
2023-10-29 22:50:09 +00:00
"generate_queries = (\n",
" prompt | ChatOpenAI(temperature=0) | StrOutputParser() | (lambda x: x.split(\"\\n\"))\n",
")"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "markdown",
"id": "ee2824cd",
"metadata": {},
"source": [
"## Define the full chain\n",
"\n",
"We can now put it all together and define the full chain. This chain:\n",
" \n",
" 1. Generates a bunch of queries\n",
" 2. Looks up each query in the retriever\n",
" 3. Joins all the results together using reciprocal rank fusion\n",
" \n",
" \n",
"Note that it does NOT do a final generation step"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "ca0bfec4",
"metadata": {},
"outputs": [],
"source": [
"original_query = \"impact of climate change\""
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "02437d65",
"metadata": {},
"outputs": [],
"source": [
"vectorstore = Pinecone.from_existing_index(\"rag-fusion\", OpenAIEmbeddings())\n",
"retriever = vectorstore.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "46a9a0e6",
"metadata": {},
"outputs": [],
"source": [
"from langchain.load import dumps, loads\n",
2023-10-29 22:50:09 +00:00
"\n",
"\n",
2023-10-21 22:37:11 +00:00
"def reciprocal_rank_fusion(results: list[list], k=60):\n",
" fused_scores = {}\n",
" for docs in results:\n",
" # Assumes the docs are returned in sorted order of relevance\n",
" for rank, doc in enumerate(docs):\n",
" doc_str = dumps(doc)\n",
" if doc_str not in fused_scores:\n",
" fused_scores[doc_str] = 0\n",
" previous_score = fused_scores[doc_str]\n",
" fused_scores[doc_str] += 1 / (rank + k)\n",
2023-10-29 22:50:09 +00:00
"\n",
" reranked_results = [\n",
" (loads(doc), score)\n",
" for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)\n",
" ]\n",
" return reranked_results"
2023-10-21 22:37:11 +00:00
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "3f9d4502",
"metadata": {},
"outputs": [],
"source": [
"chain = generate_queries | retriever.map() | reciprocal_rank_fusion"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "d70c4fcd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(Document(page_content='Climate change and economic impact.'),\n",
" 0.06558258417063283),\n",
" (Document(page_content='Climate change: A social perspective.'),\n",
" 0.06400409626216078),\n",
" (Document(page_content='How climate change affects daily weather.'),\n",
" 0.04787506400409626),\n",
" (Document(page_content='Climate change and its impact on biodiversity.'),\n",
" 0.03306010928961749),\n",
" (Document(page_content='Public health concerns due to climate change.'),\n",
" 0.016666666666666666),\n",
" (Document(page_content='Technological solutions to climate change.'),\n",
" 0.016666666666666666),\n",
" (Document(page_content='Policy changes needed to combat climate change.'),\n",
" 0.01639344262295082)]"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke({\"original_query\": original_query})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7866e551",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}