Add RAG to conceptual guide (#22790)

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
pull/22993/head
Lance Martin 3 months ago committed by GitHub
parent c6b7db6587
commit a54deba6bc
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -11,7 +11,7 @@ LangChain as a framework consists of a number of packages.
### `langchain-core`
This package contains base abstractions of different components and ways to compose them together.
The interfaces for core components like LLMs, vectorstores, retrievers and more are defined here.
The interfaces for core components like LLMs, vector stores, retrievers and more are defined here.
No third party integrations are defined here.
The dependencies are kept purposefully very lightweight.
@ -30,7 +30,7 @@ All chains, agents, and retrieval strategies here are NOT specific to any one in
This package contains third party integrations that are maintained by the LangChain community.
Key partner packages are separated out (see below).
This contains all integrations for various components (LLMs, vectorstores, retrievers).
This contains all integrations for various components (LLMs, vector stores, retrievers).
All dependencies in this package are optional to keep the package as lightweight as possible.
### [`langgraph`](https://langchain-ai.github.io/langgraph)
@ -463,7 +463,7 @@ For specifics on how to use vector stores, see the [relevant how-to guides here]
A retriever is an interface that returns documents given an unstructured query.
It is more general than a vector store.
A retriever does not need to be able to store documents, only to return (or retrieve) them.
Retrievers can be created from vectorstores, but are also broad enough to include [Wikipedia search](/docs/integrations/retrievers/wikipedia/) and [Amazon Kendra](/docs/integrations/retrievers/amazon_kendra_retriever/).
Retrievers can be created from vector stores, but are also broad enough to include [Wikipedia search](/docs/integrations/retrievers/wikipedia/) and [Amazon Kendra](/docs/integrations/retrievers/amazon_kendra_retriever/).
Retrievers accept a string query as input and return a list of Document's as output.
@ -816,7 +816,7 @@ For a full list of model providers that support JSON mode, see [this table](/doc
We use the term tool calling interchangeably with function calling. Although
function calling is sometimes meant to refer to invocations of a single function,
we treat all models as though they can return multiple tool or function calls in
each message.
each message
:::
Tool calling allows a model to respond to a given prompt by generating output that
@ -860,30 +860,162 @@ For a full list of model providers that support tool calling, [see this table](/
### Retrieval
LangChain provides several advanced retrieval types. A full list is below, along with the following information:
LLMs are trained on a large but fixed dataset, limiting their ability to reason over private or recent information. Fine-tuning an LLM with specific facts is one way to mitigate this, but is often [poorly suited for factual recall](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts) and [can be costly](https://www.glean.com/blog/how-to-build-an-ai-assistant-for-the-enterprise).
Retrieval is the process of providing relevant information to an LLM to improve its response for a given input. Retrieval augmented generation (RAG) is the process of grounding the LLM generation (output) using the retrieved information.
**Name**: Name of the retrieval algorithm.
:::tip
**Index Type**: Which index type (if any) this relies on.
* See our RAG from Scratch [code](https://github.com/langchain-ai/rag-from-scratch) and [video series](https://youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&feature=shared).
* For a high-level guide on retrieval, see this [tutorial on RAG](/docs/tutorials/rag/).
**Uses an LLM**: Whether this retrieval method uses an LLM.
:::
RAG is only as good as the retrieved documents relevance and quality. Fortunately, an emerging set of techniques can be employed to design and improve RAG systems. We've focused on taxonomizing and summarizing many of these techniques (see below figure) and will share some high-level strategic guidance in the following sections.
You can and should experiment with using different pieces together. You might also find [this LangSmith guide](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application) useful for showing how to evaluate different iterations of your app.
![](/img/rag_landscape.png)
#### Query Translation
**When to Use**: Our commentary on when you should considering using this retrieval method.
First, consider the user input(s) to your RAG system. Ideally, a RAG system can handle a wide range of inputs, from poorly worded questions to complex multi-part queries.
**Using an LLM to review and optionally modify the input is the central idea behind query translation.** This serves as a general buffer, optimizing raw user inputs for your retrieval system.
For example, this can be as simple as extracting keywords or as complex as generating multiple sub-questions for a complex query.
**Description**: Description of what this retrieval algorithm is doing.
| Name | When to use | Description |
|---------------|-------------|-------------|
| [Multi-query](/docs/how_to/MultiQueryRetriever/) | When you need to cover multiple perspectives of a question. | Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, return the unique documents for all queries. |
| [Decomposition](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a question can be broken down into smaller subproblems. | Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). |
| [Step-back](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a higher-level conceptual understanding is required. | First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. |
| [HyDE](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | If you have challenges retrieving relevant documents using the raw user inputs. | Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. |
:::tip
See our RAG from Scratch videos for a few different specific approaches:
- [Multi-query](https://youtu.be/JChPi0CRnDY?feature=shared)
- [Decomposition](https://youtu.be/h0OPWlEOank?feature=shared)
- [Step-back](https://youtu.be/xn1jEjRyJ2U?feature=shared)
- [HyDE](https://youtu.be/SaDzIVkYqyY?feature=shared)
:::
#### Routing
Second, consider the data sources available to your RAG system. You want to query across more than one database or across structured and unstructured data sources. **Using an LLM to review the input and route it to the appropriate data source is a simple and effective approach for querying across sources.**
| Name | When to use | Description |
|------------------|--------------------------------------------|-------------|
| [Logical routing](/docs/how_to/routing/#using-a-runnablebranch) | When you can prompt an LLM with rules to decide where to route the input. | Logical routing can use an LLM to reason about the query and choose which datastore is most appropriate. |
| [Semantic routing](/docs/how_to/routing/#using-a-runnablebranch) | When semantic similarity is an effective way to determine where to route the input. | Semantic routing embeds both query and, typically a set of prompts. It then chooses the appropriate prompt based upon similarity. |
:::tip
See our RAG from Scratch video on [routing](https://youtu.be/pfpIndq7Fi8?feature=shared).
:::
#### Query Construction
Third, consider whether any of your data sources require specific query formats. Many structured databases use SQL. Vector stores often have specific syntax for applying keyword filters to document metadata. **Using an LLM to convert a natural language query into a query syntax is a popular and powerful approach.**
In particular, [text-to-SQL](/docs/tutorials/sql_qa/), [text-to-Cypher](/docs/tutorials/graph/), and [query analysis for metadata filters](/docs/tutorials/query_analysis/#query-analysis) are useful ways to interact with structured, graph, and vector databases respectively.
| Name | When to Use | Description |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Text to SQL](/docs/tutorials/sql_qa/) | If users are asking questions that require information housed in a relational database, accessible via SQL. | This uses an LLM to transform user input into a SQL query. |
| [Text-to-Cypher](/docs/tutorials/graph/) | If users are asking questions that require information housed in a graph database, accessible via Cypher. | This uses an LLM to transform user input into a Cypher query. |
| [Self Query](/docs/how_to/self_query/) | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
:::tip
See our [blog post overview](https://blog.langchain.dev/query-construction/) and RAG from Scratch video on [query construction](https://youtu.be/kl6NwWYxvbM?feature=shared), the process of text-to-DSL where DSL is a domain specific language required to interact with a given database. This converts user questions into structured queries.
:::
#### Indexing
Fouth, consider the design of your document index. A simple and powerful idea is to **decouple the documents that you index for retrieval from the documents that you pass to the LLM for generation.** Indexing frequently uses embedding models with vector stores, which [compress the semantic information in documents to fixed-size vectors](/docs/concepts/#embedding-models).
Many RAG approaches focus on splitting documents into chunks and retrieving some number based on similarity to an input question for the LLM. But chunk size and chunk number can be difficult to set and affect results if they do not provide full context for the LLM to answer a question. Furthermore, LLMs are increasingly capable of processing millions of tokens.
Two approaches can address this tension: (1) [Multi Vector](/docs/how_to/multi_vector/) retriever using an LLM to translate documents into any form (e.g., often into a summary) that is well-suited for indexing, but returns full documents to the LLM for generation. (2) [ParentDocument](/docs/how_to/parent_document_retriever/) retriever embeds document chunks, but also returns full documents. The idea is to get the best of both worlds: use concise representations (summaries or chunks) for retrieval, but use the full documents for answer generation.
| Name | Index Type | Uses an LLM | When to Use | Description |
|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Vector store](/docs/how_to/vectorstore_retriever/) | Vector store | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. |
| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vector store + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
| [Multi Vector](/docs/how_to/multi_vector/) | Vector store + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
| [Time-Weighted Vector store](/docs/how_to/time_weighted_vectorstore/) | Vector store | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |
:::tip
- See our RAG from Scratch video on [indexing fundamentals](https://youtu.be/bjb_EMsTDKI?feature=shared)
- See our RAG from Scratch video on [multi vector retriever](https://youtu.be/gTCU9I6QqCE?feature=shared)
:::
Fifth, consider ways to improve the quality of your similarity search itself. Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is useful for search / retrieval, but puts a heavy burden on that single vector representation to capture the semantic nuance / detail of the document. In some cases, irrelevant or redundant content can dilute the semantic usefulness of the embedding.
[ColBERT](https://docs.google.com/presentation/d/1IRhAdGjIevrrotdplHNcc4aXgIYyKamUKTWtB3m3aMU/edit?usp=sharing) is an interesting approach to address this with a higher granularity embeddings: (1) produce a contextually influenced embedding for each token in the document and query, (2) score similarity between each query token and all document tokens, (3) take the max, (4) do this for all query tokens, and (5) take the sum of the max scores (in step 3) for all query tokens to get a query-document similarity score; this token-wise scoring can yield strong results.
![](/img/colbert.png)
There are some additional tricks to improve the quality of your retrieval. Embeddings excel at capturing semantic information, but may struggle with keyword-based queries. Many [vector stores](https://python.langchain.com/v0.2/docs/integrations/retrievers/pinecone_hybrid_search/) offer built-in [hybrid-search](https://docs.pinecone.io/guides/data/understanding-hybrid-search) to combine keyword and semantic similarity, which marries the benefits of both approaches. Furthermore, many vector stores have [maximal marginal relevance](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/mmr/), which attempts to diversify the results of a search to avoid returning similar and redundant documents.
| Name | When to use | Description |
|-------------------|----------------------------------------------------------|-------------|
| [ColBERT](/docs/integrations/providers/ragatouille/#using-colbert-as-a-reranker) | When higher granularity embeddings are needed. | ColBERT uses contextually influenced embeddings for each token in the document and query to get a granular query-document similarity score. |
| [Hybrid search](/docs/integrations/retrievers/pinecone_hybrid_search/) | When combining keyword-based and semantic similarity. | Hybrid search combines keyword and semantic similarity, marrying the benefits of both approaches. |
| [Maximal Marginal Relevance (MMR) ](/docs/integrations/vectorstores/pinecone/#maximal-marginal-relevance-searches) | When needing to diversify search results. | MMR attempts to diversify the results of a search to avoid returning similar and redundant documents. |
:::tip
See our RAG from Scratch video on [ColBERT](https://youtu.be/cN6S0Ehm7_8?feature=shared>).
:::
#### Post-processing
Sixth, consider ways to filter or rank retrieved documents. This is very useful if you are [combining documents returned from multiple sources](/docs/integrations/retrievers/cohere-reranker/#doing-reranking-with-coherererank), since it can can down-rank less relevant documents and / or [compress similar documents](/docs/how_to/contextual_compression/#more-built-in-compressors-filters).
| Name | Index Type | Uses an LLM | When to Use | Description |
|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Vectorstore](/docs/how_to/vectorstore_retriever/) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. |
| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
| [Multi Vector](/docs/how_to/multi_vector/) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
| [Self Query](/docs/how_to/self_query/) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
| [Contextual Compression](/docs/how_to/contextual_compression/) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. |
| [Time-Weighted Vectorstore](/docs/how_to/time_weighted_vectorstore/) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |
| [Multi-Query Retriever](/docs/how_to/MultiQueryRetriever/) | Any | Yes | If users are asking questions that are complex and require multiple pieces of distinct information to respond | This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them. |
| [Ensemble](/docs/how_to/ensemble_retriever/) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. |
| [Re-ranking](/docs/integrations/retrievers/cohere-reranker/) | Any | Yes | If you want to rank retrieved documents based upon relevance, especially if you want to combine results from multiple retrieval methods . | Given a query and a list of documents, Rerank indexes the documents from most to least semantically relevant to the query. |
:::tip
See our RAG from Scratch video on [RAG-Fusion](https://youtu.be/77qELPbNgxA?feature=shared), on approach for post-processing across multiple queries: Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, and combine the ranks of multiple search result lists to produce a single, unified ranking with [Reciprocal Rank Fusion (RRF)](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1).
:::
#### Generation
**Finally, consider ways to build self-correction into your RAG system.** RAG systems can suffer from low quality retrieval (e.g., if a user question is out of the domain for the index) and / or hallucinations in generation. A naive retrieve-generate pipeline has no ability to detect or self-correct from these kinds of errors. The concept of ["flow engineering"](https://x.com/karpathy/status/1748043513156272416) has been introduced [in the context of code generation](https://arxiv.org/abs/2401.08500): iteratively build an answer to a code question with unit tests to check and self-correct errors. Several works have applied this RAG, such as Self-RAG and Corrective-RAG. In both cases, checks for document relevance, hallucinations, and / or answer quality are performed in the RAG answer generation flow.
For a high-level guide on retrieval, see this [tutorial on RAG](/docs/tutorials/rag/).
We've found that graphs are a great way to reliably express logical flows and have implemented ideas from several of these papers [using LangGraph](https://github.com/langchain-ai/langgraph/tree/main/examples/rag), as shown in the figure below (red - routing, blue - fallback, green - self-correction):
- **Routing:** Adaptive RAG ([paper](https://arxiv.org/abs/2403.14403)). Route questions to different retrieval approaches, as discussed above
- **Fallback:** Corrective RAG ([paper](https://arxiv.org/pdf/2401.15884.pdf)). Fallback to web search if docs are not relevant to query
- **Self-correction:** Self-RAG ([paper](https://arxiv.org/abs/2310.11511)). Fix answers w/ hallucinations or dont address question
![](/img/langgraph_rag.png)
| Name | When to use | Description |
|-------------------|-----------------------------------------------------------|-------------|
| Self-RAG | When needing to fix answers with hallucinations or irrelevant content. | Self-RAG performs checks for document relevance, hallucinations, and answer quality during the RAG answer generation flow, iteratively building an answer and self-correcting errors. |
| Corrective-RAG | When needing a fallback mechanism for low relevance docs. | Corrective-RAG includes a fallback (e.g., to web search) if the retrieved documents are not relevant to the query, ensuring higher quality and more relevant retrieval. |
:::tip
See several videos and cookbooks showcasing RAG with LangGraph:
- [LangGraph Corrective RAG](https://www.youtube.com/watch?v=E2shqsYwxck)
- [LangGraph combining Adaptive, Self-RAG, and Corrective RAG](https://www.youtube.com/watch?v=-ROS6gfYIts)
- [Cookbooks for RAG using LangGraph ](https://github.com/langchain-ai/langgraph/tree/main/examples/rag)
See our LangGraph RAG recipes with partners:
- [Meta](https://github.com/meta-llama/llama-recipes/tree/main/recipes/use_cases/agents/langchain)
- [Mistral](https://github.com/mistralai/cookbook/tree/main/third_party/langchain)
:::
### Text splitting

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 605 KiB

Loading…
Cancel
Save