From 8cbce684d4ec861cfd45edc4585365db81b93afd Mon Sep 17 00:00:00 2001 From: ccurme Date: Fri, 31 May 2024 10:57:35 -0400 Subject: [PATCH] docs: update retriever how-to content (#22362) - [x] How to: use a vector store to retrieve data - [ ] How to: generate multiple queries to retrieve data for - [x] How to: use contextual compression to compress the data retrieved - [x] How to: write a custom retriever class - [x] How to: add similarity scores to retriever results ^ done last month - [x] How to: combine the results from multiple retrievers - [x] How to: reorder retrieved results to mitigate the "lost in the middle" effect - [x] How to: generate multiple embeddings per document ^ this PR - [ ] How to: retrieve the whole document for a chunk - [ ] How to: generate metadata filters - [ ] How to: create a time-weighted retriever - [ ] How to: use hybrid vector and keyword retrieval ^ todo --- docs/docs/how_to/ensemble_retriever.ipynb | 80 ++-- docs/docs/how_to/index.mdx | 2 +- docs/docs/how_to/long_context_reorder.ipynb | 138 ++++--- docs/docs/how_to/multi_vector.ipynb | 408 ++++++++++---------- 4 files changed, 323 insertions(+), 305 deletions(-) diff --git a/docs/docs/how_to/ensemble_retriever.ipynb b/docs/docs/how_to/ensemble_retriever.ipynb index 015c3146e5..80b0d50548 100644 --- a/docs/docs/how_to/ensemble_retriever.ipynb +++ b/docs/docs/how_to/ensemble_retriever.ipynb @@ -4,13 +4,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# How to create an Ensemble Retriever\n", + "# How to combine results from multiple retrievers\n", "\n", - "The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their `get_relevant_documents()` methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n", + "The [EnsembleRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html) supports ensembling of results from multiple retrievers. It is initialized with a list of [BaseRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_core.retrievers.BaseRetriever.html) objects. EnsembleRetrievers rerank the results of the constituent retrievers based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n", "\n", "By leveraging the strengths of different algorithms, the `EnsembleRetriever` can achieve better performance than any single algorithm. \n", "\n", - "The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\". The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity." + "The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\". The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity.\n", + "\n", + "## Basic usage\n", + "\n", + "Below we demonstrate ensembling of a [BM25Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.bm25.BM25Retriever.html) with a retriever derived from the [FAISS vector store](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html)." ] }, { @@ -24,22 +28,15 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from langchain.retrievers import EnsembleRetriever\n", "from langchain_community.retrievers import BM25Retriever\n", "from langchain_community.vectorstores import FAISS\n", - "from langchain_openai import OpenAIEmbeddings" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ + "from langchain_openai import OpenAIEmbeddings\n", + "\n", "doc_list_1 = [\n", " \"I like apples\",\n", " \"I like oranges\",\n", @@ -71,19 +68,19 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='You like apples', metadata={'source': 2}),\n", - " Document(page_content='I like apples', metadata={'source': 1}),\n", - " Document(page_content='You like oranges', metadata={'source': 2}),\n", - " Document(page_content='Apples and oranges are fruits', metadata={'source': 1})]" + "[Document(page_content='I like apples', metadata={'source': 1}),\n", + " Document(page_content='You like apples', metadata={'source': 2}),\n", + " Document(page_content='Apples and oranges are fruits', metadata={'source': 1}),\n", + " Document(page_content='You like oranges', metadata={'source': 2})]" ] }, - "execution_count": 15, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -99,24 +96,17 @@ "source": [ "## Runtime Configuration\n", "\n", - "We can also configure the retrievers at runtime. In order to do this, we need to mark the fields as configurable" + "We can also configure the individual retrievers at runtime using [configurable fields](/docs/how_to/configure). Below we update the \"top-k\" parameter for the FAISS retriever specifically:" ] }, { "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_core.runnables import ConfigurableField" - ] - }, - { - "cell_type": "code", - "execution_count": 17, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ + "from langchain_core.runnables import ConfigurableField\n", + "\n", "faiss_retriever = faiss_vectorstore.as_retriever(\n", " search_kwargs={\"k\": 2}\n", ").configurable_fields(\n", @@ -125,15 +115,8 @@ " name=\"Search Kwargs\",\n", " description=\"The search kwargs to use\",\n", " )\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ + ")\n", + "\n", "ensemble_retriever = EnsembleRetriever(\n", " retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]\n", ")" @@ -141,9 +124,22 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='I like apples', metadata={'source': 1}),\n", + " Document(page_content='You like apples', metadata={'source': 2}),\n", + " Document(page_content='Apples and oranges are fruits', metadata={'source': 1})]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "config = {\"configurable\": {\"search_kwargs_faiss\": {\"k\": 1}}}\n", "docs = ensemble_retriever.invoke(\"apples\", config=config)\n", @@ -181,7 +177,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.1" + "version": "3.10.4" } }, "nbformat": 4, diff --git a/docs/docs/how_to/index.mdx b/docs/docs/how_to/index.mdx index 3e3637b6a4..448478a034 100644 --- a/docs/docs/how_to/index.mdx +++ b/docs/docs/how_to/index.mdx @@ -151,7 +151,7 @@ Retrievers are responsible for taking a query and returning relevant documents. - [How to: write a custom retriever class](/docs/how_to/custom_retriever) - [How to: add similarity scores to retriever results](/docs/how_to/add_scores_retriever) - [How to: combine the results from multiple retrievers](/docs/how_to/ensemble_retriever) -- [How to: reorder retrieved results to put most relevant documents not in the middle](/docs/how_to/long_context_reorder) +- [How to: reorder retrieved results to mitigate the "lost in the middle" effect](/docs/how_to/long_context_reorder) - [How to: generate multiple embeddings per document](/docs/how_to/multi_vector) - [How to: retrieve the whole document for a chunk](/docs/how_to/parent_document_retriever) - [How to: generate metadata filters](/docs/how_to/self_query) diff --git a/docs/docs/how_to/long_context_reorder.ipynb b/docs/docs/how_to/long_context_reorder.ipynb index f84fad93df..1d20708318 100644 --- a/docs/docs/how_to/long_context_reorder.ipynb +++ b/docs/docs/how_to/long_context_reorder.ipynb @@ -5,28 +5,38 @@ "id": "fc0db1bc", "metadata": {}, "source": [ - "# How to reorder retrieved results to put most relevant documents not in the middle\n", + "# How to reorder retrieved results to mitigate the \"lost in the middle\" effect\n", "\n", - "No matter the architecture of your model, there is a substantial performance degradation when you include 10+ retrieved documents.\n", - "In brief: When models must access relevant information in the middle of long contexts, they tend to ignore the provided documents.\n", - "See: https://arxiv.org/abs/2307.03172\n", + "Substantial performance degradations in [RAG](/docs/tutorials/rag) applications have been [documented](https://arxiv.org/abs/2307.03172) as the number of retrieved documents grows (e.g., beyond ten). In brief: models are liable to miss relevant information in the middle of long contexts.\n", "\n", - "To avoid this issue you can re-order documents after retrieval to avoid performance degradation." + "By contrast, queries against vector stores will typically return documents in descending order of relevance (e.g., as measured by cosine similarity of [embeddings](/docs/concepts/#embedding-models)).\n", + "\n", + "To mitigate the [\"lost in the middle\"](https://arxiv.org/abs/2307.03172) effect, you can re-order documents after retrieval such that the most relevant documents are positioned at extrema (e.g., the first and last pieces of context), and the least relevant documents are positioned in the middle. In some cases this can help surface the most relevant information to LLMs.\n", + "\n", + "The [LongContextReorder](https://api.python.langchain.com/en/latest/document_transformers/langchain_community.document_transformers.long_context_reorder.LongContextReorder.html) document transformer implements this re-ordering procedure. Below we demonstrate an example." ] }, { "cell_type": "code", "execution_count": null, - "id": "74d1ebe8", + "id": "2074fdaa-edff-468a-970f-6f5f26e93d4a", "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade --quiet sentence-transformers langchain-chroma langchain langchain-openai langchain-huggingface > /dev/null" ] }, + { + "cell_type": "markdown", + "id": "c97eaaf2-34b7-4770-9949-e1abc4ca5226", + "metadata": {}, + "source": [ + "First we embed some artificial documents and index them in an (in-memory) [Chroma](/docs/integrations/providers/chroma/) vector store. We will use [Hugging Face](/docs/integrations/text_embedding/huggingfacehub/) embeddings, but any LangChain vector store or embeddings model will suffice." + ] + }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "id": "49cbcd8e", "metadata": {}, "outputs": [ @@ -45,20 +55,14 @@ " Document(page_content='This is just a random text.')]" ] }, - "execution_count": 3, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "from langchain.chains import LLMChain, StuffDocumentsChain\n", "from langchain_chroma import Chroma\n", - "from langchain_community.document_transformers import (\n", - " LongContextReorder,\n", - ")\n", - "from langchain_core.prompts import PromptTemplate\n", "from langchain_huggingface import HuggingFaceEmbeddings\n", - "from langchain_openai import OpenAI\n", "\n", "# Get embeddings.\n", "embeddings = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n", @@ -83,14 +87,22 @@ "query = \"What can you tell me about the Celtics?\"\n", "\n", "# Get relevant documents ordered by relevance score\n", - "docs = retriever.get_relevant_documents(query)\n", + "docs = retriever.invoke(query)\n", "docs" ] }, + { + "cell_type": "markdown", + "id": "175d031a-43fa-42f4-93c4-2ba52c3c3ee5", + "metadata": {}, + "source": [ + "Note that documents are returned in descending order of relevance to the query. The `LongContextReorder` document transformer will implement the re-ordering described above:" + ] + }, { "cell_type": "code", - "execution_count": 4, - "id": "34fb9d6e", + "execution_count": 3, + "id": "9a1181f2-a3dc-4614-9233-2196ab65939e", "metadata": {}, "outputs": [ { @@ -108,12 +120,14 @@ " Document(page_content='This is a document about the Boston Celtics')]" ] }, - "execution_count": 4, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "from langchain_community.document_transformers import LongContextReorder\n", + "\n", "# Reorder the documents:\n", "# Less relevant document will be at the middle of the list and more\n", "# relevant elements at beginning / end.\n", @@ -125,58 +139,54 @@ ] }, { - "cell_type": "code", - "execution_count": 5, - "id": "ceccab87", + "cell_type": "markdown", + "id": "a8d2ef0c-c397-4d8d-8118-3f7acf86d241", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'\\n\\nThe Celtics are referenced in four of the nine text extracts. They are mentioned as the favorite team of the author, the winner of a basketball game, a team with one of the best players, and a team with a specific player. Additionally, the last extract states that the document is about the Boston Celtics. This suggests that the Celtics are a basketball team, possibly from Boston, that is well-known and has had successful players and games in the past. '" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "# We prepare and run a custom Stuff chain with reordered docs as context.\n", - "\n", - "# Override prompts\n", - "document_prompt = PromptTemplate(\n", - " input_variables=[\"page_content\"], template=\"{page_content}\"\n", - ")\n", - "document_variable_name = \"context\"\n", - "llm = OpenAI()\n", - "stuff_prompt_override = \"\"\"Given this text extracts:\n", - "-----\n", - "{context}\n", - "-----\n", - "Please answer the following question:\n", - "{query}\"\"\"\n", - "prompt = PromptTemplate(\n", - " template=stuff_prompt_override, input_variables=[\"context\", \"query\"]\n", - ")\n", - "\n", - "# Instantiate the chain\n", - "llm_chain = LLMChain(llm=llm, prompt=prompt)\n", - "chain = StuffDocumentsChain(\n", - " llm_chain=llm_chain,\n", - " document_prompt=document_prompt,\n", - " document_variable_name=document_variable_name,\n", - ")\n", - "chain.run(input_documents=reordered_docs, query=query)" + "Below, we show how to incorporate the re-ordered documents into a simple question-answering chain:" ] }, { "cell_type": "code", - "execution_count": null, - "id": "d4696a97", + "execution_count": 5, + "id": "8bbea705-d5b9-4ed5-9957-e12547283622", "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "The Celtics are a professional basketball team and one of the most iconic franchises in the NBA. They are highly regarded and have a large fan base. The team has had many successful seasons and is often considered one of the top teams in the league. They have a strong history and have produced many great players, such as Larry Bird and L. Kornet. The team is based in Boston and is often referred to as the Boston Celtics.\n" + ] + } + ], + "source": [ + "from langchain.chains.combine_documents import create_stuff_documents_chain\n", + "from langchain_core.prompts import PromptTemplate\n", + "from langchain_openai import OpenAI\n", + "\n", + "llm = OpenAI()\n", + "\n", + "prompt_template = \"\"\"\n", + "Given these texts:\n", + "-----\n", + "{context}\n", + "-----\n", + "Please answer the following question:\n", + "{query}\n", + "\"\"\"\n", + "\n", + "prompt = PromptTemplate(\n", + " template=prompt_template,\n", + " input_variables=[\"context\", \"query\"],\n", + ")\n", + "\n", + "# Create and invoke the chain:\n", + "chain = create_stuff_documents_chain(llm, prompt)\n", + "response = chain.invoke({\"context\": reordered_docs, \"query\": query})\n", + "print(response)" + ] } ], "metadata": { @@ -195,7 +205,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.1" + "version": "3.10.4" } }, "nbformat": 4, diff --git a/docs/docs/how_to/multi_vector.ipynb b/docs/docs/how_to/multi_vector.ipynb index 34952b3074..f8733e8b61 100644 --- a/docs/docs/how_to/multi_vector.ipynb +++ b/docs/docs/how_to/multi_vector.ipynb @@ -5,33 +5,36 @@ "id": "d9172545", "metadata": {}, "source": [ - "# How to use the MultiVector Retriever\n", + "# How to retrieve using multiple vectors per document\n", "\n", - "It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.\n", + "It can often be useful to store multiple vectors per document. There are multiple use cases where this is beneficial. For example, we can embed multiple chunks of a document and associate those embeddings with the parent document, allowing retriever hits on the chunks to return the larger document.\n", + "\n", + "LangChain implements a base [MultiVectorRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_vector.MultiVectorRetriever.html), which simplifies this process. Much of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.\n", "\n", "The methods to create multiple vectors per document include:\n", "\n", - "- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).\n", + "- Smaller chunks: split a document into smaller chunks, and embed those (this is [ParentDocumentRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html)).\n", "- Summary: create a summary for each document, embed that along with (or instead of) the document.\n", "- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.\n", "\n", + "Note that this also enables another method of adding embeddings - manually. This is useful because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.\n", "\n", - "Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control." + "Below we walk through an example. First we instantiate some documents. We will index them in an (in-memory) [Chroma](/docs/integrations/providers/chroma/) vector store using [OpenAI](https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/) embeddings, but any LangChain vector store or embeddings model will suffice." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09cecd95-3499-465a-895a-944627ffb77f", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain-chroma langchain langchain-openai > /dev/null" ] }, { "cell_type": "code", "execution_count": 1, - "id": "eed469be", - "metadata": {}, - "outputs": [], - "source": [ - "from langchain.retrievers.multi_vector import MultiVectorRetriever" - ] - }, - { - "cell_type": "code", - "execution_count": 2, "id": "18c1421a", "metadata": {}, "outputs": [], @@ -40,25 +43,22 @@ "from langchain_chroma import Chroma\n", "from langchain_community.document_loaders import TextLoader\n", "from langchain_openai import OpenAIEmbeddings\n", - "from langchain_text_splitters import RecursiveCharacterTextSplitter" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "6d869496", - "metadata": {}, - "outputs": [], - "source": [ + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", "loaders = [\n", - " TextLoader(\"../../paul_graham_essay.txt\"),\n", + " TextLoader(\"paul_graham_essay.txt\"),\n", " TextLoader(\"state_of_the_union.txt\"),\n", "]\n", "docs = []\n", "for loader in loaders:\n", " docs.extend(loader.load())\n", "text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)\n", - "docs = text_splitter.split_documents(docs)" + "docs = text_splitter.split_documents(docs)\n", + "\n", + "# The vectorstore to use to index the child chunks\n", + "vectorstore = Chroma(\n", + " collection_name=\"full_documents\", embedding_function=OpenAIEmbeddings()\n", + ")" ] }, { @@ -68,52 +68,54 @@ "source": [ "## Smaller chunks\n", "\n", - "Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the `ParentDocumentRetriever` does. Here we show what is going on under the hood." + "Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the [ParentDocumentRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html) does. Here we show what is going on under the hood.\n", + "\n", + "We will make a distinction between the vector store, which indexes embeddings of the (sub) documents, and the document store, which houses the \"parent\" documents and associates them with an identifier." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 2, "id": "0e7b6b45", "metadata": {}, "outputs": [], "source": [ - "# The vectorstore to use to index the child chunks\n", - "vectorstore = Chroma(\n", - " collection_name=\"full_documents\", embedding_function=OpenAIEmbeddings()\n", - ")\n", + "import uuid\n", + "\n", + "from langchain.retrievers.multi_vector import MultiVectorRetriever\n", + "\n", "# The storage layer for the parent documents\n", "store = InMemoryByteStore()\n", "id_key = \"doc_id\"\n", + "\n", "# The retriever (empty to start)\n", "retriever = MultiVectorRetriever(\n", " vectorstore=vectorstore,\n", " byte_store=store,\n", " id_key=id_key,\n", ")\n", - "import uuid\n", "\n", "doc_ids = [str(uuid.uuid4()) for _ in docs]" ] }, { - "cell_type": "code", - "execution_count": 5, - "id": "72a36491", + "cell_type": "markdown", + "id": "d4feded4-856a-4282-91c3-53aabc62e6ff", "metadata": {}, - "outputs": [], "source": [ - "# The splitter to use to create smaller chunks\n", - "child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)" + "We next generate the \"sub\" documents by splitting the original documents. Note that we store the document identifier in the `metadata` of the corresponding [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) object." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 3, "id": "5d23247d", "metadata": {}, "outputs": [], "source": [ + "# The splitter to use to create smaller chunks\n", + "child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)\n", + "\n", "sub_docs = []\n", "for i, doc in enumerate(docs):\n", " _id = doc_ids[i]\n", @@ -123,9 +125,17 @@ " sub_docs.extend(_sub_docs)" ] }, + { + "cell_type": "markdown", + "id": "8e0634f8-90d5-4250-981a-5257c8a6d455", + "metadata": {}, + "source": [ + "Finally, we index the documents in our vector store and document store:" + ] + }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 4, "id": "92ed5861", "metadata": {}, "outputs": [], @@ -134,31 +144,46 @@ "retriever.docstore.mset(list(zip(doc_ids, docs)))" ] }, + { + "cell_type": "markdown", + "id": "14c48c6d-850c-4317-9b6e-1ade92f2f710", + "metadata": {}, + "source": [ + "The vector store alone will retrieve small chunks:" + ] + }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 5, "id": "8afed60c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '2fd77862-9ed5-4fad-bf76-e487b747b333', 'source': 'state_of_the_union.txt'})" + "Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})" ] }, - "execution_count": 8, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Vectorstore alone retrieves the small chunks\n", "retriever.vectorstore.similarity_search(\"justice breyer\")[0]" ] }, + { + "cell_type": "markdown", + "id": "717097c7-61d9-4306-8625-ef8f1940c127", + "metadata": {}, + "source": [ + "Whereas the retriever will return the larger parent document:" + ] + }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 6, "id": "3c9017f1", "metadata": {}, "outputs": [ @@ -168,14 +193,13 @@ "9875" ] }, - "execution_count": 9, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Retriever returns larger chunks\n", - "len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)" + "len(retriever.invoke(\"justice breyer\")[0].page_content)" ] }, { @@ -183,12 +207,12 @@ "id": "cdef8339-f9fa-4b3b-955f-ad9dbdf2734f", "metadata": {}, "source": [ - "The default search type the retriever performs on the vector database is a similarity search. LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the `search_type` property as follows:" + "The default search type the retriever performs on the vector database is a similarity search. LangChain vector stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search). This can be controlled via the `search_type` parameter of the retriever:" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 7, "id": "36739460-a737-4a8e-b70f-50bf8c8eaae7", "metadata": {}, "outputs": [ @@ -198,7 +222,7 @@ "9875" ] }, - "execution_count": 10, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -208,7 +232,7 @@ "\n", "retriever.search_type = SearchType.mmr\n", "\n", - "len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)" + "len(retriever.invoke(\"justice breyer\")[0].page_content)" ] }, { @@ -216,14 +240,37 @@ "id": "d6a7ae0d", "metadata": {}, "source": [ - "## Summary\n", + "## Associating summaries with a document for retrieval\n", "\n", - "Oftentimes a summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those." + "A summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those.\n", + "\n", + "We construct a simple [chain](/docs/how_to/sequence) that will receive an input [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) object and generate a summary using a LLM.\n", + "\n", + "```{=mdx}\n", + "import ChatModelTabs from \"@theme/ChatModelTabs\";\n", + "\n", + "\n", + "```" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 8, + "id": "6589291f-55bb-4e9a-b4ff-08f2506ed641", + "metadata": {}, + "outputs": [], + "source": [ + "# | output: false\n", + "# | echo: false\n", + "\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, "id": "1433dff4", "metadata": {}, "outputs": [], @@ -233,27 +280,26 @@ "from langchain_core.documents import Document\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.prompts import ChatPromptTemplate\n", - "from langchain_openai import ChatOpenAI" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "35b30390", - "metadata": {}, - "outputs": [], - "source": [ + "\n", "chain = (\n", " {\"doc\": lambda x: x.page_content}\n", " | ChatPromptTemplate.from_template(\"Summarize the following document:\\n\\n{doc}\")\n", - " | ChatOpenAI(max_retries=0)\n", + " | llm\n", " | StrOutputParser()\n", ")" ] }, + { + "cell_type": "markdown", + "id": "3faa9fde-1b09-4849-a815-8b2e89c30a02", + "metadata": {}, + "source": [ + "Note that we can [batch](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable) the chain accross documents:" + ] + }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 10, "id": "41a2a738", "metadata": {}, "outputs": [], @@ -261,9 +307,17 @@ "summaries = chain.batch(docs, {\"max_concurrency\": 5})" ] }, + { + "cell_type": "markdown", + "id": "73ef599e-140b-4905-8b62-6c52cdde1852", + "metadata": {}, + "source": [ + "We can then initialize a `MultiVectorRetriever` as before, indexing the summaries in our vector store, and retaining the original documents in our document store:" + ] + }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 11, "id": "7ac5e4b1", "metadata": {}, "outputs": [], @@ -279,29 +333,13 @@ " byte_store=store,\n", " id_key=id_key,\n", ")\n", - "doc_ids = [str(uuid.uuid4()) for _ in docs]" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "0d93309f", - "metadata": {}, - "outputs": [], - "source": [ + "doc_ids = [str(uuid.uuid4()) for _ in docs]\n", + "\n", "summary_docs = [\n", " Document(page_content=s, metadata={id_key: doc_ids[i]})\n", " for i, s in enumerate(summaries)\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "6d5edf0d", - "metadata": {}, - "outputs": [], - "source": [ + "]\n", + "\n", "retriever.vectorstore.add_documents(summary_docs)\n", "retriever.docstore.mset(list(zip(doc_ids, docs)))" ] @@ -320,50 +358,48 @@ ] }, { - "cell_type": "code", - "execution_count": 18, - "id": "299232d6", + "cell_type": "markdown", + "id": "f0274892-29c1-4616-9040-d23f9d537526", "metadata": {}, - "outputs": [], "source": [ - "sub_docs = vectorstore.similarity_search(\"justice breyer\")" + "Querying the vector store will return summaries:" ] }, { "cell_type": "code", - "execution_count": 19, - "id": "10e404c0", + "execution_count": 12, + "id": "299232d6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "Document(page_content=\"The document is a speech given by President Biden addressing various issues and outlining his agenda for the nation. He highlights the importance of nominating a Supreme Court justice and introduces his nominee, Judge Ketanji Brown Jackson. He emphasizes the need to secure the border and reform the immigration system, including providing a pathway to citizenship for Dreamers and essential workers. The President also discusses the protection of women's rights, including access to healthcare and the right to choose. He calls for the passage of the Equality Act to protect LGBTQ+ rights. Additionally, President Biden discusses the need to address the opioid epidemic, improve mental health services, support veterans, and fight against cancer. He expresses optimism for the future of America and the strength of the American people.\", metadata={'doc_id': '56345bff-3ead-418c-a4ff-dff203f77474'})" + "Document(page_content=\"President Biden recently nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court, emphasizing her qualifications and broad support. The President also outlined a plan to secure the border, fix the immigration system, protect women's rights, support LGBTQ+ Americans, and advance mental health services. He highlighted the importance of bipartisan unity in passing legislation, such as the Violence Against Women Act. The President also addressed supporting veterans, particularly those impacted by exposure to burn pits, and announced plans to expand benefits for veterans with respiratory cancers. Additionally, he proposed a plan to end cancer as we know it through the Cancer Moonshot initiative. President Biden expressed optimism about the future of America and emphasized the strength of the American people in overcoming challenges.\", metadata={'doc_id': '84015b1b-980e-400a-94d8-cf95d7e079bd'})" ] }, - "execution_count": 19, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "sub_docs = retriever.vectorstore.similarity_search(\"justice breyer\")\n", + "\n", "sub_docs[0]" ] }, { - "cell_type": "code", - "execution_count": 20, - "id": "e4cce5c2", + "cell_type": "markdown", + "id": "e4f77ac5-2926-4f60-aad5-b2067900dff9", "metadata": {}, - "outputs": [], "source": [ - "retrieved_docs = retriever.get_relevant_documents(\"justice breyer\")" + "Whereas the retriever will return the larger source document:" ] }, { "cell_type": "code", - "execution_count": 21, - "id": "c8570dbb", + "execution_count": 13, + "id": "e4cce5c2", "metadata": {}, "outputs": [ { @@ -372,12 +408,14 @@ "9194" ] }, - "execution_count": 21, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "retrieved_docs = retriever.invoke(\"justice breyer\")\n", + "\n", "len(retrieved_docs[0].page_content)" ] }, @@ -388,42 +426,28 @@ "source": [ "## Hypothetical Queries\n", "\n", - "An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded" + "An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document, which might bear close semantic similarity to relevant queries in a [RAG](/docs/tutorials/rag) application. These questions can then be embedded and associated with the documents to improve retrieval.\n", + "\n", + "Below, we use the [with_structured_output](/docs/how_to/structured_output/) method to structure the LLM output into a list of strings." ] }, { "cell_type": "code", - "execution_count": 22, - "id": "5219b085", + "execution_count": 16, + "id": "03d85234-c33a-4a43-861d-47328e1ec2ea", "metadata": {}, "outputs": [], "source": [ - "functions = [\n", - " {\n", - " \"name\": \"hypothetical_questions\",\n", - " \"description\": \"Generate hypothetical questions\",\n", - " \"parameters\": {\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"questions\": {\n", - " \"type\": \"array\",\n", - " \"items\": {\"type\": \"string\"},\n", - " },\n", - " },\n", - " \"required\": [\"questions\"],\n", - " },\n", - " }\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "523deb92", - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_core.output_parsers.openai_functions import JsonKeyOutputFunctionsParser\n", + "from typing import List\n", + "\n", + "from langchain_core.pydantic_v1 import BaseModel, Field\n", + "\n", + "\n", + "class HypotheticalQuestions(BaseModel):\n", + " \"\"\"Generate hypothetical questions.\"\"\"\n", + "\n", + " questions: List[str] = Field(..., description=\"List of questions\")\n", + "\n", "\n", "chain = (\n", " {\"doc\": lambda x: x.page_content}\n", @@ -431,28 +455,36 @@ " | ChatPromptTemplate.from_template(\n", " \"Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\\n\\n{doc}\"\n", " )\n", - " | ChatOpenAI(max_retries=0, model=\"gpt-4\").bind(\n", - " functions=functions, function_call={\"name\": \"hypothetical_questions\"}\n", + " | ChatOpenAI(max_retries=0, model=\"gpt-4o\").with_structured_output(\n", + " HypotheticalQuestions\n", " )\n", - " | JsonKeyOutputFunctionsParser(key_name=\"questions\")\n", + " | (lambda x: x.questions)\n", ")" ] }, + { + "cell_type": "markdown", + "id": "6dddc40f-62af-413c-b944-f94a5e1f2f4e", + "metadata": {}, + "source": [ + "Invoking the chain on a single document demonstrates that it outputs a list of questions:" + ] + }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 17, "id": "11d30554", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[\"What was the author's first experience with programming like?\",\n", - " 'Why did the author switch their focus from AI to Lisp during their graduate studies?',\n", - " 'What led the author to contemplate a career in art instead of computer science?']" + "[\"What impact did the IBM 1401 have on the author's early programming experiences?\",\n", + " \"How did the transition from using the IBM 1401 to microcomputers influence the author's programming journey?\",\n", + " \"What role did Lisp play in shaping the author's understanding and approach to AI?\"]" ] }, - "execution_count": 24, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -462,22 +494,24 @@ ] }, { - "cell_type": "code", - "execution_count": 25, - "id": "3eb2e48c", + "cell_type": "markdown", + "id": "dcffc572-7b20-4b77-857a-90ec360a8f7e", "metadata": {}, - "outputs": [], "source": [ - "hypothetical_questions = chain.batch(docs, {\"max_concurrency\": 5})" + "We can batch then batch the chain over all documents and assemble our vector store and document store as before:" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 18, "id": "b2cd6e75", "metadata": {}, "outputs": [], "source": [ + "# Batch chain over documents to generate hypothetical questions\n", + "hypothetical_questions = chain.batch(docs, {\"max_concurrency\": 5})\n", + "\n", + "\n", "# The vectorstore to use to index the child chunks\n", "vectorstore = Chroma(\n", " collection_name=\"hypo-questions\", embedding_function=OpenAIEmbeddings()\n", @@ -491,82 +525,67 @@ " byte_store=store,\n", " id_key=id_key,\n", ")\n", - "doc_ids = [str(uuid.uuid4()) for _ in docs]" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "18831b3b", - "metadata": {}, - "outputs": [], - "source": [ + "doc_ids = [str(uuid.uuid4()) for _ in docs]\n", + "\n", + "\n", + "# Generate Document objects from hypothetical questions\n", "question_docs = []\n", "for i, question_list in enumerate(hypothetical_questions):\n", " question_docs.extend(\n", " [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "224b24c5", - "metadata": {}, - "outputs": [], - "source": [ + " )\n", + "\n", + "\n", "retriever.vectorstore.add_documents(question_docs)\n", "retriever.docstore.mset(list(zip(doc_ids, docs)))" ] }, { - "cell_type": "code", - "execution_count": 29, - "id": "7b442b90", + "cell_type": "markdown", + "id": "75cba8ab-a06f-4545-85fc-cf49d0204b5e", "metadata": {}, - "outputs": [], "source": [ - "sub_docs = vectorstore.similarity_search(\"justice breyer\")" + "Note that querying the underlying vector store will retrieve hypothetical questions that are semantically similar to the input query:" ] }, { "cell_type": "code", - "execution_count": 30, - "id": "089b5ad0", + "execution_count": 19, + "id": "7b442b90", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='Who has been nominated to serve on the United States Supreme Court?', metadata={'doc_id': '0b3a349e-c936-4e77-9c40-0a39fc3e07f0'}),\n", - " Document(page_content=\"What was the context and content of Robert Morris' advice to the document's author in 2010?\", metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),\n", - " Document(page_content='How did personal circumstances influence the decision to pass on the leadership of Y Combinator?', metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),\n", - " Document(page_content='What were the reasons for the author leaving Yahoo in the summer of 1999?', metadata={'doc_id': 'ce4f4981-ca60-4f56-86f0-89466de62325'})]" + "[Document(page_content='What might be the potential benefits of nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to the United States Supreme Court?', metadata={'doc_id': '43292b74-d1b8-4200-8a8b-ea0cb57fbcdb'}),\n", + " Document(page_content='How might the Bipartisan Infrastructure Law impact the economic competition between the U.S. and China?', metadata={'doc_id': '66174780-d00c-4166-9791-f0069846e734'}),\n", + " Document(page_content='What factors led to the creation of Y Combinator?', metadata={'doc_id': '72003c4e-4cc9-4f09-a787-0b541a65b38c'}),\n", + " Document(page_content='How did the ability to publish essays online change the landscape for writers and thinkers?', metadata={'doc_id': 'e8d2c648-f245-4bcc-b8d3-14e64a164b64'})]" ] }, - "execution_count": 30, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "sub_docs = retriever.vectorstore.similarity_search(\"justice breyer\")\n", + "\n", "sub_docs" ] }, { - "cell_type": "code", - "execution_count": 31, - "id": "7594b24e", + "cell_type": "markdown", + "id": "63c32e43-5f4a-463b-a0c2-2101986f70e6", "metadata": {}, - "outputs": [], "source": [ - "retrieved_docs = retriever.get_relevant_documents(\"justice breyer\")" + "And invoking the retriever will return the corresponding document:" ] }, { "cell_type": "code", - "execution_count": 32, - "id": "4c120c65", + "execution_count": 20, + "id": "7594b24e", "metadata": {}, "outputs": [ { @@ -575,22 +594,15 @@ "9194" ] }, - "execution_count": 32, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "retrieved_docs = retriever.invoke(\"justice breyer\")\n", "len(retrieved_docs[0].page_content)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "005072b8", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -609,7 +621,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.1" + "version": "3.10.4" } }, "nbformat": 4,