You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/extras/integrations/retrievers/merger_retriever.ipynb

194 lines
6.9 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "fc0db1bc",
"metadata": {},
"source": [
"# LOTR (Merger Retriever)\n",
"\n",
"`Lord of the Retrievers`, also known as `MergerRetriever`, takes a list of retrievers as input and merges the results of their get_relevant_documents() methods into a single list. The merged results will be a list of documents that are relevant to the query and that have been ranked by the different retrievers.\n",
"\n",
"The `MergerRetriever` class can be used to improve the accuracy of document retrieval in a number of ways. First, it can combine the results of multiple retrievers, which can help to reduce the risk of bias in the results. Second, it can rank the results of the different retrievers, which can help to ensure that the most relevant documents are returned first."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fbcc58f",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import chromadb\n",
"from langchain.retrievers.merger_retriever import MergerRetriever\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago
"from langchain.document_transformers import (\n",
" EmbeddingsRedundantFilter,\n",
" EmbeddingsClusteringFilter,\n",
")\n",
"from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
"from langchain.retrievers import ContextualCompressionRetriever\n",
"\n",
"# Get 3 diff embeddings.\n",
"all_mini = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
"multi_qa_mini = HuggingFaceEmbeddings(model_name=\"multi-qa-MiniLM-L6-dot-v1\")\n",
"filter_embeddings = OpenAIEmbeddings()\n",
"\n",
"ABS_PATH = os.path.dirname(os.path.abspath(__file__))\n",
"DB_DIR = os.path.join(ABS_PATH, \"db\")\n",
"\n",
"# Instantiate 2 diff cromadb indexs, each one with a diff embedding.\n",
"client_settings = chromadb.config.Settings(\n",
upgrade chroma to 0.4.0 (#7749) ** This should land Monday the 17th ** Chroma is upgrading from `0.3.29` to `0.4.0`. `0.4.0` is easier to build, more durable, faster, smaller, and more extensible. This comes with a few changes: 1. A simplified and improved client setup. Instead of having to remember weird settings, users can just do `EphemeralClient`, `PersistentClient` or `HttpClient` (the underlying direct `Client` implementation is also still accessible) 2. We migrated data stores away from `duckdb` and `clickhouse`. This changes the api for the `PersistentClient` that used to reference `chroma_db_impl="duckdb+parquet"`. Now we simply set `is_persistent=true`. `is_persistent` is set for you to `true` if you use `PersistentClient`. 3. Because we migrated away from `duckdb` and `clickhouse` - this also means that users need to migrate their data into the new layout and schema. Chroma is committed to providing extension notification and tooling around any schema and data migrations (for example - this PR!). After upgrading to `0.4.0` - if users try to access their data that was stored in the previous regime, the system will throw an `Exception` and instruct them how to use the migration assistant to migrate their data. The migration assitant is a pip installable CLI: `pip install chroma_migrate`. And is runnable by calling `chroma_migrate` -- TODO ADD here is a short video demonstrating how it works. Please reference the readme at [chroma-core/chroma-migrate](https://github.com/chroma-core/chroma-migrate) to see a full write-up of our philosophy on migrations as well as more details about this particular migration. Please direct any users facing issues upgrading to our Discord channel called [#get-help](https://discord.com/channels/1073293645303795742/1129200523111841883). We have also created a [email listserv](https://airtable.com/shrHaErIs1j9F97BE) to notify developers directly in the future about breaking changes. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago
" is_persistent=True,\n",
" persist_directory=DB_DIR,\n",
" anonymized_telemetry=False,\n",
")\n",
"db_all = Chroma(\n",
" collection_name=\"project_store_all\",\n",
" persist_directory=DB_DIR,\n",
" client_settings=client_settings,\n",
" embedding_function=all_mini,\n",
")\n",
"db_multi_qa = Chroma(\n",
" collection_name=\"project_store_multi\",\n",
" persist_directory=DB_DIR,\n",
" client_settings=client_settings,\n",
" embedding_function=multi_qa_mini,\n",
")\n",
"\n",
"# Define 2 diff retrievers with 2 diff embeddings and diff search type.\n",
"retriever_all = db_all.as_retriever(\n",
" search_type=\"similarity\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
")\n",
"retriever_multi_qa = db_multi_qa.as_retriever(\n",
" search_type=\"mmr\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
")\n",
"\n",
"# The Lord of the Retrievers will hold the ouput of boths retrievers and can be used as any other\n",
"# retriever on different types of chains.\n",
"lotr = MergerRetriever(retrievers=[retriever_all, retriever_multi_qa])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c152339d",
"metadata": {},
"source": [
"## Remove redundant results from the merged retrievers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "039faea6",
"metadata": {},
"outputs": [],
"source": [
"# We can remove redundant results from both retrievers using yet another embedding.\n",
"# Using multiples embeddings in diff steps could help reduce biases.\n",
"filter = EmbeddingsRedundantFilter(embeddings=filter_embeddings)\n",
"pipeline = DocumentCompressorPipeline(transformers=[filter])\n",
"compression_retriever = ContextualCompressionRetriever(\n",
" base_compressor=pipeline, base_retriever=lotr\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c10022fa",
"metadata": {},
"source": [
"## Pick a representative sample of documents from the merged retrievers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b3885482",
"metadata": {},
"outputs": [],
"source": [
"# This filter will divide the documents vectors into clusters or \"centers\" of meaning.\n",
"# Then it will pick the closest document to that center for the final results.\n",
"# By default the result document will be ordered/grouped by clusters.\n",
"filter_ordered_cluster = EmbeddingsClusteringFilter(\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago
" embeddings=filter_embeddings,\n",
" num_clusters=10,\n",
" num_closest=1,\n",
")\n",
"\n",
"# If you want the final document to be ordered by the original retriever scores\n",
"# you need to add the \"sorted\" parameter.\n",
"filter_ordered_by_retriever = EmbeddingsClusteringFilter(\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago
" embeddings=filter_embeddings,\n",
" num_clusters=10,\n",
" num_closest=1,\n",
" sorted=True,\n",
")\n",
"\n",
"pipeline = DocumentCompressorPipeline(transformers=[filter_ordered_by_retriever])\n",
"compression_retriever = ContextualCompressionRetriever(\n",
" base_compressor=pipeline, base_retriever=lotr\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
1 year ago
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "8f68956e",
"metadata": {},
"source": [
"## Re-order results to avoid performance degradation.\n",
"No matter the architecture of your model, there is a sustancial performance degradation when you include 10+ retrieved documents.\n",
"In brief: When models must access relevant information in the middle of long contexts, then tend to ignore the provided documents.\n",
"See: https://arxiv.org/abs//2307.03172"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "007283f3",
"metadata": {},
"outputs": [],
"source": [
"# You can use an additional document transformer to reorder documents after removing redudance.\n",
"from langchain.document_transformers import LongContextReorder\n",
"\n",
"filter = EmbeddingsRedundantFilter(embeddings=filter_embeddings)\n",
"reordering = LongContextReorder()\n",
"pipeline = DocumentCompressorPipeline(transformers=[filter, reordering])\n",
"compression_retriever_reordered = ContextualCompressionRetriever(\n",
" base_compressor=pipeline, base_retriever=lotr\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}