langchain/libs/community/tests/integration_tests/test_document_transformers.py

"""Integration test for embedding-based redundant doc filtering."""

from langchain_core.documents import Document

from langchain_community.document_transformers.embeddings_redundant_filter import (
    EmbeddingsClusteringFilter,
    EmbeddingsRedundantFilter,
    _DocumentWithState,
)
from langchain_community.embeddings import OpenAIEmbeddings


def test_embeddings_redundant_filter() -> None:
    texts = [
        "What happened to all of my cookies?",
        "Where did all of my cookies go?",
        "I wish there were better Italian restaurants in my neighborhood.",
    ]
    docs = [Document(page_content=t) for t in texts]
    embeddings = OpenAIEmbeddings()
    redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    actual = redundant_filter.transform_documents(docs)
    assert len(actual) == 2
    assert set(texts[:2]).intersection([d.page_content for d in actual])


def test_embeddings_redundant_filter_with_state() -> None:
    texts = ["What happened to all of my cookies?", "foo bar baz"]
    state = {"embedded_doc": [0.5] * 10}
    docs = [_DocumentWithState(page_content=t, state=state) for t in texts]
    embeddings = OpenAIEmbeddings()
    redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    actual = redundant_filter.transform_documents(docs)
    assert len(actual) == 1


def test_embeddings_clustering_filter() -> None:
    texts = [
        "What happened to all of my cookies?",
        "A cookie is a small, baked sweet treat and you can find it in the cookie",
        "monsters' jar.",
        "Cookies are good.",
        "I have nightmares about the cookie monster.",
        "The most popular pizza styles are: Neapolitan, New York-style and",
        "Chicago-style. You can find them on iconic restaurants in major cities.",
        "Neapolitan pizza: This is the original pizza style,hailing from Naples,",
        "Italy.",
        "I wish there were better Italian Pizza restaurants in my neighborhood.",
        "New York-style pizza: This is characterized by its large, thin crust, and",
        "generous toppings.",
        "The first movie to feature a robot was 'A Trip to the Moon' (1902).",
        "The first movie to feature a robot that could pass for a human was",
        "'Blade Runner' (1982)",
        "The first movie to feature a robot that could fall in love with a human",
        "was 'Her' (2013)",
        "A robot is a machine capable of carrying out complex actions automatically.",
        "There are certainly hundreds, if not thousands movies about robots like:",
        "'Blade Runner', 'Her' and 'A Trip to the Moon'",
    ]

    docs = [Document(page_content=t) for t in texts]
    embeddings = OpenAIEmbeddings()
    redundant_filter = EmbeddingsClusteringFilter(
        embeddings=embeddings,
        num_clusters=3,
        num_closest=1,
        sorted=True,
    )
    actual = redundant_filter.transform_documents(docs)
    assert len(actual) == 3
    assert texts[1] in [d.page_content for d in actual]
    assert texts[4] in [d.page_content for d in actual]
    assert texts[11] in [d.page_content for d in actual]
Contextual compression retriever (#2915) Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-04-21 00:01:14 +00:00			`"""Integration test for embedding-based redundant doc filtering."""`
multiple: langchain 0.2 in master (#21191) 0.2rc migrations - [x] Move memory - [x] Move remaining retrievers - [x] graph_qa chains - [x] some dependency from evaluation code potentially on math utils - [x] Move openapi chain from `langchain.chains.api.openapi` to `langchain_community.chains.openapi` - [x] Migrate `langchain.chains.ernie_functions` to `langchain_community.chains.ernie_functions` - [x] migrate `langchain/chains/llm_requests.py` to `langchain_community.chains.llm_requests` - [x] Moving `langchain_community.cross_enoders.base:BaseCrossEncoder` -> `langchain_community.retrievers.document_compressors.cross_encoder:BaseCrossEncoder` (namespace not ideal, but it needs to be moved to `langchain` to avoid circular deps) - [x] unit tests langchain -- add pytest.mark.community to some unit tests that will stay in langchain - [x] unit tests community -- move unit tests that depend on community to community - [x] mv integration tests that depend on community to community - [x] mypy checks Other todo - [x] Make deprecation warnings not noisy (need to use warn deprecated and check that things are implemented properly) - [x] Update deprecation messages with timeline for code removal (likely we actually won't be removing things until 0.4 release) -- will give people more time to transition their code. - [ ] Add information to deprecation warning to show users how to migrate their code base using langchain-cli - [ ] Remove any unnecessary requirements in langchain (e.g., is SQLALchemy required?) --------- Co-authored-by: Erick Friis <erick@langchain.dev> 2024-05-08 20:46:52 +00:00
			`from langchain_core.documents import Document`

docs, experimental[patch], langchain[patch], community[patch]: update storage imports (#15429) ran ```bash g grep -l "langchain.vectorstores" \| xargs -L 1 sed -i '' "s/langchain\.vectorstores/langchain_community.vectorstores/g" g grep -l "langchain.document_loaders" \| xargs -L 1 sed -i '' "s/langchain\.document_loaders/langchain_community.document_loaders/g" g grep -l "langchain.chat_loaders" \| xargs -L 1 sed -i '' "s/langchain\.chat_loaders/langchain_community.chat_loaders/g" g grep -l "langchain.document_transformers" \| xargs -L 1 sed -i '' "s/langchain\.document_transformers/langchain_community.document_transformers/g" g grep -l "langchain\.graphs" \| xargs -L 1 sed -i '' "s/langchain\.graphs/langchain_community.graphs/g" g grep -l "langchain\.memory\.chat_message_histories" \| xargs -L 1 sed -i '' "s/langchain\.memory\.chat_message_histories/langchain_community.chat_message_histories/g" gco master libs/langchain/tests/unit_tests//test_imports.py gco master libs/langchain/tests/unit_tests/*/test_public_api.py ``` 2024-01-02 21:47:11 +00:00			`from langchain_community.document_transformers.embeddings_redundant_filter import (`
The Fellowship of the Vectors: New Embeddings Filter using clustering. (#7015) Continuing with Tolkien inspired series of langchain tools. I bring to you: The Fellowship of the Vectors, AKA EmbeddingsClusteringFilter. This document filter uses embeddings to group vectors together into clusters, then allows you to pick an arbitrary number of documents vector based on proximity to the cluster centers. That's a representative sample of the cluster. The original idea is from [Greg Kamradt](https://github.com/gkamradt) from this video (Level4): https://www.youtube.com/watch?v=qaPMdcCqtWk&t=365s I added few tricks to make it a bit more versatile, so you can parametrize what to do with duplicate documents in case of cluster overlap: replace the duplicates with the next closest document or remove it. This allow you to use it as an special kind of redundant filter too. Additionally you can choose 2 diff orders: grouped by cluster or respecting the original retriever scores. In my use case I was using the docs grouped by cluster to run refine chains per cluster to generate summarization over a large corpus of documents. Let me know if you want to change anything! @rlancemartin, @eyurtsev, @hwchase17, --------- Co-authored-by: rlm <pexpresss31@gmail.com> 2023-07-07 17:28:17 +00:00			`EmbeddingsClusteringFilter,`
Contextual compression retriever (#2915) Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-04-21 00:01:14 +00:00			`EmbeddingsRedundantFilter,`
			`_DocumentWithState,`
			`)`
docs, experimental[patch], langchain[patch], community[patch]: update storage imports (#15429) ran ```bash g grep -l "langchain.vectorstores" \| xargs -L 1 sed -i '' "s/langchain\.vectorstores/langchain_community.vectorstores/g" g grep -l "langchain.document_loaders" \| xargs -L 1 sed -i '' "s/langchain\.document_loaders/langchain_community.document_loaders/g" g grep -l "langchain.chat_loaders" \| xargs -L 1 sed -i '' "s/langchain\.chat_loaders/langchain_community.chat_loaders/g" g grep -l "langchain.document_transformers" \| xargs -L 1 sed -i '' "s/langchain\.document_transformers/langchain_community.document_transformers/g" g grep -l "langchain\.graphs" \| xargs -L 1 sed -i '' "s/langchain\.graphs/langchain_community.graphs/g" g grep -l "langchain\.memory\.chat_message_histories" \| xargs -L 1 sed -i '' "s/langchain\.memory\.chat_message_histories/langchain_community.chat_message_histories/g" gco master libs/langchain/tests/unit_tests//test_imports.py gco master libs/langchain/tests/unit_tests/*/test_public_api.py ``` 2024-01-02 21:47:11 +00:00			`from langchain_community.embeddings import OpenAIEmbeddings`
Contextual compression retriever (#2915) Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-04-21 00:01:14 +00:00

			`def test_embeddings_redundant_filter() -> None:`
			`texts = [`
			`"What happened to all of my cookies?",`
			`"Where did all of my cookies go?",`
			`"I wish there were better Italian restaurants in my neighborhood.",`
			`]`
			`docs = [Document(page_content=t) for t in texts]`
			`embeddings = OpenAIEmbeddings()`
			`redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)`
			`actual = redundant_filter.transform_documents(docs)`
			`assert len(actual) == 2`
			`assert set(texts[:2]).intersection([d.page_content for d in actual])`


			`def test_embeddings_redundant_filter_with_state() -> None:`
			`texts = ["What happened to all of my cookies?", "foo bar baz"]`
			`state = {"embedded_doc": [0.5] * 10}`
			`docs = [_DocumentWithState(page_content=t, state=state) for t in texts]`
			`embeddings = OpenAIEmbeddings()`
			`redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)`
			`actual = redundant_filter.transform_documents(docs)`
			`assert len(actual) == 1`
The Fellowship of the Vectors: New Embeddings Filter using clustering. (#7015) Continuing with Tolkien inspired series of langchain tools. I bring to you: The Fellowship of the Vectors, AKA EmbeddingsClusteringFilter. This document filter uses embeddings to group vectors together into clusters, then allows you to pick an arbitrary number of documents vector based on proximity to the cluster centers. That's a representative sample of the cluster. The original idea is from [Greg Kamradt](https://github.com/gkamradt) from this video (Level4): https://www.youtube.com/watch?v=qaPMdcCqtWk&t=365s I added few tricks to make it a bit more versatile, so you can parametrize what to do with duplicate documents in case of cluster overlap: replace the duplicates with the next closest document or remove it. This allow you to use it as an special kind of redundant filter too. Additionally you can choose 2 diff orders: grouped by cluster or respecting the original retriever scores. In my use case I was using the docs grouped by cluster to run refine chains per cluster to generate summarization over a large corpus of documents. Let me know if you want to change anything! @rlancemartin, @eyurtsev, @hwchase17, --------- Co-authored-by: rlm <pexpresss31@gmail.com> 2023-07-07 17:28:17 +00:00

			`def test_embeddings_clustering_filter() -> None:`
			`texts = [`
			`"What happened to all of my cookies?",`
			`"A cookie is a small, baked sweet treat and you can find it in the cookie",`
			`"monsters' jar.",`
			`"Cookies are good.",`
			`"I have nightmares about the cookie monster.",`
			`"The most popular pizza styles are: Neapolitan, New York-style and",`
			`"Chicago-style. You can find them on iconic restaurants in major cities.",`
			`"Neapolitan pizza: This is the original pizza style,hailing from Naples,",`
			`"Italy.",`
			`"I wish there were better Italian Pizza restaurants in my neighborhood.",`
			`"New York-style pizza: This is characterized by its large, thin crust, and",`
			`"generous toppings.",`
			`"The first movie to feature a robot was 'A Trip to the Moon' (1902).",`
			`"The first movie to feature a robot that could pass for a human was",`
			`"'Blade Runner' (1982)",`
			`"The first movie to feature a robot that could fall in love with a human",`
			`"was 'Her' (2013)",`
			`"A robot is a machine capable of carrying out complex actions automatically.",`
			`"There are certainly hundreds, if not thousands movies about robots like:",`
			`"'Blade Runner', 'Her' and 'A Trip to the Moon'",`
			`]`

			`docs = [Document(page_content=t) for t in texts]`
			`embeddings = OpenAIEmbeddings()`
			`redundant_filter = EmbeddingsClusteringFilter(`
			`embeddings=embeddings,`
			`num_clusters=3,`
			`num_closest=1,`
			`sorted=True,`
			`)`
			`actual = redundant_filter.transform_documents(docs)`
			`assert len(actual) == 3`
			`assert texts[1] in [d.page_content for d in actual]`
			`assert texts[4] in [d.page_content for d in actual]`
			`assert texts[11] in [d.page_content for d in actual]`