mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
f1eaa9b626
Motivation, it seems that when dealing with a long context and "big" number of relevant documents we must avoid using out of the box score ordering from vector stores. See: https://arxiv.org/pdf/2306.01150.pdf So, I added an additional parameter that allows you to reorder the retrieved documents so we can work around this performance degradation. The relevance respect the original search score but accommodates the lest relevant document in the middle of the context. Extract from the paper (one image speaks 1000 tokens): ![image](https://github.com/hwchase17/langchain/assets/1821407/fafe4843-6e18-4fa6-9416-50cc1d32e811) This seems to be common to all diff arquitectures. SO I think we need a good generic way to implement this reordering and run some test in our already running retrievers. It could be that my approach is not the best one from the architecture point of view, happy to have a discussion about that. For me this was the best place to introduce the change and start retesting diff implementations. @rlancemartin, @eyurtsev --------- Co-authored-by: Lance Martin <lance@langchain.dev>
36 lines
1.5 KiB
Python
36 lines
1.5 KiB
Python
"""Integration test for doc reordering."""
|
|
from langchain.document_transformers.long_context_reorder import LongContextReorder
|
|
from langchain.embeddings import OpenAIEmbeddings
|
|
from langchain.vectorstores import Chroma
|
|
|
|
|
|
def test_long_context_reorder() -> None:
|
|
"""Test Lost in the middle reordering get_relevant_docs."""
|
|
texts = [
|
|
"Basquetball is a great sport.",
|
|
"Fly me to the moon is one of my favourite songs.",
|
|
"The Celtics are my favourite team.",
|
|
"This is a document about the Boston Celtics",
|
|
"I simply love going to the movies",
|
|
"The Boston Celtics won the game by 20 points",
|
|
"This is just a random text.",
|
|
"Elden Ring is one of the best games in the last 15 years.",
|
|
"L. Kornet is one of the best Celtics players.",
|
|
"Larry Bird was an iconic NBA player.",
|
|
]
|
|
embeddings = OpenAIEmbeddings()
|
|
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
|
|
search_kwargs={"k": 10}
|
|
)
|
|
reordering = LongContextReorder()
|
|
docs = retriever.get_relevant_documents("Tell me about the Celtics")
|
|
actual = reordering.transform_documents(docs)
|
|
|
|
# First 2 and Last 2 elements must contain the most relevant
|
|
first_and_last = list(actual[:2]) + list(actual[-2:])
|
|
assert len(actual) == 10
|
|
assert texts[2] in [d.page_content for d in first_and_last]
|
|
assert texts[3] in [d.page_content for d in first_and_last]
|
|
assert texts[5] in [d.page_content for d in first_and_last]
|
|
assert texts[8] in [d.page_content for d in first_and_last]
|