Description: This pull request aims to support generating the correct
generic relevancy scores for different vector stores by refactoring the
relevance score functions and their selection in the base class and
subclasses of VectorStore. This is especially relevant with VectorStores
that require a distance metric upon initialization. Note many of the
current implenetations of `_similarity_search_with_relevance_scores` are
not technically correct, as they just return
`self.similarity_search_with_score(query, k, **kwargs)` without applying
the relevant score function
Also includes changes associated with:
https://github.com/hwchase17/langchain/pull/6564 and
https://github.com/hwchase17/langchain/pull/6494
See more indepth discussion in thread in #6494
Issue:
https://github.com/hwchase17/langchain/issues/6526https://github.com/hwchase17/langchain/issues/6481https://github.com/hwchase17/langchain/issues/6346
Dependencies: None
The changes include:
- Properly handling score thresholding in FAISS
`similarity_search_with_score_by_vector` for the corresponding distance
metric.
- Refactoring the `_similarity_search_with_relevance_scores` method in
the base class and removing it from the subclasses for incorrectly
implemented subclasses.
- Adding a `_select_relevance_score_fn` method in the base class and
implementing it in the subclasses to select the appropriate relevance
score function based on the distance strategy.
- Updating the `__init__` methods of the subclasses to set the
`relevance_score_fn` attribute.
- Removing the `_default_relevance_score_fn` function from the FAISS
class and using the base class's `_euclidean_relevance_score_fn`
instead.
- Adding the `DistanceStrategy` enum to the `utils.py` file and updating
the imports in the vector store classes.
- Updating the tests to import the `DistanceStrategy` enum from the
`utils.py` file.
---------
Co-authored-by: Hanit <37485638+hanit-com@users.noreply.github.com>
## Description
The type hint for `FAISS.__init__()`'s `relevance_score_fn` parameter
allowed the parameter to be set to `None`. However, a default function
is provided by the constructor. This led to an unnecessary check in the
code, as well as a test to verify this check.
**ASSUMPTION**: There's no reason to ever set `relevance_score_fn` to
`None`.
This PR changes the type hint and removes the unnecessary code.
### Adding the functionality to return the scores with retrieved
documents when using the max marginal relevance
- Description: Add the method
`max_marginal_relevance_search_with_score_by_vector` to the FAISS
wrapper. Functionality operates the same as
`similarity_search_with_score_by_vector` except for using the max
marginal relevance retrieval framework like is used in the
`max_marginal_relevance_search_by_vector` method.
- Dependencies: None
- Tag maintainer: @rlancemartin @eyurtsev
- Twitter handle: @RianDolphin
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
### Feature
Using FAISS on a retrievalQA task, I found myself wanting to allow in
multiple sources. From what I understood, the filter feature takes in a
dict of form {key: value} which then will check in the metadata for the
exact value linked to that key.
I added some logic to be able to pass a list which will be checked
against instead of an exact value. Passing an exact value will also
work.
Here's an example of how I could then use it in my own project:
```
pdfs_to_filter_in = ["file_A", "file_B"]
filter_dict = {
"source": [f"source_pdfs/{pdf_name}.pdf" for pdf_name in pdfs_to_filter_in]
}
retriever = db.as_retriever()
retriever.search_kwargs = {"filter": filter_dict}
```
I added an integration test based on the other ones I found in
`tests/integration_tests/vectorstores/test_faiss.py` under
`test_faiss_with_metadatas_and_list_filter()`.
It doesn't feel like this is worthy of its own notebook or doc, but I'm
open to suggestions if needed.
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Fixes#5807 (issue)
#### Who can review?
Tag maintainers/contributors who might be interested: @dev2049
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
Inspired by the filtering capability available in ChromaDB, added the
same functionality to the FAISS vectorestore as well. Since FAISS does
not have an inbuilt method of filtering used the approach suggested in
this [thread](https://github.com/facebookresearch/faiss/issues/1079)
Langchain Issue inspiration:
https://github.com/hwchase17/langchain/issues/4572
- [x] Added filtering capability to semantic similarly and MMR
- [x] Added test cases for filtering in
`tests/integration_tests/vectorstores/test_faiss.py`
#### Who can review?
Tag maintainers/contributors who might be interested:
VectorStores / Retrievers / Memory
- @dev2049
- @hwchase17
Fixes # 5807
Realigned tests with implementation.
Also reinforced folder unicity for the test_faiss_local_save_load test
using date-time suffix
#### Before submitting
- Integration test updated
- formatting and linting ok (locally)
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17 - project lead
VectorStores / Retrievers / Memory
-@dev2049
Add a method that exposes a similarity search with corresponding
normalized similarity scores. Implement only for FAISS now.
### Motivation:
Some memory definitions combine `relevance` with other scores, like
recency , importance, etc.
While many (but not all) of the `VectorStore`'s expose a
`similarity_search_with_score` method, they don't all interpret the
units of that score (depends on the distance metric and whether or not
the the embeddings are normalized).
This PR proposes a `similarity_search_with_normalized_similarities`
method that lets consumers of the vector store not have to worry about
the metric and embedding scale.
*Most providers default to euclidean distance, with Pinecone being one
exception (defaults to cosine _similarity_).*
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Alternate implementation to PR #960 Again - only FAISS is implemented.
If accepted can add this to other vectorstores or leave as
NotImplemented? Suggestions welcome...
Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
Signed-off-by: Frank Liu <frank.liu@zilliz.com>
Co-authored-by: Filip Haltmayer <81822489+filip-halt@users.noreply.github.com>
Co-authored-by: Frank Liu <frank@frankzliu.com>
- This uses the faiss built-in `write_index` and `load_index` to save
and load faiss indexes locally
- Also fixes#674
- The save/load functions also use the faiss library, so I refactored
the dependency into a function
this will break atm but wanted to get thoughts on implementation.
1. should add() be on docstore interface?
2. should InMemoryDocstore change to take a list of documents as init?
(makes this slightly easier to implement in FAISS -- if we think it is
less clean then could expose a method to get the number of documents
currently in the dict, and perform the logic of creating the necessary
dictionary in the FAISS.add_texts method.
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>