Commit Graph

123 Commits

Author SHA1 Message Date
Lance Martin
7a00f17033
Web research retriever (#8102)
Given a user question, this will -
* Use LLM to generate a set of queries.
* Query for each.
* The URLs from search results are stored in self.urls.
* A check is performed for any new URLs that haven't been processed yet
(not in self.url_database).
* Only these new URLs are loaded, transformed, and added to the
vectorstore.
* The vectorstore is queried for relevant documents based on the
questions generated by the LLM.
* Only unique documents are returned as the final result.

This code will avoid reprocessing of URLs across multiple runs of
similar queries, which should improve the performance of the retriever.
It also keeps track of all URLs that have been processed, which could be
useful for debugging or understanding the retriever's behavior.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-07-25 19:58:00 -07:00
William FH
0a16b3d84b
Update Integrations links (#8206) 2023-07-24 21:20:32 -07:00
Dayuan Jiang
125ae6d9de
add Hybrid retriever that not require any external service (#8108)
- Until now, hybrid search was limited to modules requiring external
services, such as Weaviate/Pinecone Hybrid Search. However, I have
developed a hybrid retriever that can merge a list of retrievers using
the [Reciprocal Rank
Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
algorithm. This new approach, similar to Weaviate hybrid search, does
not require the initialization of any external service.
  - Dependencies: No  - Twitter handle: dayuanjian21687

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-24 19:16:10 -07:00
Adilkhan Sarsen
3e7d2a1b64
SelfQuery support for deeplake (#7888)
Added support SelfQuery for Deeplake
2023-07-24 14:22:33 -07:00
Bagatur
c8c8635dc9
mv module integrations docs (#8101) 2023-07-23 23:23:16 -07:00
Lance Martin
5a084e1b20
Async HTML loader and HTML2Text transformer (#8036)
New HTML loader that asynchronously loader a list of urls. 
 
New transformer using [HTML2Text](https://github.com/Alir3z4/html2text/)
for HTML to clean, easy-to-read plain ASCII text (valid Markdown).
2023-07-20 22:30:59 -07:00
Jacob Lee
56c6ab1715
Fix bad docs sidebar header (#7966)
Quick fix for:

<img width="283" alt="Screenshot 2023-07-19 at 2 49 44 PM"
src="https://github.com/hwchase17/langchain/assets/6952323/91e4868c-b75e-413d-9f8f-d34762abf164">

CC @baskaryan
2023-07-20 19:06:57 -07:00
Kacper Łukawski
ed6a5532ac
Implement async support in Qdrant local mode (#8001)
I've extended the support of async API to local Qdrant mode. It is faked
but allows prototyping without spinning a container. The tests are
improved to test the in-memory case as well.

@baskaryan @rlancemartin @eyurtsev @agola11
2023-07-20 19:04:33 -07:00
Dwai Banerjee
d8c40253c3
Adding endpoint_url to embeddings/bedrock.py and updated docs (#7927)
BedrockEmbeddings does not have endpoint_url so that switching to custom
endpoint is not possible. I have access to Bedrock custom endpoint and
cannot use BedrockEmbeddings

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-20 07:25:59 -07:00
Jeff Huber
5694e7b8cf
Update chroma notebook (#7978)
Fix up the Chroma notebook
- remove `.persist()` -- this is no longer in Chroma as of `0.4.0`
- update output to match `0.4.0`
- other cleanup work
2023-07-20 06:25:31 -07:00
Julien Salinas
3adab5e5be
Integrate NLP Cloud embeddings endpoint (#7931)
Add embeddings for [NLPCloud](https://docs.nlpcloud.com/#embeddings).

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Lance Martin <lance@langchain.dev>
2023-07-19 15:27:34 -07:00
Brendan Collins
9aef79c2e3
Add Geopandas.GeoDataFrame Document Loader (#3817)
Work in Progress.
WIP
Not ready...

Adds Document Loader support for
[Geopandas.GeoDataFrames](https://geopandas.org/)

Example:
- [x] stub out `GeoDataFrameLoader` class
- [x] stub out integration tests
- [ ] Experiment with different geometry text representations
- [ ] Verify CRS is successfully added in metadata
- [ ] Test effectiveness of searches on geometries
- [ ] Test with different geometry types (point, line, polygon with
multi-variants).
- [ ] Add documentation

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Lance Martin <122662504+rlancemartin@users.noreply.github.com>
2023-07-19 12:14:41 -07:00
Jarek Kazmierczak
f2ef3ff54a
Google Cloud Enterprise Search retriever (#7857)
Added a retriever that encapsulated Google Cloud Enterprise Search.


---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-18 18:24:08 -07:00
Jeff Huber
2139d0197e
upgrade chroma to 0.4.0 (#7749)
** This should land Monday the 17th ** 

Chroma is upgrading from `0.3.29` to `0.4.0`. `0.4.0` is easier to
build, more durable, faster, smaller, and more extensible. This comes
with a few changes:

1. A simplified and improved client setup. Instead of having to remember
weird settings, users can just do `EphemeralClient`, `PersistentClient`
or `HttpClient` (the underlying direct `Client` implementation is also
still accessible)

2. We migrated data stores away from `duckdb` and `clickhouse`. This
changes the api for the `PersistentClient` that used to reference
`chroma_db_impl="duckdb+parquet"`. Now we simply set
`is_persistent=true`. `is_persistent` is set for you to `true` if you
use `PersistentClient`.

3. Because we migrated away from `duckdb` and `clickhouse` - this also
means that users need to migrate their data into the new layout and
schema. Chroma is committed to providing extension notification and
tooling around any schema and data migrations (for example - this PR!).

After upgrading to `0.4.0` - if users try to access their data that was
stored in the previous regime, the system will throw an `Exception` and
instruct them how to use the migration assistant to migrate their data.
The migration assitant is a pip installable CLI: `pip install
chroma_migrate`. And is runnable by calling `chroma_migrate`

-- TODO ADD here is a short video demonstrating how it works. 

Please reference the readme at
[chroma-core/chroma-migrate](https://github.com/chroma-core/chroma-migrate)
to see a full write-up of our philosophy on migrations as well as more
details about this particular migration.

Please direct any users facing issues upgrading to our Discord channel
called
[#get-help](https://discord.com/channels/1073293645303795742/1129200523111841883).
We have also created a [email
listserv](https://airtable.com/shrHaErIs1j9F97BE) to notify developers
directly in the future about breaking changes.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-18 17:20:54 -07:00
maciej-skorupka
c6d1d6d7fc
feat: moving azure OpenAI API version to the latest 2023-05-15 (#7764)
Moving to the latest non-preview Azure OpenAI API version=2023-05-15.
The previous 2023-03-15-preview doesn't have support, SLA etc. For
instance, OpenAI SDK has moved to this version
https://github.com/openai/openai-python/releases/tag/v0.27.7

@baskaryan
2023-07-18 09:50:15 -07:00
satorioh
259a409998
docs(zilliz): connection_args add token description for serverless cl… (#7810)
Description:

Currently, Zilliz only support dedicated clusters using a pair of
username and password for connection. Regarding serverless clusters,
they can connect to them by using API keys( [ see official note
detail](https://docs.zilliz.com/docs/manage-cluster-credentials)), so I
add API key(token) description in Zilliz docs to make it more obvious
and convenient for this group of users to better utilize Zilliz. No
changes done to code.

---------

Co-authored-by: Robin.Wang <3Jg$94sbQ@q1>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-18 09:31:39 -07:00
Bill Zhang
dda11d2a05
WeaviateHybridSearchRetriever option to enable scores. (#7861)
Description: This PR adds the option to retrieve scores and explanations
in the WeaviateHybridSearchRetriever. This feature improves the
usability of the retriever by allowing users to understand the scoring
logic behind the search results and further refine their search queries.

Issue: This PR is a solution to the issue #7855 
Dependencies: This PR does not introduce any new dependencies.

Tag maintainer: @rlancemartin, @eyurtsev

I have included a unit test for the added feature, ensuring that it
retrieves scores and explanations correctly. I have also included an
example notebook demonstrating its use.
2023-07-18 07:57:17 -07:00
German Martin
f1eaa9b626
Lost in the middle: We have been ordering documents the WRONG way. (for long context) (#7520)
Motivation, it seems that when dealing with a long context and "big"
number of relevant documents we must avoid using out of the box score
ordering from vector stores.
See: https://arxiv.org/pdf/2306.01150.pdf

So, I added an additional parameter that allows you to reorder the
retrieved documents so we can work around this performance degradation.
The relevance respect the original search score but accommodates the
lest relevant document in the middle of the context.
Extract from the paper (one image speaks 1000 tokens):

![image](https://github.com/hwchase17/langchain/assets/1821407/fafe4843-6e18-4fa6-9416-50cc1d32e811)
This seems to be common to all diff arquitectures. SO I think we need a
good generic way to implement this reordering and run some test in our
already running retrievers.
It could be that my approach is not the best one from the architecture
point of view, happy to have a discussion about that.
For me this was the best place to introduce the change and start
retesting diff implementations.

@rlancemartin, @eyurtsev

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
2023-07-18 07:45:15 -07:00
Jasper
5b4d53e8ef
Add text_content kwarg to BrowserlessLoader (#7856)
Added keyword argument to toggle between getting the text content of a
site versus its HTML when using the `BrowserlessLoader`
2023-07-17 17:02:19 -07:00
Matt Robinson
3c489be773
feat: optional post-processing for Unstructured loaders (#7850)
### Summary

Adds a post-processing method for Unstructured loaders that allows users
to optionally modify or clean extracted elements.

### Testing

```python
from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()
docs[:5]
```


### Reviewrs
  - @rlancemartin
  - @eyurtsev
  - @hwchase17
2023-07-17 12:13:05 -07:00
Bagatur
2a315dbee9
fix nb (#7843) 2023-07-17 09:39:11 -07:00
Bagatur
98c48f303a
fix (#7838) 2023-07-17 07:53:11 -07:00
Dayuan Jiang
ee40d37098
add bm25 module (#7779)
- Description: Add a BM25 Retriever that do not need Elastic search
- Dependencies: rank_bm25(if it is not installed it will be install by
using pip, just like TFIDFRetriever do)
  - Tag maintainer: @rlancemartin, @eyurtsev
  - Twitter handle: DayuanJian21687

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-17 07:30:17 -07:00
sseide
25e3d3f283
Support Redis Sentinel database connections (#5196)
# Support Redis Sentinel database connections

This PR adds the support to connect not only to Redis standalone servers
but High Availability Replication sets too
(https://redis.io/docs/management/sentinel/)
Redis Replica Sets have on Master allowing to write data and 2+ replicas
with read-only access to the data. The additional Redis Sentinel
instances monitor all server and reconfigure the RW-Master on the fly if
it comes unavailable.

Therefore all connections must be made through the Sentinels the query
the current master for a read-write connection. This PR adds basic
support to also allow a redis connection url specifying a Sentinel as
Redis connection.

Redis documentation and Jupyter notebook with Redis examples are updated
to mention how to connect to a redis Replica Set with Sentinels

        - 

Remark - i did not found test cases for Redis server connections to add
new cases here. Therefor i tests the new utility class locally with
different kind of setups to make sure different connection urls are
working as expected. But no test case here as part of this PR.
2023-07-17 07:18:51 -07:00
Lance Martin
b015647e31
Add GPT4All embeddings (#7743)
Support for [GPT4All
embeddings](https://docs.gpt4all.io/gpt4all_python_embedding.html)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-15 10:04:29 -04:00
Lorenzo
77e6bbe6f0
fix typo in deeplake.ipynb (#7718)
- Fixing typos in deeplake documentation
- @baskaryan
2023-07-14 13:38:31 -04:00
Aarav Borthakur
210296a71f
Integrate Rockset as a document loader (#7681)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->

Integrate [Rockset](https://rockset.com/docs/) as a document loader.

Issue: None
Dependencies: Nothing new (rockset's dependency was already added
[here](https://github.com/hwchase17/langchain/pull/6216))
Tag maintainer: @rlancemartin

I have added a test for the integration and an example notebook showing
its use. I ran `make lint` and everything looks good.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-14 07:58:13 -07:00
Daniel
6d88b23ef7
Update pgembedding.ipynb (#7699)
Update the extension name. It changed from pg_hnsw to pg_embedding.

Thank you. I missed this in my previous commit.
2023-07-14 08:39:01 -04:00
Lance Martin
6325a3517c
Make recursive loader yield while crawling (#7568)
Support actual lazy_load since it can take a while to crawl larger
directories.
2023-07-13 21:55:20 -07:00
UmerHA
82f3e32d8d
[Small upgrade] Allow document limit in AzureCognitiveSearchRetriever (#7690)
Multiple people have asked in #5081 for a way to limit the documents
returned from an AzureCognitiveSearchRetriever. This PR adds the `top_n`
parameter to allow that.


Twitter handle:
 [@UmerHAdil](twitter.com/umerHAdil)
2023-07-13 23:04:40 -04:00
Daniel
854f3fe9b1
Update pgembedding.ipynb (#7682)
Correct links to the pg_embedding repository and the Neon documentation.
2023-07-13 19:54:07 -04:00
Jasper
fbc97a77ed
add browserless loader (#7562)
# Browserless

Added support for Browserless' `/content` endpoint as a document loader.

### About Browserless

Browserless is a cloud service that provides access to headless Chrome
browsers via a REST API. It allows developers to automate Chromium in a
serverless fashion without having to configure and maintain their own
Chrome infrastructure.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Lance Martin <lance@langchain.dev>
2023-07-13 13:18:28 -07:00
Bagatur
7f8ff2a317
add tagger nb (#7637) 2023-07-13 01:48:23 -04:00
Jason Fan
8effd90be0
Add new types of document transformers (#7379)
- Description: Add two new document transformers that translates
documents into different languages and converts documents into q&a
format to improve vector search results. Uses OpenAI function calling
via the [doctran](https://github.com/psychic-api/doctran/tree/main)
library.
  - Issue: N/A
  - Dependencies: `doctran = "^0.0.5"`
  - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 
  - Twitter handle: @psychicapi or @jfan001

Notes
- Adheres to the `DocumentTransformer` abstraction set by @dev2049 in
#3182
- refactored `EmbeddingsRedundantFilter` to put it in a file under a new
`document_transformers` module
- Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as
well as the existing `EmbeddingsRedundantFilter`

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 23:53:30 -04:00
Gaurang Pawar
53722dcfdc
Fixed a typo in pinecone_hybrid_search.ipynb (#7627)
Fixed a small typo in documentation
2023-07-12 23:46:41 -04:00
Yaroslav Halchenko
0d92a7f357
codespell: workflow, config + some (quite a few) typos fixed (#6785)
Probably the most  boring PR to review ;)

Individual commits might be easier to digest

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2023-07-12 16:20:08 -04:00
os1ma
2667ddc686
Fix make docs_build and related scripts (#7276)
**Description: a description of the change**

Fixed `make docs_build` and related scripts which caused errors. There
are several changes.

First, I made the build of the documentation and the API Reference into
two separate commands. This is because it takes less time to build. The
commands for documents are `make docs_build`, `make docs_clean`, and
`make docs_linkcheck`. The commands for API Reference are `make
api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`.

It looked like `docs/.local_build.sh` could be used to build the
documentation, so I used that. Since `.local_build.sh` was also building
API Rerefence internally, I removed that process. `.local_build.sh` also
added some Bash options to stop in error or so. Futher more added `cd
"${SCRIPT_DIR}"` at the beginning so that the script will work no matter
which directory it is executed in.

`docs/api_reference/api_reference.rst` is removed, because which is
generated by `docs/api_reference/create_api_rst.py`, and added it to
.gitignore.

Finally, the description of CONTRIBUTING.md was modified.

**Issue: the issue # it fixes (if applicable)**

https://github.com/hwchase17/langchain/issues/6413

**Dependencies: any dependencies required for this change**

`nbdoc` was missing in group docs so it was added. I installed it with
the `poetry add --group docs nbdoc` command. I am concerned if any
modifications are needed to poetry.lock. I would greatly appreciate it
if you could pay close attention to this file during the review.

**Tag maintainer**
- General / Misc / if you don't know who to tag: @baskaryan

If this PR needs any additional changes, I'll be happy to make them!

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-11 22:05:14 -04:00
Kacper Łukawski
1f83b5f47e
Reuse the existing collection if configured properly in Qdrant.from_texts (#7530)
This PR changes the behavior of `Qdrant.from_texts` so the collection is
reused if not requested to recreate it. Previously, calling
`Qdrant.from_texts` or `Qdrant.from_documents` resulted in removing the
old data which was confusing for many.
2023-07-11 16:24:35 -04:00
Lance Martin
28d2b213a4
Update landing page for "question answering over documents" (#7152)
Improve documentation for a central use-case, qa / chat over documents.

This will be merged as an update to `index.mdx`
[here](https://python.langchain.com/docs/use_cases/question_answering/).

Testing w/ local Docusaurus server:

```
From `docs` directory:
mkdir _dist
cp -r {docs_skeleton,snippets} _dist
cp -r extras/* _dist/docs_skeleton/docs
cd _dist/docs_skeleton
yarn install
yarn start
```

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-10 14:15:13 -07:00
Adilkhan Sarsen
5debd5043e
Added deeplake use case examples of the new features (#6528)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes # (issue)

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

#### Who can review?

Tag maintainers/contributors who might be interested:

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @hwchase17

  VectorStores / Retrievers / Memory
  - @dev2049

 -->
 
 1. Added use cases of the new features
 2. Done some code refactoring

---------

Co-authored-by: Ivo Stranic <istranic@gmail.com>
2023-07-10 07:04:29 -07:00
Kazuki Maeda
92b4418c8c
Datadog logs loader (#7356)
### Description
Created a Loader to get a list of specific logs from Datadog Logs.

### Dependencies
`datadog_api_client` is required.

### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-10 04:27:55 -04:00
Yifei Song
7d29bb2c02
Add Xorbits Dataframe as a Document Loader (#7319)
- [Xorbits](https://doc.xorbits.io/en/latest/) is an open-source
computing framework that makes it easy to scale data science and machine
learning workloads in parallel. Xorbits can leverage multi cores or GPUs
to accelerate computation on a single machine, or scale out up to
thousands of machines to support processing terabytes of data.

- This PR added support for the Xorbits document loader, which allows
langchain to leverage Xorbits to parallelize and distribute the loading
of data.
- Dependencies: This change requires the Xorbits library to be installed
in order to be used.
`pip install xorbits`
- Request for review: @rlancemartin, @eyurtsev
- Twitter handle: https://twitter.com/Xorbitsio

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-10 04:24:47 -04:00
Paul-Emile Brotons
d2cf0d16b3
adding max_marginal_relevance_search method to MongoDBAtlasVectorSearch (#7310)
Adding a maximal_marginal_relevance method to the
MongoDBAtlasVectorSearch vectorstore enhances the user experience by
providing more diverse search results

Issue: #7304
2023-07-10 04:04:19 -04:00
Matt Robinson
bcab894f4e
feat: Add UnstructuredTSVLoader (#7367)
### Summary

Adds an `UnstructuredTSVLoader` for TSV files. Also updates the doc
strings for `UnstructuredCSV` and `UnstructuredExcel` loaders.

### Testing

```python
from langchain.document_loaders.tsv import UnstructuredTSVLoader

loader = UnstructuredTSVLoader(
    file_path="example_data/mlb_teams_2012.csv", mode="elements"
)
docs = loader.load()
```
2023-07-10 03:07:10 -04:00
nikkie
dfc3f83b0f
docs(vectorstores/integrations/chroma): Fix loading and saving (#7437)
- Description: Fix loading and saving code about Chroma
- Issue: the issue #7436 
- Dependencies: -
- Twitter handle: https://twitter.com/ftnext
2023-07-10 02:05:15 -04:00
Roger Yu
633b673b85
Update pinecone.ipynb (#7382)
Fix typo
2023-07-08 01:48:03 -04:00
Georges Petrov
ec033ae277
Rename Databerry to Chaindesk (#7022)
---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-07 17:28:04 -04:00
Harrison Chase
7cdf97ba9b
Harrison/add to imports (#7370)
pgvector cleanup
2023-07-07 16:27:44 -04:00
German Martin
3ce4e46c8c
The Fellowship of the Vectors: New Embeddings Filter using clustering. (#7015)
Continuing with Tolkien inspired series of langchain tools. I bring to
you:
**The Fellowship of the Vectors**, AKA EmbeddingsClusteringFilter.
This document filter uses embeddings to group vectors together into
clusters, then allows you to pick an arbitrary number of documents
vector based on proximity to the cluster centers. That's a
representative sample of the cluster.

The original idea is from [Greg Kamradt](https://github.com/gkamradt)
from this video (Level4):
https://www.youtube.com/watch?v=qaPMdcCqtWk&t=365s

I added few tricks to make it a bit more versatile, so you can
parametrize what to do with duplicate documents in case of cluster
overlap: replace the duplicates with the next closest document or remove
it. This allow you to use it as an special kind of redundant filter too.
Additionally you can choose 2 diff orders: grouped by cluster or
respecting the original retriever scores.
In my use case I was using the docs grouped by cluster to run refine
chains per cluster to generate summarization over a large corpus of
documents.
Let me know if you want to change anything!

@rlancemartin, @eyurtsev, @hwchase17,

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
2023-07-07 10:28:17 -07:00
OwenElliott
3074306ae1
Marqo Vector Store Examples & Type Hints (#7326)
This PR improves the example notebook for the Marqo vectorstore
implementation by adding a new RetrievalQAWithSourcesChain example. The
`embedding` parameter in `from_documents` has its type updated to
`Union[Embeddings, None]` and a default parameter of None because this
is ignored in Marqo.

This PR also upgrades the Marqo version to 0.11.0 to remove the device
parameter after a breaking change to the API.

Related to #7068 @tomhamer @hwchase17

---------

Co-authored-by: Tom Hamer <tom@marqo.ai>
2023-07-07 04:11:20 -04:00