Various miscellaneous fixes to most pages in the 'Retrievers' section of
the documentation:
- "VectorStore" and "vectorstore" changed to "vector store" for
consistency
- Various spelling, grammar, and formatting improvements for readability
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
The current document has not mentioned that splits larger than chunk
size would happen. I update the related document and explain why it
happens and how to solve it.
related issue #1349#3838#2140
This PR introduces a persistence layer to help with indexing workflows
into
vectostores.
The indexing code helps users to:
1. Avoid writing duplicated content into the vectostore
2. Avoid over-writing content if it's unchanged
Importantly, this keeps on working even if the content being written is
derived
via a set of transformations from some source content (e.g., indexing
children
documents that were derived from parent documents by chunking.)
The two main components are:
1. Persistence layer that keeps track of which keys were updated and
when.
Keeping track of the timestamp of updates, allows to clean up old
content
safely, and with minimal complexity.
2. HashedDocument which is used to hash the contents (including
metadata) of
the documents. We rely on the hashes for identifying duplicates.
The indexing code works with **ANY** document loader. To add
transformations
to the documents, users for now can add a custom document loader
that composes an existing loader together with document transformers.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Now with ElasticsearchStore VectorStore merged, i've added support for
the self-query retriever.
I've added a notebook also to demonstrate capability. I've also added
unit tests.
**Credit**
@elastic and @phoey1 on twitter.
This PR adds the ability to temporarily cache or persistently store
embeddings.
A notebook has been included showing how to set up the cache and how to
use it with a vectorstore.
begining -> beginning
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
## Description
This commit introduces the `DropboxLoader` class, a new document loader
that allows loading files from Dropbox into the application. The loader
relies on a Dropbox app, which requires creating an app on Dropbox,
obtaining the necessary scope permissions, and generating an access
token. Additionally, the dropbox Python package is required.
The `DropboxLoader` class is designed to be used as a document loader
for processing various file types, including text files, PDFs, and
Dropbox Paper files.
## Dependencies
`pip install dropbox` and `pip install unstructured` for PDF reading.
## Tag maintainer
@rlancemartin, @eyurtsev (from Data Loaders). I'd appreciate some
feedback here 🙏 .
## Social Networks
https://github.com/rubenbarraganhttps://www.linkedin.com/in/rgbarragan/https://twitter.com/RubenBarraganP
---------
Co-authored-by: Ruben Barragan <rbarragan@Rubens-MacBook-Air.local>
Given a user question, this will -
* Use LLM to generate a set of queries.
* Query for each.
* The URLs from search results are stored in self.urls.
* A check is performed for any new URLs that haven't been processed yet
(not in self.url_database).
* Only these new URLs are loaded, transformed, and added to the
vectorstore.
* The vectorstore is queried for relevant documents based on the
questions generated by the LLM.
* Only unique documents are returned as the final result.
This code will avoid reprocessing of URLs across multiple runs of
similar queries, which should improve the performance of the retriever.
It also keeps track of all URLs that have been processed, which could be
useful for debugging or understanding the retriever's behavior.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- Until now, hybrid search was limited to modules requiring external
services, such as Weaviate/Pinecone Hybrid Search. However, I have
developed a hybrid retriever that can merge a list of retrievers using
the [Reciprocal Rank
Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
algorithm. This new approach, similar to Weaviate hybrid search, does
not require the initialization of any external service.
- Dependencies: No - Twitter handle: dayuanjian21687
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
New HTML loader that asynchronously loader a list of urls.
New transformer using [HTML2Text](https://github.com/Alir3z4/html2text/)
for HTML to clean, easy-to-read plain ASCII text (valid Markdown).
I've extended the support of async API to local Qdrant mode. It is faked
but allows prototyping without spinning a container. The tests are
improved to test the in-memory case as well.
@baskaryan @rlancemartin @eyurtsev @agola11
BedrockEmbeddings does not have endpoint_url so that switching to custom
endpoint is not possible. I have access to Bedrock custom endpoint and
cannot use BedrockEmbeddings
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Work in Progress.
WIP
Not ready...
Adds Document Loader support for
[Geopandas.GeoDataFrames](https://geopandas.org/)
Example:
- [x] stub out `GeoDataFrameLoader` class
- [x] stub out integration tests
- [ ] Experiment with different geometry text representations
- [ ] Verify CRS is successfully added in metadata
- [ ] Test effectiveness of searches on geometries
- [ ] Test with different geometry types (point, line, polygon with
multi-variants).
- [ ] Add documentation
---------
Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Lance Martin <122662504+rlancemartin@users.noreply.github.com>
** This should land Monday the 17th **
Chroma is upgrading from `0.3.29` to `0.4.0`. `0.4.0` is easier to
build, more durable, faster, smaller, and more extensible. This comes
with a few changes:
1. A simplified and improved client setup. Instead of having to remember
weird settings, users can just do `EphemeralClient`, `PersistentClient`
or `HttpClient` (the underlying direct `Client` implementation is also
still accessible)
2. We migrated data stores away from `duckdb` and `clickhouse`. This
changes the api for the `PersistentClient` that used to reference
`chroma_db_impl="duckdb+parquet"`. Now we simply set
`is_persistent=true`. `is_persistent` is set for you to `true` if you
use `PersistentClient`.
3. Because we migrated away from `duckdb` and `clickhouse` - this also
means that users need to migrate their data into the new layout and
schema. Chroma is committed to providing extension notification and
tooling around any schema and data migrations (for example - this PR!).
After upgrading to `0.4.0` - if users try to access their data that was
stored in the previous regime, the system will throw an `Exception` and
instruct them how to use the migration assistant to migrate their data.
The migration assitant is a pip installable CLI: `pip install
chroma_migrate`. And is runnable by calling `chroma_migrate`
-- TODO ADD here is a short video demonstrating how it works.
Please reference the readme at
[chroma-core/chroma-migrate](https://github.com/chroma-core/chroma-migrate)
to see a full write-up of our philosophy on migrations as well as more
details about this particular migration.
Please direct any users facing issues upgrading to our Discord channel
called
[#get-help](https://discord.com/channels/1073293645303795742/1129200523111841883).
We have also created a [email
listserv](https://airtable.com/shrHaErIs1j9F97BE) to notify developers
directly in the future about breaking changes.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Moving to the latest non-preview Azure OpenAI API version=2023-05-15.
The previous 2023-03-15-preview doesn't have support, SLA etc. For
instance, OpenAI SDK has moved to this version
https://github.com/openai/openai-python/releases/tag/v0.27.7
@baskaryan
Description:
Currently, Zilliz only support dedicated clusters using a pair of
username and password for connection. Regarding serverless clusters,
they can connect to them by using API keys( [ see official note
detail](https://docs.zilliz.com/docs/manage-cluster-credentials)), so I
add API key(token) description in Zilliz docs to make it more obvious
and convenient for this group of users to better utilize Zilliz. No
changes done to code.
---------
Co-authored-by: Robin.Wang <3Jg$94sbQ@q1>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description: This PR adds the option to retrieve scores and explanations
in the WeaviateHybridSearchRetriever. This feature improves the
usability of the retriever by allowing users to understand the scoring
logic behind the search results and further refine their search queries.
Issue: This PR is a solution to the issue #7855
Dependencies: This PR does not introduce any new dependencies.
Tag maintainer: @rlancemartin, @eyurtsev
I have included a unit test for the added feature, ensuring that it
retrieves scores and explanations correctly. I have also included an
example notebook demonstrating its use.
Motivation, it seems that when dealing with a long context and "big"
number of relevant documents we must avoid using out of the box score
ordering from vector stores.
See: https://arxiv.org/pdf/2306.01150.pdf
So, I added an additional parameter that allows you to reorder the
retrieved documents so we can work around this performance degradation.
The relevance respect the original search score but accommodates the
lest relevant document in the middle of the context.
Extract from the paper (one image speaks 1000 tokens):
![image](https://github.com/hwchase17/langchain/assets/1821407/fafe4843-6e18-4fa6-9416-50cc1d32e811)
This seems to be common to all diff arquitectures. SO I think we need a
good generic way to implement this reordering and run some test in our
already running retrievers.
It could be that my approach is not the best one from the architecture
point of view, happy to have a discussion about that.
For me this was the best place to introduce the change and start
retesting diff implementations.
@rlancemartin, @eyurtsev
---------
Co-authored-by: Lance Martin <lance@langchain.dev>
- Description: Add a BM25 Retriever that do not need Elastic search
- Dependencies: rank_bm25(if it is not installed it will be install by
using pip, just like TFIDFRetriever do)
- Tag maintainer: @rlancemartin, @eyurtsev
- Twitter handle: DayuanJian21687
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
# Support Redis Sentinel database connections
This PR adds the support to connect not only to Redis standalone servers
but High Availability Replication sets too
(https://redis.io/docs/management/sentinel/)
Redis Replica Sets have on Master allowing to write data and 2+ replicas
with read-only access to the data. The additional Redis Sentinel
instances monitor all server and reconfigure the RW-Master on the fly if
it comes unavailable.
Therefore all connections must be made through the Sentinels the query
the current master for a read-write connection. This PR adds basic
support to also allow a redis connection url specifying a Sentinel as
Redis connection.
Redis documentation and Jupyter notebook with Redis examples are updated
to mention how to connect to a redis Replica Set with Sentinels
-
Remark - i did not found test cases for Redis server connections to add
new cases here. Therefor i tests the new utility class locally with
different kind of setups to make sure different connection urls are
working as expected. But no test case here as part of this PR.
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
Integrate [Rockset](https://rockset.com/docs/) as a document loader.
Issue: None
Dependencies: Nothing new (rockset's dependency was already added
[here](https://github.com/hwchase17/langchain/pull/6216))
Tag maintainer: @rlancemartin
I have added a test for the integration and an example notebook showing
its use. I ran `make lint` and everything looks good.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Multiple people have asked in #5081 for a way to limit the documents
returned from an AzureCognitiveSearchRetriever. This PR adds the `top_n`
parameter to allow that.
Twitter handle:
[@UmerHAdil](twitter.com/umerHAdil)
# Browserless
Added support for Browserless' `/content` endpoint as a document loader.
### About Browserless
Browserless is a cloud service that provides access to headless Chrome
browsers via a REST API. It allows developers to automate Chromium in a
serverless fashion without having to configure and maintain their own
Chrome infrastructure.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Lance Martin <lance@langchain.dev>