# Implements support for Personal Access Token Authentication in the
ConfluenceLoader
Fixes#5191
Implements a new optional parameter for the ConfluenceLoader: `token`.
This allows the use of personal access authentication when using the
on-prem server version of Confluence.
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @Jflick58
Twitter Handle: felipe_yyc
---------
Co-authored-by: Felipe <feferreira@ea.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Unstructured Excel Loader
Adds an `UnstructuredExcelLoader` class for `.xlsx` and `.xls` files.
Works with `unstructured>=0.6.7`. A plain text representation of the
Excel file will be available under the `page_content` attribute in the
doc. If you use the loader in `"elements"` mode, an HTML representation
of the Excel file will be available under the `text_as_html` metadata
key. Each sheet in the Excel document is its own document.
### Testing
```python
from langchain.document_loaders import UnstructuredExcelLoader
loader = UnstructuredExcelLoader(
"example_data/stanley-cups.xlsx",
mode="elements"
)
docs = loader.load()
```
## Who can review?
@hwchase17
@eyurtsev
# Create elastic_vector_search.ElasticKnnSearch class
This extends `langchain/vectorstores/elastic_vector_search.py` by adding
a new class `ElasticKnnSearch`
Features:
- Allow creating an index with the `dense_vector` mapping compataible
with kNN search
- Store embeddings in index for use with kNN search (correct mapping
creates HNSW data structure)
- Perform approximate kNN search
- Perform hybrid BM25 (`query{}`) + kNN (`knn{}`) search
- perform knn search by either providing a `query_vector` or passing a
hosted `model_id` to use query_vector_builder to automatically generate
a query_vector at search time
Connection options
- Using `cloud_id` from Elastic Cloud
- Passing elasticsearch client object
search options
- query
- k
- query_vector
- model_id
- size
- source
- knn_boost (hybrid search)
- query_boost (hybrid search)
- fields
This also adds examples to
`docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb`
Fixes # [5346](https://github.com/hwchase17/langchain/issues/5346)
cc: @dev2049
-->
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Lint sphinx documentation and fix broken links
This PR lints multiple warnings shown in generation of the project
documentation (using "make docs_linkcheck" and "make docs_build").
Additionally documentation internal links to (now?) non-existent files
are modified to point to existing documents as it seemed the new correct
target.
The documentation is not updated content wise.
There are no source code changes.
Fixes # (issue)
- broken documentation links to other files within the project
- sphinx formatting (linting)
## Before submitting
No source code changes, so no new tests added.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# docs: `ecosystem_integrations` update 3
Next cycle of updating the `ecosystem/integrations`
* Added an integration `template` file
* Added missed integration files
* Fixed several document_loaders/notebooks
## Who can review?
Is it possible to assign somebody to review PRs on docs? Thanks.
# Update Unstructured docs to remove the `detectron2` install
instructions
Removes `detectron2` installation instructions from the Unstructured
docs because installing `detectron2` is no longer required for
`unstructured>=0.7.0`. The `detectron2` model now runs using the ONNX
runtime.
## Who can review?
@hwchase17
@eyurtsev
# Support Qdrant filters
Qdrant has an [extensive filtering
system](https://qdrant.tech/documentation/concepts/filtering/) with rich
type support. This PR makes it possible to use the filters in Langchain
by passing an additional param to both the
`similarity_search_with_score` and `similarity_search` methods.
## Who can review?
@dev2049 @hwchase17
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
# Creates GitHubLoader (#5257)
GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.
Fixes#5257
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Added New Trello loader class and documentation
Simple Loader on top of py-trello wrapper.
With a board name you can pull cards and to do some field parameter
tweaks on load operation.
I included documentation and examples.
Included unit test cases using patch and a fixture for py-trello client
class.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# docs: ecosystem/integrations update
It is the first in a series of `ecosystem/integrations` updates.
The ecosystem/integrations list is missing many integrations.
I'm adding the missing integrations in a consistent format:
1. description of the integrated system
2. `Installation and Setup` section with 'pip install ...`, Key setup,
and other necessary settings
3. Sections like `LLM`, `Text Embedding Models`, `Chat Models`... with
links to correspondent examples and imports of the used classes.
This PR keeps new docs, that are presented in the
`docs/modules/models/text_embedding/examples` but missed in the
`ecosystem/integrations`. The next PRs will cover the next example
sections.
Also updated `integrations.rst`: added the `Dependencies` section with a
link to the packages used in LangChain.
## Who can review?
@hwchase17
@eyurtsev
@dev2049
# docs: ecosystem/integrations update 2
#5219 - part 1
The second part of this update (parts are independent of each other! no
overlap):
- added diffbot.md
- updated confluence.ipynb; added confluence.md
- updated college_confidential.md
- updated openai.md
- added blackboard.md
- added bilibili.md
- added azure_blob_storage.md
- added azlyrics.md
- added aws_s3.md
## Who can review?
@hwchase17@agola11
@agola11
@vowelparrot
@dev2049
# Fix for `update_document` Function in Chroma
## Summary
This pull request addresses an issue with the `update_document` function
in the Chroma class, as described in
[#5031](https://github.com/hwchase17/langchain/issues/5031#issuecomment-1562577947).
The issue was identified as an `AttributeError` raised when calling
`update_document` due to a missing corresponding method in the
`Collection` object. This fix refactors the `update_document` method in
`Chroma` to correctly interact with the `Collection` object.
## Changes
1. Fixed the `update_document` method in the `Chroma` class to correctly
call methods on the `Collection` object.
2. Added the corresponding test `test_chroma_update_document` in
`tests/integration_tests/vectorstores/test_chroma.py` to reflect the
updated method call.
3. Added an example and explanation of how to use the `update_document`
function in the Jupyter notebook tutorial for Chroma.
## Test Plan
All existing tests pass after this change. In addition, the
`test_chroma_update_document` test case now correctly checks the
functionality of `update_document`, ensuring that the function works as
expected and updates the content of documents correctly.
## Reviewers
@dev2049
This fix will ensure that users are able to use the `update_document`
function as expected, without encountering the previous
`AttributeError`. This will enhance the usability and reliability of the
Chroma class for all users.
Thank you for considering this pull request. I look forward to your
feedback and suggestions.
# Add SKLearnVectorStore
This PR adds SKLearnVectorStore, a simply vector store based on
NearestNeighbors implementations in the scikit-learn package. This
provides a simple drop-in vector store implementation with minimal
dependencies (scikit-learn is typically installed in a data scientist /
ml engineer environment). The vector store can be persisted and loaded
from json, bson and parquet format.
SKLearnVectorStore has soft (dynamic) dependency on the scikit-learn,
numpy and pandas packages. Persisting to bson requires the bson package,
persisting to parquet requires the pyarrow package.
## Before submitting
Integration tests are provided under
`tests/integration_tests/vectorstores/test_sklearn.py`
Sample usage notebook is provided under
`docs/modules/indexes/vectorstores/examples/sklear.ipynb`
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Better docs for weaviate hybrid search
<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->
<!-- Remove if not applicable -->
Fixes: NA
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
@dev2049
zep-python's sync methods no longer need an asyncio wrapper. This was
causing issues with FastAPI deployment.
Zep also now supports putting and getting of arbitrary message metadata.
Bump zep-python version to v0.30
Remove nest-asyncio from Zep example notebooks.
Modify tests to include metadata.
---------
Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
Co-authored-by: Daniel Chalef <131175+danielchalef@users.noreply.github.com>
For most queries it's the `size` parameter that determines final number
of documents to return. Since our abstractions refer to this as `k`, set
this to be `k` everywhere instead of expecting a separate param. Would
be great to have someone more familiar with OpenSearch validate that
this is reasonable (e.g. that having `size` and what OpenSearch calls
`k` be the same won't lead to any strange behavior). cc @naveentatikonda
Closes#5212
# Add QnA with sources example
<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->
<!-- Remove if not applicable -->
Fixes: see
https://stackoverflow.com/questions/76207160/langchain-doesnt-work-with-weaviate-vector-database-getting-valueerror/76210017#76210017
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
@dev2049
# Bibtex integration
Wrap bibtexparser to retrieve a list of docs from a bibtex file.
* Get the metadata from the bibtex entries
* `page_content` get from the local pdf referenced in the `file` field
of the bibtex entry using `pymupdf`
* If no valid pdf file, `page_content` set to the `abstract` field of
the bibtex entry
* Support Zotero flavour using regex to get the file path
* Added usage example in
`docs/modules/indexes/document_loaders/examples/bibtex.ipynb`
---------
Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Add Joplin document loader
[Joplin](https://joplinapp.org/) is an open source note-taking app.
Joplin has a [REST API](https://joplinapp.org/api/references/rest_api/)
for accessing its local database. The proposed `JoplinLoader` uses the
API to retrieve all notes in the database and their metadata. Joplin
needs to be installed and running locally, and an access token is
required.
- The PR includes an integration test.
- The PR includes an example notebook.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Vectara Integration
This PR provides integration with Vectara. Implemented here are:
* langchain/vectorstore/vectara.py
* tests/integration_tests/vectorstores/test_vectara.py
* langchain/retrievers/vectara_retriever.py
And two IPYNB notebooks to do more testing:
* docs/modules/chains/index_examples/vectara_text_generation.ipynb
* docs/modules/indexes/vectorstores/examples/vectara.ipynb
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# DOCS added missed document_loader examples
Added missed examples: `JSON`, `Open Document Format (ODT)`,
`Wikipedia`, `tomarkdown`.
Updated them to a consistent format.
## Who can review?
@hwchase17
@dev2049
# Add link to Psychic from document loaders documentation page
In my previous PR I forgot to update `document_loaders.rst` to link to
`psychic.ipynb` to make it discoverable from the main documentation.
# Add Mastodon toots loader.
Loader works either with public toots, or Mastodon app credentials. Toot
text and user info is loaded.
I've also added integration test for this new loader as it works with
public data, and a notebook with example output run now.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Improve pinecone hybrid search retriever adding metadata support
I simply remove the hardwiring of metadata to the existing
implementation allowing one to pass `metadatas` attribute to the
constructors and in `get_relevant_documents`. I also add one missing pip
install to the accompanying notebook (I am not adding dependencies, they
were pre-existing).
First contribution, just hoping to help, feel free to critique :)
my twitter username is `@andreliebschner`
While looking at hybrid search I noticed #3043 and #1743. I think the
former can be closed as following the example right now (even prior to
my improvements) works just fine, the latter I think can be also closed
safely, maybe pointing out the relevant classes and example. Should I
reply those issues mentioning someone?
@dev2049, @hwchase17
---------
Co-authored-by: Andreas Liebschner <a.liebschner@shopfully.com>
### Submit Multiple Files to the Unstructured API
Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.
### Testing
The following should load both documents, using two of the example docs
from the integration tests folder.
```python
from langchain.document_loaders import UnstructuredAPIFileLoader
file_paths = ["examples/layout-parser-paper.pdf", "examples/whatsapp_chat.txt"]
loader = UnstructuredAPIFileLoader(
file_paths=file_paths,
api_key="FAKE_API_KEY",
strategy="fast",
mode="elements",
)
docs = loader.load()
```
# Add self query translator for weaviate vectorstore
Adds support for the EQ comparator and the AND/OR operators.
Co-authored-by: Dominic Chan <dchan@cppib.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Improve Evernote Document Loader
When exporting from Evernote you may export more than one note.
Currently the Evernote loader concatenates the content of all notes in
the export into a single document and only attaches the name of the
export file as metadata on the document.
This change ensures that each note is loaded as an independent document
and all available metadata on the note e.g. author, title, created,
updated are added as metadata on each document.
It also uses an existing optional dependency of `html2text` instead of
`pypandoc` to remove the need to download the pandoc application via
`download_pandoc()` to be able to use the `pypandoc` python bindings.
Fixes#4493
Co-authored-by: Mike McGarry <mike.mcgarry@finbourne.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Zep Retriever - Vector Search Over Chat History with the Zep Long-term
Memory Service
More on Zep: https://github.com/getzep/zep
Note: This PR is related to and relies on
https://github.com/hwchase17/langchain/pull/4834. I did not want to
modify the `pyproject.toml` file to add the `zep-python` dependency a
second time.
Co-authored-by: Daniel Chalef <daniel.chalef@private.org>