**Description:**
Expanding version in all the Confluence API calls so to get when the
page was last modified/created in all cases.
**Issue:** #12812
**Twitter handle:** zzste
**Description:** Update s3_file.py to use arguments **mode** and
**post_processors** from the base class **UnstructuredBaseLoader** to
include more metadata about the files from the S3 bucket such as
*'page_number', 'languages'* etc.
**Issue:** NA
**Dependencies:** None
**Twitter handle:** preak95
---------
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
RecursiveUrlLoader does not currently provide an option to set
`base_url` other than the `url`, though it uses a function with such an
option.
For example, this causes it unable to parse the
`https://python.langchain.com/docs`, as it returns the 404 page, and
`https://python.langchain.com/docs/get_started/introduction` has no
child routes to parse.
`base_url` allows setting the `https://python.langchain.com/docs` to
filter by, while the starting URL is anything inside, that contains
relevant links to continue crawling.
I understand that for this case, the docusaurus loader could be used,
but it's a common issue with many websites.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "community: deprecate DocugamiLoader"
- [x] **PR message**: Deprecate the langchain_community and use the
docugami_langchain DocugamiLoader
---------
Co-authored-by: Kenzie Mihardja <kenzie28@cs.washington.edu>
- **Description:** This change fixes a bug where attempts to load data
from Notion using the NotionDBLoader resulted in a 400 Bad Request
error. The issue was traced to the unconditional addition of an empty
'filter' object in the request payload, which Notion's API does not
accept. The modification ensures that the 'filter' object is only
included in the payload when it is explicitly provided and not empty,
thus preventing the 400 error from occurring.
- **Issue:** Fixes
[#18009](https://github.com/langchain-ai/langchain/issues/18009)
- **Dependencies:** None
- **Twitter handle:** @gunnzolder
Co-authored-by: Anton Parkhomenko <anton@merge.rocks>
BasePDFLoader doesn't parse the suffix of the file correctly when
parsing S3 presigned urls. This fix enables the proper detection and
parsing of S3 presigned URLs to prevent errors such as `OSError: [Errno
36] File name too long`.
No additional dependencies required.
- **Description:** Adding an optional parameter `linearization_config`
to the `AmazonTextractPDFLoader` so the caller can define how the output
will be linearized, instead of forcing a predefined set of linearization
configs. It will still have a default configuration as this will be an
optional parameter.
- **Issue:** #17457
- **Dependencies:** The same ones that already exist for
`AmazonTextractPDFLoader`
- **Twitter handle:** [@lvieirajr19](https://twitter.com/lvieirajr19)
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- **Description:** `S3DirectoryLoader` is failing if prefix is a folder
(ex: `my_folder/`) because `S3FileLoader` will try to load that folder
and will fail. This PR skip nested directories so prefix can be set to
folder instead of `my_folder/files_prefix`.
- **Issue:**
- #11917
- #6535
- #4326
- **Dependencies:** none
- **Twitter handle:** @Falydoor
- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
"community: added a feature to filter documents in Mongoloader"
- **Description:** added a feature to filter documents in Mongoloader
- **Feature:** the feature #18251
- **Dependencies:** No
- **Twitter handle:** https://twitter.com/im_Kushagra
### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](https://github.com/langchain-ai/langchain/pull/11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.
### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.
<details>
<summary>sample json data</summary>
```json
{
"data": [
{
"attributes": {
"message": "message1",
"tags": [
"tag1"
]
},
"id": "1"
},
{
"attributes": {
"message": "message2",
"tags": [
"tag2"
]
},
"id": "2"
}
]
}
```
</details>
<details>
<summary>sample code</summary>
```python
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["source"] = None
metadata["id"] = record.get("id")
metadata["tags"] = record["attributes"].get("tags")
return metadata
sample_file = "sample1.json"
loader = JSONLoader(
file_path=sample_file,
jq_schema=".data[]",
content_key=".attributes.message", ## content_key is parsable into jq schema
is_content_key_jq_parsable=True, ## this is added parameter
metadata_func=metadata_func
)
data = loader.load()
data
```
</details>
### Dependencies
none
### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
**Description:** Fix `metadata_extractor` type for `RecursiveUrlLoader`,
the default `_metadata_extractor` returns `dict` instead of `str`.
**Issue:** N/A
**Dependencies:** N/A
**Twitter handle:** N/A
Signed-off-by: Hemslo Wang <hemslo.wang@gmail.com>
Description-
- Changed the GitHub endpoint as existing was not working and giving 404
not found error
- Also the existing function was failing if file_filter is not passed as
the tree api return all paths including directory as well, and when
get_file_content was iterating over these path, the function was failing
for directory as the api was returning list of files inside the
directory, so added a condition to ignore the paths if it a directory
- Fixes this issue -
https://github.com/langchain-ai/langchain/issues/17453
Co-authored-by: Radhika Bansal <Radhika.Bansal@veritas.com>
- **Description:** A generic document loader adapter for SQLAlchemy on
top of LangChain's `SQLDatabaseLoader`.
- **Needed by:** https://github.com/crate-workbench/langchain/pull/1
- **Depends on:** GH-16655
- **Addressed to:** @baskaryan, @cbornet, @eyurtsev
Hi from CrateDB again,
in the same spirit like GH-16243 and GH-16244, this patch breaks out
another commit from https://github.com/crate-workbench/langchain/pull/1,
in order to reduce the size of this patch before submitting it, and to
separate concerns.
To accompany the SQLAlchemy adapter implementation, the patch includes
integration tests for both SQLite and PostgreSQL. Let me know if
corresponding utility resources should be added at different spots.
With kind regards,
Andreas.
### Software Tests
```console
docker compose --file libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml up
```
```console
cd libs/community
pip install psycopg2-binary
pytest -vvv tests/integration_tests -k sqldatabase
```
```
14 passed
```
![image](https://github.com/langchain-ai/langchain/assets/453543/42be233c-eb37-4c76-a830-474276e01436)
---------
Co-authored-by: Andreas Motl <andreas.motl@crate.io>
**Description**
This PR addresses a rare issue in `OpenAIWhisperParser` that causes it
to crash when processing an audio file with a duration very close to the
class's chunk size threshold of 20 minutes.
**Issue**
#11449
**Dependencies**
None
**Tag maintainer**
@agola11 @eyurtsev
**Twitter handle**
leonardodiegues
---------
Co-authored-by: Leonardo Diegues <leonardo.diegues@grupofolha.com.br>
Co-authored-by: Bagatur <baskaryan@gmail.com>