- **Bug code**: In
langchain_community/document_loaders/csv_loader.py:100
- **Description**: currently, when 'CSVLoader' reads the column as None
in the 'csv' file, it will report an error because the 'CSVLoader' does
not verify whether the column is of str type and does not consider how
to handle the corresponding 'row_data' when the column is' None 'in the
csv. This pr provides a solution.
- **Issue:** Fix#20699
- **thinking:**
1. Refer to the processing method for
'langchain_community/document_loaders/csv_loader.py:100' when **'v'**
equals'None', and apply the same method to '**k**'.
(Reference`csv.DictReader` ,**'k'** will only be None when `
len(columns) < len(number_row_data)` is established)
2. **‘k’** equals None only holds when it is the last column, and its
corresponding **'v'** type is a list. Therefore, I referred to the data
format in 'Document' and used ',' to concatenated the elements in the
list.(But I'm not sure if you accept this form, if you have any other
ideas, communicate)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
## 'raise_for_status' parameter of WebBaseLoader works in sync load but
not in async load.
In webBaseLoader:
Sync load is calling `_scrape` and has `raise_for_status` properly
handled.
```
def _scrape(
self,
url: str,
parser: Union[str, None] = None,
bs_kwargs: Optional[dict] = None,
) -> Any:
from bs4 import BeautifulSoup
if parser is None:
if url.endswith(".xml"):
parser = "xml"
else:
parser = self.default_parser
self._check_parser(parser)
html_doc = self.session.get(url, **self.requests_kwargs)
if self.raise_for_status:
html_doc.raise_for_status()
if self.encoding is not None:
html_doc.encoding = self.encoding
elif self.autoset_encoding:
html_doc.encoding = html_doc.apparent_encoding
return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))
```
Async load is calling `_fetch` but missing `raise_for_status` logic.
```
async def _fetch(
self, url: str, retries: int = 3, cooldown: int = 2, backoff: float = 1.5
) -> str:
async with aiohttp.ClientSession() as session:
for i in range(retries):
try:
async with session.get(
url,
headers=self.session.headers,
ssl=None if self.session.verify else False,
cookies=self.session.cookies.get_dict(),
) as response:
return await response.text()
```
Co-authored-by: kefan.you <darkfss@sina.com>
**Description:** Add `Origin/langchain` to Apify's client's user-agent
to attribute API activity to LangChain (at Apify, we aim to monitor our
integrations to evaluate whether we should invest more in the LangChain
integration regarding functionality and content)
**Issue:** None
**Dependencies:** None
**Twitter handle:** None
Thank you for contributing to LangChain!
- [x] **PR title**: "community: updated Browserbase loader"
- [x] **PR message**:
Updates the Browserbase loader with more options and improved docs.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
**Description:** Added a few additional arguments to the whisper parser,
which can be consumed by the underlying API.
The prompt is especially important to fine-tune transcriptions.
---------
Co-authored-by: Roi Perlman <roi@fivesigmalabs.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- Oracle AI Vector Search
Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.
- Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.
This Pull Requests Adds the following functionalities
Oracle AI Vector Search : Vector Store
Oracle AI Vector Search : Document Loader
Oracle AI Vector Search : Document Splitter
Oracle AI Vector Search : Summary
Oracle AI Vector Search : Oracle Embeddings
- We have added unit tests and have our own local unit test suite which
verifies all the code is correct. We have made sure to add guides for
each of the components and one end to end guide that shows how the
entire thing runs.
- We have made sure that make format and make lint run clean.
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: skmishraoracle <shailendra.mishra@oracle.com>
Co-authored-by: hroyofc <harichandan.roy@oracle.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- [ ] **PR message**:
- **Description:** Refactored the lazy_load method to use asynchronous
execution for improved performance. The method now initiates scraping of
all URLs simultaneously using asyncio.gather, enhancing data fetching
efficiency. Each Document object is yielded immediately once its content
becomes available, streamlining the entire process.
- **Issue:** N/A
- **Dependencies:** Requires the asyncio library for handling
asynchronous tasks, which should already be part of standard Python
libraries in Python 3.7 and above.
- **Email:** [r73327118@gmail.com](mailto:r73327118@gmail.com)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
As shown in #13749 , `RecursiveUrlLoader` has encoding issue. This PR is
to solve this.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
## Summary
I ran `ruff check --extend-select RUF100 -n` to identify `# noqa`
comments that weren't having any effect in Ruff, and then `ruff check
--extend-select RUF100 -n --fix` on select files to remove all of the
unnecessary `# noqa: F401` violations. It's possible that these were
needed at some point in the past, but they're not necessary in Ruff
v0.1.15 (used by LangChain) or in the latest release.
Co-authored-by: Erick Friis <erick@langchain.dev>
Issue: When the third-party package is not installed, whenever we need
to `pip install <package>` the ImportError is raised.
But sometimes, the `ValueError` or `ModuleNotFoundError` is raised. It
is bad for consistency.
Change: replaced the `ValueError` or `ModuleNotFoundError` with
`ImportError` when we raise an error with the `pip install <package>`
message.
Note: Ideally, we replace all `try: import... except... raise ... `with
helper functions like `import_aim` or just use the existing
[langchain_core.utils.utils.guard_import](https://api.python.langchain.com/en/latest/utils/langchain_core.utils.utils.guard_import.html#langchain_core.utils.utils.guard_import)
But it would be much bigger refactoring. @baskaryan Please, advice on
this.
Description: The PebbloSafeLoader should first check for owner,
full_path and size in metadata before implementing its own logic.
Dependencies: None
Documentation: NA.
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
**Description**:
_PebbloSafeLoader_: Add support for pebblo server and client version
**Documentation:** NA
**Unit test:** NA
**Issue:** NA
**Dependencies:** None
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
- [ ] **Kinetica Document Loader**: "community: a class to load
Documents from Kinetica"
- [ ] **Kinetica Document Loader**:
- **Description:** implemented KineticaLoader in `kinetica_loader.py`
- **Dependencies:** install the Kinetica API using `pip install
gpudb==7.2.0.1 `
Description: Add support for Semantic topics and entities.
Classification done by pebblo-server is not used to enhance metadata of
Documents loaded by document loaders.
Dependencies: None
Documentation: Updated.
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
**Description:**
The RecursiveUrlLoader loader offers a link_regex parameter that can
filter out URLs. However, this filtering capability is limited, and if
the internal links of the website change, unexpected resources may be
loaded. These resources, such as font files, can cause problems in
subsequent embedding processing.
>
https://blog.langchain.dev/assets/fonts/source-sans-pro-v21-latin-ext_latin-regular.woff2?v=0312715cbf
We can add the Content-Type in the HTTP response headers to the document
metadata so developers can choose which resources to use. This allows
developers to make their own choices.
For example, the following may be a good choice for text knowledge.
- text/plain - simple text file
- text/html - HTML web page
- text/xml - XML format file
- text/json - JSON format data
- application/pdf - PDF file
- application/msword - Word document
and ignore the following
- text/css - CSS stylesheet
- text/javascript - JavaScript script
- application/octet-stream - binary data
- image/jpeg - JPEG image
- image/png - PNG image
- image/gif - GIF image
- image/svg+xml - SVG image
- audio/mpeg - MPEG audio files
- video/mp4 - MP4 video file
- application/font-woff - WOFF font file
- application/font-ttf - TTF font file
- application/zip - ZIP compressed file
- application/octet-stream - binary data
**Twitter handle:** @coolbeevip
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
faster-whisper is a reimplementation of OpenAI's Whisper model using
CTranslate2, which is up to 4 times faster than enai/whisper for the
same accuracy while using less memory. The efficiency can be further
improved with 8-bit quantization on both CPU and GPU.
It can automatically detect the following 14 languages and transcribe
the text into their respective languages: en, zh, fr, de, ja, ko, ru,
es, th, it, pt, vi, ar, tr.
The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
- **Description:** added the headless parameter as optional argument to
the langchain_community.document_loaders AsyncChromiumLoader class
- **Dependencies:** None
- **Twitter handle:** @perinim_98
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description:** currently, the `DirectoryLoader` progress-bar maximum value is based on an incorrect number of files to process
In langchain_community/document_loaders/directory.py:127:
```python
paths = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
items = [
path
for path in paths
if not (self.exclude and any(path.match(glob) for glob in self.exclude))
]
```
`paths` returns both files and directories. `items` is later used to determine the maximum value of the progress-bar which gives an incorrect progress indication.
Description: Add support for authorized identities in PebbloSafeLoader.
Now with this change, PebbloSafeLoader will extract
authorized_identities from metadata and send it to pebblo server
Dependencies: None
Documentation: None
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
From `langchain_community 0.0.30`, there's a bug that cannot send a
file-like object via `file` parameter instead of `file path` due to
casting the `file_path` to str type even if `file_path` is None.
which means that when I call the `partition_via_api()`, exactly one of
`filename` and `file` must be specified by the following error message.
however, from `langchain_community 0.0.30`, `file_path` is casted into
`str` type even `file_path` is None in `get_elements_from_api()` and got
an error at `exactly_one(filename=filename, file=file)`.
here's an error message
```
---> 51 exactly_one(filename=filename, file=file)
53 if metadata_filename and file_filename:
54 raise ValueError(
55 "Only one of metadata_filename and file_filename is specified. "
56 "metadata_filename is preferred. file_filename is marked for deprecation.",
57 )
File /opt/homebrew/lib/python3.11/site-packages/unstructured/partition/common.py:441, in exactly_one(**kwargs)
439 else:
440 message = f"{names[0]} must be specified."
--> 441 raise ValueError(message)
ValueError: Exactly one of filename and file must be specified.
```
So, I simply made a change that casting to str type when `file_path` is
not None.
I use `UnstructuredAPIFileLoader` like below.
```
from langchain_community.document_loaders.unstructured import UnstructuredAPIFileLoader
documents: list = UnstructuredAPIFileLoader(
file_path=None,
file=file, # file-like object, io.BytesIO type
mode='elements',
url='http://127.0.0.1:8000/general/v0/general',
content_type='application/pdf',
metadata_filename='asdf.pdf',
).load_and_split()
```
## Description:
The PR introduces 3 changes:
1. added `recursive` property to `O365BaseLoader`. (To keep the behavior
unchanged, by default is set to `False`). When `recursive=True`,
`_load_from_folder()` also recursively loads all nested folders.
2. added `folder_id` to SharePointLoader.(similar to (this
PR)[https://github.com/langchain-ai/langchain/pull/10780] ) This
provides an alternative to `folder_path` that doesn't seem to reliably
work.
3. when none of `document_ids`, `folder_id`, `folder_path` is provided,
the loader fetches documets from root folder. Combined with
`recursive=True` this provides an easy way of loading all compatible
documents from SharePoint.
The PR contains the same logic as [this stale
PR](https://github.com/langchain-ai/langchain/pull/10780) by
@WaleedAlfaris. I'd like to ask his blessing for moving forward with
this one.
## Issue:
- As described in https://github.com/langchain-ai/langchain/issues/19938
and https://github.com/langchain-ai/langchain/pull/10780 the sharepoint
loader often does not seem to work with folder_path.
- Recursive loading of subfolders is a missing functionality
## Dependecies: None
Twitter handle:
@martintriska1 @WRhetoric
This is my first PR here, please be gentle :-)
Please review @baskaryan
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Added the [FireCrawl](https://firecrawl.dev) document loader. Firecrawl
crawls and convert any website into LLM-ready data. It crawls all
accessible subpages and give you clean markdown for each.
- **Description:** Adds FireCrawl data loader
- **Dependencies:** firecrawl-py
- **Twitter handle:** @mendableai
ccing contributors: (@ericciarla @nickscamara)
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
This PR should make it easier for linters to do type checking and for IDEs to jump to definition of code.
See #20050 as a template for this PR.
- As a byproduct: Added 3 missed `test_imports`.
- Added missed `SolarChat` in to __init___.py Added it into test_import
ut.
- Added `# type: ignore` to fix linting. It is not clear, why linting
errors appear after ^ changes.
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Description: When multithreading is set to True and using the
DirectoryLoader, there was a bug that caused the return type to be a
double nested list. This resulted in other places upstream not being
able to utilize the from_documents method as it was no longer a
`List[Documents]` it was a `List[List[Documents]]`. The change made was
to just loop through the `future.result()` and yield every item.
Issue: #20093
Dependencies: N/A
Twitter handle: N/A
- **Description:** Bug fix. Removed extra line in `GCSDirectoryLoader`
to allow catching Exceptions. Now also logs the file path if Exception
is raised for easier debugging.
- **Issue:** #20198 Bug since langchain-community==0.0.31
- **Dependencies:** No change
- **Twitter handle:** timothywong731
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [ ] **PR title**: "community: deprecating integrations moved to
langchain_google_community"
- [ ] **PR message**: deprecating integrations moved to
langchain_google_community
---------
Co-authored-by: ccurme <chester.curme@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "community: added support for llmsherpa library"
- [x] **Add tests and docs**:
1. Integration test:
'docs/docs/integrations/document_loaders/test_llmsherpa.py'.
2. an example notebook:
`docs/docs/integrations/document_loaders/llmsherpa.ipynb`.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "community: Implement DirectoryLoader lazy_load
function"
- [x] **Description**: The `lazy_load` function of the `DirectoryLoader`
yields each document separately. If the given `loader_cls` of the
`DirectoryLoader` also implemented `lazy_load`, it will be used to yield
subdocuments of the file.
- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access:
`libs/community/tests/unit_tests/document_loaders/test_directory_loader.py`
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory:
`docs/docs/integrations/document_loaders/directory.ipynb`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>