## What this PR does?
### Currently `O365BaseLoader` (and consequently both derived loaders)
are limited to `pdf`, `doc`, `docx` files.
- **Solution: here we introduce _handlers_ attribute that allows for
custom handlers to be passed in. This is done in _dict_ form:**
**Example:**
```python
from langchain_community.document_loaders.parsers.documentloader_adapter import DocumentLoaderAsParser
# PR for DocumentLoaderAsParser here: https://github.com/langchain-ai/langchain/pull/27749
from langchain_community.document_loaders.excel import UnstructuredExcelLoader
xlsx_parser = DocumentLoaderAsParser(UnstructuredExcelLoader, mode="paged")
# create dictionary mapping file types to handlers (parsers)
handlers = {
"doc": MsWordParser()
"pdf": PDFMinerParser()
"txt": TextParser()
"xlsx": xlsx_parser
}
loader = SharePointLoader(document_library_id="...",
handlers=handlers # pass handlers to SharePointLoader
)
documents = loader.load()
# works the same in OneDriveLoader
loader = OneDriveLoader(document_library_id="...",
handlers=handlers
)
```
This dictionary is then passed to `MimeTypeBasedParser` same as in the
[current
implementation](5a2cfb49e0/libs/community/langchain_community/document_loaders/parsers/registry.py (L13)).
### Currently `SharePointLoader` and `OneDriveLoader` are separate
loaders that both inherit from `O365BaseLoader`
However both of these implement the same functionality. The only
differences are:
- `SharePointLoader` requires argument `document_library_id` whereas
`OneDriveLoader` requires `drive_id`. These are just different names for
the same thing.
- `SharePointLoader` implements significantly more features.
- **Solution: `OneDriveLoader` is replaced with an empty shell just
renaming `drive_id` to `document_library_id` and inheriting from
`SharePointLoader`**
**Dependencies:** None
**Twitter handle:** @martintriska1
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
Thank you for contributing to LangChain!
- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "templates:
..." for template changes, "infra: ..." for CI changes.
- Example: "community: add foobar LLM"
- [ ] **description**
langchain_core.runnables.graph_mermaid.draw_mermaid_png calls this
function, but the Mermaid API returns JPEG by default. To be consistent,
add the option `file_type` with the default `png` type.
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
With this small change, I didn't add tests and docs.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more:
One long sentence was divided into two.
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
Now `encode_kwargs` used for both for documents and queries and this
leads to wrong embeddings. E. g.:
```python
model_kwargs = {"device": "cuda", "trust_remote_code": True}
encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}
model = HuggingFaceEmbeddings(
model_name="dunzhang/stella_en_400M_v5",
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)
query_embedding = np.array(
model.embed_query("What are some ways to reduce stress?",)
)
document_embedding = np.array(
model.embed_documents(
[
"There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
"Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]
)
)
print(model._client.similarity(query_embedding, document_embedding)) # output: tensor([[0.8421, 0.3317]], dtype=torch.float64)
```
But from the [model
card](https://huggingface.co/dunzhang/stella_en_400M_v5#sentence-transformers)
expexted like this:
```python
model_kwargs = {"device": "cuda", "trust_remote_code": True}
encode_kwargs = {"normalize_embeddings": False}
query_encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}
model = HuggingFaceEmbeddings(
model_name="dunzhang/stella_en_400M_v5",
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
query_encode_kwargs=query_encode_kwargs,
)
query_embedding = np.array(
model.embed_query("What are some ways to reduce stress?", )
)
document_embedding = np.array(
model.embed_documents(
[
"There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
"Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]
)
)
print(model._client.similarity(query_embedding, document_embedding)) # tensor([[0.8398, 0.2990]], dtype=torch.float64)
```
There was a change of attribute name which was "max_batch_size". It's
now "get_max_batch_size" method.
I want to use "create_batches" which is right down below.
Please check this PR link.
reference: https://github.com/chroma-core/chroma/pull/2305
---------
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Co-authored-by: Prithvi Kannan <46332835+prithvikannan@users.noreply.github.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Jun Yamog <jkyamog@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: ono-hiroki <86904208+ono-hiroki@users.noreply.github.com>
Co-authored-by: Dobiichi-Origami <56953648+Dobiichi-Origami@users.noreply.github.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
Co-authored-by: Duy Huynh <vndee.huynh@gmail.com>
Co-authored-by: Rashmi Pawar <168514198+raspawar@users.noreply.github.com>
Co-authored-by: sifatj <26035630+sifatj@users.noreply.github.com>
Co-authored-by: Eric Pinzur <2641606+epinzur@users.noreply.github.com>
Co-authored-by: Daniel Vu Dao <danielvdao@users.noreply.github.com>
Co-authored-by: Ofer Mendelevitch <ofermend@gmail.com>
Co-authored-by: Stéphane Philippart <wildagsx@gmail.com>
- **Description:** change to do the batch embedding server side and not
client side
- **Twitter handle:** @wildagsx
---------
Co-authored-by: ccurme <chester.curme@gmail.com>
Description:
This fixes an issue that mistakenly created in
https://github.com/langchain-ai/langchain/pull/27253. The issue
currently exists only in `langchain-community==0.3.4`.
Test cases were added to prevent this issue in the future.
Co-authored-by: Erick Friis <erick@langchain.dev>
### Description:
This PR sets a default value of `output_token_limit = 4000` for the
`PowerBIToolkit` to fix the unintentionally validation error.
### Problem:
When attempting to run a code snippet from [Langchain's PowerBI toolkit
documentation](https://python.langchain.com/v0.1/docs/integrations/toolkits/powerbi/)
to interact with a `PowerBIDataset`, the following error occurs:
```
pydantic.v1.error_wrappers.ValidationError: 1 validation error for QueryPowerBITool
output_token_limit
none is not an allowed value (type=type_error.none.not_allowed)
```
### Root Cause:
The issue arises because when creating a `QueryPowerBITool`, the
`output_token_limit` parameter is unintentionally set to `None`, which
is the current default for `PowerBIToolkit`. However, `QueryPowerBITool`
expects a default value of `4000` for `output_token_limit`. This
unintended override causes the error.
17659ca2cd/libs/community/langchain_community/agent_toolkits/powerbi/toolkit.py (L63)17659ca2cd/libs/community/langchain_community/agent_toolkits/powerbi/toolkit.py (L72-L79)17659ca2cd/libs/community/langchain_community/tools/powerbi/tool.py (L39)
### Solution:
To resolve this, the default value of `output_token_limit` is now
explicitly set to `4000` in `PowerBIToolkit` to prevent the accidental
assignment of `None`.
Co-authored-by: ccurme <chester.curme@gmail.com>
- **Description:**
Currently CommaSeparatedListOutputParser can't handle strings that may
contain commas within a column. It would parse any commas as the
delimiter.
Ex.
"foo, foo2", "bar", "baz"
It will create 4 columns: "foo", "foo2", "bar", "baz"
This should be 3 columns:
"foo, foo2", "bar", "baz"
- **Dependencies:**
Added 2 additional imports, but they are built in python packages.
import csv
from io import StringIO
- **Twitter handle:** @jkyamog
- [ ] **Add tests and docs**:
1. added simple unit test test_multiple_items_with_comma
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "core: use friendlier names for duplicated nodes in
mermaid output"
- **Description:** When generating the Mermaid visualization of a chain,
if the chain had multiple nodes of the same type, the reid function
would replace their names with the UUID node_id. This made the generated
graph difficult to understand. This change deduplicates the nodes in a
chain by appending an index to their names.
- **Issue:** None
- **Discussion:**
https://github.com/langchain-ai/langchain/discussions/27714
- **Dependencies:** None
- [ ] **Add tests and docs**:
- Currently this functionality is not covered by unit tests, happy to
add tests if you'd like
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
# Example Code:
```python
from langchain_core.runnables import RunnablePassthrough
def fake_llm(prompt: str) -> str: # Fake LLM for the example
return "completion"
runnable = {
'llm1': fake_llm,
'llm2': fake_llm,
} | RunnablePassthrough.assign(
total_chars=lambda inputs: len(inputs['llm1'] + inputs['llm2'])
)
print(runnable.get_graph().draw_mermaid(with_styles=False))
```
# Before
```mermaid
graph TD;
Parallel_llm1_llm2_Input --> 0b01139db5ed4587ad37964e3a40c0ec;
0b01139db5ed4587ad37964e3a40c0ec --> Parallel_llm1_llm2_Output;
Parallel_llm1_llm2_Input --> a98d4b56bd294156a651230b9293347f;
a98d4b56bd294156a651230b9293347f --> Parallel_llm1_llm2_Output;
Parallel_total_chars_Input --> Lambda;
Lambda --> Parallel_total_chars_Output;
Parallel_total_chars_Input --> Passthrough;
Passthrough --> Parallel_total_chars_Output;
Parallel_llm1_llm2_Output --> Parallel_total_chars_Input;
```
# After
```mermaid
graph TD;
Parallel_llm1_llm2_Input --> fake_llm_1;
fake_llm_1 --> Parallel_llm1_llm2_Output;
Parallel_llm1_llm2_Input --> fake_llm_2;
fake_llm_2 --> Parallel_llm1_llm2_Output;
Parallel_total_chars_Input --> Lambda;
Lambda --> Parallel_total_chars_Output;
Parallel_total_chars_Input --> Passthrough;
Passthrough --> Parallel_total_chars_Output;
Parallel_llm1_llm2_Output --> Parallel_total_chars_Input;
```
PR title: “langchain: add batch request support for text-embedding-v3
model”
PR message:
• Description: This PR introduces batch request support for the
text-embedding-v3 model within LangChain. The new functionality allows
users to process multiple text inputs in a single request, improving
efficiency and performance for high-volume applications.
• Issue: This PR addresses #<issue_number> (if applicable).
• Dependencies: No new external dependencies are required for this
change.
• Twitter handle: If announced on Twitter, please mention me at
@yourhandle.
Add tests and docs:
1. Added unit tests to cover the batch request functionality, ensuring
it operates without requiring network access.
2. Included an example notebook demonstrating the batch request feature,
located in docs/docs/integrations.
Lint and test: All required formatting and linting checks have been
performed using make format and make lint. The changes have been
verified with make test to ensure compatibility.
Additional notes:
• The changes are fully backwards compatible.
• No modifications were made to pyproject.toml, ensuring no new
dependencies were added.
• The update only affects the langchain package and does not involve
other packages.
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
…ING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning}
{category: DEPRECATION} {title: This feature is deprecated and will be
removed in future versions.} {description: CALL subquery without a
variable scope clause is now deprecated." this warning
Thank you for contributing to LangChain!
- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "templates:
..." for template changes, "infra: ..." for CI changes.
- Example: "community: add foobar LLM"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
Co-authored-by: putao520 <putao520@putao282.com>
The metadata["source"] value for the web paths was being set to
temporary path (/tmp).
Fixed it by creating a new variable self.original_file_path, which will
store the original path.
Thank you for contributing to LangChain!
- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "templates:
..." for template changes, "infra: ..." for CI changes.
- Example: "community: add foobar LLM"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
I will keep this PR as small as the changes made.
**Description:** fixes a fatal bug syntax error in
AzureCosmosDBNoSqlVectorSearch
**Issue:** #27269#25468
**Description:**
Update the wrapper to support the Polygon API if not you get an error. I
keeped `STOCKBUSINESS` for retro-compatbility with older endpoints /
other uses
Old Code:
```
if status not in ("OK", "STOCKBUSINESS"):
raise ValueError(f"API Error: {data}")
```
API Respond:
```
API Error: {'results': {'P': 0.22, 'S': 0, 'T': 'ZOM', 'X': 5, 'p': 0.123, 'q': 0, 's': 200, 't': 1729614422813395456, 'x': 1, 'z': 1}, 'status': 'STOCKSBUSINESS', 'request_id': 'XXXXXX'}
```
- **Issue:** N/A Polygon API update
- **Dependencies:** N/A
- **Twitter handle:** @wgcv
---------
Co-authored-by: ccurme <chester.curme@gmail.com>
- **Description:** add missing tool_calls kwargs of delta message in
openai adapter, then tool call will work correctly via adapter's stream
chat completion
- **Issue:** Fixes
https://github.com/langchain-ai/langchain/issues/25436
- **Dependencies:** None
- **Description:** Change `MoonshotCommon.client` type from
`_MoonshotClient` to `Any`.
- **Issue:** Fix the issue #27058
- **Dependencies:** No
- **Twitter handle:** TaoWang2218
In PR #17100, the implementation for Moonshot was added, which defined
two classes:
- `MoonshotChat(MoonshotCommon, ChatOpenAI)` in
`langchain_community.chat_models.moonshot`;
- Here, `validate_environment()` assigns **client** as
`openai.OpenAI().chat.completions`
- Note that **client** here is actually a member variable defined in
`ChatOpenAI`;
- `MoonshotCommon` in `langchain_community.llms.moonshot`;
- And here, `validate_environment()` assigns **_client** as
`_MoonshotClient`;
- Note that this is the underscored **_client**, which is defined within
`MoonshotCommon` itself;
At this time, there was no conflict between the two, one being `client`
and the other `_client`.
However, in PR #25878 which fixed#24390, `_client` in `MoonshotCommon`
was changed to `client`. Since then, a conflict in the definition of
`client` has arisen between `MoonshotCommon` and `MoonshotChat`, which
caused `pydantic` validation error.
To fix this issue, the type of `client` in `MoonshotCommon` should be
changed to `Any`.
Signed-off-by: Tao Wang <twang2218@gmail.com>
Thank you for contributing to LangChain!
- **Description:** Adding an empty metadata field when metadata is not
present in the data
- **Issue:** This PR fixes the issue when the data items doesn't contain
the metadata field. This happens when there is already data in the
container, or cx uses CosmosDB Python SDK to insert data.
- **Dependencies:** No dependencies required
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
- [x] **PR title**:
"community: OCI Generative AI tool calling bug fix
- [x] **PR message**:
- **Description:** bug fix for streaming chat responses with tool calls.
Update to PR 24693
- **Issue:** chat response content is repeated when streaming
- **Dependencies:** NA
- **Twitter handle:** NA
- [x] **Add tests and docs**: NA
- [x] **Lint and test**: make format, make lint and make test we run
successfully
---------
Co-authored-by: Arthur Cheng <arthur.cheng@oracle.com>
Co-authored-by: Erick Friis <erick@langchain.dev>