- **Description:** Fixed `AzureSearchVectorStoreRetriever` to account
for search_kwargs. More explanation is in the mentioned issue.
- **Issue:** #21492
---------
Co-authored-by: MAC <mac@MACs-MacBook-Pro.local>
Co-authored-by: Massimiliano Pronesti <massimiliano.pronesti@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description: This change adds args_schema (pydantic BaseModel) to
WikipediaQueryRun for correct schema formatting on LLM function calls
Issue: currently using WikipediaQueryRun with OpenAI function calling
returns the following error "TypeError: WikipediaQueryRun._run() got an
unexpected keyword argument '__arg1' ". This happens because the schema
sent to the LLM is "input: '{"__arg1":"Hunter x Hunter"}'" while the
method should be called with the "query" parameter.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Added [Scrapfly](https://scrapfly.io/) Web Loader integration. Scrapfly
is a web scraping API that allows extracting web page data into
accessible markdown or text datasets.
- __Description__: Added Scrapfly web loader for retrieving web page
data as markdown or text.
- Dependencies: scrapfly-sdk
- Twitter: @thealchemi1st
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
- Example: "community: add foobar LLM"
- [x] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** Updates Meilisearch vectorstore for compatibility
with v1.8. Adds [”showRankingScore”:
true”](https://www.meilisearch.com/docs/reference/api/search#ranking-score)
in the search parameters and replaces `_semanticScore` field with `
_rankingScore`
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
**Description:**
- Extend AzureSearch with `maximal_marginal_relevance` (for vector and
hybrid search)
- Add construction `from_embeddings` - if the user has already embedded
the texts
- Add `add_embeddings`
- Refactor common parts (`_simple_search`, `_results_to_documents`,
`_reorder_results_with_maximal_marginal_relevance`)
- Add `vector_search_dimensions` as a parameter to the constructor to
avoid extra calls to `embed_query` (most of the time the user applies
the same model and knows the dimension)
**Issue:** none
**Dependencies:** none
- [x] **Add tests and docs**: The docstrings have been added to the new
functions, and unified for the existing ones. The example notebook is
great in illustrating the main usage of AzureSearch, adding the new
methods would only dilute the main content.
- [x] **Lint and test**
---------
Co-authored-by: Oleksii Pokotylo <oleksii.pokotylo@pwc.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description:** Backwards compatible extension of the initialisation
interface of HanaDB to allow the user to specify
specific_metadata_columns that are used for metadata storage of selected
keys which yields increased filter performance. Any not-mentioned
metadata remains in the general metadata column as part of a JSON
string. Furthermore switched to executemany for batch inserts into
HanaDB.
**Issue:** N/A
**Dependencies:** no new dependencies added
**Twitter handle:** @sapopensource
---------
Co-authored-by: Martin Kolb <martin.kolb@sap.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Integrate RankLLM reranker (https://github.com/castorini/rank_llm) into
LangChain
An example notebook is given in
`docs/docs/integrations/retrievers/rankllm-reranker.ipynb`
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- **Bug code**: In
langchain_community/document_loaders/csv_loader.py:100
- **Description**: currently, when 'CSVLoader' reads the column as None
in the 'csv' file, it will report an error because the 'CSVLoader' does
not verify whether the column is of str type and does not consider how
to handle the corresponding 'row_data' when the column is' None 'in the
csv. This pr provides a solution.
- **Issue:** Fix#20699
- **thinking:**
1. Refer to the processing method for
'langchain_community/document_loaders/csv_loader.py:100' when **'v'**
equals'None', and apply the same method to '**k**'.
(Reference`csv.DictReader` ,**'k'** will only be None when `
len(columns) < len(number_row_data)` is established)
2. **‘k’** equals None only holds when it is the last column, and its
corresponding **'v'** type is a list. Therefore, I referred to the data
format in 'Document' and used ',' to concatenated the elements in the
list.(But I'm not sure if you accept this form, if you have any other
ideas, communicate)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
## Description
The existing public interface for `langchain_community.emeddings` is
broken. In this file, `__all__` is statically defined, but is
subsequently overwritten with a dynamic expression, which type checkers
like pyright do not support. pyright actually gives the following
diagnostic on the line I am requesting we remove:
[reportUnsupportedDunderAll](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportUnsupportedDunderAll):
```
Operation on "__all__" is not supported, so exported symbol list may be incorrect
```
Currently, I get the following errors when attempting to use publicablly
exported classes in `langchain_community.emeddings`:
```python
import langchain_community.embeddings
langchain_community.embeddings.HuggingFaceEmbeddings(...) # error: "HuggingFaceEmbeddings" is not exported from module "langchain_community.embeddings" (reportPrivateImportUsage)
```
This is solved easily by removing the dynamic expression.
Thank you for contributing to LangChain!
- [X] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
- Example: "community: add foobar LLM"
**Description:**
Fix ChatDatabricsk in case that streaming response doesn't have role
field in delta chunk
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
## 'raise_for_status' parameter of WebBaseLoader works in sync load but
not in async load.
In webBaseLoader:
Sync load is calling `_scrape` and has `raise_for_status` properly
handled.
```
def _scrape(
self,
url: str,
parser: Union[str, None] = None,
bs_kwargs: Optional[dict] = None,
) -> Any:
from bs4 import BeautifulSoup
if parser is None:
if url.endswith(".xml"):
parser = "xml"
else:
parser = self.default_parser
self._check_parser(parser)
html_doc = self.session.get(url, **self.requests_kwargs)
if self.raise_for_status:
html_doc.raise_for_status()
if self.encoding is not None:
html_doc.encoding = self.encoding
elif self.autoset_encoding:
html_doc.encoding = html_doc.apparent_encoding
return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))
```
Async load is calling `_fetch` but missing `raise_for_status` logic.
```
async def _fetch(
self, url: str, retries: int = 3, cooldown: int = 2, backoff: float = 1.5
) -> str:
async with aiohttp.ClientSession() as session:
for i in range(retries):
try:
async with session.get(
url,
headers=self.session.headers,
ssl=None if self.session.verify else False,
cookies=self.session.cookies.get_dict(),
) as response:
return await response.text()
```
Co-authored-by: kefan.you <darkfss@sina.com>
- **Description:** Tongyi uses different client for chat model and
vision model. This PR chooses proper client based on model name to
support both chat model and vision model. Reference [tongyi
document](https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-vl-plus-api?spm=a2c4g.11186623.0.0.27404c9a7upm11)
for details.
```
from langchain_core.messages import HumanMessage
from langchain_community.chat_models import ChatTongyi
llm = ChatTongyi(model_name='qwen-vl-max')
image_message = {
"image": "https://lilianweng.github.io/posts/2023-06-23-agent/agent-overview.png"
}
text_message = {
"text": "summarize this picture",
}
message = HumanMessage(content=[text_message, image_message])
llm.invoke([message])
```
- **Issue:** None
- **Dependencies:** None
- **Twitter handle:** None
We add a tool and retriever for the [AskNews](https://asknews.app)
platform with example notebooks.
The retriever can be invoked with:
```py
from langchain_community.retrievers import AskNewsRetriever
retriever = AskNewsRetriever(k=3)
retriever.invoke("impact of fed policy on the tech sector")
```
To retrieve 3 documents in then news related to fed policy impacts on
the tech sector. The included notebook also includes deeper details
about controlling filters such as category and time, as well as
including the retriever in a chain.
The tool is quite interesting, as it allows the agent to decide how to
obtain the news by forming a query and deciding how far back in time to
look for the news:
```py
from langchain_community.tools.asknews import AskNewsSearch
from langchain import hub
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
tool = AskNewsSearch()
instructions = """You are an assistant."""
base_prompt = hub.pull("langchain-ai/openai-functions-template")
prompt = base_prompt.partial(instructions=instructions)
llm = ChatOpenAI(temperature=0)
asknews_tool = AskNewsSearch()
tools = [asknews_tool]
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
)
agent_executor.invoke({"input": "How is the tech sector being affected by fed policy?"})
```
---------
Co-authored-by: Emre <e@emre.pm>
Please let me know if you see any possible areas of improvement. I would
very much appreciate your constructive criticism if time allows.
**Description:**
- Added a aerospike vector store integration that utilizes
[Aerospike-Vector-Search](https://aerospike.com/products/vector-database-search-llm/)
add-on.
- Added both unit tests and integration tests
- Added a docker compose file for spinning up a test environment
- Added a notebook
**Dependencies:** any dependencies required for this change
- aerospike-vector-search
**Twitter handle:**
- No twitter, you can use my GitHub handle or LinkedIn if you'd like
Thanks!
---------
Co-authored-by: Jesse Schumacher <jschumacher@aerospike.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Closes#20561
This PR fixes MLX LLM stream `AttributeError`.
Recently, `mlx-lm` changed the token decoding logic, which affected the
LC+MLX integration.
Additionally, I made minor fixes such as: docs example broken link and
enforcing pipeline arguments (max_tokens, temp and etc) for invoke.
- **Issue:** #20561
- **Twitter handle:** @Prince_Canuma
Related to #20085
@baskaryan
Thank you for contributing to LangChain!
community:sparkllm[patch]: standardized init args
updated `spark_api_key` so that aliased to `api_key`. Added integration
test for `sparkllm` to test that it continues to set the same underlying
attribute.
updated temperature with Pydantic Field, added to the integration test.
Ran `make format`,`make test`, `make lint`, `make spell_check`
UpTrain has a new dashboard now that makes it easier to view projects
and evaluations. Using this requires specifying both project_name and
evaluation_name when performing evaluations. I have updated the code to
support it.
# Add pricing and max context window for GPT-4o
- community: add cost per 1k tokens and max context window
- partners: add max context window
**Description:** adds static information about GPT-4o based on
https://openai.com/api/pricing/ and
https://platform.openai.com/docs/models/gpt-4o so that GPT-4o reporting
is accurate.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "community: enable SupabaseVectorStore to support
extended table fields"
- [x] **PR message**:
- Added extension fields to the function _add_vectors so that users can
add other custom fields when insert a record into the database. eg:
![image](https://github.com/langchain-ai/langchain/assets/10885578/e1d5ca20-936e-4cab-ba69-8fdd23b8ce8f)
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- **Description:** In the aleph alpha client the paramater `normalize`
is *not* optional. Setting this to `None` gives an error.
- **Dependencies:** None
Co-authored-by: Jens Lücke <jens.luecke@tngtech.com>
Co-authored-by: Jens <jens.luecke@hu-berlin.de>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
While integrating the xinference_embedding, we observed that the
downloaded dependency package is quite substantial in size. With a focus
on resource optimization and efficiency, if the project requirements are
limited to its vector processing capabilities, we recommend migrating to
the xinference_client package. This package is more streamlined,
significantly reducing the storage space requirements of the project and
maintaining a feature focus, making it particularly suitable for
scenarios that demand lightweight integration. Such an approach not only
boosts deployment efficiency but also enhances the application's
maintainability, rendering it an optimal choice for our current context.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description:** Add `Origin/langchain` to Apify's client's user-agent
to attribute API activity to LangChain (at Apify, we aim to monitor our
integrations to evaluate whether we should invest more in the LangChain
integration regarding functionality and content)
**Issue:** None
**Dependencies:** None
**Twitter handle:** None
- **Code:** langchain_community/embeddings/baichuan.py:82
- **Description:** When I make an error using 'baichuan embeddings', the
printed error message is wrapped (there is actually no need to wrap)
```python
# example
from langchain_community.embeddings import BaichuanTextEmbeddings
# error key
BAICHUAN_API_KEY = "sk-xxxxxxxxxxxxx"
embeddings = BaichuanTextEmbeddings(baichuan_api_key=BAICHUAN_API_KEY)
text_1 = "今天天气不错"
query_result = embeddings.embed_query(text_1)
```
![unintended
newline](https://github.com/langchain-ai/langchain/assets/55082429/e1178ce8-62bb-405d-a4af-e3b28eabc158)
This PR improves on the `CassandraCache` and `CassandraSemanticCache`
classes, mainly in the constructor signature, and also introduces
several minor improvements around these classes.
### Init signature
A (sigh) breaking change is tentatively introduced to the constructor.
To me, the advantages outweigh the possible discomfort: the new syntax
places the DB-connection objects `session` and `keyspace` later in the
param list, so that they can be given a default value. This is what
enables the pattern of _not_ specifying them, provided one has
previously initialized the Cassandra connection through the versatile
utility method `cassio.init(...)`.
In this way, a much less unwieldy instantiation can be done, such as
`CassandraCache()` and `CassandraSemanticCache(embedding=xyz)`,
everything else falling back to defaults.
A downside is that, compared to the earlier signature, this might turn
out to be breaking for those doing positional instantiation. As a way to
mitigate this problem, this PR typechecks its first argument trying to
detect the legacy usage.
(And to make this point less tricky in the future, most arguments are
left to be keyword-only).
If this is considered too harsh, I'd like guidance on how to further
smoothen this transition. **Our plan is to make the pattern of optional
session/keyspace a standard across all Cassandra classes**, so that a
repeatable strategy would be ideal. A possibility would be to keep
positional arguments for legacy reasons but issue a deprecation warning
if any of them is actually used, to later remove them with 0.2 - please
advise on this point.
### Other changes
- class docstrings: enriched, completely moved to class level, added
note on `cassio.init(...)` pattern, added tiny sample usage code.
- semantic cache: revised terminology to never mention "distance" (it is
in fact a similarity!). Kept the legacy constructor param with a
deprecation warning if used.
- `llm_caching` notebook: uniform flow with the Cassandra and Astra DB
separate cases; better and Cassandra-first description; all imports made
explicit and from community where appropriate.
- cache integration tests moved to community (incl. the imported tools),
env var bugfix for `CASSANDRA_CONTACT_POINTS`.
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
## Patch Summary
community:openai[patch]: standardize init args
## Details
I made changes to the OpenAI Chat API wrapper test in the Langchain
open-source repository
- **File**: `libs/community/tests/unit_tests/chat_models/test_openai.py`
- **Changes**:
- Updated `max_retries` with Pydantic Field
- Updated the corresponding unit test
- **Related Issues**: #20085
- Updated max_retries with Pydantic Field, updated the unit test.
---------
Co-authored-by: JuHyung Son <sonju0427@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "community: updated Browserbase loader"
- [x] **PR message**:
Updates the Browserbase loader with more options and improved docs.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
0.2 is not a breaking release for core (but it is for langchain and
community)
To keep the core+langchain+community packages in sync at 0.2, we will
relax deps throughout the ecosystem to tolerate `langchain-core` 0.2
## Description
This PR introduces the new `langchain-qdrant` partner package, intending
to deprecate the community package.
## Changes
- Moved the Qdrant vector store implementation `/libs/partners/qdrant`
with integration tests.
- The conditional imports of the client library are now regular with
minor implementation improvements.
- Added a deprecation warning to
`langchain_community.vectorstores.qdrant.Qdrant`.
- Replaced references/imports from `langchain_community` with either
`langchain_core` or by moving the definitions to the `langchain_qdrant`
package itself.
- Updated the Qdrant vector store documentation to reflect the changes.
## Testing
- `QDRANT_URL` and
[`QDRANT_API_KEY`](583e36bf6b)
env values need to be set to [run integration
tests](d608c93d1f)
in the [cloud](https://cloud.qdrant.tech).
- If a Qdrant instance is running at `http://localhost:6333`, the
integration tests will use it too.
- By default, tests use an
[`in-memory`](https://github.com/qdrant/qdrant-client?tab=readme-ov-file#local-mode)
instance(Not comprehensive).
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>
This PR makes some small updates for `KuzuQAChain` for graph QA.
- Updated Cypher generation prompt (we now support `WHERE EXISTS`) and
generalize it more
- Support different LLMs for Cypher generation and QA
- Update docs and examples
First Pr for the langchain_huggingface partner Package
- Moved some of the hugging face related class from `community` to the
new `partner package`
Still needed :
- Documentation
- Tests
- Support for the new apply_chat_template in `ChatHuggingFace`
- Confirm choice of class to support for embeddings witht he
sentence-transformer team.
cc : @efriis
---------
Co-authored-by: Cyril Kondratenko <kkn1993@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Thank you for contributing to LangChain!
- [X] **PR title**: "community: Add source metadata to bedrock retriever
response"
- [X] **PR message**:
- **Description:** Bedrock retrieve API returns extra metadata in the
response which is currently not returned in the retriever response
- **Issue:** The change adds the metadata from bedrock retrieve API
response to the bedrock retriever in a backward compatible way. Renamed
metadata to sourceMetadata as metadata term is being used in the
Document already. This is in sync with what we are doing in llama-index
as well.
- **Dependencies:** No
- [X] **Add tests and docs**:
1. Added unit tests
2. Notebook already exists and does not need any change
3. Response from end to end testing, just to ensure backward
compatibility: `[Document(page_content='Exoplanets.',
metadata={'location': {'s3Location': {'uri':
's3://bucket/file_name.txt'}, 'type': 'S3'}, 'score': 0.46886647,
'source_metadata': {'x-amz-bedrock-kb-source-uri':
's3://bucket/file_name.txt', 'tag': 'space', 'team': 'Nasa', 'year':
1946.0}})]`
- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: Piyush Jain <piyushjain@duck.com>
**Description:** Added a few additional arguments to the whisper parser,
which can be consumed by the underlying API.
The prompt is especially important to fine-tune transcriptions.
---------
Co-authored-by: Roi Perlman <roi@fivesigmalabs.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description: Adds NeuralDBClientVectorStore to the langchain, which is
our enterprise client.
---------
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
**Description:**
This PR introduces chunking logic to the `DeepInfraEmbeddings` class to
handle large batch sizes without exceeding maximum batch size of the
backend. This enhancement ensures that embedding generation processes
large batches by breaking them down into smaller, manageable chunks,
each conforming to the maximum batch size limit.
**Issue:**
Fixes#21189
**Dependencies:**
No new dependencies introduced.
- Added new document_transformer: MarkdonifyTransformer, that uses
`markdonify` package with customizable options to convert HTML to
Markdown. It's similar to Html2TextTransformer, but has more flexible
options and also I've noticed that sometimes MarkdownifyTransformer
performs better than html2text one, so that's why I use markdownify on
my project.
- Added docs and tests
- Usage:
```python
from langchain_community.document_transformers import MarkdownifyTransformer
markdownify = MarkdownifyTransformer()
docs_transform = markdownify.transform_documents(docs)
```
- Example of better performance on simple task, that I've noticed:
```
<html>
<head><title>Reports on product movement</title></head>
<body>
<p data-block-key="2wst7">The reports on product movement will be useful for forming supplier orders and controlling outcomes.</p>
</body>
```
**Html2TextTransformer**:
```python
[Document(page_content='The reports on product movement will be useful for forming supplier orders and\ncontrolling outcomes.\n\n')]
# Here we can see 'and\ncontrolling', which has extra '\n' in it
```
**MarkdownifyTranformer**:
```python
[Document(page_content='Reports on product movement\n\nThe reports on product movement will be useful for forming supplier orders and controlling outcomes.')]
```
---------
Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.bbrouter>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.local>
Co-authored-by: Sokolov Fedor <f.sokolov@192.168.1.6>
### GPT4AllEmbeddings parameters
---
**Description:**
As of right now the **Embed4All** class inside _GPT4AllEmbeddings_ is
instantiated as it's default which leaves no room to customize the
chosen model and it's behavior. Thus:
- GPT4AllEmbeddings can now be instantiated with custom parameters like
a different model that shall be used.
---------
Co-authored-by: AlexJauchWalser <alexander.jauch-walser@knime.com>
The `_amake_session()` method does not allow modifying the
`self.session_factory` with
anything other than `async_sessionmaker`. This prohibits advanced uses
of `index()`.
In a RAG architecture, it is necessary to import document chunks.
To keep track of the links between chunks and documents, we can use the
`index()` API.
This API proposes to use an SQL-type record manager.
In a classic use case, using `SQLRecordManager` and a vector database,
it is impossible
to guarantee the consistency of the import. Indeed, if a crash occurs
during the import
(problem with the network, ...)
there is an inconsistency between the SQL database and the vector
database.
With the
[PR](https://github.com/langchain-ai/langchain-postgres/pull/32) we are
proposing for `langchain-postgres`,
it is now possible to guarantee the consistency of the import of chunks
into
a vector database. It's possible only if the outer session is built
with the connection.
```python
def main():
db_url = "postgresql+psycopg://postgres:password_postgres@localhost:5432/"
engine = create_engine(db_url, echo=True)
embeddings = FakeEmbeddings()
pgvector:VectorStore = PGVector(
embeddings=embeddings,
connection=engine,
)
record_manager = SQLRecordManager(
namespace="namespace",
engine=engine,
)
record_manager.create_schema()
with engine.connect() as connection:
session_maker = scoped_session(sessionmaker(bind=connection))
# NOTE: Update session_factories
record_manager.session_factory = session_maker
pgvector.session_maker = session_maker
with connection.begin():
loader = CSVLoader(
"data/faq/faq.csv",
source_column="source",
autodetect_encoding=True,
)
result = index(
source_id_key="source",
docs_source=loader.load()[:1],
cleanup="incremental",
vector_store=pgvector,
record_manager=record_manager,
)
print(result)
```
The same thing is possible asynchronously, but a bug in
`sql_record_manager.py`
in `_amake_session()` must first be fixed.
```python
async def _amake_session(self) -> AsyncGenerator[AsyncSession, None]:
"""Create a session and close it after use."""
# FIXME: REMOVE if not isinstance(self.session_factory, async_sessionmaker):~~
if not isinstance(self.engine, AsyncEngine):
raise AssertionError("This method is not supported for sync engines.")
async with self.session_factory() as session:
yield session
```
Then, it is possible to do the same thing asynchronously:
```python
async def main():
db_url = "postgresql+psycopg://postgres:password_postgres@localhost:5432/"
engine = create_async_engine(db_url, echo=True)
embeddings = FakeEmbeddings()
pgvector:VectorStore = PGVector(
embeddings=embeddings,
connection=engine,
)
record_manager = SQLRecordManager(
namespace="namespace",
engine=engine,
async_mode=True,
)
await record_manager.acreate_schema()
async with engine.connect() as connection:
session_maker = async_scoped_session(
async_sessionmaker(bind=connection),
scopefunc=current_task)
record_manager.session_factory = session_maker
pgvector.session_maker = session_maker
async with connection.begin():
loader = CSVLoader(
"data/faq/faq.csv",
source_column="source",
autodetect_encoding=True,
)
result = await aindex(
source_id_key="source",
docs_source=loader.load()[:1],
cleanup="incremental",
vector_store=pgvector,
record_manager=record_manager,
)
print(result)
asyncio.run(main())
```
---------
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Sean <sean@upstage.ai>
Co-authored-by: JuHyung-Son <sonju0427@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: YISH <mokeyish@hotmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Jason_Chen <820542443@qq.com>
Co-authored-by: Joan Fontanals <joan.fontanals.martinez@jina.ai>
Co-authored-by: Pavlo Paliychuk <pavlo.paliychuk.ca@gmail.com>
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: samanhappy <samanhappy@gmail.com>
Co-authored-by: Lei Zhang <zhanglei@apache.org>
Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com>
Co-authored-by: merdan <48309329+merdan-9@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Andres Algaba <andresalgaba@gmail.com>
Co-authored-by: davidefantiniIntel <115252273+davidefantiniIntel@users.noreply.github.com>
Co-authored-by: Jingpan Xiong <71321890+klaus-xiong@users.noreply.github.com>
Co-authored-by: kaka <kaka@zbyte-inc.cloud>
Co-authored-by: jingsi <jingsi@leadincloud.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Shengsheng Huang <shannie.huang@gmail.com>
Co-authored-by: Michael Schock <mjschock@users.noreply.github.com>
Co-authored-by: Anish Chakraborty <anish749@users.noreply.github.com>
Co-authored-by: am-kinetica <85610855+am-kinetica@users.noreply.github.com>
Co-authored-by: Dristy Srivastava <58721149+dristysrivastava@users.noreply.github.com>
Co-authored-by: Matt <matthew.gotteiner@microsoft.com>
Co-authored-by: William FH <13333726+hinthornw@users.noreply.github.com>
- **Description:** Fix import class name exporeted from
'playwright.async_api' and 'playwright.sync_api' to match the correct
name in playwright tool. Change import from inline guard_import to
helper function that calls guard_import to make code more readable in
gmail tool. Upgrade playwright version to 1.43.0
- **Issue:** #21354
- **Dependencies:** upgrade playwright version(this is not required for
the bugfix itself, just trying to keep dependencies fresh. I can remove
the playwright version upgrade if you want.)
Thank you for contributing to LangChain!
- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
- Example: "community: add foobar LLM"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
0.2rc
migrations
- [x] Move memory
- [x] Move remaining retrievers
- [x] graph_qa chains
- [x] some dependency from evaluation code potentially on math utils
- [x] Move openapi chain from `langchain.chains.api.openapi` to
`langchain_community.chains.openapi`
- [x] Migrate `langchain.chains.ernie_functions` to
`langchain_community.chains.ernie_functions`
- [x] migrate `langchain/chains/llm_requests.py` to
`langchain_community.chains.llm_requests`
- [x] Moving `langchain_community.cross_enoders.base:BaseCrossEncoder`
->
`langchain_community.retrievers.document_compressors.cross_encoder:BaseCrossEncoder`
(namespace not ideal, but it needs to be moved to `langchain` to avoid
circular deps)
- [x] unit tests langchain -- add pytest.mark.community to some unit
tests that will stay in langchain
- [x] unit tests community -- move unit tests that depend on community
to community
- [x] mv integration tests that depend on community to community
- [x] mypy checks
Other todo
- [x] Make deprecation warnings not noisy (need to use warn deprecated
and check that things are implemented properly)
- [x] Update deprecation messages with timeline for code removal (likely
we actually won't be removing things until 0.4 release) -- will give
people more time to transition their code.
- [ ] Add information to deprecation warning to show users how to
migrate their code base using langchain-cli
- [ ] Remove any unnecessary requirements in langchain (e.g., is
SQLALchemy required?)
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
…Endpoint`
Thank you for contributing to LangChain!
- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
- Example: "community: add foobar LLM"
- [x] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** add `bind_tools` and `with_structured_output` support
to `QianfanChatEndpoint`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Description: This PR includes fix for loader_source to be fetched from
metadata in case of GdriveLoaders.
Documentation: NA
Unit Test: NA
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Thank you for contributing to LangChain!
- [ ] **HuggingFaceInferenceAPIEmbeddings**: "Additional Headers"
- Where: langchain, community, embeddings. huggingface.py.
- Community: add additional headers when needed by custom HuggingFace
TEI embedding endpoints. HuggingFaceInferenceAPIEmbeddings"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** Adding the `additional_headers` to be passed to
requests library if needed
- **Dependencies:** none
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. Tested with locally available TEI endpoints with and without
`additional_headers`
2. Example Usage
```python
embeddings=HuggingFaceInferenceAPIEmbeddings(
api_key=MY_CUSTOM_API_KEY,
api_url=MY_CUSTOM_TEI_URL,
additional_headers={
"Content-Type": "application/json"
}
)
```
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: Massimiliano Pronesti <massimiliano.pronesti@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Relates [#17048]
Description : Applied fix to redis and neo4j file.
Error was : `Cannot override writeable attribute with read-only
property`
fix with the same solution of
[[langchain/libs/community/langchain_community/chat_message_histories/elasticsearch.py](d5c412b0a9/libs/community/langchain_community/chat_message_histories/elasticsearch.py (L170-L175))]
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
[Standardized model init args
#20085](https://github.com/langchain-ai/langchain/issues/20085)
- Enable premai chat model to be initialized with `model_name` as an
alias for `model`, `api_key` as an alias for `premai_api_key`.
- Add initialization test `test_premai_initialization`
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
- **Description:** fix: variable names in root validator not allowing
pass credentials as named parameters in llm instancing, also added
sambanova's sambaverse and sambastudio llms to __init__.py for module
import
Description: this change adds args_schema (pydantic BaseModel) to
YahooFinanceNewsTool for correct schema formatting on LLM function calls
Issue: currently using YahooFinanceNewsTool with OpenAI function calling
returns the following error "TypeError("YahooFinanceNewsTool._run() got
an unexpected keyword argument '__arg1'")". This happens because the
schema sent to the LLM is "input: "{'__arg1': 'MSFT'}"" while the method
should be called with the "query" parameter.
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Reverts langchain-ai/langchain#21174
Hey team - going to revert this because it doesn't seem necessary for
testing. We should only be adding optional + extended_testing
dependencies for deps that have extended tests.
otherwise it just increases probability of dependency conflicts in the
community lockfile.
Thank you for contributing to LangChain!
community:baichuan[patch]: standardize init args
updated `baichuan_api_key` so that aliased to `api_key`. Added test that
it continues to set the same underlying attribute. Test checks for
`SecretStr`
updated `temperature` with Pydantic Field, added unit test.
Related to https://github.com/langchain-ai/langchain/issues/20085
If Session and/or keyspace are not provided, they are resolved from
cassio's context. So they are not required.
This change is fully backward compatible.
Issue: the `langkit` package is not presented in the `pyproject.toml`
but it is a requirement for the `WhyLabsCallbackHandler`
Change: added `langkit`
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
Thank you for contributing to LangChain!
- Oracle AI Vector Search
Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.
- Oracle AI Vector Search is designed for Artificial Intelligence (AI)
workloads that allows you to query data based on semantics, rather than
keywords. One of the biggest benefit of Oracle AI Vector Search is that
semantic search on unstructured data can be combined with relational
search on business data in one single system. This is not only powerful
but also significantly more effective because you don't need to add a
specialized vector database, eliminating the pain of data fragmentation
between multiple systems.
This Pull Requests Adds the following functionalities
Oracle AI Vector Search : Vector Store
Oracle AI Vector Search : Document Loader
Oracle AI Vector Search : Document Splitter
Oracle AI Vector Search : Summary
Oracle AI Vector Search : Oracle Embeddings
- We have added unit tests and have our own local unit test suite which
verifies all the code is correct. We have made sure to add guides for
each of the components and one end to end guide that shows how the
entire thing runs.
- We have made sure that make format and make lint run clean.
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
---------
Co-authored-by: skmishraoracle <shailendra.mishra@oracle.com>
Co-authored-by: hroyofc <harichandan.roy@oracle.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Issue: we have several helper functions to import third-party libraries
like tools.gmail.utils.import_google in
[community.tools](https://api.python.langchain.com/en/latest/community_api_reference.html#id37).
And we have core.utils.utils.guard_import that works exactly for this
purpose.
The import_<package> functions work inconsistently and rather be private
functions.
Change: replaced these functions with the guard_import function.
Related to #21133
* Introduce individual `fetch_` methods for easier typing.
* Rework some docstrings to google style
* Move some logic to the tool
* Merge the 2 cassandra utility files
**Description:**
This pull request introduces a new feature for LangChain: the
integration with the Rememberizer API through a custom retriever.
This enables LangChain applications to allow users to load and sync
their data from Dropbox, Google Drive, Slack, their hard drive into a
vector database that LangChain can query. Queries involve sending text
chunks generated within LangChain and retrieving a collection of
semantically relevant user data for inclusion in LLM prompts.
User knowledge dramatically improved AI applications.
The Rememberizer integration will also allow users to access general
purpose vectorized data such as Reddit channel discussions and US
patents.
**Issue:**
N/A
**Dependencies:**
N/A
**Twitter handle:**
https://twitter.com/Rememberizer
- [ ] **PR message**:
- **Description:** Refactored the lazy_load method to use asynchronous
execution for improved performance. The method now initiates scraping of
all URLs simultaneously using asyncio.gather, enhancing data fetching
efficiency. Each Document object is yielded immediately once its content
becomes available, streamlining the entire process.
- **Issue:** N/A
- **Dependencies:** Requires the asyncio library for handling
asynchronous tasks, which should already be part of standard Python
libraries in Python 3.7 and above.
- **Email:** [r73327118@gmail.com](mailto:r73327118@gmail.com)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description**
This pull request updates the Bagel Network package name from
"betabageldb" to "bagelML" to align with the latest changes made by the
Bagel Network team.
The following modifications have been made:
- Updated all references to the old package name ("betabageldb") with
the new package name ("bagelML") throughout the codebase.
- Modified the documentation, and any relevant scripts to reflect the
package name change.
- Tested the changes to ensure that the functionality remains intact and
no breaking changes were introduced.
By merging this pull request, our project will stay up to date with the
latest Bagel Network package naming convention, ensuring compatibility
and smooth integration with their updated library.
Please review the changes and provide any feedback or suggestions. Thank
you!
**Issue:**
Currently `AzureSearch` vector store does not implement `delete` method.
This PR implements it. This also makes it compatible with LangChain
indexer.
**Dependencies:**
None
**Twitter handle:**
@martintriska1
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
As shown in #13749 , `RecursiveUrlLoader` has encoding issue. This PR is
to solve this.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
### Description:
When attempting to download PDF files from arXiv, an unexpected 404
error frequently occurs. This error halts the operation, regardless of
whether there are additional documents to process. As a solution, I
suggest implementing a mechanism to ignore and communicate this error
and continue processing the next document from the list.
Proposed Solution: To address the issue of unexpected 404 errors during
PDF downloads from arXiv, I propose implementing the following solution:
- Error Handling: Implement error handling mechanisms to catch and
handle 404 errors gracefully.
- Communication: Inform the user or logging system about the occurrence
of the 404 error.
- Continued Processing: After encountering a 404 error, continue
processing the remaining documents from the list without interruption.
This solution ensures that the application can handle unexpected errors
without terminating the entire operation. It promotes resilience and
robustness in the face of intermittent issues encountered during PDF
downloads from arXiv.
### Issue:
#20909
### Dependencies:
none
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
## Summary
I ran `ruff check --extend-select RUF100 -n` to identify `# noqa`
comments that weren't having any effect in Ruff, and then `ruff check
--extend-select RUF100 -n --fix` on select files to remove all of the
unnecessary `# noqa: F401` violations. It's possible that these were
needed at some point in the past, but they're not necessary in Ruff
v0.1.15 (used by LangChain) or in the latest release.
Co-authored-by: Erick Friis <erick@langchain.dev>
…/17690
Thank you for contributing to LangChain!
- [x] **Fix Google Lens knowledge graph issue**: "langchain: community"
- Fix for [No "knowledge_graph" property in Google Lens API call from
SerpAPI](https://github.com/langchain-ai/langchain/issues/17690)
- [x] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** handled the existence of keys in the json response of
Google Lens
- **Issue:** [No "knowledge_graph" property in Google Lens API call from
SerpAPI](https://github.com/langchain-ai/langchain/issues/17690)
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
## Description
Adding `UpstashVectorStore` to utilize [Upstash
Vector](https://upstash.com/docs/vector/overall/getstarted)!
#17012 was opened to add Upstash Vector to langchain but was closed to
wait for filtering. Now filtering is added to Upstash vector and we open
a new PR. Additionally, [embedding
feature](https://upstash.com/docs/vector/features/embeddingmodels) was
added and we add this to our vectorstore aswell.
## Dependencies
[upstash-vector](https://pypi.org/project/upstash-vector/) should be
installed to use `UpstashVectorStore`. Didn't update dependencies
because of [this comment in the previous
PR](https://github.com/langchain-ai/langchain/pull/17012#pullrequestreview-1876522450).
## Tests
Tests are added and they pass. Tests are naturally network bound since
Upstash Vector is offered through an API.
There was [a discussion in the previous PR about mocking the
unittests](https://github.com/langchain-ai/langchain/pull/17012#pullrequestreview-1891820567).
We didn't make changes to this end yet. We can update the tests if you
can explain how the tests should be mocked.
---------
Co-authored-by: ytkimirti <yusuftaha9@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Fixed the error that the model name is never actually put into GigaChat
request payload, always defaulting to `GigaChat-Lite`.
With this fix, model selection through
```python
import os
from langchain.chat_models.gigachat import GigaChat
chat = GigaChat(
name="GigaChat-Pro", # <- HERE!!!!!
...
)
```
should actually work, as intended in
[here](804390ba4b/libs/community/langchain_community/llms/gigachat.py (L36)).
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
**Description**: ToolKit and Tools for accessing data in a Cassandra
Database primarily for Agent integration. Initially, this includes the
following tools:
- `cassandra_db_schema` Gathers all schema information for the connected
database or a specific schema. Critical for the agent when determining
actions.
- `cassandra_db_select_table_data` Selects data from a specific keyspace
and table. The agent can pass paramaters for a predicate and limits on
the number of returned records.
- `cassandra_db_query` Expiriemental alternative to
`cassandra_db_select_table_data` which takes a query string completely
formed by the agent instead of parameters. May be removed in future
versions.
Includes unit test and two notebooks to demonstrate usage.
**Dependencies**: cassio
**Twitter handle**: @PatrickMcFadin
---------
Co-authored-by: Phil Miesle <phil.miesle@datastax.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description:** This pull request introduces a new feature to community
tools, enhancing its search capabilities by integrating the Mojeek
search engine
**Dependencies:** None
---------
Co-authored-by: Igor Brai <igor@mojeek.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Issue: When the third-party package is not installed, whenever we need
to `pip install <package>` the ImportError is raised.
But sometimes, the `ValueError` or `ModuleNotFoundError` is raised. It
is bad for consistency.
Change: replaced the `ValueError` or `ModuleNotFoundError` with
`ImportError` when we raise an error with the `pip install <package>`
message.
Note: Ideally, we replace all `try: import... except... raise ... `with
helper functions like `import_aim` or just use the existing
[langchain_core.utils.utils.guard_import](https://api.python.langchain.com/en/latest/utils/langchain_core.utils.utils.guard_import.html#langchain_core.utils.utils.guard_import)
But it would be much bigger refactoring. @baskaryan Please, advice on
this.
This introduces `store_kwargs` which behaves similarly to `graph_kwargs`
on the `RdfGraph` object, which will enable users to pass `headers` and
other arguments to the underlying `SPARQLStore` object. I have also made
a [PR in `rdflib` to support passing
`default_graph`](https://github.com/RDFLib/rdflib/pull/2761).
Example usage:
```python
from langchain_community.graphs import RdfGraph
graph = RdfGraph(
query_endpoint="http://localhost/sparql",
standard="rdf",
store_kwargs=dict(
default_graph="http://example.com/mygraph"
)
)
```
<!--If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.-->
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description: The PebbloSafeLoader should first check for owner,
full_path and size in metadata before implementing its own logic.
Dependencies: None
Documentation: NA.
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Issue: #20514
The current implementation of `construct_instance` expects a `texts:
List[str]` that will call the embedding function. This might not be
needed when we already have a client with collection and `path, you
don't want to add any text.
This PR adds a class method that returns a qdrant instance with an
existing client.
Here everytime
cb6e5e56c2/libs/community/langchain_community/vectorstores/qdrant.py (L1592)
`construct_instance` is called, this line sends some text for embedding
generation.
---------
Co-authored-by: Anush <anushshetty90@gmail.com>
**Description:** AzureSearch vector store has no tests. This PR adds
initial tests to validate the code can be imported and used.
**Issue:** N/A
**Dependencies:** azure-search-documents and azure-identity are added as
optional dependencies for testing
---------
Co-authored-by: Matt Gotteiner <[email protected]>
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description**:
_PebbloSafeLoader_: Add support for pebblo server and client version
**Documentation:** NA
**Unit test:** NA
**Issue:** NA
**Dependencies:** None
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
- [ ] **Kinetica Document Loader**: "community: a class to load
Documents from Kinetica"
- [ ] **Kinetica Document Loader**:
- **Description:** implemented KineticaLoader in `kinetica_loader.py`
- **Dependencies:** install the Kinetica API using `pip install
gpudb==7.2.0.1 `
- **Description**:
- **add support for more data types**: by default `IpexLLM` will load
the model in int4 format. This PR adds more data types support such as
`sym_in5`, `sym_int8`, etc. Data formats like NF3, NF4, FP4 and FP8 are
only supported on GPU and will be added in future PR.
- Fix a small issue in saving/loading, update api docs
- **Dependencies**: `ipex-llm` library
- **Document**: In `docs/docs/integrations/llms/ipex_llm.ipynb`, added
instructions for saving/loading low-bit model.
- **Tests**: added new test cases to
`libs/community/tests/integration_tests/llms/test_ipex_llm.py`, added
config params.
- **Contribution maintainer**: @shane-huang
Description: Add support for Semantic topics and entities.
Classification done by pebblo-server is not used to enhance metadata of
Documents loaded by document loaders.
Dependencies: None
Documentation: Updated.
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**
- [x] **PR message**:
- **Description:** Deprecate persist method in Chroma no longer exists
in Chroma 0.4.x
- **Issue:** #20851
- **Dependencies:** None
- **Twitter handle:** AndresAlgaba1
- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
**Description:**
The RecursiveUrlLoader loader offers a link_regex parameter that can
filter out URLs. However, this filtering capability is limited, and if
the internal links of the website change, unexpected resources may be
loaded. These resources, such as font files, can cause problems in
subsequent embedding processing.
>
https://blog.langchain.dev/assets/fonts/source-sans-pro-v21-latin-ext_latin-regular.woff2?v=0312715cbf
We can add the Content-Type in the HTTP response headers to the document
metadata so developers can choose which resources to use. This allows
developers to make their own choices.
For example, the following may be a good choice for text knowledge.
- text/plain - simple text file
- text/html - HTML web page
- text/xml - XML format file
- text/json - JSON format data
- application/pdf - PDF file
- application/msword - Word document
and ignore the following
- text/css - CSS stylesheet
- text/javascript - JavaScript script
- application/octet-stream - binary data
- image/jpeg - JPEG image
- image/png - PNG image
- image/gif - GIF image
- image/svg+xml - SVG image
- audio/mpeg - MPEG audio files
- video/mp4 - MP4 video file
- application/font-woff - WOFF font file
- application/font-ttf - TTF font file
- application/zip - ZIP compressed file
- application/octet-stream - binary data
**Twitter handle:** @coolbeevip
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- **Description:** Adapt JinaEmbeddings to run with the new Jina AI
Rerank API
- **Twitter handle:** https://twitter.com/JinaAI_
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Add the remove_unwanted_classnames method to the
BeautifulSoupTransformer class, which can filter more effectively.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description :
- added functionalities - delete, index creation, using existing
connection object etc.
- updated usage
- Added LaceDB cloud OSS support
make lint_diff , make test checks done
Implemented the ability to enable full-text search within the
SingleStore vector store, offering users a versatile range of search
strategies. This enhancement allows users to seamlessly combine
full-text search with vector search, enabling the following search
strategies:
* Search solely by vector similarity.
* Conduct searches exclusively based on text similarity, utilizing
Lucene internally.
* Filter search results by text similarity score, with the option to
specify a threshold, followed by a search based on vector similarity.
* Filter results by vector similarity score before conducting a search
based on text similarity.
* Perform searches using a weighted sum of vector and text similarity
scores.
Additionally, integration tests have been added to comprehensively cover
all scenarios.
Updated notebook with examples.
CC: @baskaryan, @hwchase17
---------
Co-authored-by: Volodymyr Tkachuk <vtkachuk-ua@singlestore.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- added guard on the `pyTigerGraph` import
- added a missed example page in the `docs/integrations/graphs/`
- formatted the `docs/integrations/providers/` page to the consistent
format. Added links.
- **Description:**
This PR adds support for advanced filtering to the integration of HANA
Vector Engine.
The newly supported filtering operators are: $eq, $ne, $gt, $gte, $lt,
$lte, $between, $in, $nin, $like, $and, $or
- **Issue:** N/A
- **Dependencies:** no new dependencies added
Added integration tests to:
`libs/community/tests/integration_tests/vectorstores/test_hanavector.py`
Description of the new capabilities in notebook:
`docs/docs/integrations/vectorstores/hanavector.ipynb`
Thank you for contributing to LangChain!
community:perplexity[patch]: standardize init args
updated pplx_api_key and request_timeout so that aliased to api_key, and
timeout respectively. Added test that both continue to set the same
underlying attributes.
Related to
[20085](https://github.com/langchain-ai/langchain/issues/20085)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
**Description:**
This PR fixes an issue in message formatting function for Anthropic
models on Amazon Bedrock.
Currently, LangChain BedrockChat model will crash if it uses Anthropic
models and the model return a message in the following type:
- `AIMessageChunk`
Moreover, when use BedrockChat with for building Agent, the following
message types will trigger the same issue too:
- `HumanMessageChunk`
- `FunctionMessage`
**Issue:**
https://github.com/langchain-ai/langchain/issues/18831
**Dependencies:**
No.
**Testing:**
Manually tested. The following code was failing before the patch and
works after.
```
@tool
def square_root(x: str):
"Useful when you need to calculate the square root of a number"
return math.sqrt(int(x))
llm = ChatBedrock(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
model_kwargs={ "temperature": 0.0 },
)
prompt = ChatPromptTemplate.from_messages(
[
("system", FUNCTION_CALL_PROMPT),
("human", "Question: {user_input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
]
)
tools = [square_root]
tools_string = format_tool_to_anthropic_function(square_root)
agent = (
RunnablePassthrough.assign(
user_input=lambda x: x['user_input'],
agent_scratchpad=lambda x: format_to_openai_function_messages(
x["intermediate_steps"]
)
)
| prompt
| llm
| AnthropicFunctionsAgentOutputParser()
)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, return_intermediate_steps=True)
output = agent_executor.invoke({
"user_input": "What is the square root of 2?",
"tools_string": tools_string,
})
```
List of messages returned from Bedrock:
```
<SystemMessage> content='You are a helpful assistant.'
<HumanMessage> content='Question: What is the square root of 2?'
<AIMessageChunk> content="Okay, let's calculate the square root of 2.<scratchpad>\nTo calculate the square root of a number, I can use the square_root tool:\n\n<function_calls>\n <invoke>\n <tool_name>square_root</tool_name>\n <parameters>\n <__arg1>2</__arg1>\n </parameters>\n </invoke>\n</function_calls>\n</scratchpad>\n\n<function_results>\n<search_result>\nThe square root of 2 is approximately 1.414213562373095\n</search_result>\n</function_results>\n\n<answer>\nThe square root of 2 is approximately 1.414213562373095\n</answer>" id='run-92363df7-eff6-4849-bbba-fa16a1b2988c'"
<FunctionMessage> content='1.4142135623730951' name='square_root'
```
Hi! My name is Alex, I'm an SDK engineer from
[Comet](https://www.comet.com/site/)
This PR updates the `CometTracer` class.
Fixed an issue when `CometTracer` failed while logging the data to Comet
because this data is not JSON-encodable.
The problem was in some of the `Run` attributes that could contain
non-default types inside, now these attributes are taken not from the
run instance, but from the `run.dict()` return value.
This PR moves the implementations for chat history to core. So it's
easier to determine which dependencies need to be broken / add
deprecation warnings
Vector indexes in ClickHouse are experimental at the moment and can
sometimes break/change behaviour. So this PR makes it possible to say
that you don't want to specify an index type.
Any queries against the embedding column will be brute force/linear
scan, but that gives reasonable performance for small-medium dataset
sizes.
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
**Description:** implemented GraphStore class for Apache Age graph db
**Dependencies:** depends on psycopg2
Unit and integration tests included. Formatting and linting have been
run.
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Description: you don't need to pass a version for Replicate official
models. That was broken on LangChain until now!
You can now run:
```
llm = Replicate(
model="meta/meta-llama-3-8b-instruct",
model_kwargs={"temperature": 0.75, "max_length": 500, "top_p": 1},
)
prompt = """
User: Answer the following yes/no question by reasoning step by step. Can a dog drive a car?
Assistant:
"""
llm(prompt)
```
I've updated the replicate.ipynb to reflect that.
twitter: @charliebholtz
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
ZhipuAI API only accepts `temperature` parameter between `(0, 1)` open
interval, and if `0` is passed, it responds with status code `400`.
However, 0 and 1 is often accepted by other APIs, for example, OpenAI
allows `[0, 2]` for temperature closed range.
This PR truncates temperature parameter passed to `[0.01, 0.99]` to
improve the compatibility between langchain's ecosystem's and ZhipuAI
(e.g., ragas `evaluate` often generates temperature 0, which results in
a lot of 400 invalid responses). The PR also truncates `top_p` parameter
since it has the same restriction.
Reference: [glm-4 doc](https://open.bigmodel.cn/dev/api#glm-4) (which
unfortunately is in Chinese though).
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
faster-whisper is a reimplementation of OpenAI's Whisper model using
CTranslate2, which is up to 4 times faster than enai/whisper for the
same accuracy while using less memory. The efficiency can be further
improved with 8-bit quantization on both CPU and GPU.
It can automatically detect the following 14 languages and transcribe
the text into their respective languages: en, zh, fr, de, ja, ko, ru,
es, th, it, pt, vi, ar, tr.
The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>