- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.
- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)
With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.
I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
This PR add supports for Azure Cosmos DB for NoSQL vector store.
Summary:
Description: added vector store integration for Azure Cosmos DB for
NoSQL Vector Store,
Dependencies: azure-cosmos dependency,
Tag maintainer: @hwchase17, @baskaryan @efriis @eyurtsev
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
- [ ] **Miscellaneous updates and fixes**:
- **Description:** Handled error in querying; quotes in table names;
updated gpudb API
- **Issue:** Threw an error with an error message difficult to
understand if a query failed or returned no records
- **Dependencies:** Updated GPUDB API version to `7.2.0.9`
@baskaryan @hwchase17
LLMs struggle with Graph RAG, because it's different from vector RAG in
a way that you don't provide the whole context, only the answer and the
LLM has to believe. However, that doesn't really work a lot of the time.
However, if you wrap the context as function response the accuracy is
much better.
btw... `union[LLMChain, Runnable]` is linting fun, that's why so many
ignores
**Description:** this PR adds Volcengine Rerank capability to Langchain,
you can find Volcengine Rerank API from
[here](https://www.volcengine.com/docs/84313/1254474) &
[here](https://www.volcengine.com/docs/84313/1254605).
[Volcengine](https://www.volcengine.com/) is a cloud service platform
developed by ByteDance, the parent company of TikTok. You can obtain
Volcengine API AK/SK from
[here](https://www.volcengine.com/docs/84313/1254553).
**Dependencies:** VolcengineRerank depends on `volcengine` python
package.
**Twitter handle:** my twitter/x account is https://x.com/LastMonopoly
and I'd like a mention, thank you!
**Tests and docs**
1. integration test: `test_volcengine_rerank.py`
2. example notebook: `volcengine_rerank.ipynb`
**Lint and test**: I have run `make format`, `make lint` and `make test`
from the root of the package I've modified.
Hi 👋
First off, thanks a ton for your work on this 💚 Really appreciate what
you're providing here for the community.
## Description
This PR adds a basic language parser for the
[Elixir](https://elixir-lang.org/) programming language. The parser code
is based upon the approach outlined in
https://github.com/langchain-ai/langchain/pull/13318: it's using
`tree-sitter` under the hood and aligns with all the other `tree-sitter`
based parses added that PR.
The `CHUNK_QUERY` I'm using here is probably not the most sophisticated
one, but it worked for my application. It's a starting point to provide
"core" parsing support for Elixir in LangChain. It enables people to use
the language parser out in real world applications which may then lead
to further tweaking of the queries. I consider this PR just the ground
work.
- **Dependencies:** requires `tree-sitter` and `tree-sitter-languages`
from the extended dependencies
- **Twitter handle:**`@bitcrowd`
## Checklist
- [x] **PR title**: "package: description"
- [x] **Add tests and docs**
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.
<!-- If no one reviews your PR within a few days, please @-mention one
of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->
Adding `UpstashRatelimitHandler` callback for rate limiting based on
number of chain invocations or LLM token usage.
For more details, see [upstash/ratelimit-py
repository](https://github.com/upstash/ratelimit-py) or the notebook
guide included in this PR.
Twitter handle: @cahidarda
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
They cause `poetry lock` to take a ton of time, and `uv pip install` can
resolve the constraints from these toml files in trivial time
(addressing problem with #19153)
This allows us to properly upgrade lockfile dependencies moving forward,
which revealed some issues that were either fixed or type-ignored (see
file comments)
**Description:** This PR addresses an issue with an existing test that
was not effectively testing the intended functionality. The previous
test setup did not adequately validate the filtering of the labels in
neo4j, because the nodes and relationship in the test data did not have
any properties set. Without properties these labels would not have been
returned, regardless of the filtering.
---------
Co-authored-by: Oskar Hane <oh@oskarhane.com>
This PR adds a constructor `metadata_indexing` parameter to the
Cassandra vector store to allow optional fine-tuning of which fields of
the metadata are to be indexed.
This is a feature supported by the underlying CassIO library. Indexing
mode of "all", "none" or deny- and allow-list based choices are
available.
The rationale is, in some cases it's advisable to programmatically
exclude some portions of the metadata from the index if one knows in
advance they won't ever be used at search-time. this keeps the index
more lightweight and performant and avoids limitations on the length of
_indexed_ strings.
I added a integration test of the feature. I also added the possibility
of running the integration test with Cassandra on an arbitrary IP
address (e.g. Dockerized), via
`CASSANDRA_CONTACT_POINTS=10.1.1.5,10.1.1.6 poetry run pytest [...]` or
similar.
While I was at it, I added a line to the `.gitignore` since the mypy
_test_ cache was not ignored yet.
My X (Twitter) handle: @rsprrs.
# package community: Fix SQLChatMessageHistory
## Description
Here is a rewrite of `SQLChatMessageHistory` to properly implement the
asynchronous approach. The code circumvents [issue
22021](https://github.com/langchain-ai/langchain/issues/22021) by
accepting a synchronous call to `def add_messages()` in an asynchronous
scenario. This bypasses the bug.
For the same reasons as in [PR
22](https://github.com/langchain-ai/langchain-postgres/pull/32) of
`langchain-postgres`, we use a lazy strategy for table creation. Indeed,
the promise of the constructor cannot be fulfilled without this. It is
not possible to invoke a synchronous call in a constructor. We
compensate for this by waiting for the next asynchronous method call to
create the table.
The goal of the `PostgresChatMessageHistory` class (in
`langchain-postgres`) is, among other things, to be able to recycle
database connections. The implementation of the class is problematic, as
we have demonstrated in [issue
22021](https://github.com/langchain-ai/langchain/issues/22021).
Our new implementation of `SQLChatMessageHistory` achieves this by using
a singleton of type (`Async`)`Engine` for the database connection. The
connection pool is managed by this singleton, and the code is then
reentrant.
We also accept the type `str` (optionally complemented by `async_mode`.
I know you don't like this much, but it's the only way to allow an
asynchronous connection string).
In order to unify the different classes handling database connections,
we have renamed `connection_string` to `connection`, and `Session` to
`session_maker`.
Now, a single transaction is used to add a list of messages. Thus, a
crash during this write operation will not leave the database in an
unstable state with a partially added message list. This makes the code
resilient.
We believe that the `PostgresChatMessageHistory` class is no longer
necessary and can be replaced by:
```
PostgresChatMessageHistory = SQLChatMessageHistory
```
This also fixes the bug.
## Issue
- [issue 22021](https://github.com/langchain-ai/langchain/issues/22021)
- Bug in _exit_history()
- Bugs in PostgresChatMessageHistory and sync usage
- Bugs in PostgresChatMessageHistory and async usage
- [issue
36](https://github.com/langchain-ai/langchain-postgres/issues/36)
## Twitter handle:
pprados
## Tests
- libs/community/tests/unit_tests/chat_message_histories/test_sql.py
(add async test)
@baskaryan, @eyurtsev or @hwchase17 can you check this PR ?
And, I've been waiting a long time for validation from other PRs. Can
you take a look?
- [PR 32](https://github.com/langchain-ai/langchain-postgres/pull/32)
- [PR 15575](https://github.com/langchain-ai/langchain/pull/15575)
- [PR 13200](https://github.com/langchain-ai/langchain/pull/13200)
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
- **Description:** The InMemoryVectorStore is a nice and simple vector
store implementation for quick development and debugging. The current
implementation is quite limited in its functionalities. This PR extends
the functionalities by adding utility function to persist the vector
store to a json file and to load it from a json file. We choose the json
file format because it allows inspection of the database contents in a
text editor, which is great for debugging. Furthermore, it adds a
`filter` keyword that can be used to filter out documents on their
`page_content` or `metadata`.
- **Issue:** -
- **Dependencies:** -
- **Twitter handle:** @Vincent_Min
Thank you for contributing to LangChain!
**Description:** update to the Vectara / Langchain integration to
integrate new Vectara capabilities:
- Full RAG implemented as a Runnable with as_rag()
- Vectara chat supported with as_chat()
- Both support streaming response
- Updated documentation and example notebook to reflect all the changes
- Updated Vectara templates
**Twitter handle:** ofermend
**Add tests and docs**: no new tests or docs, but updated both existing
tests and existing docs
- [ ] **Packages affected**:
- community: fix `cosine_similarity` to support simsimd beyond 3.7.7
- partners/milvus: fix `cosine_similarity` to support simsimd beyond
3.7.7
- partners/mongodb: fix `cosine_similarity` to support simsimd beyond
3.7.7
- partners/pinecone: fix `cosine_similarity` to support simsimd beyond
3.7.7
- partners/qdrant: fix `cosine_similarity` to support simsimd beyond
3.7.7
- [ ] **Broadcast operation failure while using simsimd beyond v3.7.7**:
- **Description:** I was using simsimd 4.3.1 and the unsupported operand
type issue popped up. When I checked out the repo and ran the tests,
they failed as well (have attached a screenshot for that). Looks like it
is a variant of https://github.com/langchain-ai/langchain/issues/18022 .
Prior to 3.7.7, simd.cdist returned an ndarray but now it returns
simsimd.DistancesTensor which is ineligible for a broadcast operation
with numpy. With this change, it also remove the need to explicitly cast
`Z` to numpy array
- **Issue:** #19905
- **Dependencies:** No
- **Twitter handle:** https://x.com/GetzJoydeep
<img width="1622" alt="Screenshot 2024-05-29 at 2 50 00 PM"
src="https://github.com/langchain-ai/langchain/assets/31132555/fb27b383-a9ae-4a6f-b355-6d503b72db56">
- [ ] **Considerations**:
1. I started with community but since similar changes were there in
Milvus, MongoDB, Pinecone, and QDrant so I modified their files as well.
If touching multiple packages in one PR is not the norm, then I can
remove them from this PR and raise separate ones
2. I have run and verified that the tests work. Since, only MongoDB had
tests, I ran theirs and verified it works as well. Screenshots attached
:
<img width="1573" alt="Screenshot 2024-05-29 at 2 52 13 PM"
src="https://github.com/langchain-ai/langchain/assets/31132555/ce87d1ea-19b6-4900-9384-61fbc1a30de9">
<img width="1614" alt="Screenshot 2024-05-29 at 3 33 51 PM"
src="https://github.com/langchain-ai/langchain/assets/31132555/6ce1d679-db4c-4291-8453-01028ab2dca5">
I have added a test for simsimd. I feel it may not go well with the
CI/CD setup as installing simsimd is not a dependency requirement. I
have just imported simsimd to ensure simsimd cosine similarity is
invoked. However, its not a good approach. Suggestions are welcome and I
can make the required changes on the PR. Please provide guidance on the
same as I am new to the community.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
### Description
Add tools implementation to `ChatEdenAI`:
- `bind_tools()`
- `with_structured_output()`
### Documentation
Updated `docs/docs/integrations/chat/edenai.ipynb`
### Notes
We don´t support stream with tools as of yet. If stream is called with
tools we directly yield the whole message from `generate` (implemented
the same way as Anthropic did).
- **Description:** When I was running the SparkLLMTextEmbeddings,
app_id, api_key and api_secret are all correct, but it cannot run
normally using the current URL.
```python
# example
from langchain_community.embeddings import SparkLLMTextEmbeddings
embedding= SparkLLMTextEmbeddings(
spark_app_id="my-app-id",
spark_api_key="my-api-key",
spark_api_secret="my-api-secret"
)
embedding= "hello"
print(spark.embed_query(text1))
```
![sparkembedding](https://github.com/langchain-ai/langchain/assets/55082429/11daa853-4f67-45b2-aae2-c95caa14e38c)
So I updated the url and request body parameters according to
[Embedding_api](https://www.xfyun.cn/doc/spark/Embedding_api.html), now
it is runnable.
**Description:** [IPEX-LLM](https://github.com/intel-analytics/ipex-llm)
is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local
PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low
latency. This PR adds ipex-llm integrations to langchain for BGE
embedding support on both Intel CPU and GPU.
**Dependencies:** `ipex-llm`, `sentence-transformers`
**Contribution maintainer**: @Oscilloscope98
**tests and docs**:
- langchain/docs/docs/integrations/text_embedding/ipex_llm.ipynb
- langchain/docs/docs/integrations/text_embedding/ipex_llm_gpu.ipynb
-
langchain/libs/community/tests/integration_tests/embeddings/test_ipex_llm.py
---------
Co-authored-by: Shengsheng Huang <shannie.huang@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: community: Add Zep Cloud components + docs +
examples
- [x] **PR message**:
We have recently released our new zep-cloud sdks that are compatible
with Zep Cloud (not Zep Open Source). We have also maintained our Cloud
version of langchain components (ChatMessageHistory, VectorStore) as
part of our sdks. This PRs goal is to port these components to langchain
community repo, and close the gap with the existing Zep Open Source
components already present in community repo (added
ZepCloudMemory,ZepCloudVectorStore,ZepCloudRetriever).
Also added a ZepCloudChatMessageHistory components together with an
expression language example ported from our repo. We have left the
original open source components intact on purpose as to not introduce
any breaking changes.
- **Issue:** -
- **Dependencies:** Added optional dependency of our new cloud sdk
`zep-cloud`
- **Twitter handle:** @paulpaliychuk51
- [x] **Add tests and docs**
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
Thank you for contributing to LangChain!
- [ ] **PR title**: "Add CloudBlobLoader"
- community: Add CloudBlobLoader
- [ ] **PR message**: Add cloud blob loader
- **Description:**
Langchain provides several approaches to read different file formats:
Specific loaders (`CVSLoader`) or blob-compatible loaders
(`FileSystemBlobLoader`). The only implementation proposed for
BlobLoader is `FileSystemBlobLoader`.
Many projects retrieve files from cloud storage. We propose a new
implementation of `BlobLoader` to read files from the three cloud
storage systems. The interface is strictly identical to
`FileSystemBlobLoader`. The only difference is the constructor, which
takes a cloud "url" object such as `s3://my-bucket`, `az://my-bucket`,
or `gs://my-bucket`.
By streamlining the process, this novel implementation eliminates the
requirement to pre-download files from cloud storage to local temporary
files (which are seldom removed).
The code relies on the
[CloudPathLib](https://cloudpathlib.drivendata.org/stable/) library to
interpret cloud URLs. This has been added as an optional dependency.
```Python
loader = CloudBlobLoader("s3://mybucket/id")
for blob in loader.yield_blobs():
print(blob)
```
- [X] **Dependencies:** CloudPathLib
- [X] **Twitter handle:** pprados
- [X] **Add tests and docs**: Add unit test, but it's easy to convert to
integration test, with some files in a cloud storage (see
`test_cloud_blob_loader.py`)
- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.
Hello from Paris @hwchase17. Can you review this PR?
---------
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Added [Scrapfly](https://scrapfly.io/) Web Loader integration. Scrapfly
is a web scraping API that allows extracting web page data into
accessible markdown or text datasets.
- __Description__: Added Scrapfly web loader for retrieving web page
data as markdown or text.
- Dependencies: scrapfly-sdk
- Twitter: @thealchemi1st
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
**Description:** Backwards compatible extension of the initialisation
interface of HanaDB to allow the user to specify
specific_metadata_columns that are used for metadata storage of selected
keys which yields increased filter performance. Any not-mentioned
metadata remains in the general metadata column as part of a JSON
string. Furthermore switched to executemany for batch inserts into
HanaDB.
**Issue:** N/A
**Dependencies:** no new dependencies added
**Twitter handle:** @sapopensource
---------
Co-authored-by: Martin Kolb <martin.kolb@sap.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Integrate RankLLM reranker (https://github.com/castorini/rank_llm) into
LangChain
An example notebook is given in
`docs/docs/integrations/retrievers/rankllm-reranker.ipynb`
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- **Bug code**: In
langchain_community/document_loaders/csv_loader.py:100
- **Description**: currently, when 'CSVLoader' reads the column as None
in the 'csv' file, it will report an error because the 'CSVLoader' does
not verify whether the column is of str type and does not consider how
to handle the corresponding 'row_data' when the column is' None 'in the
csv. This pr provides a solution.
- **Issue:** Fix#20699
- **thinking:**
1. Refer to the processing method for
'langchain_community/document_loaders/csv_loader.py:100' when **'v'**
equals'None', and apply the same method to '**k**'.
(Reference`csv.DictReader` ,**'k'** will only be None when `
len(columns) < len(number_row_data)` is established)
2. **‘k’** equals None only holds when it is the last column, and its
corresponding **'v'** type is a list. Therefore, I referred to the data
format in 'Document' and used ',' to concatenated the elements in the
list.(But I'm not sure if you accept this form, if you have any other
ideas, communicate)
---------
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
## Description
The existing public interface for `langchain_community.emeddings` is
broken. In this file, `__all__` is statically defined, but is
subsequently overwritten with a dynamic expression, which type checkers
like pyright do not support. pyright actually gives the following
diagnostic on the line I am requesting we remove:
[reportUnsupportedDunderAll](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportUnsupportedDunderAll):
```
Operation on "__all__" is not supported, so exported symbol list may be incorrect
```
Currently, I get the following errors when attempting to use publicablly
exported classes in `langchain_community.emeddings`:
```python
import langchain_community.embeddings
langchain_community.embeddings.HuggingFaceEmbeddings(...) # error: "HuggingFaceEmbeddings" is not exported from module "langchain_community.embeddings" (reportPrivateImportUsage)
```
This is solved easily by removing the dynamic expression.
- **Description:** Tongyi uses different client for chat model and
vision model. This PR chooses proper client based on model name to
support both chat model and vision model. Reference [tongyi
document](https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-vl-plus-api?spm=a2c4g.11186623.0.0.27404c9a7upm11)
for details.
```
from langchain_core.messages import HumanMessage
from langchain_community.chat_models import ChatTongyi
llm = ChatTongyi(model_name='qwen-vl-max')
image_message = {
"image": "https://lilianweng.github.io/posts/2023-06-23-agent/agent-overview.png"
}
text_message = {
"text": "summarize this picture",
}
message = HumanMessage(content=[text_message, image_message])
llm.invoke([message])
```
- **Issue:** None
- **Dependencies:** None
- **Twitter handle:** None
We add a tool and retriever for the [AskNews](https://asknews.app)
platform with example notebooks.
The retriever can be invoked with:
```py
from langchain_community.retrievers import AskNewsRetriever
retriever = AskNewsRetriever(k=3)
retriever.invoke("impact of fed policy on the tech sector")
```
To retrieve 3 documents in then news related to fed policy impacts on
the tech sector. The included notebook also includes deeper details
about controlling filters such as category and time, as well as
including the retriever in a chain.
The tool is quite interesting, as it allows the agent to decide how to
obtain the news by forming a query and deciding how far back in time to
look for the news:
```py
from langchain_community.tools.asknews import AskNewsSearch
from langchain import hub
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
tool = AskNewsSearch()
instructions = """You are an assistant."""
base_prompt = hub.pull("langchain-ai/openai-functions-template")
prompt = base_prompt.partial(instructions=instructions)
llm = ChatOpenAI(temperature=0)
asknews_tool = AskNewsSearch()
tools = [asknews_tool]
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
)
agent_executor.invoke({"input": "How is the tech sector being affected by fed policy?"})
```
---------
Co-authored-by: Emre <e@emre.pm>