The Docugami loader was not returning the source metadata key. This was
triggering this exception when used with retrievers, per
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/schema/prompt_template.py#L193C1-L195C41
The fix is simple and just updates the metadata key name for the
document each chunk is sourced from, from "name" to "source" as
expected.
I tested by running the python notebook that has an end to end scenario
in it.
Tagging DataLoader maintainers @rlancemartin @eyurtsev
Not obvious what the error is when you cannot index. This pr adds the
ability to log the first errors reason, to help the user diagnose the
issue.
Also added some more documentation for when you want to use the
vectorstore with an embedding model deployed in elasticsearch.
Credit: @elastic and @phoey1
- Description: a description of the change
when I set `content_format=ContentFormat.VIEW` and
`keep_markdown_format=True` on ConfluenceLoader, it shows the following
error:
```
langchain/document_loaders/confluence.py", line 459, in process_page
page["body"]["storage"]["value"], heading_style="ATX"
KeyError: 'storage'
```
The reason is because the content format was set to `view` but it was
still trying to get the content from `page["body"]["storage"]["value"]`.
Also added the other content formats which are supported by Atlassian
API
https://stackoverflow.com/questions/34353955/confluence-rest-api-expanding-page-body-when-retrieving-page-by-title/34363386#34363386
- Issue: the issue # it fixes (if applicable),
Not applicable.
- Dependencies: any dependencies required for this change,
Added optional dependency `markdownify` if anyone wants to extract in
markdown format.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: Added the capability to handles structured data from
google enterprise search,
- Issue: Retriever failed when underline search engine was integrated
with structured data,
- Dependencies: google-api-core
- Tag maintainer: @jarokaz
- Twitter handle: anifort
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
---------
Co-authored-by: Christos Aniftos <aniftos@google.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Updates the hub stubs to not fail when no api key is found. For
supporting singleton tenants and default values from sdk 0.1.6.
Also adds the ability to define is_public and description for backup
repo creation on push.
Currently, generation_info is not respected by only reflecting messages
in chunks. Change it to add generations so that generation chunks are
merged properly.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- Description: current code does not work very well on jupyter notebook,
so I changed the code so that it imports `tqdm.auto` instead.
- Issue: #9582
- Dependencies: N/A
- Tag maintainer: @hwchase17, @baskaryan
- Twitter handle: N/A
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
It's possible that langchain-experimental works fine with the latest
*published* langchain, but is broken with the langchain on `master`.
Unfortunately, you can see this is currently the case — this is why this
PR also includes a minor fix for the `langchain` package itself.
We want to catch situations like that *before* releasing a new
langchain, hence this test.
# Description
This PR introduces a new toolkit for interacting with the AINetwork
blockchain. The toolkit provides a set of tools for performing various
operations on the AINetwork blockchain, such as transferring AIN,
reading and writing values to the blockchain database, managing apps,
setting rules and owners.
# Dependencies
[ain-py](https://github.com/ainblockchain/ain-py) >= 1.0.2
# Misc
The example notebook
(langchain/docs/extras/integrations/toolkits/ainetwork.ipynb) is in the
PR
---------
Co-authored-by: kriii <kriii@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Introduces a conditional in `ArangoGraph.generate_schema()` to exclude
empty ArangoDB Collections from the schema
- Add empty collection test case
Issue: N/A
Dependencies: None
### Description
Polars is a DataFrame interface on top of an OLAP Query Engine
implemented in Rust.
Polars is faster to read than pandas, so I'm looking forward to seeing
it added to the document loader.
### Dependencies
polars (https://pola-rs.github.io/polars-book/user-guide/)
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
I have restructured the code to ensure uniform handling of ImportError.
In place of previously used ValueError, I've adopted the standard
practice of raising ImportError with explanatory messages. This
modification enhances code readability and clarifies that any problems
stem from module importation.
@eyurtsev , @baskaryan
Thanks
Add PromptGuard integration
-------
There are two approaches to integrate PromptGuard with a LangChain
application.
1. PromptGuardLLMWrapper
2. functions that can be used in LangChain expression.
-----
- Dependencies
`promptguard` python package, which is a runtime requirement if you'd
try out the demo.
- @baskaryan @hwchase17 Thanks for the ideas and suggestions along the
development process.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
### Description
When we're loading documents using `ConfluenceLoader`:`load` function
and, if both `include_comments=True` and `keep_markdown_format=True`,
we're getting an error saying `NameError: free variable 'BeautifulSoup'
referenced before assignment in enclosing scope`.
loader = ConfluenceLoader(url="URI", token="TOKEN")
documents = loader.load(
space_key="SPACE",
include_comments=True,
keep_markdown_format=True,
)
This happens because previous imports only consider the
`keep_markdown_format` parameter, however to include the comments, it's
using `BeautifulSoup`
Now it's fixed to handle all four scenarios considering both
`include_comments` and `keep_markdown_format`.
### Twitter
`@SathinduGA`
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: Allows the user of `ConfluenceLoader` to pass a
`requests.Session` object in lieu of an authentication mechanism
- Issue: None
- Dependencies: None
- Tag maintainer: @hwchase17
- Improved docs
- Improved performance in multiple ways through batching, threading,
etc.
- fixed error message
- Added support for metadata filtering during similarity search.
@baskaryan PTAL
[Epsilla](https://github.com/epsilla-cloud/vectordb) vectordb is an
open-source vector database that leverages the advanced academic
parallel graph traversal techniques for vector indexing.
This PR adds basic integration with
[pyepsilla](https://github.com/epsilla-cloud/epsilla-python-client)(Epsilla
vectordb python client) as a vectorstore.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
The package is linted with mypy, so its type hints are correct and
should be exposed publicly. Without this file, the type hints remain
private and cannot be used by downstream users of the package.
- Description: Updated marqo integration to use tensor_fields instead of
non_tensor_fields. Upgraded marqo version to 1.2.4
- Dependencies: marqo 1.2.4
---------
Co-authored-by: Raynor Kirkson E. Chavez <raynor.chavez@192.168.254.171>
Co-authored-by: Bagatur <baskaryan@gmail.com>
<!-- Thank you for contributing to LangChain!
Replace this entire comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. These live is docs/extras
directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17, @rlancemartin.
-->
- Description: support [ERNIE
Embedding-V1](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu),
which is part of ERNIE ecology
- Issue: None
- Dependencies: None
- Tag maintainer: @baskaryan
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: Changed metadata retrieval so that it combines Vectara
doc level and part level metadata
- Tag maintainer: @rlancemartin
- Twitter handle: @ofermend
**Description**:
- Uniformed the current valid suffixes (file formats) for loading agents
from hubs and files (to better handle future additions);
- Clarified exception messages (also in unit test).
@rlancemartin The current implementation within `Geopandas.GeoDataFrame`
loader uses the python builtin `str()` function on the input geometries.
While this looks very close to WKT (Well known text), Python's str
function doesn't guarantee that.
In the interest of interop., I've changed to the of use `wkt` property
on the Shapely geometries for generating the text representation of the
geometries.
Also, included here:
- validation of the input `page_content_column` as being a GeoSeries.
- geometry `crs` (Coordinate Reference System) / bounds
(xmin/ymin/xmax/ymax) added to Document metadata. Having the CRS is
critical... having the bounds is just helpful!
I think there is a larger question of "Should the geometry live in the
`page_content`, or should the record be better summarized and tuck the
geom into metadata?" ...something for another day and another PR.
This is an extension of #8104. I updated some of the signatures so all
the tests pass.
@danhnn I couldn't commit to your PR, so I created a new one. Thanks for
your contribution!
@baskaryan Could you please merge it?
---------
Co-authored-by: Danh Nguyen <dnncntt@gmail.com>
### Summary
Fixes a bug from #7850 where post processing functions in Unstructured
loaders were not apply. Adds a assertion to the test to verify the post
processing function was applied and also updates the explanation in the
example notebook.
Issue: https://github.com/langchain-ai/langchain/issues/9401
In the Async mode, SequentialChain implementation seems to run the same
callbacks over and over since it is re-using the same callbacks object.
Langchain version: 0.0.264, master
The implementation of this aysnc route differs from the sync route and
sync approach follows the right pattern of generating a new callbacks
object instead of re-using the old one and thus avoiding the cascading
run of callbacks at each step.
Async mode:
```
_run_manager = run_manager or AsyncCallbackManagerForChainRun.get_noop_manager()
callbacks = _run_manager.get_child()
...
for i, chain in enumerate(self.chains):
_input = await chain.arun(_input, callbacks=callbacks)
...
```
Regular mode:
```
_run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
for i, chain in enumerate(self.chains):
_input = chain.run(_input, callbacks=_run_manager.get_child(f"step_{i+1}"))
...
```
Notice how we are reusing the callbacks object in the Async code which
will have a cascading effect as we run through the chain. It runs the
same callbacks over and over resulting in issues.
Solution:
Define the async function in the same pattern as the regular one and
added tests.
---------
Co-authored-by: vamsee_yarlagadda <vamsee.y@airbnb.com>
<!-- Thank you for contributing to LangChain!
Replace this entire comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. These live is docs/extras
directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17, @rlancemartin.
-->
📜
- updated the top-level descriptions to a consistent format;
- changed the format of several 100% internal functions from "name" to
"_name". So, these functions are not shown in the Top-level API
Reference page (with lists of classes/functions)
<!-- Thank you for contributing to LangChain!
Replace this entire comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. These live is docs/extras
directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17, @rlancemartin.
-->
Refactored code to ensure consistent handling of ImportError. Replaced
instances of raising ValueError with raising ImportError.
The choice of raising a ValueError here is somewhat unconventional and
might lead to confusion for anyone reading the code. Typically, when
dealing with import-related errors, the recommended approach is to raise
an ImportError with a descriptive message explaining the issue. This
provides a clearer indication that the problem is related to importing
the required module.
@hwchase17 , @baskaryan , @eyurtsev
Thanks
Aashish
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
This PR fills in more missing type annotations on pydantic models.
It's OK if it missed some annotations, we just don't want it to get
annotations wrong at this stage.
I'll do a few more passes over the same files!
<!-- Thank you for contributing to LangChain!
Replace this entire comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. These live is docs/extras
directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17, @rlancemartin.
-->
This PR fixes the Airbyte loaders when doing incremental syncs. The
notebooks are calling out to access `loader.last_state` to get the
current state of incremental syncs, but this didn't work due to a
refactoring of how the loaders are structured internally in the original
PR.
This PR fixes the issue by adding a `last_state` property that forwards
the state correctly from the CDK adapter.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
## Type:
Improvement
---
## Description:
Running QAWithSourcesChain sometimes raises ValueError as mentioned in
issue #7184:
```
ValueError: too many values to unpack (expected 2)
Traceback:
response = qa({"question": pregunta}, return_only_outputs=True)
File "C:\Anaconda3\envs\iagen_3_10\lib\site-packages\langchain\chains\base.py", line 166, in __call__
raise e
File "C:\Anaconda3\envs\iagen_3_10\lib\site-packages\langchain\chains\base.py", line 160, in __call__
self._call(inputs, run_manager=run_manager)
File "C:\Anaconda3\envs\iagen_3_10\lib\site-packages\langchain\chains\qa_with_sources\base.py", line 132, in _call
answer, sources = re.split(r"SOURCES:\s", answer)
```
This is due to LLM model generating subsequent question, answer and
sources, that is complement in a similar form as below:
```
<final_answer>
SOURCES: <sources>
QUESTION: <new_or_repeated_question>
FINAL ANSWER: <new_or_repeated_final_answer>
SOURCES: <new_or_repeated_sources>
```
It leads the following line
```
re.split(r"SOURCES:\s", answer)
```
to return more than 2 elements and result in ValueError. The simple fix
is to split also with "QUESTION:\s" and take the first two elements:
```
answer, sources = re.split(r"SOURCES:\s|QUESTION:\s", answer)[:2]
```
Sometimes LLM might also generate some other texts, like alternative
answers in a form:
```
<final_answer_1>
SOURCES: <sources>
<final_answer_2>
SOURCES: <sources>
<final_answer_3>
SOURCES: <sources>
```
In such cases it is the best to split previously obtained sources with
new line:
```
sources = re.split(r"\n", sources.lstrip())[0]
```
---
## Issue:
Resolves#7184
---
## Maintainer:
@baskaryan
I quick change to allow the output key of create_openai_fn_chain to
optionally be changed.
@baskaryan
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: Added improvements in Nebula LLM to perform auto-retry;
more generation parameters supported. Conversation is no longer required
to be passed in the LLM object. Examples are updated.
- Issue: N/A
- Dependencies: N/A
- Tag maintainer: @baskaryan
- Twitter handle: symbldotai
---------
Co-authored-by: toshishjawale <toshish@symbl.ai>
Update documentation and URLs for the Langchain Context integration.
We've moved from getcontext.ai to context.ai \o/
Thanks in advance for the review!
* PR updates test.yml to test with both pydantic versions
* Code should be refactored to make it easier to do testing in matrix
format w/ packages
* Added steps to assert that pydantic version in the environment is as
expected
Now with ElasticsearchStore VectorStore merged, i've added support for
the self-query retriever.
I've added a notebook also to demonstrate capability. I've also added
unit tests.
**Credit**
@elastic and @phoey1 on twitter.
# Poetry updates
This PR updates LangChains poetry file to remove
any dependencies that aren't pydantic v2 compatible yet.
All packages remain usable under pydantic v1, and can be installed
separately.
## Bumping the following packages:
* langsmith
## Removing the following packages
not used in extended unit-tests:
* zep-python, anthropic, jina, spacy, steamship, betabageldb
not used at all:
* octoai-sdk
Cleaning up extras w/ for removed packages.
## Snapshots updated
Some snapshots had to be updated due to a change in the data model in
langsmith. RunType used to be Union of Enum and string and was changed
to be string only.
This PR adds serialization support for protocol bufferes in
`WandbTracer`. This allows code generation chains to be visualized.
Additionally, it also fixes a minor bug where the settings are not
honored when a run is initialized before using the `WandbTracer`
@agola11
---------
Co-authored-by: Bharat Ramanathan <ramanathan.parameshwaran@gohuddl.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Todo:
- [x] Connection options (cloud, localhost url, es_connection) support
- [x] Logging support
- [x] Customisable field support
- [x] Distance Similarity support
- [x] Metadata support
- [x] Metadata Filter support
- [x] Retrieval Strategies
- [x] Approx
- [x] Approx with Hybrid
- [x] Exact
- [x] Custom
- [x] ELSER (excluding hybrid as we are working on RRF support)
- [x] integration tests
- [x] Documentation
👋 this is a contribution to improve Elasticsearch integration with
Langchain. Its based loosely on the changes that are in master but with
some notable changes:
## Package name & design improvements
The import name is now `ElasticsearchStore`, to aid discoverability of
the VectorStore.
```py
## Before
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch, ElasticKnnSearch
## Now
from langchain.vectorstores.elasticsearch import ElasticsearchStore
```
## Retrieval Strategy support
Before we had a number of classes, depending on the strategy you wanted.
`ElasticKnnSearch` for approx, `ElasticVectorSearch` for exact / brute
force.
With `ElasticsearchStore` we have retrieval strategies:
### Approx Example
Default strategy for the vast majority of developers who use
Elasticsearch will be inferring the embeddings from outside of
Elasticsearch. Uses KNN functionality of _search.
```py
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts,
FakeEmbeddings(),
es_url="http://localhost:9200",
index_name="sample-index"
)
output = docsearch.similarity_search("foo", k=1)
```
### Approx, with hybrid
Developers who want to search, using both the embedding and the text
bm25 match. Its simple to enable.
```py
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts,
FakeEmbeddings(),
es_url="http://localhost:9200",
index_name="sample-index",
strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True)
)
output = docsearch.similarity_search("foo", k=1)
```
### Approx, with `query_model_id`
Developers who want to infer within Elasticsearch, using the model
loaded in the ml node.
This relies on the developer to setup the pipeline and index if they
wish to embed the text in Elasticsearch. Example of this in the test.
```py
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts,
FakeEmbeddings(),
es_url="http://localhost:9200",
index_name="sample-index",
strategy=ElasticsearchStore.ApproxRetrievalStrategy(
query_model_id="sentence-transformers__all-minilm-l6-v2"
),
)
output = docsearch.similarity_search("foo", k=1)
```
### I want to provide my own custom Elasticsearch Query
You might want to have more control over the query, to perform
multi-phase retrieval such as LTR, linearly boosting on document
parameters like recently updated or geo-distance. You can do this with
`custom_query_fn`
```py
def my_custom_query(query_body: dict, query: str) -> dict:
return {"query": {"match": {"text": {"query": "bar"}}}}
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts, FakeEmbeddings(), **elasticsearch_connection, index_name=index_name
)
docsearch.similarity_search("foo", k=1, custom_query=my_custom_query)
```
### Exact Example
Developers who have a small dataset in Elasticsearch, dont want the cost
of indexing the dims vs tradeoff on cost at query time. Uses
script_score.
```py
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts,
FakeEmbeddings(),
es_url="http://localhost:9200",
index_name="sample-index",
strategy=ElasticsearchStore.ExactRetrievalStrategy(),
)
output = docsearch.similarity_search("foo", k=1)
```
### ELSER Example
Elastic provides its own sparse vector model called ELSER. With these
changes, its really easy to use. The vector store creates a pipeline and
index thats setup for ELSER. All the developer needs to do is configure,
ingest and query via langchain tooling.
```py
texts = ["foo", "bar", "baz"]
docsearch = ElasticsearchStore.from_texts(
texts,
FakeEmbeddings(),
es_url="http://localhost:9200",
index_name="sample-index",
strategy=ElasticsearchStore.SparseVectorStrategy(),
)
output = docsearch.similarity_search("foo", k=1)
```
## Architecture
In future, we can introduce new strategies and allow us to not break bwc
as we evolve the index / query strategy.
## Credit
On release, could you credit @elastic and @phoey1 please? Thank you!
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Updated prompts for the MultiOn toolkit for better functionality
- Non-blocking but good to have it merged to improve the overall
performance for the toolkit
@hinthornw @hwchase17
---------
Co-authored-by: Naman Garg <ngarg3@binghamton.edu>
Add ability to track langchain usage for Rockset. Rockset's new python
client allows setting this. To prevent old clients from failing, it
ignore if setting throws exception (we can't track old versions)
Tested locally with old and new Rockset python client
cc @baskaryan
2 things:
- Implement the private method rather than the public one so callbacks
are handled properly
- Add search_kwargs (Open to not adding this if we are trying to
deprecate this UX but seems like as a user i'd assume similar args to
the vector store retriever. In fact some may assume this implements the
same interface but I'm not dealing with that here)
-