- Description:
- When `keep_separator` is `True` the `_split_text_with_regex()` method
in `text_splitter` uses regex to split, but when `keep_separator` is
`False` it uses `str.split()`. This causes problems when the separator
is a special regex character like `.` or `*`. This PR fixes that by
using `re.split()` in both cases.
- Issue: #7262
- Tag maintainer: @baskaryan
### Description
This pull request introduces the "Cube Semantic Layer" document loader,
which demonstrates the retrieval of Cube's data model metadata in a
format suitable for passing to LLMs as embeddings. This enhancement aims
to provide contextual information and improve the understanding of data.
Twitter handle:
@the_cube_dev
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
- Description: Allow `InMemoryDocstore` to be created without passing a
dict to the constructor; the constructor can create a dict at runtime if
one isn't provided.
- Tag maintainer: @dev2049
## Description
Added Office365 tool modules to `__init__.py` files
## Issue
As described in Issue
https://github.com/hwchase17/langchain/issues/6936, the Office365
toolkit can't be loaded easily because it is not included in the
`__init__.py` files.
## Reviewer
@dev2049
- Description: If their are missing or extra variables when validating
Jinja 2 template then a warning is issued rather than raising an
exception. This allows for better flexibility for the developer as
described in #7044. Also changed the relevant test so pytest is checking
for raised warnings rather than exceptions.
- Issue: #7044
- Tag maintainer: @hwchase17, @baskaryan
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @dev2049
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @dev2049
- Memory: @hwchase17
- Agents / Tools / Toolkits: @vowelparrot
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @dev2049
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @dev2049
- Memory: @hwchase17
- Agents / Tools / Toolkits: @vowelparrot
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
**Description**:
The JSON Lines format is used by some services such as OpenAI and
HuggingFace. It's also a convenient alternative to CSV.
This PR adds JSON Lines support to `JSONLoader` and also updates related
tests.
**Tag maintainer**: @rlancemartin, @eyurtsev.
PS I was not able to build docs locally so didn't update related
section.
should be no functional changes
also keep __init__ exposing a lot for backwards compat
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Handle the new retriever events in a way that (I think) is entirely
backwards compatible? Needs more testing for some of the chain changes
and all.
This creates an entire new run type, however. We could also just treat
this as an event within a chain run presumably (same with memory)
Adds a subclass initializer that upgrades old retriever implementations
to the new schema, along with tests to ensure they work.
First commit doesn't upgrade any of our retriever implementations (to
show that we can pass the tests along with additional ones testing the
upgrade logic).
Second commit upgrades the known universe of retrievers in langchain.
- [X] Add callback handling methods for retriever start/end/error (open
to renaming to 'retrieval' if you want that)
- [X] Update BaseRetriever schema to support callbacks
- [X] Tests for upgrading old "v1" retrievers for backwards
compatibility
- [X] Update existing retriever implementations to implement the new
interface
- [X] Update calls within chains to .{a]get_relevant_documents to pass
the child callback manager
- [X] Update the notebooks/docs to reflect the new interface
- [X] Test notebooks thoroughly
Not handled:
- Memory pass throughs: retrieval memory doesn't have a parent callback
manager passed through the method
---------
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: William Fu-Hinthorn <13333726+hinthornw@users.noreply.github.com>
If you create a dataset from runs and run the same chain or llm on it
later, it usually works great.
If you have an agent dataset and want to run a different agent on it, or
have more complex schema, it's hard for us to automatically map these
values every time. This PR lets you pass in an input_mapper function
that converts the example inputs to whatever format your model expects
### Scientific Article PDF Parsing via Grobid
`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.
`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:
```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```
Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```
@rlancemartin, @eyurtsev
Twitter Handle: @Corranmac
Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
Replace this comment with:
- Description: Add Async functionality to Zapier NLA Tools
- Issue: n/a
- Dependencies: n/a
- Tag maintainer:
Maintainer responsibilities:
- Agents / Tools / Toolkits: @vowelparrot
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
Description: When a 401 response is given back by Zapier, hint to the
end user why that may have occurred
- If an API Key was initialized with the wrapper, ask them to check
their API Key value
- if an access token was initialized with the wrapper, ask them to check
their access token or verify that it doesn't need to be refreshed.
Tag maintainer: @dev2049
#### Summary
A new approach to loading source code is implemented:
Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.
This could improve the accuracy of QA chains over source code.
For instance, having this script:
```
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
if __name__ == '__main__':
main()
```
The loader will create three documents with this content:
First document:
```
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
```
Second document:
```
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
```
Third document:
```
# Code for: class MyClass:
# Code for: def main():
if __name__ == '__main__':
main()
```
A threshold parameter is added to control whether small scripts are
split in this way or not.
At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.
#### Tests
This PR adds:
- Unit tests
- Integration tests
#### Dependencies
Only one dependency was added as optional (needed for the JavaScript
parser).
#### Documentation
A notebook is added showing how the loader can be used.
#### Who can review?
@eyurtsev @hwchase17
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
Description: Update documentation to
1) point to updated documentation links at Zapier.com (we've revamped
our help docs and paths), and
2) To provide clarity how to use the wrapper with an access token for
OAuth support
Demo:
Initializing the Zapier Wrapper with an OAuth Access Token
`ZapierNLAWrapper(zapier_nla_oauth_access_token="<redacted>")`
Using LangChain to resolve the current weather in Vancouver BC
leveraging Zapier NLA to lookup weather by coords.
```
> Entering new chain...
I need to use a tool to get the current weather.
Action: The Weather: Get Current Weather
Action Input: Get the current weather for Vancouver BC
Observation: {"coord__lon": -123.1207, "coord__lat": 49.2827, "weather": [{"id": 802, "main": "Clouds", "description": "scattered clouds", "icon": "03d", "icon_url": "http://openweathermap.org/img/wn/03d@2x.png"}], "weather[]icon_url": ["http://openweathermap.org/img/wn/03d@2x.png"], "weather[]icon": ["03d"], "weather[]id": [802], "weather[]description": ["scattered clouds"], "weather[]main": ["Clouds"], "base": "stations", "main__temp": 71.69, "main__feels_like": 71.56, "main__temp_min": 67.64, "main__temp_max": 76.39, "main__pressure": 1015, "main__humidity": 64, "visibility": 10000, "wind__speed": 3, "wind__deg": 155, "wind__gust": 11.01, "clouds__all": 41, "dt": 1687806607, "sys__type": 2, "sys__id": 2011597, "sys__country": "CA", "sys__sunrise": 1687781297, "sys__sunset": 1687839730, "timezone": -25200, "id": 6173331, "name": "Vancouver", "cod": 200, "summary": "scattered clouds", "_zap_search_was_found_status": true}
Thought: I now know the current weather in Vancouver BC.
Final Answer: The current weather in Vancouver BC is scattered clouds with a temperature of 71.69 and wind speed of 3
```
Notebook shows preference scoring between two chains and reports wilson
score interval + p value
I think I'll add the option to insert ground truth labels but doesn't
have to be in this PR
MHTML is a very interesting format since it's used both for emails but
also for archived webpages. Some scraping projects want to store pages
in disk to process them later, mhtml is perfect for that use case.
This is heavily inspired from the beautifulsoup html loader, but
extracting the html part from the mhtml file.
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
# Add caching to BaseChatModel
Fixes#1644
(Sidenote: While testing, I noticed we have multiple implementations of
Fake LLMs, used for testing. I consolidated them.)
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
Models
- @hwchase17
- @agola11
Twitter: [@UmerHAdil](https://twitter.com/@UmerHAdil) | Discord:
RicChilligerDude#7589
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Fixes#5456
This PR removes the `callbacks` argument from a tool's schema when
creating a `Tool` or `StructuredTool` with the `from_function` method
and `infer_schema` is set to `True`. The `callbacks` argument is now
removed in the `create_schema_from_function` and `_get_filtered_args`
methods. As suggested by @vowelparrot, this fix provides a
straightforward solution that minimally affects the existing
implementation.
A test was added to verify that this change enables the expected use of
`Tool` and `StructuredTool` when using a `CallbackManager` and inferring
the tool's schema.
- @hwchase17
A new implementation of `StreamlitCallbackHandler`. It formats Agent
thoughts into Streamlit expanders.
You can see the handler in action here:
https://langchain-mrkl.streamlit.app/
Per a discussion with Harrison, we'll be adding a
`StreamlitCallbackHandler` implementation to an upcoming
[Streamlit](https://github.com/streamlit/streamlit) release as well, and
will be updating it as we add new LLM- and LangChain-specific features
to Streamlit.
The idea with this PR is that the LangChain `StreamlitCallbackHandler`
will "auto-update" in a way that keeps it forward- (and backward-)
compatible with Streamlit. If the user has an older Streamlit version
installed, the LangChain `StreamlitCallbackHandler` will be used; if
they have a newer Streamlit version that has an updated
`StreamlitCallbackHandler`, that implementation will be used instead.
(I'm opening this as a draft to get the conversation going and make sure
we're on the same page. We're really excited to land this into
LangChain!)
#### Who can review?
@agola11, @hwchase17
Already supported in the reverse operation in
`_convert_message_to_dict()`, this just provides parity.
@hwchase17
@agola11
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
W.r.t recent changes, ChatPromptTemplate does not accepting partial
variables. This PR should fix that issue.
Fixes#6431
#### Who can review?
@hwchase17
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Throwing ToolException when incorrect arguments are passed to tools so
that that agent can course correct them.
# Incorrect argument count handling
I was facing an error where the agent passed incorrect arguments to
tools. As per the discussions going around, I started throwing
ToolException to allow the model to course correct.
## Before submitting
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
#### Before submitting
Add memory support for `OpenAIFunctionsAgent` like
`StructuredChatAgent`.
#### Who can review?
@hwchase17
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
In LangChain, all module classes are enumerated in the `__init__.py`
file of the correspondent module. But some classes were missed and were
not included in the module `__init__.py`
This PR:
- added the missed classes to the module `__init__.py` files
- `__init__.py:__all_` variable value (a list of the class names) was
sorted
- `langchain.tools.sql_database.tool.QueryCheckerTool` was renamed into
the `QuerySQLCheckerTool` because it conflicted with
`langchain.tools.spark_sql.tool.QueryCheckerTool`
- changes to `pyproject.toml`:
- added `pgvector` to `pyproject.toml:extended_testing`
- added `pandas` to
`pyproject.toml:[tool.poetry.group.test.dependencies]`
- commented out the `streamlit` from `collbacks/__init__.py`, It is
because now the `streamlit` requires Python >=3.7, !=3.9.7
- fixed duplicate names in `tools`
- fixed correspondent ut-s
#### Who can review?
@hwchase17
@dev2049
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
## Add Solidity programming language support for code splitter.
Twitter: @0xjord4n_
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
<!-- Remove if not applicable -->
Fixes # (issue)
#### Before submitting
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->