# Fix DeepLake Overwrite Flag Issue
Fixes Issue #4682: essentially, setting overwrite to False in the
DeepLake constructor still triggers an overwrite, because the logic is
just checking for the presence of "overwrite" in kwargs. The fix is
simple--just add some checks to inspect if "overwrite" in kwargs AND
kwargs["overwrite"]==True.
Added a new test in
tests/integration_tests/vectorstores/test_deeplake.py to reflect the
desired behavior.
Co-authored-by: Anirudh Suresh <ani@Anirudhs-MBP.cable.rcn.com>
Co-authored-by: Anirudh Suresh <ani@Anirudhs-MacBook-Pro.local>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Add summarization task type for HuggingFace APIs
Add summarization task type for HuggingFace APIs.
This task type is described by [HuggingFace inference
API](https://huggingface.co/docs/api-inference/detailed_parameters#summarization-task)
My project utilizes LangChain to connect multiple LLMs, including
various HuggingFace models that support the summarization task.
Integrating this task type is highly convenient and beneficial.
Fixes#4720
# Add GraphQL Query Support
This PR introduces a GraphQL API Wrapper tool that allows LLM agents to
query GraphQL databases. The tool utilizes the httpx and gql Python
packages to interact with GraphQL APIs and provides a simple interface
for running queries with LLM agents.
@vowelparrot
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
### Adds a document loader for Docugami
Specifically:
1. Adds a data loader that talks to the [Docugami](http://docugami.com)
API to download processed documents as semantic XML
2. Parses the semantic XML into chunks, with additional metadata
capturing chunk semantics
3. Adds a detailed notebook showing how you can use additional metadata
returned by Docugami for techniques like the [self-querying
retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html)
4. Adds an integration test, and related documentation
Here is an example of a result that is not possible without the
capabilities added by Docugami (from the notebook):
<img width="1585" alt="image"
src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b">
---------
Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>
# Improve video_id extraction in `YoutubeLoader`
`YoutubeLoader.from_youtube_url` can only deal with one specific url
format. I've introduced `YoutubeLoader.extract_video_id` which can
extract video id from common YT urls.
Fixes#4451
@eyurtsev
---------
Co-authored-by: Kamil Niski <kamil.niski@gmail.com>
# Respect User-Specified User-Agent in WebBaseLoader
This pull request modifies the `WebBaseLoader` class initializer from
the `langchain.document_loaders.web_base` module to preserve any
User-Agent specified by the user in the `header_template` parameter.
Previously, even if a User-Agent was specified in `header_template`, it
would always be overridden by a random User-Agent generated by the
`fake_useragent` library.
With this change, if a User-Agent is specified in `header_template`, it
will be used. Only in the case where no User-Agent is specified will a
random User-Agent be generated and used. This provides additional
flexibility when using the `WebBaseLoader` class, allowing users to
specify their own User-Agent if they have a specific need or preference,
while still providing a reasonable default for cases where no User-Agent
is specified.
This change has no impact on existing users who do not specify a
User-Agent, as the behavior in this case remains the same. However, for
users who do specify a User-Agent, their choice will now be respected
and used for all subsequent requests made using the `WebBaseLoader`
class.
Fixes#4167
## Before submitting
============================= test session starts
==============================
collecting ... collected 1 item
test_web_base.py::TestWebBaseLoader::test_respect_user_specified_user_agent
============================== 1 passed in 3.64s
===============================
PASSED [100%]
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested: @eyurtsev
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
[OpenWeatherMapAPIWrapper](f70e18a5b3/docs/modules/agents/tools/examples/openweathermap.ipynb)
works wonderfully, but the _tool_ itself can't be used in master branch.
- added OpenWeatherMap **tool** to the public api, to be loadable with
`load_tools` by using "openweathermap-api" tool name (that name is used
in the existing
[docs](aff33d52c5/docs/modules/agents/tools/getting_started.md),
at the bottom of the page)
- updated OpenWeatherMap tool's **description** to make the input format
match what the API expects (e.g. `London,GB` instead of `'London,GB'`)
- added [ecosystem documentation page for
OpenWeatherMap](f9c41594fe/docs/ecosystem/openweathermap.md)
- added tool usage example to [OpenWeatherMap's
notebook](f9c41594fe/docs/modules/agents/tools/examples/openweathermap.ipynb)
Let me know if there's something I missed or something needs to be
updated! Or feel free to make edits yourself if that makes it easier for
you 🙂
Currently, all Zapier tools are built using the pre-written base Zapier
prompt. These small changes (that retain default behavior) will allow a
user to create a Zapier tool using the ZapierNLARunTool while providing
their own base prompt.
Their prompt must contain input fields for zapier_description and
params, checked and enforced in the tool's root validator.
An example of when this may be useful: user has several, say 10, Zapier
tools enabled. Currently, the long generic default Zapier base prompt is
attached to every single tool, using an extreme number of tokens for no
real added benefit (repeated). User prompts LLM on how to use Zapier
tools once, then overrides the base prompt.
Or: user has a few specific Zapier tools and wants to maximize their
success rate. So, user writes prompts/descriptions for those tools
specific to their use case, and provides those to the ZapierNLARunTool.
A consideration - this is the simplest way to implement this I could
think of... though ideally custom prompting would be possible at the
Toolkit level as well. For now, this should be sufficient in solving the
concerns outlined above.
The error in #4087 was happening because of the use of csv.Dialect.*
which is just an empty base class. we need to make a choice on what is
our base dialect. I usually use excel so I put it as excel, if
maintainers have other preferences do let me know.
Open Questions:
1. What should be the default dialect?
2. Should we rework all tests to mock the open function rather than the
csv.DictReader?
3. Should we make a separate input for `dialect` like we have for
`encoding`?
---------
Co-authored-by: = <=>
### Refactor the BaseTracer
- Remove the 'session' abstraction from the BaseTracer
- Rename 'RunV2' object(s) to be called 'Run' objects (Rename previous
Run objects to be RunV1 objects)
- Ditto for sessions: TracerSession*V2 -> TracerSession*
- Remove now deprecated conversion from v1 run objects to v2 run objects
in LangChainTracerV2
- Add conversion from v2 run objects to v1 run objects in V1 tracer
## Change Chain argument in client to accept a chain factory
The `run_over_dataset` functionality seeks to treat each iteration of an
example as an independent trial.
Chains have memory, so it's easier to permit this type of behavior if we
accept a factory method rather than the chain object directly.
There's still corner cases / UX pains people will likely run into, like:
- Caching may cause issues
- if memory is persisted to a shared object (e.g., same redis queue) ,
this could impact what is retrieved
- If we're running the async methods with concurrency using local
models, if someone naively instantiates the chain and loads each time,
it could lead to tons of disk I/O or OOM
# Adds testing options to pytest
This PR adds the following options:
* `--only-core` will skip all extended tests, running all core tests.
* `--only-extended` will skip all core tests. Forcing alll extended
tests to be run.
Running `py.test` without specifying either option will remain
unaffected. Run
all tests that can be run within the unit_tests direction. Extended
tests will
run if required packages are installed.
## Before submitting
## Who can review?
### Add Invocation Params to Logged Run
Adds an llm type to each chat model as well as an override of the dict()
method to log the invocation parameters for each call
---------
Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
### Add on_chat_message_start to callback manager and base tracer
Goal: trace messages directly to permit reloading as chat messages
(store in an integration-agnostic way)
Add an `on_chat_message_start` method. Fall back to `on_llm_start()` for
handlers that don't have it implemented.
Does so in a non-backwards-compat breaking way (for now)
# Parameterize Redis vectorstore index
Redis vectorstore allows for three different distance metrics: `L2`
(flat L2), `COSINE`, and `IP` (inner product). Currently, the
`Redis._create_index` method hard codes the distance metric to COSINE.
I've parameterized this as an argument in the `Redis.from_texts` method
-- pretty simple.
Fixes#4368
## Before submitting
I've added an integration test showing indexes can be instantiated with
all three values in the `REDIS_DISTANCE_METRICS` literal. An example
notebook seemed overkill here. Normal API documentation would be more
appropriate, but no standards are in place for that yet.
## Who can review?
Not sure who's responsible for the vectorstore module... Maybe @eyurtsev
/ @hwchase17 / @agola11 ?
# Add option to `load_huggingface_tool`
Expose a method to load a huggingface Tool from the HF hub
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Thanks to @anna-charlotte and @jupyterjazz for the contribution! Made
few small changes to get it across the finish line
---------
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Co-authored-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>
# Add action to test with all dependencies installed
PR adds a custom action for setting up poetry that allows specifying a
cache key:
https://github.com/actions/setup-python/issues/505#issuecomment-1273013236
This makes it possible to run 2 types of unit tests:
(1) unit tests with only core dependencies
(2) unit tests with extended dependencies (e.g., those that rely on an
optional pdf parsing library)
As part of this PR, we're moving some pdf parsing tests into the
unit-tests section and making sure that these unit tests get executed
when running with extended dependencies.
# ODF File Loader
Adds a data loader for handling Open Office ODT files. Requires
`unstructured>=0.6.3`.
### Testing
The following should work using the `fake.odt` example doc from the
[`unstructured` repo](https://github.com/Unstructured-IO/unstructured).
```python
from langchain.document_loaders import UnstructuredODTLoader
loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements")
loader.load()
loader = UnstructuredODTLoader(file_path="fake.odt", mode="single")
loader.load()
```
Fixes#4153
If the sender of a message in a group chat isn't in your contact list,
they will appear with a ~ prefix in the exported chat. This PR adds
support for parsing such lines.
# Add support for Qdrant nested filter
This extends the filter functionality for the Qdrant vectorstore. The
current filter implementation is limited to a single-level metadata
structure; however, Qdrant supports nested metadata filtering. This
extends the functionality for users to maximize the filter functionality
when using Qdrant as the vectorstore.
Reference: https://qdrant.tech/documentation/filtering/#nested-key
---------
Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
This pr makes it possible to extract more metadata from websites for
later use.
my usecase:
parsing ld+json or microdata from sites and store it as structured data
in the metadata field
- added `Wikipedia` retriever. It is effectively a wrapper for
`WikipediaAPIWrapper`. It wrapps load() into get_relevant_documents()
- sorted `__all__` in the `retrievers/__init__`
- added integration tests for the WikipediaRetriever
- added an example (as Jupyter notebook) for the WikipediaRetriever
# Add PDF parser implementations
This PR separates the data loading from the parsing for a number of
existing PDF loaders.
Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.
This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.
This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.
Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoader Abstractions
- @eyurtsev
LLM/Chat Wrappers
- @hwchase17
- @agola11
Tools / Toolkits
- @vowelparrot
-->
# Add MimeType Based Parser
This PR adds a MimeType Based Parser. The parser inspects the mime-type
of the blob it is parsing and based on the mime-type can delegate to the sub
parser.
## Before submitting
Waiting on adding notebooks until more implementations are landed.
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@hwchase17
@vowelparrot
Handle duplicate and incorrectly specified OpenAI params
Thanks @PawelFaron for the fix! Made small update
Closes#4331
---------
Co-authored-by: PawelFaron <42373772+PawelFaron@users.noreply.github.com>
Co-authored-by: Pawel Faron <ext-pawel.faron@vaisala.com>
### Description
Add `similarity_search_with_score` method for OpenSearch to return
scores along with documents in the search results
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
This PR adds:
* Option to show a tqdm progress bar when using the file system blob loader
* Update pytest run configuration to be stricter
* Adding a new marker that checks that required pkgs exist
- Update the load_tools method to properly accept `callbacks` arguments.
- Add a deprecation warning when `callback_manager` is passed
- Add two unit tests to check the deprecation warning is raised and to
confirm the callback is passed through.
Closes issue #4096
- Added the `Wikipedia` document loader. It is based on the existing
`unilities/WikipediaAPIWrapper`
- Added a respective ut-s and example notebook
- Sorted list of classes in __init__
Hello
1) Passing `embedding_function` as a callable seems to be outdated and
the common interface is to pass `Embeddings` instance
2) At the moment `Qdrant.add_texts` is designed to be used with
`embeddings.embed_query`, which is 1) slow 2) causes ambiguity due to 1.
It should be used with `embeddings.embed_documents`
This PR solves both problems and also provides some new tests
- Update the RunCreate object to work with recent changes
- Add optional Example ID to the tracer
- Adjust default persist_session behavior to attempt to load the session
if it exists
- Raise more useful HTTP errors for logging
- Add unit testing
- Fix the default ID to be a UUID for v2 tracer sessions
Broken out from the big draft here:
https://github.com/hwchase17/langchain/pull/4061
- confirm creation
- confirm functionality with a simple dimension check.
The test now is calling OpenAI API directly, but learning from
@vowelparrot that we’re caching the requests, so that it’s not that
expensive. I also found we’re calling OpenAI api in other integration
tests. Please lmk if there is any concern of real external API calls. I
can alternatively make a fake LLM for this test. Thanks
This implements a loader of text passages in JSON format. The `jq`
syntax is used to define a schema for accessing the relevant contents
from the JSON file. This requires dependency on the `jq` package:
https://pypi.org/project/jq/.
---------
Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
This PR updates the `message_line_regex` used by `WhatsAppChatLoader` to
support different date-time formats used in WhatsApp chat exports;
resolves#4153.
The new regex handles the following input formats:
```terminal
[05.05.23, 15:48:11] James: Hi here
[11/8/21, 9:41:32 AM] User name: Message 123
1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes
```
Tests have been added to verify that the loader works correctly with all
formats.
The forward ref annotations don't get updated if we only iimport with
type checking
---------
Co-authored-by: Abhinav Verma <abhinav_win12@yahoo.co.in>
* implemented arun, results, and aresults. Reuses aiosession if
available.
* helper tools GoogleSerperRun and GoogleSerperResults
* support for Google Images, Places and News (examples given) and
filtering based on time (e.g. past hour)
* updated docs
This PR includes two main changes:
- Refactor the `TelegramChatLoader` and `FacebookChatLoader` classes by
removing the dependency on pandas and simplifying the message filtering
process.
- Add test cases for the `TelegramChatLoader` and `FacebookChatLoader`
classes. This test ensures that the class correctly loads and processes
the example chat data, providing better test coverage for this
functionality.
The Blockchain Document Loader's default behavior is to return 100
tokens at a time which is the Alchemy API limit. The Document Loader
exposes a startToken that can be used for pagination against the API.
This enhancement includes an optional get_all_tokens param (default:
False) which will:
- Iterate over the Alchemy API until it receives all the tokens, and
return the tokens in a single call to the loader.
- Manage all/most tokenId formats (this can be int, hex16 with zero or
all the leading zeros). There aren't constraints as to how smart
contracts can represent this value, but these three are most common.
Note that a contract with 10,000 tokens will issue 100 calls to the
Alchemy API, and could take about a minute, which is why this param will
default to False. But I've been using the doc loader with these
utilities on the side, so figured it might make sense to build them in
for others to use.
Seems the pyllamacpp package is no longer the supported bindings from
gpt4all. Tested that this works locally.
Given that the older models weren't very performant, I think it's better
to migrate now without trying to include a lot of try / except blocks
---------
Co-authored-by: Nissan Pow <npow@users.noreply.github.com>
Co-authored-by: Nissan Pow <pownissa@amazon.com>
- ActionAgent has a property called, `allowed_tools`, which is declared
as `List`. It stores all provided tools which is available to use during
agent action.
- This collection shouldn’t allow duplicates. The original datatype List
doesn’t make sense. Each tool should be unique. Even when there are
variants (assuming in the future), it would be named differently in
load_tools.
Test:
- confirm the functionality in an example by initializing an agent with
a list of 2 tools and confirm everything works.
```python3
def test_agent_chain_chat_bot():
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper
chat = ChatOpenAI(temperature=0)
llm = OpenAI(temperature=0)
tools = load_tools(["ddg-search", "llm-math"], llm=llm)
agent = initialize_agent(tools, chat, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?")
test_agent_chain_chat_bot()
```
Result:
<img width="863" alt="Screenshot 2023-05-01 at 7 58 11 PM"
src="https://user-images.githubusercontent.com/62768671/235572157-0937594c-ddfb-4760-acb2-aea4cacacd89.png">
Modified Modern Treasury and Strip slightly so credentials don't have to
be passed in explicitly. Thanks @mattgmarcus for adding Modern Treasury!
---------
Co-authored-by: Matt Marcus <matt.g.marcus@gmail.com>
Haven't gotten to all of them, but this:
- Updates some of the tools notebooks to actually instantiate a tool
(many just show a 'utility' rather than a tool. More changes to come in
separate PR)
- Move the `Tool` and decorator definitions to `langchain/tools/base.py`
(but still export from `langchain.agents`)
- Add scene explain to the load_tools() function
- Add unit tests for public apis for the langchain.tools and langchain.agents modules
Move tool validation to each implementation of the Agent.
Another alternative would be to adjust the `_validate_tools()` signature
to accept the output parser (and format instructions) and add logic
there. Something like
`parser.outputs_structured_actions(format_instructions)`
But don't think that's needed right now.
- Add langchain.llms.GooglePalm for text completion,
- Add langchain.chat_models.ChatGooglePalm for chat completion,
- Add langchain.embeddings.GooglePalmEmbeddings for sentence embeddings,
- Add example field to HumanMessage and AIMessage so that users can feed
in examples into the PaLM Chat API,
- Add system and unit tests.
Note async completion for the Text API is not yet supported and will be
included in a future PR.
Happy for feedback on any aspect of this PR, especially our choice of
adding an example field to Human and AI Message objects to enable
passing example messages to the API.
This pull request adds unit tests for various output parsers
(BooleanOutputParser, CommaSeparatedListOutputParser, and
StructuredOutputParser) to ensure their correct functionality and to
increase code reliability and maintainability. The tests cover both
valid and invalid input cases.
Changes:
Added unit tests for BooleanOutputParser.
Added unit tests for CommaSeparatedListOutputParser.
Added unit tests for StructuredOutputParser.
Testing:
All new unit tests have been executed, and they pass successfully.
The overall test suite has been run, and all tests pass.
Notes:
These tests cover both successful parsing scenarios and error handling
for invalid inputs.
If any new output parsers are added in the future, corresponding unit
tests should also be created to maintain coverage.
Enum to string conversion handled differently between python 3.9 and
3.11, currently breaking in 3.11 (see #3788). Thanks @peter-brady for
catching this!
In the current solution, AgentType and AGENT_TO_CLASS are placed in two
separate files and both manually maintained. This might cause
inconsistency when we update either of them.
— latest —
based on the discussion with hwchase17, we don’t know how to further use
the newly introduced AgentTypeConfig type, so it doesn’t make sense yet
to add it. Instead, it’s better to move the dictionary to another file
to keep the loading.py file clear. The consistency is a good point.
Instead of asserting the consistency during linting, we added a unittest
for consistency check. I think it works as auto unittest is triggered
every time with clear failure notice. (well, force push is possible, but
we all know what we are doing, so let’s show trust. :>)
~~This PR includes~~
- ~~Introduced AgentTypeConfig as the source of truth of all AgentType
related meta data.~~
- ~~Each AgentTypeConfig is a annotated class type which can be used for
annotation in other places.~~
- ~~Each AgentTypeConfig can be easily extended when we have more meta
data needs.~~
- ~~Strong assertion to ensure AgentType and AGENT_TO_CLASS are always
consistent.~~
- ~~Made AGENT_TO_CLASS automatically generated.~~
~~Test Plan:~~
- ~~since this change is focusing on annotation, lint is the major test
focus.~~
- ~~lint, format and test passed on local.~~