Commit Graph

394 Commits

Author SHA1 Message Date
Matt Robinson
3637d6da6e
feat: add loader for open office odt files (#4405)
# ODF File Loader

Adds a data loader for handling Open Office ODT files. Requires
`unstructured>=0.6.3`.

### Testing

The following should work using the `fake.odt` example doc from the
[`unstructured` repo](https://github.com/Unstructured-IO/unstructured).

```python
from langchain.document_loaders import UnstructuredODTLoader

loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements")
loader.load()

loader = UnstructuredODTLoader(file_path="fake.odt", mode="single")
loader.load()
```
2023-05-10 01:37:17 -07:00
Harrison Chase
6b8d144ccc
Harrison/plan and solve (#4422) 2023-05-09 21:07:56 -07:00
Rukmani
2b14036126
Update WhatsAppChatLoader to include the character ~ in the sender name (#4420)
Fixes #4153

If the sender of a message in a group chat isn't in your contact list,
they will appear with a ~ prefix in the exported chat. This PR adds
support for parsing such lines.
2023-05-09 15:00:04 -07:00
Zander Chase
f2150285a4
Fix nested runs example ID (#4413)
#### Only reference example ID on the parent run

Previously, I was assigning the example ID to every child run. 
Adds a test.
2023-05-09 12:21:53 -07:00
Aivin V. Solatorio
6335cb5b3a
Add support for Qdrant nested filter (#4354)
# Add support for Qdrant nested filter

This extends the filter functionality for the Qdrant vectorstore. The
current filter implementation is limited to a single-level metadata
structure; however, Qdrant supports nested metadata filtering. This
extends the functionality for users to maximize the filter functionality
when using Qdrant as the vectorstore.

Reference: https://qdrant.tech/documentation/filtering/#nested-key

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
2023-05-09 10:34:11 -07:00
Martin Holzhauer
872605a5c5
Add an option to extract more metadata from crawled websites (#4347)
This pr makes it possible to extract more metadata from websites for
later use.

my usecase:
parsing ld+json or microdata from sites and store it as structured data
in the metadata field
2023-05-09 10:18:33 -07:00
Leonid Ganeline
ce15ffae6a
added Wikipedia retriever (#4302)
- added `Wikipedia` retriever. It is effectively a wrapper for
`WikipediaAPIWrapper`. It wrapps load() into get_relevant_documents()
- sorted `__all__` in the `retrievers/__init__`
- added integration tests for the WikipediaRetriever
- added an example (as Jupyter notebook) for the WikipediaRetriever
2023-05-09 10:08:39 -07:00
Eugene Yurtsev
2ceb807da2
Add PDF parser implementations (#4356)
# Add PDF parser implementations

This PR separates the data loading from the parsing for a number of
existing PDF loaders.

Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.

This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.

This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.

Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoader Abstractions
        - @eyurtsev

        LLM/Chat Wrappers
        - @hwchase17
        - @agola11

        Tools / Toolkits
        - @vowelparrot
 -->
2023-05-09 10:24:17 -04:00
Eugene Yurtsev
ae0c3382dd
Add MimeType based parser (#4376)
# Add MimeType Based Parser

This PR adds a MimeType Based Parser. The parser inspects the mime-type
of the blob it is parsing and based on the mime-type can delegate to the sub
parser.

## Before submitting

Waiting on adding notebooks until more implementations are landed. 

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:


@hwchase17
@vowelparrot
2023-05-09 10:22:56 -04:00
Davis Chase
ba0057c077
Check OpenAI model kwargs (#4366)
Handle duplicate and incorrectly specified OpenAI params

Thanks @PawelFaron for the fix! Made small update

Closes #4331

---------

Co-authored-by: PawelFaron <42373772+PawelFaron@users.noreply.github.com>
Co-authored-by: Pawel Faron <ext-pawel.faron@vaisala.com>
2023-05-08 16:37:34 -07:00
Davis Chase
02ebb15c4a
Fix TextSplitter.from_tiktoken(#4361)
Thanks to @danb27 for the fix! Minor update

Fixes https://github.com/hwchase17/langchain/issues/4357

---------

Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>
2023-05-08 16:36:38 -07:00
Naveen Tatikonda
782df1db10
OpenSearch: Add Similarity Search with Score (#4089)
### Description
Add `similarity_search_with_score` method for OpenSearch to return
scores along with documents in the search results

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
2023-05-08 16:35:21 -07:00
Eugene Yurtsev
aa11f7c89b
Add progress bar to filesystemblob loader, update pytest config for unit tests (#4212)
This PR adds:

* Option to show a tqdm progress bar when using the file system blob loader
* Update pytest run configuration to be stricter
* Adding a new marker that checks that required pkgs exist
2023-05-08 16:15:09 -04:00
Zander Chase
8b284f9ad0
Pass parsed inputs through to tool _run (#4309) 2023-05-08 09:13:05 -07:00
Zander Chase
35c9e6ab40
Pass Callbacks through load_tools (#4298)
- Update the load_tools method to properly accept `callbacks` arguments.
- Add a deprecation warning when `callback_manager` is passed
- Add two unit tests to check the deprecation warning is raised and to
confirm the callback is passed through.

Closes issue #4096
2023-05-08 08:44:26 -07:00
Jinto Jose
8a338412fa
mongodb support for chat history (#4266) 2023-05-08 08:34:05 -07:00
Harrison Chase
c8b0b6e6c1
add youtube tools (#4320) 2023-05-08 08:29:30 -07:00
Leonid Ganeline
9544b30821
added Wikipedia document loader (#4141)
- Added the `Wikipedia` document loader. It is based on the existing
`unilities/WikipediaAPIWrapper`
- Added a respective ut-s and example notebook
- Sorted list of classes in __init__
2023-05-06 09:32:45 -07:00
Eugene Yurtsev
423f497168
Add BlobParser abstraction (#3979)
This PR adds the BlobParser abstraction.

It follows the proposal described here:
https://github.com/hwchase17/langchain/pull/2833#issuecomment-1509097756
2023-05-05 21:43:38 -04:00
Davis Chase
5ca13cc1f0
Dev2049/pypdfium2 (#4209)
thanks @jerrytigerxu for the addition!

---------

Co-authored-by: Jere Xu <jtxu2008@gmail.com>
Co-authored-by: jerrytigerxu <jere.tiger.xu@gmailc.om>
2023-05-05 17:55:31 -07:00
George
2324f19c85
Update qdrant interface (#3971)
Hello

1) Passing `embedding_function` as a callable seems to be outdated and
the common interface is to pass `Embeddings` instance

2) At the moment `Qdrant.add_texts` is designed to be used with
`embeddings.embed_query`, which is 1) slow 2) causes ambiguity due to 1.
It should be used with `embeddings.embed_documents`

This PR solves both problems and also provides some new tests
2023-05-05 16:46:40 -07:00
Zander Chase
1017e5cee2
Add LCP Client (#4198)
Adding a client to fetch datasets, examples, and runs from a LCP
instance and run objects over them.
2023-05-05 16:28:56 -07:00
Zander Chase
a30f42da4e
Update V2 Tracer (#4193)
- Update the RunCreate object to work with recent changes
- Add optional Example ID to the tracer
- Adjust default persist_session behavior to attempt to load the session
if it exists
- Raise more useful HTTP errors for logging
- Add unit testing
- Fix the default ID to be a UUID for v2 tracer sessions


Broken out from the big draft here:
https://github.com/hwchase17/langchain/pull/4061
2023-05-05 14:55:01 -07:00
Mike Wang
c3044b1bf0
[test] Add integration_test for PandasAgent (#4056)
- confirm creation
- confirm functionality with a simple dimension check.

The test now is calling OpenAI API directly, but learning from
@vowelparrot that we’re caching the requests, so that it’s not that
expensive. I also found we’re calling OpenAI api in other integration
tests. Please lmk if there is any concern of real external API calls. I
can alternatively make a fake LLM for this test. Thanks
2023-05-05 14:49:02 -07:00
Aivin V. Solatorio
6567b73e1a
JSON loader (#4067)
This implements a loader of text passages in JSON format. The `jq`
syntax is used to define a schema for accessing the relevant contents
from the JSON file. This requires dependency on the `jq` package:
https://pypi.org/project/jq/.

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
2023-05-05 14:48:13 -07:00
hp0404
2a3c5f8353
Update WhatsAppChatLoader regex to handle multiple date-time formats (#4186)
This PR updates the `message_line_regex` used by `WhatsAppChatLoader` to
support different date-time formats used in WhatsApp chat exports;
resolves #4153.

The new regex handles the following input formats:
```terminal
[05.05.23, 15:48:11] James: Hi here
[11/8/21, 9:41:32 AM] User name: Message 123
1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes
```

Tests have been added to verify that the loader works correctly with all
formats.
2023-05-05 13:13:05 -07:00
Zander Chase
84cfa76e00
Update Cohere Reranker (#4180)
The forward ref annotations don't get updated if we only iimport with
type checking

---------

Co-authored-by: Abhinav Verma <abhinav_win12@yahoo.co.in>
2023-05-05 09:11:37 -07:00
Zander Chase
6032a051e9
Add Tenant ID to V2 Tracer (#4135)
Update the V2 tracer to
- use UUIDs instead of int's
- load a tenant ID and use that when saving sessions
2023-05-04 21:35:20 -07:00
Zander Chase
2f087d63af
Fix Python RePL Tool (#4137)
Filter out kwargs from inferred schema when determining if a tool is
single input.

Add a couple unit tests.

Move tool unit tests to the tools dir
2023-05-04 20:31:16 -07:00
Harrison Chase
d4cf1eb60a
Add firestore memory (#3792) (#3941)
If you have any other suggestions or feedback, please let me know.

---------

Co-authored-by: yakigac <10434946+yakigac@users.noreply.github.com>
2023-05-03 22:55:47 -07:00
Mike Wang
67db495fcf
[agent] Add Spark Agent (#4020)
- added support for spark through pyspark library.
- added jupyter notebook as example.
2023-05-03 22:45:23 -07:00
rogerserper
b1446bea5f
google-serper: async + full json results + support for Google Images, Places and News (#4078)
* implemented arun, results, and aresults. Reuses aiosession if
available.
* helper tools GoogleSerperRun and GoogleSerperResults
* support for Google Images, Places and News (examples given) and
filtering based on time (e.g. past hour)
* updated docs
2023-05-03 22:35:48 -07:00
Zander Chase
65c3b146c9
Accept str or list[str] for shell (#4060)
Relax the requirements
2023-05-03 21:11:06 -07:00
hp0404
374725a715
Refactor TelegramChatLoader and FacebookChatLoader classes and add tests (#3863)
This PR includes two main changes:

- Refactor the `TelegramChatLoader` and `FacebookChatLoader` classes by
removing the dependency on pandas and simplifying the message filtering
process.

- Add test cases for the `TelegramChatLoader` and `FacebookChatLoader`
classes. This test ensures that the class correctly loads and processes
the example chat data, providing better test coverage for this
functionality.
2023-05-03 15:59:19 -07:00
Jon Saginaw
ea64b1716d
Enhancement: option to Get All Tokens with a single Blockchain Document Loader call (#3797)
The Blockchain Document Loader's default behavior is to return 100
tokens at a time which is the Alchemy API limit. The Document Loader
exposes a startToken that can be used for pagination against the API.

This enhancement includes an optional get_all_tokens param (default:
False) which will:

- Iterate over the Alchemy API until it receives all the tokens, and
return the tokens in a single call to the loader.
- Manage all/most tokenId formats (this can be int, hex16 with zero or
all the leading zeros). There aren't constraints as to how smart
contracts can represent this value, but these three are most common.

Note that a contract with 10,000 tokens will issue 100 calls to the
Alchemy API, and could take about a minute, which is why this param will
default to False. But I've been using the doc loader with these
utilities on the side, so figured it might make sense to build them in
for others to use.
2023-05-03 15:46:44 -07:00
Zander Chase
afa9d1292b
Re-Permit Partials in Tool (#4058)
Resolved issue #4053

Now that StructuredTool is a separate class, this constraint is no
longer needed.

Added/updated a unit test
2023-05-03 13:16:41 -07:00
Harrison Chase
a5dd73c1a6
Revert "[agent][property type] Change allowed_tools to Set as Duplicate doesn’t make sense" (#4014)
Reverts hwchase17/langchain#3840
2023-05-02 18:58:05 -07:00
Davis Chase
f08a76250f
Better custom model handling OpenAICallbackHandler (#4009)
Thanks @maykcaldas for flagging! think this should resolve #3988. Let me
know if you still see issues after next release.
2023-05-02 16:19:57 -07:00
Harrison Chase
48ea27ba60
Harrison/blockwise sitemap (#3940)
Co-authored-by: Martin Holzhauer <martin@holzhauer.eu>
2023-05-01 21:34:07 -07:00
Harrison Chase
f04faf8496
Harrison/spreedly (#3937)
Co-authored-by: Esmit Pérez <esmitperez@users.noreply.github.com>
2023-05-01 20:56:56 -07:00
Harrison Chase
cd3f8582cb
Harrison/combined memory (#3935)
Co-authored-by: engkheng <60956360+outday29@users.noreply.github.com>
2023-05-01 20:55:56 -07:00
Zander Chase
c4cb55a0c5
[Breaking] Migrate GPT4All to use PyGPT4All (#3934)
Seems the pyllamacpp package is no longer the supported bindings from
gpt4all. Tested that this works locally.

Given that the older models weren't very performant, I think it's better
to migrate now without trying to include a lot of try / except blocks

---------

Co-authored-by: Nissan Pow <npow@users.noreply.github.com>
Co-authored-by: Nissan Pow <pownissa@amazon.com>
2023-05-01 20:42:45 -07:00
Zander Chase
c582f2e9e3
Add Structure Chat Agent (#3912)
Create a new chat agent that is compatible with the Multi-input tools
2023-05-01 20:34:50 -07:00
Mike Wang
ec21b7126c
[agent][property type] Change allowed_tools to Set as Duplicate doesn’t make sense (#3840)
- ActionAgent has a property called, `allowed_tools`, which is declared
as `List`. It stores all provided tools which is available to use during
agent action.
- This collection shouldn’t allow duplicates. The original datatype List
doesn’t make sense. Each tool should be unique. Even when there are
variants (assuming in the future), it would be named differently in
load_tools.


Test:
- confirm the functionality in an example by initializing an agent with
a list of 2 tools and confirm everything works.
```python3
def test_agent_chain_chat_bot():
	from langchain.agents import load_tools
	from langchain.agents import initialize_agent
	from langchain.agents import AgentType
	from langchain.chat_models import ChatOpenAI
	from langchain.llms import OpenAI
	from langchain.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper

	chat = ChatOpenAI(temperature=0)
	llm = OpenAI(temperature=0)
	tools = load_tools(["ddg-search", "llm-math"], llm=llm)

	agent = initialize_agent(tools, chat, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
	agent.run("Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?")
test_agent_chain_chat_bot()
```
Result:
<img width="863" alt="Screenshot 2023-05-01 at 7 58 11 PM"
src="https://user-images.githubusercontent.com/62768671/235572157-0937594c-ddfb-4760-acb2-aea4cacacd89.png">
2023-05-01 20:30:10 -07:00
Davis Chase
e7e29f9937
Dev2049/add modern treasury (#3924)
Modified Modern Treasury and Strip slightly so credentials don't have to
be passed in explicitly. Thanks @mattgmarcus for adding Modern Treasury!

---------

Co-authored-by: Matt Marcus <matt.g.marcus@gmail.com>
2023-05-01 20:28:02 -07:00
Davis Chase
5db6b796cf
Dev2049/hf emb encode kwargs (#3925)
Thanks @amogkam for the addition! Refactored slightly

---------

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2023-05-01 20:27:41 -07:00
Zander Chase
9b9b231e10
Update some Tools Docs (#3913)
Haven't gotten to all of them, but this:
- Updates some of the tools notebooks to actually instantiate a tool
(many just show a 'utility' rather than a tool. More changes to come in
separate PR)
- Move the `Tool` and decorator definitions to `langchain/tools/base.py`
(but still export from `langchain.agents`)
- Add scene explain to the load_tools() function
- Add unit tests for public apis for the langchain.tools and langchain.agents modules
2023-05-01 19:07:26 -07:00
Zander Chase
84ea17b786
Move Tool Validation (#3923)
Move tool validation to each implementation of the Agent.

Another alternative would be to adjust the `_validate_tools()` signature
to accept the output parser (and format instructions) and add logic
there. Something like

`parser.outputs_structured_actions(format_instructions)`

But don't think that's needed right now.
2023-05-01 18:44:24 -07:00
Eugene Yurtsev
7cce68a051
Add minimal file system blob loader (#3669)
This adds a minimal file system blob loader.

If looks good, this PR will be merged and a few additional enhancements will be made.
2023-05-01 21:37:26 -04:00
Zura Isakadze
647bbf61c1
Add SQLiteChatMessageHistory (#3534)
It's based on already existing `PostgresChatMessageHistory`

Use case somewhere in between multiple files and Postgres storage.
2023-05-01 15:40:00 -07:00