Commit Graph

430 Commits (49ce5ce1ca70657e34b63c2f239222e9557be115)

Author SHA1 Message Date
Eugene Yurtsev 09587a3201
Clean up tests for pdf parsers (#4595)
# Organize tests for pdf parsers

Clean up tests for pdf parsers, remove duplicate tests, convert to unit
tests.
1 year ago
Eugene Yurtsev d3300bd799
YouTube Loader: Replace regexp with built-in parsing (#4729) 1 year ago
Eugene Yurtsev 3c490b5ba3
Docugami DataLoader (#4727)
### Adds a document loader for Docugami

Specifically:

1. Adds a data loader that talks to the [Docugami](http://docugami.com)
API to download processed documents as semantic XML
2. Parses the semantic XML into chunks, with additional metadata
capturing chunk semantics
3. Adds a detailed notebook showing how you can use additional metadata
returned by Docugami for techniques like the [self-querying
retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html)
4. Adds an integration test, and related documentation

Here is an example of a result that is not possible without the
capabilities added by Docugami (from the notebook):

<img width="1585" alt="image"
src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b">

---------

Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>
1 year ago
KNiski c2761aa8f4
Improve video_id extraction in YoutubeLoader (#4452)
# Improve video_id extraction in `YoutubeLoader`

`YoutubeLoader.from_youtube_url` can only deal with one specific url
format. I've introduced `YoutubeLoader.extract_video_id` which can
extract video id from common YT urls.

Fixes #4451 


@eyurtsev

---------

Co-authored-by: Kamil Niski <kamil.niski@gmail.com>
1 year ago
Lester Yang cd3f9865f3
Feature: pdfplumber PDF loader with BaseBlobParser (#4552)
# Feature: pdfplumber PDF loader with BaseBlobParser

* Adds pdfplumber as a PDF loader
* Adds pdfplumber as a blob parser.
1 year ago
Harrison Chase b6e3ac17c4
Harrison/sitemap local (#4704)
Co-authored-by: Lukas Bauer <lukas.bauer@mayflower.de>
1 year ago
Harrison Chase 12b4ee1fc7
Harrison/telegram chat loader (#4698)
Co-authored-by: Akinwande Komolafe <47945512+Sensei-akin@users.noreply.github.com>
Co-authored-by: Akinwande Komolafe <akhinoz@gmail.com>
1 year ago
Li Yuanzheng 3b6206af49
Respect User-Specified User-Agent in WebBaseLoader (#4579)
# Respect User-Specified User-Agent in WebBaseLoader
This pull request modifies the `WebBaseLoader` class initializer from
the `langchain.document_loaders.web_base` module to preserve any
User-Agent specified by the user in the `header_template` parameter.
Previously, even if a User-Agent was specified in `header_template`, it
would always be overridden by a random User-Agent generated by the
`fake_useragent` library.

With this change, if a User-Agent is specified in `header_template`, it
will be used. Only in the case where no User-Agent is specified will a
random User-Agent be generated and used. This provides additional
flexibility when using the `WebBaseLoader` class, allowing users to
specify their own User-Agent if they have a specific need or preference,
while still providing a reasonable default for cases where no User-Agent
is specified.

This change has no impact on existing users who do not specify a
User-Agent, as the behavior in this case remains the same. However, for
users who do specify a User-Agent, their choice will now be respected
and used for all subsequent requests made using the `WebBaseLoader`
class.


Fixes #4167

## Before submitting

============================= test session starts
==============================
collecting ... collected 1 item


test_web_base.py::TestWebBaseLoader::test_respect_user_specified_user_agent

============================== 1 passed in 3.64s
===============================
PASSED [100%]

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested: @eyurtsev

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
1 year ago
Samuli Rauatmaa 66828ad231
add the existing OpenWeatherMap tool to the public api (#4292)
[OpenWeatherMapAPIWrapper](f70e18a5b3/docs/modules/agents/tools/examples/openweathermap.ipynb)
works wonderfully, but the _tool_ itself can't be used in master branch.

- added OpenWeatherMap **tool** to the public api, to be loadable with
`load_tools` by using "openweathermap-api" tool name (that name is used
in the existing
[docs](aff33d52c5/docs/modules/agents/tools/getting_started.md),
at the bottom of the page)
- updated OpenWeatherMap tool's **description** to make the input format
match what the API expects (e.g. `London,GB` instead of `'London,GB'`)
- added [ecosystem documentation page for
OpenWeatherMap](f9c41594fe/docs/ecosystem/openweathermap.md)
- added tool usage example to [OpenWeatherMap's
notebook](f9c41594fe/docs/modules/agents/tools/examples/openweathermap.ipynb)

Let me know if there's something I missed or something needs to be
updated! Or feel free to make edits yourself if that makes it easier for
you 🙂
1 year ago
Harrison Chase cdc20d1203
Harrison/json loader fix (#4686)
Co-authored-by: Triet Le <112841660+triet-lq-holistics@users.noreply.github.com>
1 year ago
Harrison Chase f2f2aced6d
allow partials in from_template (#4638) 1 year ago
Harrison Chase fbfa49f2c1
agent serialization (#4642) 1 year ago
Harrison Chase f5e2f70115
Harrison/json new line (#4646)
Co-authored-by: David Chen <davidchen@gliacloud.com>
1 year ago
Harrison Chase 279605b4d3
Harrison/metaphor search (#4657)
Co-authored-by: Jeffrey Wang <jeffreyzhiyuanwang@gmail.com>
1 year ago
Prerit Das 2747ccbcf1
Allow custom base Zapier prompt (#4213)
Currently, all Zapier tools are built using the pre-written base Zapier
prompt. These small changes (that retain default behavior) will allow a
user to create a Zapier tool using the ZapierNLARunTool while providing
their own base prompt.

Their prompt must contain input fields for zapier_description and
params, checked and enforced in the tool's root validator.

An example of when this may be useful: user has several, say 10, Zapier
tools enabled. Currently, the long generic default Zapier base prompt is
attached to every single tool, using an extreme number of tokens for no
real added benefit (repeated). User prompts LLM on how to use Zapier
tools once, then overrides the base prompt.

Or: user has a few specific Zapier tools and wants to maximize their
success rate. So, user writes prompts/descriptions for those tools
specific to their use case, and provides those to the ZapierNLARunTool.

A consideration - this is the simplest way to implement this I could
think of... though ideally custom prompting would be possible at the
Toolkit level as well. For now, this should be sufficient in solving the
concerns outlined above.
1 year ago
Paresh Mathur e2bc836571
Fix #4087 by setting the correct csv dialect (#4103)
The error in #4087 was happening because of the use of csv.Dialect.*
which is just an empty base class. we need to make a choice on what is
our base dialect. I usually use excel so I put it as excel, if
maintainers have other preferences do let me know.

Open Questions:
1. What should be the default dialect?
2. Should we rework all tests to mock the open function rather than the
csv.DictReader?
3. Should we make a separate input for `dialect` like we have for
`encoding`?

---------

Co-authored-by: = <=>
1 year ago
Zander Chase 928cdd57a4
[Breaking] Refactor Base Tracer(#4549)
### Refactor the BaseTracer
- Remove the 'session' abstraction from the BaseTracer
- Rename 'RunV2' object(s) to be called 'Run' objects (Rename previous
Run objects to be RunV1 objects)
- Ditto for sessions: TracerSession*V2 -> TracerSession*
- Remove now deprecated conversion from v1 run objects to v2 run objects
in LangChainTracerV2
- Add conversion from v2 run objects to v1 run objects in V1 tracer
1 year ago
Harrison Chase 485ecc3580
option for csv agent to not include df in prompt (#4610) 1 year ago
Zander Chase 0c6ed657ef
Convert Chain to a Chain Factory (#4605)
## Change Chain argument in client to accept a chain factory

The `run_over_dataset` functionality seeks to treat each iteration of an
example as an independent trial.
Chains have memory, so it's easier to permit this type of behavior if we
accept a factory method rather than the chain object directly.

There's still corner cases / UX pains people will likely run into, like:
- Caching may cause issues
- if memory is persisted to a shared object (e.g., same redis queue) ,
this could impact what is retrieved
- If we're running the async methods with concurrency using local
models, if someone naively instantiates the chain and loads each time,
it could lead to tons of disk I/O or OOM
1 year ago
Davis Chase 36f9e9a0ba
Skip flaky unit test (#4591) 1 year ago
Eugene Yurtsev 08ed927c32
Turn on extended tests (#4588)
# Turn on strict extended tests

This PR turns on strict testing for extended tests.
1 year ago
Zander Chase d96f6a106b
Add Steamship Image Generation Tool (#4580)
Co-authored-by: Enias Cailliau <enias@steamship.com>
1 year ago
Eugene Yurtsev a5371a0fa2
Add pytest --only-extended and --only-core options (#4494)
# Adds testing options to pytest

This PR adds the following options: 

* `--only-core` will skip all extended tests, running all core tests.
* `--only-extended` will skip all core tests. Forcing alll extended
tests to be run.

Running `py.test` without specifying either option will remain
unaffected. Run
all tests that can be run within the unit_tests direction. Extended
tests will
run if required packages are installed.

## Before submitting

## Who can review?
1 year ago
Leonid Ganeline e17d0319d5
Add `arxiv` retriever (#4538) 1 year ago
SimFG 7bcf238a1a
Optimize the initialization method of GPTCache (#4522)
Optimize the initialization method of GPTCache, so that users can use GPTCache more quickly.
1 year ago
Zander Chase f4d3cf2dfb
Add Invocation Params (#4509)
### Add Invocation Params to Logged Run


Adds an llm type to each chat model as well as an override of the dict()
method to log the invocation parameters for each call

---------

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
1 year ago
Zander Chase 4ee47926ca
Add on_chat_message_start (#4499)
### Add on_chat_message_start to callback manager and base tracer

Goal: trace messages directly to permit reloading as chat messages
(store in an integration-agnostic way)

Add an `on_chat_message_start` method. Fall back to `on_llm_start()` for
handlers that don't have it implemented.

Does so in a non-backwards-compat breaking way (for now)
1 year ago
Sunish Sheth 812e5f43f5
Add _type for all parsers (#4189)
Used for serialization. Also add test that recurses through
our subclasses to check they have them implemented

Would fix https://github.com/hwchase17/langchain/issues/3217
Blocking: https://github.com/mlflow/mlflow/pull/8297

---------

Signed-off-by: Sunish Sheth <sunishsheth2009@gmail.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
1 year ago
kYLe 0d51a1f12b
Add LLMs support for Anyscale Service (#4350)
Add Anyscale service integration under LLM

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
1 year ago
Evan Jones f668251948
parameterized distance metrics; lint; format; tests (#4375)
# Parameterize Redis vectorstore index

Redis vectorstore allows for three different distance metrics: `L2`
(flat L2), `COSINE`, and `IP` (inner product). Currently, the
`Redis._create_index` method hard codes the distance metric to COSINE.

I've parameterized this as an argument in the `Redis.from_texts` method
-- pretty simple.

Fixes #4368 

## Before submitting

I've added an integration test showing indexes can be instantiated with
all three values in the `REDIS_DISTANCE_METRICS` literal. An example
notebook seemed overkill here. Normal API documentation would be more
appropriate, but no standards are in place for that yet.

## Who can review?

Not sure who's responsible for the vectorstore module... Maybe @eyurtsev
/ @hwchase17 / @agola11 ?
1 year ago
Zander Chase d969f43ed8
Load HuggingFace Tool (#4475)
# Add option to `load_huggingface_tool`

Expose a method to load a huggingface Tool from the HF hub

---------

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
1 year ago
Davis Chase 9ec60ad832
Add azure cognitive search retriever (#4467)
All credit to @UmerHA, made a couple small changes

---------

Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>
1 year ago
Davis Chase 46b100ea63
Add DocArray vector stores (#4483)
Thanks to @anna-charlotte and @jupyterjazz for the contribution! Made
few small changes to get it across the finish line

---------

Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Co-authored-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>
1 year ago
Harrison Chase b2f920e891
add tracing v2 env var (#4465)
Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
1 year ago
Davis Chase 04475bea7d
Mv plan and execute to experimental (#4459) 1 year ago
Eugene Yurtsev 80558b5b27
Add workflow for testing with all deps (#4410)
# Add action to test with all dependencies installed

PR adds a custom action for setting up poetry that allows specifying a
cache key:
https://github.com/actions/setup-python/issues/505#issuecomment-1273013236

This makes it possible to run 2 types of unit tests: 

(1) unit tests with only core dependencies
(2) unit tests with extended dependencies (e.g., those that rely on an
optional pdf parsing library)


As part of this PR, we're moving some pdf parsing tests into the
unit-tests section and making sure that these unit tests get executed
when running with extended dependencies.
1 year ago
Matt Robinson 3637d6da6e
feat: add loader for open office odt files (#4405)
# ODF File Loader

Adds a data loader for handling Open Office ODT files. Requires
`unstructured>=0.6.3`.

### Testing

The following should work using the `fake.odt` example doc from the
[`unstructured` repo](https://github.com/Unstructured-IO/unstructured).

```python
from langchain.document_loaders import UnstructuredODTLoader

loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements")
loader.load()

loader = UnstructuredODTLoader(file_path="fake.odt", mode="single")
loader.load()
```
1 year ago
Harrison Chase 6b8d144ccc
Harrison/plan and solve (#4422) 1 year ago
Rukmani 2b14036126
Update WhatsAppChatLoader to include the character ~ in the sender name (#4420)
Fixes #4153

If the sender of a message in a group chat isn't in your contact list,
they will appear with a ~ prefix in the exported chat. This PR adds
support for parsing such lines.
1 year ago
Zander Chase f2150285a4
Fix nested runs example ID (#4413)
#### Only reference example ID on the parent run

Previously, I was assigning the example ID to every child run. 
Adds a test.
1 year ago
Aivin V. Solatorio 6335cb5b3a
Add support for Qdrant nested filter (#4354)
# Add support for Qdrant nested filter

This extends the filter functionality for the Qdrant vectorstore. The
current filter implementation is limited to a single-level metadata
structure; however, Qdrant supports nested metadata filtering. This
extends the functionality for users to maximize the filter functionality
when using Qdrant as the vectorstore.

Reference: https://qdrant.tech/documentation/filtering/#nested-key

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
1 year ago
Martin Holzhauer 872605a5c5
Add an option to extract more metadata from crawled websites (#4347)
This pr makes it possible to extract more metadata from websites for
later use.

my usecase:
parsing ld+json or microdata from sites and store it as structured data
in the metadata field
1 year ago
Leonid Ganeline ce15ffae6a
added `Wikipedia` retriever (#4302)
- added `Wikipedia` retriever. It is effectively a wrapper for
`WikipediaAPIWrapper`. It wrapps load() into get_relevant_documents()
- sorted `__all__` in the `retrievers/__init__`
- added integration tests for the WikipediaRetriever
- added an example (as Jupyter notebook) for the WikipediaRetriever
1 year ago
Eugene Yurtsev 2ceb807da2
Add PDF parser implementations (#4356)
# Add PDF parser implementations

This PR separates the data loading from the parsing for a number of
existing PDF loaders.

Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.

This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.

This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.

Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoader Abstractions
        - @eyurtsev

        LLM/Chat Wrappers
        - @hwchase17
        - @agola11

        Tools / Toolkits
        - @vowelparrot
 -->
1 year ago
Eugene Yurtsev ae0c3382dd
Add MimeType based parser (#4376)
# Add MimeType Based Parser

This PR adds a MimeType Based Parser. The parser inspects the mime-type
of the blob it is parsing and based on the mime-type can delegate to the sub
parser.

## Before submitting

Waiting on adding notebooks until more implementations are landed. 

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:


@hwchase17
@vowelparrot
1 year ago
Davis Chase ba0057c077
Check OpenAI model kwargs (#4366)
Handle duplicate and incorrectly specified OpenAI params

Thanks @PawelFaron for the fix! Made small update

Closes #4331

---------

Co-authored-by: PawelFaron <42373772+PawelFaron@users.noreply.github.com>
Co-authored-by: Pawel Faron <ext-pawel.faron@vaisala.com>
1 year ago
Davis Chase 02ebb15c4a
Fix TextSplitter.from_tiktoken(#4361)
Thanks to @danb27 for the fix! Minor update

Fixes https://github.com/hwchase17/langchain/issues/4357

---------

Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>
1 year ago
Naveen Tatikonda 782df1db10
OpenSearch: Add Similarity Search with Score (#4089)
### Description
Add `similarity_search_with_score` method for OpenSearch to return
scores along with documents in the search results

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
1 year ago
Eugene Yurtsev aa11f7c89b
Add progress bar to filesystemblob loader, update pytest config for unit tests (#4212)
This PR adds:

* Option to show a tqdm progress bar when using the file system blob loader
* Update pytest run configuration to be stricter
* Adding a new marker that checks that required pkgs exist
1 year ago
Zander Chase 8b284f9ad0
Pass parsed inputs through to tool _run (#4309) 1 year ago