Commit Graph

1608 Commits

Author SHA1 Message Date
Holt Skinner
e05bb938de
Merge pull request #12433
* feat: Add Google Cloud Translation document transformer

* Merge branch 'langchain-ai:master' into google-translate

* Add documentation for Google Translate Document Transformer

* Fix line length error

* Merge branch 'master' into google-translate

* Merge branch 'google-translate' of https://github.com/holtskinner/lan…

* Addressed code review comments

* Merge branch 'master' into google-translate

* Merge branch 'google-translate' of https://github.com/holtskinner/lan…

* Removed extra variable

* Merge branch 'google-translate' of https://github.com/holtskinner/lan…

* Merge branch 'master' into google-translate

* Merge branch 'google-translate' of https://github.com/holtskinner/lan…

* Removed extra import
2023-10-29 21:22:36 -04:00
Samad Koita
d1fdcd4fcb
Masking of API Key for GooseAI LLM (#12496)
Description: Add masking of API Key for GooseAI LLM when printed.
Issue: https://github.com/langchain-ai/langchain/issues/12165
Dependencies: None
Tag maintainer: @eyurtsev

---------

Co-authored-by: Samad Koita <>
2023-10-29 21:21:33 -04:00
Andrew Zhou
64c4a698a8
More comprehensive readthedocs document loader (#12382)
## **Description:**
When building our own readthedocs.io scraper, we noticed a couple
interesting things:

1. Text lines with a lot of nested <span> tags would give unclean text
with a bunch of newlines. For example, for [Langchain's
documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.readthedocs.ReadTheDocsLoader.html#langchain.document_loaders.readthedocs.ReadTheDocsLoader),
a single line is represented in a complicated nested HTML structure, and
the naive `soup.get_text()` call currently being made will create a
newline for each nested HTML element. Therefore, the document loader
would give a messy, newline-separated blob of text. This would be true
in a lot of cases.

<img width="945" alt="Screenshot 2023-10-26 at 6 15 39 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/eca85d1f-d2bf-4487-a18a-e1e732fadf19">
<img width="1031" alt="Screenshot 2023-10-26 at 6 16 00 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/035938a0-9892-4f6a-83cd-0d7b409b00a3">

Additionally, content from iframes, code from scripts, css from styles,
etc. will be gotten if it's a subclass of the selector (which happens
more often than you'd think). For example, [this
page](https://pydeck.gl/gallery/contour_layer.html#) will scrape 1.5
million characters of content that looks like this:

<img width="1372" alt="Screenshot 2023-10-26 at 6 32 55 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/dbd89e39-9478-4a18-9e84-f0eb91954eac">

Therefore, I wrote a recursive _get_clean_text(soup) class function that
1. skips all irrelevant elements, and 2. only adds newlines when
necessary.

2. Index pages (like [this
one](https://api.python.langchain.com/en/latest/api_reference.html))
would be loaded, chunked, and eventually embedded. This is really bad
not just because the user will be embedding irrelevant information - but
because index pages are very likely to show up in retrieved content,
making retrieval less effective (in our tests). Therefore, I added a
bool parameter `exclude_index_pages` defaulted to False (which is the
current behavior — although I'd petition to default this to True) that
will skip all pages where links take up 50%+ of the page. Through manual
testing, this seems to be the best threshold.



## Other Information:
  - **Issue:** n/a
  - **Dependencies:** n/a
  - **Tag maintainer:** n/a
  - **Twitter handle:** @andrewthezhou

---------

Co-authored-by: Andrew Zhou <andrew@heykona.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-29 16:26:53 -07:00
Peter Vandenabeele
3468c038ba
Add unit tests for document_transformers/beautiful_soup_transformer.py (#12520)
- **Description:**
* Add unit tests for document_transformers/beautiful_soup_transformer.py
* Basic functionality is tested (extract tags, remove tags, drop lines)
    * add a FIXME comment about the order of tags that is not preserved
      (and a passing test, but with the expected tags now out-of-order)
  - **Issue:** None
  - **Dependencies:** None
  - **Tag maintainer:** @rlancemartin 
  - **Twitter handle:** `peter_v`

Please make sure your PR is passing linting and testing before
submitting.

=> OK: I ran `make format`, `make test` (passing after install of
beautifulsoup4) and `make lint`.
2023-10-29 16:24:47 -07:00
Anirudh Gautam
b257e6a4e8
Mask API key for AI21 LLM (#12418)
- **Description:** Added masking of the API Key for AI21 LLM when
printed and improved the docstring for AI21 LLM.
- Updated the AI21 LLM to utilize SecretStr from pydantic to securely
manage API key.
- Made improvements in the docstring of AI21 LLM. It now mentions that
the API key can also be passed as a named parameter to the constructor.
    - Added unit tests.
  - **Issue:** #12165 
  - **Tag maintainer:** @eyurtsev

---------

Co-authored-by: Anirudh Gautam <anirudh@Anirudhs-Mac-mini.local>
2023-10-29 14:53:41 -07:00
silvhua
9dead1034c
_dalle_image_url returns list of urls if n>1 (#11800)
- **Description:** Updated the `_dalle_image_url` method to return a
list of URLs if self.n>1,
  - **Issue:** #10691,
  - **Dependencies:** unsure,
  - **Tag maintainer:** @eyurtsev,
  - **Twitter handle:** @silvhua
---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-29 14:23:23 -07:00
Bagatur
1815ea2fdb
OpenAI runnable constructor (#12455) 2023-10-29 13:40:30 -07:00
William FH
a830b809f3
Patch forward ref bug (#12508)
Currently this gives a bug:
```
from langchain.schema.runnable import RunnableLambda

bound = RunnableLambda(lambda x: x).with_config({"callbacks": []})

# ConfigError: field "callbacks" not yet prepared so type is still a ForwardRef, you might need to call RunnableConfig.update_forward_refs().
```

Rather than deal with cyclic imports and extra load time, etc., I think
it makes sense to just have a separate Callbacks definition here that is
a relaxed typehint.
2023-10-29 00:53:01 -07:00
William FH
36204c2baf
Evaluation Callback Multi Response (#12505)
1. Allow run evaluators to return {"results": [list of evaluation
results]} in the evaluator callback.
2. Allows run evaluators to pick the target run ID to provide feedback
to

(1) means you could do something like a function call that populates a
full rubric in one go (not sure how reliable that is in general though)
rather than splitting off into separate LLM calls - cheaper and less
code to write
(2) means you can provide feedback to runs on subsequent calls.
Immediate use case is if you wanted to add an evaluator to a chat bot
and assign to assign to previous conversation turns


have a corresponding one in the SDK
2023-10-28 23:18:29 -07:00
Harrison Chase
9e0ae56287
various templates improvements (#12500) 2023-10-28 22:13:22 -07:00
0xC9
79cf01366e
Update tool.py (#12472)
In the GoogleSerperResults class, the name field is defined as
'google_serrper_results_json'. This looks like a typo, and perhaps
should be 'google_serper_results_json'.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->
2023-10-28 21:49:01 -07:00
Harrison Chase
eb903e211c
bump to 36 (#12487) 2023-10-28 08:51:23 -07:00
Tyler Hutcherson
4209457bdc
Redis langserve template (#12443)
Add Redis langserve template! Eventually will add semantic caching to
this too. But I was struggling to get that to work for some reason with
the LCEL implementation here.

- **Description:** Introduces the Redis LangServe template. A simple RAG
based app built on top of Redis that allows you to chat with company's
public financial data (Edgar 10k filings)
  - **Issue:** None
- **Dependencies:** The template contains the poetry project
requirements to run this template
  - **Tag maintainer:** @baskaryan @Spartee 
  - **Twitter handle:** @tchutch94

**Note**: this requires the commit here that deletes the
`_aget_relevant_documents()` method from the Redis retriever class that
wasn't implemented. That was breaking the langserve app.

---------

Co-authored-by: Sam Partee <sam.partee@redis.com>
2023-10-28 08:31:12 -07:00
Erick Friis
9adaa78c65
cli improvements (#12465)
Features
- add multiple repos by their branch/repo
- generate `pip install` commands and `add_route()` code
![Screenshot 2023-10-27 at 4 49 52
PM](https://github.com/langchain-ai/langchain/assets/9557659/3aec4cbb-3f67-4f04-8370-5b54ea983b2a)

Optimizations:
- group installs by repo/branch to avoid duplicate cloning
2023-10-28 08:25:31 -07:00
Adam Law
df4960a6d8
add reranking to azuresearch (#12454)
-**Description** Adds returning the reranking score when using semantic
search
-**Issue:* #12317

---------

Co-authored-by: Adam Law <adamlaw@microsoft.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-27 14:14:09 -07:00
Eugene Yurtsev
60d009f75a
Add security note to API chain (#12452)
Add security note
2023-10-27 17:09:42 -04:00
Matvey Arye
11505f95d3
Improve handling of empty queries for timescale vector (#12393)
**Description:** Improve handling of empty queries in timescale-vector.
For timescale-vector it is more efficient to get a None embedding when
the embedding has no semantic meaning. It allows timescale-vector to
perform more optimizations. Thus, when the query is empty, use a None
embedding.

 Also pass down constructor arguments to the timescale vector client.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-27 13:55:16 -07:00
Erick Friis
38cee5fae0
cli updates 2 (#12447)
- extras group
- readme
- another readme

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2023-10-27 13:37:03 -07:00
William FH
5d40e36c75
Trace if run tree set (#12444)
This code path is hit in the following case:
- Start in langchain code and manually provide a tracer
- Handoff to the traceable
- Hand back to langchain code.

Which happens for evaluating `@traceable` functions unfortunately
2023-10-27 12:29:18 -07:00
Bagatur
c2a0a6b6df
make doc utils public (#12394) 2023-10-27 12:08:08 -07:00
Henter
d6888a90d0
Fix the missing temperature parameter for Baichuan-AI chat_model (#12420)
**Description:** the missing `temperature` parameter for Baichuan-AI
chat_model

Baichuan-AI api doc: https://platform.baichuan-ai.com/docs/api
2023-10-27 12:07:21 -07:00
Erick Friis
6908634428
cli updates oct27 (#12436) 2023-10-27 12:06:46 -07:00
HwangJohn
d38c8369b3
added rrf argument in ApproxRetrievalStrategy class __init__() (#11987)
- **Description: To handle the hybrid search with RRF(Reciprocal Rank
Fusion) in the Elasticsearch, rrf argument was added for adjusting
'rank_constant' and 'window_size' to combine multiple result sets with
different relevance indicators into a single result set. (ref:
https://www.elastic.co/kr/blog/whats-new-elastic-enterprise-search-8-9-0),
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** No dependencies changed,
  - **Tag maintainer:** @baskaryan,

Nice to meet you,
I'm a newbie for contributions and it's my first PR.

I only changed the langchain/vectorstores/elasticsearch.py file.
I did make format&lint 
I got this message,
```shell
make lint_diff  
./scripts/check_pydantic.sh .
./scripts/check_imports.sh
poetry run ruff .
[ "langchain/vectorstores/elasticsearch.py" = "" ] || poetry run black langchain/vectorstores/elasticsearch.py --check
All done!  🍰 
1 file would be left unchanged.
[ "langchain/vectorstores/elasticsearch.py" = "" ] || poetry run mypy langchain/vectorstores/elasticsearch.py
langchain/__init__.py: error: Source file found twice under different module names: "mvp.nlp.langchain.libs.langchain.langchain" and "langchain"
Found 1 error in 1 file (errors prevented further checking)
make: *** [lint_diff] Error 2
```

Thank you

---------

Co-authored-by: 황중원 <jwhwang@amorepacific.com>
2023-10-27 11:53:19 -07:00
Roman Vasilyev
2c58dca5f0
optional reusable connection (#12051)
My postgres out of connections after continuous PGVector usage, and the
reason because it constantly creates new connections, so adding a
reusable pre established connection seems like solves an issue

---------

Co-authored-by: Roman Vasilyev <rvasilyev@mozilla.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2023-10-27 11:52:42 -07:00
Ennio Pastore
48fde2004f
Update long_context_reorder.py (#12422)
The function comment was confusing and inaccurate
2023-10-27 11:52:28 -07:00
Bagatur
a8c68d4ffa
Type LLMChain.llm as runnable (#12385) 2023-10-27 11:52:01 -07:00
Bagatur
d12b88557a
Bagatur/bump 325 (#12440) 2023-10-27 11:49:09 -07:00
Eugene Yurtsev
cadfce295f
Deprecate PythonRepl tools and Pandas/Xorbits/Spark DataFrame/Python/CSV agents (#12427)
See discussion here:
https://github.com/langchain-ai/langchain/discussions/11680

The code is available for usage from langchain_experimental. The reason
for the deprecation is that the agents are relying on a Python REPL. The
code can only be run safely with appropriate sandboxing.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-27 14:16:42 -04:00
Harrison Chase
0ca539eb85
Clean up deprecated agents and update __init__ in experimental (#12231)
Update init paths in experimental
2023-10-27 13:52:50 -04:00
Holt Skinner
134f085824
feat: Add Google Speech to Text API Document Loader (#12298)
- Add Document Loader for Google Speech to Text
  - Similar Structure to [Assembly AI Document Loader][1]

[1]:
https://python.langchain.com/docs/integrations/document_loaders/assemblyai
2023-10-27 09:34:26 -07:00
David Duong
52c194ec3a
Fix templates typos (#12428) 2023-10-27 09:32:57 -07:00
Massimiliano Pronesti
c8195769f2
fix(openai-callback): completion count logic (#12383)
The changes introduced in #12267 and #12190 broke the cost computation
of the `completion` tokens for fine-tuned models because of the early
return. This PR aims at fixing this.
@baskaryan.
2023-10-27 09:08:54 -07:00
Stefan Langenbach
b22da81af8
Mask API key for Aleph Alpha LLM (#12377)
- **Description:** Add masking of API Key for Aleph Alpha LLM when
printed.
- **Issue**: #12165
- **Dependencies:** None
- **Tag maintainer:** @eyurtsev

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2023-10-27 11:32:43 -04:00
William FH
4254028c52
Str Evaluator Mapper (#12401) 2023-10-26 21:38:47 -07:00
William FH
fcad1d2965
Add space (#12395) 2023-10-26 20:32:23 -07:00
William FH
922d7910ef
Wfh/json schema evaluation (#12389)
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2023-10-26 20:32:05 -07:00
Christian Kasim Loan
a35445c65f
johnsnowlabs embeddings support (#11271)
- **Description:** Introducing the
[JohnSnowLabsEmbeddings](https://www.johnsnowlabs.com/)
  - **Dependencies:** johnsnowlabs
  - **Tag maintainer:** @C-K-Loan
- **Twitter handle:** https://twitter.com/JohnSnowLabs
https://twitter.com/ChristianKasimL

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-26 20:22:50 -07:00
SteveLiao
c08b622b2d
Add HTML Title and Page Language into metadata for AsyncHtmlLoader (#11326)
**Description:** 
Revise `libs/langchain/langchain/document_loaders/async_html.py` to
store the HTML Title and Page Language in the `metadata` of
`AsyncHtmlLoader`.
2023-10-26 20:22:31 -07:00
Shorthills AI
25c98dbba9
Fixed some grammatical and Exception types issues (#12015)
Fixed some grammatical issues and Exception types.

@baskaryan , @eyurtsev

---------

Co-authored-by: Sanskar Tanwar <142409040+SanskarTanwarShorthillsAI@users.noreply.github.com>
Co-authored-by: UpneetShorthillsAI <144228282+UpneetShorthillsAI@users.noreply.github.com>
Co-authored-by: HarshGuptaShorthillsAI <144897987+HarshGuptaShorthillsAI@users.noreply.github.com>
Co-authored-by: AdityaKalraShorthillsAI <143726711+AdityaKalraShorthillsAI@users.noreply.github.com>
Co-authored-by: SakshiShorthillsAI <144228183+SakshiShorthillsAI@users.noreply.github.com>
2023-10-26 21:12:38 -04:00
William FH
923696b664
Wfh/json edit dist (#12361)
Compare predicted json to reference. First canonicalize (sort keys, rm
whitespace separators), then return normalized string edit distance.

Not a silver bullet but maybe an easy way to capture structure
differences in a less flakey way

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2023-10-26 18:10:28 -07:00
Erick Friis
4db8d82c55
CLI CI 2 (#12387)
Will run all CI because of _test change, but future PRs against CLI will
only trigger the new CLI one

Has a bunch of file changes related to formatting/linting.

No mypy yet - coming soon
2023-10-26 17:01:31 -07:00
Tyler Hutcherson
231d553824
Update broken redis tests (#12371)
Update broken redis tests -- tiny PR :) 
- **Description:** Fixes Redis tests on master (look like it was broken
by https://github.com/langchain-ai/langchain/pull/11257)
  - **Issue:** None,
  - **Dependencies:** No
  - **Tag maintainer:** @baskaryan @Spartee 
  - **Twitter handle:** N/A

Co-authored-by: Sam Partee <sam.partee@redis.com>
2023-10-26 16:13:14 -07:00
Erick Friis
03e79e62c2
cli fix (#12380) 2023-10-26 15:29:49 -07:00
Bagatur
76230d2c08
fireworks scheduled integration tests (#12373) 2023-10-26 14:24:42 -07:00
Josh Phillips
01c5cd365b
Fix SupbaseVectoreStore write operation timeout (#12318)
**Description**
This small change will make chunk_size a configurable parameter for
loading documents into a Supabase database.

**Issue**
https://github.com/langchain-ai/langchain/issues/11422

**Dependencies**
No chanages

**Twitter**
@ j1philli

**Reminder**
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.

---------

Co-authored-by: Greg Richardson <greg.nmr@gmail.com>
2023-10-26 14:19:17 -07:00
Bagatur
b10cefb160
lint fix: rm init (#12374) 2023-10-26 14:16:25 -07:00
Harrison Chase
b43996e553
Harrison/improve cli (#12368) 2023-10-26 13:53:59 -07:00
Harrison Chase
9ce38726a2
fix some stuff (#12292)
Co-authored-by: Erick Friis <erick@langchain.dev>
2023-10-26 13:30:36 -07:00
Cynthia Yang
6ce276e099
Support Fireworks batching (#8) (#12052)
Description

* Add _generate and _agenerate to support Fireworks batching.
* Add stop words test cases
* Opt out retry mechanism

Issue - Not applicable
Dependencies - None
Tag maintainer - @baskaryan
2023-10-26 16:01:08 -04:00
Tyler Hutcherson
2f0c9d8269
Fix redis vectorfield schema defaults (#12223)
- **Description:** refactors the redis vector field schema to properly
handle default values, includes a new unit test suite.
  - **Issue:** N/A
  - **Dependencies:** nothing new.
  - **Tag maintainer:** @baskaryan @Spartee 
  - **Twitter handle:** this is a tiny fix/improvement :) 

This issue was causing some clients/cuatomers issues when building a
vector index on Redis on smaller db instances (due to fault default
values in index configuration). It would raise an error like:

```redis.exceptions.ResponseError: Vector index initial capacity 20000 exceeded server limit (852 with the given parameters)```

This PR will address this moving forward.
2023-10-26 12:17:58 -07:00