**Description**
Adding different threshold types to the semantic chunker. I’ve had much
better and predictable performance when using standard deviations
instead of percentiles.
![image](https://github.com/langchain-ai/langchain/assets/44395485/066e84a8-460e-4da5-9fa1-4ff79a1941c5)
For all the documents I’ve tried, the distribution of distances look
similar to the above: positively skewed normal distribution. All skews
I’ve seen are less than 1 so that explains why standard deviations
perform well, but I’ve included IQR if anyone wants something more
robust.
Also, using the percentile method backwards, you can declare the number
of clusters and use semantic chunking to get an ‘optimal’ splitting.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
## Amazon Personalize support on Langchain
This PR is a successor to this PR -
https://github.com/langchain-ai/langchain/pull/13216
This PR introduces an integration with [Amazon
Personalize](https://aws.amazon.com/personalize/) to help you to
retrieve recommendations and use them in your natural language
applications. This integration provides two new components:
1. An `AmazonPersonalize` client, that provides a wrapper around the
Amazon Personalize API.
2. An `AmazonPersonalizeChain`, that provides a chain to pull in
recommendations using the client, and then generating the response in
natural language.
We have added this to langchain_experimental since there was feedback
from the previous PR about having this support in experimental rather
than the core or community extensions.
Here is some sample code to explain the usage.
```python
from langchain_experimental.recommenders import AmazonPersonalize
from langchain_experimental.recommenders import AmazonPersonalizeChain
from langchain.llms.bedrock import Bedrock
recommender_arn = "<insert_arn>"
client=AmazonPersonalize(
credentials_profile_name="default",
region_name="us-west-2",
recommender_arn=recommender_arn
)
bedrock_llm = Bedrock(
model_id="anthropic.claude-v2",
region_name="us-west-2"
)
chain = AmazonPersonalizeChain.from_llm(
llm=bedrock_llm,
client=client
)
response = chain({'user_id': '1'})
```
Reviewer: @3coins
Noticed and fixed a few typos in the SmartLLMChain default ideation and
critique prompts
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- **Description:**
[AS-IS] When dealing with a yaml file, the extension must be .yaml.
[TO-BE] In the absence of extension length constraints in the OS, the
extension of the YAML file is yaml, but control over the yml extension
must still be made.
It's as if it's an error because it's a .jpg extension in jpeg support.
- **Issue:** -
- **Dependencies:**
no dependencies required for this change,
## Summary
This PR upgrades LangChain's Ruff configuration in preparation for
Ruff's v0.2.0 release. (The changes are compatible with Ruff v0.1.5,
which LangChain uses today.) Specifically, we're now warning when
linter-only options are specified under `[tool.ruff]` instead of
`[tool.ruff.lint]`.
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
As described in issue #17060, in the case in which text has only one
sentence the following function fails. Checking for that and adding a
return case fixed the issue.
```python
def split_text(self, text: str) -> List[str]:
"""Split text into multiple components."""
# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r"(?<=[.?!])\s+", text)
sentences = [
{"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
]
sentences = combine_sentences(sentences)
embeddings = self.embeddings.embed_documents(
[x["combined_sentence"] for x in sentences]
)
for i, sentence in enumerate(sentences):
sentence["combined_sentence_embedding"] = embeddings[i]
distances, sentences = calculate_cosine_distances(sentences)
start_index = 0
# Create a list to hold the grouped sentences
chunks = []
breakpoint_percentile_threshold = 95
breakpoint_distance_threshold = np.percentile(
distances, breakpoint_percentile_threshold
) # If you want more chunks, lower the percentile cutoff
indices_above_thresh = [
i for i, x in enumerate(distances) if x > breakpoint_distance_threshold
] # The indices of those breakpoints on your list
# Iterate through the breakpoints to slice the sentences
for index in indices_above_thresh:
# The end index is the current breakpoint
end_index = index
# Slice the sentence_dicts from the current start index to the end index
group = sentences[start_index : end_index + 1]
combined_text = " ".join([d["sentence"] for d in group])
chunks.append(combined_text)
# Update the start index for the next group
start_index = index + 1
# The last group, if any sentences remain
if start_index < len(sentences):
combined_text = " ".join([d["sentence"] for d in sentences[start_index:]])
chunks.append(combined_text)
return chunks
```
Co-authored-by: Giulio Zani <salamanderxing@Giulios-MBP.homenet.telecomitalia.it>
- **Description:** Presidio-based anonymizers are not working because
`_remove_conflicts_and_get_text_manipulation_data` was being called
without a conflict resolution strategy. This PR fixes this issue. In
addition, it removes some mutable default arguments (antipattern).
To reproduce the issue, just run the very first cell of this
[notebook](https://python.langchain.com/docs/guides/privacy/2/) from
langchain's documentation.
<!-- Thank you for contributing to LangChain!
Please title your PR "<package>: <description>", where <package> is
whichever of langchain, community, core, experimental, etc. is being
modified.
Replace this entire comment with:
- **Description:** a description of the change,
- **Issue:** the issue # it fixes if applicable,
- **Dependencies:** any dependencies required for this change,
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` from the root
of the package you've modified to check this locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc: https://python.langchain.com/docs/contributing/
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
-->
experimental relies on `from langchain_core.runnables.config import
run_in_executor` which was introduced in core 0.1.5.
Updated pyproject dependency as well as minimum version test.
…tch]: import models from community
ran
```bash
git grep -l 'from langchain\.chat_models' | xargs -L 1 sed -i '' "s/from\ langchain\.chat_models/from\ langchain_community.chat_models/g"
git grep -l 'from langchain\.llms' | xargs -L 1 sed -i '' "s/from\ langchain\.llms/from\ langchain_community.llms/g"
git grep -l 'from langchain\.embeddings' | xargs -L 1 sed -i '' "s/from\ langchain\.embeddings/from\ langchain_community.embeddings/g"
git checkout master libs/langchain/tests/unit_tests/llms
git checkout master libs/langchain/tests/unit_tests/chat_models
git checkout master libs/langchain/tests/unit_tests/embeddings/test_imports.py
make format
cd libs/langchain; make format
cd ../experimental; make format
cd ../core; make format
```
- **Description:** This PR fixes test failures on Windows caused by path
handling differences and unescaped special characters in regex. The
failing tests are:
```
FAILED tests/unit_tests/storage/test_filesystem.py::test_yield_keys - AssertionError: assert ['key1', 'subdir\\key2'] == ['key1', 'subdir/key2']
FAILED tests/unit_tests/test_imports.py::test_importable_all - ModuleNotFoundError: No module named 'langchain_community.langchain_community\\adapters'
FAILED tests/unit_tests/tools/file_management/test_utils.py::test_get_validated_relative_path_errs_on_absolute - re.error: incomplete escape \U at position 53
FAILED tests/unit_tests/tools/file_management/test_utils.py::test_get_validated_relative_path_errs_on_parent_dir - re.error: incomplete escape \U at position 69
FAILED tests/unit_tests/tools/file_management/test_utils.py::test_get_validated_relative_path_errs_for_symlink_outside_root - re.error: incomplete escape \U at position 64
```
- **Issue:** fixes
https://github.com/langchain-ai/langchain/issues/11775 (partially)
- **Dependencies:** none
Addded missed docstrings. Fixed inconsistency in docstrings.
**Note** CC @efriis
There were PR errors on
`langchain_experimental/prompt_injection_identifier/hugging_face_identifier.py`
But, I didn't touch this file in this PR! Can it be some cache problems?
I fixed this error.
- **Description:** This is addition to [my previous
PR](https://github.com/langchain-ai/langchain/pull/13930) with
improvements to flexibility allowing different models and notebook to
use ONNX runtime for faster speed. Since the last PR, [our
model](https://huggingface.co/laiyer/deberta-v3-base-prompt-injection)
got more than 660k downloads, and with the [public
benchmark](https://huggingface.co/spaces/laiyer/prompt-injection-benchmark)
showed much fewer false-positives than the previous one from deepset.
Additionally, on the ONNX runtime, it can be running 3x faster on the
CPU, which might be handy for builders using Langchain.
**Issue:** N/A
- **Dependencies:** N/A
- **Tag maintainer:** N/A
- **Twitter handle:** `@laiyer_ai`
TIL `**` globstar doesn't work in make
Makefile changes fix that.
`__getattr__` changes allow import of all files, but raise error when
accessing anything from the module.
file deletions were corresponding libs change from #14559
**Description**
The `SmartLLMChain` was was fixed to output key "resolution".
Unfortunately, this prevents the ability to use multiple `SmartLLMChain`
in a `SequentialChain` because of colliding output keys. This change
simply gives the option the customize the output key to allow for
sequential chaining. The default behavior is the same as the current
behavior.
Now, it's possible to do the following:
```
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_experimental.smart_llm import SmartLLMChain
from langchain.chains import SequentialChain
joke_prompt = PromptTemplate(
input_variables=["content"],
template="Tell me a joke about {content}.",
)
review_prompt = PromptTemplate(
input_variables=["scale", "joke"],
template="Rate the following joke from 1 to {scale}: {joke}"
)
llm = ChatOpenAI(temperature=0.9, model_name="gpt-4-32k")
joke_chain = SmartLLMChain(llm=llm, prompt=joke_prompt, output_key="joke")
review_chain = SmartLLMChain(llm=llm, prompt=review_prompt, output_key="review")
chain = SequentialChain(
chains=[joke_chain, review_chain],
input_variables=["content", "scale"],
output_variables=["review"],
verbose=True
)
response = chain.run({"content": "chickens", "scale": "10"})
print(response)
```
---------
Co-authored-by: Erick Friis <erick@langchain.dev>