You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs
Raviraj 858ce264ef
SemanticChunker : Feature Addition ("Semantic Splitting with gradient") (#22895)
```SemanticChunker``` currently provide three methods to split the texts semantically:
- percentile
- standard_deviation
- interquartile

I propose new method ```gradient```. In this method, the gradient of distance is used to split chunks along with the percentile method (technically) . This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.
I have tested this merge on a set of 10 domain specific documents (mostly legal).

Details : 
    - **Issue:** Improvement
    - **Dependencies:** NA
    - **Twitter handle:** [x.com/prajapat_ravi](https://x.com/prajapat_ravi)


@hwchase17

---------

Co-authored-by: Raviraj Prajapat <raviraj.prajapat@sirionlabs.com>
Co-authored-by: isaac hershenson <ihershenson@hmc.edu>
3 months ago
..
cli cli[minor]: remove redefined DEFAULT_GIT_REF (#21471) 3 months ago
community LanceDB integration update (#22869) 3 months ago
core Include "no escape" and "inverted section" mustache vars in Prompt.input_variables and Prompt.input_schema (#22981) 3 months ago
experimental SemanticChunker : Feature Addition ("Semantic Splitting with gradient") (#22895) 3 months ago
langchain langchain: add id_key option to EnsembleRetriever for metadata-based document merging (#22950) 3 months ago
partners standard-tests[patch]: Update chat model standard tests (#22378) 3 months ago
standard-tests standard-tests[patch]: Update chat model standard tests (#22378) 3 months ago
text-splitters text-splitters[patch]: Fix HTMLSectionSplitter (#22812) 3 months ago