mirror of
https://github.com/hwchase17/langchain
synced 2024-11-11 19:11:02 +00:00
a4896da2a0
**Description** Adding different threshold types to the semantic chunker. I’ve had much better and predictable performance when using standard deviations instead of percentiles. ![image](https://github.com/langchain-ai/langchain/assets/44395485/066e84a8-460e-4da5-9fa1-4ff79a1941c5) For all the documents I’ve tried, the distribution of distances look similar to the above: positively skewed normal distribution. All skews I’ve seen are less than 1 so that explains why standard deviations perform well, but I’ve included IQR if anyone wants something more robust. Also, using the percentile method backwards, you can declare the number of clusters and use semantic chunking to get an ‘optimal’ splitting. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> |
||
---|---|---|
.. | ||
agents | ||
autonomous_agents | ||
chat_models | ||
comprehend_moderation | ||
cpal | ||
data_anonymizer | ||
fallacy_removal | ||
generative_agents | ||
graph_transformers | ||
llm_bash | ||
llm_symbolic_math | ||
llms | ||
open_clip | ||
openai_assistant | ||
pal_chain | ||
plan_and_execute | ||
prompt_injection_identifier | ||
prompts | ||
pydantic_v1 | ||
recommenders | ||
retrievers | ||
rl_chain | ||
smart_llm | ||
sql | ||
synthetic_data | ||
tabular_synthetic_data | ||
tools | ||
tot | ||
utilities | ||
__init__.py | ||
py.typed | ||
text_splitter.py |