langchain/docs
Anthony Shaw 6c9b0f96f3
docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)
The existing default list of separators for the `RecursiveTextSplitter`
assumes spaces are word boundaries. Some languages [don't use spaces
between
words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)
(Chinese, Japanese, Thai, Burmese).

This PR extends the documentation to explain how to cater for those
languages by adding additional punctuation to the separators and
zero-width spaces which are used by some typesetters and will assist the
splitter to not split in words.

Ideally, **these separators could be a constant in the module** but for
now, defining them in the documentation is a start.
2024-03-26 00:34:00 +00:00
..
api_reference community[patch], langchain[minor]: Add retriever self_query and score_threshold in DingoDB (#18106) 2024-03-05 15:47:29 -08:00
data 👥 Update LangChain people data (#18473) 2024-03-03 19:58:58 -08:00
docs docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295) 2024-03-26 00:34:00 +00:00
scripts ci[minor]: Bump LC scripts package, add retry option (#19285) 2024-03-19 10:42:59 -07:00
src docs[minor]: Add chat model selection tabs component (#19296) 2024-03-19 18:12:46 -07:00
static docs: Add graph construction docs (#18904) 2024-03-13 12:27:58 -07:00
.gitignore docs[minor]: Swap gtag for supabase (#18937) 2024-03-11 14:23:12 -07:00
.local_build.sh docs: partner packages (#16960) 2024-02-02 15:12:21 -08:00
.yarnrc.yml docs[minor]: Add thumbs up/down to all docs pages (#18526) 2024-03-04 15:14:28 -08:00
babel.config.js
code-block-loader.js
docusaurus.config.js docs[patch]: properly load/use env vars (#18942) 2024-03-11 15:38:05 -07:00
package.json ci[minor]: Bump LC scripts package, add retry option (#19285) 2024-03-19 10:42:59 -07:00
README.md
settings.ini
sidebars.js docs: Toolkits menu (#16217) 2024-02-08 14:52:26 -08:00
vercel_build.sh docs: fix vercel build script (#19090) 2024-03-14 20:53:43 +00:00
vercel_requirements.txt
vercel.json docs: providers update 4 (#18540) 2024-03-09 13:30:48 -08:00
yarn.lock ci[minor]: Bump LC scripts package, add retry option (#19285) 2024-03-19 10:42:59 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide