mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
6c9b0f96f3
The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start. |
||
---|---|---|
.. | ||
api_reference | ||
data | ||
docs | ||
scripts | ||
src | ||
static | ||
.gitignore | ||
.local_build.sh | ||
.yarnrc.yml | ||
babel.config.js | ||
code-block-loader.js | ||
docusaurus.config.js | ||
package.json | ||
README.md | ||
settings.ini | ||
sidebars.js | ||
vercel_build.sh | ||
vercel_requirements.txt | ||
vercel.json | ||
yarn.lock |
LangChain Documentation
For more information on contributing to our documentation, see the Documentation Contributing Guide