Previously the CJK character detection defined only characters in the
range U+4000..U+AFFF as "CJK characters". This excludes an incredibly
large number of CJK characters within the BMP, let alone the whole two
planes dedicated to rarer CJK characters (the SIP and TIP). As a result,
a very large number of Chinese, Japanese, and Korean characters were not
detected as being CJK characters.
While slightly less elegant-looking, it is far more accurate to compute
the codepoint from the utf8 character and then see if it falls within
one of the defined CJK blocks. This is not future-proof against future
CJK ideograph extensions in future Unicode versions, but there is no
real way to accurately predict such changes so this is the best we can
do without accidentally treating characters explicitily defined as being
non-CJK in Unicode as CJK.
While we're at it, copy Lua 5.3's utf8.charpattern constant definition
so that we can more easily write utf8 iterators with string.gmatch (at
least in the interim until there is a rework of utf8 handling in
KOReader and everything is rebuilt on top of utf8proc).
Some unit tests are added for Korean and Japanese text, and the existing
unit tests needed a minor adjustment to handle the fact that
isSplittable now correctly detects CJK punctuation as a character to
compare against the forbidden split rules.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
* Fix typo in dropbox
Reported by @lescheck
* Rephrase text justification explanation more elegantly
* CoverBrowser: fix up some plurals
* Statistics: remove random use of template function
* Use ngettext for minute/minutes and second/seconds
* Change KB/MB/GB to kB/MB/GB SI units
- show icons or letters as prefix of items
- various footer separators
- progress percentage format with decimal digits
- time in 12/24 format
- two duration formats (1:30, 1h30')
- move some options into Settings submenu
- Factored out duplicate code from filemanager.lua and filemanagerhistory.lua
to new filemanagerbookinfo.lua (and other common code to filemanagerutil.lua).
- Uses sidecar files' new doc_props and doc_pages settings, or fallback to
old 'stats' settings, or to opening document.
- Shows filename, filetype and directory.
- Shows description (Hold to see whole truncated text), keywords, and
cover image (tap to extract image from document and display it if available).
- Book information now available from reader menu, to display info about
the currently opened book.
- Convert possibly HTML description to plain text via added
util.htmlToPlainTextIfHtml() (for simple HTML conversion).
util.isSplitable() accepts now also the previous char to help
decide if a space can be used to split a line.
TextBoxWidget:_splitCharWidthList() : simplified logic
util: made isSplitable() accept an optional next_char
for wiser decision
textboxwidget: speed up rendering, enhanced text wrapping,
allow selection of multiple words with Hold.
scrolltextwidget: allow scrolling with Tap.
Details in #2393