Improves output quality by making these tokenizers more closely
match the behavior of the huggingface `tokenizers` based BPE
tokenizers these models were trained with.
Featuring:
* Fixed unicode handling (via ICU)
* Fixed BPE token merge handling
* Complete added vocabulary handling
Fix occurrences of the prompt callback being incorrectly specified, or
the response callback's prototype being incorrectly used in its place.
Signed-off-by: Juuso Alasuutari <juuso.alasuutari@gmail.com>
Caught with AddressSanitizer running a basic prompt test against llmodel
standalone. This fix allows ASan builds to complete a simple prompt
without illegal accesses but there are still notably several leaks.
Fixes the bug where llmodel_model_create prints "Invalid model file" even though the model is loaded correctly. Credits and thanks to @serendipity for the fix.