Commit Graph

205 Commits

Author SHA1 Message Date
Adam Treat
f605a5b686 Add q8_0 kernels to kompute shaders and bump to latest llama/gguf. 2023-10-05 18:16:19 -04:00
Cebtenzzre
672cb850f9 differentiate between init failure and unsupported models 2023-10-05 18:16:19 -04:00
Adam Treat
906699e8e9 Bump to latest llama/gguf branch. 2023-10-05 18:16:19 -04:00
Cebtenzzre
088afada49 llamamodel: fix static vector in LLamaModel::endTokens 2023-10-05 18:16:19 -04:00
Adam Treat
b4d82ea289 Bump to the latest fixes for vulkan in llama. 2023-10-05 18:16:19 -04:00
Adam Treat
12f943e966 Fix regenerate button to be deterministic and bump the llama version to latest we have for gguf. 2023-10-05 18:16:19 -04:00
Adam Treat
5d346e13d7 Add q6_k kernels for vulkan. 2023-10-05 18:16:19 -04:00
Adam Treat
4eefd386d0 Refactor for subgroups on mat * vec kernel. 2023-10-05 18:16:19 -04:00
Cebtenzzre
3c2aa299d8 gptj: remove unused variables 2023-10-05 18:16:19 -04:00
Cebtenzzre
f9deb87d20 convert scripts: add feed-forward length for better compatiblilty
This GGUF key is used by all llama.cpp models with upstream support.
2023-10-05 18:16:19 -04:00
Cebtenzzre
cc7675d432 convert scripts: make gptj script executable 2023-10-05 18:16:19 -04:00
Cebtenzzre
0493e6eb07 convert scripts: use bytes_to_unicode from transformers 2023-10-05 18:16:19 -04:00
Cebtenzzre
d5d72f0361 gpt-j: update inference to match latest llama.cpp insights
- Use F16 KV cache
- Store transposed V in the cache
- Avoid unnecessary Q copy

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml upstream commit 0265f0813492602fec0e1159fe61de1bf0ccaf78
2023-10-05 18:16:19 -04:00
Cebtenzzre
050e7f076e backend: port GPT-J to GGUF 2023-10-05 18:16:19 -04:00
Cebtenzzre
8f3abb37ca fix references to removed model types 2023-10-05 18:16:19 -04:00
Cebtenzzre
4219c0e2e7 convert scripts: make them directly executable 2023-10-05 18:16:19 -04:00
Cebtenzzre
ce7be1db48 backend: use llamamodel.cpp for Falcon 2023-10-05 18:16:19 -04:00
Cebtenzzre
cca9e6ce81 convert_mpt_hf_to_gguf.py: better tokenizer decoding 2023-10-05 18:16:19 -04:00
Cebtenzzre
25297786db convert scripts: load model as late as possible 2023-10-05 18:16:19 -04:00
Cebtenzzre
fd47088f2b conversion scripts: cleanup 2023-10-05 18:16:19 -04:00
Cebtenzzre
6277eac9cc backend: use llamamodel.cpp for StarCoder 2023-10-05 18:16:19 -04:00
Cebtenzzre
17fc9e3e58 backend: port Replit to GGUF 2023-10-05 18:16:19 -04:00
Cebtenzzre
7c67262a13 backend: port MPT to GGUF 2023-10-05 18:16:19 -04:00
Cebtenzzre
42bcb814b3 backend: port BERT to GGUF 2023-10-05 18:16:19 -04:00
Cebtenzzre
1d29e4696c llamamodel: metal supports all quantization types now 2023-10-05 18:16:19 -04:00
Aaron Miller
507753a37c macos build fixes 2023-10-05 18:16:19 -04:00
Adam Treat
d90d003a1d Latest rebase on llama.cpp with gguf support. 2023-10-05 18:16:19 -04:00
Adam Treat
99c106e6b5 Fix a bug seen on AMD RADEON cards with vulkan backend. 2023-09-26 11:59:47 -04:00
Jacob Nguyen
e86c63750d Update llama.cpp.cmake
Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com>
2023-09-16 11:42:56 -07:00
Adam Treat
84905aa281 Fix for crashes on systems where vulkan is not installed properly. 2023-09-16 12:19:46 -04:00
Adam Treat
045f6e6cdc Link against ggml in bin so we can get the available devices without loading a model. 2023-09-15 14:45:25 -04:00
Adam Treat
aa33419c6e Fallback to CPU more robustly. 2023-09-14 16:53:11 -04:00
Adam Treat
9013a089bd Bump to new llama with new bugfix. 2023-09-14 10:02:11 -04:00
Adam Treat
3076e0bf26 Only show GPU when we're actually using it. 2023-09-14 09:59:19 -04:00
Adam Treat
cf4eb530ce Sync to a newer version of llama.cpp with bugfix for vulkan. 2023-09-13 21:01:44 -04:00
Adam Treat
4b9a345aee Update the submodule. 2023-09-13 17:05:46 -04:00
Aaron Miller
6f038c136b init at most one vulkan device, submodule update
fixes issues w/ multiple of the same gpu
2023-09-13 12:49:53 -07:00
Adam Treat
8f99dca70f Bring the vulkan backend to the GUI. 2023-09-13 11:26:10 -04:00
Aaron Miller
f0735efa7d vulkan python bindings on windows fixes 2023-09-12 14:16:02 -07:00
Adam Treat
c953b321b7 Don't link against libvulkan. 2023-09-12 14:26:56 -04:00
Aaron Miller
c4d23512e4 remove extra dynamic linker deps when building with vulkan 2023-09-11 08:44:39 -07:00
Adam Treat
85e34598f9 more circleci 2023-08-31 15:29:54 -04:00
Adam Treat
f578fa6cdf Fix for windows. 2023-08-31 15:29:54 -04:00
Adam Treat
17d3e4976c Add a comment indicating future work. 2023-08-31 15:29:54 -04:00
Adam Treat
987546c63b Nomic vulkan backend licensed under the Software for Open Models License (SOM), version 1.0. 2023-08-31 15:29:54 -04:00
Adam Treat
d55cbbee32 Update to newer llama.cpp and disable older forks. 2023-08-31 15:29:54 -04:00
Aaron Miller
0bc2274869 bump llama.cpp version + needed fixes for that 2023-08-31 15:29:54 -04:00
aaron miller
33c22be2aa starcoder: use ggml_graph_plan 2023-08-31 15:29:54 -04:00
Cosmic Snow
108d950874 Fix Windows unable to load models on older Windows builds
- Replace high-level IsProcessorFeaturePresent
- Reintroduce low-level compiler intrinsics implementation
2023-08-09 09:27:43 +02:00
Adam Treat
6d03b3e500 Add starcoder support. 2023-07-27 09:15:16 -04:00
cosmic-snow
2d02c65177
Handle edge cases when generating embeddings (#1215)
* Handle edge cases when generating embeddings
* Improve Python handling & add llmodel_c.h note
- In the Python bindings fail fast with a ValueError when text is empty
- Advice other bindings authors to do likewise in llmodel_c.h
2023-07-17 13:21:03 -07:00
Aaron Miller
1c4a244291 bump mem allocation a bit 2023-07-14 09:48:57 -04:00
Adam Treat
ee4186d579 Fixup bert python bindings. 2023-07-14 09:48:57 -04:00
cosmic-snow
6200900677
Fix Windows MSVC arch detection (#1194)
- in llmodel.cpp to fix AVX-only handling

Signed-off-by: cosmic-snow <134004613+cosmic-snow@users.noreply.github.com>
2023-07-13 14:44:17 -04:00
Adam Treat
4963db8f43 Bump the version numbers for both python and c backend. 2023-07-13 14:21:46 -04:00
Adam Treat
0efdbfcffe Bert 2023-07-13 14:21:46 -04:00
Adam Treat
315a1f2aa2 Move it back as internal class. 2023-07-13 14:21:46 -04:00
Adam Treat
ae8eb297ac Add sbert backend. 2023-07-13 14:21:46 -04:00
Adam Treat
1f749d7633 Clean up backend code a bit and hide impl. details. 2023-07-13 14:21:46 -04:00
Adam Treat
33557b1f39 Move the implementation out of llmodel class. 2023-07-13 14:21:46 -04:00
Aaron Miller
432b7ebbd7 include windows.h just to be safe 2023-07-12 12:46:46 -04:00
Aaron Miller
95b8fb312e windows/msvc: use high level processor feature detection API
see https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent
2023-07-12 12:46:46 -04:00
Aaron Miller
f0faa23ad5
cmakelists: always export build commands (#1179)
friendly for using editors with clangd integration that don't also
manage the build themselves
2023-07-12 10:49:24 -04:00
Aaron Miller
4a24b586df llama.cpp: metal buffer freeing 2023-06-30 21:07:21 -03:00
Aaron Miller
137bc2c367 replit: free metal context 2023-06-30 21:07:21 -03:00
Aaron Miller
57dc0c8953 adjust eval buf sizes to pass long input test 2023-06-30 21:07:21 -03:00
Aaron Miller
7a5f6e4726 limit prompt batch size to 128 2023-06-30 21:07:21 -03:00
Aaron Miller
883775bc5f move 230511 submodule to nomic fork, fix alibi assert 2023-06-30 21:07:21 -03:00
Andriy Mulyar
46a0762bd5
Python Bindings: Improved unit tests, documentation and unification of API (#1090)
* Makefiles, black, isort

* Black and isort

* unit tests and generation method

* chat context provider

* context does not reset

* Current state

* Fixup

* Python bindings with unit tests

* GPT4All Python Bindings: chat contexts, tests

* New python bindings and backend fixes

* Black and Isort

* Documentation error

* preserved n_predict for backwords compat with langchain

---------

Co-authored-by: Adam Treat <treat.adam@gmail.com>
2023-06-30 16:02:02 -04:00
Aaron Miller
40a3faeb05
Use ggml scratch bufs for mpt and gptj models (#1104)
* backend/gptj: use scratch buffers

reduces total memory required and makes eval buf not grow with n_past

* backend/mpt: use scratch bufs

* fix format-related compile warnings
2023-06-30 10:53:45 -07:00
Aaron Miller
8d19ef3909
backend: factor out common elements in model code (#1089)
* backend: factor out common structs in model code

prepping to hack on these by hopefully making there be fewer places to fix the same bug

rename

* use common buffer wrapper instead of manual malloc

* fix replit compile warnings
2023-06-28 17:35:07 -07:00
Aaron Miller
28d41d4f6d
falcon: use *model-local* eval & scratch bufs (#1079)
fixes memory leaks copied from ggml/examples based implementation
2023-06-27 16:09:11 -07:00
Zach Nussbaum
2565f6a94a feat: add conversion script 2023-06-27 14:06:39 -03:00
Aaron Miller
198b5e4832 add Falcon 7B model
Tested with https://huggingface.co/TheBloke/falcon-7b-instruct-GGML/blob/main/falcon7b-instruct.ggmlv3.q4_0.bin
2023-06-27 14:06:39 -03:00
Aaron Miller
db34a2f670 llmodel: skip attempting Metal if model+kvcache > 53% of system ram 2023-06-26 19:46:49 -03:00
Aaron Miller
b19a3e5b2c add requiredMem method to llmodel impls
most of these can just shortcut out of the model loading logic llama is a bit worse to deal with because we submodule it so I have to at least parse the hparams, and then I just use the size on disk as an estimate for the mem size (which seems reasonable since we mmap() the llama files anyway)
2023-06-26 18:27:58 -03:00
Adam Treat
a0f80453e5 Use sysinfo in backend. 2023-06-26 14:14:49 -04:00
niansa/tuxifan
47323f8591 Update replit.cpp
replit_tokenizer_detokenize returnins std::string now

Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-26 14:49:58 -03:00
niansa
0855c0df1d Fixed Replit implementation compile warnings 2023-06-26 14:49:58 -03:00
Aaron Miller
1290b32451 update to latest mainline llama.cpp
add max_size param to ggml_metal_add_buffer - introduced in https://github.com/ggerganov/llama.cpp/pull/1826
2023-06-26 14:40:52 -03:00
niansa/tuxifan
5eee16c97c Do not specify "success" as error for unsupported models
Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-22 09:28:40 +02:00
Adam Treat
bd58c46da0 Initialize these to nullptr to prevent double deletion when a model fails to load. 2023-06-20 18:23:45 -04:00
niansa/tuxifan
68f9786ed9
Use operator ""_MiB (#991) 2023-06-16 15:56:22 -04:00
Aaron Miller
abc081e48d
fix llama.cpp k-quants (#988)
* enable k-quants on *all* mainline builds
2023-06-15 14:06:14 -07:00
Aaron Miller
c4319d2c8e
dlhandle: prevent libs from using each other's symbols (#977)
use RTLD_LOCAL so that symbols are *only* exposed via dlsym

without this all symbols exported by the libs are available for symbol
resolution, resulting in different lib versions potentially resolving
*each other's* symbols, causing incredibly cursed behavior such as
https://gist.github.com/apage43/085c1ff69f6dd05387793ebc301840f6
2023-06-13 14:52:11 -04:00
Aaron Miller
f71d8efc71
metal replit (#931)
metal+replit

makes replit work with Metal and removes its use of `mem_per_token`
in favor of fixed size scratch buffers (closer to llama.cpp)
2023-06-13 07:29:14 -07:00
Aaron Miller
85964a7635
bump llama.cpp mainline to latest (#964) 2023-06-13 08:40:38 -04:00
Tim Miller
797891c995
Initial Library Loader for .NET Bindings / Update bindings to support newest changes (#763)
* Initial Library Loader

* Load library as part of Model factory

* Dynamically search and find the dlls

* Update tests to use locally built runtimes

* Fix dylib loading, add macos runtime support for sample/tests

* Bypass automatic loading by default.

* Only set CMAKE_OSX_ARCHITECTURES if not already set, allow cross-compile

* Switch Loading again

* Update build scripts for mac/linux

* Update bindings to support newest breaking changes

* Fix build

* Use llmodel for Windows

* Actually, it does need to be libllmodel

* Name

* Remove TFMs, bypass loading by default

* Fix script

* Delete mac script

---------

Co-authored-by: Tim Miller <innerlogic4321@ghmail.com>
2023-06-13 14:05:34 +02:00
Aaron Miller
88616fde7f
llmodel: change tokenToString to not use string_view (#968)
fixes a definite use-after-free and likely avoids some other
potential ones - std::string will convert to a std::string_view
automatically but as soon as the std::string in question goes out of
scope it is already freed and the string_view is pointing at freed
memory - this is *mostly* fine if its returning a reference to the
tokenizer's internal vocab table but it's, imo, too easy to return a
reference to a dynamically constructed string with this as replit is
doing (and unfortunately needs to do to convert the internal whitespace
replacement symbol back to a space)
2023-06-13 07:14:02 -04:00
Adam Treat
84deebd223 Fix compile for windows and linux again. PLEASE DON'T REVERT THISgit gui! 2023-06-12 17:08:55 -04:00
Juuso Alasuutari
5cfb1bda89
llmodel: add model wrapper destructor, fix mem leak in golang bindings (#862)
Signed-off-by: Juuso Alasuutari <juuso.alasuutari@gmail.com>
2023-06-12 09:41:22 -07:00
Cosmic Snow
ae4a275bcd Fix Windows MSVC AVX builds
- bug introduced in 0cb2b86730
- currently getting: `warning C5102: ignoring invalid command-line macro definition '/arch:AVX2'`
- solution is to use `_options(...)` not `_definitions(...)`
2023-06-12 08:55:55 -07:00
Adam Treat
b906fb4057 When recalculating context we can't erase the BOS. 2023-06-12 08:43:20 -07:00
Aaron Miller
d3ba1295a7
Metal+LLama take two (#929)
Support latest llama with Metal
---------

Co-authored-by: Adam Treat <adam@nomic.ai>
Co-authored-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-09 16:48:46 -04:00
Adam Treat
b162b5c64e Revert "llama on Metal (#885)"
This reverts commit c55f81b860.
2023-06-09 15:08:46 -04:00
Aaron Miller
c55f81b860
llama on Metal (#885)
Support latest llama with Metal

---------

Co-authored-by: Adam Treat <adam@nomic.ai>
Co-authored-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-09 14:58:12 -04:00
niansa/tuxifan
14e9ccbc6a Do auto detection by default in C++ API
Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-09 17:01:19 +02:00
niansa/tuxifan
f03da8d732 Removed double-static from variables in replit.cpp
The anonymous namespace already makes it static.

Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-09 08:55:15 -04:00
niansa
0cb2b86730 Synced llama.cpp.cmake with upstream 2023-06-08 18:21:32 -04:00
Aaron Miller
47fbc0e309
non-llama: explicitly greedy sampling for temp<=0 (#901)
copied directly from llama.cpp - without this temp=0.0 will just
scale all the logits to infinity and give bad output
2023-06-08 11:08:30 -07:00