cebtenzzre
4338e72a51
MPT: use upstream llama.cpp implementation ( #1515 )
2023-10-19 15:25:17 -04:00
cebtenzzre
0fe2e19691
llamamodel: re-enable error messages by default ( #1537 )
2023-10-19 13:46:33 -04:00
cebtenzzre
017c3a9649
python: prepare version 2.0.0rc1 ( #1529 )
2023-10-18 20:24:54 -04:00
cebtenzzre
9a19c740ee
kompute: fix library loading issues with kp_logger ( #1517 )
2023-10-16 16:58:17 -04:00
Aaron Miller
f79557d2aa
speedup: just use mat*vec shaders for mat*mat
...
so far my from-scratch mat*mats are still slower than just running more
invocations of the existing Metal ported mat*vec shaders - it should be
theoretically possible to make a mat*mat that's faster (for actual
mat*mat cases) than an optimal mat*vec, but it will need to be at
*least* as fast as the mat*vec op and then take special care to be
cache-friendly and save memory bandwidth, as the # of compute ops is the
same
2023-10-16 13:45:51 -04:00
cebtenzzre
22de3c56bd
convert scripts: fix AutoConfig typo ( #1512 )
2023-10-13 14:16:51 -04:00
Aaron Miller
2490977f89
q6k, q4_1 mat*mat
2023-10-12 14:56:54 -04:00
Aaron Miller
afaa291eab
python bindings should be quiet by default
...
* disable llama.cpp logging unless GPT4ALL_VERBOSE_LLAMACPP envvar is
nonempty
* make verbose flag for retrieve_model default false (but also be
overridable via gpt4all constructor)
should be able to run a basic test:
```python
import gpt4all
model = gpt4all.GPT4All('/Users/aaron/Downloads/rift-coder-v0-7b-q4_0.gguf')
print(model.generate('def fib(n):'))
```
and see no non-model output when successful
2023-10-11 14:14:36 -07:00
cebtenzzre
7b611b49f2
llmodel: print an error if the CPU does not support AVX ( #1499 )
2023-10-11 15:09:40 -04:00
Aaron Miller
043617168e
do not process prompts on gpu yet
2023-10-11 13:15:50 -04:00
Aaron Miller
64001a480a
mat*mat for q4_0, q8_0
2023-10-11 13:15:50 -04:00
cebtenzzre
7a19047329
llmodel: do not call magic_match unless build variant is correct ( #1488 )
2023-10-11 11:30:48 -04:00
Cebtenzzre
5fe685427a
chat: clearer CPU fallback messages
2023-10-06 11:35:14 -04:00
Adam Treat
eec906aa05
Speculative fix for build on mac.
2023-10-05 18:37:33 -04:00
Adam Treat
a9acdd25de
Push a new version number for llmodel backend now that it is based on gguf.
2023-10-05 18:18:07 -04:00
Cebtenzzre
8bb6a6c201
rebase on newer llama.cpp
2023-10-05 18:16:19 -04:00
Cebtenzzre
d87573ea75
remove old llama.cpp submodules
2023-10-05 18:16:19 -04:00
Cebtenzzre
cc6db61c93
backend: fix build with Visual Studio generator
...
Use the $<CONFIG> generator expression instead of CMAKE_BUILD_TYPE. This
is needed because Visual Studio is a multi-configuration generator, so
we do not know what the build type will be until `cmake --build` is
called.
Fixes #1470
2023-10-05 18:16:19 -04:00
Adam Treat
f605a5b686
Add q8_0 kernels to kompute shaders and bump to latest llama/gguf.
2023-10-05 18:16:19 -04:00
Cebtenzzre
672cb850f9
differentiate between init failure and unsupported models
2023-10-05 18:16:19 -04:00
Adam Treat
906699e8e9
Bump to latest llama/gguf branch.
2023-10-05 18:16:19 -04:00
Cebtenzzre
088afada49
llamamodel: fix static vector in LLamaModel::endTokens
2023-10-05 18:16:19 -04:00
Adam Treat
b4d82ea289
Bump to the latest fixes for vulkan in llama.
2023-10-05 18:16:19 -04:00
Adam Treat
12f943e966
Fix regenerate button to be deterministic and bump the llama version to latest we have for gguf.
2023-10-05 18:16:19 -04:00
Adam Treat
5d346e13d7
Add q6_k kernels for vulkan.
2023-10-05 18:16:19 -04:00
Adam Treat
4eefd386d0
Refactor for subgroups on mat * vec kernel.
2023-10-05 18:16:19 -04:00
Cebtenzzre
3c2aa299d8
gptj: remove unused variables
2023-10-05 18:16:19 -04:00
Cebtenzzre
f9deb87d20
convert scripts: add feed-forward length for better compatiblilty
...
This GGUF key is used by all llama.cpp models with upstream support.
2023-10-05 18:16:19 -04:00
Cebtenzzre
cc7675d432
convert scripts: make gptj script executable
2023-10-05 18:16:19 -04:00
Cebtenzzre
0493e6eb07
convert scripts: use bytes_to_unicode from transformers
2023-10-05 18:16:19 -04:00
Cebtenzzre
d5d72f0361
gpt-j: update inference to match latest llama.cpp insights
...
- Use F16 KV cache
- Store transposed V in the cache
- Avoid unnecessary Q copy
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggml upstream commit 0265f0813492602fec0e1159fe61de1bf0ccaf78
2023-10-05 18:16:19 -04:00
Cebtenzzre
050e7f076e
backend: port GPT-J to GGUF
2023-10-05 18:16:19 -04:00
Cebtenzzre
8f3abb37ca
fix references to removed model types
2023-10-05 18:16:19 -04:00
Cebtenzzre
4219c0e2e7
convert scripts: make them directly executable
2023-10-05 18:16:19 -04:00
Cebtenzzre
ce7be1db48
backend: use llamamodel.cpp for Falcon
2023-10-05 18:16:19 -04:00
Cebtenzzre
cca9e6ce81
convert_mpt_hf_to_gguf.py: better tokenizer decoding
2023-10-05 18:16:19 -04:00
Cebtenzzre
25297786db
convert scripts: load model as late as possible
2023-10-05 18:16:19 -04:00
Cebtenzzre
fd47088f2b
conversion scripts: cleanup
2023-10-05 18:16:19 -04:00
Cebtenzzre
6277eac9cc
backend: use llamamodel.cpp for StarCoder
2023-10-05 18:16:19 -04:00
Cebtenzzre
17fc9e3e58
backend: port Replit to GGUF
2023-10-05 18:16:19 -04:00
Cebtenzzre
7c67262a13
backend: port MPT to GGUF
2023-10-05 18:16:19 -04:00
Cebtenzzre
42bcb814b3
backend: port BERT to GGUF
2023-10-05 18:16:19 -04:00
Cebtenzzre
1d29e4696c
llamamodel: metal supports all quantization types now
2023-10-05 18:16:19 -04:00
Aaron Miller
507753a37c
macos build fixes
2023-10-05 18:16:19 -04:00
Adam Treat
d90d003a1d
Latest rebase on llama.cpp with gguf support.
2023-10-05 18:16:19 -04:00
Adam Treat
99c106e6b5
Fix a bug seen on AMD RADEON cards with vulkan backend.
2023-09-26 11:59:47 -04:00
Jacob Nguyen
e86c63750d
Update llama.cpp.cmake
...
Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com>
2023-09-16 11:42:56 -07:00
Adam Treat
84905aa281
Fix for crashes on systems where vulkan is not installed properly.
2023-09-16 12:19:46 -04:00
Adam Treat
045f6e6cdc
Link against ggml in bin so we can get the available devices without loading a model.
2023-09-15 14:45:25 -04:00
Adam Treat
aa33419c6e
Fallback to CPU more robustly.
2023-09-14 16:53:11 -04:00
Adam Treat
9013a089bd
Bump to new llama with new bugfix.
2023-09-14 10:02:11 -04:00
Adam Treat
3076e0bf26
Only show GPU when we're actually using it.
2023-09-14 09:59:19 -04:00
Adam Treat
cf4eb530ce
Sync to a newer version of llama.cpp with bugfix for vulkan.
2023-09-13 21:01:44 -04:00
Adam Treat
4b9a345aee
Update the submodule.
2023-09-13 17:05:46 -04:00
Aaron Miller
6f038c136b
init at most one vulkan device, submodule update
...
fixes issues w/ multiple of the same gpu
2023-09-13 12:49:53 -07:00
Adam Treat
8f99dca70f
Bring the vulkan backend to the GUI.
2023-09-13 11:26:10 -04:00
Aaron Miller
f0735efa7d
vulkan python bindings on windows fixes
2023-09-12 14:16:02 -07:00
Adam Treat
c953b321b7
Don't link against libvulkan.
2023-09-12 14:26:56 -04:00
Aaron Miller
c4d23512e4
remove extra dynamic linker deps when building with vulkan
2023-09-11 08:44:39 -07:00
Adam Treat
85e34598f9
more circleci
2023-08-31 15:29:54 -04:00
Adam Treat
f578fa6cdf
Fix for windows.
2023-08-31 15:29:54 -04:00
Adam Treat
17d3e4976c
Add a comment indicating future work.
2023-08-31 15:29:54 -04:00
Adam Treat
987546c63b
Nomic vulkan backend licensed under the Software for Open Models License (SOM), version 1.0.
2023-08-31 15:29:54 -04:00
Adam Treat
d55cbbee32
Update to newer llama.cpp and disable older forks.
2023-08-31 15:29:54 -04:00
Aaron Miller
0bc2274869
bump llama.cpp version + needed fixes for that
2023-08-31 15:29:54 -04:00
aaron miller
33c22be2aa
starcoder: use ggml_graph_plan
2023-08-31 15:29:54 -04:00
Cosmic Snow
108d950874
Fix Windows unable to load models on older Windows builds
...
- Replace high-level IsProcessorFeaturePresent
- Reintroduce low-level compiler intrinsics implementation
2023-08-09 09:27:43 +02:00
Adam Treat
6d03b3e500
Add starcoder support.
2023-07-27 09:15:16 -04:00
cosmic-snow
2d02c65177
Handle edge cases when generating embeddings ( #1215 )
...
* Handle edge cases when generating embeddings
* Improve Python handling & add llmodel_c.h note
- In the Python bindings fail fast with a ValueError when text is empty
- Advice other bindings authors to do likewise in llmodel_c.h
2023-07-17 13:21:03 -07:00
Aaron Miller
1c4a244291
bump mem allocation a bit
2023-07-14 09:48:57 -04:00
Adam Treat
ee4186d579
Fixup bert python bindings.
2023-07-14 09:48:57 -04:00
cosmic-snow
6200900677
Fix Windows MSVC arch detection ( #1194 )
...
- in llmodel.cpp to fix AVX-only handling
Signed-off-by: cosmic-snow <134004613+cosmic-snow@users.noreply.github.com>
2023-07-13 14:44:17 -04:00
Adam Treat
4963db8f43
Bump the version numbers for both python and c backend.
2023-07-13 14:21:46 -04:00
Adam Treat
0efdbfcffe
Bert
2023-07-13 14:21:46 -04:00
Adam Treat
315a1f2aa2
Move it back as internal class.
2023-07-13 14:21:46 -04:00
Adam Treat
ae8eb297ac
Add sbert backend.
2023-07-13 14:21:46 -04:00
Adam Treat
1f749d7633
Clean up backend code a bit and hide impl. details.
2023-07-13 14:21:46 -04:00
Adam Treat
33557b1f39
Move the implementation out of llmodel class.
2023-07-13 14:21:46 -04:00
Aaron Miller
432b7ebbd7
include windows.h just to be safe
2023-07-12 12:46:46 -04:00
Aaron Miller
95b8fb312e
windows/msvc: use high level processor feature detection API
...
see https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent
2023-07-12 12:46:46 -04:00
Aaron Miller
f0faa23ad5
cmakelists: always export build commands ( #1179 )
...
friendly for using editors with clangd integration that don't also
manage the build themselves
2023-07-12 10:49:24 -04:00
Aaron Miller
4a24b586df
llama.cpp: metal buffer freeing
2023-06-30 21:07:21 -03:00
Aaron Miller
137bc2c367
replit: free metal context
2023-06-30 21:07:21 -03:00
Aaron Miller
57dc0c8953
adjust eval buf sizes to pass long input test
2023-06-30 21:07:21 -03:00
Aaron Miller
7a5f6e4726
limit prompt batch size to 128
2023-06-30 21:07:21 -03:00
Aaron Miller
883775bc5f
move 230511 submodule to nomic fork, fix alibi assert
2023-06-30 21:07:21 -03:00
Andriy Mulyar
46a0762bd5
Python Bindings: Improved unit tests, documentation and unification of API ( #1090 )
...
* Makefiles, black, isort
* Black and isort
* unit tests and generation method
* chat context provider
* context does not reset
* Current state
* Fixup
* Python bindings with unit tests
* GPT4All Python Bindings: chat contexts, tests
* New python bindings and backend fixes
* Black and Isort
* Documentation error
* preserved n_predict for backwords compat with langchain
---------
Co-authored-by: Adam Treat <treat.adam@gmail.com>
2023-06-30 16:02:02 -04:00
Aaron Miller
40a3faeb05
Use ggml scratch bufs for mpt and gptj models ( #1104 )
...
* backend/gptj: use scratch buffers
reduces total memory required and makes eval buf not grow with n_past
* backend/mpt: use scratch bufs
* fix format-related compile warnings
2023-06-30 10:53:45 -07:00
Aaron Miller
8d19ef3909
backend: factor out common elements in model code ( #1089 )
...
* backend: factor out common structs in model code
prepping to hack on these by hopefully making there be fewer places to fix the same bug
rename
* use common buffer wrapper instead of manual malloc
* fix replit compile warnings
2023-06-28 17:35:07 -07:00
Aaron Miller
28d41d4f6d
falcon: use *model-local* eval & scratch bufs ( #1079 )
...
fixes memory leaks copied from ggml/examples based implementation
2023-06-27 16:09:11 -07:00
Zach Nussbaum
2565f6a94a
feat: add conversion script
2023-06-27 14:06:39 -03:00
Aaron Miller
198b5e4832
add Falcon 7B model
...
Tested with https://huggingface.co/TheBloke/falcon-7b-instruct-GGML/blob/main/falcon7b-instruct.ggmlv3.q4_0.bin
2023-06-27 14:06:39 -03:00
Aaron Miller
db34a2f670
llmodel: skip attempting Metal if model+kvcache > 53% of system ram
2023-06-26 19:46:49 -03:00
Aaron Miller
b19a3e5b2c
add requiredMem method to llmodel impls
...
most of these can just shortcut out of the model loading logic llama is a bit worse to deal with because we submodule it so I have to at least parse the hparams, and then I just use the size on disk as an estimate for the mem size (which seems reasonable since we mmap() the llama files anyway)
2023-06-26 18:27:58 -03:00
Adam Treat
a0f80453e5
Use sysinfo in backend.
2023-06-26 14:14:49 -04:00
niansa/tuxifan
47323f8591
Update replit.cpp
...
replit_tokenizer_detokenize returnins std::string now
Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-26 14:49:58 -03:00
niansa
0855c0df1d
Fixed Replit implementation compile warnings
2023-06-26 14:49:58 -03:00
Aaron Miller
1290b32451
update to latest mainline llama.cpp
...
add max_size param to ggml_metal_add_buffer - introduced in https://github.com/ggerganov/llama.cpp/pull/1826
2023-06-26 14:40:52 -03:00
niansa/tuxifan
5eee16c97c
Do not specify "success" as error for unsupported models
...
Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
2023-06-22 09:28:40 +02:00
Adam Treat
bd58c46da0
Initialize these to nullptr to prevent double deletion when a model fails to load.
2023-06-20 18:23:45 -04:00