Implement llama-cpp with CUBLAS

Also switch to devel image for deployment - it will be more flexible for variant builds.
1 year ago · dd8dab6fb9
parent bd884491e4
commit dd8dab6fb9
2 changed files with 13 additions and 8 deletions
--- a/13
+++ b/13
@ -12,7 +12,6 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH"
 RUN pip3 install --upgrade pip setuptools && \
    pip3 install torch torchvision torchaudio

-
 FROM env_base AS app_base 
 ### DEVELOPERS/ADVANCED USERS ###
 # Clone oobabooga/text-generation-webui
@ -36,7 +35,7 @@ ARG TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5;8.0;8.6+PTX"
 RUN cd /app/repositories/GPTQ-for-LLaMa/ && python3 setup_cuda.py install


-FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04 AS base
+FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS base
 # Runtime pre-reqs
 RUN apt-get update && apt-get install --no-install-recommends -y \
    python3-venv git
@ -61,7 +60,6 @@ RUN chmod +x /scripts/docker-entrypoint.sh
 ENTRYPOINT ["/scripts/docker-entrypoint.sh"]


-
 # VARIANT BUILDS
 FROM base AS cuda
 RUN echo "CUDA" >> /variant.txt
@ -73,7 +71,6 @@ RUN pip3 uninstall -y quant-cuda && \
 ENV EXTRA_LAUNCH_ARGS=""
 CMD ["python3", "/app/server.py"]

-
 FROM base AS triton
 RUN echo "TRITON" >> /variant.txt
 RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
@ -84,6 +81,14 @@ RUN pip3 uninstall -y quant-cuda && \
 ENV EXTRA_LAUNCH_ARGS=""
 CMD ["python3", "/app/server.py"]

+FROM base AS llama-cublas
+RUN echo "LLAMA-CUBLAS" >> /variant.txt
+RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
+ENV LLAMA_CUBLAS=1
+RUN pip uninstall -y llama-cpp-python && \
+    CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
+ENV EXTRA_LAUNCH_ARGS=""
+CMD ["python3", "/app/server.py"]

 FROM base AS default
 RUN echo "DEFAULT" >> /variant.txt
--- a/README.md
+++ b/README.md
@ -1,10 +1,10 @@
 # Introduction
-This project dockerises the deployment of [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) and its variants. It provides a default configuration (corresponding to a vanilla deployment of the application) as well as pre-configured support for other set-ups (e.g., the more recent `triton` and `cuda` branches of GPTQ).
+This project dockerises the deployment of [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) and its variants. It provides a default configuration (corresponding to a vanilla deployment of the application) as well as pre-configured support for other set-ups (e.g., latest `llama-cpp-python` with GPU offloading, the more recent `triton` and `cuda` branches of GPTQ).

 *This goal of this project is to be to [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui), what [AbdBarho/stable-diffusion-webui-docker](https://github.com/AbdBarho/stable-diffusion-webui-docker) is to [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui).*

 # Usage
-*This project currently supports Linux as the deployment platform.*
+*This project currently supports Linux as the deployment platform. It will also probably work using WSL2.*

 ## Pre-Requisites
 - docker
@ -22,8 +22,8 @@ Choose the desired variant by setting the build `target` in `docker-compose.yml`
 | `default` | Minimal implementation of the default deployment from source.  |
 | `triton` | Updated GPTQ using the latest `triton` branch from `qwopqwop200/GPTQ-for-LLaMa`. Suitable for Linux only. |
 | `cuda` | Updated GPTQ using the latest `cuda` branch from `qwopqwop200/GPTQ-for-LLaMa`. |
-
-*See: [oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md) for more information on variants.*
+| `llama-cublas` | CUDA GPU offloading enabled for llama-cpp. Use by setting option `n-gpu-layers` > 0. |
+*See: [oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md) and [obabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md) for more information on variants.*

 ### Build
 Build the image: