Implement `monkey-patch` variant

1 year ago · 14a3bd3138
parent 1cfe526bb9
commit 14a3bd3138
2 changed files with 12 additions and 0 deletions
--- a/10
+++ b/10
@ -90,6 +90,16 @@ RUN pip uninstall -y llama-cpp-python && \
 ENV EXTRA_LAUNCH_ARGS=""
 CMD ["python3", "/app/server.py"]

+FROM base AS monkey-patch
+RUN echo "4-BIT MONKEY-PATCH" >> /variant.txt
+RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
+RUN git clone https://github.com/johnsmith0031/alpaca_lora_4bit /app/repositories/alpaca_lora_4bit && \
+    cd /app/repositories/alpaca_lora_4bit && git checkout 2f704b93c961bf202937b10aac9322b092afdce0
+ARG TORCH_CUDA_ARCH_LIST="8.6"
+RUN pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
+ENV EXTRA_LAUNCH_ARGS=""
+CMD ["python3", "/app/server.py", "--monkey-patch"]
+
 FROM base AS default
 RUN echo "DEFAULT" >> /variant.txt
 ENV EXTRA_LAUNCH_ARGS=""
--- a/README.md
+++ b/README.md
@ -22,7 +22,9 @@ Choose the desired variant by setting the build `target` in `docker-compose.yml`
 | `default` | Minimal implementation of the default deployment from source.  |
 | `triton` | Updated GPTQ using the latest `triton` branch from `qwopqwop200/GPTQ-for-LLaMa`. Suitable for Linux only. |
 | `cuda` | Updated GPTQ using the latest `cuda` branch from `qwopqwop200/GPTQ-for-LLaMa`. |
+| `monkey-patch` | Use LoRas in 4-Bit GPTQ mode. |
 | `llama-cublas` | CUDA GPU offloading enabled for llama-cpp. Use by setting option `n-gpu-layers` > 0. |
+
 *See: [oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md) and [obabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md) for more information on variants.*

 ### Build