community[minor]: Add Aphrodite Engine support (#14759)

This PR adds support for PygmalionAI's [Aphrodite
Engine](https://github.com/PygmalionAI/aphrodite-engine), based on
vLLM's attention mechanism. At the moment, this PR does not include
support for the API servers, but they will be added in a later PR.

The only dependency as of now is `aphrodite-engine==0.4.2`. We pin the
version to prevent breakage due to changes in the aphrodite-engine
library.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
pull/14953/head
AlpinDale 6 months ago committed by GitHub
parent d21f44b484
commit b0588774f1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,260 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "499c3142-2033-437d-a60a-731988ac6074",
"metadata": {},
"source": [
"# Aphrodite Engine\n",
"\n",
"[Aphrodite](https://github.com/PygmalionAI/aphrodite-engine) is the open-source large-scale inference engine designed to serve thousands of users on the [PygmalionAI](https://pygmalion.chat) website.\n",
"\n",
"* Attention mechanism by vLLM for fast throughput and low latencies \n",
"* Support for for many SOTA sampling methods\n",
"* Exllamav2 GPTQ kernels for better throughput at lower batch sizes\n",
"\n",
"This notebooks goes over how to use a LLM with langchain and Aphrodite.\n",
"\n",
"To use, you should have the `aphrodite-engine` python package installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a3f2666-5c75-4797-967a-7915a247bf33",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%pip install aphrodite-engine==0.4.2\n",
"# %pip list | grep aphrodite"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Initializing the Aphrodite Engine with the following config:\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Model = 'PygmalionAI/pygmalion-2-7b'\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Tokenizer = 'PygmalionAI/pygmalion-2-7b'\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] tokenizer_mode = auto\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] revision = None\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] trust_remote_code = True\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] DataType = torch.bfloat16\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Download Directory = None\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Model Load Format = auto\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Number of GPUs = 1\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Quantization Format = None\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Sampler Seed = 0\n",
"\u001b[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Context Length = 4096\u001b[0m\n",
"\u001b[32mINFO 12-15 11:54:07 aphrodite_engine.py:206] # GPU blocks: 3826, # CPU blocks: 512\u001b[0m\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:02<00:00, 2.91s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"I'm Ayumu \"Osaka\" Kasuga, and I'm an avid anime and manga fan! I'm pretty introverted, but I've always loved reading books, watching anime and manga, and learning about Japanese culture. My favourite anime series would be My Hero Academia, Attack on Titan, and Sword Art Online. I also really enjoy reading the manga series One Piece, Naruto, and the Gintama series.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from langchain_community.llms import Aphrodite\n",
"\n",
"llm = Aphrodite(\n",
" model=\"PygmalionAI/pygmalion-2-7b\",\n",
" trust_remote_code=True, # mandatory for hf models\n",
" max_tokens=128,\n",
" temperature=1.2,\n",
" min_p=0.05,\n",
" mirostat_mode=0, # change to 2 to use mirostat\n",
" mirostat_tau=5.0,\n",
" mirostat_eta=0.1,\n",
")\n",
"\n",
"print(\n",
" llm(\n",
" '<|system|>Enter RP mode. You are Ayumu \"Osaka\" Kasuga.<|user|>Hey Osaka. Tell me about yourself.<|model|>'\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "94a3b41d-8329-4f8f-94f9-453d7f132214",
"metadata": {},
"source": [
"## Integrate the model in an LLMChain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:03<00:00, 3.56s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" The first Pokemon game was released in Japan on 27 February 1996 (their release dates differ from ours) and it is known as Red and Green. President Bill Clinton was in the White House in the years of 1993, 1994, 1995 and 1996 so this fits.\n",
"\n",
"Answer: Let's think step by step.\n",
"\n",
"The first Pokémon game was released in Japan on February 27, 1996 (their release dates differ from ours) and it is known as\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from langchain.chains import LLMChain\n",
"from langchain.prompts import PromptTemplate\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
"\n",
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"question = \"Who was the US president in the year the first Pokemon game was released?\"\n",
"\n",
"print(llm_chain.run(question))"
]
},
{
"cell_type": "markdown",
"id": "56826aba-d08b-4838-8bfa-ca96e463b25d",
"metadata": {},
"source": [
"## Distributed Inference\n",
"\n",
"Aphrodite supports distributed tensor-parallel inference and serving. \n",
"\n",
"To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2023-12-15 11:41:27,790\tINFO worker.py:1636 -- Started a local Ray instance.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Initializing the Aphrodite Engine with the following config:\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Model = 'PygmalionAI/mythalion-13b'\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Tokenizer = 'PygmalionAI/mythalion-13b'\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] tokenizer_mode = auto\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] revision = None\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] trust_remote_code = True\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] DataType = torch.float16\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Download Directory = None\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Model Load Format = auto\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Number of GPUs = 4\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Quantization Format = None\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Sampler Seed = 0\n",
"\u001b[32mINFO 12-15 11:41:35 aphrodite_engine.py:73] Context Length = 4096\u001b[0m\n",
"\u001b[32mINFO 12-15 11:43:58 aphrodite_engine.py:206] # GPU blocks: 11902, # CPU blocks: 1310\u001b[0m\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:16<00:00, 16.09s/it]\n"
]
},
{
"data": {
"text/plain": [
"\"\\n2 years ago StockBot101\\nAI is becoming increasingly real and more and more powerful with every year. But what does the future hold for artificial intelligence?\\nThere are many possibilities for how AI could evolve and change our world. Some believe that AI will become so advanced that it will take over human jobs, while others believe that AI will be used to augment and assist human workers. There is also the possibility that AI could develop its own consciousness and become self-aware.\\nWhatever the future holds, it is clear that AI will continue to play an important role in our lives. Technologies such as machine learning and natural language processing are already transforming industries like healthcare, manufacturing, and transportation. And as AI continues to develop, we can expect even more disruption and innovation across all sectors of the economy.\\nSo what exactly are we looking at? What's the future of AI?\\nIn the next few years, we can expect AI to be used more and more in healthcare. With the power of machine learning, artificial intelligence can help doctors diagnose diseases earlier and more accurately. It can also be used to develop new treatments and personalize care plans for individual patients.\\nManufacturing is another area where AI is already having a big impact. Companies are using robotics and automation to build products faster and with fewer errors. And as AI continues to advance, we can expect even more changes in manufacturing, such as the development of self-driving factories.\\nTransportation is another industry that is being transformed by artificial intelligence. Self-driving cars are already being tested on public roads, and it's likely that they will become commonplace in the next decade or so. AI-powered drones are also being developed for use in delivery and even firefighting.\\nFinally, artificial intelligence is also poised to have a big impact on customer service and sales. Chatbots and virtual assistants will become more sophisticated, making it easier for businesses to communicate with customers and sell their products.\\nThis is just the beginning for artificial intelligence. As the technology continues to develop, we can expect even more amazing advances and innovations. The future of AI is truly limitless.\\nWhat do you think the future of AI holds? Do you see any other major\""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.llms import Aphrodite\n",
"\n",
"llm = Aphrodite(\n",
" model=\"PygmalionAI/mythalion-13b\",\n",
" tensor_parallel_size=4,\n",
" trust_remote_code=True, # mandatory for hf models\n",
")\n",
"\n",
"llm(\"What is the future of AI?\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -52,6 +52,12 @@ def _import_anyscale() -> Any:
return Anyscale
def _import_aphrodite() -> Any:
from langchain_community.llms.aphrodite import Aphrodite
return Aphrodite
def _import_arcee() -> Any:
from langchain_community.llms.arcee import Arcee
@ -547,6 +553,8 @@ def __getattr__(name: str) -> Any:
return _import_anthropic()
elif name == "Anyscale":
return _import_anyscale()
elif name == "Aphrodite":
return _import_aphrodite()
elif name == "Arcee":
return _import_arcee()
elif name == "Aviary":
@ -719,6 +727,7 @@ __all__ = [
"AmazonAPIGateway",
"Anthropic",
"Anyscale",
"Aphrodite",
"Arcee",
"Aviary",
"AzureMLOnlineEndpoint",

@ -0,0 +1,250 @@
from typing import Any, Dict, List, Optional
from langchain_core.callbacks import CallbackManagerForLLMRun
from langchain_core.language_models import BaseLLM
from langchain_core.outputs import Generation, LLMResult
from langchain_core.pydantic_v1 import Field, root_validator
class Aphrodite(BaseLLM):
"""Aphrodite language model."""
model: str = ""
"""The name or path of a HuggingFace Transformers model."""
tensor_parallel_size: Optional[int] = 1
"""The number of GPUs to use for distributed execution with tensor parallelism."""
trust_remote_code: Optional[bool] = False
"""Trust remote code (e.g., from HuggingFace) when downloading the model
and tokenizer."""
n: int = 1
"""Number of output sequences to return for the given prompt."""
best_of: Optional[int] = None
"""Number of output sequences that are generated from the prompt.
From these `best_of` sequences, the top `n` sequences are returned.
`best_of` must be >= `n`. This is treated as the beam width when
`use_beam_search` is True. By default, `best_of` is set to `n`."""
presence_penalty: float = 0.0
"""Float that penalizes new tokens based on whether they appear in the
generated text so far. Values > 0 encourage the model to generate new
tokens, while values < 0 encourage the model to repeat tokens."""
frequency_penalty: float = 0.0
"""Float that penalizes new tokens based on their frequency in the
generated text so far. Applied additively to the logits."""
repetition_penalty: float = 1.0
"""Float that penalizes new tokens based on their frequency in the
generated text so far. Applied multiplicatively to the logits."""
temperature: float = 1.0
"""Float that controls the randomness of the sampling. Lower values
make the model more deterministic, while higher values make the model
more random. Zero is equivalent to greedy sampling."""
top_p: float = 1.0
"""Float that controls the cumulative probability of the top tokens to consider.
Must be in (0, 1]. Set to 1.0 to consider all tokens."""
top_k: int = -1
"""Integer that controls the number of top tokens to consider. Set to -1 to
consider all tokens (disabled)."""
top_a: float = 0.0
"""Float that controls the cutoff for Top-A sampling. Exact cutoff is
top_a*max_prob**2. Must be in [0,inf], 0 to disable."""
min_p: float = 0.0
"""Float that controls the cutoff for min-p sampling. Exact cutoff is
min_p*max_prob. Must be in [0,1], 0 to disable."""
tfs: float = 1.0
"""Float that controls the cumulative approximate curvature of the
distribution to retain for Tail Free Sampling. Must be in (0, 1].
Set to 1.0 to disable."""
eta_cutoff: float = 0.0
"""Float that controls the cutoff threshold for Eta sampling
(a form of entropy adaptive truncation sampling). Threshold is
calculated as `min(eta, sqrt(eta)*entropy(probs)). Specified
in units of 1e-4. Set to 0 to disable."""
epsilon_cutoff: float = 0.0
"""Float that controls the cutoff threshold for Epsilon sampling
(simple probability threshold truncation). Specified in units of
1e-4. Set to 0 to disable."""
typical_p: float = 1.0
"""Float that controls the cumulative probability of tokens closest
in surprise to the expected surprise to consider. Must be in (0, 1].
Set to 1 to disable."""
mirostat_mode: int = 0
"""The mirostat mode to use. 0 for no mirostat, 2 for mirostat v2.
Mode 1 is not supported."""
mirostat_tau: float = 0.0
"""The target 'surprisal' that mirostat works towards. Range [0, inf)."""
use_beam_search: bool = False
"""Whether to use beam search instead of sampling."""
length_penalty: float = 1.0
"""Float that penalizes sequences based on their length. Used only
when `use_beam_search` is True."""
early_stopping: bool = False
"""Controls the stopping condition for beam search. It accepts the
following values: `True`, where the generation stops as soon as there
are `best_of` complete candidates; `False`, where a heuristic is applied
to the generation stops when it is very unlikely to find better candidates;
`never`, where the beam search procedure only stops where there cannot be
better candidates (canonical beam search algorithm)."""
stop: Optional[List[str]] = None
"""List of strings that stop the generation when they are generated.
The returned output will not contain the stop tokens."""
stop_token_ids: Optional[List[int]] = None
"""List of tokens that stop the generation when they are generated.
The returned output will contain the stop tokens unless the stop tokens
are special tokens."""
ignore_eos: bool = False
"""Whether to ignore the EOS token and continue generating tokens after
the EOS token is generated."""
max_tokens: int = 512
"""Maximum number of tokens to generate per output sequence."""
logprobs: Optional[int] = None
"""Number of log probabilities to return per output token."""
prompt_logprobs: Optional[int] = None
"""Number of log probabilities to return per prompt token."""
custom_token_bans: Optional[List[int]] = None
"""List of token IDs to ban from generating."""
skip_special_tokens: bool = True
"""Whether to skip special tokens in the output. Defaults to True."""
spaces_between_special_tokens: bool = True
"""Whether to add spaces between special tokens in the output.
Defaults to True."""
logit_bias: Optional[Dict[str, float]] = None
"""List of LogitsProcessors to change the probability of token
prediction at runtime."""
dtype: str = "auto"
"""The data type for the model weights and activations."""
download_dir: Optional[str] = None
"""Directory to download and load the weights. (Default to the default
cache dir of huggingface)"""
quantization: Optional[str] = None
"""Quantization mode to use. Can be one of `awq` or `gptq`."""
aphrodite_kwargs: Dict[str, Any] = Field(default_factory=dict)
"""Holds any model parameters valid for `aphrodite.LLM` call not explicitly
specified."""
client: Any #: :meta private:
@root_validator()
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that python package exists in environment."""
try:
from aphrodite import LLM as AphroditeModel
except ImportError:
raise ImportError(
"Could not import aphrodite-engine python package. "
"Please install it with `pip install aphrodite-engine`."
)
# aphrodite_kwargs = values["aphrodite_kwargs"]
# if values.get("quantization"):
# aphrodite_kwargs["quantization"] = values["quantization"]
values["client"] = AphroditeModel(
model=values["model"],
tensor_parallel_size=values["tensor_parallel_size"],
trust_remote_code=values["trust_remote_code"],
dtype=values["dtype"],
download_dir=values["download_dir"],
**values["aphrodite_kwargs"],
)
return values
@property
def _default_params(self) -> Dict[str, Any]:
"""Get the default parameters for calling aphrodite."""
return {
"n": self.n,
"best_of": self.best_of,
"max_tokens": self.max_tokens,
"top_k": self.top_k,
"top_p": self.top_p,
"top_a": self.top_a,
"min_p": self.min_p,
"temperature": self.temperature,
"presence_penalty": self.presence_penalty,
"frequency_penalty": self.frequency_penalty,
"repetition_penalty": self.repetition_penalty,
"tfs": self.tfs,
"eta_cutoff": self.eta_cutoff,
"epsilon_cutoff": self.epsilon_cutoff,
"typical_p": self.typical_p,
"mirostat_mode": self.mirostat_mode,
"mirostat_tau": self.mirostat_tau,
"length_penalty": self.length_penalty,
"early_stopping": self.early_stopping,
"use_beam_search": self.use_beam_search,
"stop": self.stop,
"ignore_eos": self.ignore_eos,
"logprobs": self.logprobs,
"prompt_logprobs": self.prompt_logprobs,
"custom_token_bans": self.custom_token_bans,
"skip_special_tokens": self.skip_special_tokens,
"spaces_between_special_tokens": self.spaces_between_special_tokens,
"logit_bias": self.logit_bias,
}
def _generate(
self,
prompts: List[str],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> LLMResult:
"""Run the LLM on the given prompt and input."""
from aphrodite import SamplingParams
# build sampling parameters
params = {**self._default_params, **kwargs, "stop": stop}
if "logit_bias" in params:
del params["logit_bias"]
sampling_params = SamplingParams(**params)
# call the model
outputs = self.client.generate(prompts, sampling_params)
generations = []
for output in outputs:
text = output.outputs[0].text
generations.append([Generation(text=text)])
return LLMResult(generations=generations)
@property
def _llm_type(self) -> str:
"""Return type of llm."""
return "aphrodite"

@ -8,6 +8,7 @@ EXPECT_ALL = [
"AmazonAPIGateway",
"Anthropic",
"Anyscale",
"Aphrodite",
"Arcee",
"Aviary",
"AzureMLOnlineEndpoint",

Loading…
Cancel
Save