<a href="https://colab.research.google.com/github/mlabonne/llm-course/blob/main/GPT2_GPTQ_4bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a 4-bit GPT-2 model using AutoGPTQ
> üó£Ô∏è [Large Language Model Course](https://github.com/mlabonne/llm-course)

‚ù§Ô∏è Created by [@maximelabonne](https://twitter.com/maximelabonne).

## Quantize model

In [None]:
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers huggingface_hub

In [None]:
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

examples = [
    "In the wake of the Federal Reserve's recent decision, market analysts predict a shift in the stock market dynamics, urging investors to reassess their portfolios.",
    "As quantum computing continues its rapid development, it promises to revolutionize fields such as cryptography and machine learning, posing a significant leap from classical computing.",
    "The recent elections have brought a seismic shift in the political landscape, with the newly elected government pledging to focus on healthcare and education reform.",
    "The Renaissance, a significant period in European history, was marked by a cultural rebirth and dramatic advances in art, science, and philosophical thought.",
    "With the rise of machine learning and AI, Python has emerged as a dominant language in programming due to its simplicity and powerful libraries such as TensorFlow and PyTorch.",
    "Jane Austen's 'Pride and Prejudice' continues to captivate readers with its intricate exploration of societal norms and the complexities of human relationships during the Regency era.",
    "Following an intense season, the Golden State Warriors have emerged as the NBA champions, underscoring their remarkable team play and strategic finesse.",
    "The latest Marvel film, 'Avengers: Infinity Gauntlet', has shattered box office records worldwide, reinforcing the global appeal of superhero narratives.",
    "The increasing instances of wildfires and erratic weather patterns underscore the urgent need to address climate change and implement sustainable environmental practices.",
    "In recent news, a breakthrough in the peace negotiations between the two countries has sparked hope for an end to the decade-long conflict.",
]

# Define base model and output directory
model_id = "gpt2"
out_dir = model_id + "-GPTQ"

# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Determine device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Tokenize examples
examples_ids = [tokenizer(text, truncation=True) for text in examples]

# Quantize
model.quantize(
  examples_ids,
  use_triton=True,
  autotune_warmup_after_quantized=True,
  batch_size=1,
)

# Save model and tokenizer
model.save_quantized(model_id + "-GPTQ", use_safetensors=False)
model.save_quantized(model_id + "-GPTQ", use_safetensors=True)
tokenizer.save_pretrained(out_dir)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 11/11 [03:16<00:00, 17.87s/it]


('gpt2-GPTQ/tokenizer_config.json',
 'gpt2-GPTQ/special_tokens_map.json',
 'gpt2-GPTQ/vocab.json',
 'gpt2-GPTQ/merges.txt',
 'gpt2-GPTQ/added_tokens.json',
 'gpt2-GPTQ/tokenizer.json')

In [None]:
# Reload model and tokenizer
model = AutoGPTQForCausalLM.from_quantized(
    out_dir,
    use_triton=True,
    device=device,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir)



In [None]:
def generate_text(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(device)

    output = model.to(device).generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        do_sample=True,
        max_length=50,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id
    )
    output = tokenizer.decode(output[0], skip_special_tokens=True)

    return output

# Generate text
input_text = "I have a dream"
generate_text(input_text)

'I have a dream,,,,,,,,, at,--,,,,,,,,,,,,,,---,,,, ( (,//,,,,---'

## Save and load model using Hugging Face Hub

In [None]:
from huggingface_hub import notebook_login
from huggingface_hub import HfApi
import locale
locale.getpreferredencoding = lambda: "UTF-8"

REPO_ID = "insert your repo/model ID" # example: "mlabonne/gpt2-GPTQ-4bit"

notebook_login()
api = HfApi()
!git config --global credential.helper store

api.upload_folder(
    folder_path=out_dir,
    repo_id=REPO_ID,
    repo_type="model",
)

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

gptq_model-4bit-128g.bin:   0%|          | 0.00/123M [00:00<?, ?B/s]

'https://huggingface.co/mlabonne/gpt2-GPTQ-4bit/tree/main/'

In [None]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_id = REPO_ID
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of the model checkpoint at mlabonne/gpt2-GPTQ-4bit were not used when initializing GPT2LMHeadModel: ['transformer.h.11.attn.c_proj.qweight', 'transformer.h.10.attn.c_proj.g_idx', 'transformer.h.4.attn.c_proj.g_idx', 'transformer.h.0.mlp.c_proj.qweight', 'transformer.h.3.attn.c_proj.scales', 'transformer.h.9.attn.c_proj.g_idx', 'transformer.h.0.mlp.c_fc.g_idx', 'transformer.h.9.mlp.c_fc.qweight', 'transformer.h.4.attn.c_proj.scales', 'transformer.h.9.mlp.c_fc.g_idx', 'transformer.h.10.attn.c_attn.qweight', 'transformer.h.4.mlp.c_proj.scales', 'transformer.h.9.mlp.c_proj.qzeros', 'transformer.h.9.attn.c_attn.scales', 'transformer.h.0.attn.c_proj.scales', 'transformer.h.4.mlp.c_fc.g_idx', 'transformer.h.9.mlp.c_fc.qzeros', 'transformer.h.2.mlp.c_proj.qweight', 'transformer.h.9.mlp.c_proj.qweight', 'transformer.h.3.mlp.c_fc.scales', 'transformer.h.8.attn.c_attn.qzeros', 'transformer.h.1.attn.c_attn.scales', 'transformer.h.1.attn.c_attn.qweight', 'transformer.h.3.mlp.c_proj.qze

'I have a dream,,,,, and,,,, and,,,,,,,,,,).,,,,,,,,,,,,,,,,,,,,,,,'

In [None]:
input_text = "I have a dream"
generate_text(input_text)