Experimental quantization.

Working inference code (regular inference with autogptq does not work without return_token_type_ids=False, didn't get it to work with textgen-webui):

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

input_ids = tokenizer("Question: What is the purpose of life?\n\nAnswer:", return_tensors="pt").input_ids.to("cuda:0")

out = model.generate(input_ids=input_ids,max_length=300)

print(tokenizer.decode(out[0]))

or

print(tokenizer.decode(model.generate(**tokenizer("test is", return_tensors="pt", return_token_type_ids=False).to("cuda:0"))[0]))

Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.