Thank you for this nice model. Could you make a q8 gguf, please?
...
You can use the sample colab sheets shared to convert the models to gguf. Unsloth uses Llama.cpp to convert the models. The below code will do the conversion
Whichever quantization you want you can replace the corresponding False to True.
Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
The free version of colab (T4 GPU) is taking about 20 minutes to build the GGUF file.
You can use the sample colab sheets shared to convert the models to gguf. Unsloth uses Llama.cpp to convert the models. The below code will do the conversion
Whichever quantization you want you can replace the corresponding False to True.
Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
Thanks for helping out as always ewre! ❤️
The free version of colab (T4 GPU) is taking about 20 minutes to build the GGUF file.
You can also try our Kaggle notebooks which provides 30 hours for free per week: https://www.kaggle.com/code/danielhanchen/kaggle-llama-3-8b-unsloth-notebook
@NikolayKozloff Here it is, in case you or anyone else is still looking for it: https://huggingface.co/akumaburn/llama-3-8b-bnb-4bit-GGUF
@NikolayKozloff Here it is, in case you or anyone else is still looking for it: https://huggingface.co/akumaburn/llama-3-8b-bnb-4bit-GGUF
Thanks. Your gguf made possible to merge it with lora and that resulted in creation of probably first Albanian llm with acceptable quality in chatting: https://huggingface.co/NikolayKozloff/bleta-8B-v0.5-Albanian-shqip-GGUF
its greate job. tanks. how to fine tune with my custum data?
"I'm encountering the following problem:
When I fine-tune an LLM using one of your Colab codes, I get a model that gives good answers in the editors.
But when I save it in GGUF format with llama.cp and push it to my Hugging Face repo, then download and use it in LMStudio, the model fails to answer any questions, it bugs out, doesn't work at all, and freezes.
Note that the output format gives me a 16GB file for a Llama3 7B, while the GGUF models in LMStudio are 5GB to 7GB.
Here's the part of the code that saves:
[
Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,) #if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
Save to 16bit GGUF
if False: model.save_pretrained_gguf("Llama3_7B_finetuned_lora_f16", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("Llama3_7B_finetuned_lora_f16", tokenizer, quantization_method = "f16", token = "")
Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("Llama3_7B_finetuned_lora_q4_k_m", tokenizer, quantization_method = "q4_k_m") model.push_to_hub_gguf("Llama3_7B_finetuned_lora_q4_k_m", tokenizer, quantization_method = "q4_k_m", token = "")]
Please tell me how to save with a reasonable file size that can work correctly locally.
Thank you."