New unique base models
Base models offer a completely unique experience and serve as a foundation for model creators to build upon. Base models are what keeps the entire AI community going yet they often don't get the attention and recognition they deserve. While llama.cpp supports many base models some of them got almost completely overlooked and have either no or only static GGUF quants on HuggingFace. We should change this by quantizing the following base models. Most of them have non-instruct and smaller variants but for this request focused on the largest instruction tuned version as I believe this will best demonstrate the capability of suck base models. Further I excluded all base models smaller than 13B or that are multinodular.
- https://huggingface.co/xverse/XVERSE-65B-Chat
- https://huggingface.co/GritLM/GritLM-8x7B
- https://huggingface.co/allenai/OLMo-2-1124-13B-Instruct
- https://huggingface.co/LumiOpen/Poro-34B
- https://huggingface.co/inceptionai/jais-13b-chat (gated but one button click gives you immediately access)
snowflake-arctic-instruct (https://huggingface.co/Snowflake/snowflake-arctic-instruct)
That one is absolutely massive. Q8 will not fit in 500 GiB of RAM unless you use GPU offloading to booth RTX 4090 GPUs and we turn off everything else but it should be barely possible to do the imatrix computation without RPC. Due to the sheer size and because I want to archive the original model anyways, I will download it to a separate HDD tomorrow and then convert the source GGUF to spool when you are ready.
I tried all all of these already and they failed, but that has been a while ago, so I will queue them again. As for snowflake, you can symlink the gguf into /tmp and /tmp/quant, under "reponame.gguf" as usual, and when I queue a job it will find it. the imatrix one I will have to parametrize manually of course. But you can prepare any time for that and symlink it.
yup, already starts: olmo is still not supported. will see you tomorrow :)
@mradermacher I copied the snowflake-arctic-instruct source GGUF to /mradermacher/tmp/snowflake-arctic-instruct.gguf
I did not hard link it as I had no SSD storage left myself due to the still ongoing qwen2.5 series performance measurement project which is using 8 TB of my SSD storage. When we are at that the performance measurement project my laptop is currently outside so it doesn't overheat which would mess with the performance measurements.
I'm very positively surprised with snowflake-arctic-instruct. It uses 128 experts and because only 2 of them are active it is super-fast (16 tokens per second token generation speed on CPU) despite having 478.58 parameters. When we are at the parameter count I hardcoded shape[-3] to be 128 inside get_total_parameter_count in gguf_writer.py because there were some 1-dimensional expert layers that otherwise crashed it. I believe this should be correct but in worst case only the parameter count metadata would be slightly affected.
Here two examples I tried on snowflake-arctic-instruct _Q2_K.gguf which I computed to see if the model is supported by llama.cpp which it surprisingly is:
./llama-cli -m snowflake-arctic-instruct _Q2_K.gguf -p "I believe the meaning of life is" -n 128
I believe the meaning of life is different for different people. For me, it's about finding happiness and fulfillment in my daily activities and relationships. It's about making a positive impact on the world and those around me. What does it mean for you? [end of text]
./llama-cli -m /mradermacher/root/snowflake-arctic-instruct_Q2_K.gguf -p "Proxmox is" -n 128
Proxmox is a powerful open-source virtualization platform that provides comprehensive management for virtual machines, containers, and storage. It allows you to create and manage VMs and containers with ease, and it offers a wide range of features such as live migration, high availability, and distributed storage.
I did some more testing with snowflake-arctic-instruct and there is even better news: The model seems to be fully uncensored! This might be the first fully uncensored high-quality base model we had so far. Not finetuned to be uncensored but they did not do any "safety" alignment. Now I know why they omitted all this AI safety garbage from their model card and blog posts. Having a fully uncensored base model is massive. It is so cool to see a model larger than Llama 3.1 405B generating over 16 tokens/second on a CPU. I think I did underestimate the power of MoE models.
llama.cpp support for another base model just dropped 2 hours ago! Make sure to update to latest llama.cpp before trying it.
https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0
https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0
https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0
https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0
https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0
minerva: As usual, "support" doesn't mean it works. Or that anybody has actually tested it with the actual model. All non-7b have converted, but the 7b, the only one for which support is claimed, of course isn't:
WARNING:hf-to-gguf:** chkhsh: 68fa7e0a33050885cc10a2acfa4df354042188f0afa03b809f7a71c4cde6e373