imatrix.dat missing output.weight and token_embd.weight
When making my own quants using your imatrix.dat (thank you guys so much for providing these, for a model like this it would be way too compute intensive for me to do myself, and even on smaller models your dataset from my experience results in better quants than the open source datasets I could find), the output contained:
====== llama_model_quantize_internal: did not find weights for output.weight
[...]
====== llama_model_quantize_internal: did not find weights for token_embd.weight
Looking at the source code shows that happens when it is missing from the imatrix.
I'm just curious if this happened with your imatrix quants, and if so anything you know about why this model has this issue. I've used your imatrix.dat for other models and I'm fairly certain I didn't have this happen.
This can happen when there is not enough measurement coverage for a tensor or when llama.cpp decides to not use measure it, which I think is the case here, i.e. it is almost certainly normal - these tensors will likely be quantized differently (i.e. with more bits).
it is almost certainly normal
Thanks for the confirmation. It takes me 4 hours just to make a quant, so I don't know how many recipes I will try, I know unsloth recipes use a mix of non standard down_proj , embed, and lm_head, on their Q2 mixes, but that's most likely to deal with the fact that Q2 is too small. My current quant is ~4.5 bpw, and I'm thinking of trying to either keep the same size but with better accuracy, or see if I can go down with negligible accuracy which will boost performance.
All the power to you :)
Unsloth found some low bpw quant recipes that they evaluated to be much better than the standard recipes. https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit
The highlight is their 158GB recipe, which completed the test objective (Create a Flappy Bird game in Python. You must include these things:...) with a score based on rubric criteria average of 3 runs of 9.08/10, while the bigger standard recipe at 175 GB only gets a 6.17, and the equivalent size standard recipe at 149 GB gets a 1.67.
Unlike any of my recipes which turned out well, theirs can easily be replicated by you if you want, also your quants will either have less than optimal performance or not be supported anymore when some upcoming Deepseek changes come to llama.cpp
It's very easy to optimize for one specific answer, at the expense of making the model globally worse. For example, our imatrix data contains a lot more english than othe languages, and so trades quality for those languages for better english. Optimizing for a single question seems useless to me, as a goal.
Anyways, there might be something good in here, but at the moment, it looks more like a publicity stunt than a usable method. If it's usable, it will eventually find its way into llama.cpp, at which point we will happily provide these quants.
It's very easy to optimize for one specific answer, at the expense of making the model globally worse. [...] Optimizing for a single question seems useless to me, as a goal.
Although that may be true, I don't think that's the case here. They show the full output of IQ1_S (131GB their version, 133GB standard recipe), and in all three examples the standard recipe gets stuck in a loop of repeating itself, while the modified recipe actually finishes the task. They may not have shown enough data to show that they are generally better, but it definitely looks promising. I don't see any reason to assume it is a one trick pony. There is also good reasons to believe that there might be recipes that are FAR better suited to Deepseek as it differs a lot from the models that the current recipes were made for, even if it isn't the Unsloth ones.
If it's usable, it will eventually find its way into llama.cpp
The quants work with stock llama.cpp, the fork is only needed to generate them.
at which point we will happily provide these quants.
I'm not asking for you to provide the quants, I don't mind making my own quants (with your imatrix.dat). I just thought you'd be interested.
The main reason I commented was to give you advance notice that this PR by the person who originally added Deepseek to llama.cpp adds two tensors to the GGUF and is not backwards compatible (and even if backwards compatibility is added, it will be at degraded performance).
I don't doubt that it looks promising... Anyway, yeah, if things break, we will deal with them, as always.