unsloth/DeepSeek-R1-GGUF · Perplexity comparsion results

I had asked myself the question of how the dynamic quants can be classified in terms of accuracy compared to the usual quants.
The question of benchmarks was also repeatedly asked here.
The only metric that was halfway possible on my limited system was the perplexity (which requires only one run per quant).

Settings: -c 1024 -b 1024 (and the four dynamics with cache type q4_0).
The tests are based on a custom textfile to limit the chunks (in addition wiki.test had nan errors in very early chunks).
In some tests there were always nan errors at llama-perplexity so that the test could not generate a finished PPL (llama-perplexity uses its own calculation, not the simple average value of all chunkvalues). Nevertheless, at least 16 out of 40 chunks were always achieved. The first chunks are volatile, but it's the same with wiki.test. That's why I think it's good to make a certain minimum number of chunks.

A mixture of different gguf were tested. Included are all four dynamic qaunts from unsloth and some more. The reference point is the Q5_K_M (higher was not possible with the system). Bartowski mentioned that they're the same source model, so probably it can be compared on this same basis and I threw unsloth and bartowski quants together.

The graph is based on all chunks that worked (at least 16 of 40).
The delta% is based on the average value of the first 16 chunks (achieved by all of them).

Graphically, the quants results broadly clustered into 4 different areas and within each area the quants are close to each other:

UD_IQ1_S(unsloth)
UD_IQ1_M(unsloth)
UD_IQ2_XXS(unsloth), Q2_K(bartowski), UD_Q2_K_XL(unsloth)
IQ3_M(bartowski), IQ4_XS(bartowski), Q4_K_S(bartowski), Q5_K_M(unsloth)

Conclusions for me with regard to the dynamic quants:

UD_IQ2_XXS and UD_Q2_K_XL are very similar. Distances are more likely to UD_IQ1_M and again to UD_IQ1_S.
The two best dynamic quants are in the range of the usual Q2_K quant.
IQ3_M is still a clear step up in quality from UD_Q2_K_XL.

There are also other short tests:

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/21#67af6a33a44a3738ba47e476
Thanks @TobDeBer . His distances among each other look less strong than mine. He also used his own text file (tests with 3 chunks).
reddit: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/?rdt=62843
Also here it is mentioned that nan errors can happen. Seems to be a general "problem" with Deepseek R1 and llama-perplexity.

Of course, all the results are to be taken with a grain of salt, the metric perplexity is only the metric perplexity :)
But for me it was exciting as a first point of reference in terms of accuracy compared to the usual quants.

Finally, thanks for cooking all the great ggufs @shimmyshimmer @danielhanchen @bartowski and all other chefs on HF!