Interesting fate of my quant

#1
by mishima - opened

Hi, just saw it :) No idea why the original model is gone now, I only do quant for my hardware, nothing else. And unfortunately I can't remember who the original author was :/

Hey, thanks for passing by!
It's really sad that the fp16 is gone.. (I guess you don't have them anymore?)
But thanks to your quant not being too destructive, something could be done of it!

Yep, fp16 is gone :(

I merged the original fp16 by taking Sao10K/WinterGoddess-1.4x-70B-L2 and adding Yukang/Llama-2-70b-longlora-32k at full weight and Doctor-Shotgun/limarpv3-llama2-70b-qlora at 0.5 weight.

Did this with a couple other 70B models as well, but overall I felt like the results were poor. If you have sufficient CPU RAM (or swap space), you should be able to recreate the fp16 from scratch.

Well, at least we got the recipe, thanks !

It was a fun experience for me, and worked quite well at 10k context (linear rope 2.5) where the model actually showed its best compromise of quality and context length for my rig.

The lack of finetuned 70b models with 32k context beside Aurelian Alpha incited me to do these requants, and also because I wanted to test the brand new iMatrix feature of LlamaCPP.

Now, this model is deprecated to my taste, considering that Miqu/Mistral 70b entered the 70b scene in quite a flamboyant manner, and that Gryphe's MergeMonster tool made a remarkable entry on the Yi 34b segment with the Kyllene 34b merge made by TeeZee, where many GPTisms/Llamaisms/Yi-isms are trimmed down thanks to MergeMonster.

Nevertheless, DS, you should let your experiments online when they fill a gap in the broad offer, as your 70b 32k experiments did. The Linear rope 2.5 trick really improved things to the point of making these models quite viable at 10k, and in a less destructive manner than NTK rope alpha 3.5 (for a similar context size) in my opinion.

The experiments were meant to be used with linear rope of 8.0, as that was what longlora-32k was trained with. Linear rope 2.5 should cause the results to be even more borked if anything.

If you want 10k ctx, I would recommend dropping longlora-32k entirely and just using the base model with NTK alpha of ~3.55. I was going to link you to the merge without longlora-32k, but then I just realized that I never public'd that repo at all. Anyhow here that is: https://huggingface.co/Doctor-Shotgun/WinterGoddess-1.4x-limarpv3-70B-L2

I don't have much faith in linear rope in general at this point. For training long context speculative decoding models, I found that rope theta extension (which is similar to how NTK functions) works better. And this seems to be the method favored by Mistral's newer models and Codellama as well.

Well, look at the difference between Linear Rope 8 and Linear Rope 2.5 in some benchs I made :

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag,86.75,,400,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag,86.7,,1000,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag_Bin,80.5,,400,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag_Bin,84,,1000,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Arc-Challenge,49.16387960,,299,2024-01-29 05:40:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Arc-Easy,77.01754386,,570,2024-01-29 05:40:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,MMLU,46.64536741,,313,2024-01-29 05:40:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Thruthful-QA,35.49571603,,817,2024-01-29 05:40:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Winogrande,76.3220,,1267,2024-01-29 05:40:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,wikitext,5.7239,512,512,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,wikitext,4.5767,4096,4096,2024-01-29 00:00:00,PEC8,70b,Mistral_Medium,32768,,,GGUF,

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag,88.5,,400,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag,87.9,,1000,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag_Bin,82.5,,400,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Hellaswag_Bin,86.6,,1000,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Arc-Challenge,52.84280936,,299,2024-01-29 05:40:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Arc-Easy,76.14035088,,570,2024-01-29 05:40:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,MMLU,47.60383387,,313,2024-01-29 05:40:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Thruthful-QA,37.57649939,,817,2024-01-29 05:40:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,Winogrande,78.0584,,1267,2024-01-29 05:40:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,wikitext,4.1818,512,512,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-b2007-iMat-c32_ch2500-IQ3_XXS.gguf,-,wikitext,3.5759,4096,4096,2024-01-29 00:00:00,PEC2.5,70b,Mistral_Medium,32768,,,GGUF,

Not so bad for a borking!

I didn't notice any issue in ST when I lowered the linear rope, which is actually quite flexible downwards (best result is 2, but 2.5 adds only 0.01-0.02 perplexity for 2k context more) in most of the extended context models based on linear rope I toyed with. While NTK is destructive with numbers and is best pushed to Alpha 2-2.2 maximum in my experience to limit the degradation.

But of course, models using a theta rope of 1,000,000 or more out of the box are the best compromise, Yi and now the Mistral leak proved that, to not speak about that weird CodeLlama family (the 70b even lost its theta 1,000,000 apparently and went back to 4k context..). Also, I wish not to have to bother with scale or base frequency anymore, so..

Could you please check if the model can expand an input article with about 2500 words? In all cases I've seen so far, when i add an article with something around 2500 words and ask the model to expand it, the output is something around 1000 words which means that the model couldn't handle it properly. Could you please test it and report the result here?

I'm sorry, Hoioi, but I don't have the time to spare to make such specific tests, especially because they demand human control and several réitérations.

Nevertheless, a few tips to help you :

  1. You have an option in Silly Tavern to auto-continue the inference of an answer (on the panel on which you set the prompt format).

  2. Here's what Docteur Shotgun says about LimaRP, included in this model :

"Message length control

Due to the inclusion of LimaRP v3, it is possible to append a length modifier to the response instruction sequence, like this:

Input

User: {utterance}

Response: (length = medium)

Character: {utterance}

This has an immediately noticeable effect on bot responses. The available lengths are: micro, tiny, short, medium, long, massive, huge, enormous, humongous, unlimited. The recommended starting length is medium. Keep in mind that the AI may ramble or impersonate the user with very long messages."

The fact that this model is a merge might lower the potentiality of proper execution of that feature, though.

Thank you so much for your reply. I will test your suggestions and i hope they work for me.

Sign up or log in to comment