https://huggingface.co/databricks/dbrx-instruct
Mirror that isn't gated: https://huggingface.co/alpindale/dbrx-instruct
For me ./convert-hf-to-gguf
didn't work so you might need to use ggml-dbrx-instruct-16x12b-f16.gguf
from dranger003/dbrx-instruct-iMat.GGUF
Can you please quantize the DBRX Instruct model? It is an awesome model for software developers. Current DBRX GGUF quants use an imatrix worse than yours and miss the i1-Q5_K_M size I need. In addition, I would love to compare your quantization with the one I tried by my own.
How are you creating i1-Q5_K_M quants? I tried creating them myself so I don't have to bother you but not sure if the imatrix got even applied. Here the command I used - is this similar to how you do it?
./quantize --imatrix dbrx-16x12b-instruct-f16.imatrix ./ggml-dbrx-instruct-16x12b-f16-00001-of-00006.gguf dbrx-16x12b-instruct.i1-Q5_K_M.gguf Q5_K_M 12
Here my perplexity results using wiki.test.raw:
dbrx-16x12b-instruct.Q5_K_M.gguf: PPL = 4.6092 +/- 0.03003
dbrx-16x12b-instruct.i1-Q5_K_M.gguf: PPL = 4.6003 +/- 0.02989
The difference seams very minor and within margin error but the one where I hopefully added the imatrix during quantization did perform slightly better.
Thank you so much for providing your highest quality GGUF quants for free. Your service is unvaluable for me and many other software developers that daily use your quantized models as programming assistant.
I had the same experience (convert-hf-to-gguf), and had no idea that dranger003 helpfully had the hf model. My pipeline doesn't support quantizing from another gguf repo (and split ggufs), but I can give it a try manualy.
I am not sure why my imatrixes would be better (other than wishful thinking :), but I'm always happy for comparisons.
I generate the Q5_K_S the same way as any other quant, essentially using no special switches.
As for perplexity... the expectation is that there isn't a big difference between the Q5_K_S with and without imatrix in the first place, and perplexity can be all over the place, so is not an indication for bad (or good) quality.
Anyway, I'll try to generate quants, static first and then imatrix ones, as usual.
It can take a while as all my servers are stuck with big models, and are at the limits of their available disk space.
Thank you so much for giving it a try!
It can take a while as all my servers are stuck with big models, and are at the limits of their available disk space.
No hurry. While I'm really excited for it, I have no problem to wait. I'm really happy you are doing them despite much more manual work being required than usual. Thanks a lot!
There might be a small difference what imatrix is used. For DBRX there is f16_imatrix-wiki.dat trained on wiki.train from dranger003 and jukofyork/dbrx-instruct-imatrix trained on groups_merged.txt from jukofyork. According to jukofyork his imatrix is likely better suited for coding tasks compared to the one from fromdranger003. I have the feeling yours will be better than booth of them.
Thanks a lot for answering all my questions. If perplexity is not a good way to measure the strength, what should I use instead? I guess I will try comparing GSM8K benchmark results and see if that gives a more accurate comparison. Comparing quantized models is a lot of fun and I have spare resources (512 GB RAM/66 GB VRAM) to run some benchmarks. Comparisons between all the different quants and no imatrix vs. imatrix would be quite cool as most graphs comparing GGUF quants seam to compare perplexity instead of real benchmark results.
I usually shy away from non-english and/or coding model imatrix quants, because I can't imagine my training data (basically groups_merged.txt plus a lot of english sentence fragments) would be better fpr that specific target, but from what I saw, it makes relatively little difference what training data you use, as practically all imatrix quants are much better than static ones. However, measurements/data trumps beliefs.
As for perplexity - it's not necessarily a bad metric. k-l-divergence is consdiered to be better when comparing quants, but none of the existing metrics correspond very well to human perception and checks, i.e. the best check is human evaluation for the specific task, which in turn is quite subjective :/
The advantage of perplexity is that it's easy to compute, thats probably why it is so popular.
I noticed that the f16 from dranger is already quantized (the source is in bf16). That means I am quantizing from a potentially already quality-reduced model. Shouldn't really be a problem, though.
https://huggingface.co/mradermacher/dbrx-instruct-i1-GGUF
The Q5 is there, and the remaining 7 quants will arrive in the next few hours. Have fun and measure away :)
I'll be interested for feedback on your quants of this model!
I've quantized it from scratch 4 times now and twice it's had some weird issue where it turns out awful: worse than a broken-frankenmerge!
I stupidly deleted my copy today, expecting to be able to re-quant it using the new llama.cpp version with improved BPE tokenizer support... Only to find DBRX isn't added yet :O
Managed to create a broken model yet again so gone right back to the original b2665
branch where support for DBRX was merged... I've also reduced the number of threads in case it's some kind of race causing this.
I'm almost certain the FP16 model is getting created correctly and I've run the imatrix creation twice today and got the same results too. It's definitely not corrupted .safetensors
files either as when this happened last time I did a full re-download to be sure.
Just ran a few iterations of perplexity
and it's showing nice low values of 2.7 .. 3, etc so fingers crossed I've not made another broken version.
@mradermacherare are you having any other problems with the new llama.cpp versions released since the BPE stuff was fixed? I can't seem to quant qwen-110b-chat
now and mixtral-8x22b-instruct
is just totally broken whatever I do :/
Gonna retry these on the b2665
branch I compiled for DBRX to see if it makes any difference.
@nicoboss if you do get this working make sure to not use any repeat penalty at all (set it to 1).
I found it was super-sensitive to this and became so lazy it was almost funny (much worse than GPT4, etc):
I asked it to write some C++ code to train a logistic regression model and it started a for loop at a number like 178! When I asked it why it said because the loop can start at anything?! It was actually the repeat penalty that caused this lol.
llama.cpp was a disaster, and it absolutely got worse recently. the solution of having to patch the script is clunky, and you can't get a clear answer as to whether convert.py or convert-hf-to-gguf is the right choice for llama (and others) now. I currently patch the pretokenizer config into a copy of that script, and my job scheduler selects the appropriate script, that helps a lot, but there is a lot of uncertainty. without that, almost nothiong converts at the moment (including llama3-8b itself...).
on the other hand, I don't even have time to test quants at the moment, so unless somebody tests and points out that there is a problem, i am blissfully unaware. i do wish convetr.py wouldn't silently generate stuff that insta-crashes with llama.cpp though. but even if it works, who knows whether its correct - maybe it works, but is only reduced in quality. so much uncertainty.
but after having been told that imatrix is just a gimmick and nobody cares whether it works, and only base models matter to llama.cpp because all these amateur model merges are likely useless anyway, I guess I learned my place and reduced my expectations considerably.
Yeah, this is almost farcical! I've just managed to get qwen-110b
working by quantizing with an older branch of llama.cpp from mid-April (it just spouts gibberish when I tried the recent llama.cpp to quantize it).
The crazy thing is they seem to have done this to fix a small error effecting a few percent of models' tokenizers and in the process made the vast majority of all other models either unquantizable or just plain broken :/
I still can't get dbrx
to work though... I'm pretty sure it must be the imatrix I'm creating as the original Q4_0 posted by the guy who added the dbrx
PR to llama.cpp, works better than even Q5_K_M for my tests... I'm now running some tests with the n_experts set to 16 whilst creating the imatrix in case some of the MLPs in the MOE are just never getting chosen or something. If that doesn't work then I'm gonna set the context length to 32k and set the imatrix calculation running all night.
I'll report back and upload my imatrix if I ever get this working!
Well, we are all volunteeres, and that includes the llama.cpp developers. I just sometimes wish they wouldn't add new features such as imatrix and then let it rot. Or clearly say they ignore a bug report because they consider the model too unimportant.
If there are tokenizer changes, the imatrix should/must be redone, because it was done using the wrong tokens (although I don't know of anybody who studied the effects). But what you are seeing is bad quality with a new imatrix?
And gibberish is not the expected failure mode of newer llama.cpp - the expected failure mode is refusal. Maybe that's a different issue?
I'll try qwen-110b and see where it leads me.
If there are tokenizer changes, the imatrix should/must be redone, because it was done using the wrong tokens (although I don't know of anybody who studied the effects). But what you are seeing is bad quality with a new imatrix?
And gibberish is not the expected failure mode of newer llama.cpp - the expected failure mode is refusal. Maybe that's a different issue?
No this is using the new code for the imatrix calculation too. I think it might be something in the new llama.cpp tokenizer code is just slightly broken:
- The
qwen-110b-chat
I made didn't work at all and just output "semi-sensible text" (ie: not just random characters) and kept stopping all the time. - The official
mixtral-8x22b-instruct
failed in almost the same way asqwen-110b-chat
, but I have successfully quantized (with imatrix) two other 8x22b models and they worked absolutely fine using the same pull of llama.cpp!? - The "broken"
dbrx-instruct
model just acts really dumb (compared to when it wasn't broken), but I also noticed it stopped randomly in the middle of the text too. One refactoring code test I gave it had camelCase variable name ending in "Hunt" and it wrote "Hunk" and then stopped like the other two broken models above.
I've also noticed that command-r-plus
seems to work differently when quantized with the old version of llama.cpp (with fresh imatrix too): when asked to create a story with "internal monologues of the characters" the old version will use markdown stars to turn the text italic, where's the new version will use speech marks which in turn make the stories it generates worse and the thoughts much more simple/childish...
For now I'm just going to stick to the mid-April pull for anything I quantize until all this upheaval calms down.
Indeed, the pretokenizer is not detected and conversion fails (this is with another qwen2 model, could be a fluke, but probably is not). I don't even see code support for qwen2. From what I see, the default would not be correct (thats essentially gpt2).
But there is progress on the llama.cpp side:
readme : add note that LLaMA 3 is not supported with convert.py (#7065)
and command-r support for the tokenizer has been merged.
"old" gguf should work as "badly" with old llama.cpp as with new llama.cpp. The changes "should" only take effect when the gguf was converted with a new version.
Well, we are all volunteeres, and that includes the llama.cpp developers. I just sometimes wish they wouldn't add new features such as imatrix and then let it rot. Or clearly say they ignore a bug report because they consider the model too unimportant.
I think I can improve the imatrix code (in theory):
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-9163642
I don't really know anything about CUDA so hoped somebody would reply and the silence makes me wary of even starting :/
I do know a lot about optimization (or at least I did 20 years ago!) and there are many glaring errors with the method being used that could easily be fixed... The problem is that one of the fixes might require scaling from O(n) to O(n^2) in terms of FLOPs and RAM use, and this may or may not be possible to do using CUDA very efficiently. If it is not feasible then a different approach would be needed to sample from the covariance matrix (if compute bound) and/or try to create a low-rank approximation (if memory bound).
There are lots of other things they are doing that would make any statistician cringe (like how similar the wiki.train.raw
and wiki.test.raw
data distribution is) and likely explain some of the head-scratching results like Q8_0 having lower perplexity that FP16 sometimes.
"old" gguf should work as "badly" with old llama.cpp as with new llama.cpp. The changes "should" only take effect when the gguf was converted with a new version.
Yeah, I think I'm just going to wait it out and see. I'm in no real rush to use llama3
until somebody successfully extends the context.
Don't let me disssuade you from contributing to llama.cpp! But you can't really force them to react. Maybe contact ikawrakow directly?
And I don't think it can explain anything about Q8_0, because thats not an imatrix quant, afaik. Also, I am sure they know that wiki.*.raw is not optimal - and not everybody is using it.
I've just got the results from setting dbrx
to use all 16 experts instead of just 4:
imatrix created with 4 experts
Final estimate: PPL = 8.7773 +/- 0.17953
imatrix created with 16 experts
Final estimate: PPL = 8.8745 +/- 0.17301
Now quantizing using that imatrix to quantize the original FP16 (with 4 experts) to see the effect.
Don't let me disssuade you from contributing to llama.cpp! But you can't really force them to react. Maybe contact ikawrakow directly?
And I don't think it can explain anything about Q8_0, because thats not an imatrix quant, afaik. Also, I am sure they know that wiki.*.raw is not optimal - and not everybody is using it.
No I'm not that bothered really :) It's just the path to take depends on how feasible it is - there's no point in making a much better imatrix file if it takes 3 weeks to run!
What you definitely should do is open an issue for qwen/qwen2. AFAICS, nobody has done that yet, and none of their existing ggufs have a pretokenizer set, all use gpt2 as tokenizer (which is the incorrect one according tot he qwen docs).
I think I may have got somewhere to seeing the problem:
https://github.com/ggerganov/llama.cpp/pull/6515#issuecomment-2094888744
This might also explain why my mixtral-8x22b-instruct
model was so broken too...
I'd hold off on quantizing any more MoE models for a while:
See the bottom of here: https://github.com/ggerganov/llama.cpp/pull/6387
and this PR: https://github.com/ggerganov/llama.cpp/pull/7099
I can confirm that it fixed all the problems I had with mixtral:8x22b-instruct
writing gibberish and stopping mid-sentecne, and it definitely seemed to make dbrx-instruct
work better than either with no imatrix or with the old/broken imatrix (neither of the models are that great for coding though as seem to be trained heavily on "lazy GTP-4" output...).
Just redoing wizard-lm-2:8x22b
imatrix now to compare (it was working pretty well before though).
I've uploaded fixed imatrix files for these:
https://huggingface.co/jukofyork/dbrx-instruct-imatrix
https://huggingface.co/jukofyork/Mixtral-8x22B-Instruct-v0.1-imatrix
https://huggingface.co/jukofyork/WizardLM-2-8x22B-imatrix
All created using groups_merged.txt
.
I'd probably hold off using these though as `slaren is checking my PR and it might end up getting merged soon.
Even though I never noticed any problems with wizard-lm-2
, it clearly was effected by this problem as produces even better code now:
https://github.com/ggerganov/llama.cpp/pull/7099#issuecomment-2096089599
The other 2 MoE models went from gibbering idiots to borderline-meh, but I think that's more the models themselves than the imatrix now.
I'm going to try this on Eurux:8x22b-nca
now to see if it has the same effect.
Why does it always affect the modelks that were by far the most expensive to quant :) Anyway, this is great and terrifying news. Not sure how to deal with all the bad quants now, though (not the least of them my own).
I just compared your Q5_K_M quant with a Q5_K_M one I created using the fixed imatrix from https://huggingface.co/jukofyork/dbrx-instruct-imatrix. The fixed imatrix indeed performed much better in the logieval benchmark. I too noticed the issue with the quantized model randomly stopping mid-sentence which seems to be fixed when quantizing using the fixed imatrix from @jukofyork as well.
lm_eval --model local-chat-completions --tasks logieval --num_fewshot=0 --limit 1500 --batch_size=1 --output_path=result --log_samples --model_args model=dbrx-16x12b-instruct.i1-Q5_K_M.gguf,base_url=https://ai.nico.re:5000/v1
dbrx-16x12b-instruct.i1-Q5_K_M_jukofyork_fixed.gguf: 0.4753 (stderr: 0.0129)
dbrx-16x12b-instruct.i1-Q5_K_M_mradermacher.gguf: 0.4647 (stderr: 0.0129)
turboderp_dbrx-instruct-exl2_3.0bpw: 0.448 (stderr: 0.0128)
logieval results of other models as comparison:
dolphin-2.9-mixtral-8x22b.i1-Q5_K_M_mradermacher.gguf: 0.5693 (0.0128) (version with fixed imatrix still needs to be tested)
blockblockblock_dolphin-2.9-mixtral-8x22b-bpw3-exl2: 0.5353 (0.0129)
Dracones_WizardLM-2-8x22B_exl2_3.0bpw: 0.4213 (0.0128)
@mradermacher It would be awesome if you could redo the broken imatrix quants using the fixed llama.cpp. If not possible at least use the fixed version for future models.
You don't need to use my imatrix now either - the PR got merged so all new MoE imatrix files from now on should be fine.
I've made a very diverse imatrix file for wizard-lm-2:8x22b
if that would save you having to redo it? It is aimed purely at coding though:
https://github.com/ggerganov/llama.cpp/pull/7099#issuecomment-2099086062
but definitely seemed to work well (even compared to the new "fixed" imatrix created from groups_merged.txt
).
I will do one for Mixtral-8x22B-Instruct-v0.1
and dbrx-instruct
sometime next week if I get chance too.
I uploaded the imatrix file here: https://huggingface.co/jukofyork/WizardLM-2-8x22B-imatrix-CODE-SPECIFIC
I also added the C++ code I used to generate the 'psedo-groups_merged' dataset (written 100% by wizard-lm-2
!).
I uploaded the imatrix file here: https://huggingface.co/jukofyork/WizardLM-2-8x22B-imatrix-CODE-SPECIFIC
I also added the C++ code I used to generate the 'psedo-groups_merged' dataset (written 100% bywizard-lm-2
!).
Thanks a lot! I might underestimate WizardLM-2. Really impressive how good in coding it is. I will definitely apply your imatrix and give it a try. Cool how you showed the comparison between the code generation using different imatrix files and how it affects the output. Your imatrix dataset generator is great. I might give it a try because groups_merged.txt` imatrix seems to perform worse than the imatrix files generated using mradermacher's private imatrix dataset. I hope we can have a great public imatrix dataset somewhere in the future.
I will do one for
Mixtral-8x22B-Instruct-v0.1
anddbrx-instruct
sometime next week if I get chance too.
That would be absolutely awesome. A love booth those models. Please do dolphin-2.9-mixtral-8x22b as well. I assume that one is likely broken too as it is based on Mixtral-8x22B.
Oh my, sorting through all this now. Soo.. it's merged, very good!
As for using another imatrix, I don't have a problem with that per se, but my scripts are currently completely hardcoded for "-i1" which indicates my training data, and I am too lazy to fix that until I absolutely have to. My training data is probably not very good for coding models anyway (it's optimized for english rp, but it contains groups_merged.txt, so has some coding), and the "-i1" is the indicator for my quality (whether good or bad :). And mixtral 8x22b models kill my servers due to the enourmous space requirements specifically those models have, so I have to do quants on slow computers, taking weeks.
Sooo... I would be happy if people would list the models they would like to have redone (as is my understanding, static quants are unaffected), and I will just delete the imatrix quants and eventually redo them.
In the meantime, I think it is absolutely great to have some variety in imatrix quants, so if juky wants to do some imatrix quants, please, do so! I can even mention them on the model cards (if people tell me about it). The fact that I might or might not do my own ones should not stop anybody else.
PS: whats the wizardlm you used (upstream is gone, and surprisingly, didn't come back)
Guess it would affect at least all these. That sucks.
/gguf/imatrix/Eurux-8x22b-nca.imatrix
/gguf/imatrix/Goku-8x22B-v0.1.imatrix
/gguf/imatrix/Goku-8x22B-v0.2.imatrix
/gguf/imatrix/Karasu-Mixtral-8x22B-v0.1.imatrix
/gguf/imatrix/Matter-0.2-8x22B.imatrix
/gguf/imatrix/Mixtral-8x22B-Capyboros-v1.imatrix
/gguf/imatrix/Mixtral-8x22B-v0.1.Q4_K_M.imatrix
/gguf/imatrix/SchizoGPT-8x22B.imatrix
/gguf/imatrix/Tess-2.0-Mixtral-8x22B.imatrix
/gguf/imatrix/Wizard-Mixtral-8x22B-Instruct-v0.1.imatrix
/gguf/imatrix/dolphin-2.9-mixtral-8x22b.imatrix
/gguf/imatrix/mixtral-8x22b-instruct-oh.imatrix
PS: whats the wizardlm you used (upstream is gone, and surprisingly, didn't come back)
I'm using https://huggingface.co/alpindale/WizardLM-2-8x22B for WizardLM-2-8x22B. I have the feeling it's what everyone is currently using. It's a re-upload of the official release and so using it instead of the now deleted official release shouldn't make any difference.
Sooo... I would be happy if people would list the models they would like to have redone
If I had to make a wishlist of the ones I care the most:
Top 1: dolphin-2.9-mixtral-8x22b.i1-Q5_K_M.gguf
Top 2: dbrx-instruct.i1-Q5_K_M.gguf
Top3: Mixtral-8x22B-Instruct-v0.1.i1-Q5_K_M.gguf
Top3: WizardLM-2-8x22B.i1-Q5_K_M.gguf
thanks. yeah, i trust alpindale. i'll redo these for the moment:
https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1
https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
https://huggingface.co/databricks/dbrx-instruct
https://huggingface.co/cognitivecomputations/dolphin-2.9-mixtral-8x22b
https://huggingface.co/migtissera/Tess-2.0-Mixtral-8x22B
https://huggingface.co/openbmb/Eurux-8x22b-nca
https://huggingface.co/tdrussell/Mixtral-8x22B-Capyboros-v1
https://huggingface.co/alpindale/WizardLM-2-8x22B
Can't give an ETA though - it will probably take multiple days per model, and only one machine cna reaosnably do them (or I invent something to move jobs between servers halfway through once the storage requirements are down).
In the meantime, I think it is absolutely great to have some variety in imatrix quants, so if juky wants to do some imatrix quants, please, do so! I can even mention them on the model cards (if people tell me about it). The fact that I might or might not do my own ones should not stop anybody else.
I absolutely love mradermacher's quants as they are just consistently good no matter the use-case. Having them standardized across thousands of models adds a lot of value as it makes different models more comparable. I highly recommend mradermacher to stick with his well-established i1 standard. However having the choice what imatrix to use is great. I highly appreciate jukofyork's work and am excited to try out and using a coding specific imatrix for models I will primary use for coding. I encourage everyone to share their imatrix quant for the community to try. It's quite interesting to compare them and see which one works best for a specific use-case.
thanks. yeah, i trust alpindale. i'll redo these for the moment:
Awesome! Thanks a lot for redoing all of them.
Can't give an ETA though - it will probably take multiple days per model, and only one machine cna reaosnably do them (or I invent something to move jobs between servers halfway through once the storage requirements are down).
No hurry. Take your time. You are doing an absolutely amazing job. You have no idea how much I appreciate you doing all this work for the community. Your quants already saved me weeks of time quantizing them myself. If there is any way I can help regarding hardware resources just let me know. My main PC has 512 GB DDR5 octa-channel RAM, 7975WX CPU (32 cores 64 threads), 2x RTX 4090, RTX 3080, RTX 2070s, 10 TB of M.2 SSD, 1 Gbit/s down 100 Mbit/s up and uses Proxmox as hypervisor. I also have an old PC with 256 GB DDR4 quad-channel RAM, a 3970X CPU (32 cores 64 threads) and an empty 2 TB M.2 SSD sitting completely unused.
Wow, hmm, that's a rather cool hardware setup you have there... Right now, I can't really make imatrixes from very big models without using low-bpw quants (I top out around 120GB for a quant), so that sounds like a very useful offer. I have zero experience with proxmox, but I assume I can give you some linux disk image that you can then run virtualised, allocating some resources to it? All my image needs would be access to the internet, but not from the internet (it would use a wireguard tunnel). At least, that's the setup all my quantize nodes have right now.
And, sure, if you don't use your old PC, we could convert it into a quantize node the same way. The limitation there would be upload bandwidth (at full speed, 100Gbps would be about 1 TB/day - obviously I wouldn't want to hog your connection, so probably much less in practise), so that would be the limiting factor for quantising, I assume.
How dynamic can proxmox manage memory? It wouldn't make sense to allocate hundreds of GB of RAM for the occasional big model that needs it.
PS: I am rarely jealous. But today, I am :) And looked it up, seems proxmox can simply import a qcow2 image.
Wow, hmm, that's a rather cool hardware setup you have there... Right now, I can't really make imatrixes from very big models without using low-bpw quants (I top out around 120GB for a quant), so that sounds like a very useful offer. I have zero experience with proxmox, but I assume I can give you some linux disk image that you can then run virtualised, allocating some resources to it? All my image needs would be access to the internet, but not from the internet (it would use a wireguard tunnel). At least, that's the setup all my quantize nodes have right now.
Awesome to hear that my hardware could help. No worries I'm an expert with Proxmox. You can just give me a disk image and list me the required resources and I will create a VM for you. WireGuard VPN is awesome. I'm using it myself to access my home PC from work. Just tell me the required port forwarding rules so I can configure the router in a way that allows you to connect to it.
And, sure, if you don't use your old PC, we could convert it into a quantize node the same way. The limitation there would be upload bandwidth (at full speed, 100Gbps would be about 1 TB/day - obviously I wouldn't want to hog your connection, so probably much less in practise), so that would be the limiting factor for quantising, I assume.
We can try and see if it’s worth it. My ISP unfortunately can't offer more upload speed using coaxial cables and switching to fiber would make internet much more expensive.
How dynamic can proxmox manage memory? It wouldn't make sense to allocate hundreds of GB of RAM for the occasional big model that needs it.
There is memory ballooning but that only works if there is enough unused memory is available. While memory ballooning is awesome, I believe an easier and more reliable way is to just turn off the AI VM either manually using the Proxmox Web Interface or automatically over the Proxmox API before booting up your own VM. That way you will always have 300 GB RAM (I can assign more RAM to the AI VM if 300 GB isn't enough) and 3 GPUs (2x RTX 4090 + RTX 3080) available without really affecting me in any way as I’m only using the AI VM to play around with AI models in my spare time which I can easily do on my main VM if the AI VM is unavailable. To access the Proxmox Web Interface and Proxmox API either give me a static IP address to whitelist or I could give you an always running LXC container to which you can SSH. Try running compute heavy tasks during European daytime if possible to make use of free solar energy.
PS: I am rarely jealous. But today, I am :) And looked it up, seems proxmox can simply import a qcow2 image.
Yes absolutely no problem. I can import almost any kind of disk imge into Proxmox.
If you need any private way of communicating with me email me at nico at bosshome dot ch
or write me on Discord using the Username nicobosshard
.
How dynamic can proxmox manage memory? It wouldn't make sense to allocate hundreds of GB of RAM for the occasional big model that needs it.
I just did some testing using memory ballooning and it is a truly amazing technology but breaks if you asign GPUs to a VM. I could assign 4 GB minimum and 400 GB maximum memory to your VM and keep it running 24/7 and let memory ballooning take care of dynamic memory allocation. If we go this path just make sure to either completely disable or drop the file system cache once you are done using echo 3 > /proc/sys/vm/drop_caches
so the RAM can be reallocated to other VMs. The only disadvantage with memory ballooning is that there is a low possibility of making the host run out of RAM if all VMs would need a lot of RAM at once which would forcefully shutdown your VM.