Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.
I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf
so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.
474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.
I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).
So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
457.4g after warming up.
So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)
llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1
and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?
I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.
dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:
Yes it is clearly streaming from SSD now:
Once the quantisation tasks are interrupted it should work without SSD streaming again.
This is somewhat worrying:
[1]2.9360,[2]2.3937,[3]2.4731,[4]2.5391,[5]2.8621,[6]2.8125,[7]2.6349,[8]2.9891,[9]2.8659,
save_imatrix: entry ' blk.34.ffn_up_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.34.ffn_down_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.34.ffn_gate_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.1.ffn_up_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_down_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_gate_exps.weight' has partial data (94.53%) - skipping
save_imatrix: storing only 373 out of 385 entries
Yes, one iteration after both quant tasks finished it stopped streaming.. But these are big tasks.
Nope, started again.
As for the quantize tasks, I don't know what is going on. I was also able to see this, but now I am unable to see any processes.
I think it stopped streaming for good. It is possible that it also takes a few iterations for everything to stay in memory.
Top now at 461.3g (495GB). So it isn't tight. Let's see what happens.
This is somewhat worrying:
It should be fine and maybe expected for a MoE model with 128 experts. According to the llama.cpp source code (https://github.com/ggerganov/llama.cpp/blob/d9c3ba2b7749c00df477599aa141a98b4521aa2c/examples/imatrix/imatrix.cpp#L218-L219 ) this warning is part of the code to avoid writing imatrix entries that do not have full data which can happen with MoE models where some of the experts end up not being exercised by the provided training data.
Storing 373 out of 385 entries seams to be good enough.
It's reducing. These look like useful new messages, actually.
[10]3.1400,[11]3.2586,[12]3.0453,[13]3.0821,[14]3.3073,[15]3.5876,[16]3.7071,[17]3.9026,[18]4.0482,[19]4.1979,
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (89.06%) - skipping
save_imatrix: storing only 379 out of 385 entries
It's reducing. These look like useful new messages, actually.
This is expected as the longer we train the more likely experts are to be included during imatrix training. I'm wondering if MoE models need longer imatrix training compared to monolithic models. This one has 128 experts while only 2 are active for a given token so we only use 1/64th of the model for every token.
If it stays that way, we'll have good chances that the imatrix quantization will fail (if the message means what I think it does). If true, it intuitively makes sense - it's harder to tickle all experts in such a massive MoE model. Well, we have another 330 chunks.
I'm wondering if MoE models need longer imatrix training
Longer is unlikely to help - the right training data, is more likely. The top two (with 99.22%) have not reduced in the last iterations. And good that I save every 10 iterations, I knew someday it would be useful for something :)
Pretty exciting. Anyway, over and out for a while.
What is interesting is that it doesn't show a message for every tensor it skips. And it really is quite fast - obvious in hindsight. But I don't think the remaining chunks will do anything. Let's see if it quants. My prediction would be that it will likely fail with low bit quants.
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
Unfortunately there are now even quantisation tasks that started to run:
1 66 I huihui-ai-abliterated-Qwen2.5-32B-Inst-BaseMerge-TIES run/imatrix 9/25,IQ4_XS [705/771] (hfu i1-Q6_K)
Not sure what I should do to pause the quantization tasks. I could pause the entire host but seems a bit overkill and might cause other issues.
If it stays that way, we'll have good chances that the imatrix quantization will fail
I don't think it will fail. It will hopefully just statically quant blk.0.ffn_down_exps.weight, blk.0.ffn_gate_exps.weight and blk.0.ffn_up_exps.weight which should be fine as then the vast majority of the model will have the imatrix applied and it seems unlikely there would be any meaningful real world quality difference. The question is more if llama.cpp is capable of quantizing with a partial imatrix. I don’t think this was ever tested.
The top two (with 99.22%) have not reduced in the last iterations.
(100/128)*127 = 99.21875% => 99.22%
We they are just missing a single expert on a single layer. For some reason none of our training data seem to get routed to this specific expert for the first layer. All other layers already reached full coverage.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
I would assume a very specific expert. I couldn't even come up with 128 different type of experts so I expect some of them to have really specific areas of activation.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
We would ideally prevent the scheduler from starting any tasks while the imatrix of such massive models is being computed. It is not that bad if this happens while running them normally as it will just start streaming from SSD essentially pausing it until there is enough RAM but with RPC running out of RAM will result in a total system crash. I likely should have just let it stream from SSD until you had time to fix it but I know that the /tmp/pause flag is only making new imatrix task wait in an endless loop which unlike pausing the entire host should be safe.
When we are at pausing the performance measurement project is coming along extremely well so soon I will have to pause the entire nico1 host for multiple nights if we want to do the performance measurements on StormPeak. I hope this is not too disruptive or I might not do it. I'm currently doing performance measurments on Threadripper, CastlePeak, Raspberry Pi 4, 7840S Laptop and all of them should be done within the next few days. I will try to keep StormPeak measurement at an absolute minimum and only measure with 32 threads which based on my current result should be the setting that gives the best performance on a 32 core/64 thread CPU.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Awesome and I see that snowflake imatrix quantizing seam to work! Thanks a lot for doing imatrix quants of this amazing model. If the imatrix quants turn out well we can do the snowflake base model too. I will give them a try tomorrow.
I likely should have just let it stream from
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag (which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem).
I will have to pause the entire nico1 host for multiple nights
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
It's up to you, though, and I can try to cope, but it would add a level of manual managing that I could avoid at this point :)
Normally it is not an issue to pause, especially for a night. It is always an issue when the system is in an exceptional state though, e.g. when doing big models (which requires some intervention due to dependencies the system cannot see) or shortly before switching off nodes.
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag
What you mean? I did communicate everything a few messages ago as you can see under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6754b97f1bc6b93608c48774 or the following quote:
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
I did describe exactly what I did and why I did so.
which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem
No it did start imatrix computation for Gugugo-koen-7B-V1.1
while the snowflake-arctic-instruct
imatrix computation was still running (as can be seen in above posted status page snipet) and later even tried to start another one but got luckely paused by the /tmp/pause flag. Please check your logs why this happened.
Yes the quantization tasks where an issue as well but they are not as bad as parallel imatrix tasks. Quantization tasks will eventually finish and free up enough RAM for imatrix tasks to no longer stream from SSD while if two imatrix tasks start streaming from SSD none of them will ever finish. We were lucky it was only a 7B model and so fully offloaded to GPU. What was even scarier is that despite snowflake-arctic-instruct
running on booth GPUs another imatrix task was started and it just happens to not allocate memory on the GPU not used by snowflake-arctic-instruct. If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
No hurry then I will wait for dbX/backup1 to be gone. I already have really good performance measurements so I can already start analyzing it even without waiting for data from StromPeak or use this time to measure some other devices like my phone.
I did describe exactly what I did and why I did so.
You are right, somehow I didn't see that message, and you acted well. Sorry for doubting you.
Please check your logs why this happened.
It happened because I wanted to see the effect of it - since that model would fit completely into the vram, it should have worked, after a small disruption due to loading the model. Either that, or I would have gained understanding. I was still there when it happened, and even if it weren't planned, I would have cleaned up. The same happened when the quant jobs have been started at 22:00, which was not planned :)
There was plently of RAM availalable - it might still have started streaming due to bad memory management in linux, but that is another story.
I also don't think (and don't see) how it would have started a third imatrix job, as so far it has never tried to start three jobs, simply because it would not have the budget and gpu available. It did start a second one after snowflake was done, though.
We were lucky it was only a 7B model
It wasn't lock, there simply was no budget for (much) more.
What was even scarier is that despite snowflake-arctic-instruct running on booth GPUs another imatrix task was started
It was only running on one gpu - I changed the job to reflect that (the status display never reflected that because it was originally overwritten).
If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right, and as far as I can see, that rule was never violated.
I will wait for dbX/backup1 to be gone.
Thanks, that helps a lot.
I don't think it will fail. [missing imatrix data for a tensor]
Llama.cpp commonly fails to quantize moe's for this reason (I have lots of models where I don't have imatrix quants for that reason). I do not know if this message is correlating perfectly to that (the message is new), but llama.cpp does not quantize tensors it has no imatrix data for - it's the same message you get when trying to do low-bpw quants without an imatrix. It predominantly happens on "less important" models, so I usually do not make a fuss of it and simply skip the model, or in some cases the imatrix quants.
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
There goes my IQ1 of snowflake :/
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I'm generating the remaining quants. I see only tow options: a) find training data that exercises that expert b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
There goes my IQ1 of snowflake :/
It indeed fails but only for very low bit per weight quants. This is because as expected it statically quants the layers containing missing experts which in this case is layer 0. There is a check in llama.cpp that stops the quantization process if one tries to statically quant with a too low bit per weight as this usually results in unusable model. You are right. If there is still partial data at the end of imatrix training imatrix quantization will fail for all low bit per weight quants. All other imatrix quant will work without any issues and without any real-world quality impact as only one out of 35 layers is quantized statically so 97.1% of the model is quantized using the imatrix. Here the full llama.cpp error:
============================================================
Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I think in this specific architecture things can get rerouted to different experts for each layer so bad training data would only affect the first layer. But honestly the snowflake architecture is extremely complicated and poorly documented so I do not yet fully understand it.
I'm generating the remaining quants.
Awesome!
a) find training data that exercises that expert
I will try bartowski's imatrix training data on some smaller quant on the RTX 3080 GPU to check if it will activate all the experts.
b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage. The main issue is that it will not just affect this expert but the entire first layer. Despite there only being one expert missing the imatrix training skips storing it in its entirety. The first and last layers are usually quite important so there will be some negative impact on quality but it will be far from garbage. A better option would be to force imatrix training from storing the partial data of the first layer but I have the feeling that is that would be easy llama.cpp developers would have long done so.
Just had some worrying experience: the huggingface-cli silently failed to download all files, but also did not fail - when I tries to redo https://huggingface.co/xxx777xxxASD/L3-SnowStorm-v1.15-4x8B-B it skipped over 3 model files that are nevertheless in the repo.
I wonder how much silent corruption that would cause.
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage.
That's not what I proposed - b) proposes to use the data, not skipping it in quantize.
Also, llama.cpp tends to crash during quantisaiton, it did not actually generate garbage quants that often, although it was one outcome.
With your proposal I would expect very bad results, because we force low bpw quantisation without any data on a tensor that seems vitasl, while the b) proposal would hopefully only leave it partially trash. The problem I see is that just writing e.g. 0 migfht make llama.cpp crash, so we might even have to synthesize data. The latter problem could be tackled when it happens, though.
None of this seems trivial to me.
I really don't want to implement your porposal in any case, I think it would be better to just leave out those quants in that case. Which also destrtyos my cxhance of getting an _IQ1_S :)
Despite there only being one expert missing the imatrix training skips storing it in its entirety.
You think all the experts are in that one tensor? (or those three, actually)
You think all the experts are in that one tensor? (or those three, actually)
The dimension of blk.0.ffn_down_exps.weight
is [ 4864, 7168, 128, 1] which indicates it contains data of all 128 experts. If you look at https://github.com/ggerganov/llama.cpp/blob/43ed389a3f102517e6f7d5620d8e451e88afbf27/gguf-py/gguf/gguf_writer.py#L138 you see that all tensors with "_exps." in the name are supposed to contain data for all experts.
That is exactly what I mean - that means your suggestion that one expert is missing does not match the data. We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
So the naive explanation, that our training data fails to activate an expert, must be wrong.
BTW, I don't understand any details of what the imatrix actually measures, specifically, what it means to lack data for part of a tensor.
I would not be terribly surprised if this was just a model defect (and maybe not even a defect). My reaosning is that we have models that generate NaNs, and according to the llama devs, this means the model is completely unusable, yet still they work fine, so there must be a way for parts of a model to be "unused". Of course, that reasoning is weak because f16 and below can't even represent nans, afaicr.
And for something completely different, in the downloader/interactive model summary page, I have changed the quality score calculation to be strictly monotonic - before, Q8_0 would unstably sort after Q6_K because they'd end up with the same integer score of 99. Now Q8, i-Q6 and Q6 get 99, 98, 97, respectively. I think that's a reasonable trade-off between being a simple ranking and assigning meaning to absolute quality differences. It also allows sharing imatrix and static quants in one table.
I don't think I can improve it much in the short-term (especially since I didn't do client work in the last days at all), but once I found a nice way to make the link, I will put the link to the model page and update all models. On the other hand, when I am more active working on my day job, I also tend to be more active on my hobby side. Strange how these things work - if there is little for-money work, I lack the impetus of doing side projects, too.
(Example: https://hf.tst.eu/model#testrepo-i1-GGUF)
We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
blk.0.ffn_down_exps.weight contains data for all 128 experts but we only imatrix measure 99.22% of it so we exactly miss one expert for that specific tensor. Wo do get data for all experts on all tensors not associated to layer 0. We miss one expert in one layer which causes llama.cpp to not save any imatrix data for this specific layer. We do have data of all expoerts for every other layer.
In any case I will soon try different imatrix training data so see if I can somehow manage to cover this specific expert in layer 0.
I would not be terribly surprised if this was just a model defect
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
I agree fully with your explanation (which matches my much more fuzzy understanding), but clearly this expert must somehow be activated if the other tensors for this expert somehow do. Clearly my understanding is flawed/missing, because _I am surprised you can activate only part of an expert. I would assume all weights to be used. But I don't know how the imatrix measurement decides what was active and what not - my understanding is that using a tensor, or an expert "slice" of it is essentially just a matrix multiplication, which should "use" all of it.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
How sad. Turns out he deleted them all for nothing. Today we finally got an official document explaining the new HuggingFace storage quota: https://huggingface.co/docs/hub/storage-limits and discussed in https://huggingface.co/posts/julien-c/388331843225875
*We aim to continue providing the AI community with free storage space for public repositories, please don’t abuse and upload dozens of TBs of generated anime 😁. If possible, we still ask that you consider upgrading to PRO and/or Enterprise Hub whenever possible.
Maybe we should consider upgrading the mradermacher account to PRO as it is just $9/month which is nothing compared to ouer operation cost but it is not required for us or anyone else to do so.
I think if hf restricts the mradermacher account after saying "unlimited repositories" they are shooting themselves in the foot. They already did, though. Not sure what to do, but I am not sure it should be supported. Anyway, let's see what happens. I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too. I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less. And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads). So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
My account looks the same btw., i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Still feeling a bit woozy after so much relieving news today. On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too.
This is exactly what it means. Even for free accounts storage for public repositories is unlimited as long it is not getting abused. They are mostly just begging for PRO. Like for every tech company for every PRO subscriber they have they can get a much larger sum of money from investors. This is also why the price of a PRO subscription is way lower than it should be given what you get.
I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
They for sure are aware of us and appreciate our work.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less.
Awesome to hear!
And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads) So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
Great to hear that you didn't lost any important files.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
I would likely do the same if I had a file system as massive as yours.
My account looks the same, i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
That's exactly what they mean.
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Amazon S2 frequent access storage for 500TB+ is $21$/month/TB so they already pay around 100k/month in storage cost for us but that's still almost nothing compared to what the bandwidth cost must be. Let's appreciate what they give us an don't ask for more. If there are no models on HuggingFace there is no point in it even existing so our and other users time and resource investment is HuggingFaces’s biggest value and what keeps HuggingFace alive as a platform. We are essentially donating the resources of 11 server and a massive amount of time to HuggingFace and the open source AI community so I'm sure they see and appreciate what we do.
Here a screenshot of their community post which clarifies things:
Still feeling a bit woozy after so much relieving news today.
Today was awesome. I'm especialy rleaved about HuggingFace removing the storage quopta for public repositories as the storage limit worried way more than it should have.
On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
Similar things happened so many times to me as well. It always seems to happen when I explicitly tell them to keep me sleeping.
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
I wonder if removing the GPU from quantisation tasks would have any performance impact. I those 400 MB don't really matter as we never really use the full GPU memory for imatrix anyways. But if it serves no purpose for quantisation we can just use llama.cpp without CUDA or set CUDA_VISIBLE_DEVICES to nothing.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
Datasets I tried so far:
- c4_en_ja_imatrix
- calibration_datav3
- imatrix-with-rp-format-data
- 4chan pol_062016-112019_labeled
- Tech-Awesome-Hub/mix-data
- GitHub Readme
- MMLU
- Merges between above datasets
I the only ones that has 127 out of 128 experts other than yours was "calibration_datav3" from bartowski and " imatrix-with-rp-format-data". Many datasets got way less experts than that. It clearly is the quality of training data and not the amount that matters. 4chan pol_062016-112019_labeled is massive but when I aborted it, it only had 122 out of 128 experts on layer 0. MMLU which I though is really diverse only managed to trigger 121 out of 121 experts on layer 0. "Tech-Awesome-Hub/mix-data" was with just 120 out of 128 experts on layer 0 even worse than that.
In conclusion you have really awesome imatrix training data and many of the training data I tried was significantly worse. So "imatrix-training-full-3" is likely better than you think. I will continue trying to find datasets that activates all experts. If you have any idea what datasets to try please let me know. I'm really interested in this topic.
A somewhat urgent request for your input, deepseek imatrix just failed:
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
I wonder if removing the GPU from quantisation tasks would have any performance impact.
I am, as usual, unburdened by actual knowledge, but I always thought it's cpu-only. And I suspect the 384MB is some kind of, wlel, not leak, but probably some dummy workspace allocation. In any case the gpu is completely idle when quantizing.
or set CUDA_VISIBLE_DEVICES to nothing.
I'll do that and see what happens.
In conclusion you have really awesome imatrix training data
No wonder, as the first part is bartowskis training data :)
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
Support for the --no-context-shift
option was added to imatrix computation yesterday by bartowski in https://github.com/ggerganov/llama.cpp/pull/10766 so make sure to use latest llama.cpp or it will not have any effect.
According to https://github.com/ggerganov/llama.cpp/issues/9390 if disabled:
- Requests bigger than context window will result in an error.
- n_predict for each sequence will be capped to n_ctx - n_tokens_prompt
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Online repacking got merged which removes llama.cpp support for all Q4_0_N_M quants: https://github.com/ggerganov/llama.cpp/pull/10446
I highly recommend to no longer generate them as they no longer run in latest llama.cpp. Even bartowski will no longer upload the now depreciated and unsupported ARM/RISC-V quants: https://huggingface.co/posts/bartowski/807894839859408
I'm quite happy about this llama.cpp change as ARM/RISC-V quants where kind of stupid as they used the same data just aligned differently to be optimized for a specific architecture.
I'm quite happy about this llama.cpp change as ARM/RISC-V
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
Anyway, thanks a lot for your updates/feedback. I'll try it out on deepseek asap, and then probably hardcode it.
[snowflake] If you have any idea what datasets to try please let me know.
I don't, but maybe something was published on the training material, or it's area of expertise. For example, if it lists support for 22 languages, maybe we need some of these languages.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
I've overriden deepseek for the time being.
PS: I haven't watched top, so I don't know if memory usage for deepseek (or the new llama-imatrix) is considerably larger than for other models.
PPS: you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
I started nico1 again. You can do DeepSeek-V2.5-1210 now as nothing beside nico1 is currently running. I recommend you interrupt any other quantisation and imatrix task before starting it as RAM will be relatively tight.
Sorry, was sleeping. I'll have a look. I'll investigate why rich1 can no longer reach nico1.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0)
. hopefully this hack will fix the stuck uploads.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Phew.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get acess, and the list of gated repos in my account settings is empty except for one collection.
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
- Only the files in /dev/shm (model.status, model.log) keep the model in error state. once removed the scheduler will try again once it runs.
- You can edit things, fix things, and then delete the error status files, followed by pushing (echo push nico1 >/dev/tcp/10.28.1.1/16713).
- You could move the original download away and replace it by the model subdirectory, in which case the scheduler would try to pick it up from there.
- I will eventually provide you with a better tools, though... Long term, it could make sense to move everything to nico1 (and either have containers everywhere, or simply give you a user account - I planned for these eventualities many months ago by making the default umask 0 :)
- If things go wrong, you can do the next step manually, e.g. you could somehow provide the .gguf file, and when the scheduler runs and error state is cleared, it would simply pick off from there.
- there is very little state that is not externalised, e.g. the scheduler distinguishes a partial download from a succeessful download by the existance of the model.hfd-success file. There are also .override, .interrupt, .nobudget and .force files. You can stop a model by creating a model.override, make it ignore the budget, or simply force-start it.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible. And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
Adding them as separate quants was a mistake. In hindsight online conversion should have been the way to implement this for the beginning. What started with a few ARM quants got out of hand quickly and soon we have likely dozens of Q4_N_M quants optimized for different architectures so switching to online conversion was the only reasonable way for them to do. No that there is online conversion supporting existing Q4_N_M quants is useless as llama.cpp can now just write data to memory in an optimized way while loading the model.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
It's probably because imatrix computation reuses the same code as other llama.cpp components and so offers similar configurations even if some of them doesn’t really make sense for imatrix computation.
I don't, but maybe something was published on the training material, or it's area of expertise.
According to their advertisement everything should be public but I'm having trouble locating anything useful. They put everything spread across random blog articles and papers and this massive fragmentation makes finding anything too time consuming.
For example, if it lists support for 22 languages, maybe we need some of these languages.
I already tried multiple multilingual imatrix datasets without any success.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I already tried around 10 MB worth of datasets so yes it might indeed be unlikely any reasonable prompt will activate that expert. It likely is something super niche like enterprise programming language like Cobol or Erlang as it is an enterprise focused model.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
Maybe that really is the way to go and it would also solve this issue with other MoE models. What I already tried is forcing the router to use all experts using --override-kv llama.expert_used_count=int:128
but it unfortunately had no effect for imatrix computation.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo
and inform me if it won't fit. There was a VM running using 24 GB of memory at a time and maybe some other things.
you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
Yes I will likely look into it but quite a pain with ZFS. I really hate swap but the current behavior of it just rebooting on OOM also isn't ideal. I wonder what happened to the OOM repaper that always prevented OOM crashes in the past.
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
mlock would explain the crash.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
No idea why this happened as well.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
100 GB should be enough to test models privately. He likely got way more than 100 GB as you got CurrentPrivateStorageUsed + 100 GB when they introduced this limit. Beside going Pro he could also email them to request more private storage for testing which they will most likely accept as a valid reason for their new private storage grant program. I wonder why he is not testing them before uploading. It seems quite wasteful to upload models you have not even tested. The machine you use to train a model should also be able to run it as far I'm aware unless for merges.
I like the new policy as closed models are almost always for meant commercial use and so used by operations that really should pay for HuggingFace. They have to make money somehow and enterprise customers make the most sense in my opinion.
By the way when I researched HuggingFaces finances it seems like the vast majority of their earnings comes from consulting services. I luckely work for a company where we don't waste money hiring consultants.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
Awesome to hear!
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
Your observation is likely correct. There was a lot more activity back then. For example take a look at https://huggingface.co/cognitivecomputations which created a Dolphin version of every good AI base model. Most of them are from early 2024.
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0). hopefully this hack will fix the stuck uploads.
Nice. This seems like a good workaround. Let's hope this fixes this issue.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
Today we reached a queue size of over 4000 so I'm really happy it will now finally go down from here. Especially now that we lose 4 hosts in one day.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Post-august you already had nico1 so there should be way less and as observed there generally are way less models recently. Before February would likely be insane but we can be way more selective.
Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get access, and the list of gated repos in my account settings is empty except for one collection.
Sounds like a strange HuggingFace bug. Maybe they never anticipated someone ungating so many models. For easy models you can always ask me or Richard to ungate and for hard ones we always have Guilherme34
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well. In a fraction of time required to create a webpage we could likely create a nice command line application/shell script automate all common manual tasks.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible.
I would love to help with this. Mainly with queuing models requested by users so they get their request fulfilled faster if you are unavailable and you don't have to care about this when you are busy. In that case it should also not matter if I'm ever too busy to help as all time I can spend on this will be an improvement over the current situation.
Should the queue ever get empty I will queue some historical models I feel are improtant and then maybe do some model authors I like but would likely run out of ideas at some point. I don't think I would have the dedication to go through and judge 30000 models to select the best ones. Your work on selecting models is highly appreciated.
And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
No worries I like to help getting interesting models to work. My time is always limited so I can't look into every single model that failed so I focus on interesting models and the ones requested by users.
No that there is online conversion supporting existing Q4_N_M quants is useless
Well, not for those already downloaded... In any case, yes, I agree that, if its a maintenance burden, it should indeed just go.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo and inform me if it won't fit.
That's... I guess you'll have to tell me what you would want me to look for and/or calculate. It is, however, notoriously difficult to check this beforehand, so likely this would just make the imatrix job fail (i.e. the imatrix job would check). That's not super-bad, as that is already happening for special models.
KaraKaraWitch is now publishing all repos
Well, it's another psychological effect. The alternative would be to gate the models, I guess, and keep them public, until they are tested.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
Yes, and don't be shy, even if I was a bit, ehe, cranky recently. You are interfering a great deal, and it's predominantly very useful :)
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well
I was thinking less of you, and more of others. But, yeah, command line is the way to go at first.
You have been rate-limited; you can retry this action in about 24 hours. If you're a new user, your limits will raise progressively over time. Get in touch with us at [email protected] if you need access now.
Small models are too much for huggingface. I'll mail them.
Small models are too much for huggingface. I'll mail them.
Oh no maybe it was not so intelligent after all to do all the lage models first. We are using many nodes and such limits are usually either token or IP based but rarely user based so this should not be an issue. If it's an upload limit try giving each host a separate upload token. If it’s a download limit then maybe we are always using Guilherme34's token and so exceed that token's rate limit in which case either download anonymously or using a dedicated token by default. If this issue only occurs on rich1 then maybe it is because we just started some face picture data collection project there. In the end there really could be a per user limit in which case we have to email them or workaround the limitation.
Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!
-2000 14 Falcon3-Mamba-7B-Instruct run/imatrix (GPU-2d) 101/64 22.58s/c 42.5/126.0m(127.0) [112/335] 6.9410
Something tells me that a 7b should not take 2h for imatrix quantization.
Oh no maybe it was not so intelligent after all to do all the lage models first.
We did not do all the large models first, we only preferentially did them. rain/kaos/back etc, all did small models the whole time. So if we did it more "intelligently", we would just have hit it earlier when rich/marco/nico would hit small models randomly.
The limit is is repository creation btw., I cna try to optimize it, but I will ask for an exception first. The issue started on rich1, but has no affected everything. I suspect it was simply the speed of quantizing small static models.
Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!
Hmm, I thought I had it mentioned already, but what happened is that whenever I clicked on "Quantizations" on a model page, I got a blocking invitation page asking me to either join or refuse to join that organisation. Normally I ignore those things until I have made up my mind (not sure I want to be part of any such organisation :) but since I was forced, as my access to the webpage was limited, I hit accept.
BTW, it also made all uploads fail, which I have to clean up manually. At least it didn't hammer their servers. And since I am rather limited w.r.t. email (my private mail server is the one currently being restored), I had to ask marco to contact the website team :)
Actually, whats the company mail server queuing... only 3130 mails it can't deliver to me. Sigh.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
Try using different upload tokens on each host. Even those limits are probably applied on a per upload token level according to @RichardErkhov . It’s at least worth a try.
@RichardErkhov Already reached upload limits, api limits, inference limits, file list limits, repo creation limits, request limits, gated request limits, file size limits, file count limits, space size limits so he should be an expert when it comes to limits. Almost all limits he encountered where on a per token bases. Since he uses 3 different upload tokens to no longer hit a single limit.
I've changed the upload to retry after 10 minutes when it happens, so only newly started jobs will fail(which are far easier to clean up). I'll wait for a while to see if hf increases the limit - they did it before (when it was so low that siumnply me clicking around on the webserver triggered it regularly). depending on the situation I will try different hf tokens per host. However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
That's exactly what @RichardErkhov is doing. 1 token per host is usually enough for him for a single host but if not he uses a separate tokens for every python instance.
For now I would just do different tokens for each host as this should be simple to implement and a single host triggering the limit is relatively unlikely.
Nope, doesn't work, leia just triggered it, and leia already uses a seperate hf token.
I will try to reduce the load on the api and check in another way for successful repo creation, later tonight when I have time. Thanks, hf.
requests.exceptions.HTTPError: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing huggingface-cli login
, and if you did pass a user token, double-check it's correct.
I am now completely blocked, it seems. I can't do anything whatsoever.
Creating new tokens does not have any effect, but from time to time, an upload goes through. It seems I am severely rate limited, and I feel this is not due to the increased upload frequency. It feels like some new limit. Also, huggingface-cli upload uses the limited create_repo API, so me reducing calls to it will have very little effect.
I guess only hf can do something about it, and if they don't in a few days, we have to shut down.
Things are slowly starting to upload again. I guess we'll find out more this evening.
Nope, the rate limit is still there, and severe. I'll try some tricks and hope there won't be a disaster tonight.
Reducing the amount of API calls to the bare minimum seems to be the only solution for now so try every trick possible. As far I'm aware every commit is an API call so maybe we should batch together some files for small models. Also make sure downloads don’t use any mradermacher token.
The rate limit doesn't seem that severe. All uploads seam to eventually make it through. Commits to 15 models where successfully made in the past hour and I see equal outgoing network traffic on nico1 than on any normal day:
The rate limit doesn't seem that severe.
I have already almost halved the number of api calls yesterday and implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it, but I think it is pretty severe, especially as we had similar rates earlier, when we did the first batch of small models (the nice 800 ones), so this definitely looks like something that has been changed since then.
Ok, the first batched uploads are through, and nothing seems to have imploded. I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
I have already almost halved the number of api calls yesterday
Oh wow wasn't aware of that. It's quite insane we are still hitting the limit despite those changes and decommissioning db1, db2, db3 and backup1.
implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it
I think and hope so as well.
so this definitely looks like something that has been changed since then.
Yes this definmately seams like a new limit or @RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Ok, the first batched uploads are through, and nothing seems to have imploded.
Awesome to hear. Let's hope everything continues to go well.
I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
Definitely not an ideal situation but better than to hit the rate limit. Everything looks good to me for the repositories I checked. I will be here and watch things but I'm quite confident nothing bad will happen. Have a great nigh!
Maybe things are not going so great after all. "auto-patch README.md" is going a bit crazy and is removing references to existing static quants on some long-completed models:
- https://huggingface.co/mradermacher/Nemotron-4-340B-Instruct-hf-i1-GGUF/commit/f4bc99d59dcd92f65c681dfc50bd6a757435f300
- https://huggingface.co/mradermacher/Hermes-3-Llama-3.1-70B-lorablated-i1-GGUF/commit/baab2edf5a54d43d775b368ff065be2d063c1da4
- https://huggingface.co/mradermacher/SILMA-9B-Instruct-v1.0-i1-GGUF/commit/d8fbfa1fe718e78034b552dbe4318482a9ace9e7
The same it also does to static quants where it removes references to imatrix quants:
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists. Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
Interesting it now started to fix things it previously broke:
@RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Haha, he often reminds me of younger me. The real test is when we hit other large blocks of static-only jobs again (today it mostly did much slower imatrix jobs).
I assume the amount of create_repo calls has gone down by a factor of about 5.
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists.
good idea, but unfortunately, it checks for that either by downloading the README.md without the API (original model) or by using the list of all mradermacher models (for fiinding other quant repos). I'll have to look at it. As long as the original url is still intact, it will be fixable.
Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
I'm not doing that on every change, fortunately, that's a background job that has essentially a fixed rate limit (more models == fewer iterations per time). The API affected seems to be only repo creation (which is called once or two per job, and was called twice per upload).
I'll have a look into the problem, thanks for catching those, which is a job well done :)
Interesting it now started to fix things it previously broke:
Fascinating, so, obviously intermittent errors of some kind. It runs on each repo after each job finishes, and tries to scan through all repos separately every 2 days at the moment. What you see is likely the background job that fixes the links to original urls and so on.
Hmm, not good, all those wrongly updated model pages are not in the llmjob log, so it must have been done by the background job. Unfortunately, that one really uses the list_models api call to get a list of all repos once, and then just checks if the static/imatrix repo exists, while the foregrtound job doesn'T use the api but does a GET on the actual model (html) page to see if the model exists.
Unfortunately, I indeed key the latter it on status 200, because you get all kinds of status codes when the model doesn't exist (404, 401...), so it's quite hard to know when it temporarily failed. I guess well have to live with this at the moment, unless I want to add more api calls for this.
I think repo creation has an especially low(*) api limit, and whoever did that was probably not aware of every upload calling this endpoint (in fact, I was not aware - maybe it is a recent addition, because I manually create the repo on upload because hf-upload would otherwise fail).
*: comparatively speaking :)
Unfortunately, that one really uses the list_models api call to get a list of all repos once
That is where I was wrong, it sahould have done it, but due to heavenly refactoring, it failed, so this explains it. The foreground job can still fail to correctly patch it, but the background job should get that part right. And if the original model page is missing occasionally, that shouldn't cause a diff. Famous last words.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us. Someone likely just thought that limiting repository creating to 100 per hour makes sense as nobody could reasonably exceed that not realizing that the same API call is called for every commit.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
They are notorious for being slow. @RichardErkhov successfully contacted them in the past regarding an API issue by creating an issue on their GitHub but they have not yet fixed it after almost 3 month despite confirming the issue: https://github.com/huggingface/huggingface_hub/issues/2581
Especially now most of them are likely already in Christmas holiday so I'm really not surprised information like this is not reaching the right persons. Even in my much smaller company bug reports often get lost somewhere in middle management. I recommend you create an issue on their huggingface_hub GitHub instead where you are much more likely to reach someone capable of fixing this issue.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it. It seems unlikely that us having such a massive queue of only small models will ever happen again. When we are at queue size, I'm currently very satisfied with the progress and we already got it down from over 4K to below 3.5K in just a few days.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us.
No, but I have contacted three days ago, and they didn't even bother to reply. I judge by actions.
They are notorious for being slow. @RichardErkhov successfully contacted
Your example shows a reaction time of less than a day, though, so clearly they can if they want to.
I recommend you create an issue on their huggingface_hub
I am not going to create potential drama somewhere - they asked me to use e-mail, and I used e-mail. If somebody wants to do that, that is fine, but, again, I went through the official channels for this, I don't want any special service.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it.
I can of course live with this, but it obviously affects our throughput. An hour ago,m no quanting was done, and right now, four nodes are still not doing anything much.
Nico, I feel you are a bit panicking because I sound so negative - Don't worry, I am not trying to declare war on hf or giving up, I am merely adjusting their way too good reputation in my head. And I have learned to judge companies by their actions, not by the goodwill of fans. Or should have learned :) This is an attitude correction for me, not a disaster.
Addendum: you can probably tell by now that I am a staunch anti-neoliberalist and work for a tiny, very personal company for a reason :) Don't worry, I am also a realist :)
@mradermacher The status page(http://hf.tst.eu/status.html) is frozen since 2024-12-20 16:05:00+0100 and booth nico1 and rich1 are idle. There no longer seam any models to be uploaded so I assume something critical broke and I don't think there is anything I can do to fix it.
I checked kernel log on StromPeak and the time it broke seams to somewhat allign to the time my RTX 3080 GPU crashing but that is not used by nico1 as only the RTX 4090 GPUs are assigned to your LXC container and so should not be related:
Dec 20 15:55:19 StormPeak kernel: NVRM: GPU at PCI:0000:c1:00: GPU-c8fe94f9-541b-e16b-da0f-b8d38ea5283e
Dec 20 15:55:19 StormPeak kernel: NVRM: Xid (PCI:0000:c1:00): 62, pid='<unknown>', name=<unknown>, 2027f626 2027f426 2027fcf4 20288f2a 20288e30 2021b5b8>
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x55:2477)
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nv_open_q:2903 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nv_open_q state:D stack:0 pid:2903 tgid:2903 ppid:2 flags:0x00004000
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nvidia-smi:2356875 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nvidia-smi state:D stack:0 pid:2356875 tgid:2356875 ppid:2341557 flags:0x00004006
(...)
Dec 20 16:00:50 StormPeak kernel: INFO: task nv_queue:2901 blocked for more than 245 seconds.
Dec 20 16:00:50 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 16:00:50 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 16:00:50 StormPeak kernel: task:nv_queue state:D stack:0 pid:2901 tgid:2901 ppid:2 flags:0x0000400
After more carefully reviewing the kernel log it indeed seams that nico1 got somehow affected by the issue with the RTX 3080 GPU:
Dec 20 15:58:48 StormPeak kernel: INFO: task llama-quantize:2364235 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:llama-quantize state:D stack:0 pid:2364235 tgid:2364235 ppid:1469293 flags:0x0000000
llama-quantize should not use any GPU and the faulty GPU is not even attached to your LXC container so really strange this happened. There are tasks running so not sure if the system is in a state where it can tolerate a reboot of nico1 but it currently is not working at all so it likely can't get any worse. It would be really interesting to know how a stuck quantize task on nico1 brought the entire system to a halt.
I disconnected nico1 from the internet but still kept it running. Let's see if that is enough for the system to fix itself. All other hosts should now detect nico1 as offline and hopefully manage to recover.
It didn't help. I will reboot StormPeak now but unlikely that fixes anything as even without nico1 the system didn't recover.
I rebooted StormPeak which fixed the RTX 3080 issue and started nico1 again but as expected this unfortunately didn't fix whatever issue brought the entire system to a halt.
Good morning. I don't know what happened. A llama-quantize should hang the job only, but maybe something else also went wrong. The connection timeout (once established) is currently 3600 seconds, but that either didn't trigger or somehow it happened multiple runs of the scheduler. rich1 is also gone at the moment, which might play a role as well.
I also disabled the local scheduler a week or so ago because there is some weird bug where static jobs finish successfully within 10 seconds without doing anything, meaning static quants are not generated at all, so that didn't help either.
Obviously, there is a bug somewhere.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
And if we don't talk to each other much, merry christmas and a happy new year :)
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
(D'oh, forgot the README patcher)
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.
Maybe finally I can find some time to link the download page before it becomes obsolete.
gee, found another locking bug that kept jobs from being started all night.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
gee, found another locking bug that kept jobs from being started all night.
Awesome to hear that you were able to find and fix another locking bug. I can only imagine how complex maintaining this entire system must be. I wrote a distributed system for the satellite project I'm doing together with Richard where we have around concurrent 30 workers often only staying for a few hours and there where so many edge cases to consider.
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
Richard would for sure appreciate it if you can keep fully utilizing his server and don't run out of models for him to quant. If the queue gets too small you can maybe make it so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first which to my knowledge are the only server where someone has to pay for electricity.
Just so you know currently we are also using the same server that hosts rich1 for a satellite project worker so when we had that rich1 LXC outage we just scaled up satellite to use all resources and downscaled it again once your LXC container was fixed. I'm sure Richard will always find some other temporary use for this server should the queue ever run dry. I also have quite close contact with him so don’t worry about it.
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
I'm so glad to hear that. This for sure must have been a really bad time for you.
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
That sounds like a great idea.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.
Wow thanks a lot! This is awesome. I'm so happy we managed to find a workaround to avoid the rate limit.
Maybe finally I can find some time to link the download page before it becomes obsolete.
It would be really cool if you could do so. I love your download page! It would be great if you can show me an example before you do all of them as this might be the last time we change all the model cards so it needs to be as good as possible. Something else I noticed is that sometimes our quants appear as "Finetunes" instead of "Quantizations" in the parent model as can be seen in https://huggingface.co/models?other=base_model:finetune:nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored - maybe this can be fixed as well when we have to update all model cards anyways.
And if we don't talk to each other much, merry christmas and a happy new year :)
I wish you a happy new year as well!
I can only imagine how complex maintaining this entire system must be.
The problem is that code is constantly added and algorithms changed while the system is running :-)
[download page] It would be great if you can show me an example before
I hope I can do it incrementally, e.g. for new models only at first. But yeah, I'll try to ask for your opinion. If you wish, you can can even make a suggestion - I want to use some custom css to make a small box with the link only, and some very short explanation, such as "Compare static/imatrix quants, download and search on our... [[[Model summary page for this model]]]" or so. Suggestions or even outright examples are welcome :*)
so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first
The problem is in the specifics. Unless requested, models get queued in batches, and then we have two choices: leave a model in the queue, or queue it somewhere. In the latter case, we can choose where to queue.
If rich1 simply has priority, it would simply accept all models till the budget is full or the queue size limit is reached, neither of which is likely for daily batches, and also not desirable. At the moment, it is kind of distributed by speed, as nodes do static quants first, so faster nodes gobble up more jobs.
We need some kind of back pressure/weighting. And something like this is implemented (differently for models with nice <= 50), but it wouldn't be able to avoid scheduling on nico1 or marco. The current scheduling restrictions on nico1 are nice, because they mostly answer the question at night, and I will put a different scheduling restriction on marco (basically take it out completely once our queue is usually empty).
The only solution, I am afraid, is to essentially block nico1 completely (other than imatrix generation). And that might be doable, after all, we did this for many months. Or we only manually schedule jobs on nico1. Or only bigger jobs, which would be delayed on the much slower rich1 (which also might conceivably busy with other jobs, as it is effectively a shared server). Something like that. Share your thoughts :)
gpus@nico1
As an unrelated side note, for a few days now, I was using only one graphics card on purpose, except when I was in trouble (because of scheduling or downtime issues unrelated to the actual model workload), and at the moment, one gfx card is sufficient.
I really do plan to queue a few more months before the queue runs dry, though.
Update: Yeah, I think that's it - disable automatic quanting on nico1 except maybe for requested models (<= -1000), hand-queued models and very big models.
peculiar: we have been rate-limited again. pretty sure our repo creation rate was very average (especially as nico is paused).
more peculiar: even though our rate is way lower, the wait time (once rate limited) is much higher.
i hope they didn't further restrict uploads, or repoc reations :/
I saw that yesterday, rich1 was pretty idle, we even decided to finish off satellite by doubling the processing power because rich1 otherwise was completely idle ... What is going on ? Did huggingface answer anything in the email ??
hf completely ignored my mail, afaics. it's quite strange, every time i reduced repo creation rate api calls, it worked for a few days, then -> new rate limit. or, alternatively, the rate limit is weirdly implemented. right now, I think we are at the theoretical minimum rate (one repo creation request per actually created repo).
it's also possible that the rate limit is not strictly implemented as a per-account rate limit. maybe it's just not reliable, just like anything else they implemented :)
I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)
or I can try contacting them elsewhere I can have contact with them
i hope they didn't further restrict uploads, or repoc reations :/
I don't think it changed since it got introduced. They for sure wouldn't introduce such changes during the Christmas/new year holiday period where most of their developers are on holiday.
especially as nico is paused
When I paused nico1 today for the performance measurement project I got the following error but it all seem to work despite this:
./nico1-pause: line 19: /root/s2/llmjob: No such file or directory
I checked and was able to confirm that the entire "s2" folder is missing. Only thing that didn't work was unfreezing and completing the frozen task but not important as I don't intend on rebooting this time. Let's just hope they don't automatically start as long nico1 is paused.
140+ 14 CosmicNoodle-7B blocked/imatrix/gpu
Any idea what this means? I saw simular blocked satuses for the entire day before I paused nico1.
I checked and was able to confirm that the entire "s2" folder is missing.
Right, everything is now in /llmjob, rather than splattered over the system. I forgot to update the script(s). Will update them.
All you missed out on was resuming the frozen/Stopped quantize jobs, so they didn't interrupt and didn't exit.
140+ 14 CosmicNoodle-7B blocked/imatrix/gpu
The status of jobs does not update when paused, so this is simply the last status update. I think :) If it does not clear up when resumed, I will have to have a look.
It might also be that the job has failed somehow, but didn't have an exit status. In that case, the job scheduler doesn't know what to do and just ignores it. (Well, it might actually block a gpu in that case, but that isn't the case here).
nico1 is now unpaused.
-2000 360 si falcon-180B
-2000 236 si goliath-120b
Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix quants. I definitely want to give falcon-180B another try. I remembered how I excited I was when it released as it was the biggest openly released LLM at that time but then the model turned out to be quite underwhelming but maybe with modern CoT prompting techniques and better system prompts this almost forgotten base model can be of use. While finetunes are nice in the end base models contains the knowledge I seek to extract and so are of much greater value.
Edit: Seams like it is requesting the existing imatrix quants. How awesome!
-999 205 I Llama-3-Motif-102B error/134 12/24,IQ1_M [691/867]
What a strange error - not something I've ever seen before but you might be familiar with it. So strange how all the other quants so far worked.
[ 691/ 867] blk.76.attn_q.weight - [ 9216, 9216, 1, 1], type = f16, converting to iq1_m .. /root/cvs/llama.cpp-cuda512/ggml/src/ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix
I was missing the static quants only (and incidentally, any missing imatrix ones). I was also so disappointed in falcon-180b. Anyway, I'll redo the imatrix ones then, too, then.
error/134
That is the process exit code, in this case, ABRT: ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
I will just put it here in case you didnt notice it =) @mradermacher
I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)
or I can try contacting them elsewhere I can have contact with them
I didn't notice, indeed. Hmm.... I'm in a bit of a cinch here - I normally don't want in a position to ask for special treatment, but obviously, I am very specially treated by hf already. And sometimes it might be better not to wake up sleeping tigers.
So... I mailed them, nicely, and they didn't consider it. At the moment, it is fine most of the time, and annoying some of the time. And we are creating repos faster than normal, due to going through all the small ones. So maybe it's best to not push them further and delay mailing them until we really run into a painful rate limit.
It might be an issue for you, if you start quickly quantozing all the, uhm, remaining models (yes, I haven't forgotten about the list :)
alright then. when it bothers you too much, I guess just send me a text for a message, I will try to do something with it. I guess it will be when we start quanting "remaining models" haha
Yeah, but that will hopefully your problem :)
when will I start quanting lol ? In 2026 haha ? When will it be my problem ? Maybe I should just send a message now to see with them about it? Or should I pursue other projects while waiting for your part to be done ?
I wanted to provide it much earlier, but too much other stuff came in between that I... couldn't preempt. Turns out the problem is a bit harder than I thought, too, but I have most of the filtering stuff in place.
well I guess I will eventually get it haha, well good luck with anything you have =)
Thanks for your understanding :) I'll try to provide it before rich1 runs dry(er)
@mradermacher
The RPC setup is ready for DeepSeek-V3
, DeepSeek-V3-Base
, Hermes-3-Llama-3.1-405B-Uncensored
and Hermes-3-Llama-3.1-405B-Samantha
. We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit and Hermes-3-Llama-3.1-405B-Uncensored/Llama-3.1-405B-Samantha in 16-bit.
The servers are not primed yet and I have no idea if on latest llama.cpp this is still required. To prime just call llama-cli -m /tmp/model.gguf -p "Hi" -n 2
and the RPC arguments. Should priming still be required we would idealy automate it.
Here the RPC arguments to use:--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 -ngl 10000
Please make absolutely sure no imatrix or quantization tasks are triggered while an RPC task is running or the entire host will crash due to OOM while GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
. Especially for then 405B models RAM will be extremely tight.
To move the GPU to CastlePeak I had to reboot StormPeak. I stopped nico1 and waited for the script to terminate and then shutdown the LXC container and host. Somehow this ungracefully killed Gemma-2-Ataraxy-Gemmasutra-9B-slerp
and Gemma-2-Ataraxy-v2a-9B
imatrix computation so please restart those.
I'll have a look when I get up again. I'll try without priming (and then complaining). As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.
Somehow this ungracefully killed
Yeah, the scheduler can't know what goes wrong when the connection fails. More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/
But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.
We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit
Might be a good time to test the hf quant download code that exists, but has not been tested yet (it had to be rewritten for nico1). Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...
Hmm, or maybe not.
Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...
I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink. cat concatenation worked instantaneously. I was so impressed that I didn't had to wait 10 minutes for the data to copy like on ZFS. That was almost like magic. I assume the file system somehow created a new file based on all the blocks of the old file without copying anything.
As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.
Yes I it just needs to do prompt processing for a token and generate 1 token if they still have not yet fixed that issue. Awesome it is not that hard to automate because manual priming always requires so much coordination.
I'll have a look when I get up again.
Any idea when that will approximately be. I'm asking because I obviously need to have all my services and the ones I provide to @RichardErkhov and @Guilherme34 turned off before we start with RPC imatrix computation. I have already everything turned off but I might turn some services on again in the meantime if I know when you will start.
I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).
I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).
The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown. It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
We don't measure perplexity of every single model beside the perplexity value llama.cpp computes during imatrix computation. I'm not sure how useful providing that would be given that we use a proprietary imatix training dataset. The model you download is never leaked to the server hosting the new download page. All dynamic content on is generated using client-side JavaScript for privacy reason so I don't think it's the right place to provide any model specific data. If there is a valid use-case for it we could consider adding the perplexity value computed during imatrix computation to the model card or maybe upload the imatrix training log as dedicated file to future models.
Regarding the llama.cpp version I installed b4435 017cc5f on all RPC servers which was and still is the latest release. I recommend you use the exact same version. I recommend against using latest develop 53ff6b9 as it majorly refactores llama.cpp backends and I don't feel confident that this version is stable. I would prefer not spending another week redoing all the RPC imatrix quants because their refactoring turns out flawed. Latest develop seams currently so bad even their automated release pipeline failed which is why b4435 017cc5f is still the latest release at the time of writing.
Don't forget to compile llamma.cpp without CUDA and with RPC support for the RPC setup to work.
Script to install it:
#!/bin/bash
rm -rf llama.cpp/
git clone --recursive https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
git checkout b4435
cmake -B build -DGGML_RPC=ON
cmake --build build --config Release -j
Yeah, the scheduler can't know what goes wrong when the connection fails.
To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.
More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/
That's strange. rclone with timeout survived all my internet issues I had back in coaxial days and now a restart caused it to hang. That’s indeed quite surprising.
But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.
I know but it would be nice if it would be possible to reboot a paused host without causing unnecessary work for you.
It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
I see, it's based on that data (I've been meaning to augment it with custom quants and KV compression, haven't had a chance to do that yet).
The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown.
I don't think that is possible, since different model families behave very differently when it comes to quantization. It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.
A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy. This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.
For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).
Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.
I don't think that is possible, since different model families behave very differently when it comes to quantization.
The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.
For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).
Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.
As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory. It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.
It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.
I guess it should be better labeled.
A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy.
Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.
This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.
This might be the case. I measured many of them in the past. Some inaccuracies based on different architectures is expected.
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.
@mradermacher I have some great news regarding the RPC based imatrix computation setup. It seems as if priming is no longer required in latest llama.cpp. At least I was able to compute an imatrix over RPC without priming first.
I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink.
I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.
To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.
It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running. And there is really no other way, for the scheduler, the job simply times out, with nop information on why (it might still be running, host might be rebooted etc). At the very least I would have to check uptime
monotony before every job start. Unless we increase reboot frequency, I'd rather clean upo opccasionally than implement and tets that in a rumning system :)
On the opther hand, if it reboots when idle, the scheduler should indeed be able to cope with it, although in the past there have been issues with ssh not timing out etc.
I guess it should be better labeled.
The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under
Except we don't have the model size available via the API. We would have to download either metadata (as for the search, which is not synchronized to the repos), or (partially) download a quant and parse it. Or use some heuristic on the file size. I don't think quant sizes make much of a difference to warrant that, though - model families would make a bigger difference.
Also, I wonder about hf doing that (partial quant downloading), because I heard that one hidden cost of aws is that partial few-byte download cause the whole file to be copied internally and they would pay for that. At least there seem to have been some cases where such money-amplification attack were done. I wonder if they are aware of that (if true). In any case, that was a tangent :)
I have some great news regarding the RPC
Indeed!
sleep
Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.
I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.
Just using cat doesn't work inside your container to instantaneously concatenate them? I thought I did it yesterday inside your container and it worked but maybe I was on the host instead.
It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running.
The pause script waits for all jobs to be completed and uploaded - at least for quantization jobs. It apparently doesn't wait for running imatrix jobs to finish before terminating. We could for sure make the pause script wait until no more imatrix processes are running. In any case now that I know I will just make sure they are all done before I reboot so this won't happen again.
The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)
Yes don't worry. I intend on improve it a lot. Once it is on every model card I will for sure be way more motivation to do so.
Except we don't have the model size available via the API.
It's on the HuggingFace model page so you can likely just scrape it or figgure out how the webpage obtains this information. But honestly just going on the model size should be good enough as it is just to give the user a rough quality estimation.
Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.
Awesome! I have not started anything and all hosts are ready to be used for RPC. I unfortunately won’t be able to help you much as I have to sleep now as I have work tomorrow (or I guess technically today because it is already past midnight).
Should something with the RPC servers go wrong you can always SSH them from nico1 using [email protected]
and then enter tmux attach
the access RPC server console.
The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.
As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory.
Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.
I'm sorry if I'm coming across as demanding. I really do appreciate the work team mradermacher does. I understand that measuring every quant would require a herculean amount of compute and is not feasible and I was not suggesting that.
My point was that the numbers on the download page can easily lead to confusion and misunderstandings, and I was just highlighting one such example, since for 4/6 including KL-divergance of the metrics on that page it will show Q5_1 static being worse than Q5_0 static, and the other 2 metrics are either extremely noisy, or extremely close. I've seen data (not going based on theory) that shows the other models I mentioned do not behave the same in that regard ( and even then I probably should have been more specific on the exact models as even within a model family that isn't always true, gemma-2 27b is erratic with the legacy quant's but the 9B is not). This issue doesn't exist for the imatrix version of Q5_0 and Q5_1, both in your data and the other data I've seen.
The only other anomaly I've seen data of where a larger quant performs worse or the same is mistral nemo instruct 2407 and static k-quant's around 3-4bpw.
Personally, I didn't see a point to the legacy quant's anymore as they are legacy for a reason, but I found out from this account's discussion page that for some platforms and some users they are worth it for the lower energy consumption. I also like KLD data which is why I'm so grateful you gave me a lot of it. It's hard to find, and resource intensive to create.
It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.
That is impossible, like I mentioned the training data and ordering matters, and at that point even if it is possible to estimate, I don't see how that would be easier than just testing the resultant LLM.
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.
I have much less data on this, as there aren't that many great sources for metrics of larger (>15B) models, but are you sure that the closest Qwen 2.5 model size is representative of the quant quality of a non Qwen 2.5 model? I don't think so but like I said I don't have enough data to be completely confident about this, but as it stands I still don't believe that is true.
I think the notes section on the current model cards and download page, and the quality metric which is derived from correct token but only showing integer's and has no tie's besides source/fp16 is helpful ( maybe adding a comment to Q5_1/Q4_1 static explaining that if it is better than Q5_0/Q4_0 is extremely hit or miss depending on the model). I think the other 5 categories (KLD., Correct token, Same Token, PPL, and Evaluation) are not helpful, as they have nothing to do with the model you are currently viewing, and are suggestive they do.
For example with a Llama-2 based model the KLD of smaller quant's should be much better than what the table indicates as Llama-2 is nowhere near as dense as Qwen-2.5. Llama-2 was trained with 2 trillion tokens vs 18 trillion for Qwen-2.5, and the data I've seen also reflects that. I think that issue will persist even if you compare to the closest Qwen 2.5 rather than the overall Qwen 2.5.
The pause script waits for all jobs to be completed and uploaded
It's hard because the jobs technically run on kaos.
Just using cat doesn't work inside your container to instantaneously concatenate them?
I have no idea, I thought so. But maybe you enabled that and I forgot? Anyway, if it does, scripts will sue it, if it doesn't, they will still work, so no worries here :)
It's on the HuggingFace model page so you can likely just scrape it
Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.
Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.
I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.
It's hard because the jobs technically run on kaos.
Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.
Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.
It's not on all SafeTensors models? Also the parameter count should be super accurate as it originates from the ModelTensorsParams object. Just search the HTML source code for <div class="SVELTE_HYDRATER contents" data-target="ModelTensorsParams" data-props="{
and you will find the raw ModelTensorsParams object containing a ton of important model metadata including the parameter count. We can also use it to check if a model is llama.cpp compatible before even downloading as ModelTensorsParams contains tokenizer_config which must contain LlamaForCausalLM or another tokenizer supported by llama.cpp.
Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.
No still the same BTRFS 2x SAMSUNG MZVL22T0HBLB-00B00 SSD pool as always. Each of them should have a 7 GB/s read and 5.2 GB/s write speed if empty and trimmed. 4KB read IOPS is 1000000 and 4KB write IOPS is 850000. Because we are using RAID 0 it should even be twice as fast under optimal conditions. Make sure to trim your SSDs and fill them so little that they run in SLC instead of TLC mode when possible.
I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.
Just check the hosts memory first using /host/proc/meminfo
to make sure enough is available and adapt the cgroup limit accordingly. Please also leave a few GB as buffer just in case. Keep in mind that while the host has 512 GiB of RAM only 503 GB of it are usable and a few GBs are also needed for the host and the containers hosting the StormPeak RPC servers.
Deepseek should be slightly less tight than 405B but booth will be quite tight.
Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.
I can probably register the jobs on nico1 as well somehow, so it would be easy to wait. But not today :)
I can probably register the jobs on nico1 as well somehow, so it would be easy to wait.
Not so easily, but I can make network rpc calls in bash. Yay. (I've updated the pause script, it might wait for imatrix jobs now, pretty much untested).
Note to self, that's how you configure rpc imatrix for big models, also, set DONTRUN*.
"extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204",
"force" : 1,
"llama" : "/root/cvs/llama.cpp-nocuda",
"ngl" : "10000",
"quant" : "Q8_0",
I had secretly hoped Deepseek would be faster...
It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to? There is not yet a DeepSeek-V3-Base-i1-GGUF repository on HuggingFace. I guess to kaos. Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3
as next massive RPC imatrix computation task.
-3000 713 DeepSeek-V3-Base run/hfu
Edit after like half an hour the DeepSeek-V3-Base hfu task is now compleated/gone.
I had secretly hoped Deepseek would be faster...
It was for sure faster than expected. It only took around 10 hours while 405B takes like 20 hours and FatLllama was like 40 hours. Keep in mind that half the performance is lost due to using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 while allocating memory far larger than the available GPU memory instead of -ngl 0 duetoi this not beeing supported for RPC servers. The RPC overhead must be almost neglectable.
nico1 complete the entire imatrix job backlog and is currently idle. Let's start RPC imatrix computation for DeepSeek-V3
if you have time. Any idea what happened to DeepSeek-V3-Base
was everything successfully?
Oh nice I see you just started it! Thanks. It is usuing all the RPC servers as expected.
The deepseek-v3-base imatrix is safe.
It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to?
To kaos, which has all imatrix files in a directory, and serves them to the other hosts. Actually multiple directories, for various reasons. Currently ~70GB.
The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file (at least, I hope so - I rely on rsync doing the right thing, and rsync rarely fails me completely, it is one of the few tools I trust a lot :), which causes the next attempt to fail as well.
(It ends with the path of the imatrix training data, so I assume it is complete)
Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3 as next massive RPC imatrix computation task.
The imatrix-training-remote script had a little refactoring while deepseek was imatrixing. And an unclosed $()... and unfortunately, this was not counted as an error, so all following imatrix quants failed in a way that made the scheduler think an imatrix was created when it wasn't. Quite messy to watch, but nothing is lost.
I can't imagine the rpc server crashed because of anything happened after deepseek, because the imatrix jobs following would use the syntatcically broken sacxript, which was not capable of running any commands (basically it failed to compile after the first few lines), so no llama calls were made. I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation, so we should look out for this when deepseek finishes tomorrow noon or so. I will try to queue some other models before then going to the next big model, just like today.
It was for sure faster than expected
Your expectation was based on better understanding :)
nico1 complete the entire imatrix job backlog and is currently idle.
Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
The deepseek-v3-base imatrix is safe.
The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file
I'm glad and relieved to hear that.
it is one of the few tools I trust a lot
Great to know. I will use it more often in this case.
Quite messy to watch, but nothing is lost.
Great that nothing got lost. Refactoring scripts directly in production must be stressfull.
I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation
That is probably exactly what happened. I remember that we had the RPC server crashing after imatrix computation back when we did FatLlama as well. It even was the same RPC server that had all its layers in GPU memory instead of using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. But it is all fine as I want to manually restart the RPC servers after every imatrix computation anyways as I don't trust the llama.cpp RPC code to properly handle the transition from one to another model.
I will try to queue some other models before then going to the next big model, just like today.
There is very high demand for DeepSeek-V3
imatrix quants as nobody so far was able to compute an imatrix for it so let's do them first. I'm personally really interested to try the imatrix quants of this model as well and we even have oobabooga asking for it to do some Q2 quality measurments. DeepSeek-V3 should now also be prioritized higher than DeepSeek-V3-Base.
Hermes-3-Llama-3.1-405B-Samantha
and Hermes-3-Llama-3.1-405B-Uncensored
will take around 20 hours each for imatrix computation and be extremely tight to fit into RAM so let’s complete imatrix quants for DeepSeek-V3
first to not further delay that.
Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.
it's all manually managed
I highly appreciate all the effort you put into all of this.
Great to know. I will use it more often in this case.
In that case, there are two main things that impressed me: whenever I needed an option, it was either already there or in the queue already. And when rsync fails to recreate the same file (by checksum) it will delay the update and try once more, and only then fail with an error, i.e. even if the algorithm has a bug, it would detect and report that. It's a lot of small things like that that increased my trust - it's not only trying to be a swiss army knife w.r.t. features, but also cares about correctness a loot.
Great that nothing got lost. Refactoring scripts directly in production must be stressfull.
Mostly only if you don't have the time to deal with the fallout at that very moment. Otherwise it's mostly a challenge. You should try it more often in your company :-)
But, seriously, it was an absolutely trivial change... Aren't they all :(
There is very high demand for DeepSeek-V3
OK. I will probably be gone when it finishes, I can try to pause nico1 isntead of essentially switching it off, so whoever sees the "I" first can unpause.
That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.
I probably have very similar problems focusing, but these phone calls are very relaxed, and I can almost always disengange for a while when I need to. It's not like a tense customer conference call or anything, so don't get the wrong impression. It mostly means I will watch the queue status from time to time, so if something goes wrong... too bad.
I think falcon-180b-chat is as disapppointing as always, especially at lower quants, but I'd be happy to hear your assessment (we didn't have the -chat before btw.)
I successfully resumed nico1 10 minutes after it finished. DeepSeek-V3 hfu is stuck again but doesn't matter as it happened when it was already uploaded. DeepSeek-V3 and DeepSeek-V3-Base are now booth quantizing. Thanks a lot!
I must say, you were quick :)
If it got stuck again, there is either another issue, or there is some generic issue with larger imatrix files (and indeed, in the past, it only happened with larger files). I'll have a look, but indeed, if the transfer is successful, it will distribute it, even if the imatrix scheduler feels drunk.
Hopw fast is /bpool? I assume slower than your cpu would like. I will try to copy the models to my local disk to get faster quanting.
Or maybe not, I get weird speeds. Will experiment.
@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)
@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)
It's likely because of the relatively high compression level I used to make all this massive models fit on bpool. I used zstd-6 in case you wonder.
Oh also I reduced ZFS ARC cache to 1 GB during RPC computation and forgot to increase it to something more reasonable. I now increased it to 50 GB. Not sure if that will have any impact as this highly depends on how llama.cpp is reading the data.
Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).
Anyway, I have barely space for one source model. I copied it over to my disk and the two quant jobs, which were at the same tensor when I noticed, are now widely apart (they have different cpu priorities, but that had no effect before).
And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.
Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.
The CPU is now mostly busy, but it's also doing IQ quants instead of the Q2_K quant earlier. However, since both jobs are doing the same quants, I guess there still is an effect due to I/O being separated works.
Another reason I completely forgot to mention is that back when I started I realized that I wanted things to go faster so I increased the amount of cores assigned to your LXC container from 48 to 60. Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.
Hopw fast is /bpool?
Because I was curious, I checked what disk bpool is using. bpool consists of a single Samsung SSD 990 PRO 4TB
. It has 7,450 MB/s read and 6.900 MB/s write speed when in SLC mode, but it currently is in TLC mode as it is almost full. It has 1,600,000 IOPS 4K read and 1.550.000 IOPS 4K write IOPS.
Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).
And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.
That 500 MB/s per process limit is likely related to decompression speed. It is not a per process limit but a per read limit. If you access the file using many concurrent read operations zfs will pool them all to separate threads resulting in much better read performance.
Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.
Performance is awesome now. Copying it for sure was the right decision. Once DeepSeek-V3 is done we continue with RPC imatrix computation without waiting for the now slower DeepSeek-V3-Base
Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.
Well, 99% idle means it didn't matter how much threads were created :) If you look at the disk I/O and cpu stats, you can clearly see the pattern (or could) - about 25s disk I/O, followed by 6 seconds CPU. Now the disk I/O phase takes about 7.5s (For V3 and the same old 25s for the V3-Base).
when in SLC mode
Shouldn't matter, as TLC mode should have the exact same reading speed.
The problem is clearly not the hardware. I can easily get >1GBps when reading with multiple threads. But somehow, a single thread (such as llama-quantize or cp) tops out at around 450MBps.
It is not a per process limit but a per read limit.
I'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then. What happened to readahead, interleaving? I'm not asking for concurrent decompression, just, like, basic filesystem advancements we had since the early 90ies... (This is only half-joking :)
I also don't buy any such decompression limit. A single 4.3GHz efficiency(!) core of my desktop CPU decompresses a zstd -14 compressed f16 gguf at 1.3GiBps, while piping it into another program.
Alas, I had hoped it would have been some container I/O rate limit of sorts - not only would I then want to know how that works, but it would also be fixable :)
'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then.
I'm starting to get quite convinced to switch to BTRFS. ZFS performance is bad, and it lack of zero copy support to quickly concatenate files making downloading quants over command line annoying. I plan on switching all my AI related storage pools to BTRFS. This would make all your temporary storage attached to your LXC container be BTRFS as well.
I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable. I will soon build a 4x18 TiB RAID 5 pool replacing my current hpool. Using BTRFS for that would make a lot of sense as it is not possible to defragment a ZFS file system making HDD performance after a few years quite terrible. Will BTRFS RAID 5 read performance increase as well when I do RAID 5? For ZFS RAID 5 with 4 disksgives you an up to 3x read speed increase compared to a single disk.
I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable.
I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages (but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this). I still wouldn't use it, because practically nobody uses it in production, afaik.
(I once asked on the xfs list whether realtime subvolumes are finally stable in xfs on linux, after having used them on irix to good effect, and I was essentially told, "nobody knows, please start using them and then tell us. if nobody uses them, nobody will ever know" - I decided not to use them, but it was an honest response).
Personally, I use hardware raid5 (which has its own perils, although my experience has been pretty good) for main storage, and multi-device btrfs filesystems for cold(er) storage (with 4 times redundancy fore metadata...). And I have a backup for the important stuff, although restoring my latest 140TB disaster took slightly over one month :(
ZFS is probably more reliable, in some sense. But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)
But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5
with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.
I'm starting to get quite convinced to switch to BTRFS.
I pray that will work out fine, otherwise you can rightly complain to me :) Although, zero copy support, while maybe a killer feature, is only one in a long series of features. I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.
And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).
Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :) Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem. But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).
I know you already know to use the right tool for the job, but I had to say it as insurance :)
Will BTRFS RAID 5 read performance increase as well when I do RAID 5?
I can't tell. Generally though, raid5 read performance will not be (much) faster then the equivalent raid0 volume, and with respect to btrfs, I don't think they have any optimisations, i.e. it will be ever so much slightly slower than an equivalent raid0 because it won't use the redundancy. But that's just conjecture, not knowledge.
If you need redundancy, and mirroring is too expensive, I would recommend not to use btrfs raid5, but linux software raid. Or zfs... And with software raid, you can then choose whether writes are slow but safe, or fast and very slightly unsafe.
Or you have some kind of backup, and feel bold enough to accept potential problems. Then I'd be happy to hear about your long term btrfs raid5 experiences.
Was working on the model link button, only to find out that huggingface's markdown dialect seems completely undocumented, thwarting my plans. Sigh.
@nicoboss
since my favourite eye-catching button (http://data.plan9.de/hfbut.html) fails due to lack of support on hf's side, why not go oldschool
and simply link a nice gif^Wwebp animation as button. that way, we can replace it's contents on every page without changing the markdown at all.
@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.
I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.
I'm on b4457 btw.
soon b4458
I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.
We would ideally start imatrixing as soon DeepSeek-V3 is done uploading because doing RPC on Monday from 08:00 to 18:00 would not fit well as I then need infrastructure for work and the only way to finish booth of them before that would be by starting imatrix as soon as possible and no later than Saturday morning.
b4458
OK I will make sure to update the RPC servers now because I know for a fact that latest llama.cpp doesn't seem compatible with the current ones. I figured this out the hard way when I tried measuring the network bandwidth.
I updated all RPC servers to b4458 and they are ready to be used.
The DeepSeek-V3 Q4_1 hfu task already stuck for 5 hours and outgoing traffic averring around 60 bytes/second. I checked DeepSeek-V3-i1-GGUF-DeepSeek-V3.i1-Q4_1.gguf*.log
:
DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/5d/08/
I killed the hfu process and hope it will retry the upload. I have copied the DeepSeek-V3-Base.SOURCE.gguf to /tmp storage but due to this unexpected upload issue storage is with 700 GB free getting somewhat tight.
Edit: The llmjob parent process doesn't seam to care and instead of retrying is just wating for a no longer existing hfu process. Moving the log to /root also didn't help.
Edit2: Started another one using /usr/bin/perl /llmjob/share/bin/llmjob hf-upload-folder DeepSeek-V3-i1-GGUF DeepSeek-V3.i1-Q4_1.gguf*
- feel free to kill 3804514
if you want as it is not doing anything.
Edit3: Yes just starting another one seamed to work which is good as only 485 GB storage left.
Edit4: It uploaded it and even continued where it stoped! https://huggingface.co/mradermacher/DeepSeek-V3-i1-GGUF/commit/d6c0da4b6cde336b2da5c767a00cbeaf6ffc7e25
Edit5: Killed 3804514
as it is now useless.
Edit6: Manualy deleted all the DeepSeek-V3.i1-Q4_1.gguf.part* files because they where not auto-deleted probably because I only started a single task of a much bigger process but everything is fine as it detected that this taks is now done and finally started with the last DeepSeek-V3 IQ3_S one.
Good morning ;)
Let me sort through this.
The disk was full practically minutes after I left. Great. The quantize scheduler does not take imatrix jobs into account (and vice versa), but it normally works because about half of the disk is available to imatrix. But not at the moment, due to the massive deepseek gguf. Well, it still probably paid off.
The disk was full because we had a few 70b's too much. I think the python exception chain is a bit confusing - I don't think there was a protocol error anywhere and the OSError was simply local. I also don't see why disk full would affect uploading - why would huggingface_hub have to write anything to disk? But... yeah, it probably did and failed.
Now we also know what llama-imatrix does when the disk is full - it keeps running till the very end, despite diagnosing the disk full almost at the beginning (it saves the imatrix every 10 chunks), and then fails. Lovely. That must have cost some extra programming over the boring "crash when write failed" approach of us lower coders :)
The hfu process is the parent of the llmjob upload. It starts llmjob upload, so killing hfu does nothing much. The llmjob upload runs python as child, using the huggingface_hub library, and communicates with it via a pipe. killing the python3 child will kill the upload and retry. Killing the llmjob parent of the python3 process will also retry, but might keep python running, trying to upload as well.
The whole thing works like this:
quantize is done with a quant and creates a child process (bash forks) for uploading => the child runs hfu (also bash) and deletes the files after success => runs llmjob upload* (perl) in a loop => python that does the work.
quantize => quantize-fork => hfu => llmjob => python3
Killing hfu will keep the upload running, but will also then keep the quant files. If the quantize job is restarted, it would wait for the upload to finish and then try to upload it again, causing it to be deleted. If the quantize job finishes, you will eventually get a pink line in the status display because the jobs is done, but the GGUF directory is not empty.
You can quickly get a list of relevant processes using the "ils" command:
hfu 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b-GGUF 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf 141233 712736
hfu-New-Dawn-Midnight-34b.Q5_K_S.gguf 141234 707273
llmjob-Llama-3.1-8b-ITA-imatrix 136909 61139 61140 61141
llmjob-New-Dawn-Midnight-34b-static 141233 141234 707271 707273 712734 712736
"hfu" is all upload jobs, hfu-MODEL-GGUF all quantize-related ones (there is also an -i1-GGUF), and the Q5_K_M.gguf ones are uploading that one. The hfu processes are the ones from hfu downards, that is, hfu, llmjob upload, python (or other processes, such as sleep when it waits for a retry) and does not include the quantize child that waits for it.
The llmjob ones are the ones doing the work, for example by running the quantize script, which is responsible for the noquant and quantize phases.
It's exactly how I started, with a for loop that iterates through the quant types, then I added conversion to it, and then uploads. And now I am loathe to touch it except for small changes :)
There is also an ikil command ("ikil -9 hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf" would kill the upload and leave the files on disk). There are a few others, such as "iwait NAME" which waits for all processes with that name to exit (e.g. "iwait hfu" in the pause script waits for all uploads).
The quantize child that deletes files after a successful hfu should not be part of any named group, but I do not know if I fucked this up or not :)
Now you know maybe more than you ever wanted to know about this.
It uploaded it and even continued where it stoped!
the hub library will hash files before upload, and if a single file already was uploaded before, it will not upload it again, only the missing files. But it does not resume individual files. I assume that is what you saw.
However, for a month or so, the huggingface_hub lib now has a way to actually resume files, but it is a bit clumsy for our purposes (requires one dir per upload), requires cleanup, and I haven't looked deeper into it yet. It would be beneficial, though, as it only hashes the files ones (but that also means extra trouble if the files change).
BTW., this is a very peculiar issue:
DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')
The wrapper I use calls the upload method in a try block and reports any exception back to perl (which would possibly die or retry, but not report it in this format). So that means huggingface_hub failed internally, and simply printed the exception and then... chose to hang or so?
And something fishy is going on, 2 TB in use (du /tmp) but 2.8TB in use (df).
Ah right, the huggingface uploader was still hanging and keep the deleted file. Now we plenty of free space again. Sigh.
DeepSeek-V3-Base failed:
/llmjob/share/bin/quantize: line 230: 685578 Bus error $QRUN "$QUANTIZE" --allow-requantize "${OVERRIDE_KV[@]}" $IMATRIX "$srcgguf" ./"$OUT.$HOSTNAME~" "$qmethod" $threads
A Bus Error... often means that the undelrying file of an mmap is gone. What the heck (the source gguf is still there, afaics). I am also currently copying the SOURCE for Base, which didn't run into issues, other than getting relatively slow (normally, I get a very stead >400MBps, now it's more like 300MBps). I will resume once the file is copied.
Can we start with RPC imatrix computation now? All the RPC servers are on version b4458 and ready.
If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?
Also, unrelated, I wonder if it is normal for DeepSeek-V3-Base to be so slow. It's basically crunching on IQ2_XS for the whole morning till now, and is only half-way through. That strikes me as a bit extreme - hopefully the new llama doesn't have a slowdown, and IQ2 is really just so much slower.
The other issue is that we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet, but the effective absence of nico1 for so many days is felt), but my plan of continuing tonight (or whenever deepseek-v3-base can be interrupted) does not include any.
If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?
We should obviously wait for things to finish. I see you already started the process by interrupting it once done.
I wonder if it is normal for DeepSeek-V3-Base to be so slow.
IQ2 took insanely long for DeepSeek-V3 as well. I'm not really sure why but wouldn't blame it on latest llama.cpp.
we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet
The imatrix computation queue ran basically dry. You must mean the quant backlog. We can always create nico2 on Threadripper and nico3 on CastlePeak if we temporary need more quant resources or have nico1 delegate tasks to other nodes accessing the same disk using network storage. If we go this route just keep in mind that CastlePeak is only turned on, on demand (but could be automated over wake on LAN) while Threadripper is always running but less energy efficient. All nodes will be unusable during RPC computation. With CastlePeak + Threadrippebut we could double the quant throughput of nico nodes.
The imatrix computation queue ran basically dry.
for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)
nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.
In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)
Anyway, you are the boss, so I will start asap.
for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)
You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp for faster quants as we can always just copy it again if it’s even worth it for the few remaining quants.
I'm generally thinking if we need to increase storage on nico1. It would be nice not having to ever worry about it but it really is only an issue if we are doing these massive models which are rare. If we are doing normal models even the 4 TB semes somewhat underutilized.
nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.
That should luckily be rare. Having 4 such massive models at once was really unfortunate timing. We should never have more than 2 of them unless multiple large models happen to release at the exact same time as it was the case here.
In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)
Sorry for not responding to it. I thought that plan got somewhat obsolete due to the delays we encountered. As long we don't start it in early morning the timing should be fine. Straining it now means it should complete somewhere morning tomorrow when booth of us are awake.
As mentioned before the reason I pressed so hard on starting with RPC imatrix tasks is because I hoped getting all remaining imatrix RPC tasks done before Monday working hours when I usually need my infrastructure for work but now that we started so late this probably isn't going to happen anyways. Having all the hardware configured for RPC is somewhat disruptive because that RTX 3080 GPU currently inside CastlePeak is the GPU I would otherwise use as display output on StormPeak which I use as my main PC.
While RPC tasks are running I trun off every service on every of my nodes to make sure enough memory is available. This includes the LXC container I use to host the development environment for my job. Luckily doing RPC on the upcoming Monday will be fine as I’m spending the entire day doing server hardware maintenance (installing GPUs into servers) and meetings on so I really don't need my development environment. Honestly just a lot of drama for nothing because I’m too careful that nothing I do in my spare time could ever affect my job.
Anyway, you are the boss, so I will start asap.
I’m not. I always suggests what I believe is most beneficial for this project but in the end, you can always overrule me and do whatever you want. If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.
Some good news regarding the 405B RPC tasks. Thanks to us using the same GPU offloading setup as for the even larger FatLlama 1.7T
memory is not as tight as I feared.
CastlePeak: 87.49% (220.07 GiB of 251.53 GiB)
StormPeak: 92.54% (465.65 GiB of 503.19 GiB)
Threadripper: 92.20% (115.82 GiB of 125.63 GiB)
NVIDIA GeForce RTX 3080: 9909MiB / 10240MiB
NVIDIA GeForce RTX 4090 01:00.0: 19147MiB / 24564MiB
NVIDIA GeForce RTX 4090 C1:00.0: 24115MiB / 24564MiB
NVIDIA GeForce RTX 2070 Super: 7787MiB / 8192MiB
It is still tight but if there is any task that can fit into the remaining memory feel free to run it at your own risk.
Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.
I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours. I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.
You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp
You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.
But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.
If we are doing normal models even the 4 TB semes somewhat underutilized.
I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).
And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.
That should luckily be rare. Having 4 such massive models at once was really unfortunate timing.
Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.
I thought that plan got somewhat obsolete due to the delays we encountered.
Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.
but now that we started so late this probably isn't going to happen anyways.
It might actually happen... We could even try immediate back-to-back.
And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.
Anyway, you are the boss, so I will start asap.
Well, you are, over your hardware. Don't worry, you didn't give me the feeling that you'd brutally overruled me.
If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.
It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)
I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages
After every system crash as well and a scrub of 72 TB must take at least one day.
but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this
I'm glad ZFS doesn't have this issue.
I still wouldn't use it, because practically nobody uses it in production, afaik.
Seams unfortunately too risky for now so I will likely have to go for ZFS again for next generation of hpool.
ZFS is probably more reliable, in some sense.
It likely is but also slow and misses many cool features that are in BTRFS like defragmentation, zero copy and file/directory specific compression. In return ZFS has some features it implemented better than BTRFS like easely seeing compressed size of a file, setting a size limit for a subvolume or efficient virtual file systems for VMs and in my opinion with ARC, L2ARC and metadata caching a better caching system. Like always there are many tradeoffs and there is no clear winner. One just has to decide on a case by use case basis.
But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)
OS X is terrible in every way and so is Xcode which in my opinion is the worst popular IDE ever created. Every OS is better than OS X. I would even prefer ReactOS over OS X despite being an unstable mess with its own NT-like kernel.
But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5 with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.
I think so as well. For all AI related workloads, the advantages of BTRFS clearly beat ZFS. I will switch bpool to BTRFS once we are done with Hermes-3-Llama-3.1-405B-Samantha
and Hermes-3-Llama-3.1-405B-Uncensored
.
I pray that will work out fine, otherwise you can rightly complain to me :)
No worries I would never do that. It is not your fault if you convince me about something and I don't do enough research/testing myself to be sure it actually fits my purpose and is stable enough for my use-case. I would be fully to blame if I let that happen and I really appreciate your honest opinion about BTRFS.
Although, zero copy support, while maybe a killer feature, is only one in a long series of features.
The ability to defragment is a quite massive killer feature for any HDD based storage pool because having to copy all data to some temporary storage and back to a newly created pool just to defragment must be one of the worst designs ever. I don’t even want to think about how I will find 54 TB of temporary storage to rebuild it should new hpool ever get too fragmented. This is the main reason I would have liked going BTRFS over ZFS for hpool.
I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.
The main thing regarding management I will lose is the ability to limit the size of a subvolume without ruining performance but I rarely have the need to limit storage and instead prefer if everyone can use as much they need until the storage pool is full which then forces me to cleanup or move things to different storage pools. If limiting the size is required, I can always create the storage over the Proxmox UI which will then create a size limited EXT4 loopback device on top of BTRFS. It is a bit annoying that there is no way to create BTRFS native storage pools using the UI but I can implement that myself by editing the Proxmox web interface if I ever feel the need for it.
And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).
Like the bitrot on the SSDs you are using. I probably should run scheduled scrubs on them like I do on my ZFS pools because as far I'm aware that doesn't automatically happen for BTRFS by default.
Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :)
As mentioned before it is my responsibility to do my own research before doing something and not to randomly trust someone’s personal opinion. And so is it everyone else. Nobody has the right to be upset about you for providing them with free advice
I would say even for paid experts one would be stupid to blindly trust them as they often seem to have some kind of personal agenda like selling you certain types of products from which they get a commission.
Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem
Don’t worry I will always use whatever I feel best fits my use-case.
But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).
ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities despite releasing it in 2005 when HDDs where the norm is a good idea.
I know you already know to use the right tool for the job, but I had to say it as insurance :)
No worries I will not and never would blame you for my own decisions no matter how much your input influenced them as they are my own responsibility. But I understand that you need to cover your ass as there are so many entitled idiots blindly trusting your advice and then blaming you for their mistakes. I likely should start adding disclaimers to all my recommendations as well just in case.
After every system crash as well and a scrub of 72 TB must take at least one day.
With 8 disks I usually have no issue saturating 12Gbit/s, but yes, "abouit a day" sounds right. But the disk is usable during that time.
Still wouldn't use btrfs raid5, too few people use it :)
I would even prefer ReactOS over OS X
That is very hardcore :)
I probably should run scheduled scrubs on them like I do on my ZFS
It's probably not worth doing it (for my pool), though - if it's raid1 metadata, then the chances of having a second corruption in the same block are low, and will then likely be found during normal usage, or be not important. For data (in single profile) it would only detect, not correect, anything anyways, and we scrub all data we write, pretty much :)
For archives, sure.
ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities
For decades, I copied my storage volumes once every 1-2 years (usually because of a disk upgrade), and that was the only way to recover performance. For a while. At least on the busy main raid volumes.
@nicoboss there is currently only 69G free on / - df shows 4023 GB used, but du only 3.4T (uncompressed size, even). lsof also doesn't show any deleted files that could account for that. (In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again).
Any idea what is going on? I don't currently see where these extra 600G could be.
For the time being, I've disable automatic file uploads, so unless more space is going missing, at least the current imatrix should be safe.
Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.
I think technically using 30 GB might be safe. RPC doesn't use mmap so the cached memory might not be needed. That should be enough for quantization tasks if you cgroup limit them to 25 GB.
That NVIDIA GeForce RTX 4090 on PCIe 01:00.0 unlike the other RTX 4090 doesn't use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 and instead runs all the layers fully in GPU memory so that 30 GB RAM together with that 5 GB of remaining GPU memory might be enough for -ngl 0 imatrix computation on small models.
Personally I don't think doing anything during RPC would be worth the time, effort and risk but feel free to go for it if you want.
I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours.
That's awesome. This must be because we are using the FatLlama 1.7T RPC configuration which better distributed the layers across nodes, makes more intelligently makes use of the faster GPU memory and ensures the two RPC servers on StromPeak don't interfere with each other. Didn't expect that to safe 4 hours going from 19+1 hours to 15+1 hours.
I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.
Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.
It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)
Great. It only taking 15 hours kind of messes with my plan as well as it now might be so early on Sunday morning that I will still be asleep.
It might actually happen... We could even try immediate back-to-back.
If we start Hermes-3-Llama-3.1-405B-Uncensored at the same time or slightly earlier than Hermes-3-Llama-3.1-405B-Samantha today we should be able to get it done before working time and I can start my development enviornement at 08:17 before leafing for work for the unlikely case I would need it.
And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.
I'm really happy with how well my use and our use can coexist without impacting each other’s. Thanks a lot for how well you are handling this. I'm extremely happy with the current setup. It really couldn't be any better. I don't even feel any slowdowns when working on StormPeak while we are using it for imatrix and quants. It is just RPC where things are getting a bit difficult but even there it is just a matter of planning RPC tasks in a way they have the least impact. It is absolutely worth it to do RPC imatrix computations even if they require some effort and sacrifices as those are the best openly available LLMs and the ones I end up using the most. The slight incontinence of the RPC setup is nothing in comparison to what I went through to create the Hermes-3-Llama-3.1-405B-Uncensored and Hermes-3-Llama-3.1-405B-Samantha finetunes.
You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.
Yes you tolled me to do so:
@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.
When I saw your message and saw that DeepSeek-V3 was doing hfu while showing 24/24 I softlink it back to /bpool and deleted it from /tmp then started copying DeepSeek-V3-Base to /tmp which I then softlinked to /tmp once copy was done. I wasn't aware that after hfu 24/24 there is still a DeepSeek-V3 quant left nor did it matter as it just ended up doing that one from the slow storage pool. The only unfortunate thing is that it somehow managed to run out of storage. Maybe because you copied the same file despite telling me I should copy it?
But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.
That is indeed quite unfortunate. I don't think there is much we can do about that. Maybe we could run some small imatrix tasks while doing RPC but large ones will always have to wait. Best mitigating factor for sure is always completing the imatrix queue between RPC tasks as we currently do.
I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).
Should it at some point no longer be enough just let me know and we could consider adding a third SSD to spool.
And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.
Its currently not possible to automate RPC mainly because I have to physically move the RTX 3080 GPU from StormPeak to CastlePeak - at least until I ever buy another GPU. I could automate shutting down services and the configuration part on your side could maybe also be automated as well. Luckily models requiring RPC are so rare that automating them is not a big concern and doing them manually allows us to carefully plan when we do them to minimize the impact they have.
Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.^
I like the strategy of moving some things to temporary storage as that way I can use the storage for other projects if we are not currently doing big models. That way we can make optimal use of storage resources at the cost of some additional work. I will switch bpool soon to btrfs increasing its performance and making sure it will always be reserved for AI workloads.
Any idea what is going on? I don't currently see where these extra 600G could be.
I will investigate this and let you know once I figured it out.
Personally I don't think doing anything during RPC would be worth the time
The only thing worth it would be running hfdprep or quantisations, unless somebody eagerly waits for an 8b imatrix - doing small imatrix ones between big rpc jobs is fine - when I only look for models once per day, we already have 24h maximum latency...
Didn't expect that to safe 4
Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...
Maybe because you copied the same file despite telling me I should copy it?
Maybe. I am wholly confused now.
I don't think there is much we can do about that. Maybe we could run some small imatrix task
Well, doing some imatrix between rpc ones is already helping, and is usually good for a few days. But queing theory says that arrival times will be clumpy, so it's just unfortunate that we had such an overload :)
In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.
The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching. If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.
Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...
It will be right as booth the DeepSeek-V3 RPC imatrix jobs where faster than expected as well. I only thought maybe MoEs are faster but now it’s clear that it’s the setup maybe in combination with some llama.cpp improvements.
In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.
Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective. Having to move the GPU back and forth between StormPeak and CastlePeak for every model we want to do over RPC would be quite time consuming. The GPU is too heavy for CastlePeak and so requires one-use cable ties to prevent it from bending so much that the GPU fan hits the cables below while on the StormPeak side the power cable is a bit too short so it takes a while to get them in and out but an additional GPU would solve these issues.
In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again
That is so scary. I'm glad you were able to prevent it from failing just in time.
Any idea what is going on? I don't currently see where these extra 600G could be.
If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.
Turns out the culprit was the deleted 800 GiB EXT4 image I used on 26th December to convert the DeepSeek models into the BF16 base model. It was still using around 750 GB of storage despite being empty and deleted. I did delete it over the Proxmox UI and the image was gone but the storage wasn't freed because there was still a terminal open somewhere that had that folder as it's working directory which apparently is enough to prevent its and its contents deletion.
lsof | grep spool
bash root cwd DIR 0,0 16 256 /spool/images/107/vm-107-disk-0 (deleted)
The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching.
Let's hope that works out. I'm also hoping the RPC servers can do this without a restart but they probably can. Should they crash I made it so they immediately restart and in worst case you can even SSH them or wait for me to be awake. Even if we start it on Sunday noon it will still easily finish before Monday 08:17 assuming it only takes 16 hours.
We unfortunately experienced an OOM event on StormPeak which ended up killing the llama-imatrix process but ironicaly none of the RPC workers:
-2000 811 Hermes-3-Llama-3.1-405B-Samantha error/1 (GPU-2d) / 240.52s/c 588.1/1258.7m(938.7-1009.1) [183/314] 6.6381 (status: failure)
[Sun Jan 12 00:27:39 2025] Out of memory: Killed process 2080792 (llama-imatrix) total-vm:14656992kB, anon-rss:615044kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:10008kB oom_score_adj:800
ZFS is to blame for this. I forgot that it is 00:17 on the second Sunday of the month. By default, ZFS does all its scrubs then. Because ZFS developers lack some common sense they decided it is a good idea to do the scrubs of all the storage pools at the exact same time which leads to a massive resource peak. Because it is all at once it managed to eat up enough memory to OOM kill the llama-imatrix process. I'm quite surprised the kernel didn't OOM crash because with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 it really should have crashed.
Most of the scrub tasks finished by itself after a few minutes and the other ones I canceled. Thanks to your awesome preparation nico1 is not ideal until you wake up but instead started to working on Hermes-3-Llama-3.1-405B-Uncensored
RPC imatrix. I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it. New plan is to finish Hermes-3-Llama-3.1-405B-Uncensored
, let the other imatrix quants run and then immediately retry Hermes-3-Llama-3.1-405B-Samantha
.
49 811 Hermes-3-Llama-3.1-405B-Uncensored run/imatrix (GPU-18) / 232.73s/c 208.9/1218.0m(1042.7-10932.4) [6/314] 3.2876
I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it.
I am even more surprised - the "pre-allocation" is because it just shows the object member which isn't cleared, it would be ignored when it is not running.
I would assume the failed job might still allocate resources (because the scheduler doe snot know in which state it is), and the other job has the force flag set to ignore the budget. Sucks.
Update: yeah, since it was force'd, it would simply ignore resource allocation, because I would need a disticnt scheduling class ("rpc") to model separate resources. So the whole setup wouldn't have worked either way. Worse, if the scheduler had run for whatever reason, it would have immediately started the next rpc quant. I think I wanted to rely on the fact that the GPU allocation still does its job and reduced the number of gpus to 1, but then accidentally commented out that line again. Very unstable.
Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective.
I meant quantization - it would have been easy to only quantize deepseek-v3 and some smaller models in parallel. The reason why I did both together was so that I could give ...-base a higher nice level, so deepseek-v3 had priority. for smaller jobs I would have to code it into the scheduler instead of manually renicing.
Turns out the culprit was the deleted 800 GiB
I am so relieved :)
compute_imatrix: 76.42 seconds per pass - ETA 7 hours 44.85 minutes
That is a for a 20B. That kind of thwarted my plan for quickly doing some imatrix calculations (the time has updated to 100-120min, but that's still remarkable for a 20B).
Must have been some weird nvidia thing - after 260 chunks it kind of normalised. But boy are we behind the schedule.
And unfortunately, I'll be gone for two hours. Will try to start the next model before I come back though.
Must have been some weird nvidia thing - after 260 chunks it kind of normalised.
No it was your scheduler starting Hermes-3-Llama-3.1-405B-Samantha
RPC imatrix computation while doing the other imatrix computations and quantisation tasks.
[Sun Jan 12 15:44:52 2025] Out of memory: Killed process 3298227 (llama-imatrix) total-vm:801635412kB, anon-rss:586492kB, file-rss:9728kB, shmem-rss:0kB, UID:100000 pgtables:1553044kB oom_score_adj:800
It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.
Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:
49 811 Hermes-3-Llama-3.1-405B-Samantha run/imatrix (GPU-2d) / 236.79s/c 68.6/1239.2m(59811.9-2394.0) [9/314] 3.3868
-9001 689 I DeepSeek-V3-Base run/imatrix 17/24,Q5_K_S [89/1025]
It seams to work based on the available RAM so everything will be fine just make sure to stick with one quantisation task while RPC imatrix is running:
It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.
Holy shit! We cna also be lucky the rpc servers didn't accept both processes.
Update: ok, I see, not both processes, it was after the other 405b was finished.
Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:
Yes, unlike the double imatrix one, this is intended. I had some trouble understanding how nested systemd-run calls work w.r.t. resource limits - apparently, new scope == new independent limits, which is a bit annoying,l because I wanted to run the quantize shell script and all uploads in the same scope, but quantize runs llama-quantize in its own scope, with again new resource limits.
It's because you were kind of ... sounding... in experimental mood yesterday, and I thought, now or never (the imatrix just having been started).
In any case, right now, there is still 26G of cache, so I guess we are not that tight. And deepseek has pretty tiny tensors (~8GB max unless I missed one).
Holy shit!
Seems the rule of "start if job is forced and current ram_usage is 0" somehow triggered despite ram usage obviously not being 0. I have no idea how that happened.
Just a heads-up: /bpool is no longer in use by me.
Why are currently so many imatrix tasks marked as blocked/imatrix/gpu
?
Maybe because I paused them for a few hours yesterday but I since long unpaused them? When we are at pausing would it be possible to have a separate /tmp/pause
trigger for each GPU? I always end up having to pause booth of them even if I only need one. Maybe we could get rid of /tmp/pause
and implement pausing/unpausing imatrix tasks similar than in nico1-pause
and nico1-resume
so the scheduler is aware which GPUs are available. I'm currently using /root/handlePause.sh
to pause/unpause so if you have time feel free to edit this script accordingly by adding arguments to specify the action and GPU and making it blocking so it is waiting for the specified GPU to finish its current imatrix tasks when paused.
Why are currently so many imatrix tasks marked as blocked/imatrix/gpu?
There were empty ".slog" files for each of those on kaos. Basically the screen/job output. But no .status file (with the exit code). As a result, the scheduler had no idea what state they were in and left them alone.
This is usually the result of a job either still running, or being killed without having a chance to write the exit code. For example, when I press ^C in screen, it would be like that. But of course I did not.
Now, as to why it was like that... I don't know. They are all from yesterday afternoon 15:20-15:30 CET.
The touch file method of pausing them should be absolutely harmless - it's just the shell script looping, i.e. for the scheduler, it should just be a longer job.
The log file does not show anything of interest (e.g. for Anubis, it downloaded the gguf, detected its size, then didn't start it because other jobs were running), it did continue to queue others, so it wasn't immediately obvious. Maybe I did something at the time, but I don't remember.
I suspect it's the problem where screen (apparently) recreates a zero-byte log file long after ther job is finished, i.e. job sets exit status, scheduler cleans up all files, screen recreates the log file, scheduler is stumped. Possibly because kaos was so busy at the time. It is somewhere on my todo list to either change how log files are written or get rid of screen, which did it's job during development. But you know, everything will subtly break when I do that, so ... :)=
imatrix tasks similar than in nico1-pause
Actually, there is, but not per-gpu. It would have been exposed fully whenever I get around to letting you take control of the queue etc., alas, life. I'll think about it.
A /tmp/pause.gpuid or so would be a quick fix, but that will just block a job. I once suggested a config file that would be fetched before each job is scheduled. But I'll try to do something more intelligent on the server side.
echo pause GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
echo resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
The other gpu "uuid" is GPU-188a5143-db69-7058-63b5-f2f1d2354f91
I'm testing it right now.
Works for me.
I should mention that there is no feedback for this pause on the status screen. I'll probably change how that is reported, too.
All pause flags are shown in the status header now:
last updated: 2025-01-19 13:42:01+0100 (1s) (imatrix.GPU-188a5143-db69-7058-63b5-f2f1d2354f91)
echo pause GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
echo resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.
All pause flags are shown in the status header
That's perfect.
Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.
It's indeed great for the future, but so far, that wasn't holding us back. What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)
@nicoboss Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)
In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.
Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)
I did pause the second GPU intentionally around an hour ago to give Guilherme34 the opportunity to test his new models. Guilherme34 needing some GPU resources today is the reasons why I asked for the single GPU pause feature to be implemented and I’m really glad to have it. I would usually give him the RTX 3080 but I’m currently using it myself.
What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)
Having many new exciting great models is awesome so don't worry about them delaying our progress on the low-priority ones. We will eventually get to them. The model backlog already reduced massively compared to our peak of over 4000 models.
In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.
That's awesome to hear! We are making such great progress.
I did pause the second GPU intentionally
That's a relief :) I forgot about the timing and my command history from tetsinbg was a bit jumbled, so I really wasn't sure.
That's awesome to hear! We are making such great progress.
Yeah, and on to january and 2023 g
Venting: sometimes, it is the little things. I am trying to automate (some) llava mmproj extraction.
fname_out = f"{model_name.replace('/', '-').lower()}-vision.gguf"
Of course, the output filename is not configurable. Sigh. Why would anyone go to these lengths to make the output filename hard to guess.
@nicoboss rich1 seems to hang again (ssh does not greet, but wireguard pings still work)
Wow that was fast. After waiting 15 minutes, I decided to notify you you (cry for help), and seconds later, problem seems solved :)
OK, I think that extracting vision data from models takes enourmous amounts of memory, multiple times the size of the whole model data, apparently (32GB is not enough to extract a 7B), and this caused the hang.
Sigh. This does not work out.
That practically means nico1 is the only box that can do vision model extraction.
You have any cool ideas around this? Because that means I have to schedule certain model architectures on certain hosts now.
@mradermacher my server was hanging from mmproj for some reason, so I guess please dont generate it there. I guess it's because it doesnt have enough ram
Wow that was fast. After waiting 15 minutes, I decided to notify you you (cry for help), and seconds later, problem seems solved :)
I was sitting for quite a while to find the source of DDOS attack lmao
That practically means nico1 is the only box that can do vision model extraction.
I'm fine with having all the vision models on nico1.
You have any cool ideas around this?
Have you tried to just cgroup limit the mmproj extraction and see what happens? Unfortunately I'm quite certain it will out of memory crash as I had similar issues back when I did the mmproj extraction for https://huggingface.co/mradermacher/model_requests/discussions/415.
@mradermacher
When would work best for you to start with DeepSeek-R1
imatrix computation? I would need to reboot StormPeak move the RTX 3080 GPU back to CastlePeak before we can start requiring me to pause nico1. We could start tomorrow late morning/early afternoon as by then DeepSeek-R1-GGUF static quants should be done and there is enough time for you to prepare a Q8 model to be used for imatrix computation. It also gives enough time for the morning imatrix queue to be completed.
Have you tried to just cgroup limit the mmproj extraction and see what happens?
Yup, it gets killed.
We could start tomorrow late morning/early afternoon
Sounds like a good tentative plan, modulo disaster. And I can probably even quant during the imatrix computation. Non-vision models, that is. I hope I can make some inroads with the queue, but it's close to being normal finally - only two 70Bs left opn nico (much worse elsewhere, but we are getting there).
(Well, I'll probably only have a bit of time during noon to prepare, that's the only issue with the plan)
I'm fine with having all the vision models on nico1.
I've already changed job adding so it does that now. It does add horrible dependencies though, such as not having defined memory requirements for quanting. And it's probably a one-line change to fix (use_temp_files=True or so). I really don't understand why the llama.cpp developers think memory is free.
I'm actually scared to look at the code, because I can't fathom why a 15GB model resulting in 1.4GB vision tensor output would need more than 64GB of RAM to produce. Do they expand it to double or simply load the model twice? I mean, what else could it be111!!!!
I will queue new jobs before I got to bed, and then probably before noon, then force as many jobs as reasonable on other nodes so they can get some imatrix computations in.
It gets better and better. Apparently some qwen2vl vision models insist on cuda.
Yeah, it seems two 70B models vision extraction triggers the oom killer on nico1. This is troubling.
Update: yeah, 270g peak for a single 70B model. And all it outputs is 1.4GB. It must load and convert all tensors to f32.
Well, I'll probably only have a bit of time during noon to prepare, that's the only issue with the plan
I will try to have everything ready by noon.
And I can probably even quant during the imatrix computation.
Yes you can quant non-vision models while RPC imatrix is running buy mabye only with one concurrent task.
Yeah, it seems two 70B models vision extraction triggers the oom killer on nico1. This is troubling.
Can you somehow have the mmproj task check if another mmproj task is currently running and if so, wait for it to finish? That way we should never OOM unless there is an absolutely massive model. It makes me happy to finally see the OOM reaper do its job instead of letting the kernel crash. I'm currently using slightly over 100 GB myself so that likely contributed to the OOM situation as well.
It gets better and better. Apparently some qwen2vl vision models insist on cuda.
Good thing we are doing them on nico1 but let’s hope they don't need as much GPU memory as the need RAM.
It does add horrible dependencies though, such as not having defined memory requirements for quanting.
It will probably just steal mmap RAM from the imatrix tasks and then free it again once it’s done so shouldn't be an issue as long you don't run multiple of them at once.
I can't fathom why a 15GB model resulting in 1.4GB vision tensor output would need more than 64GB of RAM to produce
I'm now somewhat intrigued what they are doing as well. Seams quite ridiculous.
Can you somehow have the mmproj task check if another mmproj task is currently running and if so, wait for it to finish?
The mmproj task is the "noquant" task, and the default is to only run one. It was only a problem because I did maybe 15 models tonight, and let up to 6 run concurrently.
The bigger issue is a) interference with other big tasks such as imatrix and b) cuda.
[cuda] Good thing we are doing them on nico1
Actually I currently have to skip them because I compile all quant-related stuff without cuda, and some libraries like to pick up cuda if its available, and I don't want to install cuda libraries on all hosts. So far, it affected maybe 4 models, and the problem is bitsandbytes.
It will probably just steal mmap RAM from the imatrix tasks and then free it again
Yeah. I'll have a look and see if something obvious can be done about it. But I think you noticed how much I like maintaining forks :)
I'm now somewhat intrigued what they are doing as well. Seams quite ridiculous.
Oh, my, I would never try to stop you from having a look yourself :-)
Currently I only support qwen2vl, btw.
@mradermacher The RPC servers are now ready to be used for DeepSeek-R1-GGUF in Q8 (F16 obviously won't fit). I updated them to latest llama.cpp.
I slightly changed to wights distribution to put a slightly more layers on CastlePeak so if it fits with that configuration, we might have slightly more RAM available on StormPeak while the imatrix RPC computation is running.
Morning. Haha, that brutally didn't work out. I don't even know why imatrix calculations stopped. Sigh. I'll try to find out.
Ah, OK, most did get through, but again kaos was apparently too busy for some. Hmmhmm.
OK, things are not that bad, nico1 is pretty empty. I see how far I get with noromaid, but probably until everything is ready it will be through as well.
Ok, not perfect. we we are all set to go. Unfortunately, I will have to remove the override manually once the imatrix jobs have cleared, and I will be a bit busy probably when it happens, but I will give my best :)
@nicobossactually,I have touched /tmp/pause on nico1. The job shoulöd start but pause when one gpu is free, so whoeever sees both gpus free first can rm that file.
Actually, sorry for the noise, there actually is coder that should only start it once all gpus are unused, so I unpaused and will hope for the best.
@nicoboss
Also, regarding the DeepSeek-R1-Zero, you think we can have a Q8 in time? If you manage to convert it, you can rm -rf tmp/quant/DeepSeek-R1-Zero
to free some space, and maybe make a quantize from it to /tmp/DeepSeek-R1-Zero.Q8_0.gguf, and I can set up the job so it will start once the previous job is done, or so.
Update: I also wish space would exit the @name autocompleter.
Haha. Everything configured correctly (a first!), but I managed to put the quant into / not /tmp. And then I moved it to ~ instead of /tmp. Smooth operator :)
regarding the DeepSeek-R1-Zero, you think we can have a Q8 in time?
Yes I can by juggle things around. I can BF16 the model to some SSD NFS network storage than delete the HF model on spool and source GGUF to spool. Then I can move the source GGUF back to the NFS network storage and Q8 quantize to spool. Possible but harder than usual. I will put it to /tmp/DeepSeek-R1-Zero.Q8_0.gguf
once done.
Everything configured correctly (a first!),
That's awesome to hear. Everything is looking great so far.
but I managed to put the quant into / not /tmp. And then I moved it to ~ instead of /tmp. Smooth operator :)
No problem. Luckily that is a relatively quick error. Stupid mistakes like this happen to as well if I'm distracted.
Doing BF16 conversion with such limited resources was much harder than I though. NFS can only be used within privileged containers so I had to create a new one and mount spool into it. Then I had to setup and mount the NFS share and copy over all the BF16 scripts. Then once all of this was setup, I tried to run fp8_cast_bf16.py
just to realize it requires CUDA because of it using trition. I then had to copy over and install the NVidia drivers and figuring out how to give a privileged container GPU access which was different than for an unprivileged one. I then tried running it using 12 GiB of RAM and immediately OOM crashed the entire container and thanks to the NFS share the container got stuck in kernel mode and didn't even want to stop/start anymore. Now I gave it 25 GiB RAM and it seems to be happy and luckily also doesn't make the GPU run out of GPU memory. I also underestimated how much resource an NFS server needs and only gave it 8 cores 4 GiB RAM and it now spends 75% of the CPU running system code making things take a bit longer than expected. In any case most importantly it works and it will eventually be done.
I will let the BF16 conversion finish overnight and then run the much simpler and hopefully faster conversions during tomorrow morning so we can start DeepSeek-R1-Zero at lunch time if conversions finishes until then. That way we again have a morning for all the imatrix computation tasks to be computed.
Beside that I also somehow managed to get a now unkillable process stuck busy waiting for /sys/bus/pci/drivers/nvidia/unbind due to forgetting that moving that RTX 3080 from StormPeak to CastlePeak caused the PCIe IDs to change. So should one of the RPC server crashes for whatever reason one of the RTX 4090 GPU might permanently disappear until I fix it.
wow, i feel your pain. and even more luckily, the imatrix calculation survived so far. and nice juggling!
but to be honest, for me, solving these kind of problems under resource constraints is the most fun. it's like a puzzle. hacking computers can be the same kind of fun, or was, before these pesky buffer overflows became the norm :-)
/tmp/DeepSeek-R1-Zero.Q8_0.gguf
r1 will probably be finished while I am still fast asleep, or close. then there will be a bit of time where some other imatrix quants can be done, and if the quant is there, i will relatively quickly be able to start r1-zero.
also, the bf16 => q8_0 conversion is likely going to be I/O speed (maybe 40 minutes or so :)
/tmp/DeepSeek-R1-Zero.Q8_0.gguf
and all the RPC servers are updated to latest llama.cpp are ready. nico1 is currently paused as it would otherwise run out of storage. I will resume it in around 15 minutes once the DeepSeek-R1-Zero source GGUF is done being moved to the network disk.
Morning :) Ok :)
Good morning! I resumed nico1 and everything is now ready for DeepSeek-R1-Zero
RPC imatrix computation.
DeepSeek-R1.not-override ?
(haha, cute, hf offers a translation for this post :)
DeepSeek-R1.not-override ?
nico1 was idle as it completed all important models. I then checked and there was enough RAM and storage available. I wanted DeepSeek-R1 quants to be computed so I can try them out tomorrow. Unfortunately the task was in a blocked/override
state. I saw that there is a DeepSeek-R1.override
file so I thought that this file might be what causes it to be in this state so I renamed it to DeepSeek-R1.not-override
. Not sure if it worked or if you ended up unpausing it as nothing happened for quite a while but just when I wanted to write you about it, it started to quantizing it. The main reasons I renamed and not deleted it is so I can easily rename it back and you are not confused about why DeepSeek-R1 is no longer blocked.
The only issue with doing DeepSeek-R1 quants is that we cannot really tolerate upload failures so maybe we have to pause it again when I go to bed as then there is nobody to monitor and interfere should it run low on storage. Ideally we would have it always wait for the upload to be completed before starting with the new quant but that is likely too much effort to implement. Regarding what to do if an upload fails, I thought about "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause to make it wait for the upload but last time I tried this it didn't really work for me so. I would obviously try that first and if it doesn't work heavily limit the CPU so it quantizes around 10 times slower.
What would be the correct way to make it stop after the current quant? Creating DeepSeek-R1.override
or DeepSeek-R1.interrupt
? Because that I should probably do before going to bed as I really don't want it to run out of storage when left unattended.
(haha, cute, hf offers a translation for this post :)
Haha nice. It tried using facebook/nllb-200-distilled-600M.
Well, it wasn't my plan, but good to see you learn the ropes :) Yes, the file is what puts the job into override mode, but the scheduler always has to run. Which happens from time to time when other jobs finish.
You can force it, until I provide the (as of yet mythical) llmc command, using echo push >/dev/tcp/10.28.1.1/16713
This would have more or less immediately started the job. You can also telnet 10.28.1.1 16732
to get the status daemon and press return to ask it for an update so you don't have to wait for the web page to update (q + return quits).
The only issue with doing DeepSeek-R1 quants is that we cannot really tolerate upload failures
The quantize script itself should pause when df reports less space than 1x the gguf or so, which should keep it from doing bad things if only one job is running that eats up space. But, yeah, who knows what will happen, and right now, the configured budget for nico is ~1.5TB more than normal, so it's good to keep an eye on it.
Update: QUANTDISKUSAGE=$(( $(stat -c%s -- "$SRCGGUF") * 60 / 100 ))
that should be 700G minimum free. the problem is that for very big jobs, I sometimes disable this check via touch /tmp/ignoredf
, so that should be removed (I removed it, it was actually on nico1, and I am prone to forget about it).
Ideally we would have it always wait for the upload to be completed before starting with the new quant but that is likely too much effort to implement.
It is implemented, but right now, the limit is configured to be 16 on nico1. We did run deepseek quantize before, with even less diskspace, and it worked, so I would not worry.
Regarding what to do if an upload fails, I thought about "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause to make it
This will take effect (in screen) when the script tries to output something. As long as it is quiet, it will not hang. It will pause a running quantize, though.
In worst case, ctrl-c it.
And, in case you want to know, ctrl-c means the job will still "run" because it couldn't write an exit status. If that happens and you are sure the job doesn't run (use ils
), or if the job failed and you want to restart it, delete /dev/shm/JOBNAME.log (logfile) and ...status (exit code). And then "push" the scheduler, and it will retry. You could practise some time in the future (preferably not on deepseek :-)
What would be the correct way to make it stop after the current quant?
Yes, create an .override file and an .interrupt file, it will check after every quant and the scheduler will remove the .interrupt file.
Thanks a lot for taking the time to provide me all this valuable information.
Well, it wasn't my plan
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
but good to see you learn the ropes :)
I still plan to some day help you manage the queue and for that I better get familiar with the system. I slowly start to understand it.
Yes, the file is what puts the job into override mode, but the scheduler always has to run. Which happens from time to time when other jobs finish.
Great to know. I expected that this is why it was delayed because I remembered this mechanism. I think back it triggered when something happened or at 07:00 in the morning but if I remember correctly, we changed it to be like once an hour or if something happens back when you implemented advanced nico1 electricity cost optimization.
You can force it, until I provide the (as of yet mythical) llmc command, using echo push >/dev/tcp/10.28.1.1/16713 This would have more or less immediately started the job.
Thanks. That will for sure turn out to be really useful.
You can also telnet 10.28.1.1 16732 to get the status daemon and press return to ask it for an update so you don't have to wait for the web page to update (q + return quits).
I still remember the telnet version of the webpage. The webpage updates relatively often but might still be useful to have even faster status updates.
q + return quits
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
QUANTDISKUSAGE=$(( $(stat -c%s -- "$SRCGGUF") * 60 / 100 ))
That is awesome. In that case we can just let it run over night. Realistically even without any limit there must be so massive upload failures it is quite unlikely to happen but this protection should make an out of space event almost impossible.
/tmp/ignoredf
I was wondered about that file earlier today. Thanks for explained it.
It is implemented, but right now, the limit is configured to be 16 on nico1. We did run deepseek quantize before, with even less diskspace, and it worked, so I would not worry.
The upload limit is quite cool and I remember it from the past when I had terrible internet. No need to change that on thanks to the much better auto pause during low disk situations.
This will take effect (in screen) when the script tries to output something. As long as it is quiet, it will not hang. It will pause a running quantize, though.
Last time I attached to the quantization screen session and it seams to have just ignored the shortcut and kept outputting things. Maybe I tried using screen inside tmux or had some other strange setup that made it not work. I try again on a not so important model in the future.
In worst case, ctrl-c it.
That would be quite sad but yes can be done in a worst-case scenario to prevent RPC imatrix from running out of space. But before that I can just put all cores but 2 as offline to make the entire LXC container almost pause except networking which would still be fast due to being handled by the kernel. Not sure if you ever realized but I completely switched to changing the number of CPU cores to load balance nico1 with any other CPU resources I might need on StormPeak. It seems to work much better than adjusting CPU limit or CPU units.
Yes, create an .override file and an .interrupt file, it will check after every quant and the scheduler will remove the .interrupt file.
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
I normally manually manage things when we force-schedule big models, but you made a right decision, according to the data you had. Even if it wasn't my decision I would be happy if you continue being more active like this, and I want to provide more tools so you can do so.
but if I remember correctly, we changed it to be like once an hour
At the moment (and for m any months) it is purely event driven again, i.e. without anything "push"ing it, nothing will happen.
Unrelated: in recent days, I sometimes found jobs to be "idle", which can practically only happen when a push gets lost. The push is at the moment literally the echo I have you - it was a quick hack, without error checking or retrying. But it worked fine, I wonder what happened recently.
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
q+exit is a relatively recent addition. Also, very few people remember the telnet escape (ctrl-altgr-] on german keyboards, then "close"+return). Can't say I ever used (unix) telnet for actual login, only as a simple tool to connect to a tcp port. You could use socat stdio: tcp:10.28.1.1:16713 or so I guess. Or netcat. But I always found telnet to be most convenient for such testing. (I used ncsa telnet on dos extensively, though..., which shows my age :)
The upload limit is quite cool and I remember it from the past when I had terrible internet.
It is pretty recent - I had some hack in the quantize script.
If I never explained that to you, the architecture is like this: llmjob is a perl script that copies itself to all hosts and manages the jobs. I don't think perl is your language of choice, so I won't recommend looking at it. Also, it's full of ad-hoc code. noquant and quantize phases are done by a bash script called "quantize" - I think you are quite good with posix sh. Not sure why I think all that, but it's the impression I got. I think all sources are on all machines, too, in case you ever need to look at it. There is also "imatrixjob-remote", which runs the imatrix jobs - logically, all imatrix jobs run on kaos/10.28.1.1, so you don't see much of it. It's basically a hacked copy of a hacked copy of llmjob.
I wouldn't design it like this if I would write it again, but the basic design is ok. And I think over the last year, it evolved quite a bit, and the oldest/most stable parts have been refactored into something nice. The scheduling and queuing algorithms are the most hacky atm.
Maybe I tried using screen inside tmux
You'd need to make sure all the keys are indeed send through all layers. It's quite annoying. For screen-in-screen (a relatively common case for me), it's just ctrl-a ctrl-a s. Or killing an ssh in a screen in ssh would be "return ~ ~ ."
But it should be possible. You cna practise by either setting it up yourself or waiting for some output after XOFF, and then seeing if a lot of output appears after XON. Because when done right, you should see the output continue, as if it were frozen, not continue as if it was buffered and continued in the background without you seeing it.
Well, that was not a good description...
You can also send stop signals. e.g. with "ikil -STOP ". We know it works because that's what the cronjobs on nico1 do at 17:00/22:00/07:00
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
Uhm, functionally that's correct,. but let me clarify this: quantize (the script that runs llama-quantize or convert-hf-to-gguf) will check for an .interrupt file and exit with a special exit code. It does not care about the .override file.
But the scheduler (llmjob, triggered from kaos) does care and will ignore jobs with .override when starting new ones (but continue managing running ones).
So if you'd set .interrupt alone, quantize would likely exit, then (if its the top job in the queue) immediately start again, and then it would have to wait for all uploads to finish first - which is incidentally on my list of things to optimize.
--
Anyway, what I actually came here to write was that I am going to sleep now, and very experimentally, when deepseek is done, almost everything should automatically return to normal, i.e. nico1 should start quantizing two jobs (eventually, the "push" event for this does not exist, so it will have to wait until something pushes it).
Or maybe everything will implode in various ways. But at least I am pretty sure the imatrix.dat will be safe :)
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
I normally manually manage things when we force-schedule big models, but you made a right decision, according to the data you had. Even if it wasn't my decision I would be happy if you continue being more active like this, and I want to provide more tools so you can do so.
but if I remember correctly, we changed it to be like once an hour
At the moment (and for m any months) it is purely event driven again, i.e. without anything "push"ing it, nothing will happen.
Unrelated: in recent days, I sometimes found jobs to be "idle", which can practically only happen when a push gets lost. The push is at the moment literally the echo I have you - it was a quick hack, without error checking or retrying. But it worked fine, I wonder what happened recently.
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
q+exit is a relatively recent addition. Also, very few people remember the telnet escape (ctrl-altgr-] on german keyboards, then "close"+return). Can't say I ever used (unix) telnet for actual login, only as a simple tool to connect to a tcp port. You could use socat stdio: tcp:10.28.1.1:16713 or so I guess. Or netcat. But I always found telnet to be most convenient for such testing. (I used ncsa telnet on dos extensively, though..., which shows my age :)
The upload limit is quite cool and I remember it from the past when I had terrible internet.
It is pretty recent - I had some hack in the quantize script.
If I never explained that to you, the architecture is like this: llmjob is a perl script that copies itself to all hosts and manages the jobs. I don't think perl is your language of choice, so I won't recommend looking at it. Also, it's full of ad-hoc code. noquant and quantize phases are done by a bash script called "quantize" - I think you are quite good with posix sh. Not sure why I think all that, but it's the impression I got. I think all sources are on all machines, too, in case you ever need to look at it. There is also "imatrixjob-remote", which runs the imatrix jobs - logically, all imatrix jobs run on kaos/10.28.1.1, so you don't see much of it. It's basically a hacked copy of a hacked copy of llmjob.
I wouldn't design it like this if I would write it again, but the basic design is ok. And I think over the last year, it evolved quite a bit, and the oldest/most stable parts have been refactored into something nice. The scheduling and queuing algorithms are the most hacky atm.
Maybe I tried using screen inside tmux
You'd need to make sure all the keys are indeed send through all layers. It's quite annoying. For screen-in-screen (a relatively common case for me), it's just ctrl-a ctrl-a s. Or killing an ssh in a screen in ssh would be "return ~ ~ ."
But it should be possible. You cna practise by either setting it up yourself or waiting for some output after XOFF, and then seeing if a lot of output appears after XON. Because when done right, you should see the output continue, as if it were frozen, not continue as if it was buffered and continued in the background without you seeing it.
Well, that was not a good description...
You can also send stop signals. e.g. with "ikil -STOP ". We know it works because that's what the cronjobs on nico1 do at 17:00/22:00/07:00
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
Uhm, functionally that's correct,. but let me clarify this: quantize (the script that runs llama-quantize or convert-hf-to-gguf) will check for an .interrupt file and exit with a special exit code. It does not care about the .override file.
But the scheduler (llmjob, triggered from kaos) does care and will ignore jobs with .override when starting new ones (but continue managing running ones).
So if you'd set .interrupt alone, quantize would likely exit, then (if its the top job in the queue) immediately start again, and then it would have to wait for all uploads to finish first - which is incidentally on my list of things to optimize.
--
Anyway, what I actually came here to write was that I am going to sleep now, and very experimentally, when deepseek is done, almost everything should automatically return to normal, i.e. nico1 should start quantizing two jobs (eventually, the "push" event for this does not exist, so it will have to wait until something pushes it).
Or maybe everything will implode in various ways. But at least I am pretty sure the imatrix.dat will be safe :)
@mradermacher I think DeepSeek-R1 i1-Q2_K_S huggingface upload got stuck.
DeepSeek-R1 is already in run/imatrix 12/24,IQ1_M,waiting for prev hfu, df (hfu i1-Q2_K_S)
since I woke up 3 hours ago.
stat 'DeepSeek-R1-i1-GGUF-DeepSeek-R1.i1-Q2_K_S.gguf*.log'
Access: 2025-01-24 11:45:01.247003144 +0100
Modify: 2025-01-24 06:08:14.583974792 +0100
Birth: 2025-01-24 05:49:14.126076567 +0100
The log also contains an interesting error:
(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 3aa3b504-5be6-4274-b3bf-f7837edcc6fb)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/72/60/7260819ed2f1619e5a91bb148b0eff76fca17b11fa4502382724c6eb4ebc5bcd/df9da4c9f10d5956db5e2f928c411f7b847de87fd9c4722438b631d35438b32d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=<CENSORED>&X-Amz-Date=20250124T045302Z&X-Amz-Expires=86400&X-Amz-Signature=<CENSORED>&X-Amz-SignedHeaders=host&partNumber=62&uploadId=<CENSORED>&x-id=UploadPart
I see no upload bandwidth utilization if nothing but i1-Q2_K_S is uploading indicating it is likely not doing anything.
Now we know why ignoredf was set. My explanation was incomplete. Yes it waits for all uploads when disdkpace is < QUANTDISKUSAGE, but it also waits for ther previous upload to finish when disk space is < QUANTDISKUSAGE *4.
Meh.
Well, that overallocation saved my ass many times.
Now for the upload, I think there is a bug somewhere, such as not closing the other end of the pipe or so: the python upload process does not exist anymore but the parent is waiting for something (likely the python process). Shouldn't be the case, as it's a >>30 year old well tested library I am using for that, so it's probably something else.
I am pleased enough that nico1 correctly switched back ,to the normal job limits on its own.
Ah, no, python is still running. Right, I forgot that children of llmjob are not being tagged, so they don't show up in ils. I'll have to rectify this. Then it's probably that bug that sometimes happens where the huggingface libs simply print an error and then hang instead of reporting it to the caller. I'll investigate some more, but likely there is nothing I can do about it, I have to rely on python throwing an exception or returning form the call.
some python threads arewaiting for these:
python3 1934007 root 7u IPv4 10146651 0t0 TCP mradermacher.nico.re:40512->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
python3 1934007 root 9u IPv4 10628179 0t0 TCP mradermacher.nico.re:40530->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
python3 1934007 root 11u IPv4 10472890 0t0 TCP mradermacher.nico.re:40520->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
And I suspect that the remote end does not know about these connections anymore. Unfortunately, my little timeout wrapper is loaded:
python3 1934007 root mem REG 0,40 1842788 /llmjob/share/hfu-preload.so (path dev=0,101)
So let's see why that one doesn't trigger.
Ah right, I didn't realise pyhton would use multiple threads, so my solution with alarm() is obviously broken. That, uhm, complicates things a "bit".
Anyway, imagine for some reason you wanted to just kill this upload and retry (which you wisely didn't so I can look at it), then you have options.
The code that waits in quantize is this:
iwait $PIDS || true
(quantize is one of those very few shell scripts I wrote that use set -e). But the || true means it is safe to kill without immediately causing havoc, and in this case, it is safe to kill ths iwait child of quantize, and then quantize will simply continue as if the upload had finished, which is what I did. That would preserve the upload processes for inspection.
Or, what I will do now that I think I understand why my wrapper didn't help (ok, I'll still have to attach gdb to see if I am likely right :) is I will kill the python subprocess that should be part of ils output but isn't yet, or the "llmjob hf-upload-folder" caller. That will cause the upload to fail, but the parent process (the hfu wrapper that started as a single line....) will retry.
PS: If you don't particularly enjoy reading through these thoughts, I will not be sad if you say so and I will be shorter next time. I suspect it does help me to document these things to somebody involved, though :)
Alternatively, maybe I could just enable tcp keepalive on connect(). That would be a much more sexy solution than calling poll() before every read... Hmm....
Update: better yet, let's do it at socket() time, then I don't even have to check that it is a tcp socket.
Update 2: Exciting. I've never configured keepalive parameters programmatically.
Update 3: even more interesting, keepalive enabling is a generic socket option, not a tcp-layer specific one. Only the actual parameters are tcp-specific. Are there any other protocols that even implement this?
let's see if this works better:
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [5], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPCNT, [20], 4) = 0
So, the above method works in the sense that it shuts down connections (whether due to keepalive or not, I can't tell), but python is still hanging, because the other side does not code the connection. It's actually quite interesting. Clearly, the other side is not interested in replying. Could be a very misconfigured firewall on the cloudflare side (cloudflare needs to die urgently) - the remote host is pingable and connectable on port 443, so I suspect it's typical cloudflare brokenness and shit all over the internet-ness.
I must admit I am not sure why keepalive doesn't completely kill the connection(s) here - either it's not enabled (but it seems to get enabled when I strace python3), or keepalive only shuts down the sending side, which doesn't make much sense either.
23:08:59.266571 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150262 ecr 1588015898], length 0
23:08:59.267570 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150263 ecr 1924325188], length 0
23:08:59.474572 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150470 ecr 1588015898], length 0
23:08:59.475572 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150471 ecr 1924325188], length 0
23:08:59.883573 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150879 ecr 1924325188], length 0
23:08:59.891577 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150887 ecr 1588015898], length 0
23:09:00.707580 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247151703 ecr 1924325188], length 0
23:09:00.771573 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247151767 ecr 1588015898], length 0
23:09:02.371575 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247153367 ecr 1924325188], length 0
23:09:02.435573 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247153431 ecr 1588015898], length 0
I think tcp keepalive wasn't enabled - python creates some sockets with a proprietary linux extension: socket(..., SOCK_STREAM | SOCK_CLOEXEC, ...) or other flags, and what's worse, there is no portable way to detect this, as the mask for the actual type is not exposed to userspace (being non-posix). Grm.
This turns out way more complicated than initially expected.
As a sidenote, it seems uploads have to wait after every "chunk" for an ack from the other side (probably what aws requires, never looked into that). Very interesting. At least it is a learning opportunity.
trying to pause rich1
rich1 ~# ./rich1-pause
./rich1-pause: connect: Connection timed out
./rich1-pause: line 7: /dev/tcp/10.28.1.1/16713: Connection timed out
[Exit 1]
interesting. Well it seems that https://hf.tst.eu/status.html showing it as unreachable. Can I reboot server? I need to do few things before it can run again
@richarderkhov A working network is required for the whole thing to work, yeah (neither wireguard nor ssh work) (I wonder why it keeps failing). In emergencies you can reboot any time, of course, I just have to clean up, so make it count please :)
@nicoboss you can copy the r1-zero gguf to /tmp/quant and, if you want, remove the overide file and push.
Oh, it's already being copied :)
Quick status update regarding rich1. Richard decided to install Mail-in-a-Box on the host but missed that "Technically, Mail-in-a-Box turns a fresh cloud computer into a working mail server" was meant literally and instead of installing a mail server on top of the existing OS it replaces it to be a mail server in a process that cannot be undone. We spent hours trying to save the host but it is beyond saving so we will have to reformat it. This time with Debian 12 with Proxmox. rich1 should be available again tomorrow once the host is properly setup again.
The rich1 LXC container is on a separate disk and so beside the extended downtime should be unaffected. It seems all quants queued to rich1 were completed and uploaded before he stopped it as the tmp folder seams empty. We made a remote backup of the rich1 container just in case.
Oh, it's already being copied :)
I started the copy as first thing in the morning but it took 4 hours to copy it and so only finished early afternoon. NFS was 600 Mbit/s and this despite source and destination disks being an SSD on the same server. The only reason I had to use NFS is because the SSD was assigned to a VM.
I already regret that I built back the RPC setup as llama.cpp support for more massive awesome models will likely come soon: https://github.com/ggerganov/llama.cpp/issues/11290
wow, lots of, eh, mixed news :)
an empty /tmp folder on rich1 would be surprising, but we'll see what's going on when its back up. shit happens :)
wow, never heard of minimax. but let's face it, if 4xxB models become commonplace, it might be prudent to use Q8_0 for imatrix. I don't have an issue with that.
Regarding rich1 we successfully installed Proxmox on it today. I unfortunately caused an IP conflict while setting up OpenWrt minutes after he went to bed that so I currently have to wait for him to use iKVM to fix this. I’m confident we can get rich1 working again tomorrow.
Regarding the reason why nico1 is currently offline: My ISP decided to do maintenance today from 01:00 to 06:00 and on 3rd of February from 05:00 to 06:00. I wasn't aware of it spent quite a while diagnosing the issue because they have not put that on their website but I then figured it out on the website of their upstream ISP. They usually inform me weeks in advance but could be that I missed that.
nico1 is currently reasoning finetuning DeepSeek-R1-Distill-Llama-70B-Uncensored. This is scheduled to take almost a day but I will probably interrupt it at 0.5 epochs to not block imatrix quants for too long. I wanted to test auto_resume_from_checkpoints for a first time anyways. It also happened to be such good timing with the internet outage.
wow, never heard of minimax. but let's face it, if 4xxB models become commonplace, it might be prudent to use Q8_0 for imatrix. I don't have an issue with that.
minimax is a completely new base model and so probably warrants the effort of doing it in 16-bit even if it realistically will barely make a difference. The minimax model is extremely good getting close to the much larger DeepSeek-v3. Likely because while smaller in the sense of total parameters it has more active parameters.
It suddenly felt so lonely... :)
That's a long maintenance internal, but it happens.
minimax is a completely new base model
So... you do kind of agree :) I don't expect minimax to suddenly become popular for fine-tunes, though, and I don't expect many finetunes of llama-405b either.
nico1 is currently reasoning finetuning DeepSeek-R1-Distill-Llama-70B-Uncensored.
btw., you could, if you wanted, let it quantize (if it doesn't do that already, most likely it will work on r1-zero) - if it stops, you could edit /llmjob/share/bin/llmjob, find this line:
} elsif ($cmd eq "slave-scheduler") {
and replace the rich1 a few lines below that by nico1:
if ($HOSTNAME eq "rich1") {
Then "llmjob slave-scheduler" will run the scheduler locally, which is currently disabled everywhere except on rich1.
I tell you not so much because I really want you to do that, but more to trickle knowledge about the internal workings to you. llmjob slave-scheduler is invoked at the end of every job, and because of some bug I am hunting it does only try to locally schedule jobs on rich1, not anywhere else. And oh my, it still uses bash to actually send a push to the scheduler afterwards, why did I look at that code.
The file will be overwritten automatically the next time kaos contacts rich1 (it replaces itself, so that only works if its actually compiling, though).
In other news, I have a good lead on the weird job scheduling failures I have seen in the last month.
rich1 is alive again! I recommend to check if everything with it is fine and no work got lost. I forwarded TCP port 2222 for SSH and UDP port 7103 for WireGuard. rich1 now uses a similar Proxmox with OpenWrt router setup as nico1.
Since rich1 is online I see a lot of error/12
errors:
ram budget 490 use 0
0 ? Reasoning-Llama-3.1-CoT-RE1 error/255 (from rain)
0 ? Llama-3-Yollisa-SCE error/12 (from rich1)
0 ? SauerHuatuoSkywork-o1-Llama-3.1-8B error/12 (from rich1)
0 ? Janus-1.3B-LM error/12 (from rich1)
0 ? SJT-2.1B error/12 (from rich1)
0 ? Qwen2.5-7B-Instruct-1M-abliterated error/12 (from rich1)
0 ? Taurus-Opus-7B error/12 (from rich1)
0 ? DeepSeek-R1-Distill-Qwen-7B-RRP-Ex error/12 (from rich1)
0 ? SJT-990M error/12 (from rich1)
0 ? Qwen2.5-7B-RRP-ID error/12 (from rich1)
rich1 also has quite some likely model related errors:
rich1 nice size (static/imatrix) -- free 1219 budget 1057 uploads 0 hfd 1
0 17 si Llama-3-Yollisa-SCE-TopK_0.45 error/2 converting...
0 2 si ChainBlind-HadithIsnadParser-AraT5 error/1 missing spiece.model
0 2 si ChainAware-HadithIsnadParser-AraT5 error/1 missing spiece.model
0 2 si ChainBlind-HadithIsnadParser-withPrefix-AraT5 error/1 missing spiece.model
0 16 s Zurich-7b-GCv2-5m error/2 converting...
The reason for these errors is that most files from /tmp are gone. Also, I can't login to rich1 normally (connection refused) - did the ip address or port change?
Something semi-catastrophic must have happened on rich1.
I wasn't there when it came back, so I am not 100% what the state was, but it is a bit fishy that all big jobs are missing. I wonder if the job queue was deleted as well. That means an unknown number of jobs have been lost, da lot of 70Bs as well.
-rw-rw-rw- 1 root root 5.2k Jan 29 02:09 backup_rich1_meta.cbor
-rw------- 1 root root 141k Dec 28 00:14 backup_rich1_meta.cbor.x
almost certainly. the .x file is a copy i made while rich1 was down. sigh, now I have to somehow extract the jobs from there.
Anyway, until the network is fixed, nothing much can be done about all this. I suspect the port forwardings are missing.
Indeed, midnight morning on the 28th files in /etc/ have been deleted causing /tmp to be deleted on the next boot.
Any idea what else in the vm might have been changed? I'd rather start from debian than work with a partially corrupted vm with such surprises.
There are other changed files in etc, most benign (network/interfaces, hosts). But what would delete /etc/tmpfiles.d/tmp.conf, and why?
Just in case it ever comes up, I chattr +i's tmp.conf, because if that file is removed, thats quite disastrous, because I don't normally have a backup.
If I can trust the mtime, then the only obvious change outside of /usr is the tmpfiles.d/tmp.conf deletion.
resolv.conf also changed weirdly:
search example.com
nameserver 1.1.1.1
Is this the intended resolv.conf?
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
@nicoboss to summarize,. so you don't have to read through my debug stuff
- ssh (2222 => 22) and (more importantly) wireguard (7103 => 7103) forwardings are missing. the latter is required for nico1 to get a reliable connction to rich1
- on the 28th 00:02 (likely), /etc/tmpfiles.d/tmp.conf was removed, causing /tmp to be deleted, which causes a loss of all models and jobs. i was able to restore most jobs with some work from a backup, but I don't always have a backup. it is important to find out what happened so it can be prevented in the future.
- about 500gb of diskspace seems to be missing, causing jobs to fail
Update: I tried a quick hack to regain connectivity, but somehow failed, so I think I now need ssh to be able to fix it.
ssh (2222 => 22) and (more importantly) wireguard (7103 => 7103) forwardings are missing. the latter is required for nico1 to get a reliable connction to rich1
This is fixed now. Sorry for the networking issues. ifupdown wasn't installed as it wasn't required with the old networking setup so /etc/network/interfaces set by the Proxmox host got ignore. It instead used systemd-networkd which resulted on it getting a random IP over DHCP breaking the port forwarding rules pointing to 192.168.1.101. I now installed ifupdown and enabled the networking service so this should't happen again.
on the 28th 00:02 (likely), /etc/tmpfiles.d/tmp.conf was removed, causing /tmp to be deleted, which causes a loss of all models and jobs. i was able to restore most jobs with some work from a backup, but I don't always have a backup. it is important to find out what happened so it can be prevented in the future.
No idea who or what deleted this config. /tmp was empty after Richard stopped the container on 26th of January. I don't think it will happen again as we are now using Proxmox to manage the container instead of LXC. Very unfortunately that we lost the entirety of /tmp.
resolv.conf also changed weirdly
That makes sense as Proxmox is injecting its own network configuration into LXC containers so nothing to worry about.
If I can trust the mtime, then the only obvious change outside of /usr is the tmpfiles.d/tmp.conf deletion.
You should be able to trust it as the container is still pointing to the same rootfs folder on the same disk. We didn't copy or move to container at all.
I'd rather start from debian than work with a partially corrupted vm with such surprises.
If you want to start fresh just let me know and I can easily give you a new container. It takes 1 minute for me to create a new one and doing so would be cleaner.
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
This is because the same disk contains a backup of Richard's website and rich1 just in case. The rich1 backup was unfortunately made when /tmp was already gone. I will delete the backups as soon Richard confirms I can do so.
It all looks fine from my side, thanks for your work. The chattr +i should prevent accidental deletion in the future, but it is very weird. I could chalk it up to my script messing up and fogetting about it, but then it would have happened on previous reboots, and the directory had an mtime of when it was down. Very strange.
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
I deleted the backups half an hour ago so all the storage should now be available for you to use again.
It all looks fine from my side, thanks for your work.
Thanks. Great to finally see rich1 working again.
I could chalk it up to my script messing up and forgetting about it, but then it would have happened on previous reboots
I don't think we ever rebooted rich1 after we had to reinstall it after the LXC corruption incident.
I don't think we ever rebooted rich1 after we had to reinstall it after the LXC corruption incident.
No, but rich1 and the vm rebooted multiple times before, and once after, and the only time that file was created was when I initially ran my script to configure wireguard and other stuff (i.e. twice only). I can only imagine some script went around and either deleted all 0-size files or any file starting with tmp.* - just very weird. But who knows, maybe whatever script that was run to essentially destroy rich1 also ran a find over the whole disk.
The only evidence is that something mucked with that directory on jan 28th, so it's unlikely to have been something that happened before. I was lucky that I made a copy of the queue just in case when it went down, otherwise restoring the jobs would be... difficult.
Thanks. Great to finally see rich1 working again.
Yeah, I was getting a bit desperate - nico1 much less than 50% usable for weeks, rich1 gone, and an unprecedented number of models, and big ones, too (I mean 70B..130B, not deepseek) made for very tense moments. All in all, it's making good progress despite all, and we even did make a tiny bit of progress on the nice 1000+ models.
Why does it say:
0 66 si Virtuoso-Medium-v2 error/255 repo create
The repository clearly exists under https://huggingface.co/mradermacher/Virtuoso-Medium-v2-GGUF - it is supposed to do static quants to that repo as the status shows si
.
Edit: Now that the imatirx is done it shows sI
as status but is still stuck at error/255 repo create
. Luckily it just skips this task and works on other tasks in the meantime.
Edit: Ah nice it either fixed itself or you manually fixed it. In any case the model is now getting quantized.
This night and also this morning hf had enourmous timeout problems. Everything was affected, including web page loading. It's not fully fixed yet, but much better. I need to manually retry when it fails at this step.
Ah, and yes, if "s" is in the flags, it will never try imatrix quanting first.
Oh, and btw., hetzner sometimes has good offers, which might or might not be something to consider for richard, if he actually pays €250/month. Can't see an obvious candidate, but didn't look long, and the offers change considerably over time, e.g.
https://www.hetzner.com/sb/#price_from=180&price_to=250&cpuType=AMD&search=threadripper
All of these are a bit faster than his box, and cheaper, afaics.