Join the force

#426
by RichardErkhov - opened

Hello @mradermacher , as you noticed we have been competing for the amount of models for quite a while. So instead of competing, want to join forces? I talked to @nicoboss , he is up for it, and I have my quant server for you with 2 big bananas (E5-2697Av4), 64 gigs of ram, and a 10gbps line ready for you!

Well, "take what I have" and "join forces" are not exactly the same thing. When we talked last about it, I realised we were doing very different things and thought diversity is good, especially when I actually saw what models you quantize and how :) BTW, I am far from beating your amount of models (remember, I have roughly two repos per model, so you have twice the amount), and wasn't in the business of competing, as it was clear I couldn't :)

But of course, I won't say no to such an offer, especially not at this moment (if you have seen my queue recently...).

So how do we go about it? Nico runs some virtualisation solution, and we decided on a linux container to be able to access his graphics cards, but since direct hardware access is not a concern, a more traditional VM would probably be the simplest option. I could give you an image, or you could create a VM with debian 12/bookworm and my ssh key on it (nico can just copy the authorized_kleys file).

Or, if you have any other ideas, let's talk.

Oh, and how much diskspace are you willing to give me? :)

Otherwise, welcome to team mradermacher. Really should have called it something else in the beginning.

Ah, and as for network access, I only need some port to reach ssh, and be able to get a tunnel out (wireguard, udp). having a random port go to the vm ssh port and forward udp port 7103 to the same vm port would be ideal. I can help with all that, and am open to alternative arrangements, but I have total trust in you that you can figure everything out :)

No worries I will help him setting up everything infrastructure wise. He already successfully created a Debian 12 LXC container. While a VMs might be easier those few percentages of lost performance bother me but if you prefer a VM I can also help him with that.

LXC sits perfectly well with me.

this brings me joy

@mradermacher Your new server "richard1" is ready. Make sure to abuse the internet as hard as you can. Details were provided by email by @nicoboss , so check it please as soon as you can

Oh, and how much diskspace are you willing to give me? :)

2 TB of SSD as this is all he has. Some resources are currently still in use by his own quantize tasks but should be gone by tomorrow once the models that are currently being processed are done but just already start your own tasks once the container is ready. He is also running a satellite imagery data processing project for me for the next few weeks but its resource usage will be minimal. Just go all in and try to use as much resources as you can on this server. For his quantization tasks he usually runs 10 models in parallel and uses an increased number of connections to download them in order to optimally make use of all resources available.

I'm on it. Wow, load average of 700 :)

@mradermacher rich1 doesn't seem down. It is on the status page and according to it processing multiple tasks. And the status page is not just frozen it keeps getting updated with the progress rich1 makes with the currently assigned tasks. Yes I cannot reach rich1 over SSH but that is a known issue.

It's back, yes - when I wrote that, it was down for at least one hour (unpingable), and likely longer (but maybe multiple times), which was more than normal (it's frequently offline for a few minutes in some way, but not normally that long).

now rich1 has an exciting new problem :) i get asked for a password with ssh. i do not think this is a problem with ssh per se, something worse seems to be going on.

something worse seems to be going on.

You were right. Somehow the content of /home, /proc, /sys, /media, /mnt, /srv, /run and /boot was gone from all LXC containers on Richards server. We have no idea how this happened but it caused all containers to break. We tried ouer best to safe it but in the end, we moved all your data (/root and /tmp) to a new container. SSH work again as usual but the VPN you will unfortunately have to fix yourself. To SSH connect to it use the public IP and port 2222 I wrote you in my original rich1 mail.

Thanks for your rescueing efforts. Installation is semi-automated, so not such a big problem, just work.

@nicoboss Hmm, and where did you put /tmp? du / gives me 2.5G only, but df shows 500GB used.

PS: if you move it back, do not replace the existing /tmp,.I'll start quanting other models, so no hurry.

Thanks for your rescueing efforts. Installation is semi-automated, so not such a big problem, just work.

Thanks a lot for getting rich1 working again!

@nicoboss Hmm, and where did you put /tmp? du / gives me 2.5G only, but df shows 500GB used.

Sorry no idea why it didn't work as we used the same command as we used for /root where it worked. We now moved it to /tmpold. I can confirm the content of your old /tmp folder is now accessible from within your new container under /tmpold.

It's now been integrated again, and rich1 is working hard on cleaning up :)

Sign up or log in to comment