How much vram do you need?

#12
by hyun10 - opened

I have RTX3090 * 4, a total of 96gb vram but when I try to run this with vllm I get a cuda out of memory.

Should I use a quantization model?

BF16 memory:70B × 2B = 140 GB
Q8 memory:70B × 1B =70 GB
you should use Q8 model

The official version is F8 instead of Q8.

Is there any AWQ Q8 version of this model?

the AWQ of this Q8 is this? https://huggingface.co/Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ/

This model is Q4. Q8 model file about 70Gb+

I am looking for Q8 model quantized using AWQ, do you know any?

I am looking for Q8 model quantized using AWQ, do you know any?

You can perform AWQ yourself.

Sign up or log in to comment