Long-Form input takes too long
#35
by
htkim27
- opened
I expanded the context length to around 60k and then ran inference, but the generation time took too long. Is it just me experiencing this?
I am using 8 A100 GPUs in one node.
I served the model using the following code (llama.cpp):
MODEL_PATH="/PATH/TO/GGUF_DIR/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf"
N_GPU_LAYERS=64 # Use all layers on GPU
N_THREADS=64 # Increase thread count
CONTEXT_SIZE=65536
# Run llama.cpp server
./llama.cpp/build/bin/llama-server \
-m $MODEL_PATH \
--n-gpu-layers $N_GPU_LAYERS \
--threads $N_THREADS \
--prio 3 \
--host 0.0.0.0 \
--port 8080 \
--flash-attn \
--batch-size $CONTEXT_SIZE \
--ubatch-size 1 \
--ctx-size $CONTEXT_SIZE \
--keep 1 \
--seed 3407