Long-Form input takes too long

#35
by htkim27 - opened

I expanded the context length to around 60k and then ran inference, but the generation time took too long. Is it just me experiencing this?

I am using 8 A100 GPUs in one node.

I served the model using the following code (llama.cpp):

MODEL_PATH="/PATH/TO/GGUF_DIR/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf"
N_GPU_LAYERS=64  # Use all layers on GPU
N_THREADS=64     # Increase thread count
CONTEXT_SIZE=65536

# Run llama.cpp server
./llama.cpp/build/bin/llama-server \
  -m $MODEL_PATH \
  --n-gpu-layers $N_GPU_LAYERS \
  --threads $N_THREADS \
  --prio 3 \
  --host 0.0.0.0 \
  --port 8080 \
  --flash-attn \
  --batch-size $CONTEXT_SIZE \
  --ubatch-size 1 \
  --ctx-size $CONTEXT_SIZE \
  --keep 1 \
  --seed 3407

Sign up or log in to comment