Use aphrodite-engine to infer this model error

#2
by zanepoe - opened

There is an error when using the aphrodite-engine to infer this model. With the same configuration, there is no error when using it to infer VPTQ-community/Meta-Llama-3.3-70B-Instruct-v8-k65536-0-woft.
This seems to be an issue with the quantized model, looks like one of (or all) the layers doesn't have a config defined for it. Maybe
The complete command and error message are as follows.

(aphrodite-engine) zane@zane-desktop:~/workspace/models$ aphrodite run Qwen2.5-32B-Instruct-v8-k65536-256-woft  -tp 4 --host 0.0.0.0 --port 6668 --max-model-len 20480 --guided-decoding-backend xgrammar --dtype
=half --enable-prefix-caching --gpu-memory-utilization 0.84 --disable-frontend-multiprocessing
WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  vptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Defaulting to use mp for distributed inference.
INFO:     -------------------------------------------------------------------------------------
INFO:     Initializing Aphrodite Engine (v0.6.7 commit e64075b8) with the following non-default config:
INFO:     cache.enable_prefix_caching=True
INFO:     cache.gpu_memory_utilization=0.84
INFO:     device.device=device(type='cuda')
INFO:     model.dtype=torch.float16
INFO:     model.max_model_len=20480
INFO:     model.max_seq_len_to_capture=20480
INFO:     model.model='Qwen2.5-32B-Instruct-v8-k65536-256-woft'
INFO:     model.quantization='vptq'
INFO:     parallel.distributed_executor_backend='mp'
INFO:     parallel.tensor_parallel_size=4
INFO:     scheduler.max_num_batched_tokens=20480
INFO:     -------------------------------------------------------------------------------------
WARNING:  Reducing Torch parallelism from 28 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO:     Using XFormers backend.
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  

@torch
	.library.impl_abstract("xformers_flash::flash_fwd")
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  

@torch
	.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461983) INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461983) INFO:     Using XFormers backend.
(AphroditeWorkerProcess pid=461981) INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461981) INFO:     Using XFormers backend.
(AphroditeWorkerProcess pid=461983) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461983)   

@torch
	.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461982) INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461982) INFO:     Using XFormers backend.
(AphroditeWorkerProcess pid=461981) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461981)   

@torch
	.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461983) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461983)   

@torch
	.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461982) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461982)   

@torch
	.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461981) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461981)   

@torch
	.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461982) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461982)   

@torch
	.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461983) INFO:     Worker ready; awaiting tasks
(AphroditeWorkerProcess pid=461981) INFO:     Worker ready; awaiting tasks
(AphroditeWorkerProcess pid=461982) INFO:     Worker ready; awaiting tasks
WARNING:  Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO:     Loading model Qwen2.5-32B-Instruct-v8-k65536-256-woft...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/bin/aphrodite", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/cli.py", line 229, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/cli.py", line 32, in serve
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 1194, in run_server
[rank0]:     async with build_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 121, in build_engine_client
[rank0]:     async with build_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 154, in build_engine_client_from_engine_args
[rank0]:     engine_client = await asyncio.get_running_loop().run_in_executor(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/concurrent/futures/thread.py", line 59, in run
[rank0]:     result = self.fn(*self.args, **self.kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 633, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 526, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 263, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/aphrodite_engine.py", line 334, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 214, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 111, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 191, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/worker/worker.py", line 157, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/worker/model_runner.py", line 1038, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/__init__.py", line 20, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 404, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 172, in _initialize_model
[rank0]:     return build_model(
[rank0]:            ^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 157, in build_model
[rank0]:     return model_class(config=hf_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 400, in __init__
[rank0]:     self.model = Qwen2Model(config, cache_config, quant_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 256, in __init__
[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]:                                                     ^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/utils.py", line 404, in make_layers
[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 258, in <lambda>
[rank0]:     lambda prefix: Qwen2DecoderLayer(config=config,
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 176, in __init__
[rank0]:     self.self_attn = Qwen2Attention(
[rank0]:                      ^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 118, in __init__
[rank0]:     self.qkv_proj = QKVParallelLinear(
[rank0]:                     ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 727, in __init__
[rank0]:     super().__init__(input_size=input_size,
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 293, in __init__
[rank0]:     super().__init__(input_size, output_size, skip_bias_add, params_dtype,
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 184, in __init__
[rank0]:     self.quant_method = quant_config.get_quant_method(self,
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/quantization/vptq.py", line 360, in get_quant_method
[rank0]:     quant_config = self.get_config_for_key(base_name, linear_name)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/quantization/vptq.py", line 340, in get_config_for_key
[rank0]:     raise ValueError(f"Cannot find config for ({prefix}, {key})")
[rank0]: ValueError: Cannot find config for (, )
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

It's really strange that Meta-Llama-3.3-70B doesn't have this problem.

This problem exists in the models of VPTQ-community/Qwen2.5 series.

Let me figure out what happened, thanks!

I guess it is related to model configurations, maybe we missed something.

Sign up or log in to comment