Use aphrodite-engine to infer this model error
#2
by
zanepoe
- opened
There is an error when using the aphrodite-engine to infer this model. With the same configuration, there is no error when using it to infer VPTQ-community/Meta-Llama-3.3-70B-Instruct-v8-k65536-0-woft.
This seems to be an issue with the quantized model, looks like one of (or all) the layers doesn't have a config defined for it. Maybe
The complete command and error message are as follows.
(aphrodite-engine) zane@zane-desktop:~/workspace/models$ aphrodite run Qwen2.5-32B-Instruct-v8-k65536-256-woft -tp 4 --host 0.0.0.0 --port 6668 --max-model-len 20480 --guided-decoding-backend xgrammar --dtype
=half --enable-prefix-caching --gpu-memory-utilization 0.84 --disable-frontend-multiprocessing
WARNING: Casting torch.bfloat16 to torch.float16.
WARNING: vptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Defaulting to use mp for distributed inference.
INFO: -------------------------------------------------------------------------------------
INFO: Initializing Aphrodite Engine (v0.6.7 commit e64075b8) with the following non-default config:
INFO: cache.enable_prefix_caching=True
INFO: cache.gpu_memory_utilization=0.84
INFO: device.device=device(type='cuda')
INFO: model.dtype=torch.float16
INFO: model.max_model_len=20480
INFO: model.max_seq_len_to_capture=20480
INFO: model.model='Qwen2.5-32B-Instruct-v8-k65536-256-woft'
INFO: model.quantization='vptq'
INFO: parallel.distributed_executor_backend='mp'
INFO: parallel.tensor_parallel_size=4
INFO: scheduler.max_num_batched_tokens=20480
INFO: -------------------------------------------------------------------------------------
WARNING: Reducing Torch parallelism from 28 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO: Using XFormers backend.
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch
.library.impl_abstract("xformers_flash::flash_fwd")
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch
.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461983) INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461983) INFO: Using XFormers backend.
(AphroditeWorkerProcess pid=461981) INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461981) INFO: Using XFormers backend.
(AphroditeWorkerProcess pid=461983) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461983)
@torch
.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461982) INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(AphroditeWorkerProcess pid=461982) INFO: Using XFormers backend.
(AphroditeWorkerProcess pid=461981) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461981)
@torch
.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461983) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461983)
@torch
.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461982) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461982)
@torch
.library.impl_abstract("xformers_flash::flash_fwd")
(AphroditeWorkerProcess pid=461981) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461981)
@torch
.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461982) /home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(AphroditeWorkerProcess pid=461982)
@torch
.library.impl_abstract("xformers_flash::flash_bwd")
(AphroditeWorkerProcess pid=461983) INFO: Worker ready; awaiting tasks
(AphroditeWorkerProcess pid=461981) INFO: Worker ready; awaiting tasks
(AphroditeWorkerProcess pid=461982) INFO: Worker ready; awaiting tasks
WARNING: Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO: Loading model Qwen2.5-32B-Instruct-v8-k65536-256-woft...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/bin/aphrodite", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/cli.py", line 229, in main
[rank0]: args.dispatch_function(args)
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/cli.py", line 32, in serve
[rank0]: uvloop.run(run_server(args))
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
[rank0]: return __asyncio.run(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]: return await main
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 1194, in run_server
[rank0]: async with build_engine_client(args) as engine_client:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 121, in build_engine_client
[rank0]: async with build_engine_client_from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/endpoints/openai/api_server.py", line 154, in build_engine_client_from_engine_args
[rank0]: engine_client = await asyncio.get_running_loop().run_in_executor(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/concurrent/futures/thread.py", line 59, in run
[rank0]: result = self.fn(*self.args, **self.kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 633, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 526, in __init__
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 263, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/engine/aphrodite_engine.py", line 334, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 214, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/executor_base.py", line 46, in __init__
[rank0]: self._init_executor()
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 111, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/executor/multiproc_gpu_executor.py", line 191, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/worker/worker.py", line 157, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/worker/model_runner.py", line 1038, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/__init__.py", line 20, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 404, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 172, in _initialize_model
[rank0]: return build_model(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/model_loader/loader.py", line 157, in build_model
[rank0]: return model_class(config=hf_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 400, in __init__
[rank0]: self.model = Qwen2Model(config, cache_config, quant_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 256, in __init__
[rank0]: self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/utils.py", line 404, in make_layers
[rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 258, in <lambda>
[rank0]: lambda prefix: Qwen2DecoderLayer(config=config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 176, in __init__
[rank0]: self.self_attn = Qwen2Attention(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/models/qwen2.py", line 118, in __init__
[rank0]: self.qkv_proj = QKVParallelLinear(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 727, in __init__
[rank0]: super().__init__(input_size=input_size,
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 293, in __init__
[rank0]: super().__init__(input_size, output_size, skip_bias_add, params_dtype,
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/modeling/layers/linear.py", line 184, in __init__
[rank0]: self.quant_method = quant_config.get_quant_method(self,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/quantization/vptq.py", line 360, in get_quant_method
[rank0]: quant_config = self.get_config_for_key(base_name, linear_name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/site-packages/aphrodite/quantization/vptq.py", line 340, in get_config_for_key
[rank0]: raise ValueError(f"Cannot find config for ({prefix}, {key})")
[rank0]: ValueError: Cannot find config for (, )
/home/zane/miniconda3/envs/aphrodite-engine/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
It's really strange that Meta-Llama-3.3-70B doesn't have this problem.
This problem exists in the models of VPTQ-community/Qwen2.5 series.
Let me figure out what happened, thanks!
I guess it is related to model configurations, maybe we missed something.