Bug Description
Issue Summary
Encountered NotImplementedError: No FP8 MoE backend supports the deployment configuration when attempting to deploy MiniMax-M2.5 using vLLM. The error persists even when explicitly specifying --dtype float16 in the launch command.
Steps to Reproduce
- Download the MiniMax-M2.5 model
- Inspect the model's
config.json, which contains the following FP8 quantization configuration:
{
"quantization_config": {
"activation_scheme": "dynamic",
"fmt": "float8_e4m3fn",
"quant_method": "fp8",
"weight_block_size": [128, 128],
"modules_to_not_convert": ["gate", "e_score_correction_bias", "lm_head"]
}
}
Steps to Reproduce
Execute the following launch command:
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8356 \
--model models/MiniMax-M2.5 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--tensor-parallel-size 8 \
--dtype float16 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name MiniMax-M2.5 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enforce-eager
Installation Documentation
https://vllm-kunlun.readthedocs.io/en/v0.15.1/installation.html
Bug Description
Issue Summary
Encountered
NotImplementedError: No FP8 MoE backend supports the deployment configurationwhen attempting to deploy MiniMax-M2.5 using vLLM. The error persists even when explicitly specifying--dtype float16in the launch command.Steps to Reproduce
config.json, which contains the following FP8 quantization configuration:{ "quantization_config": { "activation_scheme": "dynamic", "fmt": "float8_e4m3fn", "quant_method": "fp8", "weight_block_size": [128, 128], "modules_to_not_convert": ["gate", "e_score_correction_bias", "lm_head"] } }Steps to Reproduce
Execute the following launch command:
Installation Documentation
https://vllm-kunlun.readthedocs.io/en/v0.15.1/installation.html