FP8 Dynamic/W8A16 Quants Please
FP8 Dynamic/W8A16 Quants Please
You can use this model for FP8 with the latest vLLM nightly https://huggingface.co/nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic
The chat template is br0ken in the nm-testing repo. See also https://github.com/vllm-project/vllm/pull/15505#issuecomment-2768873223.
It has been updated now, thanks!
Thanks, it seems that in the nm-testing repo, one can only use the default setting to host vllm 0.8.3, "vllm serve nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic --tool-call-parser mistral --enable-auto-tool-choice". However, "--tokenizer_mode mistral --config_format mistral --load_format mistral" are not allowed since the params.json was missing in this version . The difference is that nm-testing uses the transformer-based tokenizer and Mistral-Small-3.1-24B-Instruct-2503 uses V7-Tekken. Will there be significant performance different in term of the function calling for the two different version?