FP8 Dynamic/W8A16 Quants Please

#44
by rjmehta - opened

FP8 Dynamic/W8A16 Quants Please

Mistral AI_ org
edited 25 days ago

You can use this model for FP8 with the latest vLLM nightly https://huggingface.co/nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic

The chat template is br0ken in the nm-testing repo. See also https://github.com/vllm-project/vllm/pull/15505#issuecomment-2768873223.

Mistral AI_ org

It has been updated now, thanks!

Thanks, it seems that in the nm-testing repo, one can only use the default setting to host vllm 0.8.3, "vllm serve nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic --tool-call-parser mistral --enable-auto-tool-choice". However, "--tokenizer_mode mistral --config_format mistral --load_format mistral" are not allowed since the params.json was missing in this version . The difference is that nm-testing uses the transformer-based tokenizer and Mistral-Small-3.1-24B-Instruct-2503 uses V7-Tekken. Will there be significant performance different in term of the function calling for the two different version?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment