Mistral-Small-24B-Instruct-2501-NVFP4

NVFP4-quantized version of mistralai/Mistral-Small-24B-Instruct-2501 produced with llmcompressor.

Notes

Quantization scheme: NVFP4 (linear layers, lm_head excluded)
Calibration samples: 512
Max sequence length during calibration: 2048

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "llmat/Mistral-Small-24B-Instruct-2501-NVFP4"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.

Downloads last month: 39

Safetensors

Model size

13.8B params

Tensor type

BF16

F32

F8_E4M3

Model tree for llmat/Mistral-Small-24B-Instruct-2501-NVFP4

Base model

mistralai/Mistral-Small-24B-Base-2501

Finetuned

mistralai/Mistral-Small-24B-Instruct-2501

Quantized

(121)

this model