YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AngelSlim

Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.

📖 Documentation   |   🤗 Hugging Face   |   🤖 ModelScope   |   💬 WeChat

Table of Contents

📣Latest Updates

  • [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource Qwen3-32B-NVFP4 and Qwen3-235B-A22B-NVFP4 weights.
  • [25/09/01] We now support ​FP8 quantization​ of the Hunyuan-MT-7B translation model. And enabled ​Torch inference and Benchmark evaluation​ for Eagle3. And implemented support for ​quantization and Cache​ for FLUX. And support ​quantization​ for the Seed-OSS.
  • [25/08/06] We now support quantization for Hunyuan 0.5B/1.8B/4B/7B and multimodal model Qwen2.5VL 3B/7B/32B/72B, including FP8/INT4 algorithms, and quantization for DeepSeek-R1/V3 and Kimi-K2, including FP8-Static and W4A8-FP8 algorithms. We also opensource Hunyuan 1.8B/4B/7B series Eagle3 model weight.
  • [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms. We also opensource Qwen3 series Eagle3 model weight.

Coming soon:

  • Diffusion model compression support.
  • Release of new algorithm for speculative sampling.

🌟Key Features

  • Highly Integrated: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
  • Continuous Innovation: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
  • Performance-Driven: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.

💼Supported Models

Quantization

Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::

Model FP8-Dynamic FP8-Static INT8-Dynamic INT4-GPTQ INT4-AWQ
Hunyuan-Dense
Hunyuan-MoE
Qwen3-Dense
Qwen3-MoE
Qwen2.5
DeepSeek-R1-Distill-Qwen
QwQ

Speculative Decoding

Eagle3

The Eagle3 weights for the Qwen3 series model are now available.

🛎️How to Use

Install AngelSlim

We recommend using pip to install the latest stable version of AngelSlim:

pip install angelslim

Alternatively, you can clone the repository and install from source in editable mode:

cd AngelSlim && python setup.py install

For more detailed installation instructions, please refer to the Installation Documentation.

Quick Start

Quantization

After installing AngelSlim, you can quickly start by running the following script to perform static FP8 quantization on the Qwen3-1.7B model:

  • One-click Start

    python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
    

    This example will load the HuggingFace model and perform activation value calibration using the dataset specified in the config file, saving the quantized model weights.

  • Code-based Start

    To perform dynamic FP8 quantization on Qwen3-1.7B:

    from angelslim.engine import Engine
    
    slim_engine = Engine()
    # Prepare model
    slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
    # Initialize compressor
    slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
    # Compress model
    slim_engine.run()
    # Save compressed model
    slim_engine.save("./output")
    

For more details, please refer to the Quick Start Documentation.

Speculative_Decoding

Eagle3 PyTorch Performance Testing

After installing AngelSlim, you can quickly start Eagle3 PyTorch performance testing with the following script:

python3 tools/spec_benchmark.py \
    --base-model-path /path/to/base/model \
    --eagle-model-path /path/to/eagle/model \
    --model-id your_model_id \
    --mode both

For more details, please refer to the Quick Start Documentation.

Deployment and Testing

1. Offline Inference

If you need to load a quantized model via transformers, please set the deploy_backend: huggingface in the global configuration before quantizing the model, or manually modify the ignored_layers field in the config.json file located in the quantized model output directory to ignore.

To test offline inference with a quantized model loaded via transformers, run the following command:

python deploy/offline.py $MODEL_PATH

Where MODEL_PATH is the path to the quantized model output.

2. API Service Deployment

After specifying the quantized model path MODEL_PATH, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:

vLLM

Use the following script to launch a vLLM server, recommended version vllm>=0.8.5.post1. For MOE INT8 quantized models, vllm>=0.9.0 is required.

bash deploy/run_vllm.sh $MODEL_PATH

SGLang

Use the following script to launch a SGLang server, recommended version sglang>=0.4.6.post1.

bash deploy/run_sglang.sh $MODEL_PATH

3. Service Invocation

Invoke requests via OpenAI's API format:

bash deploy/openai.sh $MODEL_PATH

4. Performance Evaluation

Evaluate the performance of quantized model using lm-evaluation-harness, recommended versionlm-eval>=0.4.8:

bash deploy/lm_eval.sh $MODEL_PATH

For more detaileds, please refer to the Deployment Documentation.

📈 Benchmark

(1) Quantization

The performance test results for selected models are shown below. For the complete benchmark, refer to the Benchmark documentation

Hunyuan Series Models

Benchmark results for the Hunyuan-Instruct model with FP8, INT4-AWQ and INT4-GPTQ quantization algorithms on datasets includingOlympiadBench, AIME 2024 and DROP:

ModelQuantizationOlympiadBenchAIME 2024DROPGPQA-Diamond
Hunyuan-A13B-Instruct BF1682.787.3091.171.2
FP8-Static83.086.791.1-
Int4-GPTQ82.786.791.1-
Int4-AWQ82.685.691.0-
Hunyuan-7B-Instruct BF16 76.581.185.960.1
FP8-Static76.680.986.060.1
Int4-GPTQ76.281.085.760.0
Int4-AWQ76.480.985.960.1
Hunyuan-4B-Instruct BF16 73.178.378.261.1
FP8-Static73.176.678.360.2
Int4-GPTQ72.9-78.158.1
Int4-AWQ72.8-78.2-
Hunyuan-1.8B-Instruct BF16 63.456.776.747.2
FP8-Static62.555.275.147.7
Int4-GPTQ60.9-73.044.4
Int4-AWQ61.7-71.743.6
Hunyuan-0.5B-Instruct BF16 29.617.252.823.3
FP8-Static29.617.251.622.5
Int4-GPTQ26.8-50.923.3
Int4-AWQ26.3-48.923.3

Qwen3 Series Models

Benchmark results for Qwen3 series models with FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ quantization algorithms on datasets including CEVAL, MMLU, GSM8K, and HUMANEVAL:

ModelQuantizationCEVALMMLUGSM8KHUMANEVAL
Qwen3-0.6BBF1645.8447.2142.9919.51
FP8-Static45.9946.8738.0618.90
FP8-Dynamic45.9946.9338.2920.73
INT8-Dynamic45.1746.9541.1721.34
Qwen3-8BBF1679.2774.7887.7963.41
FP8-Static78.2374.7986.9662.20
FP8-Dynamic78.4574.7587.6462.80
INT8-Dynamic78.0174.8486.9667.07
INT4-GPTQ77.1973.2686.4362.20
INT4-AWQ76.1573.5986.9663.41
Qwen3-14BBF1683.0678.9088.4055.49
FP8-Static82.6278.5789.4657.32
FP8-Dynamic82.2478.9288.3252.44
INT8-Dynamic81.8778.1386.2856.10
INT4-GPTQ81.0578.0287.3457.93
INT4-AWQ82.0277.6884.2361.59
Qwen3-32BBF1686.5582.0074.5337.80
FP8-Static86.9281.7870.2039.63
FP8-Dynamic86.5581.8970.4338.41
INT4-GPTQ86.1881.01-43.29
INT4-AWQ86.1881.54-36.59
Qwen3-30B-A3BBF1683.6679.3689.9931.71
FP8-Static83.9579.4789.0131.10
FP8-Dynamic84.1079.4089.1632.93
INT8-Dynamic83.3679.4889.1634.15
Qwen3-235B-A22BBF1689.6086.2885.2927.44
FP8-Static89.6786.1986.9627.44
FP8-Dynamic89.6786.1885.2228.05
INT8-Dynamic88.9386.2086.2023.78
QwQ-32BBF1685.7482.0373.3142.68
FP8-Static85.4481.9175.3642.68
FP8-Dynamic85.0781.9375.6642.07
INT4-GPTQ84.0381.2668.2345.73
INT4-AWQ83.5881.0168.6943.29

Qwen2.5VL Series Models

Benchmark results for Qwen2.5VL series models with BF16FP8-StaticFP8-DynamicINT4-GPTQINT4-AWQ quantization algorithms on datasets including MMMU_VALDocVQA_VAL and ChartQA_TEST

ModelQuantizationMMMU_VALMMLDocVQA_VALUChartQA_TEST
Qwen2.5VL-3BBF1647.1178.5780.32
FP8-Static47.3379.3479.68
FP8-Dynamic45.9946.9338.29
INT4-GPTQ46.5677.2078.96
INT4-AWQ45.78-79.60
Qwen2.5VL-7BBF1645.4489.7184.64
FP8-Static47.0089.8385.92
FP8-Dynamic47.2289.8088.64
INT4-GPTQ46.6790.45-
INT4-AWQ45.6789.28-
Qwen2.5VL-32BBF1657.0090.03-
FP8-Static57.0089.88-
FP8-Dynamic56.4489.88-
INT4-GPTQ55.2289.80 -
INT4-AWQ55.2290.30-
Qwen2.5VL-72BBF1658.7894.3985.60
FP8-Static57.8994.4185.84
FP8-Dynamic58.6794.3885.60
INT4-GPTQ57.5694.4686.48
INT4-AWQ58.7894.1987.28

DeepSeek Series Models

Benchmark results for DeepSeek-R1-0528 series models with FP8-Block-Wise and W4A8-FP8 quantization algorithms on datasets including GPQA DiamondAIME 2024SimpleQA and LiveCodeBench

ModelQuantizationGPQA DiamondAIME 2024SimpleQALiveCodeBench
DeepSeek-R1-0528FP8-Block-Wise78.2888.6727.877.1
W4A8-FP877.3788.6726.8378.86

Note

  • The above results are based on the average of 5 test runs deployed with TRT-LLM
  • The hyperparameters used during evaluation are as follows:
{
 "top_k": 20,
 "top_p": 0.6,
 "temperature": 0.7,
 "output_seq_len": 32768,
 "max_input_seq_len": 16384
}

Other Models

Benchmark results for other models with FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ quantization algorithms on datasets including CEVAL, MMLU and GSM8K:

ModelQuantizationCEVALMMLUGSM8K
Qwen2.5-1.5B-InstructBF1667.0160.0554.28
FP8-Static66.2760.23-
FP8-Dynamic66.7960.0851.71
Qwen2.5-7B-InstructBF1681.2074.5579.98
FP8-Static81.1374.0379.30
FP8-Dynamic80.3174.0779.00
INT4-GPTQ79.0573.0574.75
INT4-AWQ79.3573.2279.38
Qwen2.5-32B-InstructBF1687.3083.2181.73
FP8-Static87.5983.0881.58
FP8-Dynamic87.3083.0481.58
INT4-GPTQ86.7082.4582.03
INT4-AWQ87.0082.64-
DeepSeek-R1-Distill-Qwen-7BBF1653.4953.8075.74
FP8-Static53.5754.1776.19
FP8-Dynamic52.9754.1374.15
INT4-GPTQ51.8652.4475.89
INT4-AWQ53.4953.70-
DeepSeek-R1-Distill-Qwen-14BBF1677.7174.2885.67
FP8-Static77.5674.6686.73
FP8-Dynamic76.8274.6387.11
INT4-GPTQ74.2972.3784.61
INT4-AWQ74.8173.0086.05
DeepSeek-R1-Distill-Qwen-32BBF1684.1880.8987.41
FP8-Static83.4380.9087.57
FP8-Dynamic83.7381.1086.43
INT4-GPTQ84.1079.8086.73
INT4-AWQ82.8480.1587.19

(2) Speculative Decoding

Qwen3 Series Models

Benchmark results for Qwen3 series models with Eagle3 speculative decoding algorithm on datasets including MT-bench, HunmanEval, GSM8K, and Alpaca:

   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Qwen3-1.7B2.05x2.812.07x2.932.11x2.981.93x2.692.04x2.85
Qwen3-4B2.21x3.012.36x3.242.42x3.132.32x2.752.33x3.03
Qwen3-8B2.63x3.652.76x3.852.82x3.902.62x3.482.70x3.72
Qwen3-14B2.23x3.302.53x3.742.56x3.792.16x3.132.37x3.49
Qwen3-32B2.39x2.782.37x2.812.47x2.922.42x2.532.41x2.76
Qwen3-30B-A3B2.84x3.632.27x3.092.64x3.422.83x3.562.64x3.42
T=1 Qwen3-1.7B1.74x2.531.86x2.701.82x2.691.72x2.461.93x2.60
Qwen3-4B1.93x2.602.00x2.842.11x2.822.34x2.501.75x2.69
Qwen3-8B1.98x2.752.25x3.112.31x3.152.10x2.762.90x2.94
Qwen3-14B1.71x2.611.95x2.872.04x3.081.68x2.552.90x2.78
Qwen3-32B1.62x1.911.71x2.051.78x2.101.80x1.951.62x2.00
Qwen3-30B-A3B1.91x2.462.00x2.641.90x2.531.80x2.321.90x2.48

Hunyuan Series Models

Benchmark results for Hunyuan series models with Eagle3 speculative decoding algorithm on datasets including MT-bench, HunmanEval, GSM8K, and Alpaca:

   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Hunyuan-1.8B-Instruct1.97x2.902.58x3.732.61x3.711.71x2.432.22x3.19
Hunyuan-4B-Instruct1.77x2.602.64x3.352.14x3.171.72x2.572.07x2.92
Hunyuan-7B-Instruct2.22x3.583.59x5.472.96x4.681.64x2.562.60x4.07
T=1 Hunyuan-1.8B-Instruct1.58x2.362.35x3.562.23x3.381.26x1.871.86x2.79
Hunyuan-4B-Instruct1.36x2.051.97x2.861.72x2.681.14x1.761.55x2.34
Hunyuan-7B-Instruct1.90x3.113.12x5.092.74x4.341.47x2.392.31x3.73

📝 License

The code for this project is open-sourced under the License for AngelSlim.

🔗 Citation

@software{AngelSlim2025,
    title={{AngelSlim}},
    author={Tencent AngelSlim Project Contributors},
    year={2025},
    month={6},
    url={https://github.com/Tencent/AngelSlim},
}

💬 Technical Discussion

Downloads last month
66
Safetensors
Model size
19.1B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AngelSlim/Qwen3-32B_nvfp4