Improve model card: Add metadata, links, and detailed usage example
Browse filesThis PR significantly enhances the model card for the `GuidedQuant` model.
Key improvements include:
- **Corrected License**: Updated the license from `llama2` to `mit`, reflecting the explicit license declared in the project's GitHub repository.
- **Enhanced Metadata**: Added `pipeline_tag: text-generation` and `library_name: transformers` to improve model discoverability on the Hugging Face Hub and integrate with the Transformers inference API. Also added `tags: quantization` for better categorization.
- **Comprehensive Content**:
- Provided a detailed overview of the GuidedQuant approach based on the paper's abstract.
- Added clear links to the paper ([GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)), the project page ([https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)), and the GitHub repository ([https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)).
- Integrated a runnable Python code snippet for quick model usage, sourced directly from the project's GitHub.
- Included sections for overview, inference speed-up, acknowledgements, and proper citation information, making the model card a more complete resource.
These updates ensure users have comprehensive information to understand, find, and effectively use the model.
@@ -1,20 +1,95 @@
|
|
1 |
---
|
2 |
base_model:
|
3 |
- meta-llama/Llama-2-7b-hf
|
|
|
4 |
base_model_relation: quantized
|
5 |
-
|
|
|
|
|
|
|
6 |
---
|
7 |
-
# Model Card
|
8 |
|
9 |
-
|
10 |
-
- Quantization method: SqueezeLLM
|
11 |
-
- Target bit-width: 4
|
12 |
-
- Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
|
13 |
-
- Calibration data: RedPajama (1024 sentences / 4096 tokens)
|
14 |
-
- Calibration objective: Next-token prediction
|
15 |
|
16 |
-
|
17 |
-
- Follow the instruction in https://github.com/snu-mllab/GuidedQuant.
|
18 |
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
base_model:
|
3 |
- meta-llama/Llama-2-7b-hf
|
4 |
+
license: mit
|
5 |
base_model_relation: quantized
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
library_name: transformers
|
8 |
+
tags:
|
9 |
+
- quantization
|
10 |
---
|
|
|
11 |
|
12 |
+
# GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
+
**GuidedQuant** is a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. It consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, it introduces a novel non-uniform scalar quantization algorithm (LNQ), which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category.
|
|
|
15 |
|
16 |
+
* **Paper**: [GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)
|
17 |
+
* **Project Page**: [https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)
|
18 |
+
* **GitHub Repository**: [https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)
|
19 |
+
|
20 |
+
## Model Details
|
21 |
+
|
22 |
+
* Base model: `meta-llama/Llama-2-7b-hf`
|
23 |
+
* Quantization method: SqueezeLLM
|
24 |
+
* Target bit-width: 4
|
25 |
+
* Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
|
26 |
+
* Calibration data: RedPajama (1024 sentences / 4096 tokens)
|
27 |
+
* Calibration objective: Next-token prediction
|
28 |
+
|
29 |
+
## Overview
|
30 |
+

|
31 |
+

|
32 |
+
|
33 |
+
**GuidedQuant** enhances LLM quantization by integrating gradient information from the end loss into the quantization objective, boosting the performance of SOTA *weight-only scalar*, *weight-only vector*, and *weight-and-activation* quantization. Additionally, we introduce **LNQ**, a non-uniform scalar quantization algorithm which is guaranteed to monotonically decrease the quantization objective value.
|
34 |
+
|
35 |
+
## Quick Start
|
36 |
+
|
37 |
+
You can easily load and test the pre-quantized models using `AnyPrecisionForCausalLM` class, as shown in the following example (runs on one RTX 3090).
|
38 |
+
|
39 |
+
```python
|
40 |
+
from any_precision.modules.AnyPrecisionForCausalLM import AnyPrecisionForCausalLM
|
41 |
+
from transformers import AutoTokenizer, TextStreamer
|
42 |
+
import torch
|
43 |
+
|
44 |
+
quantized_model_name = "jusjinuk/Llama-3.3-70B-Instruct-2bit-GuidedQuant-LNQ"
|
45 |
+
# Use float16 for Llama models, and bfloat16 for Qwen / Gemma models
|
46 |
+
dtype = torch.float16 if "llama" in quantized_model_name.lower() else torch.bfloat16
|
47 |
+
|
48 |
+
model = AnyPrecisionForCausalLM.from_quantized(quantized_model_name, torch_dtype=dtype)
|
49 |
+
tokenizer = AutoTokenizer.from_pretrained(quantized_model_name)
|
50 |
+
streamer = TextStreamer(tokenizer)
|
51 |
+
|
52 |
+
prompt = "Write me a short and concise story about Harry, Ron, and Hermione.
|
53 |
+
"
|
54 |
+
chat = [
|
55 |
+
{"role": "system", "content": "You are a helpful assistant.
|
56 |
+
"},
|
57 |
+
{"role": "user", "content": prompt},
|
58 |
+
]
|
59 |
+
|
60 |
+
inputs = tokenizer.apply_chat_template(
|
61 |
+
chat, tokenize=True, return_tensors="pt", add_generation_prompt=True
|
62 |
+
).to(model.device)
|
63 |
+
|
64 |
+
model.generate(inputs,
|
65 |
+
max_new_tokens=200, do_sample=False, temperature=1.0, streamer=streamer, pad_token_id=tokenizer.eos_token_id
|
66 |
+
)
|
67 |
+
```
|
68 |
+
|
69 |
+
## How to Run & Inference Speed-up
|
70 |
+
|
71 |
+
For detailed instructions on installation, downloading calibration data, weight-only scalar/vector/activation quantization, evaluation, and information on inference speed-up, please refer to the comprehensive documentation and scripts in the [GuidedQuant GitHub repository](https://github.com/snu-mllab/GuidedQuant).
|
72 |
+
|
73 |
+
## Acknowledgement
|
74 |
+
This code is heavily based on the following repositories:
|
75 |
+
- [Any-Precision-LLM](https://github.com/SNU-ARC/any-precision-llm)
|
76 |
+
- [QTIP](https://github.com/Cornell-RelaxML/qtip)
|
77 |
+
- [SpinQuant](https://github.com/facebookresearch/SpinQuant)
|
78 |
+
- [AQLM](https://github.com/Vahe1994/AQLM)
|
79 |
+
- [Fast Hadamard Transform](https://github.com/Dao-AILab/fast-hadamard-transform)
|
80 |
+
- [gpt-fast](https://github.com/pytorch-labs/gpt-fast)
|
81 |
+
|
82 |
+
We thank the authors for their open-source implementations and contributions to the community.
|
83 |
+
|
84 |
+
## Citation
|
85 |
+
|
86 |
+
Please cite our paper if you find our work useful:
|
87 |
+
|
88 |
+
```bibtex
|
89 |
+
@inproceedings{kim2025guidedquant,
|
90 |
+
title={GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance},
|
91 |
+
author={Jinuk Kim and Marwa El Halabi and Wonpyo Park and Clemens JS Schaefer and Deokjae Lee and Yeonhong Park and Jae W. Lee and Hyun Oh Song},
|
92 |
+
booktitle = {International Conference on Machine Learning (ICML)},
|
93 |
+
year={2025},
|
94 |
+
}
|
95 |
+
```
|