nielsr HF Staff commited on
Commit
6c0d3ba
·
verified ·
1 Parent(s): e13284b

Improve model card: Add metadata, links, and detailed usage example

Browse files

This PR significantly enhances the model card for the `GuidedQuant` model.

Key improvements include:
- **Corrected License**: Updated the license from `llama2` to `mit`, reflecting the explicit license declared in the project's GitHub repository.
- **Enhanced Metadata**: Added `pipeline_tag: text-generation` and `library_name: transformers` to improve model discoverability on the Hugging Face Hub and integrate with the Transformers inference API. Also added `tags: quantization` for better categorization.
- **Comprehensive Content**:
- Provided a detailed overview of the GuidedQuant approach based on the paper's abstract.
- Added clear links to the paper ([GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)), the project page ([https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)), and the GitHub repository ([https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)).
- Integrated a runnable Python code snippet for quick model usage, sourced directly from the project's GitHub.
- Included sections for overview, inference speed-up, acknowledgements, and proper citation information, making the model card a more complete resource.

These updates ensure users have comprehensive information to understand, find, and effectively use the model.

Files changed (1) hide show
  1. README.md +87 -12
README.md CHANGED
@@ -1,20 +1,95 @@
1
  ---
2
  base_model:
3
  - meta-llama/Llama-2-7b-hf
 
4
  base_model_relation: quantized
5
- license: llama2
 
 
 
6
  ---
7
- # Model Card
8
 
9
- - Base model: `meta-llama/Llama-2-7b-hf`
10
- - Quantization method: SqueezeLLM
11
- - Target bit-width: 4
12
- - Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
13
- - Calibration data: RedPajama (1024 sentences / 4096 tokens)
14
- - Calibration objective: Next-token prediction
15
 
16
- # How to run
17
- - Follow the instruction in https://github.com/snu-mllab/GuidedQuant.
18
 
19
- # References
20
- - [Model Paper](https://arxiv.org/abs/2505.07004)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - meta-llama/Llama-2-7b-hf
4
+ license: mit
5
  base_model_relation: quantized
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ tags:
9
+ - quantization
10
  ---
 
11
 
12
+ # GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
 
 
 
 
 
13
 
14
+ **GuidedQuant** is a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. It consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, it introduces a novel non-uniform scalar quantization algorithm (LNQ), which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category.
 
15
 
16
+ * **Paper**: [GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)
17
+ * **Project Page**: [https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)
18
+ * **GitHub Repository**: [https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)
19
+
20
+ ## Model Details
21
+
22
+ * Base model: `meta-llama/Llama-2-7b-hf`
23
+ * Quantization method: SqueezeLLM
24
+ * Target bit-width: 4
25
+ * Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
26
+ * Calibration data: RedPajama (1024 sentences / 4096 tokens)
27
+ * Calibration objective: Next-token prediction
28
+
29
+ ## Overview
30
+ ![Light Mode](https://github.com/snu-mllab/GuidedQuant/blob/main/assets/objective-light.png?raw=true#gh-light-mode-only)
31
+ ![Dark Mode](https://github.com/snu-mllab/GuidedQuant/blob/main/assets/objective-dark.png?raw=true#gh-dark-mode-only)
32
+
33
+ **GuidedQuant** enhances LLM quantization by integrating gradient information from the end loss into the quantization objective, boosting the performance of SOTA *weight-only scalar*, *weight-only vector*, and *weight-and-activation* quantization. Additionally, we introduce **LNQ**, a non-uniform scalar quantization algorithm which is guaranteed to monotonically decrease the quantization objective value.
34
+
35
+ ## Quick Start
36
+
37
+ You can easily load and test the pre-quantized models using `AnyPrecisionForCausalLM` class, as shown in the following example (runs on one RTX 3090).
38
+
39
+ ```python
40
+ from any_precision.modules.AnyPrecisionForCausalLM import AnyPrecisionForCausalLM
41
+ from transformers import AutoTokenizer, TextStreamer
42
+ import torch
43
+
44
+ quantized_model_name = "jusjinuk/Llama-3.3-70B-Instruct-2bit-GuidedQuant-LNQ"
45
+ # Use float16 for Llama models, and bfloat16 for Qwen / Gemma models
46
+ dtype = torch.float16 if "llama" in quantized_model_name.lower() else torch.bfloat16
47
+
48
+ model = AnyPrecisionForCausalLM.from_quantized(quantized_model_name, torch_dtype=dtype)
49
+ tokenizer = AutoTokenizer.from_pretrained(quantized_model_name)
50
+ streamer = TextStreamer(tokenizer)
51
+
52
+ prompt = "Write me a short and concise story about Harry, Ron, and Hermione.
53
+ "
54
+ chat = [
55
+ {"role": "system", "content": "You are a helpful assistant.
56
+ "},
57
+ {"role": "user", "content": prompt},
58
+ ]
59
+
60
+ inputs = tokenizer.apply_chat_template(
61
+ chat, tokenize=True, return_tensors="pt", add_generation_prompt=True
62
+ ).to(model.device)
63
+
64
+ model.generate(inputs,
65
+ max_new_tokens=200, do_sample=False, temperature=1.0, streamer=streamer, pad_token_id=tokenizer.eos_token_id
66
+ )
67
+ ```
68
+
69
+ ## How to Run & Inference Speed-up
70
+
71
+ For detailed instructions on installation, downloading calibration data, weight-only scalar/vector/activation quantization, evaluation, and information on inference speed-up, please refer to the comprehensive documentation and scripts in the [GuidedQuant GitHub repository](https://github.com/snu-mllab/GuidedQuant).
72
+
73
+ ## Acknowledgement
74
+ This code is heavily based on the following repositories:
75
+ - [Any-Precision-LLM](https://github.com/SNU-ARC/any-precision-llm)
76
+ - [QTIP](https://github.com/Cornell-RelaxML/qtip)
77
+ - [SpinQuant](https://github.com/facebookresearch/SpinQuant)
78
+ - [AQLM](https://github.com/Vahe1994/AQLM)
79
+ - [Fast Hadamard Transform](https://github.com/Dao-AILab/fast-hadamard-transform)
80
+ - [gpt-fast](https://github.com/pytorch-labs/gpt-fast)
81
+
82
+ We thank the authors for their open-source implementations and contributions to the community.
83
+
84
+ ## Citation
85
+
86
+ Please cite our paper if you find our work useful:
87
+
88
+ ```bibtex
89
+ @inproceedings{kim2025guidedquant,
90
+ title={GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance},
91
+ author={Jinuk Kim and Marwa El Halabi and Wonpyo Park and Clemens JS Schaefer and Deokjae Lee and Yeonhong Park and Jae W. Lee and Hyun Oh Song},
92
+ booktitle = {International Conference on Machine Learning (ICML)},
93
+ year={2025},
94
+ }
95
+ ```