Improve model card: Add metadata, links, and detailed usage example

This PR significantly enhances the model card for the `GuidedQuant` model.

Key improvements include:
- **Corrected License**: Updated the license from `llama2` to `mit`, reflecting the explicit license declared in the project's GitHub repository.
- **Enhanced Metadata**: Added `pipeline_tag: text-generation` and `library_name: transformers` to improve model discoverability on the Hugging Face Hub and integrate with the Transformers inference API. Also added `tags: quantization` for better categorization.
- **Comprehensive Content**:
- Provided a detailed overview of the GuidedQuant approach based on the paper's abstract.
- Added clear links to the paper ([GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)), the project page ([https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)), and the GitHub repository ([https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)).
- Integrated a runnable Python code snippet for quick model usage, sourced directly from the project's GitHub.
- Included sections for overview, inference speed-up, acknowledgements, and proper citation information, making the model card a more complete resource.

These updates ensure users have comprehensive information to understand, find, and effectively use the model.

Files changed (1) hide show

README.md +87 -12

README.md CHANGED Viewed

@@ -1,20 +1,95 @@
 ---
 base_model:
 - meta-llama/Llama-2-7b-hf
 base_model_relation: quantized
-license: llama2
 ---
-# Model Card
-- Base model: `meta-llama/Llama-2-7b-hf`
-- Quantization method: SqueezeLLM
-- Target bit-width: 4
-- Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
-- Calibration data: RedPajama (1024 sentences / 4096 tokens)
-- Calibration objective: Next-token prediction
-# How to run
-- Follow the instruction in https://github.com/snu-mllab/GuidedQuant.
-# References
-- [Model Paper](https://arxiv.org/abs/2505.07004)

 ---
 base_model:
 - meta-llama/Llama-2-7b-hf
+license: mit
 base_model_relation: quantized
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- quantization
 ---
+# GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
+**GuidedQuant** is a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. It consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, it introduces a novel non-uniform scalar quantization algorithm (LNQ), which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category.
+*   **Paper**: [GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)
+*   **Project Page**: [https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)
+*   **GitHub Repository**: [https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)
+## Model Details
+*   Base model: `meta-llama/Llama-2-7b-hf`
+*   Quantization method: SqueezeLLM
+*   Target bit-width: 4
+*   Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
+*   Calibration data: RedPajama (1024 sentences / 4096 tokens)
+*   Calibration objective: Next-token prediction
+## Overview
+![Light Mode](https://github.com/snu-mllab/GuidedQuant/blob/main/assets/objective-light.png?raw=true#gh-light-mode-only)
+![Dark Mode](https://github.com/snu-mllab/GuidedQuant/blob/main/assets/objective-dark.png?raw=true#gh-dark-mode-only)
+**GuidedQuant** enhances LLM quantization by integrating gradient information from the end loss into the quantization objective, boosting the performance of SOTA *weight-only scalar*, *weight-only vector*, and *weight-and-activation* quantization. Additionally, we introduce **LNQ**, a non-uniform scalar quantization algorithm which is guaranteed to monotonically decrease the quantization objective value.
+## Quick Start
+You can easily load and test the pre-quantized models using `AnyPrecisionForCausalLM` class, as shown in the following example (runs on one RTX 3090).
+```python
+from any_precision.modules.AnyPrecisionForCausalLM import AnyPrecisionForCausalLM
+from transformers import AutoTokenizer, TextStreamer
+import torch
+quantized_model_name = "jusjinuk/Llama-3.3-70B-Instruct-2bit-GuidedQuant-LNQ"
+# Use float16 for Llama models, and bfloat16 for Qwen / Gemma models
+dtype = torch.float16 if "llama" in quantized_model_name.lower() else torch.bfloat16
+model = AnyPrecisionForCausalLM.from_quantized(quantized_model_name, torch_dtype=dtype)
+tokenizer = AutoTokenizer.from_pretrained(quantized_model_name)
+streamer = TextStreamer(tokenizer)
+prompt = "Write me a short and concise story about Harry, Ron, and Hermione.
+"
+chat = [
+    {"role": "system", "content": "You are a helpful assistant.
+"},
+    {"role": "user", "content": prompt},
+]
+inputs = tokenizer.apply_chat_template(
+    chat, tokenize=True, return_tensors="pt", add_generation_prompt=True
+).to(model.device)
+model.generate(inputs,
+    max_new_tokens=200, do_sample=False, temperature=1.0, streamer=streamer, pad_token_id=tokenizer.eos_token_id
+)
+```
+## How to Run & Inference Speed-up
+For detailed instructions on installation, downloading calibration data, weight-only scalar/vector/activation quantization, evaluation, and information on inference speed-up, please refer to the comprehensive documentation and scripts in the [GuidedQuant GitHub repository](https://github.com/snu-mllab/GuidedQuant).
+## Acknowledgement
+This code is heavily based on the following repositories:
+- [Any-Precision-LLM](https://github.com/SNU-ARC/any-precision-llm)
+- [QTIP](https://github.com/Cornell-RelaxML/qtip)
+- [SpinQuant](https://github.com/facebookresearch/SpinQuant)
+- [AQLM](https://github.com/Vahe1994/AQLM)
+- [Fast Hadamard Transform](https://github.com/Dao-AILab/fast-hadamard-transform)
+- [gpt-fast](https://github.com/pytorch-labs/gpt-fast)
+We thank the authors for their open-source implementations and contributions to the community.
+## Citation
+Please cite our paper if you find our work useful:
+```bibtex
+@inproceedings{kim2025guidedquant,
+      title={GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance},
+      author={Jinuk Kim and Marwa El Halabi and Wonpyo Park and Clemens JS Schaefer and Deokjae Lee and Yeonhong Park and Jae W. Lee and Hyun Oh Song},
+      booktitle = {International Conference on Machine Learning (ICML)},
+      year={2025},
+}
+```