Update README.md
Browse files
README.md
CHANGED
@@ -47,7 +47,8 @@ Only weights and activations of the linear operators within transformers blocks
|
|
47 |
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
|
48 |
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
|
49 |
Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
|
50 |
-
The [
|
|
|
51 |
GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
|
52 |
|
53 |
## Deployment
|
@@ -108,14 +109,24 @@ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
|
|
108 |
ds = ds.shuffle().select(range(num_samples))
|
109 |
ds = ds.map(preprocess_fn)
|
110 |
|
111 |
-
recipe =
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
model = SparseAutoModelForCausalLM.from_pretrained(
|
121 |
model_id,
|
|
|
47 |
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
|
48 |
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
|
49 |
Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
|
50 |
+
The [SmoothQuant](https://arxiv.org/abs/2211.10438) algorithm is used to alleviate outliers in the activations, whereas rhe [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization.
|
51 |
+
Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
52 |
GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
|
53 |
|
54 |
## Deployment
|
|
|
109 |
ds = ds.shuffle().select(range(num_samples))
|
110 |
ds = ds.map(preprocess_fn)
|
111 |
|
112 |
+
recipe = [
|
113 |
+
SmoothQuantModifier(
|
114 |
+
smoothing_strength=0.7,
|
115 |
+
mappings=[
|
116 |
+
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
|
117 |
+
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
|
118 |
+
[["re:.*down_proj"], "re:.*up_proj"],
|
119 |
+
],
|
120 |
+
),
|
121 |
+
GPTQModifier(
|
122 |
+
sequential=True,
|
123 |
+
targets="Linear",
|
124 |
+
scheme="W8A8",
|
125 |
+
ignore=["lm_head"],
|
126 |
+
dampening_frac=0.01,
|
127 |
+
observer="mse",
|
128 |
+
)
|
129 |
+
]
|
130 |
|
131 |
model = SparseAutoModelForCausalLM.from_pretrained(
|
132 |
model_id,
|