Having Trouble playing quantize with llama.cpp

by bobchenyx - opened Jul 18

Discussion

bobchenyx

Jul 18

Thanks for all the amazing works at the very beginning!

I pulled BF16 and imatrix_unsloth.dat form this unsloth/DeepSeek-V3-0324-GGUF-UD

and tried to play around with llama.cpp quantizations.

However I was running into issues below tensor cols 128 x 512 are not divisible by 256

================================ Have weights data with 720 entries
[   1/1086]                        output.weight - [ 7168, 129280,     1,     1], type =   bf16, 
====== llama_model_quantize_impl: did not find weights for output.weight
converting to q8_0 .. load_imatrix: imatrix dataset='unsloth_calibration_DeepSeek-V3-0324.txt'
load_imatrix: loaded 720 importance matrix entries from /home/user1/workspace/llm-work/unsloth/DeepSeek-V3-0324-GGUF-UD/imatrix_unsloth.dat computed on 60 chunks
prepare_imatrix: have 720 importance matrix entries
size =  1767.50 MiB ->   938.98 MiB
[   2/1086]                   output_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   3/1086]                    token_embd.weight - [ 7168, 129280,     1,     1], type =   bf16, 
====== llama_model_quantize_impl: did not find weights for token_embd.weight
converting to q8_0 .. size =  1767.50 MiB ->   938.98 MiB
[   4/1086]                blk.0.attn_k_b.weight - [  128,   512,   128,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 512 are not divisible by 256, required for iq1_m - using fallback quantization iq4_nl

====== llama_model_quantize_impl: imatrix size 128 is different from tensor size 16384 for blk.0.attn_k_b.weight
llama_model_quantize: failed to quantize: imatrix size 128 is different from tensor size 16384 for blk.0.attn_k_b.weight
main: failed to quantize model from '/home/user1/workspace/llm-work/unsloth/DeepSeek-V3-0324-GGUF-UD/BF16/DeepSeek-V3-0324-BF16-00001-of-00030.gguf'

Would like to kindly ask if this is llama.cpp issue or if it's me not using things correctly ?

here's my command for reference

build/bin/llama-quantize \
    --imatrix unsloth/DeepSeek-V3-0324-GGUF-UD/imatrix_unsloth.dat \
    --token-embedding-type Q8_0 \
    --output-tensor-type Q8_0 \
    unsloth/DeepSeek-V3-0324-GGUF-UD/BF16/DeepSeek-V3-0324-BF16-00001-of-00030.gguf \
    DeepSeek-V3-0324-IQ1_M/DeepSeek-V3-0324-IQ1_M.gguf \
    IQ1_M \
    48 2>&1 | tee DeepSeek-V3-0324-IQ1_M.log

bobchenyx

Jul 21

https://github.com/ggml-org/llama.cpp/issues/14749

bobchenyx changed discussion status to closed Jul 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment