benchmarks

#8
by BernardH - opened

In case anyone is interested, on one socket of a dual socket Epyc Gen2 + 4090 :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf -c 32768 -fmoe -mla 3 -amb 512 --n-gpu-layers 63 -ctk q8_0 -fa --override-tensor exps=CPU -ub 512
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 63, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 40.301 12.70 28.735 4.45
512 128 512 34.637 14.78 24.329 5.26
512 128 1024 29.901 17.12 24.911 5.14
512 128 1536 35.815 14.30 25.276 5.06
512 128 2048 42.172 12.14 25.239 5.07
512 128 2560 31.488 16.26 25.812 4.96
CUDA error: out of memory
current device: 0, in function alloc at /home/b/ik_llama.cpp/ggml/src/ggml-cuda.cu:384
cuMemCreate(&handle, reserve_size, &prop, 0)
/home/b/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
Aborted

real 9m44.155s
user 133m46.209s
sys 5m45.586s

Owner

Thanks Bernard!

Interesting it OOMd on VRAM at only n_kv just over ~2560 as psure 32k context will fit in under 24GB VRAM especialy at -ctk q8_0...

Otherwise your command looks reasonable to me. I personally don't pin the CPU cores directly, and I use echo 0 | sudo tee -a /proc/sys/kernel/numa_balancing which doesn't do much though imo.

Given this is the largest model of the set, probably not enough extra VRAM available to offload an exps layer onto GPU either.

Also without more free VRAM, cranking up -b 4096 -ub 4096 or other combinations might not be a possibility though there could be a little room to play there. You might be able to free up like ~500MiB of VRAM by trying -ot attn_k_b=CPU as well, assuming it isn't actually being used for inferencing. I'm still experimenting with that one I learned from anikifoss.

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- \
../ik_llama.cpp/build/bin/llama-sweep-bench \
    --numa numactl \
    -t 45 \
    --threads-batch 45 \
    --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
    -c 32768 \
    -fmoe -mla 3 \
    -amb 512 \
    --n-gpu-layers 63 \
    -ctk q8_0 \
    -fa \
    --override-tensor exps=CPU \
    -ub 512

Have fun with these big models on your monster rig!

Thx !
Btw, I was thinking that all the example command lines only focus on the speed aspects, but maybe it could be useful to also include the recommended parameters (temp 0.6 , Top_P value of 0.95) for actual generation.
Best Regards.

Owner

Good idea, I just pushed a small note in the README with those official recommended settings you list --temp 0.6 --top-p 0.95!

Equipment:

  • Chinese mod of 4090 D, with 48GB VRAM
  • Intel Xeon 5218 (16 cores)
  • 6 channels of DDR4-2666 * 64GB
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:00:11.0 Off |                  Off |
| 36%   56C    P0             95W /  425W |   42265MiB /  49140MiB |     39%      Default |
|                                         |                        |                  N/A |

Build:

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=0FF -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12`

./build/bin/llama-server --version
version: 3745 (a0ac16b9)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Model is 220GB. Running params:

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
    --model /mnt/models/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
    --ctx-size 32768 \
    --cache-type-k  q8_0  --flash-attn -mla 3 \
    -amb 512 \
    -fmoe \
    --temp 0.6 --top-p 0.95 \
    --n-gpu-layers 999 \
    --override-tensor "blk\.([1-9])\.ffn_.*=CUDA0" \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16

Is it right parameters?

Results:

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   43.612 |    11.74 |   19.097 |     6.70 |
|   512 |    128 |    512 |   44.266 |    11.57 |   20.046 |     6.39 |
|   512 |    128 |   1024 |   39.652 |    12.91 |   19.219 |     6.66 |
|   512 |    128 |   1536 |   44.086 |    11.61 |   19.128 |     6.69 |
|   512 |    128 |   2048 |   43.610 |    11.74 |   19.326 |     6.62 |
|   512 |    128 |   2560 |   44.873 |    11.41 |   20.361 |     6.29 |
|   512 |    128 |   3072 |   43.796 |    11.69 |   19.408 |     6.60 |
|   512 |    128 |   3584 |   43.963 |    11.65 |   19.361 |     6.61 |
|   512 |    128 |   4096 |   43.927 |    11.66 |   19.285 |     6.64 |
|   512 |    128 |   4608 |   43.925 |    11.66 |   19.188 |     6.67 |
|   512 |    128 |   5120 |   43.956 |    11.65 |   19.703 |     6.50 |
|   512 |    128 |   5632 |   44.270 |    11.57 |   19.878 |     6.44 |

If i add -b 4096 -ub 4096 \:

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   49.803 |    82.24 |  153.659 |     6.66 |
|  4096 |   1024 |   4096 |   51.596 |    79.39 |  156.569 |     6.54 |
|  4096 |   1024 |   8192 |   52.323 |    78.28 |  158.353 |     6.47 |

Found that speed is about the same for:

  • -mla 3 vs -mla 2
  • -fmoe vs no -fmoe

@SlavikF

Great start!

Hardware

A few thoughts that might allow slightly better performance:

Intel Xeon 5218

How many NUMA nodes are configured? Ideally you have all RAM presented in a single NUMA node. Configure BIOS with some kind of SNC=Disable or possibly some snoop modes to get the best RAM read bandwidth possible.

DDR4-2666

This is probably the bottle-neck for token generation speed. If you could clock it faster, might get a little more tok/sec during TG.

Build

Build:

For DeepSeek R1 models with a CUDA device I would recommend:

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_SCHED_MAX_COPIES=1

-DGGML_CUDA_IQK_FORCE_BF16 will prevent nan though might be slightly slower on 4090. It is actually faster on 3090 in my experience.

-DGGML_SCHED_MAX_COPIES=1 this will save you some VRAM probably though you can play with it. Very useful for multi-GPU.

-DCMAKE_CUDA_HOST_COMPILER i don't know this, if it works for you that is great

Inference

Your command looks good. -mla 3 is probably recommended now with latest version of ik_llama.cpp. -fmoe is recommended as well. It might not effect much at short prompt lengths given your RAM bandwidth is probably bottle-neck though.

--override-tensor "blk\.([1-9])\.ffn_.*=CUDA0" \
Keep in mind layers "blk" 0-2 are the shared experts shexps layers. The routed experts exps begin at layer 3. So while your command is okay it is redundant. If you want to be more clear and not redundant, you can use -ot "blk\.([3-9])\.ffn_.*=CUDA0" which should be equivalent. You might have enough VRAM left over especially if you make -amb 256 to offload an additional layer.

Let's put it all together now:

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
    --model /mnt/models/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
    --ctx-size 32768 \
    --cache-type-k  q8_0  --flash-attn -mla 3 \
    -amb 256 \
    -fmoe \
    --n-gpu-layers 999 \
    --override-tensor "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
    --override-tensor exps=CPU \
    --warmup-batch \
    --parallel 1 \
    --threads 16

You might be able to add 10|11) as well or keep going. This is easier to read and think about. Use whatever regex style you prefer.

Keep in mind the names are different for offloading Qwen3-235B so that is a little different.

Good luck!

Sign up or log in to comment