benchmarks
In case anyone is interested, on one socket of a dual socket Epyc Gen2 + 4090 :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf -c 32768 -fmoe -mla 3 -amb 512 --n-gpu-layers 63 -ctk q8_0 -fa --override-tensor exps=CPU -ub 512
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 63, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 40.301 | 12.70 | 28.735 | 4.45 |
512 | 128 | 512 | 34.637 | 14.78 | 24.329 | 5.26 |
512 | 128 | 1024 | 29.901 | 17.12 | 24.911 | 5.14 |
512 | 128 | 1536 | 35.815 | 14.30 | 25.276 | 5.06 |
512 | 128 | 2048 | 42.172 | 12.14 | 25.239 | 5.07 |
512 | 128 | 2560 | 31.488 | 16.26 | 25.812 | 4.96 |
CUDA error: out of memory | ||||||
current device: 0, in function alloc at /home/b/ik_llama.cpp/ggml/src/ggml-cuda.cu:384 | ||||||
cuMemCreate(&handle, reserve_size, &prop, 0) | ||||||
/home/b/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error | ||||||
Aborted |
real 9m44.155s
user 133m46.209s
sys 5m45.586s
Thanks Bernard!
Interesting it OOMd on VRAM at only n_kv just over ~2560 as psure 32k context will fit in under 24GB VRAM especialy at -ctk q8_0
...
Otherwise your command looks reasonable to me. I personally don't pin the CPU cores directly, and I use echo 0 | sudo tee -a /proc/sys/kernel/numa_balancing
which doesn't do much though imo.
Given this is the largest model of the set, probably not enough extra VRAM available to offload an exps layer onto GPU either.
Also without more free VRAM, cranking up -b 4096 -ub 4096
or other combinations might not be a possibility though there could be a little room to play there. You might be able to free up like ~500MiB of VRAM by trying -ot attn_k_b=CPU
as well, assuming it isn't actually being used for inferencing. I'm still experimenting with that one I learned from anikifoss.
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- \
../ik_llama.cpp/build/bin/llama-sweep-bench \
--numa numactl \
-t 45 \
--threads-batch 45 \
--model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
-c 32768 \
-fmoe -mla 3 \
-amb 512 \
--n-gpu-layers 63 \
-ctk q8_0 \
-fa \
--override-tensor exps=CPU \
-ub 512
Have fun with these big models on your monster rig!
Thx !
Btw, I was thinking that all the example command lines only focus on the speed aspects, but maybe it could be useful to also include the recommended parameters (temp 0.6 , Top_P value of 0.95) for actual generation.
Best Regards.
Good idea, I just pushed a small note in the README with those official recommended settings you list --temp 0.6 --top-p 0.95
!
Equipment:
- Chinese mod of 4090 D, with 48GB VRAM
- Intel Xeon 5218 (16 cores)
- 6 channels of DDR4-2666 * 64GB
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 1 NVIDIA GeForce RTX 4090 D On | 00000000:00:11.0 Off | Off |
| 36% 56C P0 95W / 425W | 42265MiB / 49140MiB | 39% Default |
| | | N/A |
Build:
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=0FF -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12`
./build/bin/llama-server --version
version: 3745 (a0ac16b9)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Model is 220GB. Running params:
CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
--model /mnt/models/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
--ctx-size 32768 \
--cache-type-k q8_0 --flash-attn -mla 3 \
-amb 512 \
-fmoe \
--temp 0.6 --top-p 0.95 \
--n-gpu-layers 999 \
--override-tensor "blk\.([1-9])\.ffn_.*=CUDA0" \
--override-tensor exps=CPU \
--parallel 1 \
--threads 16
Is it right parameters?
Results:
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 43.612 | 11.74 | 19.097 | 6.70 |
| 512 | 128 | 512 | 44.266 | 11.57 | 20.046 | 6.39 |
| 512 | 128 | 1024 | 39.652 | 12.91 | 19.219 | 6.66 |
| 512 | 128 | 1536 | 44.086 | 11.61 | 19.128 | 6.69 |
| 512 | 128 | 2048 | 43.610 | 11.74 | 19.326 | 6.62 |
| 512 | 128 | 2560 | 44.873 | 11.41 | 20.361 | 6.29 |
| 512 | 128 | 3072 | 43.796 | 11.69 | 19.408 | 6.60 |
| 512 | 128 | 3584 | 43.963 | 11.65 | 19.361 | 6.61 |
| 512 | 128 | 4096 | 43.927 | 11.66 | 19.285 | 6.64 |
| 512 | 128 | 4608 | 43.925 | 11.66 | 19.188 | 6.67 |
| 512 | 128 | 5120 | 43.956 | 11.65 | 19.703 | 6.50 |
| 512 | 128 | 5632 | 44.270 | 11.57 | 19.878 | 6.44 |
If i add -b 4096 -ub 4096 \
:
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 49.803 | 82.24 | 153.659 | 6.66 |
| 4096 | 1024 | 4096 | 51.596 | 79.39 | 156.569 | 6.54 |
| 4096 | 1024 | 8192 | 52.323 | 78.28 | 158.353 | 6.47 |
Found that speed is about the same for:
-mla 3
vs-mla 2
-fmoe
vs no-fmoe
Great start!
Hardware
A few thoughts that might allow slightly better performance:
Intel Xeon 5218
How many NUMA nodes are configured? Ideally you have all RAM presented in a single NUMA node. Configure BIOS with some kind of SNC=Disable
or possibly some snoop
modes to get the best RAM read bandwidth possible.
DDR4-2666
This is probably the bottle-neck for token generation speed. If you could clock it faster, might get a little more tok/sec during TG.
Build
Build:
For DeepSeek R1 models with a CUDA device I would recommend:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_SCHED_MAX_COPIES=1
-DGGML_CUDA_IQK_FORCE_BF16
will prevent nan
though might be slightly slower on 4090. It is actually faster on 3090 in my experience.
-DGGML_SCHED_MAX_COPIES=1
this will save you some VRAM probably though you can play with it. Very useful for multi-GPU.
-DCMAKE_CUDA_HOST_COMPILER
i don't know this, if it works for you that is great
Inference
Your command looks good. -mla 3
is probably recommended now with latest version of ik_llama.cpp
. -fmoe
is recommended as well. It might not effect much at short prompt lengths given your RAM bandwidth is probably bottle-neck though.
--override-tensor "blk\.([1-9])\.ffn_.*=CUDA0" \
Keep in mind layers "blk" 0-2
are the shared experts shexps
layers. The routed experts exps
begin at layer 3. So while your command is okay it is redundant. If you want to be more clear and not redundant, you can use -ot "blk\.([3-9])\.ffn_.*=CUDA0"
which should be equivalent. You might have enough VRAM left over especially if you make -amb 256
to offload an additional layer.
Let's put it all together now:
CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
--model /mnt/models/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
--ctx-size 32768 \
--cache-type-k q8_0 --flash-attn -mla 3 \
-amb 256 \
-fmoe \
--n-gpu-layers 999 \
--override-tensor "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
--override-tensor exps=CPU \
--warmup-batch \
--parallel 1 \
--threads 16
You might be able to add 10|11)
as well or keep going. This is easier to read and think about. Use whatever regex style you prefer.
Keep in mind the names are different for offloading Qwen3-235B so that is a little different.
Good luck!