Running the model with 3*4090 will report an error about out of memory.
The documentation says requires ~55 GB of GPU RAM in bf16 or fp16.
docker run -d --name Mistral --runtime nvidia --gpus '"device=1,2,3"' \
-v /home/sunwei/models/Mistral-Small-3.1-24B-Instruct-2503:/root/.cache/modelscope/hub/Mistral \
-p 18081:8080 \
--ipc=host \
vllm/vllm-openai:v0.8.4 \
--model /root/.cache/modelscope/hub/Mistral \
--served-model-name Mistral-Small \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--limit_mm_per_prompt 'image=10' \
--tool-call-parser mistral --enable-auto-tool-choice \
--chat-template examples/tool_chat_template_mistral.jinja \
--tensor-parallel-size 1 \
--pipeline-parallel-size 3 \
--port 8080
ERROR:
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 946.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 614.69 MiB is free. Process 2416128 has 23.03 GiB memory in use. Of the allocated memory 22.43 GiB is allocated by PyTorch, and 37.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 04-24 00:46:39 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W424 00:46:40.011589917 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
with 4*4090 just work fine. but cost about 85G GPU Memory.
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4000 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 6352 C /usr/local/bin/python 2796MiB |
| 0 N/A N/A 2441844 C /usr/bin/python3 19216MiB |
| 1 N/A N/A 4000 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2441935 C /usr/bin/python3 22016MiB |
| 2 N/A N/A 4000 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2441992 C /usr/bin/python3 22016MiB |
| 3 N/A N/A 4000 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2442066 C /usr/bin/python3 22016MiB