--- license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - Qwen2.5-VL - Qwen2.5-VL-3B-Instruct - Int8 - VLM --- # Qwen2.5-VL-3B-Instruct This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: 3.4 ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) [AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera) ## Support Platform - AX650 - AX650N DEMO Board - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) **Image Process** |Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash | |--|--|--|--|--|--|--|--| |AX650| 448*448 | 1 | 780 ms | 2857 ms | 6.2 tokens/sec| 4.3 GiB | 4.6 GiB | **Video Process** |Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash | |--|--|--|--|--|--|--|--| |AX650| 308*308 | 8 | 1400 ms | 5400 ms | 6.1 tokens/sec| 4.4 GiB | 4.7 GiB | The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. ## How to use Download all files from this repository to the device **If you using AX650 Board** ``` root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2 . ├── image │   └── ssd_car.jpg ├── main ├── python │   ├── cv_resize.py │   ├── infer_image.py │   ├── infer_text.py │   ├── infer_video.py │   ├── preprocess.py │   └── utils.py ├── qwen2_5-vl-3b-image-ax650 │   ├── Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel │   ├── model.embed_tokens.weight.bfloat16.bin │   ├── qwen2_5_vl_p320_l0_together.axmodel ...... │   ├── qwen2_5_vl_p320_l9_together.axmodel │   └── qwen2_5_vl_post.axmodel ├── qwen2_5-vl-3b-video-ax650 │   ├── Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel │   ├── model.embed_tokens.weight.bfloat16.bin │   ├── qwen2_5_vl_p512_l0_together.axmodel ...... │   ├── qwen2_5_vl_p512_l9_together.axmodel │   └── qwen2_5_vl_post.axmodel ├── qwen2_5-vl-tokenizer │   ├── chat_template.json │   ├── config.json │   ├── generation_config.json │   ├── merges.txt │   ├── model.safetensors.index.json │   ├── preprocessor_config.json │   ├── tokenizer.json │   ├── tokenizer_config.json │   └── vocab.json ├── qwen2_tokenizer_image_448.py ├── qwen2_tokenizer_video_308.py ├── run_qwen2_5_vl_image.sh ├── run_qwen2_5_vl_video.sh └── video ├── frame_0075.jpg ...... └── frame_0089.jpg ``` ### Prepare tokenizer server #### Install transformer ``` pip install transformers==4.41.1 jinja2 ``` ### Demo Run #### Image understand demo ##### start tokenizer server for image understand demo ``` python3 qwen2_tokenizer_image_448.py --port 12345 ``` ##### run image understand demo - input text ``` 描述下图片 ``` - input image ![](./image/ssd_car.jpg) ``` root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh [I][ Init][ 129]: LLM init start bos_id: -1, eos_id: 151645 2% | █ | 1 / 40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok [I][ Init][ 26]: LLaMaEmbedSelector use mmap 100% | ████████████████████████████████ | 40 / 40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB) [I][ Init][ 277]: max_token_len : 1023 [I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023 [I][ Init][ 290]: prefill_token_num : 320 [I][ Init][ 292]: vpm_height : 1024,vpm_width : 392 [I][ Init][ 301]: LLM init ok Type "q" to exit, Ctrl+c to stop current running prompt >> who are you? image >> [I][ Run][ 638]: ttft: 2854.47 ms I am a large language model created by Alibaba Cloud. I am called Qwen. [N][ Run][ 779]: hit eos,avg 6.05 token/s prompt >> 描述下图片 image >> image/ssd_car.jpg [I][ Encode][ 416]: image encode time : 795.614014 ms, size : 524288 [I][ Run][ 638]: ttft: 2856.88 ms 这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,她穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,巴士上有一个广告, 上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人, 街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。 [N][ Run][ 779]: hit eos,avg 5.96 token/s ``` #### Video understand demo Please pre-process the image of the video file into a 308x308 size picture ##### start tokenizer server for image understand demo ``` python qwen2_tokenizer_video_308.py --port 12345 ``` ##### run image understand demo ``` root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh [I][ Init][ 129]: LLM init start bos_id: -1, eos_id: 151645 2% | █ | 1 / 40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok [I][ Init][ 26]: LLaMaEmbedSelector use mmap 100% | ████████████████████████████████ | 40 / 40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB) [I][ Init][ 277]: max_token_len : 1023 [I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023 [I][ Init][ 290]: prefill_token_num : 512 [I][ Init][ 292]: vpm_height : 484,vpm_width : 392 [I][ Init][ 301]: LLM init ok Type "q" to exit, Ctrl+c to stop current running prompt >> 描述下视频 image >> video video/frame_0000.jpg video/frame_0008.jpg video/frame_0016.jpg video/frame_0024.jpg video/frame_0032.jpg video/frame_0040.jpg video/frame_0048.jpg video/frame_0056.jpg [I][ Encode][ 416]: image encode time : 1487.557007 ms, size : 991232 [I][ Run][ 638]: ttft: 5488.29 ms 视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天,前景中有松鼠在互动。松鼠的毛色主要是棕色和白色,它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢,它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。 ``` #### Inference with M.2 Accelerator card What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5. TODO