File size: 7,637 Bytes
d383d62 11b6418 d383d62 e7bfdf2 d383d62 11b6418 d383d62 11b6418 d383d62 e7bfdf2 d383d62 08e93b8 d383d62 e7bfdf2 d383d62 e7bfdf2 d383d62 e7bfdf2 d383d62 f1fd5e6 e7bfdf2 d383d62 33d32cb d383d62 33d32cb d383d62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Qwen2.5-VL
- Qwen2.5-VL-3B-Instruct
- Int8
- VLM
---
# Qwen2.5-VL-3B-Instruct
This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera)
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(็ฑ่ฏๆดพPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
**Image Process**
|Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 448*448 | 1 | 780 ms | 2857 ms | 6.2 tokens/sec| 4.3 GiB | 4.6 GiB |
**Video Process**
|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 308*308 | 8 | 1400 ms | 5400 ms | 6.1 tokens/sec| 4.4 GiB | 4.7 GiB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
## How to use
Download all files from this repository to the device
**If you using AX650 Board**
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2
.
โโโ image
โย ย โโโ ssd_car.jpg
โโโ main
โโโ python
โย ย โโโ cv_resize.py
โย ย โโโ infer_image.py
โย ย โโโ infer_text.py
โย ย โโโ infer_video.py
โย ย โโโ preprocess.py
โย ย โโโ utils.py
โโโ qwen2_5-vl-3b-image-ax650
โย ย โโโ Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel
โย ย โโโ model.embed_tokens.weight.bfloat16.bin
โย ย โโโ qwen2_5_vl_p320_l0_together.axmodel
......
โย ย โโโ qwen2_5_vl_p320_l9_together.axmodel
โย ย โโโ qwen2_5_vl_post.axmodel
โโโ qwen2_5-vl-3b-video-ax650
โย ย โโโ Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel
โย ย โโโ model.embed_tokens.weight.bfloat16.bin
โย ย โโโ qwen2_5_vl_p512_l0_together.axmodel
......
โย ย โโโ qwen2_5_vl_p512_l9_together.axmodel
โย ย โโโ qwen2_5_vl_post.axmodel
โโโ qwen2_5-vl-tokenizer
โย ย โโโ chat_template.json
โย ย โโโ config.json
โย ย โโโ generation_config.json
โย ย โโโ merges.txt
โย ย โโโ model.safetensors.index.json
โย ย โโโ preprocessor_config.json
โย ย โโโ tokenizer.json
โย ย โโโ tokenizer_config.json
โย ย โโโ vocab.json
โโโ qwen2_tokenizer_image_448.py
โโโ qwen2_tokenizer_video_308.py
โโโ run_qwen2_5_vl_image.sh
โโโ run_qwen2_5_vl_video.sh
โโโ video
โโโ frame_0075.jpg
......
โโโ frame_0089.jpg
```
### Prepare tokenizer server
#### Install transformer
```
pip install transformers==4.41.1 jinja2
```
### Demo Run
#### Image understand demo
##### start tokenizer server for image understand demo
```
python3 qwen2_tokenizer_image_448.py --port 12345
```
##### run image understand demo
- input text
```
ๆ่ฟฐไธๅพ็
```
- input image

```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
[I][ Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
2% | โ | 1 / 40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 40 / 40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
[I][ Init][ 277]: max_token_len : 1023
[I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][ Init][ 290]: prefill_token_num : 320
[I][ Init][ 292]: vpm_height : 1024,vpm_width : 392
[I][ Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you?
image >>
[I][ Run][ 638]: ttft: 2854.47 ms
I am a large language model created by Alibaba Cloud. I am called Qwen.
[N][ Run][ 779]: hit eos,avg 6.05 token/s
prompt >> ๆ่ฟฐไธๅพ็
image >> image/ssd_car.jpg
[I][ Encode][ 416]: image encode time : 795.614014 ms, size : 524288
[I][ Run][ 638]: ttft: 2856.88 ms
่ฟๅผ ๅพ็ๅฑ็คบไบไธๆก็นๅฟ็ๅๅธ่ก้ใๅๆฏไธญ๏ผไธๅๅฅณๅญ็ซๅจไบบ่ก้ไธ๏ผๅฅน็ฉฟ็้ป่ฒๅคๅฅ๏ผ้ขๅธฆๅพฎ็ฌใๅฅนๆ่พนๆฏไธ่พ็บข่ฒ็ๅๅฑๅทดๅฃซ๏ผๅทดๅฃซไธๆไธไธชๅนฟๅ๏ผ
ไธ้ขๅ็โTHINGS GET MORE EXITING WHEN YOU SAY โYESโโใๅทดๅฃซ็่ฝฆ็ๅทๆฏโL15โใๅทดๅฃซๆ่พนๅ็ไธ่พ้ป่ฒ็ๅฐๅ่ดง่ฝฆใ่ๆฏไธญๅฏไปฅ็ๅฐไธไบๅๅบๅ่กไบบ๏ผ
่ก้ไธคๆ็ๅปบ็ญ็ฉๆฏ็ฐไปฃ็็ป็ๅนๅขๅปบ็ญใๆดไฝๆฐๅดๆพๅพ็นๅฟ่ๅ
ๆปกๆดปๅใ
[N][ Run][ 779]: hit eos,avg 5.96 token/s
```
#### Video understand demo
Please pre-process the image of the video file into a 308x308 size picture
##### start tokenizer server for image understand demo
```
python qwen2_tokenizer_video_308.py --port 12345
```
##### run image understand demo
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
[I][ Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
2% | โ | 1 / 40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 40 / 40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
[I][ Init][ 277]: max_token_len : 1023
[I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][ Init][ 290]: prefill_token_num : 512
[I][ Init][ 292]: vpm_height : 484,vpm_width : 392
[I][ Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> ๆ่ฟฐไธ่ง้ข
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][ Encode][ 416]: image encode time : 1487.557007 ms, size : 991232
[I][ Run][ 638]: ttft: 5488.29 ms
่ง้ขๅฑ็คบไบไธคๅชๆพ้ผ ๅจๆทๅค็ๅบๆฏใ่ๆฏๆฏๆจก็ณ็ๅฑฑ่ๅ่ๅคฉ๏ผๅๆฏไธญๆๆพ้ผ ๅจไบๅจใๆพ้ผ ็ๆฏ่ฒไธป่ฆๆฏๆฃ่ฒๅ็ฝ่ฒ๏ผๅฎไปฌ็็ชๅญๆฏๆฉ่ฒ็ใๆพ้ผ ไผผไนๅจไบ็ธ็ฉ่ๆไบๆข๏ผๅฎไปฌ็็ชๅญๅๅดๅทด้ฝไผธๅๅฏนๆนใๆดไธชๅบๆฏๆพๅพ้ๅธธ่ช็ถๅ็ๅจใ
```
#### Inference with M.2 Accelerator card
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
TODO
|