dongdaxiang commited on
Commit
ebffdac
·
verified ·
1 Parent(s): f165d50

Upload folder using huggingface_hub

Browse files
LICENSE CHANGED
@@ -6,7 +6,7 @@ Composite License: MIT (for Original Contributions) + Qwen Research License (for
6
 
7
  MIT License
8
 
9
- Copyright (c) 2025 Qianfan
10
 
11
  Permission is hereby granted, free of charge, to any person obtaining a copy
12
  of this software and associated documentation files (the "Software"), to deal
@@ -86,6 +86,6 @@ You may distribute copies or make the Materials, or derivative works thereof, av
86
 
87
 
88
  === Scope Clarification (Non-operative summary) ===
89
- - Section A (MIT) covers only the Project’s original contributions authored by Qianfan.
90
  - Section B (Qwen Research License) governs any included Qwen Materials and any derivatives thereof (e.g., fine-tuned weights).
91
  - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Qwen Research License for Qwen Materials).
 
6
 
7
  MIT License
8
 
9
+ Copyright (c) 2025 Baidu
10
 
11
  Permission is hereby granted, free of charge, to any person obtaining a copy
12
  of this software and associated documentation files (the "Software"), to deal
 
86
 
87
 
88
  === Scope Clarification (Non-operative summary) ===
89
+ - Section A (MIT) covers only the Project’s original contributions authored by Baidu.
90
  - Section B (Qwen Research License) governs any included Qwen Materials and any derivatives thereof (e.g., fine-tuned weights).
91
  - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Qwen Research License for Qwen Materials).
README.md ADDED
@@ -0,0 +1,336 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_link: LICENSE
4
+ language:
5
+ - en
6
+ - zh
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - multimodal
11
+ ---
12
+ # Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
13
+
14
+ Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale | Document Understanding & OCR Enhancement | Chain-of-Thought Reasoning Support
15
+
16
+ ## Model Description
17
+
18
+ Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.
19
+
20
+ ### Model Variants
21
+
22
+ | Model | Parameters | Context Length | CoT Support | Best For |
23
+ | ------------------ | ---------- | -------------- | ----------- | ------------------------------------------ |
24
+ | **Qianfan-VL-3B** | 3B | 32k | ❌ | Edge deployment, real-time OCR |
25
+ | **Qianfan-VL-8B** | 8B | 32k | ✅ | Server-side general scenarios, fine-tuning |
26
+ | **Qianfan-VL-70B** | 70B | 32k | ✅ | Complex reasoning, data synthesis |
27
+
28
+ ### Architecture
29
+
30
+ - **Language Model**:
31
+ - Qianfan-VL-3B: Based on Qwen2.5-3B
32
+ - Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
33
+ - Enhanced with 3T multilingual corpus
34
+ - **Vision Encoder**: InternViT-based, supports dynamic patching up to 4K resolution
35
+ - **Cross-modal Fusion**: MLP adapter for efficient vision-language bridging
36
+
37
+ ## Key Capabilities
38
+
39
+ ### 🔍 OCR & Document Understanding
40
+
41
+ - **Full-Scenario OCR**: Handwriting, formulas, natural scenes, cards/documents
42
+ - **Document Intelligence**: Layout analysis, table parsing, chart understanding, document Q&A
43
+ - **High Precision**: Industry-leading performance on OCR benchmarks
44
+
45
+ ### 🧮 Chain-of-Thought Reasoning (8B & 70B)
46
+
47
+ - Complex chart analysis and reasoning
48
+ - Mathematical problem-solving with step-by-step derivation
49
+ - Visual reasoning and logical inference
50
+ - Statistical computation and trend prediction
51
+
52
+ ### 📊 Benchmark Performance
53
+
54
+ #### General Vision-Language Benchmarks
55
+
56
+ | Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
57
+ | --------------- | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
58
+ | A-Bench_VAL | 75.65 | 75.72 | **78.1** | 75.86 | 75.86 | 76.49 | **79.22** |
59
+ | CCBench | 66.86 | 70.39 | **80.98** | 77.84 | 70.78 | 57.65 | 73.73 |
60
+ | SEEDBench_IMG | 76.55 | 78.02 | **79.13** | 77.0 | 77.52 | 76.98 | 78.34 |
61
+ | SEEDBench2_Plus | 67.59 | 70.97 | **73.17** | 69.52 | 68.47 | 70.93 | 73.25 |
62
+ | MMVet | 48.17 | 53.21 | 67.34 | **80.28** | 78.9 | 70.64 | 75.69 |
63
+ | MMMU_VAL | 46.44 | 47.11 | 58.33 | 56.11 | **60.78** | 51.0 | **65.78** |
64
+ | ScienceQA_TEST | 95.19 | 97.62 | **98.76** | 97.97 | 97.17 | 85.47 | 92.51 |
65
+ | ScienceQA_VAL | 93.85 | 97.62 | **98.81** | **97.81** | 95.14 | 83.59 | 91.32 |
66
+ | MMT-Bench_VAL | 62.23 | 63.22 | **71.06** | 65.17 | 63.67 | 61.4 | 69.49 |
67
+ | MTVQA_TEST | 26.5 | 30.14 | **32.18** | 30.3 | 27.62 | 29.08 | **31.48** |
68
+ | BLINK | 49.97 | 56.81 | **59.44** | 55.87 | 51.87 | 54.55 | **63.02** |
69
+ | MMStar | 57.93 | 64.07 | **69.47** | 68.4 | 66.07 | 61.53 | 66.0 |
70
+ | RealWorldQA | 65.75 | 70.59 | 71.63 | 71.11 | **74.25** | 69.28 | **73.86** |
71
+ | Q-Bench1_VAL | 73.51 | 75.25 | 77.46 | 75.99 | **77.99** | **78.1** | **79.93** |
72
+ | POPE | 85.08 | 86.06 | 88.97 | **90.59** | 88.87 | 85.97 | 83.35 |
73
+ | RefCOCO (Avg) | 85.94 | 89.37 | **91.01** | 89.65 | **91.40** | 86.56 | 90.25 |
74
+
75
+ #### OCR & Document Understanding
76
+
77
+ | Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
78
+ | ------------ | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | ------------- | -------------- |
79
+ | OCRBench | 831 | 854 | 873 | **881** | 847 | 810 | **883** | 874 |
80
+ | AI2D_TEST | 81.38 | **85.07** | **87.23** | **85.07** | 83.55 | 77.07 | 80.472 | 83.84 |
81
+ | OCRVQA_TEST | 66.15 | 68.98 | **74.06** | 39.03 | 35.58 | 69.24 | **71.02** | 66.8 |
82
+ | TextVQA_VAL | 80.11 | 82.13 | **84.48** | 82.15 | 83.52 | 79.09 | **84.962** | 83.26 |
83
+ | DocVQA_VAL | 90.85 | 93.54 | 94.75 | 92.04 | 83.82 | 92.71 | **94.91** | **95.75** |
84
+ | ChartQA_TEST | 81.79 | **87.72** | **89.6** | 85.76 | 82.04 | 83.4 | 86.68 | 87.16 |
85
+
86
+ #### Mathematical Reasoning
87
+
88
+ | Benchmark | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
89
+ | ----------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
90
+ | Mathvista-mini | 69.19 | **78.6** | 69.5 | 70.1 | 67.2 | 73.9 |
91
+ | Mathvision | 32.82 | **50.29** | 29.61 | 34.8 | 25.95 | 39.34 |
92
+ | Mathverse | 48.4 | **61.04** | 43.68 | 49.26 | 44.21 | 55.18 |
93
+ | ChartQA Pro | 50.43 | **52** | 37.32 | 44.43 | 43.73 | 45.3 |
94
+ | HallusionBench | 51.72 | **54.52** | 49.2 | 40.2 | 47.9 | 49.9 |
95
+ | InHouse Dataset A | 59.87 | **71.78** | 40.64 | 41.47 | 45.58 | 57.2 |
96
+ | InHouse Dataset B | 61.33 | **75.6** | 36.25 | 42.65 | 30.62 | 59.68 |
97
+
98
+ ## Quick Start
99
+
100
+ ### Installation
101
+
102
+ ```bash
103
+ pip install transformers accelerate torch torchvision pillow einops
104
+ ```
105
+
106
+ ### Using Transformers
107
+
108
+ ```python
109
+ import torch
110
+ import torchvision.transforms as T
111
+ from torchvision.transforms.functional import InterpolationMode
112
+ from transformers import AutoModel, AutoTokenizer
113
+ from PIL import Image
114
+
115
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
116
+ IMAGENET_STD = (0.229, 0.224, 0.225)
117
+
118
+ def build_transform(input_size):
119
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
120
+ transform = T.Compose([
121
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
122
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
123
+ T.ToTensor(),
124
+ T.Normalize(mean=MEAN, std=STD)
125
+ ])
126
+ return transform
127
+
128
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
129
+ best_ratio_diff = float('inf')
130
+ best_ratio = (1, 1)
131
+ area = width * height
132
+ for ratio in target_ratios:
133
+ target_aspect_ratio = ratio[0] / ratio[1]
134
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
135
+ if ratio_diff < best_ratio_diff:
136
+ best_ratio_diff = ratio_diff
137
+ best_ratio = ratio
138
+ elif ratio_diff == best_ratio_diff:
139
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
140
+ best_ratio = ratio
141
+ return best_ratio
142
+
143
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
144
+ orig_width, orig_height = image.size
145
+ aspect_ratio = orig_width / orig_height
146
+
147
+ # calculate the existing image aspect ratio
148
+ target_ratios = set(
149
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
150
+ i * j <= max_num and i * j >= min_num)
151
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
152
+
153
+ # find the closest aspect ratio to the target
154
+ target_aspect_ratio = find_closest_aspect_ratio(
155
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
156
+
157
+ # calculate the target width and height
158
+ target_width = image_size * target_aspect_ratio[0]
159
+ target_height = image_size * target_aspect_ratio[1]
160
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
161
+
162
+ # resize the image
163
+ resized_img = image.resize((target_width, target_height))
164
+ processed_images = []
165
+ for i in range(blocks):
166
+ box = (
167
+ (i % (target_width // image_size)) * image_size,
168
+ (i // (target_width // image_size)) * image_size,
169
+ ((i % (target_width // image_size)) + 1) * image_size,
170
+ ((i // (target_width // image_size)) + 1) * image_size
171
+ )
172
+ # split the image
173
+ split_img = resized_img.crop(box)
174
+ processed_images.append(split_img)
175
+ assert len(processed_images) == blocks
176
+ if use_thumbnail and len(processed_images) != 1:
177
+ thumbnail_img = image.resize((image_size, image_size))
178
+ processed_images.append(thumbnail_img)
179
+ return processed_images
180
+
181
+ def load_image(image_file, input_size=448, max_num=12):
182
+ image = Image.open(image_file).convert('RGB')
183
+ transform = build_transform(input_size=input_size)
184
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
185
+ pixel_values = [transform(image) for image in images]
186
+ pixel_values = torch.stack(pixel_values)
187
+ return pixel_values
188
+
189
+ # Load model
190
+ MODEL_PATH = "baidu/Qianfan-VL-8B" # or Qianfan-VL-3B, Qianfan-VL-70B
191
+ model = AutoModel.from_pretrained(
192
+ MODEL_PATH,
193
+ torch_dtype=torch.bfloat16,
194
+ trust_remote_code=True,
195
+ device_map="auto"
196
+ ).eval()
197
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
198
+
199
+ # Load and process image
200
+ pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)
201
+
202
+ # Inference
203
+ prompt = "<image>请识别图中所有文字"
204
+ with torch.no_grad():
205
+ response = model.chat(
206
+ tokenizer,
207
+ pixel_values=pixel_values,
208
+ question=prompt,
209
+ generation_config={"max_new_tokens": 512},
210
+ verbose=False
211
+ )
212
+ print(response)
213
+ ```
214
+
215
+ ### Using vLLM
216
+
217
+ You can deploy Qianfan-VL using vLLM's official Docker image for high-performance inference with an OpenAI-compatible API:
218
+
219
+ #### Start vLLM Service
220
+
221
+ ```bash
222
+ docker run -d --name qianfan-vl \
223
+ --gpus all \
224
+ -v /path/to/Qianfan-VL-8B:/model \
225
+ -p 8000:8000 \
226
+ --ipc=host \
227
+ vllm/vllm-openai:latest \
228
+ --model /model \
229
+ --served-model-name qianfan-vl \
230
+ --trust-remote-code \
231
+ --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'
232
+ ```
233
+
234
+ #### Call the API
235
+
236
+ ```bash
237
+ curl 'http://127.0.0.1:8000/v1/chat/completions' \
238
+ --header 'Content-Type: application/json' \
239
+ --data '{
240
+ "model": "qianfan-vl",
241
+ "messages": [
242
+ {
243
+ "role": "user",
244
+ "content": [
245
+ {
246
+ "type": "image_url",
247
+ "image_url": {
248
+ "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
249
+ }
250
+ },
251
+ {
252
+ "type": "text",
253
+ "text": "<image>请识别图中所有文字"
254
+ }
255
+ ]
256
+ }
257
+ ]
258
+ }'
259
+ ```
260
+
261
+ Or use Python with OpenAI SDK:
262
+
263
+ ```python
264
+ from openai import OpenAI
265
+
266
+ client = OpenAI(
267
+ api_key="EMPTY",
268
+ base_url="http://127.0.0.1:8000/v1"
269
+ )
270
+
271
+ response = client.chat.completions.create(
272
+ model="qianfan-vl",
273
+ messages=[
274
+ {
275
+ "role": "user",
276
+ "content": [
277
+ {
278
+ "type": "image_url",
279
+ "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
280
+ },
281
+ {
282
+ "type": "text",
283
+ "text": "<image>请描述这张图片"
284
+ }
285
+ ]
286
+ }
287
+ ],
288
+ max_tokens=512
289
+ )
290
+ print(response.choices[0].message.content)
291
+ ```
292
+
293
+ ## Training Details
294
+
295
+ ### Four-Stage Progressive Training
296
+
297
+ 1. **Cross-modal Alignment** (100B tokens): Establishes vision-language connections
298
+ 2. **General Knowledge Injection** (3.5T tokens): Builds strong foundational capabilities
299
+ 3. **Domain Enhancement** (300B tokens): Specialized OCR and reasoning capabilities
300
+ 4. **Post-training** (1B tokens): Instruction following and preference alignment
301
+
302
+ ### Infrastructure
303
+
304
+ - Trained on 5000+ Baidu Kunlun chips
305
+ - Single-task parallel training with 5000 chips demonstrating unprecedented scale
306
+ - 90%+ scaling efficiency for large-scale distributed training
307
+ - Innovative communication-computation fusion technology
308
+
309
+ ## Model Card
310
+
311
+ - **Developed by**: Baidu AI Cloud Qianfan Team
312
+ - **Model type**: Vision-Language Transformer
313
+ - **Languages**: Multilingual support
314
+ - **License**: [Please check model card for specific license]
315
+ - **Base Architecture**: Please Reference Technical Report
316
+
317
+ ## Citation
318
+
319
+ If you use Qianfan-VL in your research, please cite:
320
+
321
+ ```bibtex
322
+ @misc{qianfan-vl-2025,
323
+ title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
324
+ author={Qianfan Team},
325
+ year={2025},
326
+ publisher={Baidu}
327
+ }
328
+ ```
329
+
330
+ ## Contact
331
+
332
+ For more information and API access, visit: [Baidu Qianfan Platform](https://qianfan.cloud.baidu.com/)
333
+
334
+ ## Acknowledgments
335
+
336
+ This model series represents a significant advancement in multimodal AI, combining general capabilities with domain-specific enhancements for real-world applications.
configuration_qianfanvl_chat.py CHANGED
@@ -1,4 +1,4 @@
1
- # Copyright (c) 2025 Qianfan
2
  # Licensed under the MIT License. See LICENSE file in the project root for full license information.
3
  import copy
4
 
 
1
+ # Copyright (c) 2025 Baidu
2
  # Licensed under the MIT License. See LICENSE file in the project root for full license information.
3
  import copy
4
 
modeling_qianfanvl_chat.py CHANGED
@@ -1,4 +1,4 @@
1
- # Copyright (c) 2025 Qianfan
2
  # Licensed under the MIT License. See LICENSE file in the project root for full license information.
3
  import warnings
4
  from typing import List, Optional, Tuple, Union
 
1
+ # Copyright (c) 2025 Baidu
2
  # Licensed under the MIT License. See LICENSE file in the project root for full license information.
3
  import warnings
4
  from typing import List, Optional, Tuple, Union