hanqer commited on
Commit
b221ae8
·
verified ·
1 Parent(s): 68b4c49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -269
README.md CHANGED
@@ -59,275 +59,8 @@ Step3 maintains exceptional efficiency across both flagship and low-end accelera
59
 
60
 
61
  ## Evaluation Results
62
- <table>
63
- <thead>
64
- <tr>
65
- <th></th>
66
- <th>Model</th>
67
- <th>Total Params.</th>
68
- <th>MMMU</th>
69
- <th>MathVision</th>
70
- <th>ZeroBench(sub)</th>
71
- <th>DYNAMATH</th>
72
- <th>SimpleVQA</th>
73
- <th>HallusionBench</th>
74
- <th>AIME25</th>
75
- <th>HMMT25</th>
76
- <th>CNMO24</th>
77
- <th>GPQA-Diamond</th>
78
- <th>LiveCodeBench<br>(24.8-25.5)</th>
79
- </tr>
80
- </thead>
81
- <tbody>
82
- <tr>
83
- <td rowspan="6">Open-Source VLM</td>
84
- <td>Step3</td>
85
- <td>321B</td>
86
- <td>74.2</td>
87
- <td>64.8</td>
88
- <td>23.0</td>
89
- <td>50.1</td>
90
- <td>62.2</td>
91
- <td>64.2</td>
92
- <td>82.9</td>
93
- <td>70.0</td>
94
- <td>83.7</td>
95
- <td>73.0</td>
96
- <td>67.1</td>
97
- </tr>
98
- <tr>
99
- <td>ERINE4.5 - thinking</td>
100
- <td>300B/424B</td>
101
- <td>70.0</td>
102
- <td>47.6</td>
103
- <td>22.5</td>
104
- <td>46.9</td>
105
- <td>59.8</td>
106
- <td>60.0</td>
107
- <td>35.1</td>
108
- <td>40.5*</td>
109
- <td>75.5</td>
110
- <td>76.8</td>
111
- <td>38.8</td>
112
- </tr>
113
- <tr>
114
- <td>GLM-4.1V-thinking</td>
115
- <td>9B</td>
116
- <td>68.0</td>
117
- <td>49.4</td>
118
- <td>22.8</td>
119
- <td>41.9</td>
120
- <td>48.1</td>
121
- <td>60.8</td>
122
- <td>13.3</td>
123
- <td>6.7</td>
124
- <td>25.0</td>
125
- <td>47.4</td>
126
- <td>24.2</td>
127
- </tr>
128
- <tr>
129
- <td>MiMo-VL</td>
130
- <td>7B</td>
131
- <td>66.7</td>
132
- <td>60.4</td>
133
- <td>18.6</td>
134
- <td>45.9</td>
135
- <td>48.5</td>
136
- <td>59.6</td>
137
- <td>60.0</td>
138
- <td>34.6</td>
139
- <td>69.9</td>
140
- <td>55.5</td>
141
- <td>50.1</td>
142
- </tr>
143
- <tr>
144
- <td>QvQ-72B-Preview</td>
145
- <td>72B</td>
146
- <td>70.3</td>
147
- <td>35.9</td>
148
- <td>15.9</td>
149
- <td>30.7</td>
150
- <td>40.3</td>
151
- <td>50.8</td>
152
- <td>22.7</td>
153
- <td>49.5</td>
154
- <td>47.3</td>
155
- <td>10.9</td>
156
- <td>24.1</td>
157
- </tr>
158
- <tr>
159
- <td>LLaMA-Maverick</td>
160
- <td>400B</td>
161
- <td>73.4</td>
162
- <td>47.2</td>
163
- <td>22.8</td>
164
- <td>47.1</td>
165
- <td>45.4</td>
166
- <td>57.1</td>
167
- <td>19.2</td>
168
- <td>8.91</td>
169
- <td>41.6</td>
170
- <td>69.8</td>
171
- <td>33.9</td>
172
- </tr>
173
- <tr>
174
- <td rowspan="4">Open-Source LLM</td>
175
- <td>MiniMax-M1-80k</td>
176
- <td>456B</td>
177
- <td>-</td>
178
- <td>-</td>
179
- <td>-</td>
180
- <td>-</td>
181
- <td>-</td>
182
- <td>-</td>
183
- <td>76.9</td>
184
- <td>-</td>
185
- <td>-</td>
186
- <td>70.0</td>
187
- <td>65.0</td>
188
- </tr>
189
- <tr>
190
- <td>Qwen3-235B-A22B-Thinking</td>
191
- <td>235B</td>
192
- <td>-</td>
193
- <td>-</td>
194
- <td>-</td>
195
- <td>-</td>
196
- <td>-</td>
197
- <td>-</td>
198
- <td>81.5</td>
199
- <td>62.5</td>
200
- <td>-</td>
201
- <td>71.1</td>
202
- <td>65.9</td>
203
- </tr>
204
- <tr>
205
- <td>DeepSeek R1-0528</td>
206
- <td>671B</td>
207
- <td>-</td>
208
- <td>-</td>
209
- <td>-</td>
210
- <td>-</td>
211
- <td>-</td>
212
- <td>-</td>
213
- <td>87.5</td>
214
- <td>79.4</td>
215
- <td>86.9</td>
216
- <td>81.0</td>
217
- <td>73.3</td>
218
- </tr>
219
- <tr>
220
- <td>Qwen3-235B-A22B-Thinking-2507</td>
221
- <td>235B</td>
222
- <td>-</td>
223
- <td>-</td>
224
- <td>-</td>
225
- <td>-</td>
226
- <td>-</td>
227
- <td>-</td>
228
- <td>92.3</td>
229
- <td>83.9</td>
230
- <td>-</td>
231
- <td>81.1</td>
232
- <td>-</td>
233
- </tr>
234
- <tr>
235
- <td rowspan="6">Proprietary VLM</td>
236
- <td>O3</td>
237
- <td>-</td>
238
- <td>82.9</td>
239
- <td>72.8</td>
240
- <td>25.2</td>
241
- <td>58.1</td>
242
- <td>59.8</td>
243
- <td>60.1</td>
244
- <td>88.9</td>
245
- <td>70.1</td>
246
- <td>86.7</td>
247
- <td>83.3</td>
248
- <td>75.8</td>
249
- </tr>
250
- <tr>
251
- <td>Claude4 Sonnet (thinking)</td>
252
- <td>-</td>
253
- <td>76.9</td>
254
- <td>64.6</td>
255
- <td>26.1</td>
256
- <td>48.1</td>
257
- <td>43.7</td>
258
- <td>57.0</td>
259
- <td>70.5</td>
260
- <td>-</td>
261
- <td>-</td>
262
- <td>75.4</td>
263
- <td>55.9</td>
264
- </tr>
265
- <tr>
266
- <td>Claude4 opus (thinking)</td>
267
- <td>-</td>
268
- <td>79.8</td>
269
- <td>66.1</td>
270
- <td>25.2</td>
271
- <td>49.3</td>
272
- <td>47.2</td>
273
- <td>59.9</td>
274
- <td>75.5</td>
275
- <td>-</td>
276
- <td>-</td>
277
- <td>79.6</td>
278
- <td>56.6</td>
279
- </tr>
280
- <tr>
281
- <td>Gemini 2.5 Flash (thinking)</td>
282
- <td>-</td>
283
- <td>73.2</td>
284
- <td>57.3</td>
285
- <td>20.1</td>
286
- <td>57.1</td>
287
- <td>61.1</td>
288
- <td>65.2</td>
289
- <td>72.0</td>
290
- <td>-</td>
291
- <td>-</td>
292
- <td>82.8</td>
293
- <td>61.9</td>
294
- </tr>
295
- <tr>
296
- <td>Gemini 2.5 Pro</td>
297
- <td>-</td>
298
- <td>81.7</td>
299
- <td>73.3</td>
300
- <td>30.8</td>
301
- <td>56.3</td>
302
- <td>66.8</td>
303
- <td>66.8</td>
304
- <td>88.0</td>
305
- <td>-</td>
306
- <td>-</td>
307
- <td>86.4</td>
308
- <td>71.8</td>
309
- </tr>
310
- <!-- 新增 Grok 4 -->
311
- <tr>
312
- <td>Grok 4</td>
313
- <td>-</td>
314
- <td>80.9</td>
315
- <td>70.3</td>
316
- <td>22.5</td>
317
- <td>40.7</td>
318
- <td>55.9</td>
319
- <td>64.8</td>
320
- <td>98.8</td>
321
- <td>93.9</td>
322
- <td>85.5</td>
323
- <td>87.5</td>
324
- <td>79.3</td>
325
- </tr>
326
- </tbody>
327
- </table>
328
-
329
- Note: Parts of the evaluation results are reproduced using the same settings.
330
- †: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.
331
  ## Deployment
332
 
333
  > [!Note]
 
59
 
60
 
61
  ## Evaluation Results
62
+ ![](figures/step3_bmk.jpeg)
63
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Deployment
65
 
66
  > [!Note]