HectorHe commited on
Commit
e89f2e0
Β·
verified Β·
1 Parent(s): fbabf9d

Model save

Browse files
README.md CHANGED
@@ -1,11 +1,9 @@
1
  ---
2
  base_model: allenai/OLMoE-1B-7B-0125
3
- datasets: fw407/Commonsense-15K
4
  library_name: transformers
5
  model_name: OLMoE-1B-7B-0125-aux-free-sft-commonsense15k
6
  tags:
7
  - generated_from_trainer
8
- - open-r1
9
  - trl
10
  - sft
11
  licence: license
@@ -13,7 +11,7 @@ licence: license
13
 
14
  # Model Card for OLMoE-1B-7B-0125-aux-free-sft-commonsense15k
15
 
16
- This model is a fine-tuned version of [allenai/OLMoE-1B-7B-0125](https://huggingface.co/allenai/OLMoE-1B-7B-0125) on the [fw407/Commonsense-15K](https://huggingface.co/datasets/fw407/Commonsense-15K) dataset.
17
  It has been trained using [TRL](https://github.com/huggingface/trl).
18
 
19
  ## Quick start
@@ -29,7 +27,7 @@ print(output["generated_text"])
29
 
30
  ## Training procedure
31
 
32
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/hector_-carnegie-mellon-university/huggingface/runs/y1ji2w8y)
33
 
34
 
35
  This model was trained with SFT.
 
1
  ---
2
  base_model: allenai/OLMoE-1B-7B-0125
 
3
  library_name: transformers
4
  model_name: OLMoE-1B-7B-0125-aux-free-sft-commonsense15k
5
  tags:
6
  - generated_from_trainer
 
7
  - trl
8
  - sft
9
  licence: license
 
11
 
12
  # Model Card for OLMoE-1B-7B-0125-aux-free-sft-commonsense15k
13
 
14
+ This model is a fine-tuned version of [allenai/OLMoE-1B-7B-0125](https://huggingface.co/allenai/OLMoE-1B-7B-0125).
15
  It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
17
  ## Quick start
 
27
 
28
  ## Training procedure
29
 
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/hector_-carnegie-mellon-university/huggingface/runs/5zw6bqnr)
31
 
32
 
33
  This model was trained with SFT.
all_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
- "total_flos": 3245435781120.0,
3
- "train_loss": 1.2809894014807308,
4
- "train_runtime": 1398.581,
5
  "train_samples": 15119,
6
- "train_samples_per_second": 10.81,
7
- "train_steps_per_second": 0.085
8
  }
 
1
  {
2
+ "total_flos": 1.3663861802532864e+17,
3
+ "train_loss": 1.2117820017448457,
4
+ "train_runtime": 1211.3807,
5
  "train_samples": 15119,
6
+ "train_samples_per_second": 12.481,
7
+ "train_steps_per_second": 0.196
8
  }
generation_config.json CHANGED
@@ -2,5 +2,6 @@
2
  "_from_model_config": true,
3
  "eos_token_id": 50279,
4
  "pad_token_id": 1,
5
- "transformers_version": "4.51.0"
 
6
  }
 
2
  "_from_model_config": true,
3
  "eos_token_id": 50279,
4
  "pad_token_id": 1,
5
+ "transformers_version": "4.51.0",
6
+ "use_cache": false
7
  }
moe_bias_states.json ADDED
@@ -0,0 +1,1179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_moe_layers": 16,
4
+ "save_timestamp": "2025-09-23T18:18:04.615500",
5
+ "model_type": "OlmoeForCausalLM",
6
+ "pytorch_version": "2.6.0+cu124",
7
+ "description": "Auxiliary-loss-free MoE bias states saved during training"
8
+ },
9
+ "moe_bias_states": {
10
+ "model.layers.0.mlp": {
11
+ "bias_values": [
12
+ 0.0,
13
+ 0.0,
14
+ 0.0,
15
+ 0.0,
16
+ 0.0,
17
+ 0.0,
18
+ 0.0,
19
+ 0.0,
20
+ 0.0,
21
+ 0.0,
22
+ 0.0,
23
+ 0.0,
24
+ 0.0,
25
+ 0.0,
26
+ 0.0,
27
+ 0.0,
28
+ 0.0,
29
+ 0.0,
30
+ 0.0,
31
+ 0.0,
32
+ 0.0,
33
+ 0.0,
34
+ 0.0,
35
+ 0.0,
36
+ 0.0,
37
+ 0.0,
38
+ 0.0,
39
+ 0.0,
40
+ 0.0,
41
+ 0.0,
42
+ 0.0,
43
+ 0.0,
44
+ 0.0,
45
+ 0.0,
46
+ 0.0,
47
+ 0.0,
48
+ 0.0,
49
+ 0.0,
50
+ 0.0,
51
+ 0.0,
52
+ 0.0,
53
+ 0.0,
54
+ 0.0,
55
+ 0.0,
56
+ 0.0,
57
+ 0.0,
58
+ 0.0,
59
+ 0.0,
60
+ 0.0,
61
+ 0.0,
62
+ 0.0,
63
+ 0.0,
64
+ 0.0,
65
+ 0.0,
66
+ 0.0,
67
+ 0.0,
68
+ 0.0,
69
+ 0.0,
70
+ 0.0,
71
+ 0.0,
72
+ 0.0,
73
+ 0.0,
74
+ 0.0,
75
+ 0.0
76
+ ],
77
+ "bias_update_speed": 0.0001,
78
+ "num_experts": 64,
79
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
80
+ "device": "cuda:0",
81
+ "dtype": "torch.bfloat16"
82
+ },
83
+ "model.layers.1.mlp": {
84
+ "bias_values": [
85
+ 0.0,
86
+ 0.0,
87
+ 0.0,
88
+ 0.0,
89
+ 0.0,
90
+ 0.0,
91
+ 0.0,
92
+ 0.0,
93
+ 0.0,
94
+ 0.0,
95
+ 0.0,
96
+ 0.0,
97
+ 0.0,
98
+ 0.0,
99
+ 0.0,
100
+ 0.0,
101
+ 0.0,
102
+ 0.0,
103
+ 0.0,
104
+ 0.0,
105
+ 0.0,
106
+ 0.0,
107
+ 0.0,
108
+ 0.0,
109
+ 0.0,
110
+ 0.0,
111
+ 0.0,
112
+ 0.0,
113
+ 0.0,
114
+ 0.0,
115
+ 0.0,
116
+ 0.0,
117
+ 0.0,
118
+ 0.0,
119
+ 0.0,
120
+ 0.0,
121
+ 0.0,
122
+ 0.0,
123
+ 0.0,
124
+ 0.0,
125
+ 0.0,
126
+ 0.0,
127
+ 0.0,
128
+ 0.0,
129
+ 0.0,
130
+ 0.0,
131
+ 0.0,
132
+ 0.0,
133
+ 0.0,
134
+ 0.0,
135
+ 0.0,
136
+ 0.0,
137
+ 0.0,
138
+ 0.0,
139
+ 0.0,
140
+ 0.0,
141
+ 0.0,
142
+ 0.0,
143
+ 0.0,
144
+ 0.0,
145
+ 0.0,
146
+ 0.0,
147
+ 0.0,
148
+ 0.0
149
+ ],
150
+ "bias_update_speed": 0.0001,
151
+ "num_experts": 64,
152
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
153
+ "device": "cuda:0",
154
+ "dtype": "torch.bfloat16"
155
+ },
156
+ "model.layers.2.mlp": {
157
+ "bias_values": [
158
+ 0.0,
159
+ 0.0,
160
+ 0.0,
161
+ 0.0,
162
+ 0.0,
163
+ 0.0,
164
+ 0.0,
165
+ 0.0,
166
+ 0.0,
167
+ 0.0,
168
+ 0.0,
169
+ 0.0,
170
+ 0.0,
171
+ 0.0,
172
+ 0.0,
173
+ 0.0,
174
+ 0.0,
175
+ 0.0,
176
+ 0.0,
177
+ 0.0,
178
+ 0.0,
179
+ 0.0,
180
+ 0.0,
181
+ 0.0,
182
+ 0.0,
183
+ 0.0,
184
+ 0.0,
185
+ 0.0,
186
+ 0.0,
187
+ 0.0,
188
+ 0.0,
189
+ 0.0,
190
+ 0.0,
191
+ 0.0,
192
+ 0.0,
193
+ 0.0,
194
+ 0.0,
195
+ 0.0,
196
+ 0.0,
197
+ 0.0,
198
+ 0.0,
199
+ 0.0,
200
+ 0.0,
201
+ 0.0,
202
+ 0.0,
203
+ 0.0,
204
+ 0.0,
205
+ 0.0,
206
+ 0.0,
207
+ 0.0,
208
+ 0.0,
209
+ 0.0,
210
+ 0.0,
211
+ 0.0,
212
+ 0.0,
213
+ 0.0,
214
+ 0.0,
215
+ 0.0,
216
+ 0.0,
217
+ 0.0,
218
+ 0.0,
219
+ 0.0,
220
+ 0.0,
221
+ 0.0
222
+ ],
223
+ "bias_update_speed": 0.0001,
224
+ "num_experts": 64,
225
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
226
+ "device": "cuda:0",
227
+ "dtype": "torch.bfloat16"
228
+ },
229
+ "model.layers.3.mlp": {
230
+ "bias_values": [
231
+ 0.0,
232
+ 0.0,
233
+ 0.0,
234
+ 0.0,
235
+ 0.0,
236
+ 0.0,
237
+ 0.0,
238
+ 0.0,
239
+ 0.0,
240
+ 0.0,
241
+ 0.0,
242
+ 0.0,
243
+ 0.0,
244
+ 0.0,
245
+ 0.0,
246
+ 0.0,
247
+ 0.0,
248
+ 0.0,
249
+ 0.0,
250
+ 0.0,
251
+ 0.0,
252
+ 0.0,
253
+ 0.0,
254
+ 0.0,
255
+ 0.0,
256
+ 0.0,
257
+ 0.0,
258
+ 0.0,
259
+ 0.0,
260
+ 0.0,
261
+ 0.0,
262
+ 0.0,
263
+ 0.0,
264
+ 0.0,
265
+ 0.0,
266
+ 0.0,
267
+ 0.0,
268
+ 0.0,
269
+ 0.0,
270
+ 0.0,
271
+ 0.0,
272
+ 0.0,
273
+ 0.0,
274
+ 0.0,
275
+ 0.0,
276
+ 0.0,
277
+ 0.0,
278
+ 0.0,
279
+ 0.0,
280
+ 0.0,
281
+ 0.0,
282
+ 0.0,
283
+ 0.0,
284
+ 0.0,
285
+ 0.0,
286
+ 0.0,
287
+ 0.0,
288
+ 0.0,
289
+ 0.0,
290
+ 0.0,
291
+ 0.0,
292
+ 0.0,
293
+ 0.0,
294
+ 0.0
295
+ ],
296
+ "bias_update_speed": 0.0001,
297
+ "num_experts": 64,
298
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
299
+ "device": "cuda:0",
300
+ "dtype": "torch.bfloat16"
301
+ },
302
+ "model.layers.4.mlp": {
303
+ "bias_values": [
304
+ 0.0,
305
+ 0.0,
306
+ 0.0,
307
+ 0.0,
308
+ 0.0,
309
+ 0.0,
310
+ 0.0,
311
+ 0.0,
312
+ 0.0,
313
+ 0.0,
314
+ 0.0,
315
+ 0.0,
316
+ 0.0,
317
+ 0.0,
318
+ 0.0,
319
+ 0.0,
320
+ 0.0,
321
+ 0.0,
322
+ 0.0,
323
+ 0.0,
324
+ 0.0,
325
+ 0.0,
326
+ 0.0,
327
+ 0.0,
328
+ 0.0,
329
+ 0.0,
330
+ 0.0,
331
+ 0.0,
332
+ 0.0,
333
+ 0.0,
334
+ 0.0,
335
+ 0.0,
336
+ 0.0,
337
+ 0.0,
338
+ 0.0,
339
+ 0.0,
340
+ 0.0,
341
+ 0.0,
342
+ 0.0,
343
+ 0.0,
344
+ 0.0,
345
+ 0.0,
346
+ 0.0,
347
+ 0.0,
348
+ 0.0,
349
+ 0.0,
350
+ 0.0,
351
+ 0.0,
352
+ 0.0,
353
+ 0.0,
354
+ 0.0,
355
+ 0.0,
356
+ 0.0,
357
+ 0.0,
358
+ 0.0,
359
+ 0.0,
360
+ 0.0,
361
+ 0.0,
362
+ 0.0,
363
+ 0.0,
364
+ 0.0,
365
+ 0.0,
366
+ 0.0,
367
+ 0.0
368
+ ],
369
+ "bias_update_speed": 0.0001,
370
+ "num_experts": 64,
371
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
372
+ "device": "cuda:0",
373
+ "dtype": "torch.bfloat16"
374
+ },
375
+ "model.layers.5.mlp": {
376
+ "bias_values": [
377
+ 0.0,
378
+ 0.0,
379
+ 0.0,
380
+ 0.0,
381
+ 0.0,
382
+ 0.0,
383
+ 0.0,
384
+ 0.0,
385
+ 0.0,
386
+ 0.0,
387
+ 0.0,
388
+ 0.0,
389
+ 0.0,
390
+ 0.0,
391
+ 0.0,
392
+ 0.0,
393
+ 0.0,
394
+ 0.0,
395
+ 0.0,
396
+ 0.0,
397
+ 0.0,
398
+ 0.0,
399
+ 0.0,
400
+ 0.0,
401
+ 0.0,
402
+ 0.0,
403
+ 0.0,
404
+ 0.0,
405
+ 0.0,
406
+ 0.0,
407
+ 0.0,
408
+ 0.0,
409
+ 0.0,
410
+ 0.0,
411
+ 0.0,
412
+ 0.0,
413
+ 0.0,
414
+ 0.0,
415
+ 0.0,
416
+ 0.0,
417
+ 0.0,
418
+ 0.0,
419
+ 0.0,
420
+ 0.0,
421
+ 0.0,
422
+ 0.0,
423
+ 0.0,
424
+ 0.0,
425
+ 0.0,
426
+ 0.0,
427
+ 0.0,
428
+ 0.0,
429
+ 0.0,
430
+ 0.0,
431
+ 0.0,
432
+ 0.0,
433
+ 0.0,
434
+ 0.0,
435
+ 0.0,
436
+ 0.0,
437
+ 0.0,
438
+ 0.0,
439
+ 0.0,
440
+ 0.0
441
+ ],
442
+ "bias_update_speed": 0.0001,
443
+ "num_experts": 64,
444
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
445
+ "device": "cuda:0",
446
+ "dtype": "torch.bfloat16"
447
+ },
448
+ "model.layers.6.mlp": {
449
+ "bias_values": [
450
+ 0.0,
451
+ 0.0,
452
+ 0.0,
453
+ 0.0,
454
+ 0.0,
455
+ 0.0,
456
+ 0.0,
457
+ 0.0,
458
+ 0.0,
459
+ 0.0,
460
+ 0.0,
461
+ 0.0,
462
+ 0.0,
463
+ 0.0,
464
+ 0.0,
465
+ 0.0,
466
+ 0.0,
467
+ 0.0,
468
+ 0.0,
469
+ 0.0,
470
+ 0.0,
471
+ 0.0,
472
+ 0.0,
473
+ 0.0,
474
+ 0.0,
475
+ 0.0,
476
+ 0.0,
477
+ 0.0,
478
+ 0.0,
479
+ 0.0,
480
+ 0.0,
481
+ 0.0,
482
+ 0.0,
483
+ 0.0,
484
+ 0.0,
485
+ 0.0,
486
+ 0.0,
487
+ 0.0,
488
+ 0.0,
489
+ 0.0,
490
+ 0.0,
491
+ 0.0,
492
+ 0.0,
493
+ 0.0,
494
+ 0.0,
495
+ 0.0,
496
+ 0.0,
497
+ 0.0,
498
+ 0.0,
499
+ 0.0,
500
+ 0.0,
501
+ 0.0,
502
+ 0.0,
503
+ 0.0,
504
+ 0.0,
505
+ 0.0,
506
+ 0.0,
507
+ 0.0,
508
+ 0.0,
509
+ 0.0,
510
+ 0.0,
511
+ 0.0,
512
+ 0.0,
513
+ 0.0
514
+ ],
515
+ "bias_update_speed": 0.0001,
516
+ "num_experts": 64,
517
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
518
+ "device": "cuda:0",
519
+ "dtype": "torch.bfloat16"
520
+ },
521
+ "model.layers.7.mlp": {
522
+ "bias_values": [
523
+ 0.0,
524
+ 0.0,
525
+ 0.0,
526
+ 0.0,
527
+ 0.0,
528
+ 0.0,
529
+ 0.0,
530
+ 0.0,
531
+ 0.0,
532
+ 0.0,
533
+ 0.0,
534
+ 0.0,
535
+ 0.0,
536
+ 0.0,
537
+ 0.0,
538
+ 0.0,
539
+ 0.0,
540
+ 0.0,
541
+ 0.0,
542
+ 0.0,
543
+ 0.0,
544
+ 0.0,
545
+ 0.0,
546
+ 0.0,
547
+ 0.0,
548
+ 0.0,
549
+ 0.0,
550
+ 0.0,
551
+ 0.0,
552
+ 0.0,
553
+ 0.0,
554
+ 0.0,
555
+ 0.0,
556
+ 0.0,
557
+ 0.0,
558
+ 0.0,
559
+ 0.0,
560
+ 0.0,
561
+ 0.0,
562
+ 0.0,
563
+ 0.0,
564
+ 0.0,
565
+ 0.0,
566
+ 0.0,
567
+ 0.0,
568
+ 0.0,
569
+ 0.0,
570
+ 0.0,
571
+ 0.0,
572
+ 0.0,
573
+ 0.0,
574
+ 0.0,
575
+ 0.0,
576
+ 0.0,
577
+ 0.0,
578
+ 0.0,
579
+ 0.0,
580
+ 0.0,
581
+ 0.0,
582
+ 0.0,
583
+ 0.0,
584
+ 0.0,
585
+ 0.0,
586
+ 0.0
587
+ ],
588
+ "bias_update_speed": 0.0001,
589
+ "num_experts": 64,
590
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
591
+ "device": "cuda:0",
592
+ "dtype": "torch.bfloat16"
593
+ },
594
+ "model.layers.8.mlp": {
595
+ "bias_values": [
596
+ 0.0,
597
+ 0.0,
598
+ 0.0,
599
+ 0.0,
600
+ 0.0,
601
+ 0.0,
602
+ 0.0,
603
+ 0.0,
604
+ 0.0,
605
+ 0.0,
606
+ 0.0,
607
+ 0.0,
608
+ 0.0,
609
+ 0.0,
610
+ 0.0,
611
+ 0.0,
612
+ 0.0,
613
+ 0.0,
614
+ 0.0,
615
+ 0.0,
616
+ 0.0,
617
+ 0.0,
618
+ 0.0,
619
+ 0.0,
620
+ 0.0,
621
+ 0.0,
622
+ 0.0,
623
+ 0.0,
624
+ 0.0,
625
+ 0.0,
626
+ 0.0,
627
+ 0.0,
628
+ 0.0,
629
+ 0.0,
630
+ 0.0,
631
+ 0.0,
632
+ 0.0,
633
+ 0.0,
634
+ 0.0,
635
+ 0.0,
636
+ 0.0,
637
+ 0.0,
638
+ 0.0,
639
+ 0.0,
640
+ 0.0,
641
+ 0.0,
642
+ 0.0,
643
+ 0.0,
644
+ 0.0,
645
+ 0.0,
646
+ 0.0,
647
+ 0.0,
648
+ 0.0,
649
+ 0.0,
650
+ 0.0,
651
+ 0.0,
652
+ 0.0,
653
+ 0.0,
654
+ 0.0,
655
+ 0.0,
656
+ 0.0,
657
+ 0.0,
658
+ 0.0,
659
+ 0.0
660
+ ],
661
+ "bias_update_speed": 0.0001,
662
+ "num_experts": 64,
663
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
664
+ "device": "cuda:0",
665
+ "dtype": "torch.bfloat16"
666
+ },
667
+ "model.layers.9.mlp": {
668
+ "bias_values": [
669
+ 0.0,
670
+ 0.0,
671
+ 0.0,
672
+ 0.0,
673
+ 0.0,
674
+ 0.0,
675
+ 0.0,
676
+ 0.0,
677
+ 0.0,
678
+ 0.0,
679
+ 0.0,
680
+ 0.0,
681
+ 0.0,
682
+ 0.0,
683
+ 0.0,
684
+ 0.0,
685
+ 0.0,
686
+ 0.0,
687
+ 0.0,
688
+ 0.0,
689
+ 0.0,
690
+ 0.0,
691
+ 0.0,
692
+ 0.0,
693
+ 0.0,
694
+ 0.0,
695
+ 0.0,
696
+ 0.0,
697
+ 0.0,
698
+ 0.0,
699
+ 0.0,
700
+ 0.0,
701
+ 0.0,
702
+ 0.0,
703
+ 0.0,
704
+ 0.0,
705
+ 0.0,
706
+ 0.0,
707
+ 0.0,
708
+ 0.0,
709
+ 0.0,
710
+ 0.0,
711
+ 0.0,
712
+ 0.0,
713
+ 0.0,
714
+ 0.0,
715
+ 0.0,
716
+ 0.0,
717
+ 0.0,
718
+ 0.0,
719
+ 0.0,
720
+ 0.0,
721
+ 0.0,
722
+ 0.0,
723
+ 0.0,
724
+ 0.0,
725
+ 0.0,
726
+ 0.0,
727
+ 0.0,
728
+ 0.0,
729
+ 0.0,
730
+ 0.0,
731
+ 0.0,
732
+ 0.0
733
+ ],
734
+ "bias_update_speed": 0.0001,
735
+ "num_experts": 64,
736
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
737
+ "device": "cuda:0",
738
+ "dtype": "torch.bfloat16"
739
+ },
740
+ "model.layers.10.mlp": {
741
+ "bias_values": [
742
+ 0.0,
743
+ 0.0,
744
+ 0.0,
745
+ 0.0,
746
+ 0.0,
747
+ 0.0,
748
+ 0.0,
749
+ 0.0,
750
+ 0.0,
751
+ 0.0,
752
+ 0.0,
753
+ 0.0,
754
+ 0.0,
755
+ 0.0,
756
+ 0.0,
757
+ 0.0,
758
+ 0.0,
759
+ 0.0,
760
+ 0.0,
761
+ 0.0,
762
+ 0.0,
763
+ 0.0,
764
+ 0.0,
765
+ 0.0,
766
+ 0.0,
767
+ 0.0,
768
+ 0.0,
769
+ 0.0,
770
+ 0.0,
771
+ 0.0,
772
+ 0.0,
773
+ 0.0,
774
+ 0.0,
775
+ 0.0,
776
+ 0.0,
777
+ 0.0,
778
+ 0.0,
779
+ 0.0,
780
+ 0.0,
781
+ 0.0,
782
+ 0.0,
783
+ 0.0,
784
+ 0.0,
785
+ 0.0,
786
+ 0.0,
787
+ 0.0,
788
+ 0.0,
789
+ 0.0,
790
+ 0.0,
791
+ 0.0,
792
+ 0.0,
793
+ 0.0,
794
+ 0.0,
795
+ 0.0,
796
+ 0.0,
797
+ 0.0,
798
+ 0.0,
799
+ 0.0,
800
+ 0.0,
801
+ 0.0,
802
+ 0.0,
803
+ 0.0,
804
+ 0.0,
805
+ 0.0
806
+ ],
807
+ "bias_update_speed": 0.0001,
808
+ "num_experts": 64,
809
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
810
+ "device": "cuda:0",
811
+ "dtype": "torch.bfloat16"
812
+ },
813
+ "model.layers.11.mlp": {
814
+ "bias_values": [
815
+ 0.0,
816
+ 0.0,
817
+ 0.0,
818
+ 0.0,
819
+ 0.0,
820
+ 0.0,
821
+ 0.0,
822
+ 0.0,
823
+ 0.0,
824
+ 0.0,
825
+ 0.0,
826
+ 0.0,
827
+ 0.0,
828
+ 0.0,
829
+ 0.0,
830
+ 0.0,
831
+ 0.0,
832
+ 0.0,
833
+ 0.0,
834
+ 0.0,
835
+ 0.0,
836
+ 0.0,
837
+ 0.0,
838
+ 0.0,
839
+ 0.0,
840
+ 0.0,
841
+ 0.0,
842
+ 0.0,
843
+ 0.0,
844
+ 0.0,
845
+ 0.0,
846
+ 0.0,
847
+ 0.0,
848
+ 0.0,
849
+ 0.0,
850
+ 0.0,
851
+ 0.0,
852
+ 0.0,
853
+ 0.0,
854
+ 0.0,
855
+ 0.0,
856
+ 0.0,
857
+ 0.0,
858
+ 0.0,
859
+ 0.0,
860
+ 0.0,
861
+ 0.0,
862
+ 0.0,
863
+ 0.0,
864
+ 0.0,
865
+ 0.0,
866
+ 0.0,
867
+ 0.0,
868
+ 0.0,
869
+ 0.0,
870
+ 0.0,
871
+ 0.0,
872
+ 0.0,
873
+ 0.0,
874
+ 0.0,
875
+ 0.0,
876
+ 0.0,
877
+ 0.0,
878
+ 0.0
879
+ ],
880
+ "bias_update_speed": 0.0001,
881
+ "num_experts": 64,
882
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
883
+ "device": "cuda:0",
884
+ "dtype": "torch.bfloat16"
885
+ },
886
+ "model.layers.12.mlp": {
887
+ "bias_values": [
888
+ 0.0,
889
+ 0.0,
890
+ 0.0,
891
+ 0.0,
892
+ 0.0,
893
+ 0.0,
894
+ 0.0,
895
+ 0.0,
896
+ 0.0,
897
+ 0.0,
898
+ 0.0,
899
+ 0.0,
900
+ 0.0,
901
+ 0.0,
902
+ 0.0,
903
+ 0.0,
904
+ 0.0,
905
+ 0.0,
906
+ 0.0,
907
+ 0.0,
908
+ 0.0,
909
+ 0.0,
910
+ 0.0,
911
+ 0.0,
912
+ 0.0,
913
+ 0.0,
914
+ 0.0,
915
+ 0.0,
916
+ 0.0,
917
+ 0.0,
918
+ 0.0,
919
+ 0.0,
920
+ 0.0,
921
+ 0.0,
922
+ 0.0,
923
+ 0.0,
924
+ 0.0,
925
+ 0.0,
926
+ 0.0,
927
+ 0.0,
928
+ 0.0,
929
+ 0.0,
930
+ 0.0,
931
+ 0.0,
932
+ 0.0,
933
+ 0.0,
934
+ 0.0,
935
+ 0.0,
936
+ 0.0,
937
+ 0.0,
938
+ 0.0,
939
+ 0.0,
940
+ 0.0,
941
+ 0.0,
942
+ 0.0,
943
+ 0.0,
944
+ 0.0,
945
+ 0.0,
946
+ 0.0,
947
+ 0.0,
948
+ 0.0,
949
+ 0.0,
950
+ 0.0,
951
+ 0.0
952
+ ],
953
+ "bias_update_speed": 0.0001,
954
+ "num_experts": 64,
955
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
956
+ "device": "cuda:0",
957
+ "dtype": "torch.bfloat16"
958
+ },
959
+ "model.layers.13.mlp": {
960
+ "bias_values": [
961
+ 0.0,
962
+ 0.0,
963
+ 0.0,
964
+ 0.0,
965
+ 0.0,
966
+ 0.0,
967
+ 0.0,
968
+ 0.0,
969
+ 0.0,
970
+ 0.0,
971
+ 0.0,
972
+ 0.0,
973
+ 0.0,
974
+ 0.0,
975
+ 0.0,
976
+ 0.0,
977
+ 0.0,
978
+ 0.0,
979
+ 0.0,
980
+ 0.0,
981
+ 0.0,
982
+ 0.0,
983
+ 0.0,
984
+ 0.0,
985
+ 0.0,
986
+ 0.0,
987
+ 0.0,
988
+ 0.0,
989
+ 0.0,
990
+ 0.0,
991
+ 0.0,
992
+ 0.0,
993
+ 0.0,
994
+ 0.0,
995
+ 0.0,
996
+ 0.0,
997
+ 0.0,
998
+ 0.0,
999
+ 0.0,
1000
+ 0.0,
1001
+ 0.0,
1002
+ 0.0,
1003
+ 0.0,
1004
+ 0.0,
1005
+ 0.0,
1006
+ 0.0,
1007
+ 0.0,
1008
+ 0.0,
1009
+ 0.0,
1010
+ 0.0,
1011
+ 0.0,
1012
+ 0.0,
1013
+ 0.0,
1014
+ 0.0,
1015
+ 0.0,
1016
+ 0.0,
1017
+ 0.0,
1018
+ 0.0,
1019
+ 0.0,
1020
+ 0.0,
1021
+ 0.0,
1022
+ 0.0,
1023
+ 0.0,
1024
+ 0.0
1025
+ ],
1026
+ "bias_update_speed": 0.0001,
1027
+ "num_experts": 64,
1028
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
1029
+ "device": "cuda:0",
1030
+ "dtype": "torch.bfloat16"
1031
+ },
1032
+ "model.layers.14.mlp": {
1033
+ "bias_values": [
1034
+ 0.0,
1035
+ 0.0,
1036
+ 0.0,
1037
+ 0.0,
1038
+ 0.0,
1039
+ 0.0,
1040
+ 0.0,
1041
+ 0.0,
1042
+ 0.0,
1043
+ 0.0,
1044
+ 0.0,
1045
+ 0.0,
1046
+ 0.0,
1047
+ 0.0,
1048
+ 0.0,
1049
+ 0.0,
1050
+ 0.0,
1051
+ 0.0,
1052
+ 0.0,
1053
+ 0.0,
1054
+ 0.0,
1055
+ 0.0,
1056
+ 0.0,
1057
+ 0.0,
1058
+ 0.0,
1059
+ 0.0,
1060
+ 0.0,
1061
+ 0.0,
1062
+ 0.0,
1063
+ 0.0,
1064
+ 0.0,
1065
+ 0.0,
1066
+ 0.0,
1067
+ 0.0,
1068
+ 0.0,
1069
+ 0.0,
1070
+ 0.0,
1071
+ 0.0,
1072
+ 0.0,
1073
+ 0.0,
1074
+ 0.0,
1075
+ 0.0,
1076
+ 0.0,
1077
+ 0.0,
1078
+ 0.0,
1079
+ 0.0,
1080
+ 0.0,
1081
+ 0.0,
1082
+ 0.0,
1083
+ 0.0,
1084
+ 0.0,
1085
+ 0.0,
1086
+ 0.0,
1087
+ 0.0,
1088
+ 0.0,
1089
+ 0.0,
1090
+ 0.0,
1091
+ 0.0,
1092
+ 0.0,
1093
+ 0.0,
1094
+ 0.0,
1095
+ 0.0,
1096
+ 0.0,
1097
+ 0.0
1098
+ ],
1099
+ "bias_update_speed": 0.0001,
1100
+ "num_experts": 64,
1101
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
1102
+ "device": "cuda:0",
1103
+ "dtype": "torch.bfloat16"
1104
+ },
1105
+ "model.layers.15.mlp": {
1106
+ "bias_values": [
1107
+ 0.0,
1108
+ 0.0,
1109
+ 0.0,
1110
+ 0.0,
1111
+ 0.0,
1112
+ 0.0,
1113
+ 0.0,
1114
+ 0.0,
1115
+ 0.0,
1116
+ 0.0,
1117
+ 0.0,
1118
+ 0.0,
1119
+ 0.0,
1120
+ 0.0,
1121
+ 0.0,
1122
+ 0.0,
1123
+ 0.0,
1124
+ 0.0,
1125
+ 0.0,
1126
+ 0.0,
1127
+ 0.0,
1128
+ 0.0,
1129
+ 0.0,
1130
+ 0.0,
1131
+ 0.0,
1132
+ 0.0,
1133
+ 0.0,
1134
+ 0.0,
1135
+ 0.0,
1136
+ 0.0,
1137
+ 0.0,
1138
+ 0.0,
1139
+ 0.0,
1140
+ 0.0,
1141
+ 0.0,
1142
+ 0.0,
1143
+ 0.0,
1144
+ 0.0,
1145
+ 0.0,
1146
+ 0.0,
1147
+ 0.0,
1148
+ 0.0,
1149
+ 0.0,
1150
+ 0.0,
1151
+ 0.0,
1152
+ 0.0,
1153
+ 0.0,
1154
+ 0.0,
1155
+ 0.0,
1156
+ 0.0,
1157
+ 0.0,
1158
+ 0.0,
1159
+ 0.0,
1160
+ 0.0,
1161
+ 0.0,
1162
+ 0.0,
1163
+ 0.0,
1164
+ 0.0,
1165
+ 0.0,
1166
+ 0.0,
1167
+ 0.0,
1168
+ 0.0,
1169
+ 0.0,
1170
+ 0.0
1171
+ ],
1172
+ "bias_update_speed": 0.0001,
1173
+ "num_experts": 64,
1174
+ "module_type": "AuxFreeOlmoeSparseMoeBlock",
1175
+ "device": "cuda:0",
1176
+ "dtype": "torch.bfloat16"
1177
+ }
1178
+ }
1179
+ }
train_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
- "total_flos": 3245435781120.0,
3
- "train_loss": 1.2809894014807308,
4
- "train_runtime": 1398.581,
5
  "train_samples": 15119,
6
- "train_samples_per_second": 10.81,
7
- "train_steps_per_second": 0.085
8
  }
 
1
  {
2
+ "total_flos": 1.3663861802532864e+17,
3
+ "train_loss": 1.2117820017448457,
4
+ "train_runtime": 1211.3807,
5
  "train_samples": 15119,
6
+ "train_samples_per_second": 12.481,
7
+ "train_steps_per_second": 0.196
8
  }
trainer_state.json CHANGED
@@ -4,975 +4,1919 @@
4
  "best_model_checkpoint": null,
5
  "epoch": 1.0,
6
  "eval_steps": 500,
7
- "global_step": 119,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "epoch": 0.008403361344537815,
14
- "grad_norm": 61.42735123885721,
15
  "learning_rate": 0.0,
16
- "loss": 3.4985,
17
- "mean_token_accuracy": 0.429544597864151,
18
  "step": 1
19
  },
20
  {
21
- "epoch": 0.01680672268907563,
22
- "grad_norm": 57.4818046786993,
23
- "learning_rate": 8.333333333333333e-07,
24
- "loss": 3.4966,
25
- "mean_token_accuracy": 0.4261890649795532,
26
  "step": 2
27
  },
28
  {
29
- "epoch": 0.025210084033613446,
30
- "grad_norm": 52.36502980325029,
31
- "learning_rate": 1.6666666666666667e-06,
32
- "loss": 3.4063,
33
- "mean_token_accuracy": 0.43464240431785583,
34
  "step": 3
35
  },
36
  {
37
- "epoch": 0.03361344537815126,
38
- "grad_norm": 61.1108274343337,
39
- "learning_rate": 2.5e-06,
40
- "loss": 3.3932,
41
- "mean_token_accuracy": 0.4462195634841919,
42
  "step": 4
43
  },
44
  {
45
- "epoch": 0.04201680672268908,
46
- "grad_norm": 48.99953073096186,
47
- "learning_rate": 3.3333333333333333e-06,
48
- "loss": 3.2191,
49
- "mean_token_accuracy": 0.4612500071525574,
50
  "step": 5
51
  },
52
  {
53
- "epoch": 0.05042016806722689,
54
- "grad_norm": 49.025280289581815,
55
- "learning_rate": 4.166666666666667e-06,
56
- "loss": 2.7456,
57
- "mean_token_accuracy": 0.5155220627784729,
58
  "step": 6
59
  },
60
  {
61
- "epoch": 0.058823529411764705,
62
- "grad_norm": 49.39781059087277,
63
- "learning_rate": 5e-06,
64
- "loss": 2.7539,
65
- "mean_token_accuracy": 0.5053491592407227,
66
  "step": 7
67
  },
68
  {
69
- "epoch": 0.06722689075630252,
70
- "grad_norm": 57.904833936071505,
71
- "learning_rate": 5.833333333333334e-06,
72
- "loss": 2.1641,
73
- "mean_token_accuracy": 0.5849207043647766,
74
  "step": 8
75
  },
76
  {
77
- "epoch": 0.07563025210084033,
78
- "grad_norm": 53.449478044902804,
79
- "learning_rate": 6.666666666666667e-06,
80
- "loss": 2.0289,
81
- "mean_token_accuracy": 0.5940138697624207,
82
  "step": 9
83
  },
84
  {
85
- "epoch": 0.08403361344537816,
86
- "grad_norm": 41.45801954741997,
87
- "learning_rate": 7.500000000000001e-06,
88
- "loss": 1.8407,
89
- "mean_token_accuracy": 0.6121459007263184,
90
  "step": 10
91
  },
92
  {
93
- "epoch": 0.09243697478991597,
94
- "grad_norm": 17.301367329136305,
95
- "learning_rate": 8.333333333333334e-06,
96
- "loss": 1.4726,
97
- "mean_token_accuracy": 0.6945884823799133,
98
  "step": 11
99
  },
100
  {
101
- "epoch": 0.10084033613445378,
102
- "grad_norm": 13.318364706508635,
103
- "learning_rate": 9.166666666666666e-06,
104
- "loss": 1.3569,
105
- "mean_token_accuracy": 0.7142514586448669,
106
  "step": 12
107
  },
108
  {
109
- "epoch": 0.1092436974789916,
110
- "grad_norm": 7.9100322503188405,
111
- "learning_rate": 1e-05,
112
- "loss": 1.2797,
113
- "mean_token_accuracy": 0.7263978719711304,
114
  "step": 13
115
  },
116
  {
117
- "epoch": 0.11764705882352941,
118
- "grad_norm": 2.8666413464434886,
119
- "learning_rate": 9.998060528021493e-06,
120
- "loss": 1.3058,
121
- "mean_token_accuracy": 0.7304229140281677,
122
  "step": 14
123
  },
124
  {
125
- "epoch": 0.12605042016806722,
126
- "grad_norm": 3.976454136858748,
127
- "learning_rate": 9.992243783886663e-06,
128
- "loss": 1.2797,
129
- "mean_token_accuracy": 0.7234476804733276,
130
  "step": 15
131
  },
132
  {
133
- "epoch": 0.13445378151260504,
134
- "grad_norm": 1.882632638506074,
135
- "learning_rate": 9.982554781556512e-06,
136
- "loss": 1.2544,
137
- "mean_token_accuracy": 0.728027880191803,
138
  "step": 16
139
  },
140
  {
141
- "epoch": 0.14285714285714285,
142
- "grad_norm": 2.0420338361657984,
143
- "learning_rate": 9.969001872830383e-06,
144
- "loss": 1.1662,
145
- "mean_token_accuracy": 0.756281852722168,
146
  "step": 17
147
  },
148
  {
149
- "epoch": 0.15126050420168066,
150
- "grad_norm": 1.9922358783832426,
151
- "learning_rate": 9.951596740146809e-06,
152
- "loss": 1.2733,
153
- "mean_token_accuracy": 0.7221024632453918,
154
  "step": 18
155
  },
156
  {
157
- "epoch": 0.15966386554621848,
158
- "grad_norm": 2.6134470765459854,
159
- "learning_rate": 9.930354386513399e-06,
160
- "loss": 1.2796,
161
- "mean_token_accuracy": 0.7271506190299988,
162
  "step": 19
163
  },
164
  {
165
- "epoch": 0.16806722689075632,
166
- "grad_norm": 1.325098516402124,
167
- "learning_rate": 9.905293122574433e-06,
168
- "loss": 1.165,
169
- "mean_token_accuracy": 0.7458003163337708,
170
  "step": 20
171
  },
172
  {
173
- "epoch": 0.17647058823529413,
174
- "grad_norm": 1.4486654015778349,
175
- "learning_rate": 9.876434550827315e-06,
176
- "loss": 1.1875,
177
- "mean_token_accuracy": 0.7506506443023682,
178
  "step": 21
179
  },
180
  {
181
- "epoch": 0.18487394957983194,
182
- "grad_norm": 1.3154551387689228,
183
- "learning_rate": 9.843803547001487e-06,
184
- "loss": 1.1617,
185
- "mean_token_accuracy": 0.7380812168121338,
186
  "step": 22
187
  },
188
  {
189
- "epoch": 0.19327731092436976,
190
- "grad_norm": 1.5343790176286574,
191
- "learning_rate": 9.807428238615858e-06,
192
- "loss": 1.2258,
193
- "mean_token_accuracy": 0.7297688126564026,
194
  "step": 23
195
  },
196
  {
197
- "epoch": 0.20168067226890757,
198
- "grad_norm": 1.2371348889686988,
199
- "learning_rate": 9.76733998073324e-06,
200
- "loss": 1.2359,
201
- "mean_token_accuracy": 0.7324696183204651,
202
  "step": 24
203
  },
204
  {
205
- "epoch": 0.21008403361344538,
206
- "grad_norm": 1.1633581437587461,
207
- "learning_rate": 9.723573328932669e-06,
208
- "loss": 1.1231,
209
- "mean_token_accuracy": 0.7543428540229797,
210
  "step": 25
211
  },
212
  {
213
- "epoch": 0.2184873949579832,
214
- "grad_norm": 1.186997167984166,
215
- "learning_rate": 9.676166009522925e-06,
216
- "loss": 1.3298,
217
- "mean_token_accuracy": 0.7144016027450562,
218
  "step": 26
219
  },
220
  {
221
- "epoch": 0.226890756302521,
222
- "grad_norm": 1.1350670150932245,
223
- "learning_rate": 9.62515888702293e-06,
224
- "loss": 1.117,
225
- "mean_token_accuracy": 0.7577504515647888,
226
  "step": 27
227
  },
228
  {
229
- "epoch": 0.23529411764705882,
230
- "grad_norm": 1.1006616707114463,
231
- "learning_rate": 9.57059592893704e-06,
232
- "loss": 1.2008,
233
- "mean_token_accuracy": 0.7340885400772095,
234
  "step": 28
235
  },
236
  {
237
- "epoch": 0.24369747899159663,
238
- "grad_norm": 1.1240894442990508,
239
- "learning_rate": 9.51252416785559e-06,
240
- "loss": 1.2089,
241
- "mean_token_accuracy": 0.734373927116394,
242
  "step": 29
243
  },
244
  {
245
- "epoch": 0.25210084033613445,
246
- "grad_norm": 1.0852353510826063,
247
- "learning_rate": 9.450993660913418e-06,
248
- "loss": 1.1635,
249
- "mean_token_accuracy": 0.7386614084243774,
250
  "step": 30
251
  },
252
  {
253
- "epoch": 0.2605042016806723,
254
- "grad_norm": 1.1086196712408876,
255
- "learning_rate": 9.386057446641195e-06,
256
- "loss": 1.1637,
257
- "mean_token_accuracy": 0.7464680671691895,
258
  "step": 31
259
  },
260
  {
261
- "epoch": 0.2689075630252101,
262
- "grad_norm": 1.1309482702848446,
263
- "learning_rate": 9.317771499246918e-06,
264
- "loss": 1.177,
265
- "mean_token_accuracy": 0.7361091375350952,
266
  "step": 32
267
  },
268
  {
269
- "epoch": 0.2773109243697479,
270
- "grad_norm": 1.0682767953323058,
271
- "learning_rate": 9.246194680366802e-06,
272
- "loss": 1.135,
273
- "mean_token_accuracy": 0.7523439526557922,
274
  "step": 33
275
  },
276
  {
277
- "epoch": 0.2857142857142857,
278
- "grad_norm": 1.1205579094583278,
279
- "learning_rate": 9.171388688327307e-06,
280
- "loss": 1.2141,
281
- "mean_token_accuracy": 0.7343338131904602,
282
  "step": 34
283
  },
284
  {
285
- "epoch": 0.29411764705882354,
286
- "grad_norm": 1.070459745860962,
287
- "learning_rate": 9.093418004961939e-06,
288
- "loss": 1.1817,
289
- "mean_token_accuracy": 0.733535885810852,
290
  "step": 35
291
  },
292
  {
293
- "epoch": 0.3025210084033613,
294
- "grad_norm": 1.2186326847566449,
295
- "learning_rate": 9.012349840028707e-06,
296
- "loss": 1.1218,
297
- "mean_token_accuracy": 0.75095134973526,
298
  "step": 36
299
  },
300
  {
301
- "epoch": 0.31092436974789917,
302
- "grad_norm": 1.0915960765396624,
303
- "learning_rate": 8.928254073276166e-06,
304
- "loss": 1.1548,
305
- "mean_token_accuracy": 0.7447705268859863,
306
  "step": 37
307
  },
308
  {
309
- "epoch": 0.31932773109243695,
310
- "grad_norm": 1.0113396310620022,
311
- "learning_rate": 8.841203194207925e-06,
312
- "loss": 1.0825,
313
- "mean_token_accuracy": 0.7554290890693665,
314
  "step": 38
315
  },
316
  {
317
- "epoch": 0.3277310924369748,
318
- "grad_norm": 1.0227576546610853,
319
- "learning_rate": 8.751272239597612e-06,
320
- "loss": 1.0612,
321
- "mean_token_accuracy": 0.7643552422523499,
322
  "step": 39
323
  },
324
  {
325
- "epoch": 0.33613445378151263,
326
- "grad_norm": 1.0589227478863605,
327
- "learning_rate": 8.658538728808097e-06,
328
- "loss": 1.1339,
329
- "mean_token_accuracy": 0.7472735643386841,
330
  "step": 40
331
  },
332
  {
333
- "epoch": 0.3445378151260504,
334
- "grad_norm": 1.0297238689876962,
335
- "learning_rate": 8.563082596970785e-06,
336
- "loss": 1.1277,
337
- "mean_token_accuracy": 0.7488000988960266,
338
  "step": 41
339
  },
340
  {
341
- "epoch": 0.35294117647058826,
342
- "grad_norm": 1.006983117306359,
343
- "learning_rate": 8.4649861260825e-06,
344
- "loss": 1.0795,
345
- "mean_token_accuracy": 0.755338191986084,
346
  "step": 42
347
  },
348
  {
349
- "epoch": 0.36134453781512604,
350
- "grad_norm": 1.0929095890205571,
351
- "learning_rate": 8.364333874079462e-06,
352
- "loss": 1.1576,
353
- "mean_token_accuracy": 0.7435667514801025,
354
  "step": 43
355
  },
356
  {
357
- "epoch": 0.3697478991596639,
358
- "grad_norm": 1.084386434492771,
359
- "learning_rate": 8.261212601949374e-06,
360
- "loss": 1.0902,
361
- "mean_token_accuracy": 0.7600305676460266,
362
  "step": 44
363
  },
364
  {
365
- "epoch": 0.37815126050420167,
366
- "grad_norm": 0.9786422969060671,
367
- "learning_rate": 8.155711198944536e-06,
368
- "loss": 1.1568,
369
- "mean_token_accuracy": 0.7446728348731995,
370
  "step": 45
371
  },
372
  {
373
- "epoch": 0.3865546218487395,
374
- "grad_norm": 1.0807855846470553,
375
- "learning_rate": 8.047920605960428e-06,
376
- "loss": 1.071,
377
- "mean_token_accuracy": 0.7608465552330017,
378
  "step": 46
379
  },
380
  {
381
- "epoch": 0.3949579831932773,
382
- "grad_norm": 1.022702731208646,
383
- "learning_rate": 7.937933737145777e-06,
384
- "loss": 1.1201,
385
- "mean_token_accuracy": 0.7528531551361084,
386
  "step": 47
387
  },
388
  {
389
- "epoch": 0.40336134453781514,
390
- "grad_norm": 1.062223911366596,
391
- "learning_rate": 7.825845399811723e-06,
392
- "loss": 1.1107,
393
- "mean_token_accuracy": 0.7442110776901245,
394
  "step": 48
395
  },
396
  {
397
- "epoch": 0.4117647058823529,
398
- "grad_norm": 1.0050833455182735,
399
- "learning_rate": 7.711752212709106e-06,
400
- "loss": 1.1187,
401
- "mean_token_accuracy": 0.7482293844223022,
402
  "step": 49
403
  },
404
  {
405
- "epoch": 0.42016806722689076,
406
- "grad_norm": 1.007940805785554,
407
- "learning_rate": 7.595752522744287e-06,
408
- "loss": 1.0953,
409
- "mean_token_accuracy": 0.7530031204223633,
410
  "step": 50
411
  },
412
  {
413
- "epoch": 0.42857142857142855,
414
- "grad_norm": 0.9650750125007934,
415
- "learning_rate": 7.477946320205358e-06,
416
- "loss": 1.1336,
417
- "mean_token_accuracy": 0.7465465068817139,
418
  "step": 51
419
  },
420
  {
421
- "epoch": 0.4369747899159664,
422
- "grad_norm": 1.0334836528707236,
423
- "learning_rate": 7.358435152571749e-06,
424
- "loss": 1.1931,
425
- "mean_token_accuracy": 0.7283521890640259,
426
  "step": 52
427
  },
428
  {
429
- "epoch": 0.44537815126050423,
430
- "grad_norm": 1.0058006955982113,
431
- "learning_rate": 7.237322036981568e-06,
432
- "loss": 1.1619,
433
- "mean_token_accuracy": 0.7378183603286743,
434
  "step": 53
435
  },
436
  {
437
- "epoch": 0.453781512605042,
438
- "grad_norm": 1.0129742292617223,
439
- "learning_rate": 7.114711371432113e-06,
440
- "loss": 1.1214,
441
- "mean_token_accuracy": 0.7524005770683289,
442
  "step": 54
443
  },
444
  {
445
- "epoch": 0.46218487394957986,
446
- "grad_norm": 0.9634013156694173,
447
- "learning_rate": 6.990708844790093e-06,
448
- "loss": 1.0842,
449
- "mean_token_accuracy": 0.754132091999054,
450
  "step": 55
451
  },
452
  {
453
- "epoch": 0.47058823529411764,
454
- "grad_norm": 0.9862552262307049,
455
- "learning_rate": 6.865421345689147e-06,
456
- "loss": 1.1125,
457
- "mean_token_accuracy": 0.7493283748626709,
458
  "step": 56
459
  },
460
  {
461
- "epoch": 0.4789915966386555,
462
- "grad_norm": 0.9874154330933039,
463
- "learning_rate": 6.73895687039317e-06,
464
- "loss": 1.1484,
465
- "mean_token_accuracy": 0.7431809306144714,
466
  "step": 57
467
  },
468
  {
469
- "epoch": 0.48739495798319327,
470
- "grad_norm": 0.9698004650017749,
471
- "learning_rate": 6.611424429704879e-06,
472
- "loss": 1.0733,
473
- "mean_token_accuracy": 0.7608862519264221,
474
  "step": 58
475
  },
476
  {
477
- "epoch": 0.4957983193277311,
478
- "grad_norm": 0.9883985888705087,
479
- "learning_rate": 6.482933954999858e-06,
480
- "loss": 1.1497,
481
- "mean_token_accuracy": 0.7470613718032837,
482
  "step": 59
483
  },
484
  {
485
- "epoch": 0.5042016806722689,
486
- "grad_norm": 1.0185742380167386,
487
- "learning_rate": 6.353596203467085e-06,
488
- "loss": 1.1337,
489
- "mean_token_accuracy": 0.7488591074943542,
490
  "step": 60
491
  },
492
  {
493
- "epoch": 0.5126050420168067,
494
- "grad_norm": 0.973270451011091,
495
- "learning_rate": 6.2235226626376075e-06,
496
- "loss": 1.0861,
497
- "mean_token_accuracy": 0.7639545202255249,
498
  "step": 61
499
  },
500
  {
501
- "epoch": 0.5210084033613446,
502
- "grad_norm": 0.9694005114210438,
503
- "learning_rate": 6.0928254542836855e-06,
504
- "loss": 1.0873,
505
- "mean_token_accuracy": 0.7524347901344299,
506
  "step": 62
507
  },
508
  {
509
- "epoch": 0.5294117647058824,
510
- "grad_norm": 0.9724600628868327,
511
- "learning_rate": 5.961617237771217e-06,
512
- "loss": 1.1536,
513
- "mean_token_accuracy": 0.7458714246749878,
514
  "step": 63
515
  },
516
  {
517
- "epoch": 0.5378151260504201,
518
- "grad_norm": 0.9500352117756259,
519
- "learning_rate": 5.830011112948768e-06,
520
- "loss": 1.2304,
521
- "mean_token_accuracy": 0.7270871996879578,
522
  "step": 64
523
  },
524
  {
525
- "epoch": 0.5462184873949579,
526
- "grad_norm": 0.9945227488227443,
527
- "learning_rate": 5.698120522656916e-06,
528
- "loss": 1.0556,
529
- "mean_token_accuracy": 0.7674956321716309,
530
  "step": 65
531
  },
532
  {
533
- "epoch": 0.5546218487394958,
534
- "grad_norm": 0.9676883555750646,
535
- "learning_rate": 5.566059154941925e-06,
536
- "loss": 1.098,
537
- "mean_token_accuracy": 0.7561368942260742,
538
  "step": 66
539
  },
540
  {
541
- "epoch": 0.5630252100840336,
542
- "grad_norm": 1.04781134622391,
543
- "learning_rate": 5.433940845058076e-06,
544
- "loss": 1.0937,
545
- "mean_token_accuracy": 0.7666974067687988,
546
  "step": 67
547
  },
548
  {
549
- "epoch": 0.5714285714285714,
550
- "grad_norm": 1.0183244679566534,
551
- "learning_rate": 5.301879477343086e-06,
552
- "loss": 1.1181,
553
- "mean_token_accuracy": 0.7474605441093445,
554
  "step": 68
555
  },
556
  {
557
- "epoch": 0.5798319327731093,
558
- "grad_norm": 0.9591192764372636,
559
- "learning_rate": 5.169988887051234e-06,
560
- "loss": 1.13,
561
- "mean_token_accuracy": 0.7436277270317078,
562
  "step": 69
563
  },
564
  {
565
- "epoch": 0.5882352941176471,
566
- "grad_norm": 0.9787662786555585,
567
- "learning_rate": 5.038382762228786e-06,
568
- "loss": 1.1328,
569
- "mean_token_accuracy": 0.7496426701545715,
570
  "step": 70
571
  },
572
  {
573
- "epoch": 0.5966386554621849,
574
- "grad_norm": 0.9507355108991862,
575
- "learning_rate": 4.907174545716317e-06,
576
- "loss": 1.1048,
577
- "mean_token_accuracy": 0.7508357763290405,
578
  "step": 71
579
  },
580
  {
581
- "epoch": 0.6050420168067226,
582
- "grad_norm": 0.9941578495647477,
583
- "learning_rate": 4.776477337362394e-06,
584
- "loss": 1.1155,
585
- "mean_token_accuracy": 0.7493728399276733,
586
  "step": 72
587
  },
588
  {
589
- "epoch": 0.6134453781512605,
590
- "grad_norm": 1.005111788535626,
591
- "learning_rate": 4.646403796532916e-06,
592
- "loss": 1.1115,
593
- "mean_token_accuracy": 0.757051944732666,
594
  "step": 73
595
  },
596
  {
597
- "epoch": 0.6218487394957983,
598
- "grad_norm": 0.9443186660832495,
599
- "learning_rate": 4.517066045000142e-06,
600
- "loss": 1.1915,
601
- "mean_token_accuracy": 0.7324733734130859,
602
  "step": 74
603
  },
604
  {
605
- "epoch": 0.6302521008403361,
606
- "grad_norm": 1.0844270342796254,
607
- "learning_rate": 4.388575570295123e-06,
608
- "loss": 1.0884,
609
- "mean_token_accuracy": 0.7623071670532227,
610
  "step": 75
611
  },
612
  {
613
- "epoch": 0.6386554621848739,
614
- "grad_norm": 0.9849374469980112,
615
- "learning_rate": 4.261043129606832e-06,
616
- "loss": 1.1225,
617
- "mean_token_accuracy": 0.7571775913238525,
618
  "step": 76
619
  },
620
  {
621
- "epoch": 0.6470588235294118,
622
- "grad_norm": 0.9770353860403225,
623
- "learning_rate": 4.134578654310854e-06,
624
- "loss": 1.1438,
625
- "mean_token_accuracy": 0.7406225204467773,
626
  "step": 77
627
  },
628
  {
629
- "epoch": 0.6554621848739496,
630
- "grad_norm": 0.9963027454603423,
631
- "learning_rate": 4.009291155209909e-06,
632
- "loss": 0.9975,
633
- "mean_token_accuracy": 0.7706347703933716,
634
  "step": 78
635
  },
636
  {
637
- "epoch": 0.6638655462184874,
638
- "grad_norm": 1.0245462387900903,
639
- "learning_rate": 3.8852886285678896e-06,
640
- "loss": 1.0277,
641
- "mean_token_accuracy": 0.7736483812332153,
642
  "step": 79
643
  },
644
  {
645
- "epoch": 0.6722689075630253,
646
- "grad_norm": 0.9867453808790595,
647
- "learning_rate": 3.762677963018433e-06,
648
- "loss": 1.0945,
649
- "mean_token_accuracy": 0.7500572800636292,
650
  "step": 80
651
  },
652
  {
653
- "epoch": 0.680672268907563,
654
- "grad_norm": 0.9907470452421272,
655
- "learning_rate": 3.641564847428254e-06,
656
- "loss": 1.2412,
657
- "mean_token_accuracy": 0.721602201461792,
658
  "step": 81
659
  },
660
  {
661
- "epoch": 0.6890756302521008,
662
- "grad_norm": 0.9877738858581707,
663
- "learning_rate": 3.5220536797946447e-06,
664
- "loss": 1.0882,
665
- "mean_token_accuracy": 0.7540343999862671,
666
  "step": 82
667
  },
668
  {
669
- "epoch": 0.6974789915966386,
670
- "grad_norm": 0.9735916464629358,
671
- "learning_rate": 3.4042474772557143e-06,
672
- "loss": 1.0889,
673
- "mean_token_accuracy": 0.7557320594787598,
674
  "step": 83
675
  },
676
  {
677
- "epoch": 0.7058823529411765,
678
- "grad_norm": 1.011109227075983,
679
- "learning_rate": 3.2882477872908965e-06,
680
- "loss": 1.0123,
681
- "mean_token_accuracy": 0.779578447341919,
682
  "step": 84
683
  },
684
  {
685
- "epoch": 0.7142857142857143,
686
- "grad_norm": 0.9739026856030252,
687
- "learning_rate": 3.1741546001882773e-06,
688
- "loss": 1.0548,
689
- "mean_token_accuracy": 0.7688317894935608,
690
  "step": 85
691
  },
692
  {
693
- "epoch": 0.7226890756302521,
694
- "grad_norm": 0.9785606022441755,
695
- "learning_rate": 3.0620662628542256e-06,
696
- "loss": 1.185,
697
- "mean_token_accuracy": 0.7312161326408386,
698
  "step": 86
699
  },
700
  {
701
- "epoch": 0.7310924369747899,
702
- "grad_norm": 0.9961905704635389,
703
- "learning_rate": 2.9520793940395735e-06,
704
- "loss": 1.1403,
705
- "mean_token_accuracy": 0.744536280632019,
706
  "step": 87
707
  },
708
  {
709
- "epoch": 0.7394957983193278,
710
- "grad_norm": 0.9312440257308606,
711
- "learning_rate": 2.8442888010554658e-06,
712
- "loss": 1.0966,
713
- "mean_token_accuracy": 0.7522168755531311,
714
  "step": 88
715
  },
716
  {
717
- "epoch": 0.7478991596638656,
718
- "grad_norm": 0.9585122926929834,
719
- "learning_rate": 2.7387873980506286e-06,
720
- "loss": 1.0615,
721
- "mean_token_accuracy": 0.7616564035415649,
722
  "step": 89
723
  },
724
  {
725
- "epoch": 0.7563025210084033,
726
- "grad_norm": 0.938343524190116,
727
- "learning_rate": 2.6356661259205396e-06,
728
- "loss": 1.1033,
729
- "mean_token_accuracy": 0.7596261501312256,
730
  "step": 90
731
  },
732
  {
733
- "epoch": 0.7647058823529411,
734
- "grad_norm": 0.9837271772834161,
735
- "learning_rate": 2.535013873917501e-06,
736
- "loss": 0.9993,
737
- "mean_token_accuracy": 0.7763923406600952,
738
  "step": 91
739
  },
740
  {
741
- "epoch": 0.773109243697479,
742
- "grad_norm": 0.9643380006931658,
743
- "learning_rate": 2.436917403029219e-06,
744
- "loss": 1.1516,
745
- "mean_token_accuracy": 0.7430367469787598,
746
  "step": 92
747
  },
748
  {
749
- "epoch": 0.7815126050420168,
750
- "grad_norm": 1.0041291438351077,
751
- "learning_rate": 2.341461271191903e-06,
752
- "loss": 1.148,
753
- "mean_token_accuracy": 0.753593921661377,
754
  "step": 93
755
  },
756
  {
757
- "epoch": 0.7899159663865546,
758
- "grad_norm": 0.9710567606823283,
759
- "learning_rate": 2.248727760402391e-06,
760
- "loss": 1.0827,
761
- "mean_token_accuracy": 0.754690945148468,
762
  "step": 94
763
  },
764
  {
765
- "epoch": 0.7983193277310925,
766
- "grad_norm": 0.9667789737502585,
767
- "learning_rate": 2.158796805792076e-06,
768
- "loss": 1.0567,
769
- "mean_token_accuracy": 0.7643100023269653,
770
  "step": 95
771
  },
772
  {
773
- "epoch": 0.8067226890756303,
774
- "grad_norm": 0.9958478241274705,
775
- "learning_rate": 2.071745926723836e-06,
776
- "loss": 1.1122,
777
- "mean_token_accuracy": 0.7497897148132324,
778
  "step": 96
779
  },
780
  {
781
- "epoch": 0.8151260504201681,
782
- "grad_norm": 0.9523710901661049,
783
- "learning_rate": 1.9876501599712933e-06,
784
- "loss": 1.077,
785
- "mean_token_accuracy": 0.7561772465705872,
786
  "step": 97
787
  },
788
  {
789
- "epoch": 0.8235294117647058,
790
- "grad_norm": 0.9532880854709574,
791
- "learning_rate": 1.9065819950380634e-06,
792
- "loss": 1.1662,
793
- "mean_token_accuracy": 0.7427510619163513,
794
  "step": 98
795
  },
796
  {
797
- "epoch": 0.8319327731092437,
798
- "grad_norm": 0.9478756657575517,
799
- "learning_rate": 1.8286113116726928e-06,
800
- "loss": 1.1194,
801
- "mean_token_accuracy": 0.7442958354949951,
802
  "step": 99
803
  },
804
  {
805
- "epoch": 0.8403361344537815,
806
- "grad_norm": 0.9921226812907276,
807
- "learning_rate": 1.7538053196331988e-06,
808
- "loss": 1.0637,
809
- "mean_token_accuracy": 0.7638644576072693,
810
  "step": 100
811
  },
812
  {
813
- "epoch": 0.8487394957983193,
814
- "grad_norm": 0.9508865763052654,
815
- "learning_rate": 1.682228500753083e-06,
816
- "loss": 1.051,
817
- "mean_token_accuracy": 0.7656214237213135,
818
  "step": 101
819
  },
820
  {
821
- "epoch": 0.8571428571428571,
822
- "grad_norm": 0.9546268203643725,
823
- "learning_rate": 1.6139425533588055e-06,
824
- "loss": 1.166,
825
- "mean_token_accuracy": 0.741737425327301,
826
  "step": 102
827
  },
828
  {
829
- "epoch": 0.865546218487395,
830
- "grad_norm": 1.007082676416536,
831
- "learning_rate": 1.5490063390865845e-06,
832
- "loss": 1.0537,
833
- "mean_token_accuracy": 0.7664616107940674,
834
  "step": 103
835
  },
836
  {
837
- "epoch": 0.8739495798319328,
838
- "grad_norm": 0.9826420216540788,
839
- "learning_rate": 1.4874758321444091e-06,
840
- "loss": 1.1865,
841
- "mean_token_accuracy": 0.7296122312545776,
842
  "step": 104
843
  },
844
  {
845
- "epoch": 0.8823529411764706,
846
- "grad_norm": 0.9487926386987494,
847
- "learning_rate": 1.4294040710629617e-06,
848
- "loss": 1.138,
849
- "mean_token_accuracy": 0.7487072944641113,
850
  "step": 105
851
  },
852
  {
853
- "epoch": 0.8907563025210085,
854
- "grad_norm": 0.9945929706480489,
855
- "learning_rate": 1.3748411129770703e-06,
856
- "loss": 1.0522,
857
- "mean_token_accuracy": 0.7531285285949707,
858
  "step": 106
859
  },
860
  {
861
- "epoch": 0.8991596638655462,
862
- "grad_norm": 0.950668612631682,
863
- "learning_rate": 1.323833990477076e-06,
864
- "loss": 1.1069,
865
- "mean_token_accuracy": 0.7477149367332458,
866
  "step": 107
867
  },
868
  {
869
- "epoch": 0.907563025210084,
870
- "grad_norm": 1.0455035364134098,
871
- "learning_rate": 1.2764266710673335e-06,
872
- "loss": 1.1656,
873
- "mean_token_accuracy": 0.7393743395805359,
874
  "step": 108
875
  },
876
  {
877
- "epoch": 0.9159663865546218,
878
- "grad_norm": 0.9438442542715659,
879
- "learning_rate": 1.2326600192667612e-06,
880
- "loss": 1.1505,
881
- "mean_token_accuracy": 0.7463235259056091,
882
  "step": 109
883
  },
884
  {
885
- "epoch": 0.9243697478991597,
886
- "grad_norm": 0.9441161468931093,
887
- "learning_rate": 1.1925717613841432e-06,
888
- "loss": 1.0912,
889
- "mean_token_accuracy": 0.7579004764556885,
890
  "step": 110
891
  },
892
  {
893
- "epoch": 0.9327731092436975,
894
- "grad_norm": 0.9756496645004279,
895
- "learning_rate": 1.1561964529985143e-06,
896
- "loss": 1.0809,
897
- "mean_token_accuracy": 0.7611285448074341,
898
  "step": 111
899
  },
900
  {
901
- "epoch": 0.9411764705882353,
902
- "grad_norm": 0.9686595951669469,
903
- "learning_rate": 1.1235654491726853e-06,
904
- "loss": 1.0244,
905
- "mean_token_accuracy": 0.7686207294464111,
906
  "step": 112
907
  },
908
  {
909
- "epoch": 0.9495798319327731,
910
- "grad_norm": 0.9359227547987794,
911
- "learning_rate": 1.0947068774255675e-06,
912
- "loss": 1.2457,
913
- "mean_token_accuracy": 0.7190503478050232,
914
  "step": 113
915
  },
916
  {
917
- "epoch": 0.957983193277311,
918
- "grad_norm": 0.9948246155805508,
919
- "learning_rate": 1.0696456134866027e-06,
920
- "loss": 1.0868,
921
- "mean_token_accuracy": 0.7551665902137756,
922
  "step": 114
923
  },
924
  {
925
- "epoch": 0.9663865546218487,
926
- "grad_norm": 0.9848176758071714,
927
- "learning_rate": 1.0484032598531933e-06,
928
- "loss": 1.0888,
929
- "mean_token_accuracy": 0.7574664950370789,
930
  "step": 115
931
  },
932
  {
933
- "epoch": 0.9747899159663865,
934
- "grad_norm": 0.9578094192978062,
935
- "learning_rate": 1.0309981271696186e-06,
936
- "loss": 1.2089,
937
- "mean_token_accuracy": 0.7329251766204834,
938
  "step": 116
939
  },
940
  {
941
- "epoch": 0.9831932773109243,
942
- "grad_norm": 0.9741614673227095,
943
- "learning_rate": 1.0174452184434888e-06,
944
- "loss": 1.0675,
945
- "mean_token_accuracy": 0.7658007740974426,
946
  "step": 117
947
  },
948
  {
949
- "epoch": 0.9915966386554622,
950
- "grad_norm": 1.0164869507154413,
951
- "learning_rate": 1.0077562161133376e-06,
952
- "loss": 1.0093,
953
- "mean_token_accuracy": 0.775924801826477,
954
  "step": 118
955
  },
956
  {
957
- "epoch": 1.0,
958
- "grad_norm": 0.9640139778034071,
959
- "learning_rate": 1.0019394719785073e-06,
960
- "loss": 1.0321,
961
- "mean_token_accuracy": 0.7566091418266296,
962
  "step": 119
963
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
964
  {
965
  "epoch": 1.0,
966
- "step": 119,
967
- "total_flos": 3245435781120.0,
968
- "train_loss": 1.2809894014807308,
969
- "train_runtime": 1398.581,
970
- "train_samples_per_second": 10.81,
971
- "train_steps_per_second": 0.085
972
  }
973
  ],
974
  "logging_steps": 1,
975
- "max_steps": 119,
976
  "num_input_tokens_seen": 0,
977
  "num_train_epochs": 1,
978
  "save_steps": 500,
@@ -988,8 +1932,8 @@
988
  "attributes": {}
989
  }
990
  },
991
- "total_flos": 3245435781120.0,
992
- "train_batch_size": 16,
993
  "trial_name": null,
994
  "trial_params": null
995
  }
 
4
  "best_model_checkpoint": null,
5
  "epoch": 1.0,
6
  "eval_steps": 500,
7
+ "global_step": 237,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "epoch": 0.004219409282700422,
14
+ "grad_norm": 56.492855072021484,
15
  "learning_rate": 0.0,
16
+ "loss": 3.4465,
17
+ "mean_token_accuracy": 0.43144476413726807,
18
  "step": 1
19
  },
20
  {
21
+ "epoch": 0.008438818565400843,
22
+ "grad_norm": 67.59725189208984,
23
+ "learning_rate": 4.1666666666666667e-07,
24
+ "loss": 3.5603,
25
+ "mean_token_accuracy": 0.4281432032585144,
26
  "step": 2
27
  },
28
  {
29
+ "epoch": 0.012658227848101266,
30
+ "grad_norm": 57.31755065917969,
31
+ "learning_rate": 8.333333333333333e-07,
32
+ "loss": 3.5257,
33
+ "mean_token_accuracy": 0.4222416579723358,
34
  "step": 3
35
  },
36
  {
37
+ "epoch": 0.016877637130801686,
38
+ "grad_norm": 60.031646728515625,
39
+ "learning_rate": 1.25e-06,
40
+ "loss": 3.5073,
41
+ "mean_token_accuracy": 0.4332667887210846,
42
  "step": 4
43
  },
44
  {
45
+ "epoch": 0.02109704641350211,
46
+ "grad_norm": 52.10445022583008,
47
+ "learning_rate": 1.6666666666666667e-06,
48
+ "loss": 3.3834,
49
+ "mean_token_accuracy": 0.44102635979652405,
50
  "step": 5
51
  },
52
  {
53
+ "epoch": 0.02531645569620253,
54
+ "grad_norm": 49.48136520385742,
55
+ "learning_rate": 2.0833333333333334e-06,
56
+ "loss": 3.2116,
57
+ "mean_token_accuracy": 0.4614466726779938,
58
  "step": 6
59
  },
60
  {
61
+ "epoch": 0.029535864978902954,
62
+ "grad_norm": 57.7535400390625,
63
+ "learning_rate": 2.5e-06,
64
+ "loss": 3.1654,
65
+ "mean_token_accuracy": 0.4769577383995056,
66
  "step": 7
67
  },
68
  {
69
+ "epoch": 0.03375527426160337,
70
+ "grad_norm": 55.622840881347656,
71
+ "learning_rate": 2.916666666666667e-06,
72
+ "loss": 2.7736,
73
+ "mean_token_accuracy": 0.5238496661186218,
74
  "step": 8
75
  },
76
  {
77
+ "epoch": 0.0379746835443038,
78
+ "grad_norm": 53.84645080566406,
79
+ "learning_rate": 3.3333333333333333e-06,
80
+ "loss": 2.7429,
81
+ "mean_token_accuracy": 0.5103054046630859,
82
  "step": 9
83
  },
84
  {
85
+ "epoch": 0.04219409282700422,
86
+ "grad_norm": 49.74085998535156,
87
+ "learning_rate": 3.7500000000000005e-06,
88
+ "loss": 2.7235,
89
+ "mean_token_accuracy": 0.5065439343452454,
90
  "step": 10
91
  },
92
  {
93
+ "epoch": 0.046413502109704644,
94
+ "grad_norm": 55.516387939453125,
95
+ "learning_rate": 4.166666666666667e-06,
96
+ "loss": 2.1734,
97
+ "mean_token_accuracy": 0.5752225518226624,
98
  "step": 11
99
  },
100
  {
101
+ "epoch": 0.05063291139240506,
102
+ "grad_norm": 51.78127670288086,
103
+ "learning_rate": 4.583333333333333e-06,
104
+ "loss": 1.9645,
105
+ "mean_token_accuracy": 0.5979234576225281,
106
  "step": 12
107
  },
108
  {
109
+ "epoch": 0.05485232067510549,
110
+ "grad_norm": 41.615440368652344,
111
+ "learning_rate": 5e-06,
112
+ "loss": 1.9885,
113
+ "mean_token_accuracy": 0.5931289196014404,
114
  "step": 13
115
  },
116
  {
117
+ "epoch": 0.05907172995780591,
118
+ "grad_norm": 18.521259307861328,
119
+ "learning_rate": 5.416666666666667e-06,
120
+ "loss": 1.6583,
121
+ "mean_token_accuracy": 0.6572901606559753,
122
  "step": 14
123
  },
124
  {
125
+ "epoch": 0.06329113924050633,
126
+ "grad_norm": 16.77658462524414,
127
+ "learning_rate": 5.833333333333334e-06,
128
+ "loss": 1.2774,
129
+ "mean_token_accuracy": 0.7345430254936218,
130
  "step": 15
131
  },
132
  {
133
+ "epoch": 0.06751054852320675,
134
+ "grad_norm": 16.902246475219727,
135
+ "learning_rate": 6.25e-06,
136
+ "loss": 1.4676,
137
+ "mean_token_accuracy": 0.6940106153488159,
138
  "step": 16
139
  },
140
  {
141
+ "epoch": 0.07172995780590717,
142
+ "grad_norm": 4.634657382965088,
143
+ "learning_rate": 6.666666666666667e-06,
144
+ "loss": 1.331,
145
+ "mean_token_accuracy": 0.7168394327163696,
146
  "step": 17
147
  },
148
  {
149
+ "epoch": 0.0759493670886076,
150
+ "grad_norm": 2.60319447517395,
151
+ "learning_rate": 7.083333333333335e-06,
152
+ "loss": 1.2297,
153
+ "mean_token_accuracy": 0.738610565662384,
154
  "step": 18
155
  },
156
  {
157
+ "epoch": 0.08016877637130802,
158
+ "grad_norm": 2.7165443897247314,
159
+ "learning_rate": 7.500000000000001e-06,
160
+ "loss": 1.0861,
161
+ "mean_token_accuracy": 0.7637795209884644,
162
  "step": 19
163
  },
164
  {
165
+ "epoch": 0.08438818565400844,
166
+ "grad_norm": 3.0834693908691406,
167
+ "learning_rate": 7.916666666666667e-06,
168
+ "loss": 1.2613,
169
+ "mean_token_accuracy": 0.7090403437614441,
170
  "step": 20
171
  },
172
  {
173
+ "epoch": 0.08860759493670886,
174
+ "grad_norm": 2.5113463401794434,
175
+ "learning_rate": 8.333333333333334e-06,
176
+ "loss": 1.1188,
177
+ "mean_token_accuracy": 0.7468186020851135,
178
  "step": 21
179
  },
180
  {
181
+ "epoch": 0.09282700421940929,
182
+ "grad_norm": 2.150643825531006,
183
+ "learning_rate": 8.750000000000001e-06,
184
+ "loss": 1.2615,
185
+ "mean_token_accuracy": 0.7228550910949707,
186
  "step": 22
187
  },
188
  {
189
+ "epoch": 0.0970464135021097,
190
+ "grad_norm": 2.469194173812866,
191
+ "learning_rate": 9.166666666666666e-06,
192
+ "loss": 1.198,
193
+ "mean_token_accuracy": 0.7393919825553894,
194
  "step": 23
195
  },
196
  {
197
+ "epoch": 0.10126582278481013,
198
+ "grad_norm": 2.4474411010742188,
199
+ "learning_rate": 9.583333333333335e-06,
200
+ "loss": 1.096,
201
+ "mean_token_accuracy": 0.7572800517082214,
202
  "step": 24
203
  },
204
  {
205
+ "epoch": 0.10548523206751055,
206
+ "grad_norm": 2.088087320327759,
207
+ "learning_rate": 1e-05,
208
+ "loss": 0.985,
209
+ "mean_token_accuracy": 0.7962148785591125,
210
  "step": 25
211
  },
212
  {
213
+ "epoch": 0.10970464135021098,
214
+ "grad_norm": 1.9643747806549072,
215
+ "learning_rate": 9.999510542279196e-06,
216
+ "loss": 1.209,
217
+ "mean_token_accuracy": 0.729297935962677,
218
  "step": 26
219
  },
220
  {
221
+ "epoch": 0.11392405063291139,
222
+ "grad_norm": 2.398698568344116,
223
+ "learning_rate": 9.998042275591827e-06,
224
+ "loss": 1.1201,
225
+ "mean_token_accuracy": 0.755721390247345,
226
  "step": 27
227
  },
228
  {
229
+ "epoch": 0.11814345991561181,
230
+ "grad_norm": 2.144404888153076,
231
+ "learning_rate": 9.995595519339882e-06,
232
+ "loss": 1.2033,
233
+ "mean_token_accuracy": 0.737983763217926,
234
  "step": 28
235
  },
236
  {
237
+ "epoch": 0.12236286919831224,
238
+ "grad_norm": 1.771583914756775,
239
+ "learning_rate": 9.992170805782799e-06,
240
+ "loss": 1.2612,
241
+ "mean_token_accuracy": 0.7275724411010742,
242
  "step": 29
243
  },
244
  {
245
+ "epoch": 0.12658227848101267,
246
+ "grad_norm": 1.857326626777649,
247
+ "learning_rate": 9.987768879921685e-06,
248
+ "loss": 1.089,
249
+ "mean_token_accuracy": 0.7615371942520142,
250
  "step": 30
251
  },
252
  {
253
+ "epoch": 0.1308016877637131,
254
+ "grad_norm": 1.5715709924697876,
255
+ "learning_rate": 9.982390699337253e-06,
256
+ "loss": 1.1158,
257
+ "mean_token_accuracy": 0.748799741268158,
258
  "step": 31
259
  },
260
  {
261
+ "epoch": 0.1350210970464135,
262
+ "grad_norm": 1.6975351572036743,
263
+ "learning_rate": 9.976037433981505e-06,
264
+ "loss": 1.2293,
265
+ "mean_token_accuracy": 0.7248790264129639,
266
  "step": 32
267
  },
268
  {
269
+ "epoch": 0.13924050632911392,
270
+ "grad_norm": 1.6748303174972534,
271
+ "learning_rate": 9.968710465923233e-06,
272
+ "loss": 1.0872,
273
+ "mean_token_accuracy": 0.7640039920806885,
274
  "step": 33
275
  },
276
  {
277
+ "epoch": 0.14345991561181434,
278
+ "grad_norm": 1.5389747619628906,
279
+ "learning_rate": 9.960411389047366e-06,
280
+ "loss": 1.091,
281
+ "mean_token_accuracy": 0.7640097737312317,
282
  "step": 34
283
  },
284
  {
285
+ "epoch": 0.14767932489451477,
286
+ "grad_norm": 1.5563571453094482,
287
+ "learning_rate": 9.951142008708238e-06,
288
+ "loss": 1.1572,
289
+ "mean_token_accuracy": 0.7440212368965149,
290
  "step": 35
291
  },
292
  {
293
+ "epoch": 0.1518987341772152,
294
+ "grad_norm": 1.5131391286849976,
295
+ "learning_rate": 9.940904341336859e-06,
296
+ "loss": 1.2426,
297
+ "mean_token_accuracy": 0.7158628106117249,
298
  "step": 36
299
  },
300
  {
301
+ "epoch": 0.15611814345991562,
302
+ "grad_norm": 1.5075509548187256,
303
+ "learning_rate": 9.929700614002265e-06,
304
+ "loss": 1.1656,
305
+ "mean_token_accuracy": 0.7480515837669373,
306
  "step": 37
307
  },
308
  {
309
+ "epoch": 0.16033755274261605,
310
+ "grad_norm": 1.4747117757797241,
311
+ "learning_rate": 9.91753326392706e-06,
312
+ "loss": 1.2723,
313
+ "mean_token_accuracy": 0.7222742438316345,
314
  "step": 38
315
  },
316
  {
317
+ "epoch": 0.16455696202531644,
318
+ "grad_norm": 1.5765788555145264,
319
+ "learning_rate": 9.904404937957213e-06,
320
+ "loss": 1.1191,
321
+ "mean_token_accuracy": 0.7518899440765381,
322
  "step": 39
323
  },
324
  {
325
+ "epoch": 0.16877637130801687,
326
+ "grad_norm": 1.4843032360076904,
327
+ "learning_rate": 9.890318491986282e-06,
328
+ "loss": 1.1076,
329
+ "mean_token_accuracy": 0.7483693361282349,
330
  "step": 40
331
  },
332
  {
333
+ "epoch": 0.1729957805907173,
334
+ "grad_norm": 1.5332094430923462,
335
+ "learning_rate": 9.87527699033415e-06,
336
+ "loss": 1.118,
337
+ "mean_token_accuracy": 0.7508232593536377,
338
  "step": 41
339
  },
340
  {
341
+ "epoch": 0.17721518987341772,
342
+ "grad_norm": 1.5774556398391724,
343
+ "learning_rate": 9.85928370508043e-06,
344
+ "loss": 1.1516,
345
+ "mean_token_accuracy": 0.7587430477142334,
346
  "step": 42
347
  },
348
  {
349
+ "epoch": 0.18143459915611815,
350
+ "grad_norm": 1.5169801712036133,
351
+ "learning_rate": 9.842342115352647e-06,
352
+ "loss": 1.0526,
353
+ "mean_token_accuracy": 0.7587090730667114,
354
  "step": 43
355
  },
356
  {
357
+ "epoch": 0.18565400843881857,
358
+ "grad_norm": 1.476678729057312,
359
+ "learning_rate": 9.824455906569423e-06,
360
+ "loss": 1.1723,
361
+ "mean_token_accuracy": 0.7248958349227905,
362
  "step": 44
363
  },
364
  {
365
+ "epoch": 0.189873417721519,
366
+ "grad_norm": 1.8056585788726807,
367
+ "learning_rate": 9.805628969638757e-06,
368
+ "loss": 1.2247,
369
+ "mean_token_accuracy": 0.728428065776825,
370
  "step": 45
371
  },
372
  {
373
+ "epoch": 0.1940928270042194,
374
+ "grad_norm": 1.4431560039520264,
375
+ "learning_rate": 9.785865400111593e-06,
376
+ "loss": 1.1691,
377
+ "mean_token_accuracy": 0.7389825582504272,
378
  "step": 46
379
  },
380
  {
381
+ "epoch": 0.19831223628691982,
382
+ "grad_norm": 1.5770704746246338,
383
+ "learning_rate": 9.765169497290908e-06,
384
+ "loss": 1.2224,
385
+ "mean_token_accuracy": 0.7342731356620789,
386
  "step": 47
387
  },
388
  {
389
+ "epoch": 0.20253164556962025,
390
+ "grad_norm": 1.505393624305725,
391
+ "learning_rate": 9.743545763296451e-06,
392
+ "loss": 1.1314,
393
+ "mean_token_accuracy": 0.731634795665741,
394
  "step": 48
395
  },
396
  {
397
+ "epoch": 0.20675105485232068,
398
+ "grad_norm": 1.5118578672409058,
399
+ "learning_rate": 9.720998902085354e-06,
400
+ "loss": 1.0991,
401
+ "mean_token_accuracy": 0.7543997764587402,
402
  "step": 49
403
  },
404
  {
405
+ "epoch": 0.2109704641350211,
406
+ "grad_norm": 1.4458030462265015,
407
+ "learning_rate": 9.697533818428863e-06,
408
+ "loss": 1.101,
409
+ "mean_token_accuracy": 0.7582892775535583,
410
  "step": 50
411
  },
412
  {
413
+ "epoch": 0.21518987341772153,
414
+ "grad_norm": 1.3428678512573242,
415
+ "learning_rate": 9.673155616845362e-06,
416
+ "loss": 1.219,
417
+ "mean_token_accuracy": 0.7282519936561584,
418
  "step": 51
419
  },
420
  {
421
+ "epoch": 0.21940928270042195,
422
+ "grad_norm": 1.4228544235229492,
423
+ "learning_rate": 9.647869600489954e-06,
424
+ "loss": 1.3229,
425
+ "mean_token_accuracy": 0.7029238343238831,
426
  "step": 52
427
  },
428
  {
429
+ "epoch": 0.22362869198312235,
430
+ "grad_norm": 1.427255630493164,
431
+ "learning_rate": 9.621681270000833e-06,
432
+ "loss": 1.0733,
433
+ "mean_token_accuracy": 0.7589775919914246,
434
  "step": 53
435
  },
436
  {
437
+ "epoch": 0.22784810126582278,
438
+ "grad_norm": 1.4552180767059326,
439
+ "learning_rate": 9.594596322302688e-06,
440
+ "loss": 1.0936,
441
+ "mean_token_accuracy": 0.7616468071937561,
442
  "step": 54
443
  },
444
  {
445
+ "epoch": 0.2320675105485232,
446
+ "grad_norm": 1.6069968938827515,
447
+ "learning_rate": 9.566620649367418e-06,
448
+ "loss": 1.2073,
449
+ "mean_token_accuracy": 0.7224194407463074,
450
  "step": 55
451
  },
452
  {
453
+ "epoch": 0.23628691983122363,
454
+ "grad_norm": 1.412222146987915,
455
+ "learning_rate": 9.537760336932406e-06,
456
+ "loss": 1.1143,
457
+ "mean_token_accuracy": 0.7492688298225403,
458
  "step": 56
459
  },
460
  {
461
+ "epoch": 0.24050632911392406,
462
+ "grad_norm": 1.4546841382980347,
463
+ "learning_rate": 9.508021663176648e-06,
464
+ "loss": 1.1594,
465
+ "mean_token_accuracy": 0.7387874126434326,
466
  "step": 57
467
  },
468
  {
469
+ "epoch": 0.24472573839662448,
470
+ "grad_norm": 1.3827760219573975,
471
+ "learning_rate": 9.477411097355025e-06,
472
+ "loss": 1.1539,
473
+ "mean_token_accuracy": 0.7320411801338196,
474
  "step": 58
475
  },
476
  {
477
+ "epoch": 0.2489451476793249,
478
+ "grad_norm": 1.4128773212432861,
479
+ "learning_rate": 9.445935298390994e-06,
480
+ "loss": 1.1729,
481
+ "mean_token_accuracy": 0.7295903563499451,
482
  "step": 59
483
  },
484
  {
485
+ "epoch": 0.25316455696202533,
486
+ "grad_norm": 1.436807632446289,
487
+ "learning_rate": 9.413601113428032e-06,
488
+ "loss": 1.1122,
489
+ "mean_token_accuracy": 0.7527050375938416,
490
  "step": 60
491
  },
492
  {
493
+ "epoch": 0.25738396624472576,
494
+ "grad_norm": 1.4155519008636475,
495
+ "learning_rate": 9.380415576340127e-06,
496
+ "loss": 1.134,
497
+ "mean_token_accuracy": 0.748845100402832,
498
  "step": 61
499
  },
500
  {
501
+ "epoch": 0.2616033755274262,
502
+ "grad_norm": 1.3568230867385864,
503
+ "learning_rate": 9.346385906201653e-06,
504
+ "loss": 1.0927,
505
+ "mean_token_accuracy": 0.7473200559616089,
506
  "step": 62
507
  },
508
  {
509
+ "epoch": 0.26582278481012656,
510
+ "grad_norm": 1.3414636850357056,
511
+ "learning_rate": 9.311519505716963e-06,
512
+ "loss": 1.116,
513
+ "mean_token_accuracy": 0.7434830665588379,
514
  "step": 63
515
  },
516
  {
517
+ "epoch": 0.270042194092827,
518
+ "grad_norm": 1.4687381982803345,
519
+ "learning_rate": 9.275823959610019e-06,
520
+ "loss": 1.1996,
521
+ "mean_token_accuracy": 0.7335405349731445,
522
  "step": 64
523
  },
524
  {
525
+ "epoch": 0.2742616033755274,
526
+ "grad_norm": 1.3659785985946655,
527
+ "learning_rate": 9.239307032974438e-06,
528
+ "loss": 1.1548,
529
+ "mean_token_accuracy": 0.7464311718940735,
530
  "step": 65
531
  },
532
  {
533
+ "epoch": 0.27848101265822783,
534
+ "grad_norm": 1.3824652433395386,
535
+ "learning_rate": 9.201976669584299e-06,
536
+ "loss": 1.0442,
537
+ "mean_token_accuracy": 0.7634408473968506,
538
  "step": 66
539
  },
540
  {
541
+ "epoch": 0.28270042194092826,
542
+ "grad_norm": 1.3764418363571167,
543
+ "learning_rate": 9.163840990166085e-06,
544
+ "loss": 1.163,
545
+ "mean_token_accuracy": 0.7362051606178284,
546
  "step": 67
547
  },
548
  {
549
+ "epoch": 0.2869198312236287,
550
+ "grad_norm": 1.4887207746505737,
551
+ "learning_rate": 9.124908290632119e-06,
552
+ "loss": 1.2148,
553
+ "mean_token_accuracy": 0.7365679144859314,
554
  "step": 68
555
  },
556
  {
557
+ "epoch": 0.2911392405063291,
558
+ "grad_norm": 1.412133812904358,
559
+ "learning_rate": 9.085187040275903e-06,
560
+ "loss": 1.1769,
561
+ "mean_token_accuracy": 0.7347424626350403,
562
  "step": 69
563
  },
564
  {
565
+ "epoch": 0.29535864978902954,
566
+ "grad_norm": 1.4466365575790405,
567
+ "learning_rate": 9.044685879929734e-06,
568
+ "loss": 1.1561,
569
+ "mean_token_accuracy": 0.736750066280365,
570
  "step": 70
571
  },
572
  {
573
+ "epoch": 0.29957805907172996,
574
+ "grad_norm": 1.452510952949524,
575
+ "learning_rate": 9.003413620085002e-06,
576
+ "loss": 1.0451,
577
+ "mean_token_accuracy": 0.7633549571037292,
578
  "step": 71
579
  },
580
  {
581
+ "epoch": 0.3037974683544304,
582
+ "grad_norm": 1.4171886444091797,
583
+ "learning_rate": 8.961379238975594e-06,
584
+ "loss": 1.1657,
585
+ "mean_token_accuracy": 0.7385682463645935,
586
  "step": 72
587
  },
588
  {
589
+ "epoch": 0.3080168776371308,
590
+ "grad_norm": 1.3945873975753784,
591
+ "learning_rate": 8.918591880624783e-06,
592
+ "loss": 1.154,
593
+ "mean_token_accuracy": 0.7416154742240906,
594
  "step": 73
595
  },
596
  {
597
+ "epoch": 0.31223628691983124,
598
+ "grad_norm": 1.3587473630905151,
599
+ "learning_rate": 8.875060852856082e-06,
600
+ "loss": 1.1147,
601
+ "mean_token_accuracy": 0.7544776201248169,
602
  "step": 74
603
  },
604
  {
605
+ "epoch": 0.31645569620253167,
606
+ "grad_norm": 1.3392646312713623,
607
+ "learning_rate": 8.830795625268437e-06,
608
+ "loss": 1.0643,
609
+ "mean_token_accuracy": 0.7560762763023376,
610
  "step": 75
611
  },
612
  {
613
+ "epoch": 0.3206751054852321,
614
+ "grad_norm": 1.3936909437179565,
615
+ "learning_rate": 8.785805827176256e-06,
616
+ "loss": 1.0533,
617
+ "mean_token_accuracy": 0.7552059292793274,
618
  "step": 76
619
  },
620
  {
621
+ "epoch": 0.32489451476793246,
622
+ "grad_norm": 1.3902052640914917,
623
+ "learning_rate": 8.740101245514659e-06,
624
+ "loss": 0.9963,
625
+ "mean_token_accuracy": 0.784263014793396,
626
  "step": 77
627
  },
628
  {
629
+ "epoch": 0.3291139240506329,
630
+ "grad_norm": 1.37273371219635,
631
+ "learning_rate": 8.69369182271048e-06,
632
+ "loss": 1.0645,
633
+ "mean_token_accuracy": 0.7502308487892151,
634
  "step": 78
635
  },
636
  {
637
+ "epoch": 0.3333333333333333,
638
+ "grad_norm": 1.4530737400054932,
639
+ "learning_rate": 8.646587654519413e-06,
640
+ "loss": 1.048,
641
+ "mean_token_accuracy": 0.765147864818573,
642
  "step": 79
643
  },
644
  {
645
+ "epoch": 0.33755274261603374,
646
+ "grad_norm": 1.351982593536377,
647
+ "learning_rate": 8.598798987829816e-06,
648
+ "loss": 1.1773,
649
+ "mean_token_accuracy": 0.7323355674743652,
650
  "step": 80
651
  },
652
  {
653
+ "epoch": 0.34177215189873417,
654
+ "grad_norm": 1.4358004331588745,
655
+ "learning_rate": 8.550336218433631e-06,
656
+ "loss": 1.0669,
657
+ "mean_token_accuracy": 0.7604398727416992,
658
  "step": 81
659
  },
660
  {
661
+ "epoch": 0.3459915611814346,
662
+ "grad_norm": 1.4607288837432861,
663
+ "learning_rate": 8.501209888764928e-06,
664
+ "loss": 1.1499,
665
+ "mean_token_accuracy": 0.7327597737312317,
666
  "step": 82
667
  },
668
  {
669
+ "epoch": 0.350210970464135,
670
+ "grad_norm": 1.3822574615478516,
671
+ "learning_rate": 8.451430685606532e-06,
672
+ "loss": 1.0957,
673
+ "mean_token_accuracy": 0.7444460988044739,
674
  "step": 83
675
  },
676
  {
677
+ "epoch": 0.35443037974683544,
678
+ "grad_norm": 1.3216774463653564,
679
+ "learning_rate": 8.401009437765248e-06,
680
+ "loss": 1.0046,
681
+ "mean_token_accuracy": 0.767160177230835,
682
  "step": 84
683
  },
684
  {
685
+ "epoch": 0.35864978902953587,
686
+ "grad_norm": 1.6017900705337524,
687
+ "learning_rate": 8.349957113716213e-06,
688
+ "loss": 1.0585,
689
+ "mean_token_accuracy": 0.7692307829856873,
690
  "step": 85
691
  },
692
  {
693
+ "epoch": 0.3628691983122363,
694
+ "grad_norm": 1.3391270637512207,
695
+ "learning_rate": 8.29828481921683e-06,
696
+ "loss": 1.2253,
697
+ "mean_token_accuracy": 0.7242193818092346,
698
  "step": 86
699
  },
700
  {
701
+ "epoch": 0.3670886075949367,
702
+ "grad_norm": 1.373449683189392,
703
+ "learning_rate": 8.246003794890885e-06,
704
+ "loss": 1.0709,
705
+ "mean_token_accuracy": 0.757455587387085,
706
  "step": 87
707
  },
708
  {
709
+ "epoch": 0.37130801687763715,
710
+ "grad_norm": 1.4508413076400757,
711
+ "learning_rate": 8.19312541378326e-06,
712
+ "loss": 1.0736,
713
+ "mean_token_accuracy": 0.7621853947639465,
714
  "step": 88
715
  },
716
  {
717
+ "epoch": 0.3755274261603376,
718
+ "grad_norm": 1.3206504583358765,
719
+ "learning_rate": 8.139661178885912e-06,
720
+ "loss": 1.1301,
721
+ "mean_token_accuracy": 0.7476067543029785,
722
  "step": 89
723
  },
724
  {
725
+ "epoch": 0.379746835443038,
726
+ "grad_norm": 1.3155889511108398,
727
+ "learning_rate": 8.085622720635536e-06,
728
+ "loss": 1.1623,
729
+ "mean_token_accuracy": 0.743880569934845,
730
  "step": 90
731
  },
732
  {
733
+ "epoch": 0.38396624472573837,
734
+ "grad_norm": 1.4986933469772339,
735
+ "learning_rate": 8.031021794383513e-06,
736
+ "loss": 0.9997,
737
+ "mean_token_accuracy": 0.7751938104629517,
738
  "step": 91
739
  },
740
  {
741
+ "epoch": 0.3881856540084388,
742
+ "grad_norm": 1.3492070436477661,
743
+ "learning_rate": 7.975870277838695e-06,
744
+ "loss": 1.1146,
745
+ "mean_token_accuracy": 0.7491279244422913,
746
  "step": 92
747
  },
748
  {
749
+ "epoch": 0.3924050632911392,
750
+ "grad_norm": 1.3896995782852173,
751
+ "learning_rate": 7.920180168483565e-06,
752
+ "loss": 1.0942,
753
+ "mean_token_accuracy": 0.7597774267196655,
754
  "step": 93
755
  },
756
  {
757
+ "epoch": 0.39662447257383965,
758
+ "grad_norm": 1.3334070444107056,
759
+ "learning_rate": 7.863963580964344e-06,
760
+ "loss": 1.0688,
761
+ "mean_token_accuracy": 0.7475572228431702,
762
  "step": 94
763
  },
764
  {
765
+ "epoch": 0.4008438818565401,
766
+ "grad_norm": 1.4157761335372925,
767
+ "learning_rate": 7.80723274445561e-06,
768
+ "loss": 1.1208,
769
+ "mean_token_accuracy": 0.731880784034729,
770
  "step": 95
771
  },
772
  {
773
+ "epoch": 0.4050632911392405,
774
+ "grad_norm": 1.4356557130813599,
775
+ "learning_rate": 7.75e-06,
776
+ "loss": 1.0452,
777
+ "mean_token_accuracy": 0.7568075060844421,
778
  "step": 96
779
  },
780
  {
781
+ "epoch": 0.4092827004219409,
782
+ "grad_norm": 1.4047561883926392,
783
+ "learning_rate": 7.692277797823585e-06,
784
+ "loss": 1.0809,
785
+ "mean_token_accuracy": 0.7537745237350464,
786
  "step": 97
787
  },
788
  {
789
+ "epoch": 0.41350210970464135,
790
+ "grad_norm": 1.3407511711120605,
791
+ "learning_rate": 7.634078694627483e-06,
792
+ "loss": 1.1126,
793
+ "mean_token_accuracy": 0.7462708353996277,
794
  "step": 98
795
  },
796
  {
797
+ "epoch": 0.4177215189873418,
798
+ "grad_norm": 1.371414303779602,
799
+ "learning_rate": 7.575415350856316e-06,
800
+ "loss": 1.1067,
801
+ "mean_token_accuracy": 0.7437711954116821,
802
  "step": 99
803
  },
804
  {
805
+ "epoch": 0.4219409282700422,
806
+ "grad_norm": 1.3650238513946533,
807
+ "learning_rate": 7.516300527944104e-06,
808
+ "loss": 1.0487,
809
+ "mean_token_accuracy": 0.7662240862846375,
810
  "step": 100
811
  },
812
  {
813
+ "epoch": 0.42616033755274263,
814
+ "grad_norm": 1.3607968091964722,
815
+ "learning_rate": 7.456747085538173e-06,
816
+ "loss": 1.0891,
817
+ "mean_token_accuracy": 0.7554155588150024,
818
  "step": 101
819
  },
820
  {
821
+ "epoch": 0.43037974683544306,
822
+ "grad_norm": 1.3348058462142944,
823
+ "learning_rate": 7.3967679787017166e-06,
824
+ "loss": 1.147,
825
+ "mean_token_accuracy": 0.7373363971710205,
826
  "step": 102
827
  },
828
  {
829
+ "epoch": 0.4345991561181435,
830
+ "grad_norm": 1.4145907163619995,
831
+ "learning_rate": 7.336376255095592e-06,
832
+ "loss": 1.093,
833
+ "mean_token_accuracy": 0.7504742741584778,
834
  "step": 103
835
  },
836
  {
837
+ "epoch": 0.4388185654008439,
838
+ "grad_norm": 1.3660974502563477,
839
+ "learning_rate": 7.275585052139975e-06,
840
+ "loss": 1.2449,
841
+ "mean_token_accuracy": 0.7118442058563232,
842
  "step": 104
843
  },
844
  {
845
+ "epoch": 0.4430379746835443,
846
+ "grad_norm": 1.3254221677780151,
847
+ "learning_rate": 7.2144075941564835e-06,
848
+ "loss": 1.1414,
849
+ "mean_token_accuracy": 0.7412657737731934,
850
  "step": 105
851
  },
852
  {
853
+ "epoch": 0.4472573839662447,
854
+ "grad_norm": 1.3703272342681885,
855
+ "learning_rate": 7.152857189491406e-06,
856
+ "loss": 1.1501,
857
+ "mean_token_accuracy": 0.7333996891975403,
858
  "step": 106
859
  },
860
  {
861
+ "epoch": 0.45147679324894513,
862
+ "grad_norm": 1.3824867010116577,
863
+ "learning_rate": 7.090947227620646e-06,
864
+ "loss": 1.0566,
865
+ "mean_token_accuracy": 0.761897087097168,
866
  "step": 107
867
  },
868
  {
869
+ "epoch": 0.45569620253164556,
870
+ "grad_norm": 1.353622317314148,
871
+ "learning_rate": 7.028691176237018e-06,
872
+ "loss": 1.1275,
873
+ "mean_token_accuracy": 0.7441860437393188,
874
  "step": 108
875
  },
876
  {
877
+ "epoch": 0.459915611814346,
878
+ "grad_norm": 1.343522548675537,
879
+ "learning_rate": 6.966102578320531e-06,
880
+ "loss": 1.0808,
881
+ "mean_token_accuracy": 0.750849723815918,
882
  "step": 109
883
  },
884
  {
885
+ "epoch": 0.4641350210970464,
886
+ "grad_norm": 1.2934088706970215,
887
+ "learning_rate": 6.903195049192285e-06,
888
+ "loss": 1.0279,
889
+ "mean_token_accuracy": 0.7599811553955078,
890
  "step": 110
891
  },
892
  {
893
+ "epoch": 0.46835443037974683,
894
+ "grad_norm": 1.288739562034607,
895
+ "learning_rate": 6.839982273552651e-06,
896
+ "loss": 1.1493,
897
+ "mean_token_accuracy": 0.7389063835144043,
898
  "step": 111
899
  },
900
  {
901
+ "epoch": 0.47257383966244726,
902
+ "grad_norm": 1.377466082572937,
903
+ "learning_rate": 6.776478002504335e-06,
904
+ "loss": 1.0465,
905
+ "mean_token_accuracy": 0.7633302211761475,
906
  "step": 112
907
  },
908
  {
909
+ "epoch": 0.4767932489451477,
910
+ "grad_norm": 1.2997636795043945,
911
+ "learning_rate": 6.712696050561014e-06,
912
+ "loss": 1.129,
913
+ "mean_token_accuracy": 0.7459065914154053,
914
  "step": 113
915
  },
916
  {
917
+ "epoch": 0.4810126582278481,
918
+ "grad_norm": 1.320365071296692,
919
+ "learning_rate": 6.648650292642166e-06,
920
+ "loss": 1.1049,
921
+ "mean_token_accuracy": 0.7416779398918152,
922
  "step": 114
923
  },
924
  {
925
+ "epoch": 0.48523206751054854,
926
+ "grad_norm": 1.377439260482788,
927
+ "learning_rate": 6.584354661054765e-06,
928
+ "loss": 1.004,
929
+ "mean_token_accuracy": 0.7778915166854858,
930
  "step": 115
931
  },
932
  {
933
+ "epoch": 0.48945147679324896,
934
+ "grad_norm": 1.2823392152786255,
935
+ "learning_rate": 6.519823142462501e-06,
936
+ "loss": 1.0903,
937
+ "mean_token_accuracy": 0.7483987808227539,
938
  "step": 116
939
  },
940
  {
941
+ "epoch": 0.4936708860759494,
942
+ "grad_norm": 1.3214064836502075,
943
+ "learning_rate": 6.4550697748431545e-06,
944
+ "loss": 1.1488,
945
+ "mean_token_accuracy": 0.7386626601219177,
946
  "step": 117
947
  },
948
  {
949
+ "epoch": 0.4978902953586498,
950
+ "grad_norm": 1.4070407152175903,
951
+ "learning_rate": 6.390108644434828e-06,
952
+ "loss": 1.1147,
953
+ "mean_token_accuracy": 0.7577263712882996,
954
  "step": 118
955
  },
956
  {
957
+ "epoch": 0.5021097046413502,
958
+ "grad_norm": 1.3729873895645142,
959
+ "learning_rate": 6.32495388267167e-06,
960
+ "loss": 1.1647,
961
+ "mean_token_accuracy": 0.7341417670249939,
962
  "step": 119
963
  },
964
+ {
965
+ "epoch": 0.5063291139240507,
966
+ "grad_norm": 1.3975082635879517,
967
+ "learning_rate": 6.259619663109762e-06,
968
+ "loss": 1.0578,
969
+ "mean_token_accuracy": 0.7666885852813721,
970
+ "step": 120
971
+ },
972
+ {
973
+ "epoch": 0.510548523206751,
974
+ "grad_norm": 1.3510339260101318,
975
+ "learning_rate": 6.19412019834386e-06,
976
+ "loss": 1.0236,
977
+ "mean_token_accuracy": 0.7666836380958557,
978
+ "step": 121
979
+ },
980
+ {
981
+ "epoch": 0.5147679324894515,
982
+ "grad_norm": 1.2858039140701294,
983
+ "learning_rate": 6.1284697369156276e-06,
984
+ "loss": 1.1033,
985
+ "mean_token_accuracy": 0.7595654726028442,
986
+ "step": 122
987
+ },
988
+ {
989
+ "epoch": 0.5189873417721519,
990
+ "grad_norm": 1.325225830078125,
991
+ "learning_rate": 6.062682560214053e-06,
992
+ "loss": 1.0949,
993
+ "mean_token_accuracy": 0.749600350856781,
994
+ "step": 123
995
+ },
996
+ {
997
+ "epoch": 0.5232067510548524,
998
+ "grad_norm": 1.3359731435775757,
999
+ "learning_rate": 5.996772979368715e-06,
1000
+ "loss": 1.0567,
1001
+ "mean_token_accuracy": 0.7532724738121033,
1002
+ "step": 124
1003
+ },
1004
+ {
1005
+ "epoch": 0.5274261603375527,
1006
+ "grad_norm": 1.315093755722046,
1007
+ "learning_rate": 5.930755332136604e-06,
1008
+ "loss": 1.0807,
1009
+ "mean_token_accuracy": 0.7582170367240906,
1010
+ "step": 125
1011
+ },
1012
+ {
1013
+ "epoch": 0.5316455696202531,
1014
+ "grad_norm": 1.3243067264556885,
1015
+ "learning_rate": 5.86464397978312e-06,
1016
+ "loss": 1.1816,
1017
+ "mean_token_accuracy": 0.7343863248825073,
1018
+ "step": 126
1019
+ },
1020
+ {
1021
+ "epoch": 0.5358649789029536,
1022
+ "grad_norm": 1.2779967784881592,
1023
+ "learning_rate": 5.798453303957968e-06,
1024
+ "loss": 1.2504,
1025
+ "mean_token_accuracy": 0.7217909097671509,
1026
+ "step": 127
1027
+ },
1028
+ {
1029
+ "epoch": 0.540084388185654,
1030
+ "grad_norm": 1.363757848739624,
1031
+ "learning_rate": 5.73219770356663e-06,
1032
+ "loss": 1.1925,
1033
+ "mean_token_accuracy": 0.7341822981834412,
1034
+ "step": 128
1035
+ },
1036
+ {
1037
+ "epoch": 0.5443037974683544,
1038
+ "grad_norm": 1.3086532354354858,
1039
+ "learning_rate": 5.665891591638063e-06,
1040
+ "loss": 1.1309,
1041
+ "mean_token_accuracy": 0.7532561421394348,
1042
+ "step": 129
1043
+ },
1044
+ {
1045
+ "epoch": 0.5485232067510548,
1046
+ "grad_norm": 1.367034673690796,
1047
+ "learning_rate": 5.5995493921893415e-06,
1048
+ "loss": 0.9554,
1049
+ "mean_token_accuracy": 0.7872604727745056,
1050
+ "step": 130
1051
+ },
1052
+ {
1053
+ "epoch": 0.5527426160337553,
1054
+ "grad_norm": 1.3385508060455322,
1055
+ "learning_rate": 5.533185537087906e-06,
1056
+ "loss": 1.0339,
1057
+ "mean_token_accuracy": 0.76613849401474,
1058
+ "step": 131
1059
+ },
1060
+ {
1061
+ "epoch": 0.5569620253164557,
1062
+ "grad_norm": 1.315830945968628,
1063
+ "learning_rate": 5.4668144629120945e-06,
1064
+ "loss": 1.1457,
1065
+ "mean_token_accuracy": 0.7451325058937073,
1066
+ "step": 132
1067
+ },
1068
+ {
1069
+ "epoch": 0.5611814345991561,
1070
+ "grad_norm": 1.4468579292297363,
1071
+ "learning_rate": 5.40045060781066e-06,
1072
+ "loss": 1.0689,
1073
+ "mean_token_accuracy": 0.7672370076179504,
1074
+ "step": 133
1075
+ },
1076
+ {
1077
+ "epoch": 0.5654008438818565,
1078
+ "grad_norm": 1.419268250465393,
1079
+ "learning_rate": 5.33410840836194e-06,
1080
+ "loss": 1.0781,
1081
+ "mean_token_accuracy": 0.7676616907119751,
1082
+ "step": 134
1083
+ },
1084
+ {
1085
+ "epoch": 0.569620253164557,
1086
+ "grad_norm": 1.3178187608718872,
1087
+ "learning_rate": 5.267802296433372e-06,
1088
+ "loss": 1.0813,
1089
+ "mean_token_accuracy": 0.7537773847579956,
1090
+ "step": 135
1091
+ },
1092
+ {
1093
+ "epoch": 0.5738396624472574,
1094
+ "grad_norm": 1.352331280708313,
1095
+ "learning_rate": 5.201546696042033e-06,
1096
+ "loss": 1.1175,
1097
+ "mean_token_accuracy": 0.74346524477005,
1098
+ "step": 136
1099
+ },
1100
+ {
1101
+ "epoch": 0.5780590717299579,
1102
+ "grad_norm": 1.2893943786621094,
1103
+ "learning_rate": 5.13535602021688e-06,
1104
+ "loss": 1.146,
1105
+ "mean_token_accuracy": 0.7399700880050659,
1106
+ "step": 137
1107
+ },
1108
+ {
1109
+ "epoch": 0.5822784810126582,
1110
+ "grad_norm": 1.373579978942871,
1111
+ "learning_rate": 5.069244667863397e-06,
1112
+ "loss": 1.0754,
1113
+ "mean_token_accuracy": 0.747799277305603,
1114
+ "step": 138
1115
+ },
1116
+ {
1117
+ "epoch": 0.5864978902953587,
1118
+ "grad_norm": 1.3490039110183716,
1119
+ "learning_rate": 5.003227020631287e-06,
1120
+ "loss": 1.064,
1121
+ "mean_token_accuracy": 0.764332115650177,
1122
+ "step": 139
1123
+ },
1124
+ {
1125
+ "epoch": 0.5907172995780591,
1126
+ "grad_norm": 1.341293454170227,
1127
+ "learning_rate": 4.937317439785949e-06,
1128
+ "loss": 1.16,
1129
+ "mean_token_accuracy": 0.7380607724189758,
1130
+ "step": 140
1131
+ },
1132
+ {
1133
+ "epoch": 0.5949367088607594,
1134
+ "grad_norm": 1.2629183530807495,
1135
+ "learning_rate": 4.871530263084373e-06,
1136
+ "loss": 1.0139,
1137
+ "mean_token_accuracy": 0.7695739269256592,
1138
+ "step": 141
1139
+ },
1140
+ {
1141
+ "epoch": 0.5991561181434599,
1142
+ "grad_norm": 1.3168911933898926,
1143
+ "learning_rate": 4.80587980165614e-06,
1144
+ "loss": 1.1634,
1145
+ "mean_token_accuracy": 0.7377873659133911,
1146
+ "step": 142
1147
+ },
1148
+ {
1149
+ "epoch": 0.6033755274261603,
1150
+ "grad_norm": 1.3361846208572388,
1151
+ "learning_rate": 4.74038033689024e-06,
1152
+ "loss": 1.126,
1153
+ "mean_token_accuracy": 0.7410577535629272,
1154
+ "step": 143
1155
+ },
1156
+ {
1157
+ "epoch": 0.6075949367088608,
1158
+ "grad_norm": 1.3500585556030273,
1159
+ "learning_rate": 4.675046117328333e-06,
1160
+ "loss": 1.0654,
1161
+ "mean_token_accuracy": 0.7601693868637085,
1162
+ "step": 144
1163
+ },
1164
+ {
1165
+ "epoch": 0.6118143459915611,
1166
+ "grad_norm": 1.4159785509109497,
1167
+ "learning_rate": 4.609891355565172e-06,
1168
+ "loss": 1.0892,
1169
+ "mean_token_accuracy": 0.7582122683525085,
1170
+ "step": 145
1171
+ },
1172
+ {
1173
+ "epoch": 0.6160337552742616,
1174
+ "grad_norm": 1.3257557153701782,
1175
+ "learning_rate": 4.544930225156847e-06,
1176
+ "loss": 1.0994,
1177
+ "mean_token_accuracy": 0.7577795386314392,
1178
+ "step": 146
1179
+ },
1180
+ {
1181
+ "epoch": 0.620253164556962,
1182
+ "grad_norm": 1.3204015493392944,
1183
+ "learning_rate": 4.480176857537499e-06,
1184
+ "loss": 1.1573,
1185
+ "mean_token_accuracy": 0.7304109334945679,
1186
+ "step": 147
1187
+ },
1188
+ {
1189
+ "epoch": 0.6244725738396625,
1190
+ "grad_norm": 1.2947508096694946,
1191
+ "learning_rate": 4.415645338945236e-06,
1192
+ "loss": 1.1672,
1193
+ "mean_token_accuracy": 0.7346352934837341,
1194
+ "step": 148
1195
+ },
1196
+ {
1197
+ "epoch": 0.6286919831223629,
1198
+ "grad_norm": 1.4755604267120361,
1199
+ "learning_rate": 4.351349707357836e-06,
1200
+ "loss": 1.0376,
1201
+ "mean_token_accuracy": 0.7728136777877808,
1202
+ "step": 149
1203
+ },
1204
+ {
1205
+ "epoch": 0.6329113924050633,
1206
+ "grad_norm": 1.375398874282837,
1207
+ "learning_rate": 4.287303949438987e-06,
1208
+ "loss": 1.1152,
1209
+ "mean_token_accuracy": 0.751935601234436,
1210
+ "step": 150
1211
+ },
1212
+ {
1213
+ "epoch": 0.6371308016877637,
1214
+ "grad_norm": 1.297082543373108,
1215
+ "learning_rate": 4.223521997495665e-06,
1216
+ "loss": 1.1076,
1217
+ "mean_token_accuracy": 0.7536709904670715,
1218
+ "step": 151
1219
+ },
1220
+ {
1221
+ "epoch": 0.6413502109704642,
1222
+ "grad_norm": 1.289937138557434,
1223
+ "learning_rate": 4.160017726447352e-06,
1224
+ "loss": 1.0683,
1225
+ "mean_token_accuracy": 0.7612445950508118,
1226
+ "step": 152
1227
+ },
1228
+ {
1229
+ "epoch": 0.6455696202531646,
1230
+ "grad_norm": 1.3394392728805542,
1231
+ "learning_rate": 4.096804950807717e-06,
1232
+ "loss": 1.1548,
1233
+ "mean_token_accuracy": 0.7326341271400452,
1234
+ "step": 153
1235
+ },
1236
+ {
1237
+ "epoch": 0.6497890295358649,
1238
+ "grad_norm": 1.3397821187973022,
1239
+ "learning_rate": 4.033897421679472e-06,
1240
+ "loss": 1.0725,
1241
+ "mean_token_accuracy": 0.7469461560249329,
1242
+ "step": 154
1243
+ },
1244
+ {
1245
+ "epoch": 0.6540084388185654,
1246
+ "grad_norm": 1.314746379852295,
1247
+ "learning_rate": 3.971308823762983e-06,
1248
+ "loss": 1.021,
1249
+ "mean_token_accuracy": 0.7617459893226624,
1250
+ "step": 155
1251
+ },
1252
+ {
1253
+ "epoch": 0.6582278481012658,
1254
+ "grad_norm": 1.3997515439987183,
1255
+ "learning_rate": 3.9090527723793545e-06,
1256
+ "loss": 0.9554,
1257
+ "mean_token_accuracy": 0.783367931842804,
1258
+ "step": 156
1259
+ },
1260
+ {
1261
+ "epoch": 0.6624472573839663,
1262
+ "grad_norm": 1.3897455930709839,
1263
+ "learning_rate": 3.847142810508596e-06,
1264
+ "loss": 1.0588,
1265
+ "mean_token_accuracy": 0.7604950666427612,
1266
+ "step": 157
1267
+ },
1268
+ {
1269
+ "epoch": 0.6666666666666666,
1270
+ "grad_norm": 1.29047429561615,
1271
+ "learning_rate": 3.785592405843518e-06,
1272
+ "loss": 0.9476,
1273
+ "mean_token_accuracy": 0.7861513495445251,
1274
+ "step": 158
1275
+ },
1276
+ {
1277
+ "epoch": 0.6708860759493671,
1278
+ "grad_norm": 1.3335379362106323,
1279
+ "learning_rate": 3.724414947860027e-06,
1280
+ "loss": 1.1159,
1281
+ "mean_token_accuracy": 0.7479957938194275,
1282
+ "step": 159
1283
+ },
1284
+ {
1285
+ "epoch": 0.6751054852320675,
1286
+ "grad_norm": 1.3141283988952637,
1287
+ "learning_rate": 3.6636237449044077e-06,
1288
+ "loss": 1.0423,
1289
+ "mean_token_accuracy": 0.7584823966026306,
1290
+ "step": 160
1291
+ },
1292
+ {
1293
+ "epoch": 0.679324894514768,
1294
+ "grad_norm": 1.2607753276824951,
1295
+ "learning_rate": 3.603232021298284e-06,
1296
+ "loss": 1.3062,
1297
+ "mean_token_accuracy": 0.704585075378418,
1298
+ "step": 161
1299
+ },
1300
+ {
1301
+ "epoch": 0.6835443037974683,
1302
+ "grad_norm": 1.414121150970459,
1303
+ "learning_rate": 3.5432529144618287e-06,
1304
+ "loss": 1.1291,
1305
+ "mean_token_accuracy": 0.7416020631790161,
1306
+ "step": 162
1307
+ },
1308
+ {
1309
+ "epoch": 0.6877637130801688,
1310
+ "grad_norm": 1.4330233335494995,
1311
+ "learning_rate": 3.483699472055897e-06,
1312
+ "loss": 1.0693,
1313
+ "mean_token_accuracy": 0.7502612471580505,
1314
+ "step": 163
1315
+ },
1316
+ {
1317
+ "epoch": 0.6919831223628692,
1318
+ "grad_norm": 1.2901924848556519,
1319
+ "learning_rate": 3.424584649143685e-06,
1320
+ "loss": 1.0724,
1321
+ "mean_token_accuracy": 0.7590579986572266,
1322
+ "step": 164
1323
+ },
1324
+ {
1325
+ "epoch": 0.6962025316455697,
1326
+ "grad_norm": 1.3194284439086914,
1327
+ "learning_rate": 3.365921305372519e-06,
1328
+ "loss": 1.0791,
1329
+ "mean_token_accuracy": 0.755479633808136,
1330
+ "step": 165
1331
+ },
1332
+ {
1333
+ "epoch": 0.70042194092827,
1334
+ "grad_norm": 1.3910928964614868,
1335
+ "learning_rate": 3.307722202176417e-06,
1336
+ "loss": 1.0845,
1337
+ "mean_token_accuracy": 0.7579113841056824,
1338
+ "step": 166
1339
+ },
1340
+ {
1341
+ "epoch": 0.7046413502109705,
1342
+ "grad_norm": 1.3976861238479614,
1343
+ "learning_rate": 3.2500000000000015e-06,
1344
+ "loss": 0.9902,
1345
+ "mean_token_accuracy": 0.7839325666427612,
1346
+ "step": 167
1347
+ },
1348
+ {
1349
+ "epoch": 0.7088607594936709,
1350
+ "grad_norm": 1.344634771347046,
1351
+ "learning_rate": 3.19276725554439e-06,
1352
+ "loss": 1.004,
1353
+ "mean_token_accuracy": 0.7765747904777527,
1354
+ "step": 168
1355
+ },
1356
+ {
1357
+ "epoch": 0.7130801687763713,
1358
+ "grad_norm": 1.4013010263442993,
1359
+ "learning_rate": 3.136036419035656e-06,
1360
+ "loss": 0.9893,
1361
+ "mean_token_accuracy": 0.7824804782867432,
1362
+ "step": 169
1363
+ },
1364
+ {
1365
+ "epoch": 0.7172995780590717,
1366
+ "grad_norm": 1.2817473411560059,
1367
+ "learning_rate": 3.0798198315164345e-06,
1368
+ "loss": 1.0701,
1369
+ "mean_token_accuracy": 0.7619868516921997,
1370
+ "step": 170
1371
+ },
1372
+ {
1373
+ "epoch": 0.7215189873417721,
1374
+ "grad_norm": 1.3074806928634644,
1375
+ "learning_rate": 3.024129722161305e-06,
1376
+ "loss": 1.1384,
1377
+ "mean_token_accuracy": 0.7425307631492615,
1378
+ "step": 171
1379
+ },
1380
+ {
1381
+ "epoch": 0.7257383966244726,
1382
+ "grad_norm": 1.3982939720153809,
1383
+ "learning_rate": 2.9689782056164874e-06,
1384
+ "loss": 1.1811,
1385
+ "mean_token_accuracy": 0.7242642045021057,
1386
+ "step": 172
1387
+ },
1388
+ {
1389
+ "epoch": 0.729957805907173,
1390
+ "grad_norm": 1.3266452550888062,
1391
+ "learning_rate": 2.914377279364464e-06,
1392
+ "loss": 1.1075,
1393
+ "mean_token_accuracy": 0.739130437374115,
1394
+ "step": 173
1395
+ },
1396
+ {
1397
+ "epoch": 0.7341772151898734,
1398
+ "grad_norm": 1.354162573814392,
1399
+ "learning_rate": 2.8603388211140886e-06,
1400
+ "loss": 1.0977,
1401
+ "mean_token_accuracy": 0.7508479952812195,
1402
+ "step": 174
1403
+ },
1404
+ {
1405
+ "epoch": 0.7383966244725738,
1406
+ "grad_norm": 1.3136208057403564,
1407
+ "learning_rate": 2.8068745862167423e-06,
1408
+ "loss": 1.0426,
1409
+ "mean_token_accuracy": 0.7577039003372192,
1410
+ "step": 175
1411
+ },
1412
+ {
1413
+ "epoch": 0.7426160337552743,
1414
+ "grad_norm": 1.264014720916748,
1415
+ "learning_rate": 2.7539962051091185e-06,
1416
+ "loss": 1.1094,
1417
+ "mean_token_accuracy": 0.7489492893218994,
1418
+ "step": 176
1419
+ },
1420
+ {
1421
+ "epoch": 0.7468354430379747,
1422
+ "grad_norm": 1.2856388092041016,
1423
+ "learning_rate": 2.70171518078317e-06,
1424
+ "loss": 0.9631,
1425
+ "mean_token_accuracy": 0.7795804738998413,
1426
+ "step": 177
1427
+ },
1428
+ {
1429
+ "epoch": 0.7510548523206751,
1430
+ "grad_norm": 1.2878668308258057,
1431
+ "learning_rate": 2.6500428862837878e-06,
1432
+ "loss": 1.1275,
1433
+ "mean_token_accuracy": 0.7491475343704224,
1434
+ "step": 178
1435
+ },
1436
+ {
1437
+ "epoch": 0.7552742616033755,
1438
+ "grad_norm": 1.2011187076568604,
1439
+ "learning_rate": 2.5989905622347538e-06,
1440
+ "loss": 1.0777,
1441
+ "mean_token_accuracy": 0.7586644291877747,
1442
+ "step": 179
1443
+ },
1444
+ {
1445
+ "epoch": 0.759493670886076,
1446
+ "grad_norm": 1.2956939935684204,
1447
+ "learning_rate": 2.5485693143934704e-06,
1448
+ "loss": 1.078,
1449
+ "mean_token_accuracy": 0.7640592455863953,
1450
+ "step": 180
1451
+ },
1452
+ {
1453
+ "epoch": 0.7637130801687764,
1454
+ "grad_norm": 1.278355598449707,
1455
+ "learning_rate": 2.498790111235072e-06,
1456
+ "loss": 1.0595,
1457
+ "mean_token_accuracy": 0.7603550553321838,
1458
+ "step": 181
1459
+ },
1460
+ {
1461
+ "epoch": 0.7679324894514767,
1462
+ "grad_norm": 1.3742746114730835,
1463
+ "learning_rate": 2.4496637815663697e-06,
1464
+ "loss": 0.9225,
1465
+ "mean_token_accuracy": 0.7922610640525818,
1466
+ "step": 182
1467
+ },
1468
+ {
1469
+ "epoch": 0.7721518987341772,
1470
+ "grad_norm": 1.2835192680358887,
1471
+ "learning_rate": 2.4012010121701853e-06,
1472
+ "loss": 1.2256,
1473
+ "mean_token_accuracy": 0.7330555319786072,
1474
+ "step": 183
1475
+ },
1476
+ {
1477
+ "epoch": 0.7763713080168776,
1478
+ "grad_norm": 1.303166389465332,
1479
+ "learning_rate": 2.353412345480587e-06,
1480
+ "loss": 1.0578,
1481
+ "mean_token_accuracy": 0.7578330039978027,
1482
+ "step": 184
1483
+ },
1484
+ {
1485
+ "epoch": 0.7805907172995781,
1486
+ "grad_norm": 1.3752117156982422,
1487
+ "learning_rate": 2.3063081772895203e-06,
1488
+ "loss": 1.0769,
1489
+ "mean_token_accuracy": 0.761528730392456,
1490
+ "step": 185
1491
+ },
1492
+ {
1493
+ "epoch": 0.7848101265822784,
1494
+ "grad_norm": 1.3514289855957031,
1495
+ "learning_rate": 2.2598987544853428e-06,
1496
+ "loss": 1.1491,
1497
+ "mean_token_accuracy": 0.7445311546325684,
1498
+ "step": 186
1499
+ },
1500
+ {
1501
+ "epoch": 0.7890295358649789,
1502
+ "grad_norm": 1.342996597290039,
1503
+ "learning_rate": 2.2141941728237467e-06,
1504
+ "loss": 1.1003,
1505
+ "mean_token_accuracy": 0.7531036734580994,
1506
+ "step": 187
1507
+ },
1508
+ {
1509
+ "epoch": 0.7932489451476793,
1510
+ "grad_norm": 1.3291714191436768,
1511
+ "learning_rate": 2.1692043747315628e-06,
1512
+ "loss": 1.0323,
1513
+ "mean_token_accuracy": 0.7538182139396667,
1514
+ "step": 188
1515
+ },
1516
+ {
1517
+ "epoch": 0.7974683544303798,
1518
+ "grad_norm": 1.3257864713668823,
1519
+ "learning_rate": 2.1249391471439206e-06,
1520
+ "loss": 0.9746,
1521
+ "mean_token_accuracy": 0.7849647402763367,
1522
+ "step": 189
1523
+ },
1524
+ {
1525
+ "epoch": 0.8016877637130801,
1526
+ "grad_norm": 1.2531793117523193,
1527
+ "learning_rate": 2.081408119375219e-06,
1528
+ "loss": 1.0839,
1529
+ "mean_token_accuracy": 0.7478733062744141,
1530
+ "step": 190
1531
+ },
1532
+ {
1533
+ "epoch": 0.8059071729957806,
1534
+ "grad_norm": 1.3834004402160645,
1535
+ "learning_rate": 2.0386207610244073e-06,
1536
+ "loss": 1.1095,
1537
+ "mean_token_accuracy": 0.7386733889579773,
1538
+ "step": 191
1539
+ },
1540
+ {
1541
+ "epoch": 0.810126582278481,
1542
+ "grad_norm": 1.2729295492172241,
1543
+ "learning_rate": 1.9965863799149988e-06,
1544
+ "loss": 1.0797,
1545
+ "mean_token_accuracy": 0.7641941905021667,
1546
+ "step": 192
1547
+ },
1548
+ {
1549
+ "epoch": 0.8143459915611815,
1550
+ "grad_norm": 1.3292949199676514,
1551
+ "learning_rate": 1.955314120070269e-06,
1552
+ "loss": 1.0022,
1553
+ "mean_token_accuracy": 0.7704975605010986,
1554
+ "step": 193
1555
+ },
1556
+ {
1557
+ "epoch": 0.8185654008438819,
1558
+ "grad_norm": 1.3078043460845947,
1559
+ "learning_rate": 1.9148129597240984e-06,
1560
+ "loss": 1.1246,
1561
+ "mean_token_accuracy": 0.7446621656417847,
1562
+ "step": 194
1563
+ },
1564
+ {
1565
+ "epoch": 0.8227848101265823,
1566
+ "grad_norm": 1.3361434936523438,
1567
+ "learning_rate": 1.8750917093678824e-06,
1568
+ "loss": 1.1086,
1569
+ "mean_token_accuracy": 0.7590915560722351,
1570
+ "step": 195
1571
+ },
1572
+ {
1573
+ "epoch": 0.8270042194092827,
1574
+ "grad_norm": 1.3019007444381714,
1575
+ "learning_rate": 1.8361590098339168e-06,
1576
+ "loss": 1.1935,
1577
+ "mean_token_accuracy": 0.7272245287895203,
1578
+ "step": 196
1579
+ },
1580
+ {
1581
+ "epoch": 0.8312236286919831,
1582
+ "grad_norm": 1.2692325115203857,
1583
+ "learning_rate": 1.7980233304157025e-06,
1584
+ "loss": 1.0738,
1585
+ "mean_token_accuracy": 0.7482107281684875,
1586
+ "step": 197
1587
+ },
1588
+ {
1589
+ "epoch": 0.8354430379746836,
1590
+ "grad_norm": 1.2975091934204102,
1591
+ "learning_rate": 1.760692967025564e-06,
1592
+ "loss": 1.1519,
1593
+ "mean_token_accuracy": 0.7438170313835144,
1594
+ "step": 198
1595
+ },
1596
+ {
1597
+ "epoch": 0.8396624472573839,
1598
+ "grad_norm": 1.3825592994689941,
1599
+ "learning_rate": 1.724176040389982e-06,
1600
+ "loss": 1.0687,
1601
+ "mean_token_accuracy": 0.7617847323417664,
1602
+ "step": 199
1603
+ },
1604
+ {
1605
+ "epoch": 0.8438818565400844,
1606
+ "grad_norm": 1.2977200746536255,
1607
+ "learning_rate": 1.6884804942830373e-06,
1608
+ "loss": 1.0253,
1609
+ "mean_token_accuracy": 0.7679784893989563,
1610
+ "step": 200
1611
+ },
1612
+ {
1613
+ "epoch": 0.8481012658227848,
1614
+ "grad_norm": 1.2826578617095947,
1615
+ "learning_rate": 1.6536140937983469e-06,
1616
+ "loss": 0.9576,
1617
+ "mean_token_accuracy": 0.7874261736869812,
1618
+ "step": 201
1619
+ },
1620
+ {
1621
+ "epoch": 0.8523206751054853,
1622
+ "grad_norm": 1.2712202072143555,
1623
+ "learning_rate": 1.619584423659875e-06,
1624
+ "loss": 1.1153,
1625
+ "mean_token_accuracy": 0.7474706768989563,
1626
+ "step": 202
1627
+ },
1628
+ {
1629
+ "epoch": 0.8565400843881856,
1630
+ "grad_norm": 1.3052817583084106,
1631
+ "learning_rate": 1.5863988865719702e-06,
1632
+ "loss": 1.1858,
1633
+ "mean_token_accuracy": 0.7376619577407837,
1634
+ "step": 203
1635
+ },
1636
+ {
1637
+ "epoch": 0.8607594936708861,
1638
+ "grad_norm": 1.2947852611541748,
1639
+ "learning_rate": 1.5540647016090066e-06,
1640
+ "loss": 1.1035,
1641
+ "mean_token_accuracy": 0.7478904724121094,
1642
+ "step": 204
1643
+ },
1644
+ {
1645
+ "epoch": 0.8649789029535865,
1646
+ "grad_norm": 1.3773938417434692,
1647
+ "learning_rate": 1.5225889026449754e-06,
1648
+ "loss": 1.0303,
1649
+ "mean_token_accuracy": 0.7679511308670044,
1650
+ "step": 205
1651
+ },
1652
+ {
1653
+ "epoch": 0.869198312236287,
1654
+ "grad_norm": 1.3938205242156982,
1655
+ "learning_rate": 1.4919783368233525e-06,
1656
+ "loss": 1.0553,
1657
+ "mean_token_accuracy": 0.7651947140693665,
1658
+ "step": 206
1659
+ },
1660
+ {
1661
+ "epoch": 0.8734177215189873,
1662
+ "grad_norm": 1.3168855905532837,
1663
+ "learning_rate": 1.462239663067596e-06,
1664
+ "loss": 1.1293,
1665
+ "mean_token_accuracy": 0.7346562743186951,
1666
+ "step": 207
1667
+ },
1668
+ {
1669
+ "epoch": 0.8776371308016878,
1670
+ "grad_norm": 1.2759041786193848,
1671
+ "learning_rate": 1.4333793506325832e-06,
1672
+ "loss": 1.1841,
1673
+ "mean_token_accuracy": 0.7245597243309021,
1674
+ "step": 208
1675
+ },
1676
+ {
1677
+ "epoch": 0.8818565400843882,
1678
+ "grad_norm": 1.2906190156936646,
1679
+ "learning_rate": 1.4054036776973123e-06,
1680
+ "loss": 1.1123,
1681
+ "mean_token_accuracy": 0.7488489747047424,
1682
+ "step": 209
1683
+ },
1684
+ {
1685
+ "epoch": 0.8860759493670886,
1686
+ "grad_norm": 1.3179748058319092,
1687
+ "learning_rate": 1.378318729999169e-06,
1688
+ "loss": 1.1058,
1689
+ "mean_token_accuracy": 0.7502433657646179,
1690
+ "step": 210
1691
+ },
1692
+ {
1693
+ "epoch": 0.890295358649789,
1694
+ "grad_norm": 1.3706414699554443,
1695
+ "learning_rate": 1.3521303995100479e-06,
1696
+ "loss": 0.9837,
1697
+ "mean_token_accuracy": 0.7637858986854553,
1698
+ "step": 211
1699
+ },
1700
+ {
1701
+ "epoch": 0.8945147679324894,
1702
+ "grad_norm": 1.302856683731079,
1703
+ "learning_rate": 1.32684438315464e-06,
1704
+ "loss": 1.0673,
1705
+ "mean_token_accuracy": 0.7459544539451599,
1706
+ "step": 212
1707
+ },
1708
+ {
1709
+ "epoch": 0.8987341772151899,
1710
+ "grad_norm": 1.2482903003692627,
1711
+ "learning_rate": 1.3024661815711387e-06,
1712
+ "loss": 1.1662,
1713
+ "mean_token_accuracy": 0.7322877049446106,
1714
+ "step": 213
1715
+ },
1716
+ {
1717
+ "epoch": 0.9029535864978903,
1718
+ "grad_norm": 1.2914732694625854,
1719
+ "learning_rate": 1.2790010979146467e-06,
1720
+ "loss": 0.984,
1721
+ "mean_token_accuracy": 0.7709949016571045,
1722
+ "step": 214
1723
+ },
1724
+ {
1725
+ "epoch": 0.9071729957805907,
1726
+ "grad_norm": 1.3376529216766357,
1727
+ "learning_rate": 1.2564542367035502e-06,
1728
+ "loss": 1.1432,
1729
+ "mean_token_accuracy": 0.7400686144828796,
1730
+ "step": 215
1731
+ },
1732
+ {
1733
+ "epoch": 0.9113924050632911,
1734
+ "grad_norm": 1.3194866180419922,
1735
+ "learning_rate": 1.2348305027090923e-06,
1736
+ "loss": 1.1384,
1737
+ "mean_token_accuracy": 0.7444621324539185,
1738
+ "step": 216
1739
+ },
1740
+ {
1741
+ "epoch": 0.9156118143459916,
1742
+ "grad_norm": 1.2742234468460083,
1743
+ "learning_rate": 1.2141345998884092e-06,
1744
+ "loss": 1.1669,
1745
+ "mean_token_accuracy": 0.7380778789520264,
1746
+ "step": 217
1747
+ },
1748
+ {
1749
+ "epoch": 0.919831223628692,
1750
+ "grad_norm": 1.3064154386520386,
1751
+ "learning_rate": 1.1943710303612459e-06,
1752
+ "loss": 1.0964,
1753
+ "mean_token_accuracy": 0.7555238604545593,
1754
+ "step": 218
1755
+ },
1756
+ {
1757
+ "epoch": 0.9240506329113924,
1758
+ "grad_norm": 1.261291742324829,
1759
+ "learning_rate": 1.175544093430577e-06,
1760
+ "loss": 1.107,
1761
+ "mean_token_accuracy": 0.751292884349823,
1762
+ "step": 219
1763
+ },
1764
+ {
1765
+ "epoch": 0.9282700421940928,
1766
+ "grad_norm": 1.2909098863601685,
1767
+ "learning_rate": 1.1576578846473558e-06,
1768
+ "loss": 1.0531,
1769
+ "mean_token_accuracy": 0.7663896679878235,
1770
+ "step": 220
1771
+ },
1772
+ {
1773
+ "epoch": 0.9324894514767933,
1774
+ "grad_norm": 1.3752104043960571,
1775
+ "learning_rate": 1.1407162949195732e-06,
1776
+ "loss": 1.0887,
1777
+ "mean_token_accuracy": 0.7531322240829468,
1778
+ "step": 221
1779
+ },
1780
+ {
1781
+ "epoch": 0.9367088607594937,
1782
+ "grad_norm": 1.297967791557312,
1783
+ "learning_rate": 1.1247230096658498e-06,
1784
+ "loss": 1.0218,
1785
+ "mean_token_accuracy": 0.7693407535552979,
1786
+ "step": 222
1787
+ },
1788
+ {
1789
+ "epoch": 0.9409282700421941,
1790
+ "grad_norm": 1.298868179321289,
1791
+ "learning_rate": 1.1096815080137196e-06,
1792
+ "loss": 0.9778,
1793
+ "mean_token_accuracy": 0.7718530297279358,
1794
+ "step": 223
1795
+ },
1796
+ {
1797
+ "epoch": 0.9451476793248945,
1798
+ "grad_norm": 1.3202515840530396,
1799
+ "learning_rate": 1.0955950620427883e-06,
1800
+ "loss": 1.0321,
1801
+ "mean_token_accuracy": 0.7685170769691467,
1802
+ "step": 224
1803
+ },
1804
+ {
1805
+ "epoch": 0.9493670886075949,
1806
+ "grad_norm": 1.2280904054641724,
1807
+ "learning_rate": 1.0824667360729408e-06,
1808
+ "loss": 1.2625,
1809
+ "mean_token_accuracy": 0.7183030843734741,
1810
+ "step": 225
1811
+ },
1812
+ {
1813
+ "epoch": 0.9535864978902954,
1814
+ "grad_norm": 1.2865333557128906,
1815
+ "learning_rate": 1.070299385997735e-06,
1816
+ "loss": 1.1937,
1817
+ "mean_token_accuracy": 0.7263870239257812,
1818
+ "step": 226
1819
+ },
1820
+ {
1821
+ "epoch": 0.9578059071729957,
1822
+ "grad_norm": 1.3348139524459839,
1823
+ "learning_rate": 1.0590956586631432e-06,
1824
+ "loss": 1.1109,
1825
+ "mean_token_accuracy": 0.743833601474762,
1826
+ "step": 227
1827
+ },
1828
+ {
1829
+ "epoch": 0.9620253164556962,
1830
+ "grad_norm": 1.3875958919525146,
1831
+ "learning_rate": 1.0488579912917633e-06,
1832
+ "loss": 1.017,
1833
+ "mean_token_accuracy": 0.7680346369743347,
1834
+ "step": 228
1835
+ },
1836
+ {
1837
+ "epoch": 0.9662447257383966,
1838
+ "grad_norm": 1.3476899862289429,
1839
+ "learning_rate": 1.0395886109526346e-06,
1840
+ "loss": 1.0319,
1841
+ "mean_token_accuracy": 0.7578980922698975,
1842
+ "step": 229
1843
+ },
1844
+ {
1845
+ "epoch": 0.9704641350210971,
1846
+ "grad_norm": 1.4462693929672241,
1847
+ "learning_rate": 1.0312895340767674e-06,
1848
+ "loss": 1.1329,
1849
+ "mean_token_accuracy": 0.761381208896637,
1850
+ "step": 230
1851
+ },
1852
+ {
1853
+ "epoch": 0.9746835443037974,
1854
+ "grad_norm": 1.2707955837249756,
1855
+ "learning_rate": 1.0239625660184965e-06,
1856
+ "loss": 1.2035,
1857
+ "mean_token_accuracy": 0.7266802787780762,
1858
+ "step": 231
1859
+ },
1860
+ {
1861
+ "epoch": 0.9789029535864979,
1862
+ "grad_norm": 1.331416368484497,
1863
+ "learning_rate": 1.0176093006627485e-06,
1864
+ "loss": 1.1598,
1865
+ "mean_token_accuracy": 0.7420715689659119,
1866
+ "step": 232
1867
+ },
1868
+ {
1869
+ "epoch": 0.9831223628691983,
1870
+ "grad_norm": 1.3013097047805786,
1871
+ "learning_rate": 1.0122311200783154e-06,
1872
+ "loss": 0.9994,
1873
+ "mean_token_accuracy": 0.7827889323234558,
1874
+ "step": 233
1875
+ },
1876
+ {
1877
+ "epoch": 0.9873417721518988,
1878
+ "grad_norm": 1.3347793817520142,
1879
+ "learning_rate": 1.0078291942172018e-06,
1880
+ "loss": 1.0984,
1881
+ "mean_token_accuracy": 0.7526082992553711,
1882
+ "step": 234
1883
+ },
1884
+ {
1885
+ "epoch": 0.9915611814345991,
1886
+ "grad_norm": 1.4117809534072876,
1887
+ "learning_rate": 1.0044044806601188e-06,
1888
+ "loss": 0.9413,
1889
+ "mean_token_accuracy": 0.7868911623954773,
1890
+ "step": 235
1891
+ },
1892
+ {
1893
+ "epoch": 0.9957805907172996,
1894
+ "grad_norm": 1.3429399728775024,
1895
+ "learning_rate": 1.0019577244081736e-06,
1896
+ "loss": 1.0494,
1897
+ "mean_token_accuracy": 0.7685214877128601,
1898
+ "step": 236
1899
+ },
1900
+ {
1901
+ "epoch": 1.0,
1902
+ "grad_norm": 1.2474669218063354,
1903
+ "learning_rate": 1.0004894577208057e-06,
1904
+ "loss": 1.0154,
1905
+ "mean_token_accuracy": 0.7552247643470764,
1906
+ "step": 237
1907
+ },
1908
  {
1909
  "epoch": 1.0,
1910
+ "step": 237,
1911
+ "total_flos": 1.3663861802532864e+17,
1912
+ "train_loss": 1.2117820017448457,
1913
+ "train_runtime": 1211.3807,
1914
+ "train_samples_per_second": 12.481,
1915
+ "train_steps_per_second": 0.196
1916
  }
1917
  ],
1918
  "logging_steps": 1,
1919
+ "max_steps": 237,
1920
  "num_input_tokens_seen": 0,
1921
  "num_train_epochs": 1,
1922
  "save_steps": 500,
 
1932
  "attributes": {}
1933
  }
1934
  },
1935
+ "total_flos": 1.3663861802532864e+17,
1936
+ "train_batch_size": 8,
1937
  "trial_name": null,
1938
  "trial_params": null
1939
  }
training.log CHANGED
@@ -188,3 +188,40 @@ weight_decay=0.0,
188
  )
189
  (lm_head): Linear(in_features=2048, out_features=50304, bias=False)
190
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  )
189
  (lm_head): Linear(in_features=2048, out_features=50304, bias=False)
190
  )
191
+ 2025-09-23 18:18:04 - INFO - __main__ - *** Save model ***
192
+ 2025-09-23 18:18:04 - INFO - __main__ - πŸ’Ύ Saving MoE bias states...
193
+ 2025-09-23 18:18:04 - INFO - __main__ - πŸ” Searching for MoE layers with bias states...
194
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.0.mlp: 64 experts, update_speed=0.000100
195
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.1.mlp: 64 experts, update_speed=0.000100
196
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.2.mlp: 64 experts, update_speed=0.000100
197
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.3.mlp: 64 experts, update_speed=0.000100
198
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.4.mlp: 64 experts, update_speed=0.000100
199
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.5.mlp: 64 experts, update_speed=0.000100
200
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.6.mlp: 64 experts, update_speed=0.000100
201
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.7.mlp: 64 experts, update_speed=0.000100
202
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.8.mlp: 64 experts, update_speed=0.000100
203
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.9.mlp: 64 experts, update_speed=0.000100
204
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.10.mlp: 64 experts, update_speed=0.000100
205
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.11.mlp: 64 experts, update_speed=0.000100
206
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.12.mlp: 64 experts, update_speed=0.000100
207
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.13.mlp: 64 experts, update_speed=0.000100
208
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.14.mlp: 64 experts, update_speed=0.000100
209
+ 2025-09-23 18:18:04 - INFO - __main__ - βœ… Saved bias from model.layers.15.mlp: 64 experts, update_speed=0.000100
210
+ 2025-09-23 18:18:04 - INFO - __main__ - πŸŽ‰ Successfully saved 16 MoE bias states to /tmp/data/OLMoE-1B-7B-0125/aux_free_sft/commonsense15k/moe_bias_states.json
211
+ 2025-09-23 18:18:04 - INFO - __main__ - πŸ“Š Bias States Summary:
212
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.0.mlp: 64 experts, range=[0.0000, 0.0000]
213
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.1.mlp: 64 experts, range=[0.0000, 0.0000]
214
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.2.mlp: 64 experts, range=[0.0000, 0.0000]
215
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.3.mlp: 64 experts, range=[0.0000, 0.0000]
216
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.4.mlp: 64 experts, range=[0.0000, 0.0000]
217
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.5.mlp: 64 experts, range=[0.0000, 0.0000]
218
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.6.mlp: 64 experts, range=[0.0000, 0.0000]
219
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.7.mlp: 64 experts, range=[0.0000, 0.0000]
220
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.8.mlp: 64 experts, range=[0.0000, 0.0000]
221
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.9.mlp: 64 experts, range=[0.0000, 0.0000]
222
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.10.mlp: 64 experts, range=[0.0000, 0.0000]
223
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.11.mlp: 64 experts, range=[0.0000, 0.0000]
224
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.12.mlp: 64 experts, range=[0.0000, 0.0000]
225
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.13.mlp: 64 experts, range=[0.0000, 0.0000]
226
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.14.mlp: 64 experts, range=[0.0000, 0.0000]
227
+ 2025-09-23 18:18:04 - INFO - __main__ - model.layers.15.mlp: 64 experts, range=[0.0000, 0.0000]