Update README.md
Browse files
README.md
CHANGED
@@ -263,6 +263,8 @@ Data Labeling for Evaluation Datasets:
|
|
263 |
## Evaluation Results
|
264 |
We evaluate the model using temperature=`0.6`, top_p=`0.95`, and 64k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate.
|
265 |
|
|
|
|
|
266 |
### MATH500
|
267 |
|
268 |
| Reasoning Mode | pass@1 (avg. over 4 runs) |
|
|
|
263 |
## Evaluation Results
|
264 |
We evaluate the model using temperature=`0.6`, top_p=`0.95`, and 64k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate.
|
265 |
|
266 |
+
All evaluations were done using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills). We published a [tutorial](https://nvidia.github.io/NeMo-Skills/tutorials/2025/08/15/reproducing-llama-nemotron-super-49b-v15-evals/) with all details necessary to reproduce our evaluation results.
|
267 |
+
|
268 |
### MATH500
|
269 |
|
270 |
| Reasoning Mode | pass@1 (avg. over 4 runs) |
|