Eval numbers for Llama 3.2 1B in Table 1 don't match Meta's results
The eval numbers in Table 1 in paper https://arxiv.org/pdf/2504.12285 for Llama 3.2 1B don't match Meta's published results in https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. For example, in Table 1 you quote 37.8 for ARC Challenge, but Meta reports 59.4. There are discrepancies in all tasks.
Thank you for noting the difference in Llama 3.2 1B evaluation scores. Evaluation results for LLMs can indeed vary significantly based on the specific framework, prompts, few-shot settings, and dataset versions used.
In our study, the priority was a consistent comparison across all models evaluated. To achieve this, we used the widely adopted lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) with uniform settings for all models. The scores in Table 1 reflect performance under this specific, unified evaluation setup.
Therefore, while our results facilitate fair relative comparisons within our paper, they may understandably diverge from Meta's figures, which could be based on different internal protocols, specific harness configurations, or prompt engineering.