microsoft/bitnet-b1.58-2B-4T · Eval numbers for Llama 3.2 1B in Table 1 don't match Meta's results

Thank you for noting the difference in Llama 3.2 1B evaluation scores. Evaluation results for LLMs can indeed vary significantly based on the specific framework, prompts, few-shot settings, and dataset versions used.

In our study, the priority was a consistent comparison across all models evaluated. To achieve this, we used the widely adopted lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) with uniform settings for all models. The scores in Table 1 reflect performance under this specific, unified evaluation setup.

Therefore, while our results facilitate fair relative comparisons within our paper, they may understandably diverge from Meta's figures, which could be based on different internal protocols, specific harness configurations, or prompt engineering.