ALLaM-AI
/

ALLaM-7B-Instruct-preview

@@ -106,7 +106,7 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
 - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
 - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
-**Exams Arabic** ([Exams Ar](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
 **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
@@ -120,7 +120,10 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
 All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
-| model                             |   Exams_ar 5 Shot |   Acva 5 Shot |   Etec 0 Shot |   Moe_ien_mcq 0 Shot |   Moe_ien_tf 0 Shot |   Sdaia_mcq 0 Shot |   Arabicmmlu 0 Shot |   Aramath_v2 5 Shot |   Openaimmlu 0 Shot |   Gat_overall_average 0 Shot |
 |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
 | Qwen2.5-72B-Instruct              |             60.71 |         79.92 |         79.92 |                89.25 |               87.23 |              79.3  |               74.1  |               92.17 |               73.59 |                        59.54 |
 | Llama-3.1-70B-Instruct            |             60.34 |         77.07 |         72.3  |                85.02 |               70.21 |              76.79 |               71.46 |               85.17 |               69.88 |                        42.36 |
@@ -144,7 +147,7 @@ All models were evaluated using our proprietary evaluation pipeline and [LM Eval
 #### English Benchmarks
-| model                             |   Agieval 0 Shot |   Arc_challenge 0 Shot |   Gpqa_main_n_shot 0 Shot |   Hendrycks_ethics 0 Shot |   Winogrande 0 Shot |   Hellaswag 0 Shot |   Triviaqa 5 Shot |   Mmlu_pro 5 Shot |   Minerva_math 4 Shot |   Mmlu 0 Shot |   Truthfulqa_mc2 0 Shot |   Ifeval_prompt_level strict Shot |   Ifeval_inst_level strict Shot |   Gsm8k 5 Shot |
 |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
 | Qwen2.5-72B-Instruct              |            71.09 |                  63.48 |                     25.67 |                     78.33 |               76.24 |              87.41 |             70.9  |             62.77 |                 54.04 |         83.44 |                   69.54 |                             67.47 |                           76.86 |          93.25 |
 | Llama-3.1-70B-Instruct            |            52.6  |                  63.05 |                     27.01 |                     80.28 |               79.08 |              84.67 |             82.09 |             59    |                 49.18 |         82.36 |                   59.92 |                             70.98 |                           79.74 |          88.4  |

 - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
 - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
+**Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
 **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
 All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
+The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
+| model                             |   EXAMS (ar) 5 Shot |   ACVA 5 Shot |   ETECH 0 Shot |   MOE-IEN-MCQ 0 Shot |   MOE-IEN-TF 0 Shot |   SDAIA MCQs 0 Shot |   ArabicMMLU 0 Shot |   AraMath 5 Shot |   OpenAI MMLU-ar 0 Shot |   GAT 0 Shot |
 |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
 | Qwen2.5-72B-Instruct              |             60.71 |         79.92 |         79.92 |                89.25 |               87.23 |              79.3  |               74.1  |               92.17 |               73.59 |                        59.54 |
 | Llama-3.1-70B-Instruct            |             60.34 |         77.07 |         72.3  |                85.02 |               70.21 |              76.79 |               71.46 |               85.17 |               69.88 |                        42.36 |
 #### English Benchmarks
+| model                             |   AGIEval 0 Shot |   Arc-(challenge) 0 Shot |   GPQA (main) 0 Shot |   Hendrycks ethics 0 Shot |   Winogrande 0 Shot |   HellaSwag 0 Shot |   TriviaQa 5 Shot |   MMLU Pro 5 Shot |   Minerva Math 4 Shot |   MMLU 0 Shot |   TruthfulQA-mc2 0 Shot |   IFEval (prompt_level strict) 0 Shot |   IFEval (inst_level strict`) 0 Shot |   GSM8k 5 Shot |
 |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
 | Qwen2.5-72B-Instruct              |            71.09 |                  63.48 |                     25.67 |                     78.33 |               76.24 |              87.41 |             70.9  |             62.77 |                 54.04 |         83.44 |                   69.54 |                             67.47 |                           76.86 |          93.25 |
 | Llama-3.1-70B-Instruct            |            52.6  |                  63.05 |                     27.01 |                     80.28 |               79.08 |              84.67 |             82.09 |             59    |                 49.18 |         82.36 |                   59.92 |                             70.98 |                           79.74 |          88.4  |