sbmaruf commited on
Commit
f6049df
·
1 Parent(s): cd1d2eb

update readme

Browse files
Files changed (1) hide show
  1. README.md +6 -3
README.md CHANGED
@@ -106,7 +106,7 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
106
  - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
107
  - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
108
 
109
- **Exams Arabic** ([Exams Ar](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
110
 
111
  **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
 
@@ -120,7 +120,10 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
120
 
121
 
122
  All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
123
- | model | Exams_ar 5 Shot | Acva 5 Shot | Etec 0 Shot | Moe_ien_mcq 0 Shot | Moe_ien_tf 0 Shot | Sdaia_mcq 0 Shot | Arabicmmlu 0 Shot | Aramath_v2 5 Shot | Openaimmlu 0 Shot | Gat_overall_average 0 Shot |
 
 
 
124
  |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
125
  | Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
126
  | Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
@@ -144,7 +147,7 @@ All models were evaluated using our proprietary evaluation pipeline and [LM Eval
144
  #### English Benchmarks
145
 
146
 
147
- | model | Agieval 0 Shot | Arc_challenge 0 Shot | Gpqa_main_n_shot 0 Shot | Hendrycks_ethics 0 Shot | Winogrande 0 Shot | Hellaswag 0 Shot | Triviaqa 5 Shot | Mmlu_pro 5 Shot | Minerva_math 4 Shot | Mmlu 0 Shot | Truthfulqa_mc2 0 Shot | Ifeval_prompt_level strict Shot | Ifeval_inst_level strict Shot | Gsm8k 5 Shot |
148
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
149
  | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
150
  | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
 
106
  - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
107
  - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
108
 
109
+ **Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
110
 
111
  **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
 
 
120
 
121
 
122
  All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
123
+
124
+ The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
125
+
126
+ | model | EXAMS (ar) 5 Shot | ACVA 5 Shot | ETECH 0 Shot | MOE-IEN-MCQ 0 Shot | MOE-IEN-TF 0 Shot | SDAIA MCQs 0 Shot | ArabicMMLU 0 Shot | AraMath 5 Shot | OpenAI MMLU-ar 0 Shot | GAT 0 Shot |
127
  |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
128
  | Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
129
  | Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
 
147
  #### English Benchmarks
148
 
149
 
150
+ | model | AGIEval 0 Shot | Arc-(challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro 5 Shot | Minerva Math 4 Shot | MMLU 0 Shot | TruthfulQA-mc2 0 Shot | IFEval (prompt_level strict) 0 Shot | IFEval (inst_level strict`) 0 Shot | GSM8k 5 Shot |
151
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
152
  | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
153
  | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |