update readme
Browse files
README.md
CHANGED
@@ -106,7 +106,7 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
|
|
106 |
- [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
|
107 |
- [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
|
108 |
|
109 |
-
**Exams Arabic** ([Exams Ar](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
|
110 |
|
111 |
**Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
|
112 |
|
@@ -120,7 +120,10 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
|
|
120 |
|
121 |
|
122 |
All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
|
123 |
-
|
|
|
|
|
|
|
124 |
|:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
|
125 |
| Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
|
126 |
| Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
|
@@ -144,7 +147,7 @@ All models were evaluated using our proprietary evaluation pipeline and [LM Eval
|
|
144 |
#### English Benchmarks
|
145 |
|
146 |
|
147 |
-
| model |
|
148 |
|:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
|
149 |
| Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
|
150 |
| Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
|
|
|
106 |
- [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
|
107 |
- [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
|
108 |
|
109 |
+
**Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
|
110 |
|
111 |
**Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
|
112 |
|
|
|
120 |
|
121 |
|
122 |
All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
|
123 |
+
|
124 |
+
The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
|
125 |
+
|
126 |
+
| model | EXAMS (ar) 5 Shot | ACVA 5 Shot | ETECH 0 Shot | MOE-IEN-MCQ 0 Shot | MOE-IEN-TF 0 Shot | SDAIA MCQs 0 Shot | ArabicMMLU 0 Shot | AraMath 5 Shot | OpenAI MMLU-ar 0 Shot | GAT 0 Shot |
|
127 |
|:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
|
128 |
| Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
|
129 |
| Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
|
|
|
147 |
#### English Benchmarks
|
148 |
|
149 |
|
150 |
+
| model | AGIEval 0 Shot | Arc-(challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro 5 Shot | Minerva Math 4 Shot | MMLU 0 Shot | TruthfulQA-mc2 0 Shot | IFEval (prompt_level strict) 0 Shot | IFEval (inst_level strict`) 0 Shot | GSM8k 5 Shot |
|
151 |
|:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
|
152 |
| Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
|
153 |
| Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
|