naazahrani commited on
Commit
54ea89a
·
verified ·
1 Parent(s): 0e95d3d

Adding evaluation results

Browse files
Files changed (1) hide show
  1. README.md +31 -21
README.md CHANGED
@@ -123,31 +123,41 @@ All models were evaluated using our proprietary evaluation pipeline and [LM Eval
123
 
124
  The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
125
 
126
- | model | EXAMS (ar) 5 Shot | ACVA 5 Shot | ETECH 0 Shot | MOE-IEN-MCQ 0 Shot | MOE-IEN-TF 0 Shot | SDAIA MCQs 0 Shot | ArabicMMLU 0 Shot | AraMath 5 Shot | OpenAI MMLU-ar 0 Shot | GAT 0 Shot |
127
- |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
128
- | Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
129
- | Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
130
- | jais-adapted-70b-chat | 54.75 | 73.33 | 59.57 | 76 | 56.97 | 69.39 | 65.74 | 52.17 | 56.82 | 39.15 |
131
- | jais-family-30b-8k-chat | 50.28 | 74.47 | 55.71 | 73.02 | 72.14 | 65.31 | 63.11 | 50.33 | 50.9 | 36.44 |
132
- | jais-family-30b-16k-chat | 49.72 | 60.08 | 27.64 | 40.56 | 60.03 | 26.98 | 62.04 | 46.5 | 50.98 | 34.85 |
133
- | AceGPT-v2-8B-Chat | 51.96 | 72.69 | 56.71 | 77.02 | 75.85 | 68.44 | 57.02 | 40 | 49.99 | 36.15 |
134
- | jais-family-6p7b-chat | 46.93 | 73.8 | 48.31 | 61.55 | 68.14 | 60.66 | 56.15 | 30.83 | 44.96 | 31.71 |
135
- | jais-adapted-7b-chat | 40.6 | 70.44 | 40.96 | 58.69 | 68.63 | 51.38 | 49.75 | 24.17 | 38.54 | 29.68 |
136
- | jais-adapted-13b-chat | 48.23 | 67.78 | 47.89 | 70.75 | 41.54 | 61.92 | 56.42 | 41.17 | 46.83 | 33.4 |
137
- | Qwen2.5-14B-Instruct | 57.54 | 75.04 | 73.41 | 83.76 | 71.04 | 73.59 | 69.36 | 91.17 | 63.8 | 51.7 |
138
- | Mistral-7B-Instruct-v0.3 | 33.71 | 61.21 | 33.83 | 53.9 | 65.38 | 44.1 | 45.27 | 24.33 | 32.32 | 26.65 |
139
- | falcon-mamba-7b-instruct | 28.49 | 63.52 | 34.62 | 47.22 | 71.34 | 39.9 | 39.27 | 31.67 | 28.45 | 29.69 |
140
- | Mistral-Nemo-Instruct-2407 | 47.49 | 76.92 | 51.43 | 70.23 | 71.73 | 61.1 | 55.97 | 43.33 | 46.15 | 25.44 |
141
- | Qwen2.5-7B-Instruct | 50.65 | 78.17 | 64.11 | 78.31 | 75.17 | 68.7 | 61.54 | 60.5 | 56.1 | 41.42 |
142
- | Llama-3.1-8B-Instruct | 54 | 70.54 | 51.9 | 70.01 | 76.99 | 62.42 | 56.53 | 42.83 | 44.67 | 30.76 |
143
- | jais-family-13b-chat | 45.07 | 71.18 | 46.83 | 60.92 | 50.87 | 54.83 | 58.14 | 41.67 | 47.73 | 31.72 |
144
- | Mistral-Small-Instruct-2409 | 38.73 | 68.93 | 44.03 | 62.16 | 75.87 | 52.51 | 50.43 | 46.33 | 39.63 | 28.82 |
145
- | ALLaM-7B-Instruct-preview | 51.58 | 76.33 | 66.81 | 91.54 | 85.57 | 73.9 | 67.78 | 65.5 | 55.91 | 44.53 |
 
 
 
 
 
 
 
 
 
 
146
 
147
  #### English Benchmarks
148
 
149
 
150
- | model | AGIEval 0 Shot | Arc-(challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro 5 Shot | Minerva Math 4 Shot | MMLU 0 Shot | TruthfulQA-mc2 0 Shot | IFEval (prompt_level strict) 0 Shot | IFEval (inst_level strict`) 0 Shot | GSM8k 5 Shot |
151
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
152
  | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
153
  | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
 
123
 
124
  The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
125
 
126
+
127
+
128
+ | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
129
+ |:----------------------------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
130
+ | ALLaM-7B-Instruct-preview | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
131
+ | AceGPT-v2-8B-Chat | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
132
+ | jais-family-6p7b-chat | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
133
+ | jais-family-13b-chat | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
134
+ | jais-family-30b-8k-chat | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
135
+ | jais-family-30b-16k-chat | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
136
+ | jais-adapted-7b-chat | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
137
+ | jais-adapted-13b-chat | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
138
+ | jais-adapted-70b-chat | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
139
+ | Qwen2.5-7B-Instruct | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
140
+ | Qwen2.5-14B-Instruct | 78.33 | 84.93 | 81.92 | 71.81 | 91.9 | 56.9 | 82.87 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
141
+ | Qwen2.5-72B-Instruct | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
142
+ | Mistral-7B-Instruct-v0.3 | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
143
+ | Mistral-Small-Instruct-2409 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
144
+ | Mistral-Nemo-Instruct-2407 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
145
+ | falcon-mamba-7b-instruct | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 28.49 | 63.52 | 39.27 | 28.45 | 29.69 |
146
+ | Llama-3.1-8B-Instruct | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 54.0 | 70.54 | 56.53 | 44.67 | 30.76 |
147
+ | Llama-3.3-70B-Instruct | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
148
+
149
+ Closed models evaluations:
150
+
151
+ | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
152
+ |:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
153
+ | GPT4o (API Generation) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
154
+ | Claude Sonnet 3.5 (API Generation) oct | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
155
+ | gemini pro 1.5 | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
156
 
157
  #### English Benchmarks
158
 
159
 
160
+ | model | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
161
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
162
  | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
163
  | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |