Adding evaluation results
Browse files
README.md
CHANGED
@@ -123,31 +123,41 @@ All models were evaluated using our proprietary evaluation pipeline and [LM Eval
|
|
123 |
|
124 |
The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
|
125 |
|
126 |
-
|
127 |
-
|
128 |
-
|
|
129 |
-
|
130 |
-
|
|
131 |
-
|
|
132 |
-
| jais-family-
|
133 |
-
|
|
134 |
-
| jais-family-
|
135 |
-
| jais-
|
136 |
-
| jais-adapted-
|
137 |
-
|
|
138 |
-
|
|
139 |
-
|
|
140 |
-
|
|
141 |
-
| Qwen2.5-
|
142 |
-
|
|
143 |
-
|
|
144 |
-
| Mistral-
|
145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
|
147 |
#### English Benchmarks
|
148 |
|
149 |
|
150 |
-
| model | AGIEval 0 Shot | Arc
|
151 |
|:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
|
152 |
| Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
|
153 |
| Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
|
|
|
123 |
|
124 |
The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
|
125 |
|
126 |
+
|
127 |
+
|
128 |
+
| Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
|
129 |
+
|:----------------------------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
|
130 |
+
| ALLaM-7B-Instruct-preview | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
|
131 |
+
| AceGPT-v2-8B-Chat | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
|
132 |
+
| jais-family-6p7b-chat | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
|
133 |
+
| jais-family-13b-chat | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
|
134 |
+
| jais-family-30b-8k-chat | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
|
135 |
+
| jais-family-30b-16k-chat | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
|
136 |
+
| jais-adapted-7b-chat | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
|
137 |
+
| jais-adapted-13b-chat | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
|
138 |
+
| jais-adapted-70b-chat | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
|
139 |
+
| Qwen2.5-7B-Instruct | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
|
140 |
+
| Qwen2.5-14B-Instruct | 78.33 | 84.93 | 81.92 | 71.81 | 91.9 | 56.9 | 82.87 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
|
141 |
+
| Qwen2.5-72B-Instruct | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
|
142 |
+
| Mistral-7B-Instruct-v0.3 | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
|
143 |
+
| Mistral-Small-Instruct-2409 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
|
144 |
+
| Mistral-Nemo-Instruct-2407 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
|
145 |
+
| falcon-mamba-7b-instruct | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 28.49 | 63.52 | 39.27 | 28.45 | 29.69 |
|
146 |
+
| Llama-3.1-8B-Instruct | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 54.0 | 70.54 | 56.53 | 44.67 | 30.76 |
|
147 |
+
| Llama-3.3-70B-Instruct | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
|
148 |
+
|
149 |
+
Closed models evaluations:
|
150 |
+
|
151 |
+
| Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
|
152 |
+
|:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
|
153 |
+
| GPT4o (API Generation) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
|
154 |
+
| Claude Sonnet 3.5 (API Generation) oct | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
|
155 |
+
| gemini pro 1.5 | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
|
156 |
|
157 |
#### English Benchmarks
|
158 |
|
159 |
|
160 |
+
| model | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
|
161 |
|:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
|
162 |
| Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
|
163 |
| Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
|