naazahrani commited on
Commit
fb4e4cc
·
verified ·
1 Parent(s): 650b81d

Update evaluation results

Browse files
Files changed (1) hide show
  1. README.md +40 -34
README.md CHANGED
@@ -110,14 +110,17 @@ ALLaM is a generative model that comes with inherent uncertainties. Trials canno
110
 
111
  **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
 
113
- **Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,188 test samples.
114
 
115
- **IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 10,435 multiple-choice questions and 10,910 true/false questions.
116
 
117
  **GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
118
 
119
- **SDAIA MCQs**: A curated collection of 1,594 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
120
 
 
 
 
121
 
122
  All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
123
 
@@ -125,58 +128,61 @@ The evaluation scores of ALLaM can be found in JSON format [here](https://huggin
125
 
126
 
127
 
128
- | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
 
129
  |:----------------------------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
130
  | ALLaM-7B-Instruct-preview | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
131
- | AceGPT-v2-8B-Chat | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
132
- | jais-family-6p7b-chat | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
133
- | jais-family-13b-chat | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
134
- | jais-family-30b-8k-chat | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
135
- | jais-family-30b-16k-chat | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
136
- | jais-adapted-7b-chat | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
137
- | jais-adapted-13b-chat | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
138
- | jais-adapted-70b-chat | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
139
- | Qwen2.5-7B-Instruct | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
140
- | Qwen2.5-14B-Instruct | 78.33 | 84.93 | 81.92 | 71.81 | 91.9 | 56.9 | 82.87 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
141
- | Qwen2.5-72B-Instruct | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
142
- | Mistral-7B-Instruct-v0.3 | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
143
- | Mistral-Small-Instruct-2409 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
144
- | Mistral-Nemo-Instruct-2407 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
145
- | falcon-mamba-7b-instruct | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 28.49 | 63.52 | 39.27 | 28.45 | 29.69 |
146
- | Llama-3.1-8B-Instruct | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 54.0 | 70.54 | 56.53 | 44.67 | 30.76 |
147
- | Llama-3.3-70B-Instruct | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
 
 
148
 
149
  Closed models evaluations:
150
 
151
  | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
152
  |:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
153
- | GPT4o (API Generation) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
154
- | Claude Sonnet 3.5 (API Generation) oct | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
155
- | gemini pro 1.5 | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
156
 
157
  #### English Benchmarks
158
 
159
 
160
  | model | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
161
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
162
- | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
163
- | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
164
- | jais-adapted-70b-chat | 39.96 | 59.56 | 20.98 | 70.77 | 77.27 | 84.06 | 68.6 | 37.33 | 27.72 | 65.23 | 44.49 | 31.98 | 44.36 | 76.8 |
165
- | jais-family-30b-8k-chat | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.46 | 57.46 | 49.49 | 23.84 | 37.41 | 72.18 |
166
- | jais-family-30b-16k-chat | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 30.22 | 68.01 |
167
  | AceGPT-v2-8B-Chat | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
168
  | jais-family-6p7b-chat | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
 
 
 
169
  | jais-adapted-7b-chat | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22 | 35.73 | 58.07 |
170
  | jais-adapted-13b-chat | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
 
 
171
  | Qwen2.5-14B-Instruct | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 49.6 | 22.6 | 78.93 | 69.01 | 51.57 | 64.27 | 79.38 |
 
172
  | Mistral-7B-Instruct-v0.3 | 36.54 | 58.87 | 23.21 | 72.53 | 73.95 | 82.93 | 67.97 | 33.61 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
173
- | falcon-mamba-7b-instruct | 35.57 | 57.68 | 27.46 | 67.68 | 73.4 | 78.21 | 33.1 | 24.79 | 16.82 | 60.51 | 53.19 | 28.1 | 38.61 | 56.33 |
174
  | Mistral-Nemo-Instruct-2407 | 39.88 | 59.13 | 24.33 | 67.82 | 74.74 | 82.35 | 72.93 | 44.27 | 30.22 | 65.56 | 54.88 | 30.13 | 38.97 | 74.15 |
175
- | Qwen2.5-7B-Instruct | 59.1 | 51.28 | 26.56 | 72.78 | 69.38 | 79.55 | 50.59 | 36.1 | 12.04 | 70.56 | 58.93 | 58.04 | 68.35 | 43.29 |
176
- | Llama-3.1-8B-Instruct | 42.38 | 55.12 | 27.01 | 66.69 | 73.88 | 79.28 | 70.08 | 41.16 | 34.14 | 67.97 | 54.05 | 42.7 | 57.55 | 75.82 |
177
- | jais-family-13b-chat | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75 | 35.82 | 24.4 | 18.92 | 51.91 | 40.57 | 20.52 | 31.89 | 64.59 |
178
  | Mistral-Small-Instruct-2409 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
179
- | ALLaM-7B-Instruct-preview | 41.75 | 51.28 | 22.1 | 73.17 | 70.48 | 76.26 | 15.96 | 30.43 | 17.42 | 59.6 | 46.67 | 37.71 | 48.68 | 62.02 | 85.35 | 79.11 | 47.47 | 36.73 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
 
 
180
 
181
 
182
  ### MT-bench
 
110
 
111
  **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
 
113
+ **Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,887 test samples.
114
 
115
+ **IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 9990 multiple-choice questions and 5823 true/false questions.
116
 
117
  **GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
118
 
119
+ **AraPro**: A curated collection of 5,001 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
120
 
121
+ **AraMath**: AraMath consists of 605 MCQs derived from [ArMath](https://github.com/reem-codes/ArMATH), which includes mathematical word problems, that was transformed to MCQs internally.
122
+
123
+ **Ar-IFEval**: an Arabic instruction-following (IF) evaluation dataset designed to automatically assess language models' compliance with specified instructions through verifiable methods. The dataset consists of 535 instances, each containing two to four verifiable instructions that can be validated using deterministic programming approaches.
124
 
125
  All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
126
 
 
128
 
129
 
130
 
131
+
132
+ | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | Ar-IFEval <br>(prompt strict) <br>0 shot | Ar-IFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabic MMLU <br>0 Shot | Openai MMLU <br>0 shot | GAT <br>0 shot |
133
  |:----------------------------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
134
  | ALLaM-7B-Instruct-preview | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
135
+ | AceGPT-v2-32B-Chat | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 55.31 | 71.57 | 68.3 | 60.8 | 43.21 |
136
+ | jais-family-6p7b-chat | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
137
+ | jais-family-13b-chat | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
138
+ | jais-family-30b-8k-chat | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
139
+ | jais-family-30b-16k-chat | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
140
+ | jais-adapted-7b-chat | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
141
+ | jais-adapted-13b-chat | nan | nan | nan | nan | nan | nan | nan | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
142
+ | jais-adapted-70b-chat | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
143
+ | Qwen2.5-7B-Instruct | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
144
+ | Qwen2.5-14B-Instruct | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
145
+ | Qwen2.5-72B-Instruct | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
146
+ | Mistral-7B-Instruct-v0.3 | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
147
+ | Mistral-Nemo-Instruct-2407 | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
148
+ | Mistral-Small-Instruct-2409 | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
149
+ | falcon-mamba-7b-instruct | nan | nan | nan | nan | nan | nan | nan | 28.49 | 63.52 | 39.27 | 28.45 | 29.69 |
150
+ | Llama-3.1-8B-Instruct | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 54.0 | 70.54 | 56.53 | 44.67 | 30.76 |
151
+ | Llama-3.3-70B-Instruct | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
152
+ <!-- | AceGPT-v2-8B-Chat | nan | nan | nan | nan | nan | nan | nan | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 | -->
153
+
154
 
155
  Closed models evaluations:
156
 
157
  | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
158
  |:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
159
+ | Azureml GPT4o (gpt-4o-900ptu) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
160
+ | Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
161
+ | gemini pro 1.5 (gemini-1.5-pro) | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
162
 
163
  #### English Benchmarks
164
 
165
 
166
  | model | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
167
  |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
168
+ | ALLaM-7B-Instruct-preview | 41.75 | 51.28 | 22.1 | 73.17 | 70.48 | 76.26 | 15.96 | 30.43 | 17.42 | 59.6 | 46.67 | 37.71 | 48.68 | 62.02 | 85.35 | 79.11 | 47.47 | 36.73 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
 
 
 
 
169
  | AceGPT-v2-8B-Chat | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
170
  | jais-family-6p7b-chat | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
171
+ | jais-family-13b-chat | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75 | 35.82 | 24.4 | 18.92 | 51.91 | 40.57 | 20.52 | 31.89 | 64.59 |
172
+ | jais-family-30b-8k-chat | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.46 | 57.46 | 49.49 | 23.84 | 37.41 | 72.18 |
173
+ | jais-family-30b-16k-chat | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 30.22 | 68.01 |
174
  | jais-adapted-7b-chat | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22 | 35.73 | 58.07 |
175
  | jais-adapted-13b-chat | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
176
+ | jais-adapted-70b-chat | 39.96 | 59.56 | 20.98 | 70.77 | 77.27 | 84.06 | 68.6 | 37.33 | 27.72 | 65.23 | 44.49 | 31.98 | 44.36 | 76.8 |
177
+ | Qwen2.5-7B-Instruct | 59.1 | 51.28 | 26.56 | 72.78 | 69.38 | 79.55 | 50.59 | 36.1 | 12.04 | 70.56 | 58.93 | 58.04 | 68.35 | 43.29 |
178
  | Qwen2.5-14B-Instruct | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 49.6 | 22.6 | 78.93 | 69.01 | 51.57 | 64.27 | 79.38 |
179
+ | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
180
  | Mistral-7B-Instruct-v0.3 | 36.54 | 58.87 | 23.21 | 72.53 | 73.95 | 82.93 | 67.97 | 33.61 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
 
181
  | Mistral-Nemo-Instruct-2407 | 39.88 | 59.13 | 24.33 | 67.82 | 74.74 | 82.35 | 72.93 | 44.27 | 30.22 | 65.56 | 54.88 | 30.13 | 38.97 | 74.15 |
 
 
 
182
  | Mistral-Small-Instruct-2409 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
183
+ | falcon-mamba-7b-instruct | 35.57 | 57.68 | 27.46 | 67.68 | 73.4 | 78.21 | 33.1 | 24.79 | 16.82 | 60.51 | 53.19 | 28.1 | 38.61 | 56.33 |
184
+ | Llama-3.1-8B-Instruct | 42.38 | 55.12 | 27.01 | 66.69 | 73.88 | 79.28 | 70.08 | 41.16 | 34.14 | 67.97 | 54.05 | 42.7 | 57.55 | 75.82 |
185
+ | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
186
 
187
 
188
  ### MT-bench