sbmaruf commited on
Commit
4cb5169
·
verified ·
1 Parent(s): 3242131

initial readme

Browse files
Files changed (1) hide show
  1. README.md +197 -3
README.md CHANGED
@@ -1,3 +1,197 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ - en
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - pytorch
9
+ library_name: transformers
10
+ ---
11
+ # ALLaM-7B-Instruct-preview
12
+
13
+ ALLaM is a series of powerful language models designed to advance Arabic Language Technology (ALT) developed by the National Center for Artificial Intelligence (NCAI) at the [Saudi Data and AI Authority (SDAIA)](https://sdaia.gov.sa/en/default.aspx). `ALLaM-2-7b-instruct` is trained from scratch. Our pretraining from scratch recipe consists of two steps: training on 4T English tokens followed by training on 1.2T mixed Arabic/English tokens. This retains the English capabilities of the model without catastrophic forgetting, effectively transferring knowledge from one language distribution to another.
14
+
15
+ ## Intended Use
16
+
17
+ `ALLaM` is specifically designed to expedite the research and development of ALT through Large Language Models (LLM). It serves as one of the foundational elements for building product offerings as well as facilitating experimental initiatives.
18
+
19
+ The ALLaM series models are designed to be a component of a larger AI system, and it is important for developers to incorporate safety measures when creating these systems. These safety measures are crucial for ensuring a balance between effectiveness and security, as well as minimizing potential risks, such as those resulting from the integration of the model with external tools.
20
+
21
+ ## Model Details
22
+
23
+ ALLaM is a family of LLMs specially trained for Arabic. The main two paths followed for pretraining are:
24
+
25
+ - **ALLaM**: Pretraining models from scratch
26
+ - **ALLaM-Adapted/ALLaM-(\*\*)/(\*\*)-ALLaM**/: Continued training from open source/weight models
27
+
28
+ For this release, we are providing our instruction-tuned 7B parameter generative model pretrained from scratch.
29
+
30
+ Some parameters for this model are provided in the following table:
31
+
32
+ | Size | Context Length | Pretraining Tokens | Instructions | Preference Pairs |
33
+ |----------------|-----------------|--------------------|--------------|------------------|
34
+ | 7B parameters | 4096 tokens |4T(en) + 1.2T(en+ar)| 7M | 260K |
35
+
36
+
37
+ ## Model Description
38
+
39
+ - **Developed by:** National Center for Artificial Intelligence at [SDAIA](https://sdaia.gov.sa/en/default.aspx)
40
+ - **Model type:** Autoregressive Transformer
41
+ - **Language(s):** Arabic, English
42
+ - **License:** Please see the LICENSE file
43
+ - **Input:** Text
44
+ - **Output:** Text
45
+
46
+
47
+ ## Training Details
48
+
49
+ ALLaM-7B-Instruct-preview is pretrained on a total of 4 trillion tokens in English and Arabic, Our training codebase is built on [NVIDIA/MegatronLM](https://github.com/NVIDIA/Megatron-LM). Average MFU during training was ~42%. We trained our model using bf16-mixed precision.
50
+
51
+
52
+ ## Getting started
53
+
54
+
55
+ ### System Prompt
56
+
57
+ It is important to note that this model is optimized to function without a predefined system prompt.
58
+ While Allam does not come with a default system prompt, it does provide the flexibility to add a custom system prompt.
59
+ For instance, a well crafted system prompt could be:
60
+
61
+ “You are ALLaM, a bilingual English and Arabic AI assistant.”
62
+ System prompts can also be in Arabic:
63
+
64
+ "أنت علام، مساعد ذكاء اصطناعي مطور من الهيئة السعودية للبيانات والذكاء الاصطناعي، تجيب على الأسئلة بطريقة مفيدة مع مراعاة القيم الثقافية المحلية."
65
+ Alternatively, users can get creative with their prompts, such as:
66
+
67
+ “You are an AI assistant who responds to everything like a pirate.”
68
+
69
+ The system prompt is integrated inside the tokenizer config (accessed via `apply_chat_template()` module).
70
+
71
+
72
+ ### Example Usages
73
+
74
+ The weights for ALLaM model checkpoints can be accessed via [HuggingFace transformers](https://github.com/huggingface/transformers) (tested with `transformers>=4.40.1`). The following code snippet demonstrates how to load the model and generate text using the `ALLaM-2-7B-SFT` model.
75
+
76
+ ```python
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+ allam_model = AutoModelForCausalLM.from_pretrained("ALLaM-2-7B-SFT")
79
+ tokenizer = AutoTokenizer.from_pretrained("ALLaM-2-7B-SFT")
80
+ messages=[
81
+ {"role": "user", "content": "كيف أجهز كوب شاهي؟"},
82
+ ]
83
+ inputs = tokenizer.apply_chat_template(messages, tokenize=False)
84
+ inputs = tokenizer(inputs, return_tensors='pt', return_token_type_ids=False)
85
+ inputs = {k: v.to('cuda') for k,v in inputs.items()}
86
+ allam_model = allam_model.to('cuda')
87
+ response = allam_model.generate(**inputs, max_new_tokens=4096, do_sample=True, top_k=50, top_p=0.95, temperature=.6)
88
+ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
89
+ ```
90
+
91
+
92
+ ## Ethical Considerations and Limitations
93
+
94
+ ALLaM is a generative model that comes with inherent uncertainties. Trials cannot encompass every possible use case. Hence, predicting ALLaM's responses in every context is not possible, leading on occasion to incorrect or biased outputs. Developers must conduct thorough safety evaluations and make specific adjustments to ensure the model is suitable for the intended purposes.
95
+
96
+ *The output generated by this model is not considered a statement of NCAI, SDAIA, or any other organization.*
97
+
98
+ ## Evaluation
99
+
100
+ ### Automatic Benchmarks
101
+
102
+ #### Arabic Benchmarks
103
+ **Massive Multitask Language Understanding** (MMLU) is a collection of many multiple-choice evaluation questions sourced from various academic levels (elementary to college level). These questions are typically related to humanities, STEM, or social sciences. It was originally an English dataset, but other variants were developed for Arabic:
104
+
105
+ <!-- - [Original English MMLU (MMLU-en)](https://github.com/hendrycks/test): A collection of 14,079 original English questions spanning 57 domains. -->
106
+ - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
107
+ - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
108
+
109
+ **Exams Arabic** ([Exams Ar](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
110
+
111
+ **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
+
113
+ **Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,188 test samples.
114
+
115
+ **IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 10,435 multiple-choice questions and 10,910 true/false questions.
116
+
117
+ **GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
118
+
119
+ **SDAIA MCQs**: A curated collection of 1,594 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
120
+
121
+
122
+ All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
123
+ | model | Exams_ar 5 Shot | Acva 5 Shot | Etec 0 Shot | Moe_ien_mcq 0 Shot | Moe_ien_tf 0 Shot | Sdaia_mcq 0 Shot | Arabicmmlu 0 Shot | Aramath_v2 5 Shot | Openaimmlu 0 Shot | Gat_overall_average 0 Shot |
124
+ |:----------------------------------|------------------:|--------------:|--------------:|---------------------:|--------------------:|-------------------:|--------------------:|--------------------:|--------------------:|-----------------------------:|
125
+ | Qwen2.5-72B-Instruct | 60.71 | 79.92 | 79.92 | 89.25 | 87.23 | 79.3 | 74.1 | 92.17 | 73.59 | 59.54 |
126
+ | Llama-3.1-70B-Instruct | 60.34 | 77.07 | 72.3 | 85.02 | 70.21 | 76.79 | 71.46 | 85.17 | 69.88 | 42.36 |
127
+ | jais-adapted-70b-chat | 54.75 | 73.33 | 59.57 | 76 | 56.97 | 69.39 | 65.74 | 52.17 | 56.82 | 39.15 |
128
+ | jais-family-30b-8k-chat | 50.28 | 74.47 | 55.71 | 73.02 | 72.14 | 65.31 | 63.11 | 50.33 | 50.9 | 36.44 |
129
+ | jais-family-30b-16k-chat | 49.72 | 60.08 | 27.64 | 40.56 | 60.03 | 26.98 | 62.04 | 46.5 | 50.98 | 34.85 |
130
+ | AceGPT-v2-8B-Chat | 51.96 | 72.69 | 56.71 | 77.02 | 75.85 | 68.44 | 57.02 | 40 | 49.99 | 36.15 |
131
+ | jais-family-6p7b-chat | 46.93 | 73.8 | 48.31 | 61.55 | 68.14 | 60.66 | 56.15 | 30.83 | 44.96 | 31.71 |
132
+ | jais-adapted-7b-chat | 40.6 | 70.44 | 40.96 | 58.69 | 68.63 | 51.38 | 49.75 | 24.17 | 38.54 | 29.68 |
133
+ | jais-adapted-13b-chat | 48.23 | 67.78 | 47.89 | 70.75 | 41.54 | 61.92 | 56.42 | 41.17 | 46.83 | 33.4 |
134
+ | Qwen2.5-14B-Instruct | 57.54 | 75.04 | 73.41 | 83.76 | 71.04 | 73.59 | 69.36 | 91.17 | 63.8 | 51.7 |
135
+ | Mistral-7B-Instruct-v0.3 | 33.71 | 61.21 | 33.83 | 53.9 | 65.38 | 44.1 | 45.27 | 24.33 | 32.32 | 26.65 |
136
+ | falcon-mamba-7b-instruct | 28.49 | 63.52 | 34.62 | 47.22 | 71.34 | 39.9 | 39.27 | 31.67 | 28.45 | 29.69 |
137
+ | Mistral-Nemo-Instruct-2407 | 47.49 | 76.92 | 51.43 | 70.23 | 71.73 | 61.1 | 55.97 | 43.33 | 46.15 | 25.44 |
138
+ | Qwen2.5-7B-Instruct | 50.65 | 78.17 | 64.11 | 78.31 | 75.17 | 68.7 | 61.54 | 60.5 | 56.1 | 41.42 |
139
+ | Llama-3.1-8B-Instruct | 54 | 70.54 | 51.9 | 70.01 | 76.99 | 62.42 | 56.53 | 42.83 | 44.67 | 30.76 |
140
+ | jais-family-13b-chat | 45.07 | 71.18 | 46.83 | 60.92 | 50.87 | 54.83 | 58.14 | 41.67 | 47.73 | 31.72 |
141
+ | Mistral-Small-Instruct-2409 | 38.73 | 68.93 | 44.03 | 62.16 | 75.87 | 52.51 | 50.43 | 46.33 | 39.63 | 28.82 |
142
+ | v27_dpo_model_7b-alpha-v1.27.2.25 | 51.58 | 76.33 | 66.81 | 91.54 | 85.57 | 73.9 | 67.78 | 65.5 | 55.91 | 44.53 |
143
+
144
+ #### English Benchmarks
145
+
146
+
147
+ | model | Agieval 0 Shot | Arc_challenge 0 Shot | Gpqa_main_n_shot 0 Shot | Hendrycks_ethics 0 Shot | Winogrande 0 Shot | Hellaswag 0 Shot | Triviaqa 5 Shot | Mmlu_pro 5 Shot | Minerva_math 4 Shot | Mmlu 0 Shot | Truthfulqa_mc2 0 Shot | Ifeval_prompt_level strict Shot | Ifeval_inst_level strict Shot | Gsm8k 5 Shot |
148
+ |:----------------------------------|-----------------:|-----------------------:|--------------------------:|--------------------------:|--------------------:|-------------------:|------------------:|------------------:|----------------------:|--------------:|------------------------:|----------------------------------:|--------------------------------:|---------------:|
149
+ | Qwen2.5-72B-Instruct | 71.09 | 63.48 | 25.67 | 78.33 | 76.24 | 87.41 | 70.9 | 62.77 | 54.04 | 83.44 | 69.54 | 67.47 | 76.86 | 93.25 |
150
+ | Llama-3.1-70B-Instruct | 52.6 | 63.05 | 27.01 | 80.28 | 79.08 | 84.67 | 82.09 | 59 | 49.18 | 82.36 | 59.92 | 70.98 | 79.74 | 88.4 |
151
+ | jais-adapted-70b-chat | 39.96 | 59.56 | 20.98 | 70.77 | 77.27 | 84.06 | 68.6 | 37.33 | 27.72 | 65.23 | 44.49 | 31.98 | 44.36 | 76.8 |
152
+ | jais-family-30b-8k-chat | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.46 | 57.46 | 49.49 | 23.84 | 37.41 | 72.18 |
153
+ | jais-family-30b-16k-chat | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 30.22 | 68.01 |
154
+ | AceGPT-v2-8B-Chat | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
155
+ | jais-family-6p7b-chat | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
156
+ | jais-adapted-7b-chat | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22 | 35.73 | 58.07 |
157
+ | jais-adapted-13b-chat | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
158
+ | Qwen2.5-14B-Instruct | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 49.6 | 22.6 | 78.93 | 69.01 | 51.57 | 64.27 | 79.38 |
159
+ | Mistral-7B-Instruct-v0.3 | 36.54 | 58.87 | 23.21 | 72.53 | 73.95 | 82.93 | 67.97 | 33.61 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
160
+ | falcon-mamba-7b-instruct | 35.57 | 57.68 | 27.46 | 67.68 | 73.4 | 78.21 | 33.1 | 24.79 | 16.82 | 60.51 | 53.19 | 28.1 | 38.61 | 56.33 |
161
+ | Mistral-Nemo-Instruct-2407 | 39.88 | 59.13 | 24.33 | 67.82 | 74.74 | 82.35 | 72.93 | 44.27 | 30.22 | 65.56 | 54.88 | 30.13 | 38.97 | 74.15 |
162
+ | Qwen2.5-7B-Instruct | 59.1 | 51.28 | 26.56 | 72.78 | 69.38 | 79.55 | 50.59 | 36.1 | 12.04 | 70.56 | 58.93 | 58.04 | 68.35 | 43.29 |
163
+ | Llama-3.1-8B-Instruct | 42.38 | 55.12 | 27.01 | 66.69 | 73.88 | 79.28 | 70.08 | 41.16 | 34.14 | 67.97 | 54.05 | 42.7 | 57.55 | 75.82 |
164
+ | jais-family-13b-chat | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75 | 35.82 | 24.4 | 18.92 | 51.91 | 40.57 | 20.52 | 31.89 | 64.59 |
165
+ | Mistral-Small-Instruct-2409 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
166
+ | v27_dpo_model_7b-alpha-v1.27.2.25 | 41.75 | 51.28 | 22.1 | 73.17 | 70.48 | 76.26 | 15.96 | 30.43 | 17.42 | 59.6 | 46.67 | 37.71 | 48.68 | 62.02 | 85.35 | 79.11 | 47.47 | 36.73 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
167
+
168
+
169
+ ### MT-bench
170
+
171
+ **Multi-Turn Bench** (MT-Bench): A challenging multi-turn benchmark that uses GPT-4o as a judge. MT-bench comprises 80 questions from 8 domains. Each question is presented to the model and the responses are submitted to GPT-4o to assign scores to each response. The judge returns a score for the first and second turn separately.
172
+ This dataset was also automatically translated to Arabic and manually verified and culturally aligned.
173
+
174
+ | Model | AR Average | AR Turn 1 | AR Turn 2 | EN Average | EN Turn 1 | EN Turn 2 |
175
+ |---------------------|------------|-----------|-----------|------------|-----------|-----------|
176
+ | AceGPT-v2-32B-Chat |5.43 | 6.61 | 4.26 | **6.5** | **7.41** | **5.58** |
177
+ | AceGPT-v1.5-13B-Chat | 4.61 | 5.28 | 3.93 | 4.86 | 5.56 | 4.17 |
178
+ |Jais-family-13b-chat | 4.89 | 5.37 | 4.41 | 4.77 | 5.57 | 3.97
179
+ | Jais family30b16k chat | 4.87 | 5.50 | 4.25 | 5.13 | 5.86 | 4.4 |
180
+ | Jais-adapted-70b-chat | 5.86 | 6.33 | **5.38** | 5.88 | 6.41 | 5.36 |
181
+ | ALLaM-7B-Instruct-preview | 5.9 | **6.93**| 4.88 | 6.5 | 7.49 | 5.15 |
182
+
183
+ ## Citation
184
+
185
+ If you found this work helpful or used any part of this work, please include the following citation:
186
+
187
+ ```
188
+ @misc{bari2024allamlargelanguagemodels,
189
+ title={ALLaM: Large Language Models for Arabic and English},
190
+ author={M Saiful Bari and Yazeed Alnumay and Norah A. Alzahrani and Nouf M. Alotaibi and Hisham A. Alyahya and Sultan AlRashed and Faisal A. Mirza and Shaykhah Z. Alsubaie and Hassan A. Alahmed and Ghadah Alabduljabbar and Raghad Alkhathran and Yousef Almushayqih and Raneem Alnajim and Salman Alsubaihi and Maryam Al Mansour and Majed Alrubaian and Ali Alammari and Zaki Alawami and Abdulmohsen Al-Thubaity and Ahmed Abdelali and Jeril Kuriakose and Abdalghani Abujabal and Nora Al-Twairesh and Areeb Alowisheq and Haidar Khan},
191
+ year={2024},
192
+ eprint={2407.15390},
193
+ archivePrefix={arXiv},
194
+ primaryClass={cs.CL},
195
+ url={https://arxiv.org/abs/2407.15390},
196
+ }
197
+ ```