Commit
73932a3
·
verified ·
1 Parent(s): 1c5c24e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +242 -0
README.md ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2
3
+ datasets:
4
+ - Omartificial-Intelligence-Space/Arabic-stsb
5
+ - Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class
6
+ language:
7
+ - ar
8
+ library_name: sentence-transformers
9
+ metrics:
10
+ - pearson_cosine
11
+ - spearman_cosine
12
+ - pearson_manhattan
13
+ - spearman_manhattan
14
+ - pearson_euclidean
15
+ - spearman_euclidean
16
+ - pearson_dot
17
+ - spearman_dot
18
+ - pearson_max
19
+ - spearman_max
20
+ pipeline_tag: sentence-similarity
21
+ tags:
22
+ - mteb
23
+ - sentence-transformers
24
+ - sentence-similarity
25
+ - feature-extraction
26
+ - generated_from_trainer
27
+ - dataset_size:947818
28
+ - loss:SoftmaxLoss
29
+ - loss:CosineSimilarityLoss
30
+ - transformers
31
+ model-index:
32
+ - name: Omartificial-Intelligence-Space/GATE-AraBert-v1
33
+ results:
34
+ - dataset:
35
+ config: ar-ar
36
+ name: MTEB STS17 (ar-ar)
37
+ revision: faeb762787bd10488a50c8b5be4a3b82e411949c
38
+ split: test
39
+ type: mteb/sts17-crosslingual-sts
40
+ metrics:
41
+ - type: cosine_pearson
42
+ value: 82.06597171670848
43
+ - type: cosine_spearman
44
+ value: 82.7809395809498
45
+ - type: euclidean_pearson
46
+ value: 79.23996991139896
47
+ - type: euclidean_spearman
48
+ value: 81.5287595404711
49
+ - type: main_score
50
+ value: 82.7809395809498
51
+ - type: manhattan_pearson
52
+ value: 78.95407006608013
53
+ - type: manhattan_spearman
54
+ value: 81.15109493737467
55
+ task:
56
+ type: STS
57
+ - dataset:
58
+ config: ar
59
+ name: MTEB STS22.v2 (ar)
60
+ revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
61
+ split: test
62
+ type: mteb/sts22-crosslingual-sts
63
+ metrics:
64
+ - type: cosine_pearson
65
+ value: 54.912880452465004
66
+ - type: cosine_spearman
67
+ value: 63.09788380910325
68
+ - type: euclidean_pearson
69
+ value: 57.92665617677832
70
+ - type: euclidean_spearman
71
+ value: 62.76032598469037
72
+ - type: main_score
73
+ value: 63.09788380910325
74
+ - type: manhattan_pearson
75
+ value: 58.0736648155273
76
+ - type: manhattan_spearman
77
+ value: 62.94190582776664
78
+ task:
79
+ type: STS
80
+ - dataset:
81
+ config: ar
82
+ name: MTEB STS22 (ar)
83
+ revision: de9d86b3b84231dc21f76c7b7af1f28e2f57f6e3
84
+ split: test
85
+ type: mteb/sts22-crosslingual-sts
86
+ metrics:
87
+ - type: cosine_pearson
88
+ value: 51.72534929358701
89
+ - type: cosine_spearman
90
+ value: 59.75149627160101
91
+ - type: euclidean_pearson
92
+ value: 53.894835373598774
93
+ - type: euclidean_spearman
94
+ value: 59.44278354697161
95
+ - type: main_score
96
+ value: 59.75149627160101
97
+ - type: manhattan_pearson
98
+ value: 54.076675975406985
99
+ - type: manhattan_spearman
100
+ value: 59.610061143235725
101
+ task:
102
+ type: STS
103
+ widget:
104
+ - source_sentence: امرأة تكتب شيئاً
105
+ sentences:
106
+ - مراهق يتحدث إلى فتاة عبر كاميرا الإنترنت
107
+ - امرأة تقطع البصل الأخضر.
108
+ - مجموعة من كبار السن يتظاهرون حول طاولة الطعام.
109
+ - source_sentence: تتشكل النجوم في مناطق تكوين النجوم، والتي تنشأ نفسها من السحب الجزيئية.
110
+ sentences:
111
+ - لاعب كرة السلة على وشك تسجيل نقاط لفريقه.
112
+ - المقال التالي مأخوذ من نسختي من "أطلس البطريق الجديد للتاريخ الوسطى"
113
+ - قد يكون من الممكن أن يوجد نظام شمسي مثل نظامنا خارج المجرة
114
+ - source_sentence: >-
115
+ تحت السماء الزرقاء مع الغيوم البيضاء، يصل طفل لمس مروحة طائرة واقفة على حقل
116
+ من العشب.
117
+ sentences:
118
+ - امرأة تحمل كأساً
119
+ - طفل يحاول لمس مروحة طائرة
120
+ - اثنان من عازبين عن الشرب يستعدون للعشاء
121
+ - source_sentence: رجل في منتصف العمر يحلق لحيته في غرفة ذات جدران بيضاء والتي لا تبدو كحمام
122
+ sentences:
123
+ - فتى يخطط اسمه على مكتبه
124
+ - رجل ينام
125
+ - المرأة وحدها وهي نائمة في غرفة نومها
126
+ - source_sentence: الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.
127
+ sentences:
128
+ - شخص طويل القامة
129
+ - المرأة تنظر من النافذة.
130
+ - لقد مات الكلب
131
+ license: apache-2.0
132
+ ---
133
+
134
+ # GATE-AraBert-V1
135
+
136
+ This is **GATE | General Arabic Text Embedding** trained using SentenceTransformers in a **multi-task** setup. The system trains on the **AllNLI** and on the **STS** dataset.
137
+
138
+ ## Model Details
139
+
140
+ ### Model Description
141
+ - **Model Type:** Sentence Transformer
142
+ - **Base model:** [Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2](https://huggingface.co/Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2) <!-- at revision 5ce4f80f3ede26de623d6ac10681399dba5c684a -->
143
+ - **Maximum Sequence Length:** 512 tokens
144
+ - **Output Dimensionality:** 768 tokens
145
+ - **Similarity Function:** Cosine Similarity
146
+ - **Training Datasets:**
147
+ - [all-nli](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class)
148
+ - [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb)
149
+ - **Language:** ar
150
+
151
+
152
+ ## Usage
153
+
154
+ ### Direct Usage (Sentence Transformers)
155
+
156
+ First install the Sentence Transformers library:
157
+
158
+ ```bash
159
+ pip install -U sentence-transformers
160
+ ```
161
+
162
+ Then you can load this model and run inference.
163
+ ```python
164
+ from sentence_transformers import SentenceTransformer
165
+
166
+ # Download from the 🤗 Hub
167
+ model = SentenceTransformer("Omartificial-Intelligence-Space/GATE-AraBert-v1")
168
+ # Run inference
169
+ sentences = [
170
+ 'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
171
+ 'لقد مات الكلب',
172
+ 'شخص طويل القامة',
173
+ ]
174
+ embeddings = model.encode(sentences)
175
+ print(embeddings.shape)
176
+ # [3, 768]
177
+
178
+ # Get the similarity scores for the embeddings
179
+ similarities = model.similarity(embeddings, embeddings)
180
+ print(similarities.shape)
181
+ # [3, 3]
182
+ ```
183
+
184
+
185
+ ## Evaluation
186
+
187
+ ### Metrics
188
+
189
+ #### Semantic Similarity
190
+ * Dataset: `sts-dev`
191
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
192
+
193
+ | Metric | Value |
194
+ |:--------------------|:----------|
195
+ | pearson_cosine | 0.8391 |
196
+ | **spearman_cosine** | **0.841** |
197
+ | pearson_manhattan | 0.8277 |
198
+ | spearman_manhattan | 0.8361 |
199
+ | pearson_euclidean | 0.8274 |
200
+ | spearman_euclidean | 0.8358 |
201
+ | pearson_dot | 0.8154 |
202
+ | spearman_dot | 0.818 |
203
+ | pearson_max | 0.8391 |
204
+ | spearman_max | 0.841 |
205
+
206
+ #### Semantic Similarity
207
+ * Dataset: `sts-test`
208
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
209
+
210
+ | Metric | Value |
211
+ |:--------------------|:-----------|
212
+ | pearson_cosine | 0.813 |
213
+ | **spearman_cosine** | **0.8173** |
214
+ | pearson_manhattan | 0.8114 |
215
+ | spearman_manhattan | 0.8164 |
216
+ | pearson_euclidean | 0.8103 |
217
+ | spearman_euclidean | 0.8158 |
218
+ | pearson_dot | 0.7908 |
219
+ | spearman_dot | 0.7887 |
220
+ | pearson_max | 0.813 |
221
+ | spearman_max | 0.8173 |
222
+
223
+
224
+ ## <span style="color:blue">Acknowledgments</span>
225
+
226
+ The author would like to thank Prince Sultan University for their invaluable support in this project. Their contributions and resources have been instrumental in the development and fine-tuning of these models.
227
+
228
+
229
+ ```markdown
230
+ ## Citation
231
+
232
+ If you use the GATE, please cite it as follows:
233
+
234
+ @misc{nacar2025GATE,
235
+ title={GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Hybrid Loss Training},
236
+ author={Omer Nacar, Anis Koubaa, Serry Taiseer Sibaee and Lahouari Ghouti},
237
+ year={2025},
238
+ note={Submitted to COLING 2025},
239
+ url={https://huggingface.co/Omartificial-Intelligence-Space/GATE-AraBert-v1},
240
+ }
241
+
242
+