Add comprehensive model card
Browse files
README.md
CHANGED
@@ -1,458 +1,216 @@
|
|
1 |
---
|
|
|
|
|
2 |
tags:
|
3 |
- sentence-transformers
|
4 |
- sentence-similarity
|
5 |
- feature-extraction
|
6 |
-
-
|
7 |
-
-
|
8 |
-
-
|
9 |
-
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
- شخصان يصعدان على الدرج
|
15 |
-
- الناس يجلسون
|
16 |
-
- رجل يجلس ويستمع للمحادثات
|
17 |
-
- source_sentence: لاعب كرة قدم يرتدي زيًا أحمر وأسود يحمل الرقم 3 وخوذة سوداء يحمل
|
18 |
-
الكرة ويحيط به لاعبون معارضون يرتدون زيًا أبيض وأرجواني بيكسفيل.
|
19 |
-
sentences:
|
20 |
-
- لاعب كرة قدم يحمل كرة
|
21 |
-
- الرجل مستعد لالتقاط كرة القدم
|
22 |
-
- الكلاب بالخارج
|
23 |
-
- source_sentence: بعثة لوس أنجلوس هي عيادة مجانية
|
24 |
-
sentences:
|
25 |
-
- إنها مساعدة ممرضة في بعثة لوس أنجلوس
|
26 |
-
- تعمل كطبيبة رئيسة في "لوس أنجلوس ميسيون" عيادة مجانية في حي فقير
|
27 |
-
- التوافق مطلوب من الأجهزة أو البرمجيات.
|
28 |
-
- source_sentence: رجل يرتدي قميصًا بنيًا مخططًا يقف يثني ذراعيه على قمة مبنى على
|
29 |
-
سطح منزل.
|
30 |
-
sentences:
|
31 |
-
- رجل ينظر من نافذة المطبخ
|
32 |
-
- شخص على السطح
|
33 |
-
- لا يجوز إظهار أي مبلغ من الأصول في الميزانية العمومية للمهمة الفيدرالية
|
34 |
-
- source_sentence: الحيوانات الأليفة تلعب دور الجدار
|
35 |
-
sentences:
|
36 |
-
- كلبان يلعبان في منطقة محصورة من الحصى.
|
37 |
-
- الكلاب تجري لالتقاط عصا عبر الشارع.
|
38 |
-
- يمكن تطوير التكنولوجيا.
|
39 |
-
pipeline_tag: sentence-similarity
|
40 |
-
library_name: sentence-transformers
|
41 |
metrics:
|
42 |
-
-
|
43 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
model-index:
|
45 |
-
- name:
|
46 |
results:
|
47 |
- task:
|
48 |
-
type:
|
49 |
-
name:
|
50 |
dataset:
|
51 |
-
|
52 |
-
|
53 |
metrics:
|
54 |
-
- type:
|
55 |
-
value: 0.
|
56 |
-
name:
|
57 |
-
- type:
|
58 |
-
value: 0.
|
59 |
-
name:
|
60 |
---
|
61 |
|
62 |
-
#
|
63 |
|
64 |
-
|
65 |
|
66 |
-
|
67 |
|
68 |
-
|
69 |
-
- **Model Type:** Sentence Transformer
|
70 |
-
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) <!-- at revision 016fb9d6768f522a59c6e0d2d5d5d43a4e1bff60 -->
|
71 |
-
- **Maximum Sequence Length:** 75 tokens
|
72 |
-
- **Output Dimensionality:** 768 dimensions
|
73 |
-
- **Similarity Function:** Cosine Similarity
|
74 |
-
<!-- - **Training Dataset:** Unknown -->
|
75 |
-
<!-- - **Language:** Unknown -->
|
76 |
-
<!-- - **License:** Unknown -->
|
77 |
|
78 |
-
|
79 |
|
80 |
-
- **
|
81 |
-
- **
|
82 |
-
- **
|
|
|
|
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
87 |
-
SentenceTransformer(
|
88 |
-
(0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: BertModel
|
89 |
-
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
90 |
-
)
|
91 |
-
```
|
92 |
|
93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
-
|
96 |
|
97 |
-
|
98 |
|
99 |
```bash
|
100 |
-
pip install
|
101 |
```
|
102 |
|
103 |
-
|
|
|
104 |
```python
|
105 |
from sentence_transformers import SentenceTransformer
|
106 |
|
107 |
-
#
|
108 |
-
model = SentenceTransformer(
|
109 |
-
|
|
|
110 |
sentences = [
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
]
|
|
|
|
|
115 |
embeddings = model.encode(sentences)
|
116 |
-
print(embeddings.shape)
|
117 |
-
# [3, 768]
|
118 |
|
119 |
-
#
|
120 |
-
|
121 |
-
|
122 |
-
|
|
|
|
|
123 |
```
|
124 |
|
125 |
-
|
126 |
-
### Direct Usage (Transformers)
|
127 |
|
128 |
-
|
|
|
129 |
|
130 |
-
|
131 |
-
|
|
|
132 |
|
133 |
-
|
134 |
-
|
|
|
|
|
135 |
|
136 |
-
|
137 |
|
138 |
-
|
|
|
|
|
|
|
139 |
|
140 |
-
|
141 |
-
|
|
|
|
|
142 |
|
143 |
-
|
144 |
-
|
|
|
|
|
|
|
145 |
|
146 |
-
|
147 |
-
-->
|
148 |
|
149 |
-
|
|
|
|
|
150 |
|
151 |
-
|
|
|
|
|
152 |
|
153 |
-
|
154 |
|
155 |
-
|
156 |
-
|
|
|
|
|
|
|
157 |
|
158 |
-
|
159 |
-
|:--------------------|:-----------|
|
160 |
-
| pearson_cosine | 0.5891 |
|
161 |
-
| **spearman_cosine** | **0.5933** |
|
162 |
|
163 |
-
|
164 |
-
|
|
|
|
|
165 |
|
166 |
-
|
167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
-
|
170 |
-
### Recommendations
|
171 |
|
172 |
-
|
173 |
-
|
|
|
|
|
|
|
174 |
|
175 |
-
##
|
176 |
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
* Size: 457,216 training samples
|
182 |
-
* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
|
183 |
-
* Approximate statistics based on the first 1000 samples:
|
184 |
-
| | sentence_0 | sentence_1 | sentence_2 |
|
185 |
-
|:--------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
|
186 |
-
| type | string | string | string |
|
187 |
-
| details | <ul><li>min: 4 tokens</li><li>mean: 12.5 tokens</li><li>max: 66 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.33 tokens</li><li>max: 68 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 9.59 tokens</li><li>max: 33 tokens</li></ul> |
|
188 |
-
* Samples:
|
189 |
-
| sentence_0 | sentence_1 | sentence_2 |
|
190 |
-
|:------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------|:--------------------------------------------------------------|
|
191 |
-
| <code>يجلس طفل أحمر الشعر ينظر من خلال السور إلى الماء بينما يلعب الناس على الشاطئ في المسافة.</code> | <code>طفل أحمر الشعر مهتم بالماء والناس يلعبون على الشاطئ في المسافة.</code> | <code>فتى شقراء يراقب القارب مع الناس عليه يبحر بعيدا.</code> |
|
192 |
-
| <code>عامل نظافة على وشك التنظيف في محطة القطار</code> | <code>البواب سيقوم بتنظيف محطة القطار</code> | <code>البواب يجلس في محطة القطار</code> |
|
193 |
-
| <code>رجل يرتدي قميصاً أخضر وبنطال جينز ينحني فوق مرمى الهوكي الأحمر مع ثقب فوقه.</code> | <code>رجل يرتدي قميصاً أخضر.</code> | <code>امرأة ترتدي قميصاً أخضر.</code> |
|
194 |
-
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
195 |
-
```json
|
196 |
-
{
|
197 |
-
"loss": "MultipleNegativesRankingLoss",
|
198 |
-
"matryoshka_dims": [
|
199 |
-
768,
|
200 |
-
512,
|
201 |
-
256,
|
202 |
-
128,
|
203 |
-
64
|
204 |
-
],
|
205 |
-
"matryoshka_weights": [
|
206 |
-
1,
|
207 |
-
1,
|
208 |
-
1,
|
209 |
-
1,
|
210 |
-
1
|
211 |
-
],
|
212 |
-
"n_dims_per_step": -1
|
213 |
-
}
|
214 |
-
```
|
215 |
-
|
216 |
-
### Training Hyperparameters
|
217 |
-
#### Non-Default Hyperparameters
|
218 |
-
|
219 |
-
- `eval_strategy`: steps
|
220 |
-
- `per_device_train_batch_size`: 64
|
221 |
-
- `per_device_eval_batch_size`: 64
|
222 |
-
- `fp16`: True
|
223 |
-
- `batch_sampler`: no_duplicates
|
224 |
-
- `multi_dataset_batch_sampler`: round_robin
|
225 |
-
|
226 |
-
#### All Hyperparameters
|
227 |
-
<details><summary>Click to expand</summary>
|
228 |
-
|
229 |
-
- `overwrite_output_dir`: False
|
230 |
-
- `do_predict`: False
|
231 |
-
- `eval_strategy`: steps
|
232 |
-
- `prediction_loss_only`: True
|
233 |
-
- `per_device_train_batch_size`: 64
|
234 |
-
- `per_device_eval_batch_size`: 64
|
235 |
-
- `per_gpu_train_batch_size`: None
|
236 |
-
- `per_gpu_eval_batch_size`: None
|
237 |
-
- `gradient_accumulation_steps`: 1
|
238 |
-
- `eval_accumulation_steps`: None
|
239 |
-
- `torch_empty_cache_steps`: None
|
240 |
-
- `learning_rate`: 5e-05
|
241 |
-
- `weight_decay`: 0.0
|
242 |
-
- `adam_beta1`: 0.9
|
243 |
-
- `adam_beta2`: 0.999
|
244 |
-
- `adam_epsilon`: 1e-08
|
245 |
-
- `max_grad_norm`: 1
|
246 |
-
- `num_train_epochs`: 3
|
247 |
-
- `max_steps`: -1
|
248 |
-
- `lr_scheduler_type`: linear
|
249 |
-
- `lr_scheduler_kwargs`: {}
|
250 |
-
- `warmup_ratio`: 0.0
|
251 |
-
- `warmup_steps`: 0
|
252 |
-
- `log_level`: passive
|
253 |
-
- `log_level_replica`: warning
|
254 |
-
- `log_on_each_node`: True
|
255 |
-
- `logging_nan_inf_filter`: True
|
256 |
-
- `save_safetensors`: True
|
257 |
-
- `save_on_each_node`: False
|
258 |
-
- `save_only_model`: False
|
259 |
-
- `restore_callback_states_from_checkpoint`: False
|
260 |
-
- `no_cuda`: False
|
261 |
-
- `use_cpu`: False
|
262 |
-
- `use_mps_device`: False
|
263 |
-
- `seed`: 42
|
264 |
-
- `data_seed`: None
|
265 |
-
- `jit_mode_eval`: False
|
266 |
-
- `use_ipex`: False
|
267 |
-
- `bf16`: False
|
268 |
-
- `fp16`: True
|
269 |
-
- `fp16_opt_level`: O1
|
270 |
-
- `half_precision_backend`: auto
|
271 |
-
- `bf16_full_eval`: False
|
272 |
-
- `fp16_full_eval`: False
|
273 |
-
- `tf32`: None
|
274 |
-
- `local_rank`: 0
|
275 |
-
- `ddp_backend`: None
|
276 |
-
- `tpu_num_cores`: None
|
277 |
-
- `tpu_metrics_debug`: False
|
278 |
-
- `debug`: []
|
279 |
-
- `dataloader_drop_last`: False
|
280 |
-
- `dataloader_num_workers`: 0
|
281 |
-
- `dataloader_prefetch_factor`: None
|
282 |
-
- `past_index`: -1
|
283 |
-
- `disable_tqdm`: False
|
284 |
-
- `remove_unused_columns`: True
|
285 |
-
- `label_names`: None
|
286 |
-
- `load_best_model_at_end`: False
|
287 |
-
- `ignore_data_skip`: False
|
288 |
-
- `fsdp`: []
|
289 |
-
- `fsdp_min_num_params`: 0
|
290 |
-
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
|
291 |
-
- `tp_size`: 0
|
292 |
-
- `fsdp_transformer_layer_cls_to_wrap`: None
|
293 |
-
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
|
294 |
-
- `deepspeed`: None
|
295 |
-
- `label_smoothing_factor`: 0.0
|
296 |
-
- `optim`: adamw_torch
|
297 |
-
- `optim_args`: None
|
298 |
-
- `adafactor`: False
|
299 |
-
- `group_by_length`: False
|
300 |
-
- `length_column_name`: length
|
301 |
-
- `ddp_find_unused_parameters`: None
|
302 |
-
- `ddp_bucket_cap_mb`: None
|
303 |
-
- `ddp_broadcast_buffers`: False
|
304 |
-
- `dataloader_pin_memory`: True
|
305 |
-
- `dataloader_persistent_workers`: False
|
306 |
-
- `skip_memory_metrics`: True
|
307 |
-
- `use_legacy_prediction_loop`: False
|
308 |
-
- `push_to_hub`: False
|
309 |
-
- `resume_from_checkpoint`: None
|
310 |
-
- `hub_model_id`: None
|
311 |
-
- `hub_strategy`: every_save
|
312 |
-
- `hub_private_repo`: None
|
313 |
-
- `hub_always_push`: False
|
314 |
-
- `gradient_checkpointing`: False
|
315 |
-
- `gradient_checkpointing_kwargs`: None
|
316 |
-
- `include_inputs_for_metrics`: False
|
317 |
-
- `include_for_metrics`: []
|
318 |
-
- `eval_do_concat_batches`: True
|
319 |
-
- `fp16_backend`: auto
|
320 |
-
- `push_to_hub_model_id`: None
|
321 |
-
- `push_to_hub_organization`: None
|
322 |
-
- `mp_parameters`:
|
323 |
-
- `auto_find_batch_size`: False
|
324 |
-
- `full_determinism`: False
|
325 |
-
- `torchdynamo`: None
|
326 |
-
- `ray_scope`: last
|
327 |
-
- `ddp_timeout`: 1800
|
328 |
-
- `torch_compile`: False
|
329 |
-
- `torch_compile_backend`: None
|
330 |
-
- `torch_compile_mode`: None
|
331 |
-
- `dispatch_batches`: None
|
332 |
-
- `split_batches`: None
|
333 |
-
- `include_tokens_per_second`: False
|
334 |
-
- `include_num_input_tokens_seen`: False
|
335 |
-
- `neftune_noise_alpha`: None
|
336 |
-
- `optim_target_modules`: None
|
337 |
-
- `batch_eval_metrics`: False
|
338 |
-
- `eval_on_start`: False
|
339 |
-
- `use_liger_kernel`: False
|
340 |
-
- `eval_use_gather_object`: False
|
341 |
-
- `average_tokens_across_devices`: False
|
342 |
-
- `prompts`: None
|
343 |
-
- `batch_sampler`: no_duplicates
|
344 |
-
- `multi_dataset_batch_sampler`: round_robin
|
345 |
-
|
346 |
-
</details>
|
347 |
-
|
348 |
-
### Training Logs
|
349 |
-
| Epoch | Step | Training Loss | arabic-nli-dev_spearman_cosine |
|
350 |
-
|:------:|:-----:|:-------------:|:------------------------------:|
|
351 |
-
| 0.1400 | 500 | 10.0831 | - |
|
352 |
-
| 0.1999 | 714 | - | 0.4417 |
|
353 |
-
| 0.2800 | 1000 | 5.6335 | - |
|
354 |
-
| 0.3998 | 1428 | - | 0.5157 |
|
355 |
-
| 0.4199 | 1500 | 4.7627 | - |
|
356 |
-
| 0.5599 | 2000 | 4.3656 | - |
|
357 |
-
| 0.5997 | 2142 | - | 0.5443 |
|
358 |
-
| 0.6999 | 2500 | 4.085 | - |
|
359 |
-
| 0.7996 | 2856 | - | 0.5569 |
|
360 |
-
| 0.8399 | 3000 | 3.8314 | - |
|
361 |
-
| 0.9798 | 3500 | 3.5961 | - |
|
362 |
-
| 0.9994 | 3570 | - | 0.5612 |
|
363 |
-
| 1.0 | 3572 | - | 0.5617 |
|
364 |
-
| 1.1198 | 4000 | 3.2502 | - |
|
365 |
-
| 1.1993 | 4284 | - | 0.5819 |
|
366 |
-
| 1.2598 | 4500 | 3.1274 | - |
|
367 |
-
| 1.3992 | 4998 | - | 0.5848 |
|
368 |
-
| 1.3998 | 5000 | 3.0461 | - |
|
369 |
-
| 1.5398 | 5500 | 2.9606 | - |
|
370 |
-
| 1.5991 | 5712 | - | 0.5930 |
|
371 |
-
| 1.6797 | 6000 | 2.9263 | - |
|
372 |
-
| 1.7990 | 6426 | - | 0.5906 |
|
373 |
-
| 1.8197 | 6500 | 2.8313 | - |
|
374 |
-
| 1.9597 | 7000 | 2.7663 | - |
|
375 |
-
| 1.9989 | 7140 | - | 0.5868 |
|
376 |
-
| 2.0 | 7144 | - | 0.5888 |
|
377 |
-
| 2.0997 | 7500 | 2.4814 | - |
|
378 |
-
| 2.1988 | 7854 | - | 0.5864 |
|
379 |
-
| 2.2396 | 8000 | 2.3545 | - |
|
380 |
-
| 2.3796 | 8500 | 2.3052 | - |
|
381 |
-
| 2.3987 | 8568 | - | 0.5898 |
|
382 |
-
| 2.5196 | 9000 | 2.3227 | - |
|
383 |
-
| 2.5985 | 9282 | - | 0.5924 |
|
384 |
-
| 2.6596 | 9500 | 2.3185 | - |
|
385 |
-
| 2.7984 | 9996 | - | 0.5933 |
|
386 |
-
| 2.7996 | 10000 | 2.2571 | - |
|
387 |
-
| 2.9395 | 10500 | 2.2335 | - |
|
388 |
-
| 2.9983 | 10710 | - | 0.5925 |
|
389 |
-
| 3.0 | 10716 | - | 0.5933 |
|
390 |
-
|
391 |
-
|
392 |
-
### Framework Versions
|
393 |
-
- Python: 3.11.11
|
394 |
-
- Sentence Transformers: 4.1.0
|
395 |
-
- Transformers: 4.50.0.dev0
|
396 |
-
- PyTorch: 2.6.0+cu124
|
397 |
-
- Accelerate: 1.4.0
|
398 |
-
- Datasets: 3.3.2
|
399 |
-
- Tokenizers: 0.21.0
|
400 |
|
401 |
## Citation
|
402 |
|
403 |
-
|
404 |
|
405 |
-
#### Sentence Transformers
|
406 |
```bibtex
|
407 |
-
@
|
408 |
-
|
409 |
-
|
410 |
-
|
411 |
-
|
412 |
-
year = "2019",
|
413 |
-
publisher = "Association for Computational Linguistics",
|
414 |
-
url = "https://arxiv.org/abs/1908.10084",
|
415 |
}
|
416 |
```
|
417 |
|
418 |
-
|
419 |
-
```bibtex
|
420 |
-
@misc{kusupati2024matryoshka,
|
421 |
-
title={Matryoshka Representation Learning},
|
422 |
-
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
|
423 |
-
year={2024},
|
424 |
-
eprint={2205.13147},
|
425 |
-
archivePrefix={arXiv},
|
426 |
-
primaryClass={cs.LG}
|
427 |
-
}
|
428 |
-
```
|
429 |
|
430 |
-
|
431 |
-
|
432 |
-
|
433 |
-
|
434 |
-
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
435 |
-
year={2017},
|
436 |
-
eprint={1705.00652},
|
437 |
-
archivePrefix={arXiv},
|
438 |
-
primaryClass={cs.CL}
|
439 |
-
}
|
440 |
-
```
|
441 |
|
442 |
-
|
443 |
-
## Glossary
|
444 |
|
445 |
-
|
446 |
-
-->
|
447 |
|
448 |
-
|
449 |
-
## Model Card Authors
|
450 |
-
|
451 |
-
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
452 |
-
-->
|
453 |
-
|
454 |
-
<!--
|
455 |
-
## Model Card Contact
|
456 |
|
457 |
-
|
458 |
-
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- ar
|
4 |
tags:
|
5 |
- sentence-transformers
|
6 |
- sentence-similarity
|
7 |
- feature-extraction
|
8 |
+
- matryoshka
|
9 |
+
- arabic
|
10 |
+
- natural-language-inference
|
11 |
+
- bert
|
12 |
+
- nli
|
13 |
+
- arabert
|
14 |
+
datasets:
|
15 |
+
- Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
metrics:
|
17 |
+
- cosine_accuracy
|
18 |
+
- cosine_f1
|
19 |
+
- accuracy
|
20 |
+
- f1
|
21 |
+
library_name: sentence-transformers
|
22 |
+
pipeline_tag: sentence-similarity
|
23 |
+
base_model: aubmindlab/bert-base-arabertv02
|
24 |
+
license: apache-2.0
|
25 |
model-index:
|
26 |
+
- name: Arabic BERT NLI Matryoshka
|
27 |
results:
|
28 |
- task:
|
29 |
+
type: natural-language-inference
|
30 |
+
name: Natural Language Inference
|
31 |
dataset:
|
32 |
+
type: Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class
|
33 |
+
name: Arabic NLI Pair Classification
|
34 |
metrics:
|
35 |
+
- type: accuracy
|
36 |
+
value: 0.8125
|
37 |
+
name: Best Accuracy (128 dim)
|
38 |
+
- type: f1
|
39 |
+
value: 0.8142
|
40 |
+
name: Best F1 (256 dim)
|
41 |
---
|
42 |
|
43 |
+
# Arabic BERT NLI Matryoshka Embeddings
|
44 |
|
45 |
+
## Model Description
|
46 |
|
47 |
+
This model is a **Matryoshka representation learning** version of AraBERT specifically fine-tuned for Arabic Natural Language Inference (NLI) tasks. It generates embeddings that can be truncated to different dimensions (768, 512, 256, 128, 64) while maintaining strong performance across all sizes.
|
48 |
|
49 |
+
The model is based on `aubmindlab/bert-base-arabertv02` and trained using the Matryoshka Representation Learning approach, which allows for flexible embedding dimensions without retraining.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
+
## Key Features
|
52 |
|
53 |
+
- 🔄 **Flexible Dimensions**: Single model supports embeddings of size 768, 512, 256, 128, and 64
|
54 |
+
- 🚀 **High Performance**: Consistently outperforms base model across all dimensions
|
55 |
+
- 📊 **Arabic NLI Optimized**: Specifically trained on Arabic Natural Language Inference tasks
|
56 |
+
- ⚡ **Efficient**: Smaller dimensions offer faster inference with minimal performance loss
|
57 |
+
- 🎯 **Binary Classification**: Optimized for entailment vs contradiction classification
|
58 |
|
59 |
+
## Performance Results
|
60 |
|
61 |
+
Our model shows significant improvements over the base AraBERT model across all embedding dimensions:
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
+
| Dimension | Matryoshka Accuracy | Base Accuracy | Matryoshka F1 | Base F1 | Improvement |
|
64 |
+
|-----------|---------------------|---------------|---------------|---------|-------------|
|
65 |
+
| 768 | 80.3% | 56.8% | 81.15% | 41.94% | +39.21% |
|
66 |
+
| 512 | 80.6% | 56.9% | 81.36% | 44.32% | +37.05% |
|
67 |
+
| 256 | 80.95% | 55.65% | 81.42% | 38.7% | +42.72% |
|
68 |
+
| 128 | 81.25% | 56.7% | 81.37% | 40.6% | +40.77% |
|
69 |
+
| 64 | 81.0% | 55.8% | 80.51% | 37.92% | +42.59% |
|
70 |
|
71 |
+
## Quick Start
|
72 |
|
73 |
+
### Installation
|
74 |
|
75 |
```bash
|
76 |
+
pip install sentence-transformers torch
|
77 |
```
|
78 |
|
79 |
+
### Basic Usage
|
80 |
+
|
81 |
```python
|
82 |
from sentence_transformers import SentenceTransformer
|
83 |
|
84 |
+
# Load the model
|
85 |
+
model = SentenceTransformer('AhmedZaky1/arabic-bert-nli-matryoshka')
|
86 |
+
|
87 |
+
# Example sentences
|
88 |
sentences = [
|
89 |
+
"الطقس جميل اليوم",
|
90 |
+
"إنه يوم مشمس وجميل",
|
91 |
+
"أحب قراءة الكتب"
|
92 |
]
|
93 |
+
|
94 |
+
# Generate embeddings (default: full 768 dimensions)
|
95 |
embeddings = model.encode(sentences)
|
96 |
+
print(f"Full embeddings shape: {embeddings.shape}")
|
|
|
97 |
|
98 |
+
# Use different dimensions by truncating
|
99 |
+
embeddings_256 = embeddings[:, :256] # Use first 256 dimensions
|
100 |
+
embeddings_128 = embeddings[:, :128] # Use first 128 dimensions
|
101 |
+
embeddings_64 = embeddings[:, :64] # Use first 64 dimensions
|
102 |
+
|
103 |
+
print(f"256-dim embeddings shape: {embeddings_256.shape}")
|
104 |
```
|
105 |
|
106 |
+
### Similarity Computation
|
|
|
107 |
|
108 |
+
```python
|
109 |
+
from sentence_transformers import util
|
110 |
|
111 |
+
# Compute similarity between sentences
|
112 |
+
sentence1 = "القطة تجلس على السجادة"
|
113 |
+
sentence2 = "الكلب يلعب في الحديقة"
|
114 |
|
115 |
+
embeddings = model.encode([sentence1, sentence2])
|
116 |
+
similarity = util.cos_sim(embeddings[0], embeddings[1])
|
117 |
+
print(f"Similarity: {similarity.item():.4f}")
|
118 |
+
```
|
119 |
|
120 |
+
### NLI Classification
|
121 |
|
122 |
+
```python
|
123 |
+
def classify_nli_pair(premise, hypothesis, threshold=0.6):
|
124 |
+
"""
|
125 |
+
Classify Natural Language Inference relationship
|
126 |
|
127 |
+
Args:
|
128 |
+
premise: The premise sentence
|
129 |
+
hypothesis: The hypothesis sentence
|
130 |
+
threshold: Similarity threshold for classification
|
131 |
|
132 |
+
Returns:
|
133 |
+
str: 'entailment' if similarity > threshold, else 'contradiction'
|
134 |
+
"""
|
135 |
+
embeddings = model.encode([premise, hypothesis])
|
136 |
+
similarity = util.cos_sim(embeddings[0], embeddings[1]).item()
|
137 |
|
138 |
+
return 'entailment' if similarity > threshold else 'contradiction'
|
|
|
139 |
|
140 |
+
# Example usage
|
141 |
+
premise = "الرجل يقرأ كتاباً في المكتبة"
|
142 |
+
hypothesis = "شخص يقرأ في مكان هادئ"
|
143 |
|
144 |
+
result = classify_nli_pair(premise, hypothesis)
|
145 |
+
print(f"Relationship: {result}")
|
146 |
+
```
|
147 |
|
148 |
+
### Choosing the Right Dimension
|
149 |
|
150 |
+
- **768 dimensions**: Maximum accuracy for critical applications
|
151 |
+
- **512 dimensions**: Best balance of performance and efficiency
|
152 |
+
- **256 dimensions**: Good performance with 3x faster inference
|
153 |
+
- **128 dimensions**: Suitable for real-time applications
|
154 |
+
- **64 dimensions**: Ultra-fast inference for large-scale processing
|
155 |
|
156 |
+
## Training Details
|
|
|
|
|
|
|
157 |
|
158 |
+
### Dataset
|
159 |
+
- **Training Data**: Arabic-NLI-Pair-Class dataset from Omartificial-Intelligence-Space
|
160 |
+
- **Language**: Modern Standard Arabic (MSA)
|
161 |
+
- **Task Type**: Binary classification (entailment vs contradiction)
|
162 |
|
163 |
+
### Training Configuration
|
164 |
+
- **Base Model**: aubmindlab/bert-base-arabertv02
|
165 |
+
- **Max Sequence Length**: 75 tokens
|
166 |
+
- **Batch Size**: 64
|
167 |
+
- **Epochs**: 5
|
168 |
+
- **Matryoshka Dimensions**: [768, 512, 256, 128, 64]
|
169 |
+
- **Loss Function**: MatryoshkaLoss with CosineSimilarityLoss
|
170 |
+
- **Optimization**: AdamW with automatic mixed precision (AMP)
|
171 |
|
172 |
+
## Use Cases
|
|
|
173 |
|
174 |
+
1. **Arabic Text Similarity**: Measure semantic similarity between Arabic texts
|
175 |
+
2. **Natural Language Inference**: Determine logical relationships between Arabic sentences
|
176 |
+
3. **Information Retrieval**: Find relevant Arabic documents based on queries
|
177 |
+
4. **Semantic Search**: Build Arabic search engines with semantic understanding
|
178 |
+
5. **Text Classification**: Use embeddings as features for downstream Arabic NLP tasks
|
179 |
|
180 |
+
## Limitations
|
181 |
|
182 |
+
- Primarily trained on Modern Standard Arabic (MSA)
|
183 |
+
- Performance may vary on dialectal Arabic
|
184 |
+
- Optimized for shorter texts (up to 75 tokens)
|
185 |
+
- Binary classification focus (entailment/contradiction)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
|
187 |
## Citation
|
188 |
|
189 |
+
If you use this model in your research, please cite:
|
190 |
|
|
|
191 |
```bibtex
|
192 |
+
@model{arabic-bert-nli-matryoshka,
|
193 |
+
title={Arabic BERT NLI Matryoshka Embeddings},
|
194 |
+
author={Ahmed Mouad},
|
195 |
+
year={2025},
|
196 |
+
url={https://huggingface.co/AhmedZaky1/arabic-bert-nli-matryoshka}
|
|
|
|
|
|
|
197 |
}
|
198 |
```
|
199 |
|
200 |
+
## Acknowledgments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
201 |
|
202 |
+
- **AraBERT Team**: For the excellent base model (aubmindlab/bert-base-arabertv02)
|
203 |
+
- **Sentence Transformers**: For the robust training framework
|
204 |
+
- **Matryoshka Representation Learning**: For the innovative approach to nested embeddings
|
205 |
+
- **Arabic NLI Dataset**: Omartificial-Intelligence-Space for the training data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
206 |
|
207 |
+
## License
|
|
|
208 |
|
209 |
+
This model is released under the Apache 2.0 License.
|
|
|
210 |
|
211 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
|
213 |
+
**Model Version**: 1.0
|
214 |
+
**Last Updated**: May 2025
|
215 |
+
**Framework**: sentence-transformers
|
216 |
+
**Language**: Arabic (العربية)
|