AhmedZaky1 commited on
Commit
e5ce13c
·
verified ·
1 Parent(s): a2a3654

Add comprehensive model card

Browse files
Files changed (1) hide show
  1. README.md +148 -390
README.md CHANGED
@@ -1,458 +1,216 @@
1
  ---
 
 
2
  tags:
3
  - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
- - generated_from_trainer
7
- - dataset_size:457216
8
- - loss:MatryoshkaLoss
9
- - loss:MultipleNegativesRankingLoss
10
- base_model: aubmindlab/bert-base-arabertv02
11
- widget:
12
- - source_sentence: الناس يسيرون
13
- sentences:
14
- - شخصان يصعدان على الدرج
15
- - الناس يجلسون
16
- - رجل يجلس ويستمع للمحادثات
17
- - source_sentence: لاعب كرة قدم يرتدي زيًا أحمر وأسود يحمل الرقم 3 وخوذة سوداء يحمل
18
- الكرة ويحيط به لاعبون معارضون يرتدون زيًا أبيض وأرجواني بيكسفيل.
19
- sentences:
20
- - لاعب كرة قدم يحمل كرة
21
- - الرجل مستعد لالتقاط كرة القدم
22
- - الكلاب بالخارج
23
- - source_sentence: بعثة لوس أنجلوس هي عيادة مجانية
24
- sentences:
25
- - إنها مساعدة ممرضة في بعثة لوس أنجلوس
26
- - تعمل كطبيبة رئيسة في "لوس أنجلوس ميسيون" عيادة مجانية في حي فقير
27
- - التوافق مطلوب من الأجهزة أو البرمجيات.
28
- - source_sentence: رجل يرتدي قميصًا بنيًا مخططًا يقف يثني ذراعيه على قمة مبنى على
29
- سطح منزل.
30
- sentences:
31
- - رجل ينظر من نافذة المطبخ
32
- - شخص على السطح
33
- - لا يجوز إظهار أي مبلغ من الأصول في الميزانية العمومية للمهمة الفيدرالية
34
- - source_sentence: الحيوانات الأليفة تلعب دور الجدار
35
- sentences:
36
- - كلبان يلعبان في منطقة محصورة من الحصى.
37
- - الكلاب تجري لالتقاط عصا عبر الشارع.
38
- - يمكن تطوير التكنولوجيا.
39
- pipeline_tag: sentence-similarity
40
- library_name: sentence-transformers
41
  metrics:
42
- - pearson_cosine
43
- - spearman_cosine
 
 
 
 
 
 
44
  model-index:
45
- - name: SentenceTransformer based on aubmindlab/bert-base-arabertv02
46
  results:
47
  - task:
48
- type: semantic-similarity
49
- name: Semantic Similarity
50
  dataset:
51
- name: arabic nli dev
52
- type: arabic-nli-dev
53
  metrics:
54
- - type: pearson_cosine
55
- value: 0.5891378532917348
56
- name: Pearson Cosine
57
- - type: spearman_cosine
58
- value: 0.5933477548023721
59
- name: Spearman Cosine
60
  ---
61
 
62
- # SentenceTransformer based on aubmindlab/bert-base-arabertv02
63
 
64
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
65
 
66
- ## Model Details
67
 
68
- ### Model Description
69
- - **Model Type:** Sentence Transformer
70
- - **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) <!-- at revision 016fb9d6768f522a59c6e0d2d5d5d43a4e1bff60 -->
71
- - **Maximum Sequence Length:** 75 tokens
72
- - **Output Dimensionality:** 768 dimensions
73
- - **Similarity Function:** Cosine Similarity
74
- <!-- - **Training Dataset:** Unknown -->
75
- <!-- - **Language:** Unknown -->
76
- <!-- - **License:** Unknown -->
77
 
78
- ### Model Sources
79
 
80
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
81
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
82
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
 
 
83
 
84
- ### Full Model Architecture
85
 
86
- ```
87
- SentenceTransformer(
88
- (0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: BertModel
89
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
90
- )
91
- ```
92
 
93
- ## Usage
 
 
 
 
 
 
94
 
95
- ### Direct Usage (Sentence Transformers)
96
 
97
- First install the Sentence Transformers library:
98
 
99
  ```bash
100
- pip install -U sentence-transformers
101
  ```
102
 
103
- Then you can load this model and run inference.
 
104
  ```python
105
  from sentence_transformers import SentenceTransformer
106
 
107
- # Download from the 🤗 Hub
108
- model = SentenceTransformer("sentence_transformers_model_id")
109
- # Run inference
 
110
  sentences = [
111
- 'الحيوانات الأليفة تلعب دور الجدار',
112
- 'كلبان يلعبان في منطقة محصورة من الحصى.',
113
- 'الكلاب تجري لالتقاط عصا عبر الشارع.',
114
  ]
 
 
115
  embeddings = model.encode(sentences)
116
- print(embeddings.shape)
117
- # [3, 768]
118
 
119
- # Get the similarity scores for the embeddings
120
- similarities = model.similarity(embeddings, embeddings)
121
- print(similarities.shape)
122
- # [3, 3]
 
 
123
  ```
124
 
125
- <!--
126
- ### Direct Usage (Transformers)
127
 
128
- <details><summary>Click to see the direct usage in Transformers</summary>
 
129
 
130
- </details>
131
- -->
 
132
 
133
- <!--
134
- ### Downstream Usage (Sentence Transformers)
 
 
135
 
136
- You can finetune this model on your own dataset.
137
 
138
- <details><summary>Click to expand</summary>
 
 
 
139
 
140
- </details>
141
- -->
 
 
142
 
143
- <!--
144
- ### Out-of-Scope Use
 
 
 
145
 
146
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
147
- -->
148
 
149
- ## Evaluation
 
 
150
 
151
- ### Metrics
 
 
152
 
153
- #### Semantic Similarity
154
 
155
- * Dataset: `arabic-nli-dev`
156
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
 
 
 
157
 
158
- | Metric | Value |
159
- |:--------------------|:-----------|
160
- | pearson_cosine | 0.5891 |
161
- | **spearman_cosine** | **0.5933** |
162
 
163
- <!--
164
- ## Bias, Risks and Limitations
 
 
165
 
166
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
167
- -->
 
 
 
 
 
 
168
 
169
- <!--
170
- ### Recommendations
171
 
172
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
173
- -->
 
 
 
174
 
175
- ## Training Details
176
 
177
- ### Training Dataset
178
-
179
- #### Unnamed Dataset
180
-
181
- * Size: 457,216 training samples
182
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
183
- * Approximate statistics based on the first 1000 samples:
184
- | | sentence_0 | sentence_1 | sentence_2 |
185
- |:--------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
186
- | type | string | string | string |
187
- | details | <ul><li>min: 4 tokens</li><li>mean: 12.5 tokens</li><li>max: 66 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.33 tokens</li><li>max: 68 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 9.59 tokens</li><li>max: 33 tokens</li></ul> |
188
- * Samples:
189
- | sentence_0 | sentence_1 | sentence_2 |
190
- |:------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------|:--------------------------------------------------------------|
191
- | <code>يجلس طفل أحمر الشعر ينظر من خلال السور إلى الماء بينما يلعب الناس على الشاطئ في المسافة.</code> | <code>طفل أحمر الشعر مهتم بالماء والناس يلعبون على الشاطئ في المسافة.</code> | <code>فتى شقراء يراقب القارب مع الناس عليه يبحر بعيدا.</code> |
192
- | <code>عامل نظافة على وشك التنظيف في محطة القطار</code> | <code>البواب سيقوم بتنظيف محطة القطار</code> | <code>البواب يجلس في محطة القطار</code> |
193
- | <code>رجل يرتدي قميصاً أخضر وبنطال جينز ينحني فوق مرمى الهوكي الأحمر مع ثقب فوقه.</code> | <code>رجل يرتدي قميصاً أخضر.</code> | <code>امرأة ترتدي قميصاً أخضر.</code> |
194
- * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
195
- ```json
196
- {
197
- "loss": "MultipleNegativesRankingLoss",
198
- "matryoshka_dims": [
199
- 768,
200
- 512,
201
- 256,
202
- 128,
203
- 64
204
- ],
205
- "matryoshka_weights": [
206
- 1,
207
- 1,
208
- 1,
209
- 1,
210
- 1
211
- ],
212
- "n_dims_per_step": -1
213
- }
214
- ```
215
-
216
- ### Training Hyperparameters
217
- #### Non-Default Hyperparameters
218
-
219
- - `eval_strategy`: steps
220
- - `per_device_train_batch_size`: 64
221
- - `per_device_eval_batch_size`: 64
222
- - `fp16`: True
223
- - `batch_sampler`: no_duplicates
224
- - `multi_dataset_batch_sampler`: round_robin
225
-
226
- #### All Hyperparameters
227
- <details><summary>Click to expand</summary>
228
-
229
- - `overwrite_output_dir`: False
230
- - `do_predict`: False
231
- - `eval_strategy`: steps
232
- - `prediction_loss_only`: True
233
- - `per_device_train_batch_size`: 64
234
- - `per_device_eval_batch_size`: 64
235
- - `per_gpu_train_batch_size`: None
236
- - `per_gpu_eval_batch_size`: None
237
- - `gradient_accumulation_steps`: 1
238
- - `eval_accumulation_steps`: None
239
- - `torch_empty_cache_steps`: None
240
- - `learning_rate`: 5e-05
241
- - `weight_decay`: 0.0
242
- - `adam_beta1`: 0.9
243
- - `adam_beta2`: 0.999
244
- - `adam_epsilon`: 1e-08
245
- - `max_grad_norm`: 1
246
- - `num_train_epochs`: 3
247
- - `max_steps`: -1
248
- - `lr_scheduler_type`: linear
249
- - `lr_scheduler_kwargs`: {}
250
- - `warmup_ratio`: 0.0
251
- - `warmup_steps`: 0
252
- - `log_level`: passive
253
- - `log_level_replica`: warning
254
- - `log_on_each_node`: True
255
- - `logging_nan_inf_filter`: True
256
- - `save_safetensors`: True
257
- - `save_on_each_node`: False
258
- - `save_only_model`: False
259
- - `restore_callback_states_from_checkpoint`: False
260
- - `no_cuda`: False
261
- - `use_cpu`: False
262
- - `use_mps_device`: False
263
- - `seed`: 42
264
- - `data_seed`: None
265
- - `jit_mode_eval`: False
266
- - `use_ipex`: False
267
- - `bf16`: False
268
- - `fp16`: True
269
- - `fp16_opt_level`: O1
270
- - `half_precision_backend`: auto
271
- - `bf16_full_eval`: False
272
- - `fp16_full_eval`: False
273
- - `tf32`: None
274
- - `local_rank`: 0
275
- - `ddp_backend`: None
276
- - `tpu_num_cores`: None
277
- - `tpu_metrics_debug`: False
278
- - `debug`: []
279
- - `dataloader_drop_last`: False
280
- - `dataloader_num_workers`: 0
281
- - `dataloader_prefetch_factor`: None
282
- - `past_index`: -1
283
- - `disable_tqdm`: False
284
- - `remove_unused_columns`: True
285
- - `label_names`: None
286
- - `load_best_model_at_end`: False
287
- - `ignore_data_skip`: False
288
- - `fsdp`: []
289
- - `fsdp_min_num_params`: 0
290
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
291
- - `tp_size`: 0
292
- - `fsdp_transformer_layer_cls_to_wrap`: None
293
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
294
- - `deepspeed`: None
295
- - `label_smoothing_factor`: 0.0
296
- - `optim`: adamw_torch
297
- - `optim_args`: None
298
- - `adafactor`: False
299
- - `group_by_length`: False
300
- - `length_column_name`: length
301
- - `ddp_find_unused_parameters`: None
302
- - `ddp_bucket_cap_mb`: None
303
- - `ddp_broadcast_buffers`: False
304
- - `dataloader_pin_memory`: True
305
- - `dataloader_persistent_workers`: False
306
- - `skip_memory_metrics`: True
307
- - `use_legacy_prediction_loop`: False
308
- - `push_to_hub`: False
309
- - `resume_from_checkpoint`: None
310
- - `hub_model_id`: None
311
- - `hub_strategy`: every_save
312
- - `hub_private_repo`: None
313
- - `hub_always_push`: False
314
- - `gradient_checkpointing`: False
315
- - `gradient_checkpointing_kwargs`: None
316
- - `include_inputs_for_metrics`: False
317
- - `include_for_metrics`: []
318
- - `eval_do_concat_batches`: True
319
- - `fp16_backend`: auto
320
- - `push_to_hub_model_id`: None
321
- - `push_to_hub_organization`: None
322
- - `mp_parameters`:
323
- - `auto_find_batch_size`: False
324
- - `full_determinism`: False
325
- - `torchdynamo`: None
326
- - `ray_scope`: last
327
- - `ddp_timeout`: 1800
328
- - `torch_compile`: False
329
- - `torch_compile_backend`: None
330
- - `torch_compile_mode`: None
331
- - `dispatch_batches`: None
332
- - `split_batches`: None
333
- - `include_tokens_per_second`: False
334
- - `include_num_input_tokens_seen`: False
335
- - `neftune_noise_alpha`: None
336
- - `optim_target_modules`: None
337
- - `batch_eval_metrics`: False
338
- - `eval_on_start`: False
339
- - `use_liger_kernel`: False
340
- - `eval_use_gather_object`: False
341
- - `average_tokens_across_devices`: False
342
- - `prompts`: None
343
- - `batch_sampler`: no_duplicates
344
- - `multi_dataset_batch_sampler`: round_robin
345
-
346
- </details>
347
-
348
- ### Training Logs
349
- | Epoch | Step | Training Loss | arabic-nli-dev_spearman_cosine |
350
- |:------:|:-----:|:-------------:|:------------------------------:|
351
- | 0.1400 | 500 | 10.0831 | - |
352
- | 0.1999 | 714 | - | 0.4417 |
353
- | 0.2800 | 1000 | 5.6335 | - |
354
- | 0.3998 | 1428 | - | 0.5157 |
355
- | 0.4199 | 1500 | 4.7627 | - |
356
- | 0.5599 | 2000 | 4.3656 | - |
357
- | 0.5997 | 2142 | - | 0.5443 |
358
- | 0.6999 | 2500 | 4.085 | - |
359
- | 0.7996 | 2856 | - | 0.5569 |
360
- | 0.8399 | 3000 | 3.8314 | - |
361
- | 0.9798 | 3500 | 3.5961 | - |
362
- | 0.9994 | 3570 | - | 0.5612 |
363
- | 1.0 | 3572 | - | 0.5617 |
364
- | 1.1198 | 4000 | 3.2502 | - |
365
- | 1.1993 | 4284 | - | 0.5819 |
366
- | 1.2598 | 4500 | 3.1274 | - |
367
- | 1.3992 | 4998 | - | 0.5848 |
368
- | 1.3998 | 5000 | 3.0461 | - |
369
- | 1.5398 | 5500 | 2.9606 | - |
370
- | 1.5991 | 5712 | - | 0.5930 |
371
- | 1.6797 | 6000 | 2.9263 | - |
372
- | 1.7990 | 6426 | - | 0.5906 |
373
- | 1.8197 | 6500 | 2.8313 | - |
374
- | 1.9597 | 7000 | 2.7663 | - |
375
- | 1.9989 | 7140 | - | 0.5868 |
376
- | 2.0 | 7144 | - | 0.5888 |
377
- | 2.0997 | 7500 | 2.4814 | - |
378
- | 2.1988 | 7854 | - | 0.5864 |
379
- | 2.2396 | 8000 | 2.3545 | - |
380
- | 2.3796 | 8500 | 2.3052 | - |
381
- | 2.3987 | 8568 | - | 0.5898 |
382
- | 2.5196 | 9000 | 2.3227 | - |
383
- | 2.5985 | 9282 | - | 0.5924 |
384
- | 2.6596 | 9500 | 2.3185 | - |
385
- | 2.7984 | 9996 | - | 0.5933 |
386
- | 2.7996 | 10000 | 2.2571 | - |
387
- | 2.9395 | 10500 | 2.2335 | - |
388
- | 2.9983 | 10710 | - | 0.5925 |
389
- | 3.0 | 10716 | - | 0.5933 |
390
-
391
-
392
- ### Framework Versions
393
- - Python: 3.11.11
394
- - Sentence Transformers: 4.1.0
395
- - Transformers: 4.50.0.dev0
396
- - PyTorch: 2.6.0+cu124
397
- - Accelerate: 1.4.0
398
- - Datasets: 3.3.2
399
- - Tokenizers: 0.21.0
400
 
401
  ## Citation
402
 
403
- ### BibTeX
404
 
405
- #### Sentence Transformers
406
  ```bibtex
407
- @inproceedings{reimers-2019-sentence-bert,
408
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
409
- author = "Reimers, Nils and Gurevych, Iryna",
410
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
411
- month = "11",
412
- year = "2019",
413
- publisher = "Association for Computational Linguistics",
414
- url = "https://arxiv.org/abs/1908.10084",
415
  }
416
  ```
417
 
418
- #### MatryoshkaLoss
419
- ```bibtex
420
- @misc{kusupati2024matryoshka,
421
- title={Matryoshka Representation Learning},
422
- author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
423
- year={2024},
424
- eprint={2205.13147},
425
- archivePrefix={arXiv},
426
- primaryClass={cs.LG}
427
- }
428
- ```
429
 
430
- #### MultipleNegativesRankingLoss
431
- ```bibtex
432
- @misc{henderson2017efficient,
433
- title={Efficient Natural Language Response Suggestion for Smart Reply},
434
- author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
435
- year={2017},
436
- eprint={1705.00652},
437
- archivePrefix={arXiv},
438
- primaryClass={cs.CL}
439
- }
440
- ```
441
 
442
- <!--
443
- ## Glossary
444
 
445
- *Clearly define terms in order to be accessible across audiences.*
446
- -->
447
 
448
- <!--
449
- ## Model Card Authors
450
-
451
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
452
- -->
453
-
454
- <!--
455
- ## Model Card Contact
456
 
457
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
458
- -->
 
 
 
1
  ---
2
+ language:
3
+ - ar
4
  tags:
5
  - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
8
+ - matryoshka
9
+ - arabic
10
+ - natural-language-inference
11
+ - bert
12
+ - nli
13
+ - arabert
14
+ datasets:
15
+ - Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  metrics:
17
+ - cosine_accuracy
18
+ - cosine_f1
19
+ - accuracy
20
+ - f1
21
+ library_name: sentence-transformers
22
+ pipeline_tag: sentence-similarity
23
+ base_model: aubmindlab/bert-base-arabertv02
24
+ license: apache-2.0
25
  model-index:
26
+ - name: Arabic BERT NLI Matryoshka
27
  results:
28
  - task:
29
+ type: natural-language-inference
30
+ name: Natural Language Inference
31
  dataset:
32
+ type: Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class
33
+ name: Arabic NLI Pair Classification
34
  metrics:
35
+ - type: accuracy
36
+ value: 0.8125
37
+ name: Best Accuracy (128 dim)
38
+ - type: f1
39
+ value: 0.8142
40
+ name: Best F1 (256 dim)
41
  ---
42
 
43
+ # Arabic BERT NLI Matryoshka Embeddings
44
 
45
+ ## Model Description
46
 
47
+ This model is a **Matryoshka representation learning** version of AraBERT specifically fine-tuned for Arabic Natural Language Inference (NLI) tasks. It generates embeddings that can be truncated to different dimensions (768, 512, 256, 128, 64) while maintaining strong performance across all sizes.
48
 
49
+ The model is based on `aubmindlab/bert-base-arabertv02` and trained using the Matryoshka Representation Learning approach, which allows for flexible embedding dimensions without retraining.
 
 
 
 
 
 
 
 
50
 
51
+ ## Key Features
52
 
53
+ - 🔄 **Flexible Dimensions**: Single model supports embeddings of size 768, 512, 256, 128, and 64
54
+ - 🚀 **High Performance**: Consistently outperforms base model across all dimensions
55
+ - 📊 **Arabic NLI Optimized**: Specifically trained on Arabic Natural Language Inference tasks
56
+ - ⚡ **Efficient**: Smaller dimensions offer faster inference with minimal performance loss
57
+ - 🎯 **Binary Classification**: Optimized for entailment vs contradiction classification
58
 
59
+ ## Performance Results
60
 
61
+ Our model shows significant improvements over the base AraBERT model across all embedding dimensions:
 
 
 
 
 
62
 
63
+ | Dimension | Matryoshka Accuracy | Base Accuracy | Matryoshka F1 | Base F1 | Improvement |
64
+ |-----------|---------------------|---------------|---------------|---------|-------------|
65
+ | 768 | 80.3% | 56.8% | 81.15% | 41.94% | +39.21% |
66
+ | 512 | 80.6% | 56.9% | 81.36% | 44.32% | +37.05% |
67
+ | 256 | 80.95% | 55.65% | 81.42% | 38.7% | +42.72% |
68
+ | 128 | 81.25% | 56.7% | 81.37% | 40.6% | +40.77% |
69
+ | 64 | 81.0% | 55.8% | 80.51% | 37.92% | +42.59% |
70
 
71
+ ## Quick Start
72
 
73
+ ### Installation
74
 
75
  ```bash
76
+ pip install sentence-transformers torch
77
  ```
78
 
79
+ ### Basic Usage
80
+
81
  ```python
82
  from sentence_transformers import SentenceTransformer
83
 
84
+ # Load the model
85
+ model = SentenceTransformer('AhmedZaky1/arabic-bert-nli-matryoshka')
86
+
87
+ # Example sentences
88
  sentences = [
89
+ "الطقس جميل اليوم",
90
+ "إنه يوم مشمس وجميل",
91
+ "أحب قراءة الكتب"
92
  ]
93
+
94
+ # Generate embeddings (default: full 768 dimensions)
95
  embeddings = model.encode(sentences)
96
+ print(f"Full embeddings shape: {embeddings.shape}")
 
97
 
98
+ # Use different dimensions by truncating
99
+ embeddings_256 = embeddings[:, :256] # Use first 256 dimensions
100
+ embeddings_128 = embeddings[:, :128] # Use first 128 dimensions
101
+ embeddings_64 = embeddings[:, :64] # Use first 64 dimensions
102
+
103
+ print(f"256-dim embeddings shape: {embeddings_256.shape}")
104
  ```
105
 
106
+ ### Similarity Computation
 
107
 
108
+ ```python
109
+ from sentence_transformers import util
110
 
111
+ # Compute similarity between sentences
112
+ sentence1 = "القطة تجلس على السجادة"
113
+ sentence2 = "الكلب يلعب في الحديقة"
114
 
115
+ embeddings = model.encode([sentence1, sentence2])
116
+ similarity = util.cos_sim(embeddings[0], embeddings[1])
117
+ print(f"Similarity: {similarity.item():.4f}")
118
+ ```
119
 
120
+ ### NLI Classification
121
 
122
+ ```python
123
+ def classify_nli_pair(premise, hypothesis, threshold=0.6):
124
+ """
125
+ Classify Natural Language Inference relationship
126
 
127
+ Args:
128
+ premise: The premise sentence
129
+ hypothesis: The hypothesis sentence
130
+ threshold: Similarity threshold for classification
131
 
132
+ Returns:
133
+ str: 'entailment' if similarity > threshold, else 'contradiction'
134
+ """
135
+ embeddings = model.encode([premise, hypothesis])
136
+ similarity = util.cos_sim(embeddings[0], embeddings[1]).item()
137
 
138
+ return 'entailment' if similarity > threshold else 'contradiction'
 
139
 
140
+ # Example usage
141
+ premise = "الرجل يقرأ كتاباً في المكتبة"
142
+ hypothesis = "شخص يقرأ في مكان هادئ"
143
 
144
+ result = classify_nli_pair(premise, hypothesis)
145
+ print(f"Relationship: {result}")
146
+ ```
147
 
148
+ ### Choosing the Right Dimension
149
 
150
+ - **768 dimensions**: Maximum accuracy for critical applications
151
+ - **512 dimensions**: Best balance of performance and efficiency
152
+ - **256 dimensions**: Good performance with 3x faster inference
153
+ - **128 dimensions**: Suitable for real-time applications
154
+ - **64 dimensions**: Ultra-fast inference for large-scale processing
155
 
156
+ ## Training Details
 
 
 
157
 
158
+ ### Dataset
159
+ - **Training Data**: Arabic-NLI-Pair-Class dataset from Omartificial-Intelligence-Space
160
+ - **Language**: Modern Standard Arabic (MSA)
161
+ - **Task Type**: Binary classification (entailment vs contradiction)
162
 
163
+ ### Training Configuration
164
+ - **Base Model**: aubmindlab/bert-base-arabertv02
165
+ - **Max Sequence Length**: 75 tokens
166
+ - **Batch Size**: 64
167
+ - **Epochs**: 5
168
+ - **Matryoshka Dimensions**: [768, 512, 256, 128, 64]
169
+ - **Loss Function**: MatryoshkaLoss with CosineSimilarityLoss
170
+ - **Optimization**: AdamW with automatic mixed precision (AMP)
171
 
172
+ ## Use Cases
 
173
 
174
+ 1. **Arabic Text Similarity**: Measure semantic similarity between Arabic texts
175
+ 2. **Natural Language Inference**: Determine logical relationships between Arabic sentences
176
+ 3. **Information Retrieval**: Find relevant Arabic documents based on queries
177
+ 4. **Semantic Search**: Build Arabic search engines with semantic understanding
178
+ 5. **Text Classification**: Use embeddings as features for downstream Arabic NLP tasks
179
 
180
+ ## Limitations
181
 
182
+ - Primarily trained on Modern Standard Arabic (MSA)
183
+ - Performance may vary on dialectal Arabic
184
+ - Optimized for shorter texts (up to 75 tokens)
185
+ - Binary classification focus (entailment/contradiction)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
  ## Citation
188
 
189
+ If you use this model in your research, please cite:
190
 
 
191
  ```bibtex
192
+ @model{arabic-bert-nli-matryoshka,
193
+ title={Arabic BERT NLI Matryoshka Embeddings},
194
+ author={Ahmed Mouad},
195
+ year={2025},
196
+ url={https://huggingface.co/AhmedZaky1/arabic-bert-nli-matryoshka}
 
 
 
197
  }
198
  ```
199
 
200
+ ## Acknowledgments
 
 
 
 
 
 
 
 
 
 
201
 
202
+ - **AraBERT Team**: For the excellent base model (aubmindlab/bert-base-arabertv02)
203
+ - **Sentence Transformers**: For the robust training framework
204
+ - **Matryoshka Representation Learning**: For the innovative approach to nested embeddings
205
+ - **Arabic NLI Dataset**: Omartificial-Intelligence-Space for the training data
 
 
 
 
 
 
 
206
 
207
+ ## License
 
208
 
209
+ This model is released under the Apache 2.0 License.
 
210
 
211
+ ---
 
 
 
 
 
 
 
212
 
213
+ **Model Version**: 1.0
214
+ **Last Updated**: May 2025
215
+ **Framework**: sentence-transformers
216
+ **Language**: Arabic (العربية)