|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
language: |
|
- en |
|
- ar |
|
base_model: |
|
- BAAI/bge-m3 |
|
--- |
|
|
|
# mhaseeb1604/bge-m3-law |
|
|
|
This model is a fine-tuned version of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more. |
|
|
|
## Model Overview |
|
|
|
- **Architecture**: Based on [sentence-transformers](https://www.SBERT.net). |
|
- **Training Data**: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English. |
|
- **Embedding Size**: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text. |
|
- **Applications**: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context. |
|
|
|
## Installation |
|
|
|
To use this model, you need to have the `sentence-transformers` library installed. You can install it via pip: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
## Usage |
|
|
|
You can easily load and use this model in Python with the following code: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load the model |
|
model = SentenceTransformer('mhaseeb1604/bge-m3-law') |
|
|
|
# Sample sentences |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
# Generate embeddings |
|
embeddings = model.encode(sentences) |
|
|
|
# Output embeddings |
|
print(embeddings) |
|
``` |
|
|
|
## Model Training |
|
|
|
The model was fine-tuned on Arabic and English legal texts using the following configurations: |
|
|
|
- **DataLoader**: |
|
- Batch size: 4 |
|
- Sampler: SequentialSampler |
|
- **Loss Function**: `MultipleNegativesRankingLoss` with cosine similarity. |
|
- **Optimizer**: AdamW with learning rate `2e-05`. |
|
- **Training Parameters**: |
|
- Epochs: 2 |
|
- Warmup Steps: 20 |
|
- Weight Decay: 0.01 |
|
|
|
## Full Model Architecture |
|
|
|
This model consists of three main components: |
|
|
|
```python |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
- **Transformer Layer**: Uses XLM-Roberta model with a max sequence length of 8192. |
|
- **Pooling Layer**: Utilizes CLS token pooling to generate sentence embeddings. |
|
- **Normalization Layer**: Ensures normalized output vectors for better performance in similarity tasks. |
|
|
|
## Citing & Authors |
|
|
|
If you find this repository useful, please consider giving a star : and citation |
|
|
|
```bibtex |
|
@misc {muhammad_haseeb_2024, |
|
author = { {Muhammad Haseeb} }, |
|
title = { bge-m3-law (Revision 2fc0289) }, |
|
year = 2024, |
|
url = { https://huggingface.co/mhaseeb1604/bge-m3-law }, |
|
doi = { 10.57967/hf/3217 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
<!--- Describe where people can find more information --> |