bge-m3-law / README.md
mhaseeb1604's picture
Update README.md
ee3fb9e verified
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
language:
- en
- ar
base_model:
- BAAI/bge-m3
---
# mhaseeb1604/bge-m3-law
This model is a fine-tuned version of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.
## Model Overview
- **Architecture**: Based on [sentence-transformers](https://www.SBERT.net).
- **Training Data**: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English.
- **Embedding Size**: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text.
- **Applications**: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context.
## Installation
To use this model, you need to have the `sentence-transformers` library installed. You can install it via pip:
```bash
pip install -U sentence-transformers
```
## Usage
You can easily load and use this model in Python with the following code:
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('mhaseeb1604/bge-m3-law')
# Sample sentences
sentences = ["This is an example sentence", "Each sentence is converted"]
# Generate embeddings
embeddings = model.encode(sentences)
# Output embeddings
print(embeddings)
```
## Model Training
The model was fine-tuned on Arabic and English legal texts using the following configurations:
- **DataLoader**:
- Batch size: 4
- Sampler: SequentialSampler
- **Loss Function**: `MultipleNegativesRankingLoss` with cosine similarity.
- **Optimizer**: AdamW with learning rate `2e-05`.
- **Training Parameters**:
- Epochs: 2
- Warmup Steps: 20
- Weight Decay: 0.01
## Full Model Architecture
This model consists of three main components:
```python
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False})
(2): Normalize()
)
```
- **Transformer Layer**: Uses XLM-Roberta model with a max sequence length of 8192.
- **Pooling Layer**: Utilizes CLS token pooling to generate sentence embeddings.
- **Normalization Layer**: Ensures normalized output vectors for better performance in similarity tasks.
## Citing & Authors
If you find this repository useful, please consider giving a star : and citation
```bibtex
@misc {muhammad_haseeb_2024,
author = { {Muhammad Haseeb} },
title = { bge-m3-law (Revision 2fc0289) },
year = 2024,
url = { https://huggingface.co/mhaseeb1604/bge-m3-law },
doi = { 10.57967/hf/3217 },
publisher = { Hugging Face }
}
```
<!--- Describe where people can find more information -->