--- library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity language: - en - ar base_model: - BAAI/bge-m3 --- # mhaseeb1604/bge-m3-law This model is a fine-tuned version of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more. ## Model Overview - **Architecture**: Based on [sentence-transformers](https://www.SBERT.net). - **Training Data**: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English. - **Embedding Size**: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text. - **Applications**: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context. ## Installation To use this model, you need to have the `sentence-transformers` library installed. You can install it via pip: ```bash pip install -U sentence-transformers ``` ## Usage You can easily load and use this model in Python with the following code: ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer('mhaseeb1604/bge-m3-law') # Sample sentences sentences = ["This is an example sentence", "Each sentence is converted"] # Generate embeddings embeddings = model.encode(sentences) # Output embeddings print(embeddings) ``` ## Model Training The model was fine-tuned on Arabic and English legal texts using the following configurations: - **DataLoader**: - Batch size: 4 - Sampler: SequentialSampler - **Loss Function**: `MultipleNegativesRankingLoss` with cosine similarity. - **Optimizer**: AdamW with learning rate `2e-05`. - **Training Parameters**: - Epochs: 2 - Warmup Steps: 20 - Weight Decay: 0.01 ## Full Model Architecture This model consists of three main components: ```python SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False}) (2): Normalize() ) ``` - **Transformer Layer**: Uses XLM-Roberta model with a max sequence length of 8192. - **Pooling Layer**: Utilizes CLS token pooling to generate sentence embeddings. - **Normalization Layer**: Ensures normalized output vectors for better performance in similarity tasks. ## Citing & Authors If you find this repository useful, please consider giving a star : and citation ```bibtex @misc {muhammad_haseeb_2024, author = { {Muhammad Haseeb} }, title = { bge-m3-law (Revision 2fc0289) }, year = 2024, url = { https://huggingface.co/mhaseeb1604/bge-m3-law }, doi = { 10.57967/hf/3217 }, publisher = { Hugging Face } } ```