mhaseeb1604
/

bge-m3-law

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

bge-m3-law / README.md

mhaseeb1604's picture

Update README.md

ee3fb9e verified 10 months ago

|

history blame contribute delete

3.11 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	language:
	- en
	- ar
	base_model:
	- BAAI/bge-m3
	---

	# mhaseeb1604/bge-m3-law

	This model is a fine-tuned version of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.

	## Model Overview

	- Architecture: Based on [sentence-transformers](https://www.SBERT.net).
	- Training Data: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English.
	- Embedding Size: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text.
	- Applications: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context.

	## Installation

	To use this model, you need to have the `sentence-transformers` library installed. You can install it via pip:

	```bash
	pip install -U sentence-transformers
	```

	## Usage

	You can easily load and use this model in Python with the following code:

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer('mhaseeb1604/bge-m3-law')

	# Sample sentences
	sentences = ["This is an example sentence", "Each sentence is converted"]

	# Generate embeddings
	embeddings = model.encode(sentences)

	# Output embeddings
	print(embeddings)
	```

	## Model Training

	The model was fine-tuned on Arabic and English legal texts using the following configurations:

	- DataLoader:
	- Batch size: 4
	- Sampler: SequentialSampler
	- Loss Function: `MultipleNegativesRankingLoss` with cosine similarity.
	- Optimizer: AdamW with learning rate `2e-05`.
	- Training Parameters:
	- Epochs: 2
	- Warmup Steps: 20
	- Weight Decay: 0.01

	## Full Model Architecture

	This model consists of three main components:

	```python
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False})
	(2): Normalize()
	)
	```

	- Transformer Layer: Uses XLM-Roberta model with a max sequence length of 8192.
	- Pooling Layer: Utilizes CLS token pooling to generate sentence embeddings.
	- Normalization Layer: Ensures normalized output vectors for better performance in similarity tasks.

	## Citing & Authors

	If you find this repository useful, please consider giving a star : and citation

	```bibtex
	@misc {muhammad_haseeb_2024,
	author = { {Muhammad Haseeb} },
	title = { bge-m3-law (Revision 2fc0289) },
	year = 2024,
	url = { https://huggingface.co/mhaseeb1604/bge-m3-law },
	doi = { 10.57967/hf/3217 },
	publisher = { Hugging Face }
	}
	```

	<!--- Describe where people can find more information -->