SentenceTransformer based on intfloat/multilingual-e5-small

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-small. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/multilingual-e5-small
  • Maximum Sequence Length: 128 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("josangho99/ko-multilingual-e5-small-multiTask")
# Run inference
sentences = [
    '여자와 아이가 게임을 하고 있다.',
    '여자와 어린 아이는 보드게임을 하면서 즐거운 시간을 보낸다.',
    '한 여성과 그녀의 아이가 영화를 보고 있다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6531, 0.4611],
#         [0.6531, 1.0000, 0.3023],
#         [0.4611, 0.3023, 1.0000]])

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.8491
spearman_cosine 0.8481

Training Details

Training Datasets

Unnamed Dataset

  • Size: 568,640 training samples
  • Columns: sentence_0, sentence_1, and sentence_2
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 sentence_2
    type string string string
    details
    • min: 4 tokens
    • mean: 19.75 tokens
    • max: 99 tokens
    • min: 4 tokens
    • mean: 19.98 tokens
    • max: 128 tokens
    • min: 4 tokens
    • mean: 14.82 tokens
    • max: 47 tokens
  • Samples:
    sentence_0 sentence_1 sentence_2
    실버민 베이에는 섬의 다른 지역으로 가는 노선이 있는 버스 터미널이 있다. 페리는 오전 6시 10분에서 오후 10시 30분 사이에 2시간마다 센트럴에서 실버 마인 베이 (Mui Wo)로 출발하며 버스 터미널에는 섬의 모든 지역으로 버스가 있습니다. 실버민 만의 버스는 섬의 한 부분으로만 간다.
    사람이 사형선고를 받으면 국가가 형을 집행하고 그 사람을 사형에 처해야 하는데 항소절차가 없어야 한다. 어쨌든 내가 4 년 동안 주 교도소 시스템과 함께 일한 후, 당신이 그 사람에게 사형 선고를 계속하고 문장을 이해할 수있게 해주겠다고 결론을 내렸다는 결론은 누군가에게 사형 선고를 내리면 30 일 이내에 항소에 실패하면 한 번의 항소를 받게됩니다. 그것에 대해 사형수라면 형량이 감형될 때까지 원하는 만큼 항소할 수 있도록 해야 한다.
    웹 사이트 기능은 불편한 자원 봉사자를 지원합니다. 두 번째 웹 사이트 기능은 새로운 법률 분야에서 불편하고 지침과 방향이 필요한 자원 봉사자를 지원합니다. 두 번째 웹사이트 기능은 자원봉사자들에게 무료 아이스캔디를 제공한다.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Unnamed Dataset

  • Size: 5,749 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 6 tokens
    • mean: 19.17 tokens
    • max: 54 tokens
    • min: 5 tokens
    • mean: 19.06 tokens
    • max: 63 tokens
    • min: 0.0
    • mean: 0.54
    • max: 1.0
  • Samples:
    sentence_0 sentence_1 label
    케냐 경찰 체포 키 알샤바브 신병 모집인 금융가 케냐 : 대테러 경찰에 의해 체포된 쌍 0.64
    "요정은 존재하지 않는다" - 좋아. "레프리콘은 존재하지 않는다" - 좋아. 0.2
    당신이 필요로 하는 모든 것에 대한 가격은 인플레이션 때문에 상승했다. 당신이 소유한 모든 IE 자산의 가격이 하락했다. 0.4
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • num_train_epochs: 5
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss sts-dev_spearman_cosine
0.3477 500 0.3954 -
0.6954 1000 0.3058 0.8398
1.0 1438 - 0.8456
1.0431 1500 0.2945 -
1.3908 2000 0.2003 0.8440
1.7385 2500 0.1785 -
2.0 2876 - 0.8459
2.0862 3000 0.1774 0.8471
2.4339 3500 0.1271 -
2.7816 4000 0.1278 0.8481

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.1.0
  • Transformers: 4.56.1
  • PyTorch: 2.8.0+cu126
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
26
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for josangho99/ko-multilingual-e5-small-multiTask

Finetuned
(126)
this model

Evaluation results