clapAI/mmBERT-small-multilingual-sentiment

Introduction

mmBERT-small-multilingual-sentiment is a multilingual sentiment classification model, part of the Multilingual-Sentiment collection.

The model is fine-tuned from jhu-clsp/mmBERT-small using the multilingual sentiment dataset clapAI/MultiLingualSentiment.

Model supports multilingual sentiment classification across 16+ languages, including English, Vietnamese, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and more.

Key Highlights

📈 Improved accuracy: Achieves F1 = 82.2.
📜 Long context support: Handles sequences up to 8192 tokens.
🪶 Efficient size: Only 140M parameters, smaller than RoBERTa-base (278M) with better performance.
⚡ FlashAttention-2 support: Enables much faster inference on modern GPUs.

Evaluation & Performance

Results on the test split of clapAI/MultiLingualSentiment

Model	Pretrained Model	Parameters	Context-length	F1-score
clapAI/mmBERT-small-multilingual-sentiment	jhu-clsp/mmBERT-small	140M	8192	82.2
modernBERT-base-multilingual-sentiment	ModernBERT-base	150M	8192	80.16
roberta-base-multilingual-sentiment	XLM-roberta-base	278M	512	81.8

How to use

Installation

pip install torch==2.8
pip install transformers==4.55.0

Optional: accelerate inference with FlashAttention-2 (if supported by your GPU):

pip install packaging==25.0 ninja==1.13.0
MAX_JOBS=4 pip install flash-attn==2.8.3 --no-build-isolation

Example Usage

Try it on Google Colab

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "clapAI/mmBERT-small-multilingual-sentiment"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    torch_dtype=dtype,
    # Uncomment if device supports FA2
    # attn_implementation="flash_attention_2" 
)

model.to(device)
model.eval()

# Retrieve labels from the model's configuration
id2label = model.config.id2label

texts = [
    "I absolutely love the new design of this app!",  # English
    "الخدمة كانت سيئة للغاية.",
    "Ich bin sehr zufrieden mit dem Kauf.",  # German
    "El producto llegó roto y no funciona.",  # Spanish
    "J'adore ce restaurant, la nourriture est délicieuse!",  # French
    "Makanannya benar-benar tidak enak.",  # Indonesian
    "この製品は本当に素晴らしいです！",  # Japanese
    "고객 서비스가 정말 실망스러웠어요.",  # Korean
    "Этот фильм просто потрясающий!",  # Russian
    "Tôi thực sự yêu thích sản phẩm này!",  # Vietnamese
    "质量真的很差。"  # Chinese
]

for text in texts:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model(**inputs)
        prediction = id2label[outputs.logits.argmax(dim=-1).item()]
    print(f"Text: {text} | Prediction: {prediction}")

Citation

If you use this model, please consider citing:

@misc{clapAI_mmbert_small_multilingual_sentiment,
      title={mmBERT-small-multilingual-sentiment: A Multilingual Sentiment Classification Model},
      author={clapAI},
      howpublished={\url{https://huggingface.co/clapAI/mmBERT-small-multilingual-sentiment}},
      year={2025},
}

Downloads last month: 184

Safetensors

Model size

141M params

Tensor type

BF16

Model tree for clapAI/mmBERT-small-multilingual-sentiment

Base model

jhu-clsp/mmBERT-small

Finetuned

(7)

this model

Dataset used to train clapAI/mmBERT-small-multilingual-sentiment

Evaluation results

Metadata error: specify a dataset to view leaderboard