--- base_model: answerdotai/ModernBERT-base library_name: peft tags: - text-classification - reddit - conversation-analysis - constructive-dialogue - modernbert - lora - transformers - lightweight - high-throughput language: - en datasets: - reddit pipeline_tag: text-classification repo_url: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git --- # ModernBERT Reddit Discussion Classifier A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently. ## Model Description This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a **lightweight** solution for large-scale Reddit discussion analysis. - **Model Type**: Text Classification (Binary) - **Base Model**: answerdotai/ModernBERT-base - **Training Method**: QLoRA with self-training - **Task**: Binary classification of conversation constructiveness - **Language**: English ### Model Source - **Repository**: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git ## Intended Uses ### Primary Use Case - Classifying Reddit discussions as constructive or non-constructive - Content moderation assistance - Large-scale conversation quality analysis - Social media research ### Direct Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel import torch # Load base model and tokenizer base_model_name = "answerdotai/ModernBERT-base" tokenizer = AutoTokenizer.from_pretrained(base_model_name) model = AutoModelForSequenceClassification.from_pretrained( base_model_name, num_labels=2 ) # Load the fine-tuned adapters model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier") model.eval() # Classify text (optimized for batch processing) def classify_text(text): inputs = tokenizer( text, return_tensors="pt", truncation=True, padding=True, max_length=4096 ) # Move inputs to same device as model (important for GPU usage) inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # 0 = non-constructive, 1 = constructive predicted_class = torch.argmax(predictions, dim=-1).item() confidence = predictions[0][predicted_class].item() return { 'class': 'constructive' if predicted_class == 1 else 'non-constructive', 'confidence': confidence, 'scores': { 'non-constructive': predictions[0][0].item(), 'constructive': predictions[0][1].item() } } # Example usage - Reddit discussion text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks" result = classify_text(text) print(result) ``` ## Training Details ### Training Data - **Source**: https://archive.org/download/pushshift_reddit_200506_to_202212/ - **Size**: ~1.4 million Reddit threads filtered for English language and minimum 2 authors - **Labels**: Binary (constructive/non-constructive conversations) - **Additional Data**: YNACC and IAC datasets for initial supervised training ### Training Procedure - **Training Method**: Self-training - **Quantization**: 4-bit QLoRA for efficiency - **LoRA Config**: - `r`: 16 - `lora_alpha`: 32 - `lora_dropout`: 0.1 - Target modules: `Wqkv`, `Wo`, `Wi`, `dense` - **Loss Function**: Focal Loss with class weighting - **Max Sequence Length**: 4096 tokens - **Batch Size**: 64 - **Learning Rate**: 2e-6 ### Training Hardware - 48 hours on 4x NVIDIA A100 40GB GPUs ## Performance ### Evaluation Results ``` YNACC: Accuracy: 0.63 Precision: 0.63 F1-Score: 0.65 IAC: Accuracy: 0.79 Precision: 0.85 F1-Score: 0.87 Reddit: Accuracy: 0.57 Precision: 0.74 F1-Score: 0.67 ``` ## Limitations and Bias - **Language**: English only - **Bias**: May reflect biases present in Reddit discussions and training data ## Ethical Considerations - Human oversight is recommended for important moderation decisions ## Technical Specifications - **Model Architecture**: ModernBERT + Classification Head - **Parameters**: ~150M base + LoRA adapters + classification head - **Precision**: 4-bit quantized base model with full-precision adapters - **Framework**: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters) ## Model Card Authors Niklas Koch, Georg August University of Göttingen ## Model Card Contact niklas.koch01@stud.uni-goettingen.de