DistilBERT RSS Advertisement Detection

A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content.

Model Description

This model is fine-tuned from distilbert-base-uncased for binary text classification. It can distinguish between:

  • Advertisement: Promotional content, deals, sales, sponsored content
  • News: Legitimate news articles, editorial content, research findings

Intended Use

  • Primary: Filtering RSS feeds to separate advertisements from news
  • Secondary: Content moderation, spam detection, content categorization
  • Research: Text classification, advertisement detection studies

Performance

  • Accuracy: ~95%
  • F1 Score: ~94%
  • Precision: ~93%
  • Recall: ~94%

Training Data

  • Source: 75+ RSS feeds from major tech news outlets
  • Articles: 1,600+ RSS articles
  • Labeled: 1,000+ manually labeled examples
  • Sources: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc.

Usage

from transformers import pipeline

# Load the model
classifier = pipeline("text-classification", 
                     model="SoroushXYZ/distilbert-rss-ad-detection")

# Classify examples
examples = [
    "Apple Announces New iPhone with Advanced AI Features",
    "50% OFF - Limited Time Offer on Premium Headphones!",
    "Scientists Discover New Method for Carbon Capture",
    "Buy Now! Get Free Shipping on All Electronics Today Only!"
]

for text in examples:
    result = classifier(text)
    print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})")

Model Architecture

  • Base Model: distilbert-base-uncased
  • Task: Binary text classification
  • Input: Text (max 128 tokens)
  • Output: Class probabilities (news, advertisement)

Training Details

  • Epochs: 3
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Optimizer: AdamW
  • Framework: PyTorch + Transformers

Limitations

  • Trained primarily on tech news content
  • May not generalize well to other domains
  • Performance depends on title quality and clarity
  • Limited to English language content

Citation

If you use this model, please cite:

@misc{distilbert-rss-ad-detection,
  title={DistilBERT RSS Advertisement Detection},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection}
}
Downloads last month
31
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support