Text Classification
Transformers
TensorBoard
Safetensors
modernbert

wissamantoun/WebOrganizer-FormatClassifier-ModernBERT

[Paper] [Website] [GitHub]

All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a ModernBERT-base with 140M parameters fine-tuned on the following training data:

  1. WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
  2. WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

Usage

This classifier expects input in the following input format:

{url}

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
    "wissamantoun/WebOrganizer-FormatClassifier-ModernBERT",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

  1. Academic Writing
  2. Content Listing
  3. Creative Writing
  4. Customer Support
  5. Comment Section
  6. FAQ
  7. Truncated
  8. Knowledge Article
  9. Legal Notices
  10. Listicle
  11. News Article
  12. Nonfiction Writing
  13. About (Org
  14. News (Org
  15. About (Pers
  16. Personal Blog
  17. Product Page
  18. Q&A Forum
  19. Spam / Ads
  20. Structured Data
  21. Documentation
  22. Audio Transcript
  23. Tutorial
  24. User Review

The full definitions of the categories can be found in the taxonomy config.

Scores

***** pred metrics *****
  test_accuracy                      =     0.8154
  test_accuracy__0                   =      0.855
  test_accuracy__1                   =     0.7558
  test_accuracy__10                  =     0.9071
  test_accuracy__11                  =     0.6869
  test_accuracy__12                  =     0.8055
  test_accuracy__13                  =     0.7897
  test_accuracy__14                  =     0.8592
  test_accuracy__15                  =     0.8541
  test_accuracy__16                  =     0.8788
  test_accuracy__17                  =     0.7733
  test_accuracy__18                  =     0.7286
  test_accuracy__19                  =     0.6989
  test_accuracy__2                   =     0.7474
  test_accuracy__20                  =     0.7609
  test_accuracy__21                  =     0.7807
  test_accuracy__22                  =     0.7703
  test_accuracy__23                  =     0.7931
  test_accuracy__3                   =     0.6351
  test_accuracy__4                   =      0.871
  test_accuracy__5                   =     0.8333
  test_accuracy__6                   =     0.6125
  test_accuracy__7                   =     0.6416
  test_accuracy__8                   =       0.78
  test_accuracy__9                   =     0.7668
  test_accuracy_conf50               =     0.8312
  test_accuracy_conf50__0            =     0.8852
  test_accuracy_conf50__1            =     0.7651
  test_accuracy_conf50__10           =     0.9167
  test_accuracy_conf50__11           =     0.7168
  test_accuracy_conf50__12           =     0.8256
  test_accuracy_conf50__13           =     0.7996
  test_accuracy_conf50__14           =     0.8696
  test_accuracy_conf50__15           =     0.8684
  test_accuracy_conf50__16           =     0.8878
  test_accuracy_conf50__17           =     0.7838
  test_accuracy_conf50__18           =     0.7663
  test_accuracy_conf50__19           =     0.7276
  test_accuracy_conf50__2            =     0.7609
  test_accuracy_conf50__20           =     0.7907
  test_accuracy_conf50__21           =        0.8
  test_accuracy_conf50__22           =     0.7927
  test_accuracy_conf50__23           =     0.7904
  test_accuracy_conf50__3            =     0.6617
  test_accuracy_conf50__4            =      0.877
  test_accuracy_conf50__5            =     0.8571
  test_accuracy_conf50__6            =     0.6299
  test_accuracy_conf50__7            =     0.6786
  test_accuracy_conf50__8            =     0.7755
  test_accuracy_conf50__9            =     0.7796
  test_accuracy_conf75               =     0.9003 <--- Metric from the paper
  test_accuracy_conf75__0            =     0.9412
  test_accuracy_conf75__1            =     0.8318
  test_accuracy_conf75__10           =     0.9542
  test_accuracy_conf75__11           =     0.8478
  test_accuracy_conf75__12           =     0.8841
  test_accuracy_conf75__13           =     0.8724
  test_accuracy_conf75__14           =      0.914
  test_accuracy_conf75__15           =     0.9345
  test_accuracy_conf75__16           =     0.9316
  test_accuracy_conf75__17           =     0.8667
  test_accuracy_conf75__18           =     0.8446
  test_accuracy_conf75__19           =     0.8209
  test_accuracy_conf75__2            =     0.8333
  test_accuracy_conf75__20           =     0.9333
  test_accuracy_conf75__21           =     0.8587
  test_accuracy_conf75__22           =     0.8708
  test_accuracy_conf75__23           =     0.8309
  test_accuracy_conf75__3            =     0.7292
  test_accuracy_conf75__4            =     0.9357
  test_accuracy_conf75__5            =     0.9032
  test_accuracy_conf75__6            =     0.7816
  test_accuracy_conf75__7            =     0.8011
  test_accuracy_conf75__8            =     0.8409
  test_accuracy_conf75__9            =     0.8592
  test_accuracy_label_average        =     0.7744
  test_accuracy_label_average_conf50 =     0.7919
  test_accuracy_label_average_conf75 =     0.8676
  test_accuracy_label_min            =     0.6125
  test_accuracy_label_min_conf75     =     0.7292 <--- Metric from the paper
  test_loss                          =     0.6023
  test_proportion_conf50             =     0.9638
  test_proportion_conf75             =     0.7951
  test_runtime                       = 0:00:08.38
  test_samples_per_second            =   1192.262
  test_steps_per_second              =     37.318

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
Downloads last month
13
Safetensors
Model size
150M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wissamantoun/WebOrganizer-FormatClassifier-ModernBERT

Finetuned
(529)
this model

Datasets used to train wissamantoun/WebOrganizer-FormatClassifier-ModernBERT