Text Classification
Transformers
TensorBoard
Safetensors
modernbert
wissamantoun's picture
Update README.md
83383cc verified
---
library_name: transformers
datasets:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
base_model:
- answerdotai/ModernBERT-base
---
# wissamantoun/WebOrganizer-FormatClassifier-ModernBERT
[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
*All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model*
The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data:
1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
#### All Domain Classifiers
- [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT) *← you are here!*
- [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT)
## Usage
This classifier expects input in the following input format:
```
{url}
{text}
```
Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
"wissamantoun/WebOrganizer-FormatClassifier-ModernBERT",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to build a computer from scratch? Here are the components you need..."""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)
```
You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
0. Academic Writing
1. Content Listing
2. Creative Writing
3. Customer Support
4. Comment Section
5. FAQ
6. Truncated
7. Knowledge Article
8. Legal Notices
9. Listicle
10. News Article
11. Nonfiction Writing
12. About (Org
13. News (Org
14. About (Pers
15. Personal Blog
16. Product Page
17. Q&A Forum
18. Spam / Ads
19. Structured Data
20. Documentation
21. Audio Transcript
22. Tutorial
23. User Review
The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml).
# Scores
```
***** pred metrics *****
test_accuracy = 0.8154
test_accuracy__0 = 0.855
test_accuracy__1 = 0.7558
test_accuracy__10 = 0.9071
test_accuracy__11 = 0.6869
test_accuracy__12 = 0.8055
test_accuracy__13 = 0.7897
test_accuracy__14 = 0.8592
test_accuracy__15 = 0.8541
test_accuracy__16 = 0.8788
test_accuracy__17 = 0.7733
test_accuracy__18 = 0.7286
test_accuracy__19 = 0.6989
test_accuracy__2 = 0.7474
test_accuracy__20 = 0.7609
test_accuracy__21 = 0.7807
test_accuracy__22 = 0.7703
test_accuracy__23 = 0.7931
test_accuracy__3 = 0.6351
test_accuracy__4 = 0.871
test_accuracy__5 = 0.8333
test_accuracy__6 = 0.6125
test_accuracy__7 = 0.6416
test_accuracy__8 = 0.78
test_accuracy__9 = 0.7668
test_accuracy_conf50 = 0.8312
test_accuracy_conf50__0 = 0.8852
test_accuracy_conf50__1 = 0.7651
test_accuracy_conf50__10 = 0.9167
test_accuracy_conf50__11 = 0.7168
test_accuracy_conf50__12 = 0.8256
test_accuracy_conf50__13 = 0.7996
test_accuracy_conf50__14 = 0.8696
test_accuracy_conf50__15 = 0.8684
test_accuracy_conf50__16 = 0.8878
test_accuracy_conf50__17 = 0.7838
test_accuracy_conf50__18 = 0.7663
test_accuracy_conf50__19 = 0.7276
test_accuracy_conf50__2 = 0.7609
test_accuracy_conf50__20 = 0.7907
test_accuracy_conf50__21 = 0.8
test_accuracy_conf50__22 = 0.7927
test_accuracy_conf50__23 = 0.7904
test_accuracy_conf50__3 = 0.6617
test_accuracy_conf50__4 = 0.877
test_accuracy_conf50__5 = 0.8571
test_accuracy_conf50__6 = 0.6299
test_accuracy_conf50__7 = 0.6786
test_accuracy_conf50__8 = 0.7755
test_accuracy_conf50__9 = 0.7796
test_accuracy_conf75 = 0.9003 <--- Metric from the paper
test_accuracy_conf75__0 = 0.9412
test_accuracy_conf75__1 = 0.8318
test_accuracy_conf75__10 = 0.9542
test_accuracy_conf75__11 = 0.8478
test_accuracy_conf75__12 = 0.8841
test_accuracy_conf75__13 = 0.8724
test_accuracy_conf75__14 = 0.914
test_accuracy_conf75__15 = 0.9345
test_accuracy_conf75__16 = 0.9316
test_accuracy_conf75__17 = 0.8667
test_accuracy_conf75__18 = 0.8446
test_accuracy_conf75__19 = 0.8209
test_accuracy_conf75__2 = 0.8333
test_accuracy_conf75__20 = 0.9333
test_accuracy_conf75__21 = 0.8587
test_accuracy_conf75__22 = 0.8708
test_accuracy_conf75__23 = 0.8309
test_accuracy_conf75__3 = 0.7292
test_accuracy_conf75__4 = 0.9357
test_accuracy_conf75__5 = 0.9032
test_accuracy_conf75__6 = 0.7816
test_accuracy_conf75__7 = 0.8011
test_accuracy_conf75__8 = 0.8409
test_accuracy_conf75__9 = 0.8592
test_accuracy_label_average = 0.7744
test_accuracy_label_average_conf50 = 0.7919
test_accuracy_label_average_conf75 = 0.8676
test_accuracy_label_min = 0.6125
test_accuracy_label_min_conf75 = 0.7292 <--- Metric from the paper
test_loss = 0.6023
test_proportion_conf50 = 0.9638
test_proportion_conf75 = 0.7951
test_runtime = 0:00:08.38
test_samples_per_second = 1192.262
test_steps_per_second = 37.318
```
## Citation
```bibtex
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}
```