Update README.md

83383cc verified 7 days ago

7.75 kB

	---
	library_name: transformers
	datasets:
	- WebOrganizer/TopicAnnotations-Llama-3.1-8B
	- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
	base_model:
	- answerdotai/ModernBERT-base
	---
	# wissamantoun/WebOrganizer-FormatClassifier-ModernBERT

	[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

	All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model

	The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
	The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data:

	1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
	2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

	#### All Domain Classifiers
	- [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT) ← you are here!
	- [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT)

	## Usage

	This classifier expects input in the following input format:
	```
	{url}

	{text}
	```

	Example:
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT")
	model = AutoModelForSequenceClassification.from_pretrained(
	"wissamantoun/WebOrganizer-FormatClassifier-ModernBERT",
	trust_remote_code=True,
	use_memory_efficient_attention=False)

	web_page = """http://www.example.com

	How to build a computer from scratch? Here are the components you need..."""

	inputs = tokenizer([web_page], return_tensors="pt")
	outputs = model(**inputs)

	probs = outputs.logits.softmax(dim=-1)
	print(probs.argmax(dim=-1))
	# -> 5 ("Hardware" topic)
	```

	You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):

	0. Academic Writing
	1. Content Listing
	2. Creative Writing
	3. Customer Support
	4. Comment Section
	5. FAQ
	6. Truncated
	7. Knowledge Article
	8. Legal Notices
	9. Listicle
	10. News Article
	11. Nonfiction Writing
	12. About (Org
	13. News (Org
	14. About (Pers
	15. Personal Blog
	16. Product Page
	17. Q&A Forum
	18. Spam / Ads
	19. Structured Data
	20. Documentation
	21. Audio Transcript
	22. Tutorial
	23. User Review

	The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml).

	# Scores
	```
	*** pred metrics ***
	test_accuracy = 0.8154
	test_accuracy__0 = 0.855
	test_accuracy__1 = 0.7558
	test_accuracy__10 = 0.9071
	test_accuracy__11 = 0.6869
	test_accuracy__12 = 0.8055
	test_accuracy__13 = 0.7897
	test_accuracy__14 = 0.8592
	test_accuracy__15 = 0.8541
	test_accuracy__16 = 0.8788
	test_accuracy__17 = 0.7733
	test_accuracy__18 = 0.7286
	test_accuracy__19 = 0.6989
	test_accuracy__2 = 0.7474
	test_accuracy__20 = 0.7609
	test_accuracy__21 = 0.7807
	test_accuracy__22 = 0.7703
	test_accuracy__23 = 0.7931
	test_accuracy__3 = 0.6351
	test_accuracy__4 = 0.871
	test_accuracy__5 = 0.8333
	test_accuracy__6 = 0.6125
	test_accuracy__7 = 0.6416
	test_accuracy__8 = 0.78
	test_accuracy__9 = 0.7668
	test_accuracy_conf50 = 0.8312
	test_accuracy_conf50__0 = 0.8852
	test_accuracy_conf50__1 = 0.7651
	test_accuracy_conf50__10 = 0.9167
	test_accuracy_conf50__11 = 0.7168
	test_accuracy_conf50__12 = 0.8256
	test_accuracy_conf50__13 = 0.7996
	test_accuracy_conf50__14 = 0.8696
	test_accuracy_conf50__15 = 0.8684
	test_accuracy_conf50__16 = 0.8878
	test_accuracy_conf50__17 = 0.7838
	test_accuracy_conf50__18 = 0.7663
	test_accuracy_conf50__19 = 0.7276
	test_accuracy_conf50__2 = 0.7609
	test_accuracy_conf50__20 = 0.7907
	test_accuracy_conf50__21 = 0.8
	test_accuracy_conf50__22 = 0.7927
	test_accuracy_conf50__23 = 0.7904
	test_accuracy_conf50__3 = 0.6617
	test_accuracy_conf50__4 = 0.877
	test_accuracy_conf50__5 = 0.8571
	test_accuracy_conf50__6 = 0.6299
	test_accuracy_conf50__7 = 0.6786
	test_accuracy_conf50__8 = 0.7755
	test_accuracy_conf50__9 = 0.7796
	test_accuracy_conf75 = 0.9003 <--- Metric from the paper
	test_accuracy_conf75__0 = 0.9412
	test_accuracy_conf75__1 = 0.8318
	test_accuracy_conf75__10 = 0.9542
	test_accuracy_conf75__11 = 0.8478
	test_accuracy_conf75__12 = 0.8841
	test_accuracy_conf75__13 = 0.8724
	test_accuracy_conf75__14 = 0.914
	test_accuracy_conf75__15 = 0.9345
	test_accuracy_conf75__16 = 0.9316
	test_accuracy_conf75__17 = 0.8667
	test_accuracy_conf75__18 = 0.8446
	test_accuracy_conf75__19 = 0.8209
	test_accuracy_conf75__2 = 0.8333
	test_accuracy_conf75__20 = 0.9333
	test_accuracy_conf75__21 = 0.8587
	test_accuracy_conf75__22 = 0.8708
	test_accuracy_conf75__23 = 0.8309
	test_accuracy_conf75__3 = 0.7292
	test_accuracy_conf75__4 = 0.9357
	test_accuracy_conf75__5 = 0.9032
	test_accuracy_conf75__6 = 0.7816
	test_accuracy_conf75__7 = 0.8011
	test_accuracy_conf75__8 = 0.8409
	test_accuracy_conf75__9 = 0.8592
	test_accuracy_label_average = 0.7744
	test_accuracy_label_average_conf50 = 0.7919
	test_accuracy_label_average_conf75 = 0.8676
	test_accuracy_label_min = 0.6125
	test_accuracy_label_min_conf75 = 0.7292 <--- Metric from the paper
	test_loss = 0.6023
	test_proportion_conf50 = 0.9638
	test_proportion_conf75 = 0.7951
	test_runtime = 0:00:08.38
	test_samples_per_second = 1192.262
	test_steps_per_second = 37.318
	```



	## Citation
	```bibtex
	@article{wettig2025organize,
	title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
	author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
	journal={arXiv preprint arXiv:2502.10341},
	year={2025}
	}
	```