Hatevolution: What Static Benchmarks Don't Tell Us
Abstract
Empirical evaluation reveals temporal misalignment in the robustness of language models on evolving hate speech benchmarks, highlighting the need for time-sensitive linguistic assessments.
Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.
Community
Authors: Chiara Di Bonaventura, Barbara McGillivray, Yulan He, Albert Meroño-Peñuela
TL;DR: We design two novel time-sensitive experiments and metrics to empirically evaluate 20 state-of-the-art language models on evolving hate speech. We found that language models are not robust to evolving hate speech as they exhibit short- and long-term volatility to time-sensitive shifts in Experiment 1 and sensitivity to counterfactuals containing neologisms in Experiment 2. Moreover, we show the temporal misalignment between static and time-sensitive evaluations, thus calling for time-sensitive linguistic hate speech benchmarks in order to reliably evaluate language models' safety over time.
Paper: https://arxiv.org/pdf/2506.12148
Code: https://github.com/ChiaraDiBonaventura/hatevolution
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study (2025)
- Compositional Generalisation for Explainable Hate Speech Detection (2025)
- A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings (2025)
- FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking (2025)
- LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification (2025)
- Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models (2025)
- MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper