arxiv:2503.06573

WildIFEval: Instruction Following in the Wild

Published on Mar 9

· Submitted by

Asaf-Yehudai on Mar 13

Upvote

Authors:

Gili Lior ,

Asaf Yehudai ,

Ariel Gera ,

Abstract

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

View arXiv page View PDF Add to collection

Community

Asaf-Yehudai

Paper author Paper submitter Mar 13

This is Wild! 🤯 Hard benchmark alert! 🚨
Our new WildIFEval benchmark is out, and it's challenging for leading models!
No model scores above 65% — not even DS-v3 or Llama-3-405B!