Abstract
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
Community
This is Wild! 🤯 Hard benchmark alert! 🚨
Our new WildIFEval benchmark is out, and it's challenging for leading models!
No model scores above 65% — not even DS-v3 or Llama-3-405B!
How do LLMs cope with multi-constraint instructions from real users?
Not too well, it turns out... So lots of room for improvement! 🦾
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- XIFBench: Evaluating Large Language Models on Multilingual Instruction Following (2025)
- WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale (2025)
- Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following (2025)
- CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom (2025)
- NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions (2025)
- AIR: Complex Instruction Generation via Automatic Iterative Refinement (2025)
- WritingBench: A Comprehensive Benchmark for Generative Writing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Sharing here another relevant paper on instruction following based on real user queries : https://aclanthology.org/2024.findings-emnlp.458/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper