WildIFEval: Instruction Following in the Wild
read the original abstract
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under pertu...
-
Revisiting the Reliability of Language Models in Instruction-Following
LLMs exhibit up to 61.8% performance drops on nuanced rephrasings of instruction-following tasks, revealing insufficient nuance-oriented reliability across 46 tested models.
-
Generalizing Verifiable Instruction Following
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.