pith. sign in

arxiv: 2503.06573 · v3 · pith:AKNJI62Inew · submitted 2025-03-09 · 💻 cs.CL · cs.AI

WildIFEval: Instruction Following in the Wild

classification 💻 cs.CL cs.AI
keywords constraintsinstructionswildifevaluserconditionsdatasetfollowinginstruction-following
0
0 comments X
read the original abstract

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

    cs.CL 2026-05 unverdicted novelty 7.0

    MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under pertu...

  2. Revisiting the Reliability of Language Models in Instruction-Following

    cs.SE 2025-12 conditional novelty 6.0

    LLMs exhibit up to 61.8% performance drops on nuanced rephrasings of instruction-following tasks, revealing insufficient nuance-oriented reliability across 46 tested models.

  3. Generalizing Verifiable Instruction Following

    cs.CL 2025-07 unverdicted novelty 6.0

    Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.