arxiv: 2507.02833 · v3 · pith:RXRGM7AQnew · submitted 2025-07-03 · 💻 cs.CL

Generalizing Verifiable Instruction Following

Valentina Pyatkin , Saumya Malik , Victoria Graf , Hamish Ivison , Shengyi Huang , Pradeep Dasigi , Nathan Lambert , Hannaneh Hajishirzi This is my paper

classification 💻 cs.CL

keywords constraintsfollowinginstructionmodelsverifiablepreciseadditionanswer

0 comments

read the original abstract

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
cs.CL 2026-05 unverdicted novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
cs.CL 2026-04 unverdicted novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...
Many-Tier Instruction Hierarchy in LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
cs.SE 2026-02 unverdicted novelty 7.0

IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
eess.AS 2025-09 unverdicted novelty 7.0

Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
cs.CL 2026-05 conditional novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
cs.AI 2026-05 unverdicted novelty 6.0

Dynamic Boundary Evaluation adaptively identifies each LLM's performance boundary on a shared difficulty scale using a calibrated item bank and a search algorithm.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
cs.CL 2026-04 unverdicted novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
cs.CL 2026-04 unverdicted novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
cs.CL 2026-04 unverdicted novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.