RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
unclear 1representative citing papers
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
An automated environment construction pipeline plus verifiable rewards enables RL training that improves LLM tool-use performance across scales without harming general capabilities.
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
citing papers explorer
-
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
-
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
-
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
An automated environment construction pipeline plus verifiable rewards enables RL training that improves LLM tool-use performance across scales without harming general capabilities.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.