Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
citing papers explorer
-
Understanding helpfulness and harmless tension in reward models
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
-
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
-
DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.