Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =

Interpretable Preferences via Multi-Objective Reward Modeling · 2024 · DOI 10.18653/v1/2024.findings-emnlp.620

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Understanding helpfulness and harmless tension in reward models

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.

citing papers explorer

Showing 3 of 3 citing papers.

Understanding helpfulness and harmless tension in reward models cs.LG · 2026-06-11 · unverdicted · none · ref 45
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill cs.LG · 2026-06-02 · unverdicted · none · ref 13
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity cs.LG · 2026-06-08 · unverdicted · none · ref 44
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.

Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =

fields

years

verdicts

representative citing papers

citing papers explorer