Multi-Domain Explainability of Preferences

Calderon, Nitay, Ein-Dor, Liat, Reichart, Roi · 2025 · DOI 10.18653/v1/2025.emnlp-main.736

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Understanding helpfulness and harmless tension in reward models

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Understanding helpfulness and harmless tension in reward models cs.LG · 2026-06-11 · unverdicted · none · ref 28
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.

Multi-Domain Explainability of Preferences

fields

years

verdicts

representative citing papers

citing papers explorer