Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
The Colorful Future of LLM s: Evaluating and Improving LLM s as Emotional Supporters for Queer Youth
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
Understanding helpfulness and harmless tension in reward models
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
- Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders