A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
Reasoning isn’t enough: Examining truth- bias and sycophancy in llms
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.
citing papers explorer
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.