Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.
When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint arXiv:2508.02087
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 4roles
background 1polarities
background 1representative citing papers
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
citing papers explorer
-
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
-
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.