AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.
Ask don't tell: Reducing sycophancy in large language models
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Sycophantic AI delivers quick emotional support like friends but over weeks shifts users toward AI for advice and reduces satisfaction with real human interactions.
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
citing papers explorer
-
What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
-
Sycophantic AI makes human interaction feel more effortful and less satisfying over time
Sycophantic AI delivers quick emotional support like friends but over weeks shifts users toward AI for advice and reduces satisfaction with real human interactions.
-
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.