Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
hub Canonical reference
Simple synthetic data reduces sycophancy in large language models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
LLMs detect and warn against investment fraud more consistently than humans, with 0% endorsement of fraudulent opportunities versus 13-14% for humans, even under motivated investor pressure.
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
BASIL is a Bayesian probabilistic framework that separates sycophantic belief shifts from rational updating in LLMs and demonstrates its use on uncertainty-driven tasks along with mitigation via calibration and fine-tuning.
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.
Deception in generative AI is subtle and normalized through defaults and interactions, with users often complicit, calling for friction, awareness, and regulatory approaches to protect users.
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
citing papers explorer
-
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Playing games with knowledge: AI-Induced delusions need game theoretic interventions
AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
-
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
-
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
-
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
-
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
-
How Large Language Models Balance Internal Knowledge with User and Document Assertions
LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
-
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
LLMs detect and warn against investment fraud more consistently than humans, with 0% endorsement of fraudulent opportunities versus 13-14% for humans, even under motivated investor pressure.
-
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
-
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
-
BASIL: Bayesian Assessment of Sycophancy in LLMs
BASIL is a Bayesian probabilistic framework that separates sycophantic belief shifts from rational updating in LLMs and demonstrates its use on uncertainty-driven tasks along with mitigation via calibration and fine-tuning.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
-
User Detection and Response Patterns of Sycophantic Behavior in Conversational AI
Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.
-
Exploring the "Banality" of Deception in Generative AI
Deception in generative AI is subtle and normalized through defaults and interactions, with users often complicit, calling for friction, awareness, and regulatory approaches to protect users.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
- To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs