Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
hub Canonical reference
Simple synthetic data reduces sycophancy in large language models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
Open-weight LLMs display domain-dependent compliance with harmful requests spanning 71 percentage points, including a technical framing bypass that overrides safety without detectable signals, replicated in shape on closed frontier models.
The paper introduces a three-source decomposition showing that answer flips in multi-agent LLM debate include 37% spontaneous instability and 29% harmful conformity, with even vacuous reasoning persuading 20-39% of resistant agents and interventions reducing harmful conformity by 13.6 points.
Instruction-tuned language model agents exhibit in-group bias, action homophily, and network assortativity in simulations when group labels are salient, accumulating into structural inequality over repeated interactions.
Analysis of news text in 34 languages shows cross-lingual convergence on AI-associated lemmas and increased prevalence of top AI-overused items after ChatGPT's release.
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
citing papers explorer
-
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
-
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
-
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Playing games with knowledge: AI-Induced delusions need game theoretic interventions
AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.
-
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
-
Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
-
The Self-Correction Illusion: LLMs Correct Others but Not Themselves
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
-
Large Language Models Are Overconfident in Their Own Responses
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
-
Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs
Open-weight LLMs display domain-dependent compliance with harmful requests spanning 71 percentage points, including a technical framing bypass that overrides safety without detectable signals, replicated in shape on closed frontier models.
-
Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate
The paper introduces a three-source decomposition showing that answer flips in multi-agent LLM debate include 37% spontaneous instability and 29% harmful conformity, with even vacuous reasoning persuading 20-39% of resistant agents and interventions reducing harmful conformity by 13.6 points.
-
Human-like in-group bias in instruction-tuned language model agents
Instruction-tuned language model agents exhibit in-group bias, action homophily, and network assortativity in simulations when group labels are salient, accumulating into structural inequality over repeated interactions.
-
AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing
Analysis of news text in 34 languages shows cross-lingual convergence on AI-associated lemmas and increased prevalence of top AI-overused items after ChatGPT's release.
-
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
-
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
-
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
-
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
-
Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
-
How Large Language Models Balance Internal Knowledge with User and Document Assertions
LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
-
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
LLMs detect and warn against investment fraud more consistently than humans, with 0% endorsement of fraudulent opportunities versus 13-14% for humans, even under motivated investor pressure.
-
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
-
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
-
BASIL: Bayesian Assessment of Sycophancy in LLMs
BASIL is a Bayesian probabilistic framework that separates sycophantic belief shifts from rational updating in LLMs and demonstrates its use on uncertainty-driven tasks along with mitigation via calibration and fine-tuning.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
-
User Detection and Response Patterns of Sycophantic Behavior in Conversational AI
Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.
-
Exploring the "Banality" of Deception in Generative AI
Deception in generative AI is subtle and normalized through defaults and interactions, with users often complicit, calling for friction, awareness, and regulatory approaches to protect users.
-
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
Proposes Knowledge Objects to externalize implicit AI knowledge for human verification, addressing a gap in current reliability methods.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
- To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs