hub Canonical reference

Simple synthetic data reduces sycophancy in large language models

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le · 2023 · cs.CL · arXiv 2308.03958

Canonical reference. 100% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Playing games with knowledge: AI-Induced delusions need game theoretic interventions

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

cs.AI · 2026-04-28 · unverdicted · novelty 7.0

LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

cs.AI · 2026-04-07 · unverdicted · novelty 7.0

A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

cs.AI · 2025-10-08 · unverdicted · novelty 7.0

Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.

User-Assistant Bias in LLMs

cs.CL · 2025-08-16 · unverdicted · novelty 7.0

LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

cs.CL · 2025-06-08 · unverdicted · novelty 7.0

VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

Large Language Models Are Overconfident in Their Own Responses

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.

Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

cs.SE · 2026-06-01 · unverdicted · novelty 6.0

Open-weight LLMs display domain-dependent compliance with harmful requests spanning 71 percentage points, including a technical framing bypass that overrides safety without detectable signals, replicated in shape on closed frontier models.

Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

The paper introduces a three-source decomposition showing that answer flips in multi-agent LLM debate include 37% spontaneous instability and 29% harmful conformity, with even vacuous reasoning persuading 20-39% of resistant agents and interventions reducing harmful conformity by 13.6 points.

Human-like in-group bias in instruction-tuned language model agents

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

Instruction-tuned language model agents exhibit in-group bias, action homophily, and network assortativity in simulations when group labels are salient, accumulating into structural inequality over repeated interactions.

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

Analysis of news text in 34 languages shows cross-lingual convergence on AI-associated lemmas and increased prevalence of top AI-overused items after ChatGPT's release.

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

cs.LG · 2026-05-20 · conditional · novelty 6.0

On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.

Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.

Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.

citing papers explorer

Showing 38 of 38 citing papers.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm cs.CL · 2026-05-27 · unverdicted · none · ref 125 · internal anchor
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 85 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
How LLMs Are Persuaded: A Few Attention Heads, Rerouted cs.AI · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 153 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Playing games with knowledge: AI-Induced delusions need game theoretic interventions cs.AI · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
AI sycophancy creates belief spirals modeled as cheap talk games, mitigated by an Epistemic Mediator that introduces costly signals for type revelation and Belief Versioning for epistemic safety.
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors cs.AI · 2026-04-28 · unverdicted · none · ref 35 · internal anchor
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 62 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models cs.CL · 2026-04-12 · unverdicted · none · ref 50 · internal anchor
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition cs.AI · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning cs.AI · 2025-10-08 · unverdicted · none · ref 47 · internal anchor
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
User-Assistant Bias in LLMs cs.CL · 2025-08-16 · unverdicted · none · ref 18 · internal anchor
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs cs.CL · 2025-06-08 · unverdicted · none · ref 41 · internal anchor
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision cs.CL · 2026-06-30 · unverdicted · none · ref 47 · internal anchor
Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI · 2026-06-04 · conditional · none · ref 43 · internal anchor
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
Large Language Models Are Overconfident in Their Own Responses cs.CL · 2026-06-02 · unverdicted · none · ref 29 · internal anchor
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs cs.SE · 2026-06-01 · unverdicted · none · ref 7 · internal anchor
Open-weight LLMs display domain-dependent compliance with harmful requests spanning 71 percentage points, including a technical framing bypass that overrides safety without detectable signals, replicated in shape on closed frontier models.
Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate cs.CL · 2026-05-30 · unverdicted · none · ref 12 · internal anchor
The paper introduces a three-source decomposition showing that answer flips in multi-agent LLM debate include 37% spontaneous instability and 29% harmful conformity, with even vacuous reasoning persuading 20-39% of resistant agents and interventions reducing harmful conformity by 13.6 points.
Human-like in-group bias in instruction-tuned language model agents cs.AI · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
Instruction-tuned language model agents exhibit in-group bias, action homophily, and network assortativity in simulations when group labels are salient, accumulating into structural inequality over repeated interactions.
AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing cs.CL · 2026-05-25 · unverdicted · none · ref 21 · internal anchor
Analysis of news text in 34 languages shows cross-lingual convergence on AI-associated lemmas and increased prevalence of top AI-overused items after ChatGPT's release.
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation cs.LG · 2026-05-20 · conditional · none · ref 31 · internal anchor
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation cs.CL · 2026-05-02 · unverdicted · none · ref 74 · internal anchor
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination cs.LG · 2026-05-01 · unverdicted · none · ref 44 · internal anchor
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training cs.CL · 2026-04-30 · unverdicted · none · ref 18 · internal anchor
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
How Large Language Models Balance Internal Knowledge with User and Document Assertions cs.CL · 2026-04-24 · unverdicted · none · ref 9 · internal anchor
LLMs prefer document assertions over user assertions, are impressionable to external information, and gain better discrimination after fine-tuning on diverse source-interaction data.
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure cs.AI · 2026-04-22 · conditional · none · ref 10 · internal anchor
LLMs detect and warn against investment fraud more consistently than humans, with 0% endorsement of fraudulent opportunities versus 13-14% for humans, even under motivated investor pressure.
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models cs.AI · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy cs.CL · 2026-04-02 · unverdicted · none · ref 26 · internal anchor
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
BASIL: Bayesian Assessment of Sycophancy in LLMs cs.AI · 2025-08-23 · unverdicted · none · ref 12 · internal anchor
BASIL is a Bayesian probabilistic framework that separates sycophantic belief shifts from rational updating in LLMs and demonstrates its use on uncertainty-driven tasks along with mitigation via calibration and fine-tuning.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 59 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement cs.AI · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 53 · internal anchor
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior cs.LG · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
User Detection and Response Patterns of Sycophantic Behavior in Conversational AI cs.HC · 2026-01-15 · unverdicted · none · ref 13 · internal anchor
Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.
Exploring the "Banality" of Deception in Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
Deception in generative AI is subtle and normalized through defaults and interactions, with users often complicit, calling for friction, awareness, and regulatory approaches to protect users.
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective cs.AI · 2026-05-03 · unverdicted · none · ref 17 · internal anchor
Proposes Knowledge Objects to externalize implicit AI knowledge for human verification, addressing a gap in current reliability methods.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 163 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs cs.CV · 2026-03-19 · unreviewed · ref 32 · internal anchor

Simple synthetic data reduces sycophancy in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer