Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 1polarities
background 1representative citing papers
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
citing papers explorer
-
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
-
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
-
Rhetorical Questions in LLM Representations: A Linear Probing Study
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
-
Exploring Concreteness Through a Figurative Lens
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.