For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
arXiv preprint arXiv:2509.21305 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.
AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
citing papers explorer
-
Comparing Linear Probes with Mahalanobis Cosine Similarity
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating
Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.
-
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
-
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
-
Rhetorical Questions in LLM Representations: A Linear Probing Study
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
-
Exploring Concreteness Through a Figurative Lens
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.