arXiv preprint arXiv:2509.21305

Vennemeyer,D · 2025 · arXiv 2509.21305

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

cs.AI · 2026-04-07 · unverdicted · novelty 7.0

A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

cs.AI · 2026-05-20 · conditional · novelty 6.0

AI sycophancy is a broad family of behaviors split by target (beliefs vs. traits) and style (explicit vs. implicit), with experts agreeing it is a problem but disagreeing on which actions qualify.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.

Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.

Rhetorical Questions in LLM Representations: A Linear Probing Study

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.

Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions

cs.CL · 2026-06-25 · unverdicted · novelty 5.0

Framing of mental health prompts systematically changes LLM response tendencies, with framing information remaining decodable across layers and partially steerable via activations.

Exploring Concreteness Through a Figurative Lens

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 20
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.

arXiv preprint arXiv:2509.21305

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer