pith. sign in

arxiv: 2510.16727 · v2 · pith:TKKL3KV2new · submitted 2025-10-19 · 💻 cs.CL · cs.AI

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

classification 💻 cs.CL cs.AI
keywords sycophancybeaconbiasmodelsalignmentlanguagelargelatent
0
0 comments X
read the original abstract

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...