pith. sign in

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be explicitly anti-sycophantic yields moderate reductions, and can backfire, our counterfactual CoT mitigation drives sycophancy to near zero across models, commitment levels, and clause types, while not suppressing responsiveness to genuine evidence. Overall, we contribute a metric for benchmarking sycophancy and a mitigation informed by it.

fields

cs.AI 1

years

2026 1

verdicts

CONDITIONAL 1

clear filters

representative citing papers

Detecting and Controlling Sycophancy with Cascading Linear Features

cs.AI · 2026-06-23 · conditional · novelty 6.0

Cascading linear features extracted from graded sycophancy samples form separable subspaces that enable detection, scoring, and steering of sycophantic behavior in LLMs, matching or exceeding LLM-judge and prompting baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Detecting and Controlling Sycophancy with Cascading Linear Features cs.AI · 2026-06-23 · conditional · none · ref 1 · internal anchor

    Cascading linear features extracted from graded sycophancy samples form separable subspaces that enable detection, scoring, and steering of sycophantic behavior in LLMs, matching or exceeding LLM-judge and prompting baselines.