Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

· 2026 · cs.AI · arXiv 2605.21006

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

representative citing papers

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 5.0

Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.

citing papers explorer

Showing 1 of 1 citing paper.

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation cs.LG · 2026-06-09 · unverdicted · none · ref 7 · internal anchor
Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

fields

years

verdicts

representative citing papers

citing papers explorer