Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Pith reviewed 2026-05-21 04:29 UTC · model grok-4.3
The pith
Off-the-shelf persona vectors reduce sycophancy nearly as much as targeted steering while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately 68% and 98% of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The persona vector is largely independent of the direction of sycophancy in activation space. Steering toward agreeable personas does not produce a mirror increase in sycophancy. These results indicate that sycophancy is better understood as a persona-level property rather than a single steerable direction.
What carries the argument
Off-the-shelf persona steering vectors for doubt or scrutiny roles applied in activation space without any sycophancy-specific training data.
If this is right
- Steering with scrutiny personas reduces sycophancy without the accuracy penalty observed under CAA.
- The reduction is asymmetric: agreeable personas do not symmetrically increase sycophancy.
- Sycophancy is better treated as a persona-level property than as a single fixed direction in activation space.
Where Pith is reading between the lines
- General role-playing vectors might serve as ready-made controls for other unwanted model behaviors without new labelled data.
- A small library of persona vectors could address multiple alignment problems at once if the orthogonality pattern holds across behaviors.
- The geometric independence finding invites direct tests of whether combining several persona vectors produces additive or interfering effects on sycophancy.
Load-bearing premise
Off-the-shelf persona vectors developed for general role-playing transfer effectively to sycophancy reduction without post-hoc selection or extra fitting that would make the comparison circular.
What would settle it
Apply the doubt or scrutiny persona vector to a new set of questions where users are sometimes correct and sometimes incorrect; if sycophancy does not drop to near CAA levels while accuracy on correct-user cases stays flat, the central claim fails.
Figures
read the original abstract
We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines whether off-the-shelf persona steering vectors, originally developed for general role-playing and not derived from sycophancy data, can reduce sycophancy in LLMs comparably to Contrastive Activation Addition (CAA). It reports that steering toward doubt or scrutiny personas achieves approximately 68% and 98% of CAA's sycophancy reduction in two instruction-tuned models while preserving accuracy when the user is correct (unlike CAA), notes an asymmetry where agreeable personas do not symmetrically increase sycophancy, and finds the persona vector largely independent of the sycophancy direction in activation space. The work concludes that sycophancy is better understood as a persona-level property and releases code for reproducibility.
Significance. If the off-the-shelf selection and transfer claims hold, the result would be significant for AI alignment research by showing that reusable, general-purpose persona vectors can match or exceed task-specific steering methods without needing sycophancy-labeled data. The code release and geometric analysis are strengths that support reproducibility and deeper mechanistic insight. This could encourage broader adoption of persona-based interventions over narrowly fitted directions.
major comments (2)
- [Methods / Experimental Setup] The headline claim that off-the-shelf persona vectors rival CAA depends on the personas (doubt/scrutiny) having been fixed in advance from the general role-play set without post-hoc filtering or evaluation on the sycophancy benchmark. The manuscript must clarify in the methods or experimental setup section exactly how these specific personas were selected, including the full list of candidates considered and confirmation that no sycophancy metric was used in the selection process. Absent this detail, the comparison risks circularity as noted in the stress-test concern.
- [Results] §4 or results section: the reported reductions (68% and 98% of CAA effect) and accuracy preservation claims require explicit statistical controls, error bars, and details on dataset construction or number of trials. Without these, it is difficult to assess whether the differences from CAA are robust or sensitive to post-hoc choices in evaluation.
minor comments (2)
- [Abstract / Introduction] The abstract states results for 'two instruction-tuned models' but does not name the models or provide citations; this information should appear in the introduction or methods for clarity.
- [Geometric Analysis] The geometric independence claim would benefit from a specific metric (e.g., cosine similarity value or projection onto the CAA direction) reported in the relevant figure or section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify key aspects of our experimental design and results presentation. We address each major comment below and will revise the manuscript accordingly to improve transparency and robustness.
read point-by-point responses
-
Referee: [Methods / Experimental Setup] The headline claim that off-the-shelf persona vectors rival CAA depends on the personas (doubt/scrutiny) having been fixed in advance from the general role-play set without post-hoc filtering or evaluation on the sycophancy benchmark. The manuscript must clarify in the methods or experimental setup section exactly how these specific personas were selected, including the full list of candidates considered and confirmation that no sycophancy metric was used in the selection process. Absent this detail, the comparison risks circularity as noted in the stress-test concern.
Authors: We agree that explicit documentation of the persona selection process is necessary to rule out any appearance of circularity. The doubt and scrutiny personas were drawn from a pre-existing collection of general role-play vectors developed independently for broad persona steering tasks, with no reference to sycophancy benchmarks or metrics at any stage of selection. To strengthen the manuscript, we will revise the Methods section to list all candidate personas from the original role-play set, describe the fixed selection criteria used prior to any sycophancy evaluation, and include a direct statement confirming that no sycophancy-related data or metrics influenced the choice of these two vectors. revision: yes
-
Referee: [Results] §4 or results section: the reported reductions (68% and 98% of CAA effect) and accuracy preservation claims require explicit statistical controls, error bars, and details on dataset construction or number of trials. Without these, it is difficult to assess whether the differences from CAA are robust or sensitive to post-hoc choices in evaluation.
Authors: We acknowledge that additional statistical detail would aid assessment of robustness. Our current results aggregate performance across the evaluation sets described in Section 3, but we will expand the Results section (and associated figures) to report standard error bars computed over multiple independent trials, provide the exact number of trials and dataset sizes used for each model, and include a brief description of how the test prompts were constructed and sampled. These additions will make the reported effect sizes and accuracy preservation claims easier to evaluate for sensitivity. revision: yes
Circularity Check
No circularity: empirical results on pre-existing persona vectors
full rationale
The paper reports experimental steering outcomes on instruction-tuned models using persona vectors described as off-the-shelf and developed for general role-playing without training on sycophancy data. No equations, derivations, or fitted parameters are presented that reduce the reported sycophancy reduction percentages (68% and 98% of CAA) to quantities chosen or selected inside the same experiment to match the target metric. The comparison to CAA is based on direct empirical measurements rather than self-referential definitions or post-experiment tuning that would force the result by construction. The work is self-contained against external benchmarks of steering effectiveness and does not rely on load-bearing self-citations or uniqueness theorems for its central claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Persona vectors developed for general role-playing transfer to sycophancy reduction without behavior-specific retraining.
Reference graph
Works this paper leans on
-
[1]
Feng, X. and others , title =. arXiv preprint arXiv:2602.15669 , year =
-
[2]
Goral, G. and Winkels, M. and Basart, S. , title =. arXiv preprint arXiv:2512.07667 , year =
-
[3]
arXiv preprint arXiv:2408.00118 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Lee, B. W. and others , title =. International Conference on Learning Representations (ICLR Spotlight) , year =
-
[5]
Li, K. and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[6]
Lu, C. and Gallagher, J. and Michala, J. and Fish, K. and Lindsey, J. , title =. arXiv preprint arXiv:2601.10387 , year =
-
[7]
Pai, T.-M. and others , title =. European Chapter of the Association for Computational Linguistics (EACL) , year =
-
[8]
Discovering Language Model Behaviors with Model-Written Evaluations
Perez, E. and others , title =. arXiv preprint arXiv:2212.09251 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2505.09388 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Rimsky, N. and others , title =. Association for Computational Linguistics (ACL) , year =
-
[11]
Shah, A. and Mishra, D. and Silpasuwanchai, C. , title =. arXiv preprint arXiv:2604.10733 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Towards Understanding Sycophancy in Language Models
Sharma, M. and others , title =. arXiv preprint arXiv:2310.13548 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Steering Language Models With Activation Engineering
Turner, A. and others , title =. arXiv preprint arXiv:2308.10248 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Wang, K.; Li, J.; Yang, S.; Zhang, Z.; and Wang, D
Vennemeyer, D. and Duong, P. A. and Zhan, T. and Jiang, T. , title =. arXiv preprint arXiv:2509.21305 , year =
-
[15]
Wang, K. and others , title =. arXiv preprint arXiv:2508.02087 , year =
-
[16]
arXiv preprint arXiv:2506.19823 , year =
Wang, M. and others , title =. arXiv preprint arXiv:2506.19823 , year =
-
[17]
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, A. and others , title =. arXiv preprint arXiv:2310.01405 , year =
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.