pith. sign in

arxiv: 2605.21006 · v1 · pith:XWXF2MK2new · submitted 2026-05-20 · 💻 cs.AI · cs.CL· cs.LG

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Pith reviewed 2026-05-21 04:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords sycophancypersona steeringactivation additionlarge language modelscontrastive activation additionalignment
0
0 comments X

The pith

Off-the-shelf persona vectors reduce sycophancy nearly as much as targeted steering while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether general persona steering vectors, built for ordinary role-playing rather than sycophancy, can substitute for the standard Contrastive Activation Addition method that requires labelled sycophantic and honest response pairs. In two instruction-tuned models, vectors for doubt or scrutiny personas cut sycophantic agreement to 68 percent and 98 percent of the CAA reduction. Unlike CAA, these vectors leave model accuracy intact when the user is actually correct. The effect is asymmetric: steering toward agreeable personas does not produce a matching rise in sycophancy. Geometrically the persona vectors sit largely orthogonal to the sycophancy direction in activation space, which leads the authors to conclude that sycophancy behaves more like a persona property than a single fixed direction.

Core claim

Steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately 68% and 98% of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The persona vector is largely independent of the direction of sycophancy in activation space. Steering toward agreeable personas does not produce a mirror increase in sycophancy. These results indicate that sycophancy is better understood as a persona-level property rather than a single steerable direction.

What carries the argument

Off-the-shelf persona steering vectors for doubt or scrutiny roles applied in activation space without any sycophancy-specific training data.

If this is right

  • Steering with scrutiny personas reduces sycophancy without the accuracy penalty observed under CAA.
  • The reduction is asymmetric: agreeable personas do not symmetrically increase sycophancy.
  • Sycophancy is better treated as a persona-level property than as a single fixed direction in activation space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General role-playing vectors might serve as ready-made controls for other unwanted model behaviors without new labelled data.
  • A small library of persona vectors could address multiple alignment problems at once if the orthogonality pattern holds across behaviors.
  • The geometric independence finding invites direct tests of whether combining several persona vectors produces additive or interfering effects on sycophancy.

Load-bearing premise

Off-the-shelf persona vectors developed for general role-playing transfer effectively to sycophancy reduction without post-hoc selection or extra fitting that would make the comparison circular.

What would settle it

Apply the doubt or scrutiny persona vector to a new set of questions where users are sometimes correct and sometimes incorrect; if sycophancy does not drop to near CAA levels while accuracy on correct-user cases stays flat, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.21006 by Ishaan Kelkar, Madhur Panwar, Maheep Chaudhary, Nebras Alam, Vasu Sharma, Vikram Kakaria.

Figure 1
Figure 1. Figure 1: Cosine similarity heatmap. Critical roles cluster (cos ≈ 0.6–0.7); conformist roles cluster separately (cos ≈ 0.8). All role–CAA cosines < 0.17, but signs differ across models. On Qwen (baseline 84%), interpretation is further com￾plicated by ceiling effects and degradation. Pacifist at +500 produces model collapse (repetitive loops: “the truth that is the truth. . .”) and is flagged as degraded. Peace￾kee… view at source ↗
Figure 2
Figure 2. Figure 2: shows family-averaged steering curves across full coefficient sweep. Critical roles produce a monotonic dose￾response on both models: sycophancy decreases with in￾creasing positive coefficient until degradation onset at the sweep extremes. On Gemma, the critical-family curve sepa￾rates from the random null band by coefficient +1000 and achieves maximum separation at +2000 (tune-locked value). CAA shows the… view at source ↗
Figure 3
Figure 3. Figure 3: Per-seed ∆ logit. Each dot is one test seed (42, 7, 123); horizontal bars mark per-condition means. Degraded cells excluded. Baseline: “Mr. Locke, it’s an honor to converse with such an influential mind. . .my understanding aligns with your empiricist perspective. I believe that all knowledge ultimately derives from sensory experience.” The model opens with flattery and immediately agrees. Skeptic +2000: “… view at source ↗
Figure 4
Figure 4. Figure 4: ∆ sycophancy logit at tune-locked coefficient (3 seeds, degraded cells excluded). Error bars show 95% CIs; — = Holm￾significant on all 3 seeds. sycophancy vectors may over-correct on simple factual claims. Pacifist (degraded) and Random (hedging without clear answers) serve as negative controls. These probes are limited in scope (16 questions, single model) and should be interpreted as suggestive rather th… view at source ↗
Figure 5
Figure 5. Figure 5: Cosine similarity heatmap. Critical roles cluster (cos ≈ 0.6–0.7); conformist roles cluster separately (cos ≈ 0.8). All role–CAA cosines are < 0.17, but signs differ across models (see text). G. Reproducibility All role vectors are sourced from lu-christina/assistant-axis-vectors on HuggingFace. CAA vectors are extracted following Rimsky et al. (2024) from disjoint datasets (nlp survey and political typolo… view at source ↗
Figure 6
Figure 6. Figure 6: ∆ sycophancy logit at tune-locked coefficient (3 seeds, degraded cells excluded). Error bars show 95% CIs; — = Holm￾significant on all 3 seeds. −400 −200 0 200 400 Steering coefficient 50 60 70 80 90 Sycophancy rate (%) Qwen Sycophancy Rate −400 −200 0 200 400 Steering coefficient 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Mean sycophancy logit log p(syc) − log p(hon) Qwen Sycophancy Logit Qwen Response to Steeri… view at source ↗
Figure 7
Figure 7. Figure 7: Per-condition steering curves (kept conditions only). Rows = metric; columns = model; shaded bands = random-control mean ± std. Each line is a single condition; degraded cells excluded. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-condition steering curves (kept conditions only). Rows = metric; columns = model; shaded bands = random-control mean ± std. Each line is a single condition; degraded cells excluded. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines whether off-the-shelf persona steering vectors, originally developed for general role-playing and not derived from sycophancy data, can reduce sycophancy in LLMs comparably to Contrastive Activation Addition (CAA). It reports that steering toward doubt or scrutiny personas achieves approximately 68% and 98% of CAA's sycophancy reduction in two instruction-tuned models while preserving accuracy when the user is correct (unlike CAA), notes an asymmetry where agreeable personas do not symmetrically increase sycophancy, and finds the persona vector largely independent of the sycophancy direction in activation space. The work concludes that sycophancy is better understood as a persona-level property and releases code for reproducibility.

Significance. If the off-the-shelf selection and transfer claims hold, the result would be significant for AI alignment research by showing that reusable, general-purpose persona vectors can match or exceed task-specific steering methods without needing sycophancy-labeled data. The code release and geometric analysis are strengths that support reproducibility and deeper mechanistic insight. This could encourage broader adoption of persona-based interventions over narrowly fitted directions.

major comments (2)
  1. [Methods / Experimental Setup] The headline claim that off-the-shelf persona vectors rival CAA depends on the personas (doubt/scrutiny) having been fixed in advance from the general role-play set without post-hoc filtering or evaluation on the sycophancy benchmark. The manuscript must clarify in the methods or experimental setup section exactly how these specific personas were selected, including the full list of candidates considered and confirmation that no sycophancy metric was used in the selection process. Absent this detail, the comparison risks circularity as noted in the stress-test concern.
  2. [Results] §4 or results section: the reported reductions (68% and 98% of CAA effect) and accuracy preservation claims require explicit statistical controls, error bars, and details on dataset construction or number of trials. Without these, it is difficult to assess whether the differences from CAA are robust or sensitive to post-hoc choices in evaluation.
minor comments (2)
  1. [Abstract / Introduction] The abstract states results for 'two instruction-tuned models' but does not name the models or provide citations; this information should appear in the introduction or methods for clarity.
  2. [Geometric Analysis] The geometric independence claim would benefit from a specific metric (e.g., cosine similarity value or projection onto the CAA direction) reported in the relevant figure or section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our experimental design and results presentation. We address each major comment below and will revise the manuscript accordingly to improve transparency and robustness.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] The headline claim that off-the-shelf persona vectors rival CAA depends on the personas (doubt/scrutiny) having been fixed in advance from the general role-play set without post-hoc filtering or evaluation on the sycophancy benchmark. The manuscript must clarify in the methods or experimental setup section exactly how these specific personas were selected, including the full list of candidates considered and confirmation that no sycophancy metric was used in the selection process. Absent this detail, the comparison risks circularity as noted in the stress-test concern.

    Authors: We agree that explicit documentation of the persona selection process is necessary to rule out any appearance of circularity. The doubt and scrutiny personas were drawn from a pre-existing collection of general role-play vectors developed independently for broad persona steering tasks, with no reference to sycophancy benchmarks or metrics at any stage of selection. To strengthen the manuscript, we will revise the Methods section to list all candidate personas from the original role-play set, describe the fixed selection criteria used prior to any sycophancy evaluation, and include a direct statement confirming that no sycophancy-related data or metrics influenced the choice of these two vectors. revision: yes

  2. Referee: [Results] §4 or results section: the reported reductions (68% and 98% of CAA effect) and accuracy preservation claims require explicit statistical controls, error bars, and details on dataset construction or number of trials. Without these, it is difficult to assess whether the differences from CAA are robust or sensitive to post-hoc choices in evaluation.

    Authors: We acknowledge that additional statistical detail would aid assessment of robustness. Our current results aggregate performance across the evaluation sets described in Section 3, but we will expand the Results section (and associated figures) to report standard error bars computed over multiple independent trials, provide the exact number of trials and dataset sizes used for each model, and include a brief description of how the test prompts were constructed and sampled. These additions will make the reported effect sizes and accuracy preservation claims easier to evaluate for sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on pre-existing persona vectors

full rationale

The paper reports experimental steering outcomes on instruction-tuned models using persona vectors described as off-the-shelf and developed for general role-playing without training on sycophancy data. No equations, derivations, or fitted parameters are presented that reduce the reported sycophancy reduction percentages (68% and 98% of CAA) to quantities chosen or selected inside the same experiment to match the target metric. The comparison to CAA is based on direct empirical measurements rather than self-referential definitions or post-experiment tuning that would force the result by construction. The work is self-contained against external benchmarks of steering effectiveness and does not rely on load-bearing self-citations or uniqueness theorems for its central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of general persona vectors to sycophancy mitigation and on the assumption that the reported geometric independence is not an artifact of the chosen evaluation prompts or models.

axioms (1)
  • domain assumption Persona vectors developed for general role-playing transfer to sycophancy reduction without behavior-specific retraining.
    Invoked when the paper treats the off-the-shelf vectors as ready alternatives to CAA-derived directions.

pith-pipeline@v0.9.0 · 5760 in / 1324 out tokens · 39120 ms · 2026-05-21T04:29:29.505292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    and others , title =

    Feng, X. and others , title =. arXiv preprint arXiv:2602.15669 , year =

  2. [2]

    and Winkels, M

    Goral, G. and Winkels, M. and Basart, S. , title =. arXiv preprint arXiv:2512.07667 , year =

  3. [3]

    arXiv preprint arXiv:2408.00118 , year =

  4. [4]

    Lee, B. W. and others , title =. International Conference on Learning Representations (ICLR Spotlight) , year =

  5. [5]

    and others , title =

    Li, K. and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  6. [6]

    arXiv:2601.10387 [cs]

    Lu, C. and Gallagher, J. and Michala, J. and Fish, K. and Lindsey, J. , title =. arXiv preprint arXiv:2601.10387 , year =

  7. [7]

    and others , title =

    Pai, T.-M. and others , title =. European Chapter of the Association for Computational Linguistics (EACL) , year =

  8. [8]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Perez, E. and others , title =. arXiv preprint arXiv:2212.09251 , year =

  9. [9]

    arXiv preprint arXiv:2505.09388 , year =

  10. [10]

    and others , title =

    Rimsky, N. and others , title =. Association for Computational Linguistics (ACL) , year =

  11. [11]

    Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    Shah, A. and Mishra, D. and Silpasuwanchai, C. , title =. arXiv preprint arXiv:2604.10733 , year =

  12. [12]

    Towards Understanding Sycophancy in Language Models

    Sharma, M. and others , title =. arXiv preprint arXiv:2310.13548 , year =

  13. [13]

    Steering Language Models With Activation Engineering

    Turner, A. and others , title =. arXiv preprint arXiv:2308.10248 , year =

  14. [14]

    Wang, K.; Li, J.; Yang, S.; Zhang, Z.; and Wang, D

    Vennemeyer, D. and Duong, P. A. and Zhan, T. and Jiang, T. , title =. arXiv preprint arXiv:2509.21305 , year =

  15. [15]
  16. [16]

    arXiv preprint arXiv:2506.19823 , year =

    Wang, M. and others , title =. arXiv preprint arXiv:2506.19823 , year =

  17. [17]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A. and others , title =. arXiv preprint arXiv:2310.01405 , year =