pith. sign in

arxiv: 2604.08986 · v1 · submitted 2026-04-10 · 💻 cs.CL · cs.AI

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords persona promptingRLVRpersona expressivityreinforcement learningmodel alignmentrobustness trade-offLLM training
0
0 comments X p. Extension

The pith

Mixing personas into the RLVR training process lets models keep both task robustness and the ability to adopt characters faithfully.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning with verifiable rewards improves how well models ignore unhelpful persona prompts on math and similar tasks, yet it also makes them less able to follow a requested character when the task calls for it. The paper introduces PerMix-RLVR, which mixes multiple personas into the same training episodes so the model learns to switch behaviors on demand without losing the robustness gains. This removes the need to hunt for the right prompt at inference time. A sympathetic reader would care because it turns persona prompting from an unreliable add-on into a stable property of the trained model.

Core claim

Reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts and thereby improves robustness on tasks with verifiable goals, yet it degrades persona expressivity when role-playing is required. PerMix-RLVR counters this by applying a persona-mixed RLVR strategy that preserves strong robustness to harmful persona variation while enabling faithful persona adoption when needed, yielding a +21.2 percent gain in persona stability score on MATH500 and a +11.4 percent gain in persona fidelity on PersonaGym.

What carries the argument

PerMix-RLVR, a persona-mixed RLVR strategy that interleaves diverse persona instructions within the same verifiable-reward training episodes to balance robustness against expressivity.

If this is right

  • Models trained this way can answer verifiable math problems correctly even when given an unhelpful persona prompt.
  • The same model can still produce in-character responses on role-play benchmarks without extra prompt engineering at test time.
  • Outcome-based optimization no longer forces a strict choice between robustness and expressivity.
  • Training-time mixing removes the repeated inference cost of searching for effective persona prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing idea could be tested on other alignment methods that use outcome rewards, such as direct preference optimization variants.
  • If the mixing ratio can be learned automatically, the method might scale to very large numbers of personas without manual design.
  • This training change might also reduce the variance seen across different prompt formats for the same underlying task.

Load-bearing premise

The measured gains on both stability and fidelity are caused by the persona-mixing step rather than other training choices, and the two chosen benchmarks together capture the full robustness-fidelity trade-off.

What would settle it

An ablation that removes the persona-mixing component while keeping every other training detail fixed and then observes that both the persona stability score on MATH500 and the persona fidelity score on PersonaGym fall back to the levels of plain RLVR.

Figures

Figures reproduced from arXiv: 2604.08986 by Jihwan Oh, Minchan Jeong, Murad Aghazada, Se-Young Yun, Soowon Oh, Sungnyun Kim.

Figure 1
Figure 1. Figure 1: Persona lottery and the cost of prompt search. (a) MATH500 and (b) AIME2024 show Llama3 and Gemma3 performances under seven different personas (four math experts, kindergartener, carpenter, and no persona). Each dot denotes persona; ⋆ and ◦ mark the best and worst persona per model, and black bars show mean ± std across personas, revealing large persona-induced variance. (c) Expected best-of-k. We report t… view at source ↗
Figure 2
Figure 2. Figure 2: PSS comparison across model families and scales. RLVR-trained families consistently exhibit higher PSS, indicating greater robustness to persona-induced variations. Full results are provided in Appendix C. Qwen3-32B has been post-trained with RLVR, and the other Qwen3 models have been dis￾tilled from Qwen3-32B/235B-A22B without additional post-training, as highlighted in Yang et al. (2025). Additionally, t… view at source ↗
Figure 3
Figure 3. Figure 3: Standard RLVR vs. PerMix-RLVR. (a) RLVR reweights πref(s|x) toward competent styles, yielding πRL(s|x). A test-time persona shifts the prior to πref(s|x, p), causing support mismatch for standard RLVR. (b) PerMix-RLVR is trained on diverse persona-conditioned priors to produce πPerMix-RL(s|x, p) aligned with persona-shifted support. This indicates better coverage under the persona-shifted prior πref(s|x, p… view at source ↗
Figure 4
Figure 4. Figure 4: Graphical model of the generation process. Shaded nodes are observed; the unshaded node [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that RLVR improves robustness to persona prompts in LLMs but introduces a robustness-fidelity trade-off that degrades persona expressivity. It proposes PerMix-RLVR, which incorporates persona mixing during RLVR training to mitigate this trade-off while preserving task performance. The method yields concrete gains: +21.2% on persona stability score (PSS) on MATH500 relative to standard RLVR, and +11.4% on persona fidelity on PersonaGym.

Significance. If the empirical results hold under detailed scrutiny, this would be a useful contribution to LLM alignment and controllable generation. The work clearly articulates an outcome-based optimization trade-off and offers a training-time persona-mixing intervention that avoids extra inference cost. The reported benchmark improvements on MATH500 and PersonaGym provide a falsifiable starting point for further study of persona robustness in verifiable-reward settings.

major comments (3)
  1. [Experiments] Experiments section: the reported +21.2% PSS and +11.4% fidelity gains are presented without error bars, standard deviations across runs, or statistical significance tests. This directly affects the reliability of the central claim that PerMix-RLVR resolves the trade-off.
  2. [Method] Method section: the persona-mixing procedure is described at a high level but lacks concrete specification of sampling ratios, mixing schedule, or how the mixed objective is combined with the verifiable reward. Without these details, attribution of gains to the mixing component versus other training choices cannot be verified.
  3. [Evaluation] Evaluation section: no ablation studies isolate the persona-mixing component from the base RLVR pipeline or alternative mixing strategies. This is load-bearing because the weakest assumption in the argument is that observed improvements stem specifically from persona mixing rather than unspecified implementation details.
minor comments (2)
  1. [Abstract] Abstract: the terms PSS and PersonaGym are used without a one-sentence definition, which would improve accessibility for readers unfamiliar with the benchmarks.
  2. [Introduction] Introduction: several references to prior persona-prompting work could be expanded with more recent citations to strengthen the literature positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for improving the clarity and rigor of our work. We agree that additional statistical reporting, implementation details, and ablations will strengthen the manuscript and plan to incorporate them in the revision.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported +21.2% PSS and +11.4% fidelity gains are presented without error bars, standard deviations across runs, or statistical significance tests. This directly affects the reliability of the central claim that PerMix-RLVR resolves the trade-off.

    Authors: We agree that the absence of variability measures and significance testing limits the strength of the empirical claims. In the revised manuscript, we will conduct additional training runs with multiple random seeds, report standard deviations and error bars for all key metrics including the PSS and fidelity scores, and include statistical significance tests (such as paired t-tests or Wilcoxon tests) to confirm the reported improvements over standard RLVR. revision: yes

  2. Referee: [Method] Method section: the persona-mixing procedure is described at a high level but lacks concrete specification of sampling ratios, mixing schedule, or how the mixed objective is combined with the verifiable reward. Without these details, attribution of gains to the mixing component versus other training choices cannot be verified.

    Authors: We acknowledge that the current Method section provides only a high-level overview. In the revision, we will expand this section with explicit implementation details, including the exact persona sampling ratios used during mixing, the training schedule for applying the mixed objective, and the precise formulation for combining it with the verifiable reward loss (e.g., as a weighted sum with the specific coefficient). These specifications will enable full reproducibility and clearer attribution of results. revision: yes

  3. Referee: [Evaluation] Evaluation section: no ablation studies isolate the persona-mixing component from the base RLVR pipeline or alternative mixing strategies. This is load-bearing because the weakest assumption in the argument is that observed improvements stem specifically from persona mixing rather than unspecified implementation details.

    Authors: We recognize that ablations are necessary to isolate the contribution of persona mixing. In the revised manuscript, we will add a dedicated ablation study section that compares PerMix-RLVR against the base RLVR pipeline, variants with alternative mixing strategies (e.g., random or uniform mixing), and ablations on key hyperparameters such as mixing ratio and schedule. These experiments will directly test whether the observed gains on PSS and fidelity are attributable to the proposed persona-mixing approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmarks

full rationale

The paper proposes PerMix-RLVR as an empirical training strategy to mitigate an observed robustness-fidelity trade-off in RLVR for persona-controlled LLMs. Central claims consist of measured improvements on external benchmarks (PSS on MATH500, fidelity on PersonaGym) rather than any derivation, equation, or first-principles prediction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the argument is self-contained via direct experimental comparison without circular reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced in the abstract; the work relies on standard domain assumptions about RLVR effects on persona sensitivity.

axioms (1)
  • domain assumption RLVR systematically reduces sensitivity to persona prompts while improving task robustness on verifiable goals.
    Stated as an observed finding that motivates the new method.

pith-pipeline@v0.9.0 · 5545 in / 1132 out tokens · 43145 ms · 2026-05-10T17:57:46.360396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    #### 3 √ 13 This example illustrates that a model fine-tuned exclusively on math task responses, without any prior exposure to personas, tends to generate dry and characterless answers when faced with a persona-conditioned query. 25 Preprint E.3 PerMix-SFT (Ground-Truthy) Example: PerMix-SFT (Ground-Truthy) System: Cutting Knowledge Date: December 2023 To...

  2. [2]

    Add the squares: 36 + 81 = 117. 6. Take the square root: √ 117= √ 9×13= 3 √

  3. [3]

    E.5 Distillation Example: Distillation from Qwen3-32B System: Cutting Knowledge Date: December 2023 Today Date: XX XXX XXXX You are a kindergartener who just learned their numbers

    #### 3 √ 13 26 Preprint This example demonstrates that when persona-conditioned fine-tuning is performed us- ing styled rewritten answer ys targets, the trained model maintains significantly stronger persona fidelity. E.5 Distillation Example: Distillation from Qwen3-32B System: Cutting Knowledge Date: December 2023 Today Date: XX XXX XXXX You are a kinde...

  4. [4]

    But I don’t know how to do that

    **Step 5: Simplify the answer** My teacher said we can simplify the answer by finding the square root of 97. But I don’t know how to do that. My teacher said it’s a big number, so we can just leave it as √

  5. [5]

    However, it remains relatively weak in enhancing reasoning accuracy

    #### √ 97 The example illustrates that even with distillation without persona-conditioned inputs, the model effectively absorbs and preserves the strong persona expressivity of the large teacher model. However, it remains relatively weak in enhancing reasoning accuracy. E.6 Distillation with Persona Conditioned Example: Distillation from Qwen3-32B with pe...

  6. [6]

    So, maybe I can simplify the square root of 111

    Wait, but the problem says to express the answer in simplest radical form. So, maybe I can simplify the square root of 111. Let me try to factor 111. Hmm, 111 is 3 times 37. So, the square root of 111 is the square root of 3 × 37. The square root of 3 is √ 3, and the square root of 37 is... well, that’s not a perfect square. So, the answer is √ 3×37 . Wai...

  7. [7]

    So, the answer is √ 3×37

    Wait, but that’s not simpler. So, the answer is √ 3×37 . Wait, but maybe I can write 37 as 3 × 12 + 1. So, 37 = 3 × 12 + 1. So, the square root of 37 is the square root of (3 × 12 + 1). Let me try to simplify that. The square root of (3 × 12 + 1) is the same as the square root of 36 + √

  8. [8]

    What number times itself equals 117?

    The square root of 36 is 6, and the square root of 1 is 1. So, the square root of 36 + √ 1 is 6 + 1, which is 7. So, the square root of 37 is 7. Wait, but that’s not right. The square root of 37 is not This example demonstrates that while distillation with persona-conditioned inputs enables the model to achieve high persona fidelity through explicit expos...

  9. [9]

    It’s like a secret code that we can’t break

    My teacher said that’s like a special number that we can’t simplify. It’s like a secret code that we can’t break. ####√ 117 This example illustrates that while RLVR-trained model can maintain certain stylistic ele- ments, it fundamentally prioritizes reasoning correctness by filtering out persona-induced traits that hinder task performance. Although theki...

  10. [10]

    My friend said that √ 117 can be simplified to √ 9×13 , and then we can take the square root of 9 and 13 separately

    But we need to simplify it. My friend said that √ 117 can be simplified to √ 9×13 , and then we can take the square root of 9 and 13 separately. The square root of 9 is 3, and the square root of 13 is √

  11. [11]

    So, the distance is 3 √

  12. [12]

    correctness

    #### 3 √ 13 This example illustrates thatPerMix-RLVRachieves a superior balance by framing complex reasoning within the persona’s cognitive boundaries. While both models utilize advanced mathematical concepts,PerMix-RLVRpreserves persona authenticity by attributing such knowledge to external sources like ”help from a friend”, whereas standard RLVR often b...