PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment
Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3
The pith
Mixing personas into the RLVR training process lets models keep both task robustness and the ability to adopt characters faithfully.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts and thereby improves robustness on tasks with verifiable goals, yet it degrades persona expressivity when role-playing is required. PerMix-RLVR counters this by applying a persona-mixed RLVR strategy that preserves strong robustness to harmful persona variation while enabling faithful persona adoption when needed, yielding a +21.2 percent gain in persona stability score on MATH500 and a +11.4 percent gain in persona fidelity on PersonaGym.
What carries the argument
PerMix-RLVR, a persona-mixed RLVR strategy that interleaves diverse persona instructions within the same verifiable-reward training episodes to balance robustness against expressivity.
If this is right
- Models trained this way can answer verifiable math problems correctly even when given an unhelpful persona prompt.
- The same model can still produce in-character responses on role-play benchmarks without extra prompt engineering at test time.
- Outcome-based optimization no longer forces a strict choice between robustness and expressivity.
- Training-time mixing removes the repeated inference cost of searching for effective persona prompts.
Where Pith is reading between the lines
- The same mixing idea could be tested on other alignment methods that use outcome rewards, such as direct preference optimization variants.
- If the mixing ratio can be learned automatically, the method might scale to very large numbers of personas without manual design.
- This training change might also reduce the variance seen across different prompt formats for the same underlying task.
Load-bearing premise
The measured gains on both stability and fidelity are caused by the persona-mixing step rather than other training choices, and the two chosen benchmarks together capture the full robustness-fidelity trade-off.
What would settle it
An ablation that removes the persona-mixing component while keeping every other training detail fixed and then observes that both the persona stability score on MATH500 and the persona fidelity score on PersonaGym fall back to the levels of plain RLVR.
Figures
read the original abstract
Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLVR improves robustness to persona prompts in LLMs but introduces a robustness-fidelity trade-off that degrades persona expressivity. It proposes PerMix-RLVR, which incorporates persona mixing during RLVR training to mitigate this trade-off while preserving task performance. The method yields concrete gains: +21.2% on persona stability score (PSS) on MATH500 relative to standard RLVR, and +11.4% on persona fidelity on PersonaGym.
Significance. If the empirical results hold under detailed scrutiny, this would be a useful contribution to LLM alignment and controllable generation. The work clearly articulates an outcome-based optimization trade-off and offers a training-time persona-mixing intervention that avoids extra inference cost. The reported benchmark improvements on MATH500 and PersonaGym provide a falsifiable starting point for further study of persona robustness in verifiable-reward settings.
major comments (3)
- [Experiments] Experiments section: the reported +21.2% PSS and +11.4% fidelity gains are presented without error bars, standard deviations across runs, or statistical significance tests. This directly affects the reliability of the central claim that PerMix-RLVR resolves the trade-off.
- [Method] Method section: the persona-mixing procedure is described at a high level but lacks concrete specification of sampling ratios, mixing schedule, or how the mixed objective is combined with the verifiable reward. Without these details, attribution of gains to the mixing component versus other training choices cannot be verified.
- [Evaluation] Evaluation section: no ablation studies isolate the persona-mixing component from the base RLVR pipeline or alternative mixing strategies. This is load-bearing because the weakest assumption in the argument is that observed improvements stem specifically from persona mixing rather than unspecified implementation details.
minor comments (2)
- [Abstract] Abstract: the terms PSS and PersonaGym are used without a one-sentence definition, which would improve accessibility for readers unfamiliar with the benchmarks.
- [Introduction] Introduction: several references to prior persona-prompting work could be expanded with more recent citations to strengthen the literature positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects for improving the clarity and rigor of our work. We agree that additional statistical reporting, implementation details, and ablations will strengthen the manuscript and plan to incorporate them in the revision.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported +21.2% PSS and +11.4% fidelity gains are presented without error bars, standard deviations across runs, or statistical significance tests. This directly affects the reliability of the central claim that PerMix-RLVR resolves the trade-off.
Authors: We agree that the absence of variability measures and significance testing limits the strength of the empirical claims. In the revised manuscript, we will conduct additional training runs with multiple random seeds, report standard deviations and error bars for all key metrics including the PSS and fidelity scores, and include statistical significance tests (such as paired t-tests or Wilcoxon tests) to confirm the reported improvements over standard RLVR. revision: yes
-
Referee: [Method] Method section: the persona-mixing procedure is described at a high level but lacks concrete specification of sampling ratios, mixing schedule, or how the mixed objective is combined with the verifiable reward. Without these details, attribution of gains to the mixing component versus other training choices cannot be verified.
Authors: We acknowledge that the current Method section provides only a high-level overview. In the revision, we will expand this section with explicit implementation details, including the exact persona sampling ratios used during mixing, the training schedule for applying the mixed objective, and the precise formulation for combining it with the verifiable reward loss (e.g., as a weighted sum with the specific coefficient). These specifications will enable full reproducibility and clearer attribution of results. revision: yes
-
Referee: [Evaluation] Evaluation section: no ablation studies isolate the persona-mixing component from the base RLVR pipeline or alternative mixing strategies. This is load-bearing because the weakest assumption in the argument is that observed improvements stem specifically from persona mixing rather than unspecified implementation details.
Authors: We recognize that ablations are necessary to isolate the contribution of persona mixing. In the revised manuscript, we will add a dedicated ablation study section that compares PerMix-RLVR against the base RLVR pipeline, variants with alternative mixing strategies (e.g., random or uniform mixing), and ablations on key hyperparameters such as mixing ratio and schedule. These experiments will directly test whether the observed gains on PSS and fidelity are attributable to the proposed persona-mixing approach. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmarks
full rationale
The paper proposes PerMix-RLVR as an empirical training strategy to mitigate an observed robustness-fidelity trade-off in RLVR for persona-controlled LLMs. Central claims consist of measured improvements on external benchmarks (PSS on MATH500, fidelity on PersonaGym) rather than any derivation, equation, or first-principles prediction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the argument is self-contained via direct experimental comparison without circular reduction of outputs to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RLVR systematically reduces sensitivity to persona prompts while improving task robustness on verifiable goals.
Reference graph
Works this paper leans on
-
[1]
#### 3 √ 13 This example illustrates that a model fine-tuned exclusively on math task responses, without any prior exposure to personas, tends to generate dry and characterless answers when faced with a persona-conditioned query. 25 Preprint E.3 PerMix-SFT (Ground-Truthy) Example: PerMix-SFT (Ground-Truthy) System: Cutting Knowledge Date: December 2023 To...
work page 2023
-
[2]
Add the squares: 36 + 81 = 117. 6. Take the square root: √ 117= √ 9×13= 3 √
-
[3]
#### 3 √ 13 26 Preprint This example demonstrates that when persona-conditioned fine-tuning is performed us- ing styled rewritten answer ys targets, the trained model maintains significantly stronger persona fidelity. E.5 Distillation Example: Distillation from Qwen3-32B System: Cutting Knowledge Date: December 2023 Today Date: XX XXX XXXX You are a kinde...
work page 2023
-
[4]
But I don’t know how to do that
**Step 5: Simplify the answer** My teacher said we can simplify the answer by finding the square root of 97. But I don’t know how to do that. My teacher said it’s a big number, so we can just leave it as √
-
[5]
However, it remains relatively weak in enhancing reasoning accuracy
#### √ 97 The example illustrates that even with distillation without persona-conditioned inputs, the model effectively absorbs and preserves the strong persona expressivity of the large teacher model. However, it remains relatively weak in enhancing reasoning accuracy. E.6 Distillation with Persona Conditioned Example: Distillation from Qwen3-32B with pe...
work page 2023
-
[6]
So, maybe I can simplify the square root of 111
Wait, but the problem says to express the answer in simplest radical form. So, maybe I can simplify the square root of 111. Let me try to factor 111. Hmm, 111 is 3 times 37. So, the square root of 111 is the square root of 3 × 37. The square root of 3 is √ 3, and the square root of 37 is... well, that’s not a perfect square. So, the answer is √ 3×37 . Wai...
-
[7]
Wait, but that’s not simpler. So, the answer is √ 3×37 . Wait, but maybe I can write 37 as 3 × 12 + 1. So, 37 = 3 × 12 + 1. So, the square root of 37 is the square root of (3 × 12 + 1). Let me try to simplify that. The square root of (3 × 12 + 1) is the same as the square root of 36 + √
-
[8]
What number times itself equals 117?
The square root of 36 is 6, and the square root of 1 is 1. So, the square root of 36 + √ 1 is 6 + 1, which is 7. So, the square root of 37 is 7. Wait, but that’s not right. The square root of 37 is not This example demonstrates that while distillation with persona-conditioned inputs enables the model to achieve high persona fidelity through explicit expos...
work page 2023
-
[9]
It’s like a secret code that we can’t break
My teacher said that’s like a special number that we can’t simplify. It’s like a secret code that we can’t break. ####√ 117 This example illustrates that while RLVR-trained model can maintain certain stylistic ele- ments, it fundamentally prioritizes reasoning correctness by filtering out persona-induced traits that hinder task performance. Although theki...
work page 2023
-
[10]
But we need to simplify it. My friend said that √ 117 can be simplified to √ 9×13 , and then we can take the square root of 9 and 13 separately. The square root of 9 is 3, and the square root of 13 is √
-
[11]
So, the distance is 3 √
-
[12]
#### 3 √ 13 This example illustrates thatPerMix-RLVRachieves a superior balance by framing complex reasoning within the persona’s cognitive boundaries. While both models utilize advanced mathematical concepts,PerMix-RLVRpreserves persona authenticity by attributing such knowledge to external sources like ”help from a friend”, whereas standard RLVR often b...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.