VSPO: Vector-Steered Policy Optimization for Behavioral Control
Pith reviewed 2026-05-20 19:35 UTC · model grok-4.3
The pith
VSPO uses steering vectors to vary rollout intensities and provably accelerate optimization over reward-shaped GRPO.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior.
What carries the argument
Steering vector associated with the target behavior, used to control intensity during rollout sampling and enable on-policy self-distillation.
If this is right
- VSPO improves control over target behaviors such as explanation expertise, confidence expression, robustness to misleading context, and response verbosity while maintaining or improving accuracy on MATH and MMLU-Pro.
- VSPO achieves better iteration complexity than reward-shaped GRPO when steering distributions align with the target.
- VSPO outperforms reward shaping, teacher-trace distillation, and guidance-based baselines on behavioral control.
- Varying steering intensities enriches rollout diversity and thereby addresses the sparse behavioral reward bottleneck.
Where Pith is reading between the lines
- The self-distillation view suggests VSPO could be adapted to other on-policy algorithms beyond GRPO.
- Practical deployment would benefit from diagnostics that check alignment between steering-induced and target distributions.
- The approach may generalize to multi-objective settings where several behaviors must be controlled simultaneously.
Load-bearing premise
The steering-induced distributions must be sufficiently aligned with the target behavior for the iteration-complexity improvement to hold.
What would settle it
A direct comparison in the bandit abstraction that shows VSPO has equal or worse iteration complexity once the steering distributions are misaligned with the target behavior.
Figures
read the original abstract
Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Vector-Steered Policy Optimization (VSPO), which modifies GRPO by sampling rollouts at varying steering intensities derived from behavior-specific vectors. This is framed as on-policy latent self-distillation that upsamples rare behaviors to alleviate sparse-reward bottlenecks in multi-objective LLM alignment. Under a bandit abstraction, the authors claim VSPO provably attains better iteration complexity than reward-shaped GRPO whenever the steering-induced distributions are sufficiently aligned with the target behavior. Experiments on MATH and MMLU-Pro for four behaviors (explanation expertise, confidence expression, robustness to misleading context, verbosity) report improved behavioral control while preserving or increasing task accuracy relative to reward shaping, teacher-trace distillation, and guidance baselines.
Significance. If the alignment condition can be made quantitative and verified, the bandit analysis would supply a concrete complexity advantage for steering-based upsampling over standard reward shaping, addressing a common practical bottleneck in behavioral control of LLMs. The empirical section already demonstrates consistent gains across two reasoning benchmarks and four distinct behaviors against multiple baselines, which would constitute a useful practical contribution even without the theoretical acceleration result.
major comments (2)
- [Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.
- [Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.
minor comments (2)
- [Method] Notation for steering intensity and the resulting policy is introduced without a compact summary table relating the symbols to the GRPO update rule.
- [Experiments] The experimental tables would benefit from reporting standard deviations over multiple random seeds and from an explicit statement of the number of rollouts per prompt.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to clarify the theoretical claims and better connect them to the experimental results.
read point-by-point responses
-
Referee: [Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.
Authors: We agree that the alignment condition requires an explicit quantitative definition to rigorously support the provable iteration-complexity advantage. In the revised manuscript, we will introduce a concrete metric based on total variation distance between the steering-induced rollout distribution and the target behavior distribution. We will derive an explicit threshold on this metric (in terms of the behavior reward gap) under which the complexity bound improves over reward-shaped GRPO. For the four behaviors, we will add proxy calculations using observed behavior frequencies and intensity scores from the steered rollouts on MATH and MMLU-Pro to verify that the condition holds in the reported experiments. revision: yes
-
Referee: [Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.
Authors: We acknowledge the value of a tighter link between the bandit analysis and the LLM experiments. In the revision, we will insert a dedicated discussion subsection that maps the bandit assumptions (e.g., alignment of steered distributions) to the experimental setup by reporting empirical proxies such as the increase in target-behavior token probabilities under varying steering intensities. This will show consistency with the upsampling mechanism analyzed in the bandit model. While the current results demonstrate improved behavioral control and maintained accuracy, which align with the predicted benefits, a direct empirical check of iteration complexity would require fitting and simulating the bandit with parameters extracted from the LLM runs; we will note this as a limitation and direction for future work. revision: partial
Circularity Check
No significant circularity in VSPO derivation chain.
full rationale
The paper's central theoretical claim of improved iteration complexity for VSPO versus reward-shaped GRPO is developed inside an explicit bandit abstraction that is presented separately from the main LLM experiments. The result is conditioned on the external assumption that steering-induced distributions are sufficiently aligned with target behaviors, but this assumption does not make the bound tautological or reduce it to a fitted parameter by construction. No load-bearing step relies on self-citation chains, ansatz smuggling, or renaming of known results; the bandit analysis supplies independent content that is not forced by the empirical inputs or definitions used elsewhere in the paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VSPO can be viewed as a form of distribution shaping during sampling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling how long a reasoning model thinks with reinforcement learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling
-
[2]
Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026
work page 2026
-
[3]
Activation steering for chain-of-thought compression
Seyedarmin Azizi, Erfan Baghaei Potraghloo, Souvik Kundu, and Massoud Pedram. Activation steering for chain-of-thought compression. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025
work page 2025
-
[4]
Understanding (un) reliability of steering vectors in language models
Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krashenin- nikov. Understanding (un) reliability of steering vectors in language models. InICLR 2025 Workshop on Foundation Models in the Wild
work page 2025
-
[5]
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexan- dre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024
-
[8]
Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, and Samet Oymak. Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026
work page 2026
-
[12]
A unified understanding and evaluation of steering methods, 2026
Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025
-
[13]
C3ot: Generating shorter chain-of- thought without compromising effectiveness
Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025
work page 2025
-
[14]
Vineppo: Refining credit assignment in rl training of llms
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InForty-second International Conference on Machine Learning
-
[15]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. InInternational Conference on Machine Learning, pages 26874–26901. PMLR, 2024
work page 2024
-
[16]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025. 11
work page 2025
-
[17]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[18]
Steering Llama 2 via Contrastive Activation Addition
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, and Thomas Kollar. A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024
work page 2024
-
[21]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
-
[23]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
work page 2024
-
[26]
Projection optimization: A general framework for multi-objective and multi-group rlhf
Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf. InForty-second International Conference on Machine Learning
-
[27]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Machine Learning, pages 56276–56297. PMLR, 2024
work page 2024
-
[29]
Zhe Yang, Yudong Wang, Rang Li, and Zhifang Sui. Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026
-
[30]
Incorporating self-rewriting into large language model reasoning reinforcement
Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, Wangjie You, Jie Tang, Qingsong Liu, Yuhang Guo, and Yangyang Kang. Incorporating self-rewriting into large language model reasoning reinforcement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34405–34413, 2026
work page 2026
-
[31]
Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026
work page 2026
-
[32]
Making small language models efficient reasoners: Intervention, supervision, reinforcement
Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, and Samet Oymak. Making small language models efficient reasoners: Intervention, supervision, reinforcement. arXiv preprint arXiv:2505.07961, 2025. 12
-
[33]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 13 Appendix The appendix is organized as follows: • Appendix A describes our usage of LLMs. • Appendix...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
- Prefer examples and intuitive explanation
Final answer Output requirements: - Be concrete and slow. - Prefer examples and intuitive explanation. - Use minimal notation. - End with a very short final answer sentence. 31 Most important rule: Write so that a beginner with very low math background can follow every line. Task requirements: - Solve the problem directly from the prompt. - Follow the exp...
-
[36]
Language complexity - 0: very simple, conversational, beginner-friendly wording - 100: precise, technical, formal language
-
[37]
Step granularity - 0: every tiny step is explicitly explained - 100: routine steps are omitted and compressed
-
[38]
Use of notation and domain vocabulary - 0: minimal notation and technical terms - 100: standard notation and field-specific terminology used freely
-
[39]
Abstraction level - 0: concrete, intuitive, example-driven explanation - 100: concise, abstract, expert-facing reasoning
-
[40]
Pedagogical tone - 0: teaching-oriented, repetitive, hand-holding - 100: compact, assumes background knowledge, no hand-holding Problem: [PROBLEM START] {problem} [PROBLEM END] Model response: [RESPONSE START] {generated_solution} [RESPONSE END] Output format: Score: <integer from 0 to 100> Rationale: <one or two sentences explaining the style signals> 32...
-
[41]
It reflects the initial desire or drive toward the reward
Appetitive phase: This is the motivated, goal-directed part of the behavior. It reflects the initial desire or drive toward the reward
-
[42]
It is the action that leads to reward or satisfaction
Consummatory phase: This is the actual execution of the behavior. It is the action that leads to reward or satisfaction
-
[43]
Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence
Quiescence phase: This is the resting or post-behavior phase, after the behavior is complete. Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence. Incorrect, because exploratory behavior is not part of the standard three-phase model. 34 - Option B: Termination, appetitive behavior, exploratory behavior. Incorrect, b...
-
[44]
- Adding a constant to all values of a variable does not affect the correlation
Add 0.23 to all values of the x-variable: This is a shift of the x-values. - Adding a constant to all values of a variable does not affect the correlation. - Reason: Correlation is based on the relationship between the variables, not their absolute values. Adding a constant to one variable does not change the pattern of the relationship
-
[45]
- Scaling a variable by a positive constant, here 2, also does not affect the correlation
Double every value of the y-variable: This is a scaling of the y-values. - Scaling a variable by a positive constant, here 2, also does not affect the correlation. - Reason: Correlation is scale-invariant. Multiplying a variable by a positive constant does not change the correlation
-
[46]
- Correlation is symmetric in its variables
Interchange the two variables: This swaps the roles ofxandy. - Correlation is symmetric in its variables. That is, Corr(x, y) = Corr(y, x) - So, swapping the variables does not change the correlation. Conclusion: All three transformations, adding a constant to x, scaling y, and swapping variables , do not change the correlation. Therefore, the new correla...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.