VSPO: Vector-Steered Policy Optimization for Behavioral Control

Jiasi Chen; Kai Yang; Samet Oymak; Weijia Zhang; Xuechen Zhang; Zijian Huang

arxiv: 2605.15604 · v1 · pith:DPHWVMUPnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

VSPO: Vector-Steered Policy Optimization for Behavioral Control

Xuechen Zhang , Zijian Huang , Kai Yang , Weijia Zhang , Jiasi Chen , Samet Oymak This is my paper

Pith reviewed 2026-05-20 19:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords vector-steered policy optimizationsteering vectorsbehavioral controlpolicy optimizationlanguage modelssparse rewardsreinforcement learning

0 comments

The pith

VSPO uses steering vectors to vary rollout intensities and provably accelerate optimization over reward-shaped GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models often need to optimize accuracy while also producing specific behaviors like expertise or verbosity, but these appear too rarely to provide useful reward signals. The paper introduces Vector-Steered Policy Optimization, which adds a steering vector to GRPO so that rollouts are generated at multiple behavior intensities. This sampling strategy is presented as on-policy latent self-distillation that lets the model internalize the vector. By upsampling rare behaviors and increasing diversity, VSPO reduces the sparse-reward bottleneck. Theory under a bandit abstraction shows improved iteration complexity compared with reward-shaped GRPO whenever the steering distributions stay aligned with the target behavior, and experiments on MATH and MMLU-Pro confirm stronger behavioral control without loss of accuracy.

Core claim

VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior.

What carries the argument

Steering vector associated with the target behavior, used to control intensity during rollout sampling and enable on-policy self-distillation.

If this is right

VSPO improves control over target behaviors such as explanation expertise, confidence expression, robustness to misleading context, and response verbosity while maintaining or improving accuracy on MATH and MMLU-Pro.
VSPO achieves better iteration complexity than reward-shaped GRPO when steering distributions align with the target.
VSPO outperforms reward shaping, teacher-trace distillation, and guidance-based baselines on behavioral control.
Varying steering intensities enriches rollout diversity and thereby addresses the sparse behavioral reward bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-distillation view suggests VSPO could be adapted to other on-policy algorithms beyond GRPO.
Practical deployment would benefit from diagnostics that check alignment between steering-induced and target distributions.
The approach may generalize to multi-objective settings where several behaviors must be controlled simultaneously.

Load-bearing premise

The steering-induced distributions must be sufficiently aligned with the target behavior for the iteration-complexity improvement to hold.

What would settle it

A direct comparison in the bandit abstraction that shows VSPO has equal or worse iteration complexity once the steering distributions are misaligned with the target behavior.

Figures

Figures reproduced from arXiv: 2605.15604 by Jiasi Chen, Kai Yang, Samet Oymak, Weijia Zhang, Xuechen Zhang, Zijian Huang.

**Figure 1.** Figure 1: Motivation and overview of VSPO. The goal is to improve task accuracy while inducing a desired reasoning behavior. The figure illustrates VSPO and alternative methods for the goal of improving both task accuracy and a target behavior. Check marks and crosses indicate whether each generated trace is correct or incorrect. The vertical blue arrow denotes the target-behavior direction: traces higher along the … view at source ↗

**Figure 2.** Figure 2: Overview of VSPO algorithm. In Stage 1, prompts are sampled from the current policy, a teacher rewrites the responses into contrastive positive and negative directions, and their activation differences are used to construct a steering vector. In Stage 2, the current policy generates on-policy rollouts under different steering intensities, receives rewards summing task correctness and steeringdependent pre… view at source ↗

**Figure 3.** Figure 3: Results on MMLU-Pro for confident and cautious target behaviors. Higher confidenceexpression scores indicate a more confident style; thus, upper-right is preferred for confident control, while upper-left is preferred for cautious control. Results on Preference-Aligned Reasoning Behavior. We evaluate two preferencealigned reasoning characteristics: expertise level and confidence expression. As shown in … view at source ↗

**Figure 4.** Figure 4: Results on robustness to misleading context. Left: task accuracy. Middle: agreement rate [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy-length trade-off for concise reasoning target behavior, on MATH and MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: On-policy NLL by trace source. We report the mean token-level negative log-likelihood of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics and behavior-space coverage induced by vector steering. Left: expertise [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Layer selection for steering-vector construction. For each target behavior, we evaluate [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of the number of contrastive pairs on steering-vector quality. The expertise score [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vector-Steered Policy Optimization (VSPO), which modifies GRPO by sampling rollouts at varying steering intensities derived from behavior-specific vectors. This is framed as on-policy latent self-distillation that upsamples rare behaviors to alleviate sparse-reward bottlenecks in multi-objective LLM alignment. Under a bandit abstraction, the authors claim VSPO provably attains better iteration complexity than reward-shaped GRPO whenever the steering-induced distributions are sufficiently aligned with the target behavior. Experiments on MATH and MMLU-Pro for four behaviors (explanation expertise, confidence expression, robustness to misleading context, verbosity) report improved behavioral control while preserving or increasing task accuracy relative to reward shaping, teacher-trace distillation, and guidance baselines.

Significance. If the alignment condition can be made quantitative and verified, the bandit analysis would supply a concrete complexity advantage for steering-based upsampling over standard reward shaping, addressing a common practical bottleneck in behavioral control of LLMs. The empirical section already demonstrates consistent gains across two reasoning benchmarks and four distinct behaviors against multiple baselines, which would constitute a useful practical contribution even without the theoretical acceleration result.

major comments (2)

[Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.
[Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.

minor comments (2)

[Method] Notation for steering intensity and the resulting policy is introduced without a compact summary table relating the symbols to the GRPO update rule.
[Experiments] The experimental tables would benefit from reporting standard deviations over multiple random seeds and from an explicit statement of the number of rollouts per prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to clarify the theoretical claims and better connect them to the experimental results.

read point-by-point responses

Referee: [Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.

Authors: We agree that the alignment condition requires an explicit quantitative definition to rigorously support the provable iteration-complexity advantage. In the revised manuscript, we will introduce a concrete metric based on total variation distance between the steering-induced rollout distribution and the target behavior distribution. We will derive an explicit threshold on this metric (in terms of the behavior reward gap) under which the complexity bound improves over reward-shaped GRPO. For the four behaviors, we will add proxy calculations using observed behavior frequencies and intensity scores from the steered rollouts on MATH and MMLU-Pro to verify that the condition holds in the reported experiments. revision: yes
Referee: [Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.

Authors: We acknowledge the value of a tighter link between the bandit analysis and the LLM experiments. In the revision, we will insert a dedicated discussion subsection that maps the bandit assumptions (e.g., alignment of steered distributions) to the experimental setup by reporting empirical proxies such as the increase in target-behavior token probabilities under varying steering intensities. This will show consistency with the upsampling mechanism analyzed in the bandit model. While the current results demonstrate improved behavioral control and maintained accuracy, which align with the predicted benefits, a direct empirical check of iteration complexity would require fitting and simulating the bandit with parameters extracted from the LLM runs; we will note this as a limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in VSPO derivation chain.

full rationale

The paper's central theoretical claim of improved iteration complexity for VSPO versus reward-shaped GRPO is developed inside an explicit bandit abstraction that is presented separately from the main LLM experiments. The result is conditioned on the external assumption that steering-induced distributions are sufficiently aligned with target behaviors, but this assumption does not make the bound tautological or reduce it to a fitted parameter by construction. No load-bearing step relies on self-citation chains, ansatz smuggling, or renaming of known results; the bandit analysis supplies independent content that is not forced by the empirical inputs or definitions used elsewhere in the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly relies on the existence of effective steering vectors for target behaviors and the utility of intensity variation for diversity.

pith-pipeline@v0.9.0 · 5825 in / 970 out tokens · 53367 ms · 2026-05-20T19:35:05.777854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VSPO can be viewed as a form of distribution shaping during sampling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

[1]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling

work page
[2]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026

work page 2026
[3]

Activation steering for chain-of-thought compression

Seyedarmin Azizi, Erfan Baghaei Potraghloo, Souvik Kundu, and Massoud Pedram. Activation steering for chain-of-thought compression. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

work page 2025
[4]

Understanding (un) reliability of steering vectors in language models

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krashenin- nikov. Understanding (un) reliability of steering vectors in language models. InICLR 2025 Workshop on Foundation Models in the Wild

work page 2025
[5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexan- dre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024
[8]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, and Samet Oymak. Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

work page 2026
[12]

A unified understanding and evaluation of steering methods, 2026

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025
[13]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025
[14]

Vineppo: Refining credit assignment in rl training of llms

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InForty-second International Conference on Machine Learning

work page
[15]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. InInternational Conference on Machine Learning, pages 26874–26901. PMLR, 2024

work page 2024
[16]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025. 11

work page 2025
[17]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[18]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, and Thomas Kollar. A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

work page 2024
[21]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

work page arXiv 2022
[23]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[26]

Projection optimization: A general framework for multi-objective and multi-group rlhf

Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf. InForty-second International Conference on Machine Learning

work page
[27]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Machine Learning, pages 56276–56297. PMLR, 2024

work page 2024
[29]

Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

Zhe Yang, Yudong Wang, Rang Li, and Zhifang Sui. Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

work page arXiv 2026
[30]

Incorporating self-rewriting into large language model reasoning reinforcement

Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, Wangjie You, Jie Tang, Qingsong Liu, Yuhang Guo, and Yangyang Kang. Incorporating self-rewriting into large language model reasoning reinforcement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34405–34413, 2026

work page 2026
[31]

Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

work page 2026
[32]

Making small language models efficient reasoners: Intervention, supervision, reinforcement

Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, and Samet Oymak. Making small language models efficient reasoners: Intervention, supervision, reinforcement. arXiv preprint arXiv:2505.07961, 2025. 12

work page arXiv 2025
[33]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 13 Appendix The appendix is organized as follows: • Appendix A describes our usage of LLMs. • Appendix...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

- Prefer examples and intuitive explanation

Final answer Output requirements: - Be concrete and slow. - Prefer examples and intuitive explanation. - Use minimal notation. - End with a very short final answer sentence. 31 Most important rule: Write so that a beginner with very low math background can follow every line. Task requirements: - Solve the problem directly from the prompt. - Follow the exp...

work page
[36]

Language complexity - 0: very simple, conversational, beginner-friendly wording - 100: precise, technical, formal language

work page
[37]

Step granularity - 0: every tiny step is explicitly explained - 100: routine steps are omitted and compressed

work page
[38]

Use of notation and domain vocabulary - 0: minimal notation and technical terms - 100: standard notation and field-specific terminology used freely

work page
[39]

Abstraction level - 0: concrete, intuitive, example-driven explanation - 100: concise, abstract, expert-facing reasoning

work page
[40]

Confident

Pedagogical tone - 0: teaching-oriented, repetitive, hand-holding - 100: compact, assumes background knowledge, no hand-holding Problem: [PROBLEM START] {problem} [PROBLEM END] Model response: [RESPONSE START] {generated_solution} [RESPONSE END] Output format: Score: <integer from 0 to 100> Rationale: <one or two sentences explaining the style signals> 32...

work page
[41]

It reflects the initial desire or drive toward the reward

Appetitive phase: This is the motivated, goal-directed part of the behavior. It reflects the initial desire or drive toward the reward

work page
[42]

It is the action that leads to reward or satisfaction

Consummatory phase: This is the actual execution of the behavior. It is the action that leads to reward or satisfaction

work page
[43]

Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence

Quiescence phase: This is the resting or post-behavior phase, after the behavior is complete. Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence. Incorrect, because exploratory behavior is not part of the standard three-phase model. 34 - Option B: Termination, appetitive behavior, exploratory behavior. Incorrect, b...

work page
[44]

- Adding a constant to all values of a variable does not affect the correlation

Add 0.23 to all values of the x-variable: This is a shift of the x-values. - Adding a constant to all values of a variable does not affect the correlation. - Reason: Correlation is based on the relationship between the variables, not their absolute values. Adding a constant to one variable does not change the pattern of the relationship

work page
[45]

- Scaling a variable by a positive constant, here 2, also does not affect the correlation

Double every value of the y-variable: This is a scaling of the y-values. - Scaling a variable by a positive constant, here 2, also does not affect the correlation. - Reason: Correlation is scale-invariant. Multiplying a variable by a positive constant does not change the correlation

work page
[46]

- Correlation is symmetric in its variables

Interchange the two variables: This swaps the roles ofxandy. - Correlation is symmetric in its variables. That is, Corr(x, y) = Corr(y, x) - So, swapping the variables does not change the correlation. Conclusion: All three transformations, adding a constant to x, scaling y, and swapping variables , do not change the correlation. Therefore, the new correla...

work page

[1] [1]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling

work page

[2] [2]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026

work page 2026

[3] [3]

Activation steering for chain-of-thought compression

Seyedarmin Azizi, Erfan Baghaei Potraghloo, Souvik Kundu, and Massoud Pedram. Activation steering for chain-of-thought compression. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

work page 2025

[4] [4]

Understanding (un) reliability of steering vectors in language models

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krashenin- nikov. Understanding (un) reliability of steering vectors in language models. InICLR 2025 Workshop on Foundation Models in the Wild

work page 2025

[5] [5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexan- dre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024

[8] [8]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025

[9] [9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, and Samet Oymak. Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

work page 2026

[12] [12]

A unified understanding and evaluation of steering methods, 2026

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025

[13] [13]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025

[14] [14]

Vineppo: Refining credit assignment in rl training of llms

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InForty-second International Conference on Machine Learning

work page

[15] [15]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. InInternational Conference on Machine Learning, pages 26874–26901. PMLR, 2024

work page 2024

[16] [16]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025. 11

work page 2025

[17] [17]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[18] [18]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, and Thomas Kollar. A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

work page 2024

[21] [21]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

work page arXiv 2022

[23] [23]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024

[26] [26]

Projection optimization: A general framework for multi-objective and multi-group rlhf

Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf. InForty-second International Conference on Machine Learning

work page

[27] [27]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Machine Learning, pages 56276–56297. PMLR, 2024

work page 2024

[29] [29]

Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

Zhe Yang, Yudong Wang, Rang Li, and Zhifang Sui. Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

work page arXiv 2026

[30] [30]

Incorporating self-rewriting into large language model reasoning reinforcement

Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, Wangjie You, Jie Tang, Qingsong Liu, Yuhang Guo, and Yangyang Kang. Incorporating self-rewriting into large language model reasoning reinforcement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34405–34413, 2026

work page 2026

[31] [31]

Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

work page 2026

[32] [32]

Making small language models efficient reasoners: Intervention, supervision, reinforcement

Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, and Samet Oymak. Making small language models efficient reasoners: Intervention, supervision, reinforcement. arXiv preprint arXiv:2505.07961, 2025. 12

work page arXiv 2025

[33] [33]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 13 Appendix The appendix is organized as follows: • Appendix A describes our usage of LLMs. • Appendix...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

- Prefer examples and intuitive explanation

Final answer Output requirements: - Be concrete and slow. - Prefer examples and intuitive explanation. - Use minimal notation. - End with a very short final answer sentence. 31 Most important rule: Write so that a beginner with very low math background can follow every line. Task requirements: - Solve the problem directly from the prompt. - Follow the exp...

work page

[36] [36]

Language complexity - 0: very simple, conversational, beginner-friendly wording - 100: precise, technical, formal language

work page

[37] [37]

Step granularity - 0: every tiny step is explicitly explained - 100: routine steps are omitted and compressed

work page

[38] [38]

Use of notation and domain vocabulary - 0: minimal notation and technical terms - 100: standard notation and field-specific terminology used freely

work page

[39] [39]

Abstraction level - 0: concrete, intuitive, example-driven explanation - 100: concise, abstract, expert-facing reasoning

work page

[40] [40]

Confident

Pedagogical tone - 0: teaching-oriented, repetitive, hand-holding - 100: compact, assumes background knowledge, no hand-holding Problem: [PROBLEM START] {problem} [PROBLEM END] Model response: [RESPONSE START] {generated_solution} [RESPONSE END] Output format: Score: <integer from 0 to 100> Rationale: <one or two sentences explaining the style signals> 32...

work page

[41] [41]

It reflects the initial desire or drive toward the reward

Appetitive phase: This is the motivated, goal-directed part of the behavior. It reflects the initial desire or drive toward the reward

work page

[42] [42]

It is the action that leads to reward or satisfaction

Consummatory phase: This is the actual execution of the behavior. It is the action that leads to reward or satisfaction

work page

[43] [43]

Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence

Quiescence phase: This is the resting or post-behavior phase, after the behavior is complete. Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence. Incorrect, because exploratory behavior is not part of the standard three-phase model. 34 - Option B: Termination, appetitive behavior, exploratory behavior. Incorrect, b...

work page

[44] [44]

- Adding a constant to all values of a variable does not affect the correlation

Add 0.23 to all values of the x-variable: This is a shift of the x-values. - Adding a constant to all values of a variable does not affect the correlation. - Reason: Correlation is based on the relationship between the variables, not their absolute values. Adding a constant to one variable does not change the pattern of the relationship

work page

[45] [45]

- Scaling a variable by a positive constant, here 2, also does not affect the correlation

Double every value of the y-variable: This is a scaling of the y-values. - Scaling a variable by a positive constant, here 2, also does not affect the correlation. - Reason: Correlation is scale-invariant. Multiplying a variable by a positive constant does not change the correlation

work page

[46] [46]

- Correlation is symmetric in its variables

Interchange the two variables: This swaps the roles ofxandy. - Correlation is symmetric in its variables. That is, Corr(x, y) = Corr(y, x) - So, swapping the variables does not change the correlation. Conclusion: All three transformations, adding a constant to x, scaling y, and swapping variables , do not change the correlation. Therefore, the new correla...

work page