pith. sign in

arxiv: 2605.15604 · v1 · pith:DPHWVMUPnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

VSPO: Vector-Steered Policy Optimization for Behavioral Control

Pith reviewed 2026-05-20 19:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords vector-steered policy optimizationsteering vectorsbehavioral controlpolicy optimizationlanguage modelssparse rewardsreinforcement learning
0
0 comments X

The pith

VSPO uses steering vectors to vary rollout intensities and provably accelerate optimization over reward-shaped GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models often need to optimize accuracy while also producing specific behaviors like expertise or verbosity, but these appear too rarely to provide useful reward signals. The paper introduces Vector-Steered Policy Optimization, which adds a steering vector to GRPO so that rollouts are generated at multiple behavior intensities. This sampling strategy is presented as on-policy latent self-distillation that lets the model internalize the vector. By upsampling rare behaviors and increasing diversity, VSPO reduces the sparse-reward bottleneck. Theory under a bandit abstraction shows improved iteration complexity compared with reward-shaped GRPO whenever the steering distributions stay aligned with the target behavior, and experiments on MATH and MMLU-Pro confirm stronger behavioral control without loss of accuracy.

Core claim

VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior.

What carries the argument

Steering vector associated with the target behavior, used to control intensity during rollout sampling and enable on-policy self-distillation.

If this is right

  • VSPO improves control over target behaviors such as explanation expertise, confidence expression, robustness to misleading context, and response verbosity while maintaining or improving accuracy on MATH and MMLU-Pro.
  • VSPO achieves better iteration complexity than reward-shaped GRPO when steering distributions align with the target.
  • VSPO outperforms reward shaping, teacher-trace distillation, and guidance-based baselines on behavioral control.
  • Varying steering intensities enriches rollout diversity and thereby addresses the sparse behavioral reward bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-distillation view suggests VSPO could be adapted to other on-policy algorithms beyond GRPO.
  • Practical deployment would benefit from diagnostics that check alignment between steering-induced and target distributions.
  • The approach may generalize to multi-objective settings where several behaviors must be controlled simultaneously.

Load-bearing premise

The steering-induced distributions must be sufficiently aligned with the target behavior for the iteration-complexity improvement to hold.

What would settle it

A direct comparison in the bandit abstraction that shows VSPO has equal or worse iteration complexity once the steering distributions are misaligned with the target behavior.

Figures

Figures reproduced from arXiv: 2605.15604 by Jiasi Chen, Kai Yang, Samet Oymak, Weijia Zhang, Xuechen Zhang, Zijian Huang.

Figure 1
Figure 1. Figure 1: Motivation and overview of VSPO. The goal is to improve task accuracy while inducing a desired reasoning behavior. The figure illustrates VSPO and alternative methods for the goal of improving both task accuracy and a target behavior. Check marks and crosses indicate whether each generated trace is correct or incorrect. The vertical blue arrow denotes the target-behavior direction: traces higher along the … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VSPO algorithm. In Stage 1, prompts are sampled from the current policy, a teacher rewrites the responses into contrastive positive and negative directions, and their activation differences are used to construct a steering vector. In Stage 2, the current policy generates on-policy rollouts under different steering intensities, receives rewards summing task correctness and steering￾dependent pre… view at source ↗
Figure 3
Figure 3. Figure 3: Results on MMLU-Pro for confident and cautious target behaviors. Higher confidence￾expression scores indicate a more confident style; thus, upper-right is preferred for confident control, while upper-left is preferred for cautious control. Results on Preference-Aligned Reasoning Behavior. We evaluate two preference￾aligned reasoning characteristics: exper￾tise level and confidence expression. As shown in … view at source ↗
Figure 4
Figure 4. Figure 4: Results on robustness to misleading context. Left: task accuracy. Middle: agreement rate [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy-length trade-off for concise reasoning target behavior, on MATH and MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: On-policy NLL by trace source. We report the mean token-level negative log-likelihood of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics and behavior-space coverage induced by vector steering. Left: expertise [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer selection for steering-vector construction. For each target behavior, we evaluate [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of the number of contrastive pairs on steering-vector quality. The expertise score [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
read the original abstract

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vector-Steered Policy Optimization (VSPO), which modifies GRPO by sampling rollouts at varying steering intensities derived from behavior-specific vectors. This is framed as on-policy latent self-distillation that upsamples rare behaviors to alleviate sparse-reward bottlenecks in multi-objective LLM alignment. Under a bandit abstraction, the authors claim VSPO provably attains better iteration complexity than reward-shaped GRPO whenever the steering-induced distributions are sufficiently aligned with the target behavior. Experiments on MATH and MMLU-Pro for four behaviors (explanation expertise, confidence expression, robustness to misleading context, verbosity) report improved behavioral control while preserving or increasing task accuracy relative to reward shaping, teacher-trace distillation, and guidance baselines.

Significance. If the alignment condition can be made quantitative and verified, the bandit analysis would supply a concrete complexity advantage for steering-based upsampling over standard reward shaping, addressing a common practical bottleneck in behavioral control of LLMs. The empirical section already demonstrates consistent gains across two reasoning benchmarks and four distinct behaviors against multiple baselines, which would constitute a useful practical contribution even without the theoretical acceleration result.

major comments (2)
  1. [Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.
  2. [Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.
minor comments (2)
  1. [Method] Notation for steering intensity and the resulting policy is introduced without a compact summary table relating the symbols to the GRPO update rule.
  2. [Experiments] The experimental tables would benefit from reporting standard deviations over multiple random seeds and from an explicit statement of the number of rollouts per prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to clarify the theoretical claims and better connect them to the experimental results.

read point-by-point responses
  1. Referee: [Theoretical analysis (bandit abstraction)] Bandit abstraction analysis: the iteration-complexity claim is conditioned on steering-induced distributions being 'sufficiently aligned' with the target behavior, yet the manuscript supplies neither an explicit metric (KL, total variation, or expectation gap) nor a numerical threshold, nor any calculation confirming that the four steering vectors satisfy the condition on MATH or MMLU-Pro. Without this, the 'provably' qualifier does not follow from the stated assumptions.

    Authors: We agree that the alignment condition requires an explicit quantitative definition to rigorously support the provable iteration-complexity advantage. In the revised manuscript, we will introduce a concrete metric based on total variation distance between the steering-induced rollout distribution and the target behavior distribution. We will derive an explicit threshold on this metric (in terms of the behavior reward gap) under which the complexity bound improves over reward-shaped GRPO. For the four behaviors, we will add proxy calculations using observed behavior frequencies and intensity scores from the steered rollouts on MATH and MMLU-Pro to verify that the condition holds in the reported experiments. revision: yes

  2. Referee: [Experiments section] The bandit abstraction and the LLM experiments remain disconnected: the complexity bound is derived in a separate abstraction whose assumptions are not checked against the actual steering vectors or rollout distributions used in the MATH/MMLU-Pro runs, so the experimental results do not corroborate the theoretical acceleration.

    Authors: We acknowledge the value of a tighter link between the bandit analysis and the LLM experiments. In the revision, we will insert a dedicated discussion subsection that maps the bandit assumptions (e.g., alignment of steered distributions) to the experimental setup by reporting empirical proxies such as the increase in target-behavior token probabilities under varying steering intensities. This will show consistency with the upsampling mechanism analyzed in the bandit model. While the current results demonstrate improved behavioral control and maintained accuracy, which align with the predicted benefits, a direct empirical check of iteration complexity would require fitting and simulating the bandit with parameters extracted from the LLM runs; we will note this as a limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in VSPO derivation chain.

full rationale

The paper's central theoretical claim of improved iteration complexity for VSPO versus reward-shaped GRPO is developed inside an explicit bandit abstraction that is presented separately from the main LLM experiments. The result is conditioned on the external assumption that steering-induced distributions are sufficiently aligned with target behaviors, but this assumption does not make the bound tautological or reduce it to a fitted parameter by construction. No load-bearing step relies on self-citation chains, ansatz smuggling, or renaming of known results; the bandit analysis supplies independent content that is not forced by the empirical inputs or definitions used elsewhere in the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly relies on the existence of effective steering vectors for target behaviors and the utility of intensity variation for diversity.

pith-pipeline@v0.9.0 · 5825 in / 970 out tokens · 53367 ms · 2026-05-20T19:35:05.777854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

  1. [1]

    L1: Controlling how long a reasoning model thinks with reinforcement learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling

  2. [2]

    Claude sonnet 4.6 system card

    Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026

  3. [3]

    Activation steering for chain-of-thought compression

    Seyedarmin Azizi, Erfan Baghaei Potraghloo, Souvik Kundu, and Massoud Pedram. Activation steering for chain-of-thought compression. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

  4. [4]

    Understanding (un) reliability of steering vectors in language models

    Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krashenin- nikov. Understanding (un) reliability of steering vectors in language models. InICLR 2025 Workshop on Foundation Models in the Wild

  5. [5]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexan- dre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

  8. [8]

    Don’t overthink it

    Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

  9. [9]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  10. [10]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  11. [11]

    Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

    Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, and Samet Oymak. Learn- ing to correct: Calibrated reinforcement learning for multi-attempt chain-of-thought.Interna- tional Conference on Machine Learning, 2026

  12. [12]

    A unified understanding and evaluation of steering methods, 2026

    Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

  13. [13]

    C3ot: Generating shorter chain-of- thought without compromising effectiveness

    Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

  14. [14]

    Vineppo: Refining credit assignment in rl training of llms

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InForty-second International Conference on Machine Learning

  15. [15]

    Rlaif vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. InInternational Conference on Machine Learning, pages 26874–26901. PMLR, 2024

  16. [16]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025. 11

  17. [17]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  18. [18]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  20. [20]

    A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

    Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, and Thomas Kollar. A critical evaluation of ai feedback for aligning large language models.Advances in Neural Information Processing Systems, 37:29166–29190, 2024

  21. [21]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  22. [22]

    Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

  23. [23]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  24. [24]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  25. [25]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  26. [26]

    Projection optimization: A general framework for multi-objective and multi-group rlhf

    Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf. InForty-second International Conference on Machine Learning

  27. [27]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

  28. [28]

    Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment

    Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Machine Learning, pages 56276–56297. PMLR, 2024

  29. [29]

    Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

    Zhe Yang, Yudong Wang, Rang Li, and Zhifang Sui. Towards better rl training data utilization via second-order rollout.arXiv preprint arXiv:2602.22765, 2026

  30. [30]

    Incorporating self-rewriting into large language model reasoning reinforcement

    Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, Wangjie You, Jie Tang, Qingsong Liu, Yuhang Guo, and Yangyang Kang. Incorporating self-rewriting into large language model reasoning reinforcement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34405–34413, 2026

  31. [31]

    Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

    Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning.Advances in Neural Information Processing Systems, 38:96726–96752, 2026

  32. [32]

    Making small language models efficient reasoners: Intervention, supervision, reinforcement

    Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, and Samet Oymak. Making small language models efficient reasoners: Intervention, supervision, reinforcement. arXiv preprint arXiv:2505.07961, 2025. 12

  33. [33]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  34. [34]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 13 Appendix The appendix is organized as follows: • Appendix A describes our usage of LLMs. • Appendix...

  35. [35]

    - Prefer examples and intuitive explanation

    Final answer Output requirements: - Be concrete and slow. - Prefer examples and intuitive explanation. - Use minimal notation. - End with a very short final answer sentence. 31 Most important rule: Write so that a beginner with very low math background can follow every line. Task requirements: - Solve the problem directly from the prompt. - Follow the exp...

  36. [36]

    Language complexity - 0: very simple, conversational, beginner-friendly wording - 100: precise, technical, formal language

  37. [37]

    Step granularity - 0: every tiny step is explicitly explained - 100: routine steps are omitted and compressed

  38. [38]

    Use of notation and domain vocabulary - 0: minimal notation and technical terms - 100: standard notation and field-specific terminology used freely

  39. [39]

    Abstraction level - 0: concrete, intuitive, example-driven explanation - 100: concise, abstract, expert-facing reasoning

  40. [40]

    Confident

    Pedagogical tone - 0: teaching-oriented, repetitive, hand-holding - 100: compact, assumes background knowledge, no hand-holding Problem: [PROBLEM START] {problem} [PROBLEM END] Model response: [RESPONSE START] {generated_solution} [RESPONSE END] Output format: Score: <integer from 0 to 100> Rationale: <one or two sentences explaining the style signals> 32...

  41. [41]

    It reflects the initial desire or drive toward the reward

    Appetitive phase: This is the motivated, goal-directed part of the behavior. It reflects the initial desire or drive toward the reward

  42. [42]

    It is the action that leads to reward or satisfaction

    Consummatory phase: This is the actual execution of the behavior. It is the action that leads to reward or satisfaction

  43. [43]

    Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence

    Quiescence phase: This is the resting or post-behavior phase, after the behavior is complete. Now consider the options: - Option A: Appetitive behavior, exploratory behavior, quiescence. Incorrect, because exploratory behavior is not part of the standard three-phase model. 34 - Option B: Termination, appetitive behavior, exploratory behavior. Incorrect, b...

  44. [44]

    - Adding a constant to all values of a variable does not affect the correlation

    Add 0.23 to all values of the x-variable: This is a shift of the x-values. - Adding a constant to all values of a variable does not affect the correlation. - Reason: Correlation is based on the relationship between the variables, not their absolute values. Adding a constant to one variable does not change the pattern of the relationship

  45. [45]

    - Scaling a variable by a positive constant, here 2, also does not affect the correlation

    Double every value of the y-variable: This is a scaling of the y-values. - Scaling a variable by a positive constant, here 2, also does not affect the correlation. - Reason: Correlation is scale-invariant. Multiplying a variable by a positive constant does not change the correlation

  46. [46]

    - Correlation is symmetric in its variables

    Interchange the two variables: This swaps the roles ofxandy. - Correlation is symmetric in its variables. That is, Corr(x, y) = Corr(y, x) - So, swapping the variables does not change the correlation. Conclusion: All three transformations, adding a constant to x, scaling y, and swapping variables , do not change the correlation. Therefore, the new correla...