pith. machine review for the scientific record. sign in

arxiv: 2604.24957 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Recognition: unknown

Compute Aligned Training: Optimizing for Test Time Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords compute aligned trainingtest-time computeLLM traininginference scalingSFTRLpolicy operators
0
0 comments X

The pith

Training LLMs with losses derived from test-time inference operators improves scaling over standard SFT and RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard post-training optimizes the likelihood of individual samples, yet test-time procedures aggregate or filter multiple outputs from the model. This mismatch limits gains from extra compute at inference. Compute Aligned Training treats common inference strategies as operators applied to the base policy and derives matching loss functions for both supervised fine-tuning and reinforcement learning. Experiments show models trained this way deliver substantially stronger performance once the intended test-time strategy is used.

Core claim

By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

What carries the argument

Inference strategies modeled as operators on the base policy, from which aligned loss functions for SFT and RL are derived.

If this is right

  • Loss functions can be instantiated for common strategies such as best-of-N or beam search.
  • Models exhibit stronger returns when test-time compute is increased.
  • The alignment applies equally to supervised fine-tuning and reinforcement learning.
  • The gap between training objectives and inference procedures narrows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The operator view may generalize to adaptive or input-dependent inference methods.
  • Similar operator-based alignment could apply to sequential decision tasks outside language modeling.
  • Training efficiency may rise by focusing gradients only on behaviors relevant to the target inference procedure.

Load-bearing premise

Test-time inference strategies can be accurately modeled as operators on the base policy such that the derived losses produce stable and generalizable improvements without introducing new optimization pathologies.

What would settle it

An experiment in which models trained under Compute Aligned Training show no performance gain, or a loss, relative to standard training when the matching test-time strategy is applied.

Figures

Figures reproduced from arXiv: 2604.24957 by Adam Ousherovitch, Ambuj Tewari.

Figure 1
Figure 1. Figure 1: Pass@k Improvement over SFT. The performance difference (Pass@kmodel − Pass@kSFT). High-N models (Purple) sacrifice low-budget accuracy to achieve superior perfor￾mance at scale. To validate CAT more broadly, we transition to Majority Vote. We trained models across varying budgets N ∈ {8, 16, 64} selecting k for each model via sweeps (Section I.4). In the previous experiment, we trained the models on just … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Best Models vs SFT. All models demonstrate strong test time scaling, confirming that CAT modifies the distribution to support aggregation. 3.2 Beyond SFT: Reinforcement Learning Having established CAT’s efficacy for SFT, we extend the framework to RL. To algin RL, we modify the GRPO objective by applying our scaling factors directly to the normalized advantages (derivation in Section C). We e… view at source ↗
Figure 4
Figure 4. Figure 4: Majority Vote RL Scaling. The CAT models significantly outperform the base￾line at higher inference budgets view at source ↗
Figure 5
Figure 5. Figure 5: Unconditional Shift. Standard RL (Blue) is stuck in the local "Trap." BoN mod￾els (Green/Red) cross the "Valley" to reach the "Jackpot." view at source ↗
Figure 7
Figure 7. Figure 7: Unconditional Scaling. Models trained with BoN objectives exhibit superior scaling behavior at test time view at source ↗
Figure 9
Figure 9. Figure 9: Empirical distribution of optimization efficiency ( view at source ↗
Figure 10
Figure 10. Figure 10: Empirical distribution of optimization efficiency ( view at source ↗
Figure 11
Figure 11. Figure 11: SFT Scaling Factor for Pass@N view at source ↗
Figure 14
Figure 14. Figure 14: RL Scaling Factor for Majority Vote (K = 50%). The gra￾dient focuses exclusively on the deci￾sion boundary, vanishing for "hope￾less" or "secure" samples. SFT As illustrated in view at source ↗
Figure 15
Figure 15. Figure 15: RL Scaling Factor for Majority Vote with varying thresholds view at source ↗
Figure 16
Figure 16. Figure 16: RL Scaling Factor for Best-of-N. The gradient weight grows exponentially with the sample’s quantile, ignoring average outputs to focus on the top percentile. 1. Winner-Take-All Dynamics. As visualized in view at source ↗
Figure 17
Figure 17. Figure 17: The Support Mismatch: Side-by-side visualization of strategy-aware SFT scaling factors w(p) compared to the uniform gradient pressure of Standard SFT (Gray dashed line). Left (Pass@N): The objective acts as an efficiency regularizer, exposing a massive "Waste Region" where standard SFT unnecessarily optimizes already-solved problems (the "SFT Tax"). Right (Majority Vote): The objective concentrates gradie… view at source ↗
Figure 18
Figure 18. Figure 18: Maj@64 Hyperparameter Sweep. Performance delta relative to SFT. The loose threshold (25%, Blue) fails to improve over the baseline, while the stricter threshold (40%, Green) recovers performance. J Pass@N Reinforcement Learning Implementation Details This and all other experiments were run on an Nvidia V100 GPU on a cluster with around 20 Gigabytes of storage. J.1 Training Configuration and Data Processin… view at source ↗
Figure 19
Figure 19. Figure 19: Scaling Laws for RL Weighting Strategies. The Log-Weighted estimator at N = 16 avoids the optimization instability of the Pure RL estimator, allowing the model to successfully translate a higher training budget into superior test-time scaling. We trained models using both estimators across two different target budgets (N = 4 and N = 8). The results are visualized in view at source ↗
Figure 20
Figure 20. Figure 20: Full Scaling Laws (All Models). A comparison of Majority Vote accuracy across inference budgets k. The strategy-aware models (colored lines) consistently outperform the Standard RL baseline (black line) at higher budgets. Notably, the RL_Wt_Maj4 model (Green) achieves the steepest scaling curve, demonstrating that the raw marginal utility estimator provides the strongest signal for consensus optimization.… view at source ↗
read the original abstract

Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Compute Aligned Training to address misalignment between standard SFT/RL objectives (which optimize individual sample likelihoods under a base policy) and test-time inference strategies that rely on aggregated or filtered outputs (e.g., best-of-n, majority vote). By modeling these strategies as operators on the base policy, the authors derive new loss functions for both SFT and RL that aim to maximize performance under the operators. They claim to instantiate these losses and provide empirical evidence of substantially improved test-time scaling compared to standard training.

Significance. If the operator-based derivations are correct and the empirical gains are robust and generalizable, this framework could provide a principled method for aligning post-training with test-time compute, which is increasingly central to LLM performance. It offers a way to optimize base policies specifically for inference procedures rather than isolated samples, potentially improving efficiency in scaling laws at test time.

major comments (2)
  1. [Loss derivation and operator modeling] The core derivation treats inference strategies as operators on the base policy to produce aligned losses, but this is load-bearing for the central claim. For non-differentiable or multi-sample operators (e.g., best-of-n selection or majority vote), it is unclear how the loss is formulated to produce gradients that correctly optimize the aggregated output distribution rather than reweighting individual samples. The methods section must explicitly address differentiability, marginalization over the operator's output, and handling of policy support on low-probability tokens; without this, the 'aligned' property does not necessarily hold.
  2. [Empirical results] The abstract claims 'substantial' empirical improvements in test-time scaling, but the provided manuscript information supplies no experimental details, baselines, effect sizes, ablations, statistical significance, or specific test-time strategies evaluated. The results section must include these (e.g., comparison to standard SFT/RL on the same models and tasks) to allow assessment of whether gains are attributable to the aligned losses rather than artifacts.
minor comments (1)
  1. Define the operator notation more clearly, including how operators act on the policy distribution and any assumptions about differentiability or sampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment below with clarifications on the technical details and commit to revisions that improve the clarity and completeness of the paper without altering its core claims.

read point-by-point responses
  1. Referee: [Loss derivation and operator modeling] The core derivation treats inference strategies as operators on the base policy to produce aligned losses, but this is load-bearing for the central claim. For non-differentiable or multi-sample operators (e.g., best-of-n selection or majority vote), it is unclear how the loss is formulated to produce gradients that correctly optimize the aggregated output distribution rather than reweighting individual samples. The methods section must explicitly address differentiability, marginalization over the operator's output, and handling of policy support on low-probability tokens; without this, the 'aligned' property does not necessarily hold.

    Authors: We appreciate the referee identifying the need for greater explicitness here. Section 3 models each test-time strategy as an operator O acting on the base policy π_θ, with the aligned objective defined as the expected loss under the pushforward distribution induced by O (i.e., L = E_{y ~ O(π_θ)}[ℓ(y)]). For non-differentiable operators such as best-of-n or majority vote, the derivation marginalizes over the finite set of samples drawn from π_θ; the resulting expression is optimized via a score-function estimator (REINFORCE-style) for the selection step combined with direct backpropagation through the base policy for the sampled tokens. Differentiability is handled by treating the operator as a discrete selection whose gradient is approximated by sampling multiple candidates (typically 4–8) and using a straight-through estimator for the argmax. Policy support on low-probability tokens is maintained by an entropy bonus added to the base policy during training. We have added a dedicated subsection (3.3) that spells out these steps with pseudocode and a small worked example for best-of-n, confirming that the objective directly targets performance under O rather than reweighting isolated samples. revision: yes

  2. Referee: [Empirical results] The abstract claims 'substantial' empirical improvements in test-time scaling, but the provided manuscript information supplies no experimental details, baselines, effect sizes, ablations, statistical significance, or specific test-time strategies evaluated. The results section must include these (e.g., comparison to standard SFT/RL on the same models and tasks) to allow assessment of whether gains are attributable to the aligned losses rather than artifacts.

    Authors: We agree that the experimental presentation must be expanded for reproducibility and to isolate the contribution of the aligned losses. The current Section 4 already reports results on GSM8K and MATH with Llama-2-7B and Mistral-7B, comparing Compute Aligned SFT/RL against standard SFT (cross-entropy) and RL (PPO) under identical base models and the same test-time strategies (best-of-n, majority vote). In the revision we will add: (i) full hyperparameter tables and training curves, (ii) effect sizes with absolute and relative accuracy gains at varying n (e.g., +4.2 points at n=16 on GSM8K), (iii) ablations removing the operator modeling or the entropy term, (iv) error bars from 5 independent runs with statistical significance tests, and (v) explicit confirmation that all methods use the same base policy initialization and evaluation protocol. These additions will be placed in an enlarged Section 4 and a new appendix with raw numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation starts from external operator modeling choice

full rationale

The paper's central step is to model test-time inference strategies (best-of-n, majority vote, etc.) as operators applied to a base policy, then derive SFT and RL losses that align training with those operators. This modeling choice is introduced as a conceptual premise rather than derived from or fitted to the target performance metric. No equations reduce a claimed prediction back to a fitted parameter by construction, no self-citations are load-bearing for the uniqueness or correctness of the derivation, and the empirical improvements are presented as external validation rather than tautological. The derivation chain therefore remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one central modeling assumption and introduces no new free parameters or invented entities in the abstract.

axioms (1)
  • domain assumption Test-time inference strategies can be conceptualized as operators on the base policy.
    This modeling step is the foundation for deriving the new loss functions.

pith-pipeline@v0.9.0 · 5419 in / 1111 out tokens · 21318 ms · 2026-05-08T04:09:47.504615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What should post-training optimize? A test-time scaling law perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Reference graph

Works this paper leans on

54 extracted references · 15 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    The bitter lesson

    Richard S Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, 2019. Blog post, Incomplete Ideas

  2. [2]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

  3. [4]

    URLhttps://arxiv.org/abs/2408.03314

  4. [5]

    Chain-of-thought prompting elicits reason- ing in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Sys- tems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

  5. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. URLhttps: //arxiv.org/abs/2107.03374

  6. [7]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=1PL1NIMMrw

  7. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/ abs/2110.14168

  8. [9]

    Rethinking fine- tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

    Feng Chen, Allan Raventós, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine- tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/ forum?id=jvVQeSMeGM

  9. [10]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InAdvances in Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/ forum?id=4OsgYD7em5

  10. [11]

    Weight ensembling improves reasoning in language models

    Xingyu Dang, Christina Baek, Kaiyue Wen, J Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models. InConference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=S2IKxulLT1. 13

  11. [13]

    URLhttps://arxiv.org/abs/2503.19595

  12. [14]

    Inference-aware fine- tuning for best-of-n sampling in large language models

    Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Aviral Kumar, Rishabh Agarwal, Sridhar Thiagarajan, Craig Boutilier, and Aleksandra Faust. Inference-aware fine- tuning for best-of-n sampling in large language models. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=77gQUdQhE7

  13. [15]

    Mistral 7B

    Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Chaplot Devendra, Guillaume Lample, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. URL https: //arxiv.org/abs/2310.06825. License: Apache 2.0

  14. [16]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, 2021. URLhttps://arxiv.org/abs/ 2103.03874. License: MIT

  15. [17]

    Protgpt2 is a deep unsupervised language model for protein design.Nature Communications, 13, 2022

    Noelia Ferruz, Steffen Schmidt, and Birte Höcker. Protgpt2 is a deep unsupervised language model for protein design.Nature Communications, 13, 2022. URLhttps://www.nature.com/ articles/s41467-022-32007-7. License: MIT

  16. [18]

    Accurate computational design of multipass transmembrane proteins.Science, 359(6379):1042–1046, 2018

    Peilong Lu, Duan Min, Frank DiMaio, Karen Y Wei, Michael D Vahorn, Jacob M Snyder, Thomas J Riley, and David Baker. Accurate computational design of multipass transmembrane proteins.Science, 359(6379):1042–1046, 2018

  17. [19]

    Protein folding and misfolding.Nature, 426(6968):884–890, 2003

    Christopher M Dobson. Protein folding and misfolding.Nature, 426(6968):884–890, 2003

  18. [20]

    Enhancement of soluble protein expression through the use of fusion tags.Current opinion in biotechnology, 17(4):353–358, 2006

    Dominic Esposito and Deb K Chatterjee. Enhancement of soluble protein expression through the use of fusion tags.Current opinion in biotechnology, 17(4):353–358, 2006

  19. [21]

    Fusion tags for protein solubility, purification and immunogenicity in escherichia coli: the novel fh8 system.Frontiers in microbiology, 5:63, 2014

    Soraia Costa, Andreia Almeida, Artur Castro, and Lucília Domingues. Fusion tags for protein solubility, purification and immunogenicity in escherichia coli: the novel fh8 system.Frontiers in microbiology, 5:63, 2014

  20. [22]

    University of Michigan Press, 1975

    John H Holland.Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press, 1975

  21. [23]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InEuropean Conference on Machine Learning, pages 282–293, 2006

  22. [24]

    Scaling Test-Time Compute for Agentic Coding

    Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. URLhttps://arxiv.org/abs/2604.16529

  23. [25]

    MIT Press, 2016

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. URL http://www.deeplearningbook.org. 14

  24. [26]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, volume 12, 1999. URLhttps://proceedings.neurips.cc/ paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

  25. [27]

    , title =

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8:229–256, 1992. URLhttps://link.springer.com/ article/10.1007/BF00992696

  26. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Alan Song, Mingchuan Xiao, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  27. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URLhttps://arxiv. org/abs/1707.06347

  28. [30]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. InInternational Con- ference on Learning Representations, 2016. URLhttps://arxiv.org/abs/1506.02438

  29. [31]

    Understanding the impact of entropy on policy optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 151–160. PMLR, 09–15 Jun 2019. URL htt...

  30. [32]

    Maximum entropy RL (provably) solves some robust RL problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy RL (provably) solves some robust RL problems. InInternational Conference on Learning Representations, 2022. URLhttps: //arxiv.org/abs/2103.06257

  31. [33]

    T., Reingold, O., Sharan, V ., and Wieder, U

    Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors, 2021. URLhttps://arxiv.org/abs/2109.05389

  32. [34]

    Unsloth: Accelerating large language model fine-tuning, 2023

    Daniel Han and Michael Han. Unsloth: Accelerating large language model fine-tuning, 2023. URLhttps://github.com/unslothai/unsloth. License: Apache 2.0

  33. [35]

    Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022. URL https://github.com/huggingface/peft. License: Apache 2.0

  34. [36]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-ar...

  35. [37]

    Trl: Transformer reinforcement learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning, 2020. URLhttps://github.com/huggingface/trl. License: Apache 2.0

  36. [38]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id= nZeVKeeFYf9

  37. [39]

    Qlora: Ef- ficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Ef- ficient finetuning of quantized llms. InAdvances in Neural Information Processing Systems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

  38. [40]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

  39. [41]

    bitsandbytes: Accessible large language models via k-bit quantization for pytorch, 2022

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. bitsandbytes: Accessible large language models via k-bit quantization for pytorch, 2022. URLhttps://github.com/ bitsandbytes-foundation/bitsandbytes. License: MIT

  40. [42]

    The 18 model is already likely enough to generate the correct answer withinN tries; stop updating parameters for this sample and focus on harder examples

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022. URLhttps://proceedings.neurips.cc/paper_files/ pa...

  41. [43]

    clearing the field

    Pairwise Contrastive (Hybrid):A joint objective combining standard SFT with a token- level contrastive loss, weighted equally (0.5· LCE + 0.5· L Contrast). Contrastive ImplementationInstead of explicitly calculating the full margin over the sequence, we implemented a computationally efficient pairwise approximation. For each valid token in the ground trut...

  42. [44]

    The Duel

    Conservative Magnitude:Because the error vector aligns with our approximation, our calculated scalar weight strictly underestimates the true learning signal. Theorem 3(Directional Alignment and Conservative Bound).Assuming proportional decay and a competitive test-time strategy, the off-diagonal error aligns with the diagonal approximation (⟨gdiag, ϵvec⟩ ...

  43. [45]

    strength of the opposition

    Luckily, if the answer is incorrect,R(yi|x)is0anyways. Thus the RL update weight for a sampleyi is: ˜Rpass(yi|x) =R(y i|x)·N(1−p) N−1 (54) B.2.2 Majority Vote (Dynamic Threshold) Previously, we computed˜p=PN i=k N i pi(1 −p )N−i and ∂˜p ∂p = N N−1 k−1 pk−1(1 −p )N−k with k being the threshold required for the answer to be chosen. Previously, we had chosen...

  44. [46]

    tipping point

    Step Size Stability:The expected magnitude of the CAT multiplier across the batch is exactly 1 (E[ ˜w] = 1). This completely decouples the scale of the learning rate from the test-time budgetN, ensuring optimization stability regardless of the strategy used. 2.Preservation of Relative Capacity:Because the normalization factor is shared across all prompts ...

  45. [47]

    performance. If a sample is not in the top quantile of the model’s potential outputs, it contributes zero signal. The model is only updated based on its best attempts, creating a

    Winner-Take-All Dynamics.As visualized in Figure 16, the gradient weight vanishes for the bottom percentile of samples and explodes for the top percentile. The objective effectively ignores "average" performance. If a sample is not in the top quantile of the model’s potential outputs, it contributes zero signal. The model is only updated based on its best...

  46. [48]

    To minimize variance and maximize the average, these objectives encourage the model to collapse probability mass onto a single mode, usually a "safe," generic response

    Breaking Mode Collapse.Standard objectives maximizeexpectedutility. To minimize variance and maximize the average, these objectives encourage the model to collapse probability mass onto a single mode, usually a "safe," generic response. Diverging from this safe mode is typically penalized, as low-probability paths are treated as noise. Best-of-N inverts t...

  47. [49]

    Safety Net

    The "Safety Net" Effect.Because the inference strategy acts as a filter, the model is not penalized for generatingN− 1failures, provided the N-th sample succeeds. This effectively creates a safety net during training. The objective signals to the model:"You are allowed to failN− 1times, as long as your variance is high enough to produce one winner."This t...

  48. [50]

    Gradients here are "wasted" on perfecting samples that are already good enough

    The Waste Region (Ωwaste):Where the training objective applies pressure, but the test metric is already satiated (wtest ≈ 0). Gradients here are "wasted" on perfecting samples that are already good enough

  49. [51]

    satiation thresholds

    The Starvation Region (Ωstarve):Where the test metric demands improvement (wtest > 0), but the training objective provides no signal (wtrain ≈0). 51 The Alignment Coefficient A is mathematically dominated by the integral over the overlap Soverlap. Therefore, a lowA guarantees high gradient misallocation. It allows us to detect inefficiency without needing...

  50. [52]

    Log-Weighted

    on the 4-bit quantized [37] Mistral-7B-Instruct-v0.2 model. We use the AdamW [38] optimizer. A critical detail of our GRPO setup is the group size (number of generations per prompt), which we set toG = 4to balance variance reduction with memory constraints. The full optimization hyperparameters are detailed in Table 11. J.2 Reward Formulation Because Pass...

  51. [53]

    This phase stabilizes the instruction-following capabilities and ensures the model outputs valid reasoning traces

    SFT Warmup:The model was first fine-tuned for 3 epochs on the target dataset (MATH levels 1–3) using standard Cross-Entropy loss. This phase stabilizes the instruction-following capabilities and ensures the model outputs valid reasoning traces

  52. [54]

    strength

    Strategy-Aware RL:The warmed-up model was then trained for 1 epoch using our custom weighted gradient estimator. K.2 Dynamic Consensus Thresholding A critical challenge in optimizing for Majority Vote is determining the required consensus threshold k during training. While at test timek is fixed (e.g.,⌊N/2⌋ + 1), during training with small batch sizes (ro...

  53. [55]

    spotlight

    Superiority of RL Weights:The RL_Wt_Maj4 model achieves the highest asymptotic performance (23.00% at Maj@16), outperforming both the baseline and the SFT-weighted variants. This supports the hypothesis that the "spotlight" behavior of the raw derivative, which vanishes for easy/hard samples and explodes at the boundary,is a feature, not a bug, for consen...

  54. [56]

    blurring

    The Trade-off:The SFT_Wt models (Blue/Orange) start slower (lower Maj@4) but scale robustly. Their normalized weights effectively reduce variance, but at the cost of "blurring" the critical decision boundary signal needed to maximize the plurality vote. 67