pith. sign in

arxiv: 2604.24957 · v2 · pith:BI7PFZK3new · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Compute Aligned Training: Optimizing for Test Time Inference

Pith reviewed 2026-05-21 08:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords compute aligned trainingtest-time scalinglarge language modelsinference operatorsloss function derivationsupervised fine-tuningreinforcement learningpolicy optimization
0
0 comments X

The pith

Training language models with losses that anticipate test-time strategies improves results when those strategies are later applied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are usually trained to maximize the likelihood of single outputs, yet real inference often applies extra computation through strategies that aggregate, filter, or select among multiple outputs. This paper introduces Compute Aligned Training to close the gap by treating common test-time strategies as operators that act on the base policy. From that modeling choice the authors derive modified loss functions for both supervised fine-tuning and reinforcement learning. Experiments show the resulting models improve more rapidly as test-time compute is increased than models trained under standard objectives. A reader would conclude that training should be designed with the eventual inference procedure already in view rather than optimized in isolation.

Core claim

By conceptualizing inference strategies as operators on the base policy, new loss functions are derived that maximize performance when said strategies are applied. The approach is instantiated for SFT and RL across common test-time strategies, and empirical results indicate substantial gains in test-time scaling over standard training.

What carries the argument

Modeling inference strategies as operators on the base policy in order to derive loss functions aligned with their use at test time.

If this is right

  • Models trained with the derived losses exhibit higher final accuracy when the corresponding test-time operator is applied.
  • The alignment method extends to both supervised fine-tuning and reinforcement learning training regimes.
  • Performance improves more steeply as additional compute is allocated at inference time compared with conventionally trained models.
  • New loss functions become available for any inference strategy once a suitable operator is defined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same operator-based alignment could be applied to training pipelines that use search or planning at test time.
  • Developers might begin designing inference procedures first and then back-propagate the required losses into training.
  • Models could be trained under explicit assumptions about the inference budget rather than pure next-token likelihood.
  • Hybrid systems that jointly optimize training losses and test-time operators may emerge as a next step.

Load-bearing premise

Test-time strategies can be modeled as operators on the base policy such that the resulting losses improve actual inference procedures without creating new failure modes.

What would settle it

A direct comparison in which a model trained with the new loss for a chosen strategy, such as best-of-N, shows no accuracy gain over a standard model when best-of-N is actually run at test time on new tasks.

Figures

Figures reproduced from arXiv: 2604.24957 by Adam Ousherovitch, Ambuj Tewari.

Figure 1
Figure 1. Figure 1: Pass@k Improvement over SFT. The performance difference (Pass@kmodel − Pass@kSFT). High-N models (Purple) sacrifice low-budget accuracy to achieve superior perfor￾mance at scale. To validate CAT more broadly, we transition to Majority Vote. We trained models across varying budgets N ∈ {8, 16, 64} selecting k for each model via sweeps (Section I.4). In the previous experiment, we trained the models on just … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Best Models vs SFT. All models demonstrate strong test time scaling, confirming that CAT modifies the distribution to support aggregation. 3.2 Beyond SFT: Reinforcement Learning Having established CAT’s efficacy for SFT, we extend the framework to RL. To algin RL, we modify the GRPO objective by applying our scaling factors directly to the normalized advantages (derivation in Section C). We e… view at source ↗
Figure 4
Figure 4. Figure 4: Majority Vote RL Scaling. The CAT models significantly outperform the base￾line at higher inference budgets view at source ↗
Figure 5
Figure 5. Figure 5: Unconditional Shift. Standard RL (Blue) is stuck in the local "Trap." BoN mod￾els (Green/Red) cross the "Valley" to reach the "Jackpot." view at source ↗
Figure 7
Figure 7. Figure 7: Unconditional Scaling. Models trained with BoN objectives exhibit superior scaling behavior at test time view at source ↗
Figure 9
Figure 9. Figure 9: Empirical distribution of optimization efficiency ( view at source ↗
Figure 10
Figure 10. Figure 10: Empirical distribution of optimization efficiency ( view at source ↗
Figure 11
Figure 11. Figure 11: SFT Scaling Factor for Pass@N view at source ↗
Figure 14
Figure 14. Figure 14: RL Scaling Factor for Majority Vote (K = 50%). The gra￾dient focuses exclusively on the deci￾sion boundary, vanishing for "hope￾less" or "secure" samples. SFT As illustrated in view at source ↗
Figure 15
Figure 15. Figure 15: RL Scaling Factor for Majority Vote with varying thresholds view at source ↗
Figure 16
Figure 16. Figure 16: RL Scaling Factor for Best-of-N. The gradient weight grows exponentially with the sample’s quantile, ignoring average outputs to focus on the top percentile. 1. Winner-Take-All Dynamics. As visualized in view at source ↗
Figure 17
Figure 17. Figure 17: The Support Mismatch: Side-by-side visualization of strategy-aware SFT scaling factors w(p) compared to the uniform gradient pressure of Standard SFT (Gray dashed line). Left (Pass@N): The objective acts as an efficiency regularizer, exposing a massive "Waste Region" where standard SFT unnecessarily optimizes already-solved problems (the "SFT Tax"). Right (Majority Vote): The objective concentrates gradie… view at source ↗
Figure 18
Figure 18. Figure 18: Maj@64 Hyperparameter Sweep. Performance delta relative to SFT. The loose threshold (25%, Blue) fails to improve over the baseline, while the stricter threshold (40%, Green) recovers performance. J Pass@N Reinforcement Learning Implementation Details This and all other experiments were run on an Nvidia V100 GPU on a cluster with around 20 Gigabytes of storage. J.1 Training Configuration and Data Processin… view at source ↗
Figure 19
Figure 19. Figure 19: Scaling Laws for RL Weighting Strategies. The Log-Weighted estimator at N = 16 avoids the optimization instability of the Pure RL estimator, allowing the model to successfully translate a higher training budget into superior test-time scaling. We trained models using both estimators across two different target budgets (N = 4 and N = 8). The results are visualized in view at source ↗
Figure 20
Figure 20. Figure 20: Full Scaling Laws (All Models). A comparison of Majority Vote accuracy across inference budgets k. The strategy-aware models (colored lines) consistently outperform the Standard RL baseline (black line) at higher budgets. Notably, the RL_Wt_Maj4 model (Green) achieves the steepest scaling curve, demonstrating that the raw marginal utility estimator provides the strongest signal for consensus optimization.… view at source ↗
read the original abstract

Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Compute Aligned Training for LLMs, framing test-time inference strategies (e.g., self-consistency, best-of-N) as operators O acting on a base policy π. It derives new loss functions for both SFT and RL that optimize expected performance under O(π) rather than under the base policy alone, and reports empirical results showing improved test-time scaling compared to standard SFT/RL baselines.

Significance. If the operator-based losses are correctly derived and the reported gains are robust, the work would offer a principled route to close the train-test mismatch that currently limits returns from test-time compute. The framing could generalize across multiple aggregation strategies and provide a template for future alignment of training objectives with inference procedures.

major comments (3)
  1. [§3] §3 (Loss Derivation): For non-linear operators such as majority vote or self-consistency, the effective operator O(π) changes as π is updated during training. The gradient of the proposed loss therefore depends on the current policy; the manuscript does not show that the final model is a stationary point of the intended test-time objective or provide a convergence argument that accounts for this non-stationarity.
  2. [§4] §4 (Empirical Evaluation): The abstract and results claim substantial improvements in test-time scaling, yet no equations, exact loss formulations, baseline details, or error bars are visible in the provided text. Without these, it is impossible to verify whether the gains survive the non-stationarity concern or simply reflect higher base likelihoods.
  3. [§2] §2 (Operator Definition): The claim that any test-time strategy can be usefully modeled as a fixed operator on the base policy is load-bearing. For strategies that involve sampling multiple trajectories and then aggregating, the mapping is stochastic and policy-dependent; the paper does not demonstrate that the derived losses remain well-defined or produce models that generalize to real inference pipelines without additional hyperparameter tuning.
minor comments (2)
  1. [§3] Notation for the operator O and the resulting loss should be introduced with an explicit equation early in §3 rather than described only in prose.
  2. [§3] The manuscript should include a short table comparing the proposed losses to standard SFT/RL objectives side-by-side.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach and indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Loss Derivation): For non-linear operators such as majority vote or self-consistency, the effective operator O(π) changes as π is updated during training. The gradient of the proposed loss therefore depends on the current policy; the manuscript does not show that the final model is a stationary point of the intended test-time objective or provide a convergence argument that accounts for this non-stationarity.

    Authors: We agree that non-linear operators introduce non-stationarity because O(π) depends on the evolving policy. Our derivation computes the gradient with respect to the current policy at each step, treating the operator application as fixed for that gradient computation—an approximation common in policy optimization. In the revised manuscript we have added a paragraph in §3 discussing this dependence, along with a simple argument that under sufficiently slow policy updates the procedure reaches an approximate stationary point of the test-time objective. We also report additional training curves showing convergence to improved test-time performance. A fully rigorous convergence proof for arbitrary non-linear operators is left for future work. revision: partial

  2. Referee: [§4] §4 (Empirical Evaluation): The abstract and results claim substantial improvements in test-time scaling, yet no equations, exact loss formulations, baseline details, or error bars are visible in the provided text. Without these, it is impossible to verify whether the gains survive the non-stationarity concern or simply reflect higher base likelihoods.

    Authors: The complete manuscript contains the exact loss derivations (Equations 3–7 in §3) for both the SFT and RL cases, together with the operator definitions and baseline implementations. Standard SFT and RLHF objectives serve as the baselines. We have revised §4 to include error bars computed over five random seeds for all scaling curves and have added a table listing the precise loss formulations and hyper-parameters. These additions allow direct verification that the reported gains arise from alignment with the test-time operator rather than from increased base likelihood alone. revision: yes

  3. Referee: [§2] §2 (Operator Definition): The claim that any test-time strategy can be usefully modeled as a fixed operator on the base policy is load-bearing. For strategies that involve sampling multiple trajectories and then aggregating, the mapping is stochastic and policy-dependent; the paper does not demonstrate that the derived losses remain well-defined or produce models that generalize to real inference pipelines without additional hyperparameter tuning.

    Authors: We define each test-time strategy as a (possibly stochastic) operator O that maps the base policy π to an effective output distribution or selection process; the training loss is the expected performance under samples drawn from O(π). Although the mapping is stochastic and policy-dependent for multi-sample aggregation, the loss remains well-defined as an expectation that can be estimated by sampling during training. Experiments in §4 demonstrate that models trained with these losses improve scaling curves when evaluated in standard inference pipelines using the same hyper-parameters reported in the paper, without requiring additional tuning. We have added a clarifying sentence in §2 on the stochastic nature of the operator and how it is handled in the Monte-Carlo estimate of the loss. revision: partial

Circularity Check

0 steps flagged

Derivation self-contained with no reduction to inputs by construction

full rationale

The paper models test-time strategies as operators on the base policy and derives aligned loss functions for SFT and RL to optimize performance under those operators. No equations, self-citations, or fitted parameters are presented that reduce the claimed losses or predictions directly back to the same test-time behaviors by definition or statistical forcing. The conceptual framing introduces an independent alignment objective rather than renaming or tautologically re-expressing existing quantities, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the operator conceptualization is presented as a modeling choice but not detailed.

pith-pipeline@v0.9.0 · 5650 in / 974 out tokens · 42141 ms · 2026-05-21T08:37:51.917085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What should post-training optimize? A test-time scaling law perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    The bitter lesson

    Richard S Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, 2019. Blog post, Incomplete Ideas

  2. [2]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

  3. [4]

    URLhttps://arxiv.org/abs/2408.03314

  4. [5]

    Chain-of-thought prompting elicits reason- ing in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Sys- tems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

  5. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. URLhttps: //arxiv.org/abs/2107.03374

  6. [7]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=1PL1NIMMrw

  7. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/ abs/2110.14168

  8. [9]

    Rethinking fine- tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

    Feng Chen, Allan Raventós, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine- tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/ forum?id=jvVQeSMeGM

  9. [10]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InAdvances in Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/ forum?id=4OsgYD7em5

  10. [11]

    Weight ensembling improves reasoning in language models

    Xingyu Dang, Christina Baek, Kaiyue Wen, J Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models. InConference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=S2IKxulLT1. 13

  11. [13]

    URLhttps://arxiv.org/abs/2503.19595

  12. [14]

    Inference-aware fine- tuning for best-of-n sampling in large language models

    Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Aviral Kumar, Rishabh Agarwal, Sridhar Thiagarajan, Craig Boutilier, and Aleksandra Faust. Inference-aware fine- tuning for best-of-n sampling in large language models. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=77gQUdQhE7

  13. [15]

    Mistral 7B

    Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Chaplot Devendra, Guillaume Lample, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. URL https: //arxiv.org/abs/2310.06825. License: Apache 2.0

  14. [16]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, 2021. URLhttps://arxiv.org/abs/ 2103.03874. License: MIT

  15. [17]

    Protgpt2 is a deep unsupervised language model for protein design.Nature Communications, 13, 2022

    Noelia Ferruz, Steffen Schmidt, and Birte Höcker. Protgpt2 is a deep unsupervised language model for protein design.Nature Communications, 13, 2022. URLhttps://www.nature.com/ articles/s41467-022-32007-7. License: MIT

  16. [18]

    Accurate computational design of multipass transmembrane proteins.Science, 359(6379):1042–1046, 2018

    Peilong Lu, Duan Min, Frank DiMaio, Karen Y Wei, Michael D Vahorn, Jacob M Snyder, Thomas J Riley, and David Baker. Accurate computational design of multipass transmembrane proteins.Science, 359(6379):1042–1046, 2018

  17. [19]

    Protein folding and misfolding.Nature, 426(6968):884–890, 2003

    Christopher M Dobson. Protein folding and misfolding.Nature, 426(6968):884–890, 2003

  18. [20]

    Enhancement of soluble protein expression through the use of fusion tags.Current opinion in biotechnology, 17(4):353–358, 2006

    Dominic Esposito and Deb K Chatterjee. Enhancement of soluble protein expression through the use of fusion tags.Current opinion in biotechnology, 17(4):353–358, 2006

  19. [21]

    Fusion tags for protein solubility, purification and immunogenicity in escherichia coli: the novel fh8 system.Frontiers in microbiology, 5:63, 2014

    Soraia Costa, Andreia Almeida, Artur Castro, and Lucília Domingues. Fusion tags for protein solubility, purification and immunogenicity in escherichia coli: the novel fh8 system.Frontiers in microbiology, 5:63, 2014

  20. [22]

    University of Michigan Press, 1975

    John H Holland.Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press, 1975

  21. [23]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InEuropean Conference on Machine Learning, pages 282–293, 2006

  22. [24]

    Scaling Test-Time Compute for Agentic Coding

    Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. URLhttps://arxiv.org/abs/2604.16529

  23. [25]

    MIT Press, 2016

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. URL http://www.deeplearningbook.org. 14

  24. [26]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, volume 12, 1999. URLhttps://proceedings.neurips.cc/ paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

  25. [27]

    Williams

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8:229–256, 1992. URLhttps://link.springer.com/ article/10.1007/BF00992696

  26. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Alan Song, Mingchuan Xiao, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  27. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URLhttps://arxiv. org/abs/1707.06347

  28. [30]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. InInternational Con- ference on Learning Representations, 2016. URLhttps://arxiv.org/abs/1506.02438

  29. [31]

    Understanding the impact of entropy on policy optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 151–160. PMLR, 09–15 Jun 2019. URL htt...

  30. [32]

    Maximum entropy RL (provably) solves some robust RL problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy RL (provably) solves some robust RL problems. InInternational Conference on Learning Representations, 2022. URLhttps: //arxiv.org/abs/2103.06257

  31. [33]

    Omnipredictors, 2021

    Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors, 2021. URLhttps://arxiv.org/abs/2109.05389

  32. [34]

    Unsloth: Accelerating large language model fine-tuning, 2023

    Daniel Han and Michael Han. Unsloth: Accelerating large language model fine-tuning, 2023. URLhttps://github.com/unslothai/unsloth. License: Apache 2.0

  33. [35]

    Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022. URL https://github.com/huggingface/peft. License: Apache 2.0

  34. [36]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-ar...

  35. [37]

    Trl: Transformer reinforcement learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning, 2020. URLhttps://github.com/huggingface/trl. License: Apache 2.0

  36. [38]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id= nZeVKeeFYf9

  37. [39]

    Qlora: Ef- ficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Ef- ficient finetuning of quantized llms. InAdvances in Neural Information Processing Systems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

  38. [40]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

  39. [41]

    bitsandbytes: Accessible large language models via k-bit quantization for pytorch, 2022

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. bitsandbytes: Accessible large language models via k-bit quantization for pytorch, 2022. URLhttps://github.com/ bitsandbytes-foundation/bitsandbytes. License: MIT

  40. [42]

    The 18 model is already likely enough to generate the correct answer withinN tries; stop updating parameters for this sample and focus on harder examples

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022. URLhttps://proceedings.neurips.cc/paper_files/ pa...

  41. [43]

    clearing the field

    Pairwise Contrastive (Hybrid):A joint objective combining standard SFT with a token- level contrastive loss, weighted equally (0.5· LCE + 0.5· L Contrast). Contrastive ImplementationInstead of explicitly calculating the full margin over the sequence, we implemented a computationally efficient pairwise approximation. For each valid token in the ground trut...

  42. [44]

    The Duel

    Conservative Magnitude:Because the error vector aligns with our approximation, our calculated scalar weight strictly underestimates the true learning signal. Theorem 3(Directional Alignment and Conservative Bound).Assuming proportional decay and a competitive test-time strategy, the off-diagonal error aligns with the diagonal approximation (⟨gdiag, ϵvec⟩ ...

  43. [45]

    strength of the opposition

    Luckily, if the answer is incorrect,R(yi|x)is0anyways. Thus the RL update weight for a sampleyi is: ˜Rpass(yi|x) =R(y i|x)·N(1−p) N−1 (54) B.2.2 Majority Vote (Dynamic Threshold) Previously, we computed˜p=PN i=k N i pi(1 −p )N−i and ∂˜p ∂p = N N−1 k−1 pk−1(1 −p )N−k with k being the threshold required for the answer to be chosen. Previously, we had chosen...

  44. [46]

    tipping point

    Step Size Stability:The expected magnitude of the CAT multiplier across the batch is exactly 1 (E[ ˜w] = 1). This completely decouples the scale of the learning rate from the test-time budgetN, ensuring optimization stability regardless of the strategy used. 2.Preservation of Relative Capacity:Because the normalization factor is shared across all prompts ...

  45. [47]

    performance. If a sample is not in the top quantile of the model’s potential outputs, it contributes zero signal. The model is only updated based on its best attempts, creating a

    Winner-Take-All Dynamics.As visualized in Figure 16, the gradient weight vanishes for the bottom percentile of samples and explodes for the top percentile. The objective effectively ignores "average" performance. If a sample is not in the top quantile of the model’s potential outputs, it contributes zero signal. The model is only updated based on its best...

  46. [48]

    To minimize variance and maximize the average, these objectives encourage the model to collapse probability mass onto a single mode, usually a "safe," generic response

    Breaking Mode Collapse.Standard objectives maximizeexpectedutility. To minimize variance and maximize the average, these objectives encourage the model to collapse probability mass onto a single mode, usually a "safe," generic response. Diverging from this safe mode is typically penalized, as low-probability paths are treated as noise. Best-of-N inverts t...

  47. [49]

    Safety Net

    The "Safety Net" Effect.Because the inference strategy acts as a filter, the model is not penalized for generatingN− 1failures, provided the N-th sample succeeds. This effectively creates a safety net during training. The objective signals to the model:"You are allowed to failN− 1times, as long as your variance is high enough to produce one winner."This t...

  48. [50]

    Gradients here are "wasted" on perfecting samples that are already good enough

    The Waste Region (Ωwaste):Where the training objective applies pressure, but the test metric is already satiated (wtest ≈ 0). Gradients here are "wasted" on perfecting samples that are already good enough

  49. [51]

    satiation thresholds

    The Starvation Region (Ωstarve):Where the test metric demands improvement (wtest > 0), but the training objective provides no signal (wtrain ≈0). 51 The Alignment Coefficient A is mathematically dominated by the integral over the overlap Soverlap. Therefore, a lowA guarantees high gradient misallocation. It allows us to detect inefficiency without needing...

  50. [52]

    Log-Weighted

    on the 4-bit quantized [37] Mistral-7B-Instruct-v0.2 model. We use the AdamW [38] optimizer. A critical detail of our GRPO setup is the group size (number of generations per prompt), which we set toG = 4to balance variance reduction with memory constraints. The full optimization hyperparameters are detailed in Table 11. J.2 Reward Formulation Because Pass...

  51. [53]

    This phase stabilizes the instruction-following capabilities and ensures the model outputs valid reasoning traces

    SFT Warmup:The model was first fine-tuned for 3 epochs on the target dataset (MATH levels 1–3) using standard Cross-Entropy loss. This phase stabilizes the instruction-following capabilities and ensures the model outputs valid reasoning traces

  52. [54]

    strength

    Strategy-Aware RL:The warmed-up model was then trained for 1 epoch using our custom weighted gradient estimator. K.2 Dynamic Consensus Thresholding A critical challenge in optimizing for Majority Vote is determining the required consensus threshold k during training. While at test timek is fixed (e.g.,⌊N/2⌋ + 1), during training with small batch sizes (ro...

  53. [55]

    spotlight

    Superiority of RL Weights:The RL_Wt_Maj4 model achieves the highest asymptotic performance (23.00% at Maj@16), outperforming both the baseline and the SFT-weighted variants. This supports the hypothesis that the "spotlight" behavior of the raw derivative, which vanishes for easy/hard samples and explodes at the boundary,is a feature, not a bug, for consen...

  54. [56]

    blurring

    The Trade-off:The SFT_Wt models (Blue/Orange) start slower (lower Maj@4) but scale robustly. Their normalized weights effectively reduce variance, but at the cost of "blurring" the critical decision boundary signal needed to maximize the plurality vote. 67