pith. machine review for the scientific record. sign in

arxiv: 2605.11299 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.CL· cs.SE

Recognition: no theorem link

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.SE
keywords self-trainingtest-time scalingcode generationreinforcement learningrankingGRPOLiveCodeBenchdual judgment
0
0 comments X

The pith

A model trained solely to rank its own code attempts generates better programs without direct correctness rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that test-time scaling produces comparative information across multiple generated programs that can be turned into a training signal instead of being discarded. DuST samples candidates from the model itself, executes them in a sandbox, keeps groups with both successes and failures, and applies on-policy RL to teach the model to rank them by correctness. This dual judgment training improves the model's ability to discriminate good attempts from bad ones and, surprisingly, also improves its ability to generate correct programs on the first try. Across five models the gains appear consistently on LiveCodeBench: judgment quality, single-sample accuracy, and Best-of-4 performance all rise, with the trained single rollout matching the base model's Best-of-4 level. A reader would care because this recycles an existing inference technique into a self-improvement loop for code generation.

Core claim

DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales, DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained modelâ

What carries the argument

DuST (Dual Self-Training), a framework that converts comparative correctness labels from sandbox execution of multiple self-generated candidates into on-policy RL training for ranking by correctness.

If this is right

  • Single-sample pass@1 accuracy rises even though the training objective never rewards correct programs directly.
  • Best-of-4 test-time scaling performance improves consistently across model families and sizes from 4B to 30B.
  • Supervised fine-tuning on the same ranking data improves judgment quality but leaves generation unchanged, confirming that on-policy RL is required for the transfer.
  • The trained model's single rollout matches the base model's Best-of-4 accuracy on LiveCodeBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Discriminative ranking on self-generated data may serve as a general bootstrap for generative improvement in other domains that already use test-time sampling.
  • The approach could lessen dependence on external verifiers by strengthening the model's internal ability to both judge and generate.
  • Applying the same dual-space loop to multi-step reasoning tasks might compound gains because richer candidate sets supply denser comparative signals.

Load-bearing premise

The comparative ranking information obtained from sandbox execution of multiple candidates provides a training signal that transfers via on-policy RL back into improved primal generation rather than only improving discrimination.

What would settle it

If DuST training produces no gain in single-sample pass@1 accuracy on LiveCodeBench or if the trained model's single rollout no longer reaches the base model's Best-of-4 performance level.

read the original abstract

Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DuST (Dual Self-Training), a self-training framework that generates multiple candidate programs from the model, labels them via sandbox execution to obtain comparative ranking signals, and applies GRPO to train the model discriminatively on these rankings. It reports that this improves both judgment (NDCG) and generation (pass@1 and Best-of-4) on LiveCodeBench across five models, with the trained single rollout matching the base model's Best-of-4; an SFT control on identical data isolates on-policy RL as the transfer mechanism from dual judgment to primal generation gains.

Significance. If the results hold, the work is significant for demonstrating that relative correctness information from test-time scaling can be recycled via on-policy RL to improve both discrimination and generation without direct correctness rewards, with the SFT ablation providing a clean isolation of the RL component. This offers a scalable self-improvement path for code and reasoning models that leverages existing inference-time compute.

major comments (1)
  1. [Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.
minor comments (2)
  1. [Experimental setup] The manuscript would benefit from explicit discussion of potential confounds such as data contamination on LiveCodeBench or sandbox execution reliability, even if briefly addressed in the experimental setup.
  2. Clarify the exact composition of the five models (specific names and parameter counts beyond the 4B-30B range) and the precise LiveCodeBench version used for all reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the empirical presentation concern below.

read point-by-point responses
  1. Referee: [Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.

    Authors: We agree that single-point estimates limit the strength of claims about consistency. In the revision we will add multi-seed statistics: we have already begun rerunning the primary LiveCodeBench evaluations (Qwen3-30B-Thinking and the other four models) with three independent random seeds each, reporting means and standard deviations for NDCG, pass@1, and Best-of-4. These will appear in updated tables and the main results figure, with a short methods paragraph describing the seed protocol. The additional runs are computationally modest and do not alter the experimental design. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claim is an empirical result: on-policy GRPO training on sandbox-derived ranking labels from the model's own samples improves both judgment (NDCG) and generation (pass@1, Best-of-4) on held-out LiveCodeBench problems. This is isolated by an SFT control on identical data that improves only judgment, confirming the transfer effect is not forced by the discriminative objective or by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided derivation chain; the performance gains are externally measured rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that sandbox execution yields reliable binary correctness labels and that GRPO can propagate ranking information into generation policy improvement; no free parameters or new entities are introduced beyond standard RL components.

axioms (1)
  • domain assumption Sandbox execution provides accurate pass/fail labels for generated code candidates
    Used to label successes and failures in mixed groups before GRPO training.

pith-pipeline@v0.9.0 · 5623 in / 1375 out tokens · 33518 ms · 2026-05-14T20:42:56.216161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 5 internal anchors

  1. [4]

    2025 , eprint=

    OpenCodeReasoning: Advancing Data Distillation for Competitive Coding , author=. 2025 , eprint=

  2. [5]

    2025 , eprint=

    TinyR1-32B-Report , author=. 2025 , eprint=

  3. [6]

    2025 , eprint=

    Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training , author=. 2025 , eprint=

  4. [8]

    The Thirteenth International Conference on Learning Representations , year=

    Mind the gap: Examining the self-improvement capabilities of large language models , author=. The Thirteenth International Conference on Learning Representations , year=

  5. [9]

    The Thirteenth International Conference on Learning Representations , year=

    Learning llm-as-a-judge for preference alignment , author=. The Thirteenth International Conference on Learning Representations , year=

  6. [10]

    Self-generated critiques boost reward modeling for language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  7. [11]

    Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025 , author=

  8. [13]

    Self-Play Preference Optimization for Language Model Alignment , author=

  9. [14]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  10. [15]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  11. [16]

    Advances in Neural Information Processing Systems , volume=

    Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  12. [17]

    Advances in Neural Information Processing Systems , volume=

    Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

  13. [18]

    Advances in Neural Information Processing Systems , volume=

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  14. [19]

    Execution-based Code Generation using Deep Reinforcement Learning , author=

  15. [20]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Acecoder: Acing coder rl via automated test-case synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  16. [21]

    International Conference on Machine Learning , pages=

    RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  17. [22]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Process-supervised reinforcement learning for code generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  18. [23]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Codeprm: Execution feedback-enhanced process reward model for code generation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  19. [24]

    International Conference on Machine Learning , pages=

    Teaching Language Models to Critique via Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  20. [25]

    Planning in Natural Language Improves LLM Search for Code Generation , author=

  21. [27]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  22. [28]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

    Z1: Efficient test-time scaling with code , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

  23. [29]

    2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

    Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute , author=. 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2025 , organization=

  24. [30]

    The Thirteenth International Conference on Learning Representations , year=

    Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

  25. [31]

    The Thirteenth International Conference on Learning Representations , year=

    Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=

  26. [32]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  27. [33]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Thought calibration: Efficient and confident test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  28. [34]

    ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification , author=

  29. [35]

    International Conference on Machine Learning , pages=

    Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  30. [36]

    International Conference on Machine Learning , pages=

    Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  31. [52]

    Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal

    Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, and Bernhard Sch \"o lkopf. Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19808--19855, 2025

  32. [53]

    OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a

    Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a . URL https://arxiv.org/abs/2507.09075

  33. [54]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. 2025 b . URL https://arxiv.org/abs/2504.01943

  34. [55]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  35. [56]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. URL https://doi.org/10.48550/arXiv.2501.12948

  36. [57]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025

  37. [58]

    Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. CoRR, abs/2505.22312, 2025. doi:10.48550/ARXIV.2505.22312. URL https://doi.org/10.48550/arXiv.2505.22312

  38. [59]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  39. [60]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

  40. [61]

    Revise: Learning to refine at test-time via intrinsic self-verification

    Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification. In Forty-second International Conference on Machine Learning

  41. [62]

    Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment

    Cheryl Li, Tianyuan Xu, and Steven Y Guo. Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment. In International Conference on Machine Learning, pages 36530--36550. PMLR, 2025 a

  42. [63]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 b

  43. [64]

    Codeprm: Execution feedback-enhanced process reward model for code generation

    Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, 2025 c

  44. [65]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  45. [66]

    rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset

    Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Yang Mao. rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset. arXiv preprint arXiv:2505.21297, 2025

  46. [67]

    Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute

    Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Yanhao Li, Yue Liu, Rongyu Cao, Jue Chen, Fei Huang, and Binhua Li. Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3730--3741. IEEE, 2025

  47. [68]

    Scaling agentic verifier for competitive coding

    Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, and Binyuan Hui. Scaling agentic verifier for competitive coding. arXiv preprint arXiv:2602.04254, 2026

  48. [69]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37: 0 124198--124235, 2024

  49. [70]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332, 2025

  50. [71]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  51. [72]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

  52. [73]

    Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

    Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, and Tianlu Wang. Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025. URL https://arxiv. org/abs/2501.18099

  53. [74]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  54. [75]

    Execution-based code generation using deep reinforcement learning

    Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research

  55. [76]

    Efficient switchable safety control in llms via magic-token-guided co-training, 2025

    Jianfeng Si, Lin Sun, Zhewen Tan, and Xiangzheng Zhang. Efficient switchable safety control in llms via magic-token-guided co-training, 2025. URL https://arxiv.org/abs/2508.14904

  56. [77]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025

  57. [78]

    Mind the gap: Examining the self-improvement capabilities of large language models

    Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. In The Thirteenth International Conference on Learning Representations, 2024

  58. [79]

    Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

    Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization. CoRR, abs/2508.07629, 2025. doi:10.48550/ARXIV.2508.07629. URL https://doi.org/10.48550/arXiv.2508.07629

  59. [80]

    Tinyr1-32b-report, 2025

    TinyR1 Team. Tinyr1-32b-report, 2025

  60. [81]

    Learning generative selection for best-of-n

    Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, and Igor Gitman. Learning generative selection for best-of-n. arXiv preprint arXiv:2602.02143, 2026

  61. [82]

    Planning in natural language improves llm search for code generation

    Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves llm search for code generation. In The First Workshop on System-2 Reasoning at Scale, NeurIPS'24

  62. [83]

    Thought calibration: Efficient and confident test-time scaling

    Menghua Wu, Cai Zhou, Stephen Bates, and Tommi Jaakkola. Thought calibration: Efficient and confident test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14302--14316, 2025 a

  63. [84]

    Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving. In The Thirteenth International Conference on Learning Representations, 2025 b

  64. [85]

    Self-play preference optimization for language model alignment

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In The Thirteenth International Conference on Learning Representations

  65. [86]

    Teaching language models to critique via reinforcement learning

    Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, and Lingpeng Kong. Teaching language models to critique via reinforcement learning. In International Conference on Machine Learning, pages 68559--68577. PMLR, 2025

  66. [87]

    Process-supervised reinforcement learning for code generation

    Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang. Process-supervised reinforcement learning for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14224--14237, 2025 a

  67. [88]

    Learning llm-as-a-judge for preference alignment

    Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025 b

  68. [89]

    Self-generated critiques boost reward modeling for language models

    Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, et al. Self-generated critiques boost reward modeling for language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language...

  69. [90]

    Z1: Efficient test-time scaling with code

    Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Z1: Efficient test-time scaling with code. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2688--2712, 2025 b

  70. [91]

    Rrhf: Rank responses to align language models with human feedback

    Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36: 0 10935--10950, 2023

  71. [92]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024

  72. [93]

    Acecoder: Acing coder rl via automated test-case synthesis

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023--12040, 2025

  73. [94]

    Incentivizing llms to self-verify their answers

    Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers. arXiv preprint arXiv:2506.01369, 2025

  74. [95]

    Generative verifiers: Reward modeling as next-token prediction

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024

  75. [96]

    Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators

    Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. In International Conference on Machine Learning, pages 79436--79471. PMLR, 2025

  76. [97]

    Learning to self-verify makes language models better reasoners

    Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, and Tat-Seng Chua. Learning to self-verify makes language models better reasoners. arXiv preprint arXiv:2602.07594, 2026

  77. [98]

    Distilling feedback into memory-as-a-tool

    V \'i ctor Gallego. Distilling feedback into memory-as-a-tool. arXiv preprint arXiv:2601.05960, 2026

  78. [99]

    Internalizing agency from reflective experience

    Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, and Hao Zhang. Internalizing agency from reflective experience. arXiv preprint arXiv:2603.16843, 2026

  79. [100]

    Tslm: Tree-structured language modeling for divergent thinking

    Doyoung Kim, Jaehyeok Doo, and Minjoon Seo. Tslm: Tree-structured language modeling for divergent thinking. arXiv preprint arXiv:2601.22688, 2026

  80. [101]

    Rlsr: Reinforcement learning from self reward

    Toby Simonds, Kevin Lopez, Akira Yoshiyama, and Dominique Garmier. Rlsr: Reinforcement learning from self reward. arXiv preprint arXiv:2505.08827, 2025

Showing first 80 references.