arxiv: 2605.11299 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.CL· cs.SE

Recognition: no theorem link

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Yizhu Jiao , Ruixiang Zhang , Richard Bai , Jiawei Han , Ronan Collobert , Yizhe Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.SE

keywords self-trainingtest-time scalingcode generationreinforcement learningrankingGRPOLiveCodeBenchdual judgment

0 comments

The pith

A model trained solely to rank its own code attempts generates better programs without direct correctness rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that test-time scaling produces comparative information across multiple generated programs that can be turned into a training signal instead of being discarded. DuST samples candidates from the model itself, executes them in a sandbox, keeps groups with both successes and failures, and applies on-policy RL to teach the model to rank them by correctness. This dual judgment training improves the model's ability to discriminate good attempts from bad ones and, surprisingly, also improves its ability to generate correct programs on the first try. Across five models the gains appear consistently on LiveCodeBench: judgment quality, single-sample accuracy, and Best-of-4 performance all rise, with the trained single rollout matching the base model's Best-of-4 level. A reader would care because this recycles an existing inference technique into a self-improvement loop for code generation.

Core claim

DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales, DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained modelâ

What carries the argument

DuST (Dual Self-Training), a framework that converts comparative correctness labels from sandbox execution of multiple self-generated candidates into on-policy RL training for ranking by correctness.

If this is right

Single-sample pass@1 accuracy rises even though the training objective never rewards correct programs directly.
Best-of-4 test-time scaling performance improves consistently across model families and sizes from 4B to 30B.
Supervised fine-tuning on the same ranking data improves judgment quality but leaves generation unchanged, confirming that on-policy RL is required for the transfer.
The trained model's single rollout matches the base model's Best-of-4 accuracy on LiveCodeBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Discriminative ranking on self-generated data may serve as a general bootstrap for generative improvement in other domains that already use test-time sampling.
The approach could lessen dependence on external verifiers by strengthening the model's internal ability to both judge and generate.
Applying the same dual-space loop to multi-step reasoning tasks might compound gains because richer candidate sets supply denser comparative signals.

Load-bearing premise

The comparative ranking information obtained from sandbox execution of multiple candidates provides a training signal that transfers via on-policy RL back into improved primal generation rather than only improving discrimination.

What would settle it

If DuST training produces no gain in single-sample pass@1 accuracy on LiveCodeBench or if the trained model's single rollout no longer reaches the base model's Best-of-4 performance level.

read the original abstract

Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DuST (Dual Self-Training), a self-training framework that generates multiple candidate programs from the model, labels them via sandbox execution to obtain comparative ranking signals, and applies GRPO to train the model discriminatively on these rankings. It reports that this improves both judgment (NDCG) and generation (pass@1 and Best-of-4) on LiveCodeBench across five models, with the trained single rollout matching the base model's Best-of-4; an SFT control on identical data isolates on-policy RL as the transfer mechanism from dual judgment to primal generation gains.

Significance. If the results hold, the work is significant for demonstrating that relative correctness information from test-time scaling can be recycled via on-policy RL to improve both discrimination and generation without direct correctness rewards, with the SFT ablation providing a clean isolation of the RL component. This offers a scalable self-improvement path for code and reasoning models that leverages existing inference-time compute.

major comments (1)

[Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.

minor comments (2)

[Experimental setup] The manuscript would benefit from explicit discussion of potential confounds such as data contamination on LiveCodeBench or sandbox execution reliability, even if briefly addressed in the experimental setup.
Clarify the exact composition of the five models (specific names and parameter counts beyond the 4B-30B range) and the precise LiveCodeBench version used for all reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the empirical presentation concern below.

read point-by-point responses

Referee: [Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.

Authors: We agree that single-point estimates limit the strength of claims about consistency. In the revision we will add multi-seed statistics: we have already begun rerunning the primary LiveCodeBench evaluations (Qwen3-30B-Thinking and the other four models) with three independent random seeds each, reporting means and standard deviations for NDCG, pass@1, and Best-of-4. These will appear in updated tables and the main results figure, with a short methods paragraph describing the seed protocol. The additional runs are computationally modest and do not alter the experimental design. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claim is an empirical result: on-policy GRPO training on sandbox-derived ranking labels from the model's own samples improves both judgment (NDCG) and generation (pass@1, Best-of-4) on held-out LiveCodeBench problems. This is isolated by an SFT control on identical data that improves only judgment, confirming the transfer effect is not forced by the discriminative objective or by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided derivation chain; the performance gains are externally measured rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that sandbox execution yields reliable binary correctness labels and that GRPO can propagate ranking information into generation policy improvement; no free parameters or new entities are introduced beyond standard RL components.

axioms (1)

domain assumption Sandbox execution provides accurate pass/fail labels for generated code candidates
Used to label successes and failures in mixed groups before GRPO training.

pith-pipeline@v0.9.0 · 5623 in / 1375 out tokens · 33518 ms · 2026-05-14T20:42:56.216161+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 5 internal anchors

[4]

2025 , eprint=

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

TinyR1-32B-Report , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training , author=. 2025 , eprint=

work page 2025
[8]

The Thirteenth International Conference on Learning Representations , year=

Mind the gap: Examining the self-improvement capabilities of large language models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[9]

The Thirteenth International Conference on Learning Representations , year=

Learning llm-as-a-judge for preference alignment , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[10]

Self-generated critiques boost reward modeling for language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[11]

Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025 , author=

work page 2025
[13]

Self-Play Preference Optimization for Language Model Alignment , author=

work page
[14]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[15]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[16]

Advances in Neural Information Processing Systems , volume=

Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Execution-based Code Generation using Deep Reinforcement Learning , author=

work page
[20]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Acecoder: Acing coder rl via automated test-case synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[21]

International Conference on Machine Learning , pages=

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[22]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Process-supervised reinforcement learning for code generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[23]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Codeprm: Execution feedback-enhanced process reward model for code generation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[24]

International Conference on Machine Learning , pages=

Teaching Language Models to Critique via Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[25]

Planning in Natural Language Improves LLM Search for Code Generation , author=

work page
[27]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

Z1: Efficient test-time scaling with code , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

work page 2025
[29]

2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute , author=. 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2025 , organization=

work page 2025
[30]

The Thirteenth International Conference on Learning Representations , year=

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[31]

The Thirteenth International Conference on Learning Representations , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[32]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[33]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Thought calibration: Efficient and confident test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[34]

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification , author=

work page
[35]

International Conference on Machine Learning , pages=

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[36]

International Conference on Machine Learning , pages=

Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[52]

Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal

Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, and Bernhard Sch \"o lkopf. Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19808--19855, 2025

work page 2025
[53]

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a . URL https://arxiv.org/abs/2507.09075

work page arXiv 2025
[54]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. 2025 b . URL https://arxiv.org/abs/2504.01943

work page arXiv 2025
[55]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. URL https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[57]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025

work page 2025
[58]

Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. CoRR, abs/2505.22312, 2025. doi:10.48550/ARXIV.2505.22312. URL https://doi.org/10.48550/arXiv.2505.22312

work page doi:10.48550/arxiv.2505.22312 2025
[59]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

work page 2022
[61]

Revise: Learning to refine at test-time via intrinsic self-verification

Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification. In Forty-second International Conference on Machine Learning

work page
[62]

Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment

Cheryl Li, Tianyuan Xu, and Steven Y Guo. Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment. In International Conference on Machine Learning, pages 36530--36550. PMLR, 2025 a

work page 2025
[63]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 b

work page arXiv 2025
[64]

Codeprm: Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, 2025 c

work page 2025
[65]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

work page arXiv 2022
[66]

rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset

Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Yang Mao. rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset. arXiv preprint arXiv:2505.21297, 2025

work page arXiv 2025
[67]

Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute

Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Yanhao Li, Yue Liu, Rongyu Cao, Jue Chen, Fei Huang, and Binhua Li. Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3730--3741. IEEE, 2025

work page 2025
[68]

Scaling agentic verifier for competitive coding

Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, and Binyuan Hui. Scaling agentic verifier for competitive coding. arXiv preprint arXiv:2602.04254, 2026

work page arXiv 2026
[69]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37: 0 124198--124235, 2024

work page 2024
[70]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332, 2025

work page 2025
[71]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[72]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

work page 2023
[73]

Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, and Tianlu Wang. Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025. URL https://arxiv. org/abs/2501.18099

work page arXiv 2025
[74]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Execution-based code generation using deep reinforcement learning

Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research

work page
[76]

Efficient switchable safety control in llms via magic-token-guided co-training, 2025

Jianfeng Si, Lin Sun, Zhewen Tan, and Xiangzheng Zhang. Efficient switchable safety control in llms via magic-token-guided co-training, 2025. URL https://arxiv.org/abs/2508.14904

work page arXiv 2025
[77]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[78]

Mind the gap: Examining the self-improvement capabilities of large language models

Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. In The Thirteenth International Conference on Learning Representations, 2024

work page 2024
[79]

Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization. CoRR, abs/2508.07629, 2025. doi:10.48550/ARXIV.2508.07629. URL https://doi.org/10.48550/arXiv.2508.07629

work page doi:10.48550/arxiv.2508.07629 2025
[80]

Tinyr1-32b-report, 2025

TinyR1 Team. Tinyr1-32b-report, 2025

work page 2025
[81]

Learning generative selection for best-of-n

Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, and Igor Gitman. Learning generative selection for best-of-n. arXiv preprint arXiv:2602.02143, 2026

work page arXiv 2026
[82]

Planning in natural language improves llm search for code generation

Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves llm search for code generation. In The First Workshop on System-2 Reasoning at Scale, NeurIPS'24

work page
[83]

Thought calibration: Efficient and confident test-time scaling

Menghua Wu, Cai Zhou, Stephen Bates, and Tommi Jaakkola. Thought calibration: Efficient and confident test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14302--14316, 2025 a

work page 2025
[84]

Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving. In The Thirteenth International Conference on Learning Representations, 2025 b

work page 2025
[85]

Self-play preference optimization for language model alignment

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In The Thirteenth International Conference on Learning Representations

work page
[86]

Teaching language models to critique via reinforcement learning

Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, and Lingpeng Kong. Teaching language models to critique via reinforcement learning. In International Conference on Machine Learning, pages 68559--68577. PMLR, 2025

work page 2025
[87]

Process-supervised reinforcement learning for code generation

Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang. Process-supervised reinforcement learning for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14224--14237, 2025 a

work page 2025
[88]

Learning llm-as-a-judge for preference alignment

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025 b

work page 2025
[89]

Self-generated critiques boost reward modeling for language models

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, et al. Self-generated critiques boost reward modeling for language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language...

work page 2025
[90]

Z1: Efficient test-time scaling with code

Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Z1: Efficient test-time scaling with code. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2688--2712, 2025 b

work page 2025
[91]

Rrhf: Rank responses to align language models with human feedback

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36: 0 10935--10950, 2023

work page 2023
[92]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[93]

Acecoder: Acing coder rl via automated test-case synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023--12040, 2025

work page 2025
[94]

Incentivizing llms to self-verify their answers

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers. arXiv preprint arXiv:2506.01369, 2025

work page arXiv 2025
[95]

Generative verifiers: Reward modeling as next-token prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024

work page arXiv 2024
[96]

Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators

Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. In International Conference on Machine Learning, pages 79436--79471. PMLR, 2025

work page 2025
[97]

Learning to self-verify makes language models better reasoners

Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, and Tat-Seng Chua. Learning to self-verify makes language models better reasoners. arXiv preprint arXiv:2602.07594, 2026

work page arXiv 2026
[98]

Distilling feedback into memory-as-a-tool

V \'i ctor Gallego. Distilling feedback into memory-as-a-tool. arXiv preprint arXiv:2601.05960, 2026

work page arXiv 2026
[99]

Internalizing agency from reflective experience

Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, and Hao Zhang. Internalizing agency from reflective experience. arXiv preprint arXiv:2603.16843, 2026

work page arXiv 2026
[100]

Tslm: Tree-structured language modeling for divergent thinking

Doyoung Kim, Jaehyeok Doo, and Minjoon Seo. Tslm: Tree-structured language modeling for divergent thinking. arXiv preprint arXiv:2601.22688, 2026

work page arXiv 2026
[101]

Rlsr: Reinforcement learning from self reward

Toby Simonds, Kevin Lopez, Akira Yoshiyama, and Dominique Garmier. Rlsr: Reinforcement learning from self reward. arXiv preprint arXiv:2505.08827, 2025

work page arXiv 2025

Showing first 80 references.