Recognition: no theorem link
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
Pith reviewed 2026-05-14 20:42 UTC · model grok-4.3
The pith
A model trained solely to rank its own code attempts generates better programs without direct correctness rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales, DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained modelâ
What carries the argument
DuST (Dual Self-Training), a framework that converts comparative correctness labels from sandbox execution of multiple self-generated candidates into on-policy RL training for ranking by correctness.
If this is right
- Single-sample pass@1 accuracy rises even though the training objective never rewards correct programs directly.
- Best-of-4 test-time scaling performance improves consistently across model families and sizes from 4B to 30B.
- Supervised fine-tuning on the same ranking data improves judgment quality but leaves generation unchanged, confirming that on-policy RL is required for the transfer.
- The trained model's single rollout matches the base model's Best-of-4 accuracy on LiveCodeBench.
Where Pith is reading between the lines
- Discriminative ranking on self-generated data may serve as a general bootstrap for generative improvement in other domains that already use test-time sampling.
- The approach could lessen dependence on external verifiers by strengthening the model's internal ability to both judge and generate.
- Applying the same dual-space loop to multi-step reasoning tasks might compound gains because richer candidate sets supply denser comparative signals.
Load-bearing premise
The comparative ranking information obtained from sandbox execution of multiple candidates provides a training signal that transfers via on-policy RL back into improved primal generation rather than only improving discrimination.
What would settle it
If DuST training produces no gain in single-sample pass@1 accuracy on LiveCodeBench or if the trained model's single rollout no longer reaches the base model's Best-of-4 performance level.
read the original abstract
Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DuST (Dual Self-Training), a self-training framework that generates multiple candidate programs from the model, labels them via sandbox execution to obtain comparative ranking signals, and applies GRPO to train the model discriminatively on these rankings. It reports that this improves both judgment (NDCG) and generation (pass@1 and Best-of-4) on LiveCodeBench across five models, with the trained single rollout matching the base model's Best-of-4; an SFT control on identical data isolates on-policy RL as the transfer mechanism from dual judgment to primal generation gains.
Significance. If the results hold, the work is significant for demonstrating that relative correctness information from test-time scaling can be recycled via on-policy RL to improve both discrimination and generation without direct correctness rewards, with the SFT ablation providing a clean isolation of the RL component. This offers a scalable self-improvement path for code and reasoning models that leverages existing inference-time compute.
major comments (1)
- [Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.
minor comments (2)
- [Experimental setup] The manuscript would benefit from explicit discussion of potential confounds such as data contamination on LiveCodeBench or sandbox execution reliability, even if briefly addressed in the experimental setup.
- Clarify the exact composition of the five models (specific names and parameter counts beyond the 4B-30B range) and the precise LiveCodeBench version used for all reported numbers.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the empirical presentation concern below.
read point-by-point responses
-
Referee: [Empirical results] Empirical results section: the reported gains (e.g., +6.2 NDCG, +3.1 pass@1, +4.1 Best-of-4 for Qwen3-30B-Thinking) are presented as single point estimates with no error bars, standard deviations, or multi-seed statistics, which is load-bearing for the central claim of consistent improvements across model families and scales given the moderate empirical support.
Authors: We agree that single-point estimates limit the strength of claims about consistency. In the revision we will add multi-seed statistics: we have already begun rerunning the primary LiveCodeBench evaluations (Qwen3-30B-Thinking and the other four models) with three independent random seeds each, reporting means and standard deviations for NDCG, pass@1, and Best-of-4. These will appear in updated tables and the main results figure, with a short methods paragraph describing the seed protocol. The additional runs are computationally modest and do not alter the experimental design. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's core claim is an empirical result: on-policy GRPO training on sandbox-derived ranking labels from the model's own samples improves both judgment (NDCG) and generation (pass@1, Best-of-4) on held-out LiveCodeBench problems. This is isolated by an SFT control on identical data that improves only judgment, confirming the transfer effect is not forced by the discriminative objective or by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided derivation chain; the performance gains are externally measured rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sandbox execution provides accurate pass/fail labels for generated code candidates
Reference graph
Works this paper leans on
-
[4]
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding , author=. 2025 , eprint=
work page 2025
- [5]
-
[6]
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training , author=. 2025 , eprint=
work page 2025
-
[8]
The Thirteenth International Conference on Learning Representations , year=
Mind the gap: Examining the self-improvement capabilities of large language models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[9]
The Thirteenth International Conference on Learning Representations , year=
Learning llm-as-a-judge for preference alignment , author=. The Thirteenth International Conference on Learning Representations , year=
-
[10]
Self-generated critiques boost reward modeling for language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2025
-
[11]
Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025 , author=
work page 2025
-
[13]
Self-Play Preference Optimization for Language Model Alignment , author=
-
[14]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[15]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[16]
Advances in Neural Information Processing Systems , volume=
Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Advances in Neural Information Processing Systems , volume=
Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
Advances in Neural Information Processing Systems , volume=
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
Execution-based Code Generation using Deep Reinforcement Learning , author=
-
[20]
Acecoder: Acing coder rl via automated test-case synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[21]
International Conference on Machine Learning , pages=
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=
work page 2025
-
[22]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Process-supervised reinforcement learning for code generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[23]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Codeprm: Execution feedback-enhanced process reward model for code generation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[24]
International Conference on Machine Learning , pages=
Teaching Language Models to Critique via Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2025 , organization=
work page 2025
-
[25]
Planning in Natural Language Improves LLM Search for Code Generation , author=
-
[27]
Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[28]
Z1: Efficient test-time scaling with code , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=
work page 2025
-
[29]
2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=
Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute , author=. 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2025 , organization=
work page 2025
-
[30]
The Thirteenth International Conference on Learning Representations , year=
Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. The Thirteenth International Conference on Learning Representations , year=
-
[31]
The Thirteenth International Conference on Learning Representations , year=
Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=
-
[32]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[33]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Thought calibration: Efficient and confident test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[34]
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification , author=
-
[35]
International Conference on Machine Learning , pages=
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators , author=. International Conference on Machine Learning , pages=. 2025 , organization=
work page 2025
-
[36]
International Conference on Machine Learning , pages=
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment , author=. International Conference on Machine Learning , pages=. 2025 , organization=
work page 2025
-
[52]
Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal
Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, and Bernhard Sch \"o lkopf. Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19808--19855, 2025
work page 2025
-
[53]
OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a
Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique , 2025 a . URL https://arxiv.org/abs/2507.09075
-
[54]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. 2025 b . URL https://arxiv.org/abs/2504.01943
-
[55]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. URL https://doi.org/10.48550/arXiv.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[57]
Rlef: Grounding code llms in execution feedback with reinforcement learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025
work page 2025
-
[58]
Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. CoRR, abs/2505.22312, 2025. doi:10.48550/ARXIV.2505.22312. URL https://doi.org/10.48550/arXiv.2505.22312
-
[59]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Coderl: Mastering code generation through pretrained models and deep reinforcement learning
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022
work page 2022
-
[61]
Revise: Learning to refine at test-time via intrinsic self-verification
Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification. In Forty-second International Conference on Machine Learning
-
[62]
Cheryl Li, Tianyuan Xu, and Steven Y Guo. Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment. In International Conference on Machine Learning, pages 36530--36550. PMLR, 2025 a
work page 2025
-
[63]
Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 b
-
[64]
Codeprm: Execution feedback-enhanced process reward model for code generation
Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, 2025 c
work page 2025
-
[65]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...
-
[66]
rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset
Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Yang Mao. rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset. arXiv preprint arXiv:2505.21297, 2025
-
[67]
Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute
Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Yanhao Li, Yue Liu, Rongyu Cao, Jue Chen, Fei Huang, and Binhua Li. Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3730--3741. IEEE, 2025
work page 2025
-
[68]
Scaling agentic verifier for competitive coding
Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, and Binyuan Hui. Scaling agentic verifier for competitive coding. arXiv preprint arXiv:2602.04254, 2026
-
[69]
Simpo: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37: 0 124198--124235, 2024
work page 2024
-
[70]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332, 2025
work page 2025
-
[71]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[72]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023
work page 2023
-
[73]
Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025
Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, and Tianlu Wang. Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025. URL https://arxiv. org/abs/2501.18099
-
[74]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Execution-based code generation using deep reinforcement learning
Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research
-
[76]
Efficient switchable safety control in llms via magic-token-guided co-training, 2025
Jianfeng Si, Lin Sun, Zhewen Tan, and Xiangzheng Zhang. Efficient switchable safety control in llms via magic-token-guided co-training, 2025. URL https://arxiv.org/abs/2508.14904
-
[77]
Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[78]
Mind the gap: Examining the self-improvement capabilities of large language models
Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[79]
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization. CoRR, abs/2508.07629, 2025. doi:10.48550/ARXIV.2508.07629. URL https://doi.org/10.48550/arXiv.2508.07629
- [80]
-
[81]
Learning generative selection for best-of-n
Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, and Igor Gitman. Learning generative selection for best-of-n. arXiv preprint arXiv:2602.02143, 2026
-
[82]
Planning in natural language improves llm search for code generation
Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves llm search for code generation. In The First Workshop on System-2 Reasoning at Scale, NeurIPS'24
-
[83]
Thought calibration: Efficient and confident test-time scaling
Menghua Wu, Cai Zhou, Stephen Bates, and Tommi Jaakkola. Thought calibration: Efficient and confident test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14302--14316, 2025 a
work page 2025
-
[84]
Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving. In The Thirteenth International Conference on Learning Representations, 2025 b
work page 2025
-
[85]
Self-play preference optimization for language model alignment
Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In The Thirteenth International Conference on Learning Representations
-
[86]
Teaching language models to critique via reinforcement learning
Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, and Lingpeng Kong. Teaching language models to critique via reinforcement learning. In International Conference on Machine Learning, pages 68559--68577. PMLR, 2025
work page 2025
-
[87]
Process-supervised reinforcement learning for code generation
Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang. Process-supervised reinforcement learning for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14224--14237, 2025 a
work page 2025
-
[88]
Learning llm-as-a-judge for preference alignment
Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025 b
work page 2025
-
[89]
Self-generated critiques boost reward modeling for language models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, et al. Self-generated critiques boost reward modeling for language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language...
work page 2025
-
[90]
Z1: Efficient test-time scaling with code
Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Z1: Efficient test-time scaling with code. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2688--2712, 2025 b
work page 2025
-
[91]
Rrhf: Rank responses to align language models with human feedback
Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36: 0 10935--10950, 2023
work page 2023
-
[92]
Self-Rewarding Language Models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[93]
Acecoder: Acing coder rl via automated test-case synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023--12040, 2025
work page 2025
-
[94]
Incentivizing llms to self-verify their answers
Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers. arXiv preprint arXiv:2506.01369, 2025
-
[95]
Generative verifiers: Reward modeling as next-token prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024
-
[96]
Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. In International Conference on Machine Learning, pages 79436--79471. PMLR, 2025
work page 2025
-
[97]
Learning to self-verify makes language models better reasoners
Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, and Tat-Seng Chua. Learning to self-verify makes language models better reasoners. arXiv preprint arXiv:2602.07594, 2026
-
[98]
Distilling feedback into memory-as-a-tool
V \'i ctor Gallego. Distilling feedback into memory-as-a-tool. arXiv preprint arXiv:2601.05960, 2026
-
[99]
Internalizing agency from reflective experience
Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, and Hao Zhang. Internalizing agency from reflective experience. arXiv preprint arXiv:2603.16843, 2026
-
[100]
Tslm: Tree-structured language modeling for divergent thinking
Doyoung Kim, Jaehyeok Doo, and Minjoon Seo. Tslm: Tree-structured language modeling for divergent thinking. arXiv preprint arXiv:2601.22688, 2026
-
[101]
Rlsr: Reinforcement learning from self reward
Toby Simonds, Kevin Lopez, Akira Yoshiyama, and Dominique Garmier. Rlsr: Reinforcement learning from self reward. arXiv preprint arXiv:2505.08827, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.