On the Generalization Gap in Self-Evolving Language Model Reasoning

Andrew Tomkins; Cyrus Rashtchian; Da-Cheng Juan; Kan Yuan; Stefanie Anna Baby; Susanna Maria Baby; Tu Vu; Zhenting Qi

arxiv: 2606.01075 · v2 · pith:DZCIQQEJnew · submitted 2026-05-31 · 💻 cs.CL

On the Generalization Gap in Self-Evolving Language Model Reasoning

Zhenting Qi , Susanna Maria Baby , Stefanie Anna Baby , Kan Yuan , Andrew Tomkins , Tu Vu , Da-Cheng Juan , Cyrus Rashtchian This is my paper

Pith reviewed 2026-06-28 17:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-evolutionlanguage modelsreasoninggeneralization gapclosed-loop trainingoracle supervisionKnights and Knaves

0 comments

The pith

Closed-loop self-evolution improves language model reasoning but plateaus short of oracle-supervised performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can improve their own reasoning using only their internally generated signals in a strict closed-loop setup with no external labels. Four strategies are compared in one framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. On controlled logical reasoning tasks, self-evolution raises accuracy above the base model yet stops improving after more compute and still trails training with perfect external answers. Multi-turn critic-revision with larger models narrows this gap most effectively. The same pattern of modest gains appears when the methods are applied to real-world reasoning benchmarks.

Core claim

Under a minimal closed-loop self-evolution setup that uses only an unlabeled prompt set and the base model itself, internally generated supervision produces consistent gains over the starting model, yet these gains plateau with additional training compute and leave a non-trivial performance gap relative to oracle-supervised training; multi-turn critic-revision with large models such as Gemma 12B comes closest to closing that gap.

What carries the argument

The unified offline self-evolution framework that evaluates four representative strategies on Knights and Knaves logical reasoning tasks with controlled difficulty.

If this is right

Self-evolution raises accuracy over the base model without any external labels.
Further increases in training compute after the plateau produce no additional benefit.
Multi-turn critic-revision with larger models narrows the gap to oracle performance more than the other three strategies.
Gains remain modest when the same methods are run on standard real-world reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed plateau suggests that purely internal feedback loops may require additional mechanisms such as external verification to keep improving.
Model scale appears more effective than extra iteration count at reducing the supervision gap.
The modest real-benchmark results imply that the gap could widen on noisier or open-ended tasks.

Load-bearing premise

That the easy-to-hard generalization behavior observed on Knights and Knaves tasks also holds for real-world reasoning problems.

What would settle it

A replication experiment on a different reasoning domain in which one of the self-evolution strategies reaches or exceeds oracle-supervised accuracy would falsify the reported gap.

read the original abstract

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-evolution improves but plateaus below oracle on KK tasks with a clean comparison; real-world results stay too thin to support the general claim.

read the letter

The core result is that closed-loop self-evolution on Knights and Knaves raises performance over the base model yet plateaus short of oracle supervision, and multi-turn critic-revision with larger models closes most of that gap. The paper sets up a single offline framework that runs four strategies head-to-head on the same prompt set and base models, which keeps the comparison controlled and makes the plateau observation easy to inspect.

The KK testbed is a reasonable choice for this question because correctness is deterministic and difficulty can be graded, so the easy-to-hard generalization test is straightforward. The finding that Gemma 12B nearly matches oracle under multi-turn revision is the most concrete takeaway.

The weaker part is the real-world section. The abstract reports only modest gains there and does not give the corresponding oracle gap or show the same plateau pattern, so the claim that internal supervision remains insufficient in general rests mainly on the KK evidence. If the verifiable structure and graded difficulty of KK do not transfer, the gap size could be task-dependent rather than diagnostic of the minimal closed-loop setup.

Implementation details, data splits, and statistical tests are not visible in the abstract, but the overall design looks reproducible enough to check. This is the kind of empirical comparison that people building self-training loops will want to see, even if they end up disagreeing with how far the KK results generalize. It deserves a serious referee.

Referee Report

2 major / 0 minor

Summary. The manuscript examines closed-loop self-evolution of LLMs under a minimal setup with only unlabeled prompts and the base model. It evaluates four strategies (single-round verification, multi-turn critic-revision, iterative training, curriculum learning) primarily on Knights and Knaves logical reasoning tasks, reporting consistent gains over the base model that plateau with added compute and leave a non-trivial gap to oracle-supervised training; multi-turn revision with larger models (e.g., Gemma 12B) nearly closes the gap on KK. Modest gains are also noted on real-world reasoning benchmarks.

Significance. If the plateau-and-gap pattern holds beyond the controlled testbed, the work usefully bounds the capabilities of minimal self-evolution and indicates that internally generated supervision alone is insufficient to match oracle training. The choice of KK as a deterministic, graded-difficulty testbed enables clean analysis of easy-to-hard generalization and is a methodological strength for isolating the effect of the self-evolution loop.

major comments (2)

[Abstract / real-world experiments] Abstract and real-world evaluation: the claim that internally generated supervision 'remains insufficient under this minimal formulation' rests primarily on the KK results showing a non-trivial oracle gap after plateau; the real-world section reports only 'modest' gains without quantifying the corresponding oracle gap size or confirming the same plateau-and-gap pattern, weakening the generalization of the insufficiency result beyond the KK testbed.
[Experiments] Experiments section: the soundness of the plateauing and gap claims cannot be assessed because the manuscript provides no details on implementation, data splits, hyperparameter choices, training compute measurement, or statistical significance testing, which are load-bearing for interpreting the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the value of the Knights and Knaves testbed. We address the two major comments below and describe the targeted revisions.

read point-by-point responses

Referee: [Abstract / real-world experiments] Abstract and real-world evaluation: the claim that internally generated supervision 'remains insufficient under this minimal formulation' rests primarily on the KK results showing a non-trivial oracle gap after plateau; the real-world section reports only 'modest' gains without quantifying the corresponding oracle gap size or confirming the same plateau-and-gap pattern, weakening the generalization of the insufficiency result beyond the KK testbed.

Authors: We agree that the primary demonstration of the plateau-and-gap pattern, and thus the insufficiency of internally generated supervision, is provided by the controlled KK experiments. The real-world results are presented as supplementary evidence of modest gains rather than a complete replication of the KK analysis, because real-world benchmarks lack deterministic oracles that would allow direct gap measurement. We will revise the abstract to state the insufficiency conclusion more precisely as being supported by the KK testbed, with real-world tasks providing additional but secondary evidence of limited improvement. This qualification will be added without overstating the real-world findings. revision: partial
Referee: [Experiments] Experiments section: the soundness of the plateauing and gap claims cannot be assessed because the manuscript provides no details on implementation, data splits, hyperparameter choices, training compute measurement, or statistical significance testing, which are load-bearing for interpreting the reported performance differences.

Authors: We acknowledge that the main text currently provides insufficient detail for readers to fully assess the plateauing and gap claims. Although an appendix contains some implementation information, we will expand the Experiments section to explicitly describe the KK data splits, hyperparameter choices for both training and inference, the precise definition and measurement of training compute, and statistical significance (including standard deviations across multiple random seeds). These additions will directly support evaluation of the reported differences. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-referential reductions

full rationale

The paper presents an empirical comparison of self-evolution strategies against oracle-supervised training on Knights and Knaves tasks plus real-world benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains are used to derive the central claims; results are measured directly from experiments. The generalization-gap observation is an experimental outcome, not a quantity forced by internal definitions or prior self-citations. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described. The central claims rest on the unstated assumption that the chosen testbed and four strategies are representative.

pith-pipeline@v0.9.1-grok · 5780 in / 1126 out tokens · 20152 ms · 2026-06-28T17:16:50.161756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 42 canonical work pages · 16 internal anchors

[1]

R3: Robust rubric-agnostic reward models

David Anugraha, Zilu Tang, Lester James V Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models. arXiv preprint arXiv:2505.13388,

work page arXiv
[2]

Annotating the annotators: Analysis, insightsandmodellingfromanannotationcampaignonpersuasiontechniques detection

Davide Bassi, Dimitar Iliyanov Dimitrov, Bernardo D’Auria, Firoj Alam, Maram Hasanain, Christian Moro, Luisa Orrù, Gian Piero Turchi, Preslav Nakov, and Giovanni Da San Martino. Annotating the annotators: Analysis, insightsandmodellingfromanannotationcampaignonpersuasiontechniques detection. InACL 2025, pages 17918–17929, July

2025
[3]

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, and Donya Saless

URLhttps://aclanthology.org/ 2025.findings-acl.922/. Avrim Blum, Daniel Hsu, Cyrus Rashtchian, and Donya Saless. Prior makes it possible: From sublinear graph algorithms to llm test-time methods. In29th Conference on Artificial Intelligence and Statistics (AISTATS),

2025
[4]

Training Verifiers to Solve Math Word Problems

14 On the Generalization Gap in Self-Evolving Language Model Reasoning Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Apratim Dey and David Donoho. Universality of the𝜋2/6pathway in avoiding model collapse.arXiv preprint arXiv:2410.22812,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

work page arXiv
[9]

Gemma 3 Technical Report

URLhttps: //arxiv.org/abs/2503.19786. Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Human and llm biases in hate speech annotations: A socio-demographic analysis of annotators and targets. InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 653–670,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Self-improvement in language models: The sharpening mechanism

Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. arXiv preprint arXiv:2412.01951,

work page arXiv
[14]

Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment.arXiv preprint arXiv:2503.21878, 2025a

Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J Foster. Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment.arXiv preprint arXiv:2503.21878, 2025a. Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self...

work page arXiv
[15]

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning inference compute into reference- free supervision.arXiv preprint arXiv:2509.14234,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Llms could autonomously learn without external supervision.arXiv preprint arXiv:2406.00606,

Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, and Benyou Wang. Llms could autonomously learn without external supervision.arXiv preprint arXiv:2406.00606,

work page arXiv
[17]

Demystifying synthetic data in llm pre-training: A systematic study of scaling laws, benefits, and pitfalls

Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, and Carole-Jean Wu. Demystifying synthetic data in llm pre-training: A systematic study of scaling laws, benefits, and pitfalls. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pro...

2025
[18]

Search-based correction of reasoning chains for language models.arXiv preprint arXiv:2505.11824,

Minsu Kim, Jean-Pierre Falet, Oliver E Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, and Yoshua Bengio. Search-based correction of reasoning chains for language models.arXiv preprint arXiv:2505.11824,

work page arXiv
[19]

Language self-play for data-free training.arXiv preprint arXiv:2509.07414,

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, and Vijai Mohan. Language self-play for data-free training.arXiv preprint arXiv:2509.07414,

work page arXiv
[20]

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Revise: Learning to refine at test-time via intrinsic self-verification.arXiv preprint arXiv:2502.14565,

Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification.arXiv preprint arXiv:2502.14565,

work page arXiv
[23]

Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

Shalev Lifshitz, Sheila A McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

work page arXiv
[24]

Self-improving vlm judges without human annotations.arXiv preprint arXiv:2512.05145,

Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh, Luke Zettlemoyer, Tim Althoff, and Marjan Ghazvininejad. Self-improving vlm judges without human annotations.arXiv preprint arXiv:2512.05145,

work page arXiv
[25]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

16 On the Generalization Gap in Self-Evolving Language Model Reasoning Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610,

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610,

work page arXiv
[27]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682,

2022
[28]

Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

Jon Saad-Falcon, E Kelly Buchanan, Mayee F Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, et al. Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

work page arXiv
[29]

Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118,

work page arXiv
[30]

Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

work page arXiv
[31]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, and Ranjay Krishna. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Hendryx, Brad Kenstler, and Bing Liu

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

work page arXiv
[33]

Satori: Reinforcement learning with chain-of- action-thought enhances llm reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,

Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, and Chuang Gan. Satori: Reinforcement learning with chain-of- action-thought enhances llm reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,

work page arXiv
[34]

Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674,

17 On the Generalization Gap in Self-Evolving Language Model Reasoning Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674,

work page arXiv
[35]

Theoretical modeling of llm self- improvement training dynamics through solver-verifier gap.arXiv preprint arXiv:2507.00075, 2025

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Her- rmann, Sjoerd Van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian. DreamSync: Aligning text-to-image generation with image understanding feedback. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas...

work page arXiv 2025
[36]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Tianduo Wang, Shichen Li, and Wei Lu. Self-training with direct preference optimization improves chain-of-thought reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11917–11928, Bangkok, Thailand, August 2024a. Associati...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.643 2024
[37]

Cream: Consistency regularized self-rewarding language models.arXiv preprint arXiv:2410.12735, 2024b

Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. Cream: Consistency regularized self-rewarding language models.arXiv preprint arXiv:2410.12735, 2024b. JiaxinWen, ZacharyAnkner, ArushiSomani, PeterHase, SamuelMarks, JacobGoldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Fen...

work page arXiv
[38]

The invisible leash: Why rlvr may or may not escape its origin, 2026

Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may not escape its origin.arXiv preprint arXiv:2507.14843,

work page arXiv
[39]

On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

work page arXiv
[40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Cot-self-instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks.arXiv preprint arXiv:2507.23751,

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Cot-self-instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks.arXiv preprint arXiv:2507.23751,

work page arXiv
[42]

Wisdom of the crowd: Reinforcement learning from coevolutionary collective feedback.arXiv preprint arXiv:2508.12338,

18 On the Generalization Gap in Self-Evolving Language Model Reasoning Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, and Peng Ye. Wisdom of the crowd: Reinforcement learning from coevolutionary collective feedback.arXiv preprint arXiv:2508.12338,

work page arXiv
[43]

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Aries: Evolving llms’ self-refinement capability via iterative preference optimization.arXiv preprint arXiv:2502.05605,

Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Guoqing Liu, Zexu Sun, Dong Li, Ning Yang, Jianye Hao, Haifeng Zhang, and Jun Wang. Aries: Evolving llms’ self-refinement capability via iterative preference optimization.arXiv preprint arXiv:2502.05605,

work page arXiv
[45]

On the limits of self-improving in llms and why agi, asi and the singularity are not near without symbolic model synthesis.arXiv preprint arXiv:2601.05280,

Hector Zenil. On the limits of self-improving in llms and why agi, asi and the singularity are not near without symbolic model synthesis.arXiv preprint arXiv:2601.05280,

work page arXiv
[46]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025a

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025a. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. A survey on test-ti...

work page arXiv
[47]

TTRL: Test-Time Reinforcement Learning

YuxinZuo,KaiyanZhang,LiSheng,ShangQu,GanquCui,XuekaiZhu,HaozhanLi,YuchenZhang,Xin- wei Long, and Ermo Hua. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

R3: Robust rubric-agnostic reward models

David Anugraha, Zilu Tang, Lester James V Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models. arXiv preprint arXiv:2505.13388,

work page arXiv

[2] [2]

Annotating the annotators: Analysis, insightsandmodellingfromanannotationcampaignonpersuasiontechniques detection

Davide Bassi, Dimitar Iliyanov Dimitrov, Bernardo D’Auria, Firoj Alam, Maram Hasanain, Christian Moro, Luisa Orrù, Gian Piero Turchi, Preslav Nakov, and Giovanni Da San Martino. Annotating the annotators: Analysis, insightsandmodellingfromanannotationcampaignonpersuasiontechniques detection. InACL 2025, pages 17918–17929, July

2025

[3] [3]

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, and Donya Saless

URLhttps://aclanthology.org/ 2025.findings-acl.922/. Avrim Blum, Daniel Hsu, Cyrus Rashtchian, and Donya Saless. Prior makes it possible: From sublinear graph algorithms to llm test-time methods. In29th Conference on Artificial Intelligence and Statistics (AISTATS),

2025

[4] [4]

Training Verifiers to Solve Math Word Problems

14 On the Generalization Gap in Self-Evolving Language Model Reasoning Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Apratim Dey and David Donoho. Universality of the𝜋2/6pathway in avoiding model collapse.arXiv preprint arXiv:2410.22812,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

work page arXiv

[8] [9]

Gemma 3 Technical Report

URLhttps: //arxiv.org/abs/2503.19786. Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. Human and llm biases in hate speech annotations: A socio-demographic analysis of annotators and targets. InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 653–670,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Self-improvement in language models: The sharpening mechanism

Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. arXiv preprint arXiv:2412.01951,

work page arXiv

[13] [14]

Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment.arXiv preprint arXiv:2503.21878, 2025a

Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J Foster. Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment.arXiv preprint arXiv:2503.21878, 2025a. Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self...

work page arXiv

[14] [15]

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning inference compute into reference- free supervision.arXiv preprint arXiv:2509.14234,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

Llms could autonomously learn without external supervision.arXiv preprint arXiv:2406.00606,

Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, and Benyou Wang. Llms could autonomously learn without external supervision.arXiv preprint arXiv:2406.00606,

work page arXiv

[16] [17]

Demystifying synthetic data in llm pre-training: A systematic study of scaling laws, benefits, and pitfalls

Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, and Carole-Jean Wu. Demystifying synthetic data in llm pre-training: A systematic study of scaling laws, benefits, and pitfalls. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pro...

2025

[17] [18]

Search-based correction of reasoning chains for language models.arXiv preprint arXiv:2505.11824,

Minsu Kim, Jean-Pierre Falet, Oliver E Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, and Yoshua Bengio. Search-based correction of reasoning chains for language models.arXiv preprint arXiv:2505.11824,

work page arXiv

[18] [19]

Language self-play for data-free training.arXiv preprint arXiv:2509.07414,

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, and Vijai Mohan. Language self-play for data-free training.arXiv preprint arXiv:2509.07414,

work page arXiv

[19] [20]

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [21]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [22]

Revise: Learning to refine at test-time via intrinsic self-verification.arXiv preprint arXiv:2502.14565,

Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification.arXiv preprint arXiv:2502.14565,

work page arXiv

[22] [23]

Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

Shalev Lifshitz, Sheila A McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

work page arXiv

[23] [24]

Self-improving vlm judges without human annotations.arXiv preprint arXiv:2512.05145,

Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh, Luke Zettlemoyer, Tim Althoff, and Marjan Ghazvininejad. Self-improving vlm judges without human annotations.arXiv preprint arXiv:2512.05145,

work page arXiv

[24] [25]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

16 On the Generalization Gap in Self-Evolving Language Model Reasoning Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610,

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610,

work page arXiv

[26] [27]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682,

2022

[27] [28]

Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

Jon Saad-Falcon, E Kelly Buchanan, Mayee F Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, et al. Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

work page arXiv

[28] [29]

Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118,

work page arXiv

[29] [30]

Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

work page arXiv

[30] [31]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, and Ranjay Krishna. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

Hendryx, Brad Kenstler, and Bing Liu

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685,

work page arXiv

[32] [33]

Satori: Reinforcement learning with chain-of- action-thought enhances llm reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,

Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, and Chuang Gan. Satori: Reinforcement learning with chain-of- action-thought enhances llm reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,

work page arXiv

[33] [34]

Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674,

17 On the Generalization Gap in Self-Evolving Language Model Reasoning Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674,

work page arXiv

[34] [35]

Theoretical modeling of llm self- improvement training dynamics through solver-verifier gap.arXiv preprint arXiv:2507.00075, 2025

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Her- rmann, Sjoerd Van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian. DreamSync: Aligning text-to-image generation with image understanding feedback. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas...

work page arXiv 2025

[35] [36]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Tianduo Wang, Shichen Li, and Wei Lu. Self-training with direct preference optimization improves chain-of-thought reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11917–11928, Bangkok, Thailand, August 2024a. Associati...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.643 2024

[36] [37]

Cream: Consistency regularized self-rewarding language models.arXiv preprint arXiv:2410.12735, 2024b

Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. Cream: Consistency regularized self-rewarding language models.arXiv preprint arXiv:2410.12735, 2024b. JiaxinWen, ZacharyAnkner, ArushiSomani, PeterHase, SamuelMarks, JacobGoldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Fen...

work page arXiv

[37] [38]

The invisible leash: Why rlvr may or may not escape its origin, 2026

Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may not escape its origin.arXiv preprint arXiv:2507.14843,

work page arXiv

[38] [39]

On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

work page arXiv

[39] [40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Cot-self-instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks.arXiv preprint arXiv:2507.23751,

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Cot-self-instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks.arXiv preprint arXiv:2507.23751,

work page arXiv

[41] [42]

Wisdom of the crowd: Reinforcement learning from coevolutionary collective feedback.arXiv preprint arXiv:2508.12338,

18 On the Generalization Gap in Self-Evolving Language Model Reasoning Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, and Peng Ye. Wisdom of the crowd: Reinforcement learning from coevolutionary collective feedback.arXiv preprint arXiv:2508.12338,

work page arXiv

[42] [43]

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [44]

Aries: Evolving llms’ self-refinement capability via iterative preference optimization.arXiv preprint arXiv:2502.05605,

Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Guoqing Liu, Zexu Sun, Dong Li, Ning Yang, Jianye Hao, Haifeng Zhang, and Jun Wang. Aries: Evolving llms’ self-refinement capability via iterative preference optimization.arXiv preprint arXiv:2502.05605,

work page arXiv

[44] [45]

On the limits of self-improving in llms and why agi, asi and the singularity are not near without symbolic model synthesis.arXiv preprint arXiv:2601.05280,

Hector Zenil. On the limits of self-improving in llms and why agi, asi and the singularity are not near without symbolic model synthesis.arXiv preprint arXiv:2601.05280,

work page arXiv

[45] [46]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025a

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025a. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. A survey on test-ti...

work page arXiv

[46] [47]

TTRL: Test-Time Reinforcement Learning

YuxinZuo,KaiyanZhang,LiSheng,ShangQu,GanquCui,XuekaiZhu,HaozhanLi,YuchenZhang,Xin- wei Long, and Ermo Hua. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084,

work page internal anchor Pith review Pith/arXiv arXiv