VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Masashi Sugiyama; Xin-Qiang Cai

REVIEW 1 major objections 1 minor 1 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

VI-CuRL stabilizes verifier-independent RL for LLM reasoning by using intrinsic model confidence to reduce gradient variance while preserving asymptotic unbiasedness.

2026-05-25 07:24 UTC pith:3NJ3S6G5

load-bearing objection VI-CuRL proposes using model confidence for curriculum selection to stabilize verifier-free RL in LLMs and claims an asymptotic unbiasedness proof, but the adaptive sampling setup looks like it could undermine that guarantee. the 1 major comments →

arxiv 2602.12579 v2 pith:3NJ3S6G5 submitted 2026-02-13 cs.LG cs.AI

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Xin-Qiang Cai , Masashi Sugiyama This is my paper

classification cs.LG cs.AI

keywords reinforcement learninglarge language modelsreasoningcurriculum learningvariance reductionverifier-independentpolicy optimizationasymptotic unbiasedness

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops VI-CuRL to remove the need for external verifiers in reinforcement learning applied to large language model reasoning. Standard policy optimization methods encounter destructive variance that collapses training when no verifier is present. VI-CuRL builds a curriculum by prioritizing samples where the model already shows high confidence, which targets both action-level and problem-level variance. A theoretical analysis establishes that the resulting estimator stays asymptotically unbiased. Experiments show the method delivers stable training and higher performance than both verifier-dependent and verifier-independent baselines on math and general reasoning tasks.

Core claim

VI-CuRL introduces a verifier-independent curriculum that ranks training samples by the model's own confidence scores and applies this ordering inside Group Relative Policy Optimization. The curriculum reduces variance in the policy gradient estimates without introducing external reward signals. Rigorous analysis proves the modified estimator converges to the same unbiased limit as the original method. On both math and general reasoning benchmarks the approach yields more stable training curves and higher final accuracy whether or not an external verifier is available.

What carries the argument

The confidence-guided curriculum inside VI-CuRL, which orders samples by the model's intrinsic confidence to reduce action and problem variance in verifier-free policy optimization.

Load-bearing premise

The model's intrinsic confidence provides a reliable signal for sample ordering that reduces variance without adding bias that would invalidate the asymptotic unbiasedness guarantee.

What would settle it

Training runs on standard math-reasoning benchmarks in which the confidence curriculum produces either biased final policy estimates or recurrent collapse would falsify the central claims.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Training of LLM reasoning policies can proceed without access to external verifiers.
The estimator remains asymptotically unbiased while variance is actively lowered.
Performance gains appear on both math-specific and general reasoning tasks.
The same framework works in hybrid settings that sometimes include and sometimes omit verifiers.
Destructive gradient variance that previously caused collapse is measurably suppressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence signal might be reused to adapt learning rates or batch sizes dynamically during training.
If confidence correlates with latent capability, the curriculum could serve as a diagnostic for when a model has internalized a reasoning pattern.
Extending the approach to multi-turn dialogue or agentic tasks would test whether problem-level variance reduction generalizes beyond single-step reasoning.
Combining the curriculum with other internal signals such as entropy or uncertainty estimates could further tighten the bias-variance trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

VI-CuRL proposes using model confidence for curriculum selection to stabilize verifier-free RL in LLMs and claims an asymptotic unbiasedness proof, but the adaptive sampling setup looks like it could undermine that guarantee.

read the letter

The main point is that this work targets the variance collapse in verifier-free RL methods like GRPO by building a curriculum that prioritizes high-confidence samples from the model itself. That removes the external verifier dependency, which is a practical bottleneck for scaling reasoning RL on LLMs. The abstract frames it as managing bias-variance through this prioritization, with a proof that the resulting estimator stays asymptotically unbiased and some empirical wins on math and general reasoning tasks both with and without verifiers. The curriculum idea itself is a direct response to the variance issue they identify, and if the experiments are clean it could be a useful practical tweak for post-training setups that want to avoid verifier overhead. The soft spot sits in the theory. The stress-test concern is on target: confidence scores are generated by the current policy, so the sampling distribution shifts with training in a policy-dependent way. Standard unbiasedness arguments for policy gradient estimators usually require the sampling probabilities to be either fixed or to cancel cleanly with the reward term; an adaptive rule tied to the same signals used for training can leave a residual bias that does not obviously vanish at the right rate. The abstract asserts a rigorous proof but gives no derivation or handling of the time-varying probabilities, so it is impossible to tell whether the claim holds. The empirical side is also thin on details here. This is aimed at people working on verifier-light RL for LLMs who need stable training signals. A reader already thinking about curriculum or variance reduction in RLVR might pick up the selection mechanism, but the paper needs the full proof and experimental protocol checked before it can be relied on. It is worth sending to referees to examine the derivation and the actual runs.

Referee Report

1 major / 1 minor

Summary. The paper introduces VI-CuRL, a verifier-independent curriculum RL framework for LLM reasoning that uses the model's intrinsic confidence scores to prioritize samples and reduce action/problem variance in settings like Group Relative Policy Optimization. It claims a rigorous proof that the resulting estimator is asymptotically unbiased and reports consistent empirical outperformance over verifier-dependent and verifier-independent baselines on math and general reasoning benchmarks, both with and without external verifiers.

Significance. If the asymptotic unbiasedness guarantee holds under the adaptive confidence-based sampling, the work would meaningfully advance scalable RLVR without external verifiers by addressing destructive gradient variance. The empirical stability claims across benchmarks add practical value. The attempt at a theoretical analysis is noted as a strength, though its validity under policy-dependent sampling remains to be verified.

major comments (1)

[Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.

minor comments (1)

[Abstract] Abstract: the statement that the method 'specifically target[s] the reduction of action and problem variance' would benefit from a brief parenthetical reference to the relevant variance decomposition or estimator form used in the analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the theoretical analysis. We address the single major comment below.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.

Authors: We agree that the current proof presentation does not explicitly derive or bound the transient bias arising from the adaptive, policy-dependent sampling rule. The original argument relies on the fact that confidence scores converge as training progresses, which would make the sampling distribution asymptotically fixed, but this step is not formalized with a rate or coupling argument. In the revised manuscript we will add a new lemma that bounds the total variation distance between the time-varying sampling distribution and its limiting distribution, showing that the resulting bias term is o(1/T) and therefore does not affect asymptotic unbiasedness of the estimator. revision: yes

Circularity Check

0 steps flagged

No circularity exhibited; derivation claims independent proof

full rationale

The abstract asserts a rigorous theoretical analysis proving asymptotic unbiasedness of the VI-CuRL estimator, with curriculum construction based on intrinsic confidence. However, the visible text contains no equations, proof steps, or self-citations that reduce any claimed prediction or uniqueness result to fitted inputs or prior self-work by construction. Per hard rules, circularity requires explicit quotes showing Eq. X equivalent to input by definition or fitted parameter renamed as prediction; none are present. The central claim therefore remains self-contained against external benchmarks in the provided material, warranting score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract only; full paper likely contains additional modeling choices and assumptions not visible here.

free parameters (1)

confidence prioritization parameters
Used to select high-confidence samples for the curriculum; specific values or fitting procedure not stated.

axioms (1)

domain assumption Model intrinsic confidence is a valid proxy for constructing an effective training curriculum independent of external verifiers.
Central premise enabling verifier-independent operation and variance reduction.

pith-pipeline@v0.9.0 · 5706 in / 996 out tokens · 28499 ms · 2026-05-25T07:24:47.784135+00:00 · methodology

0 comments

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

Figures

Figures reproduced from arXiv: 2602.12579 by Masashi Sugiyama, Xin-Qiang Cai.

**Figure 2.** Figure 2: The learning curves comparing VI-CuRL against the baseline No Curriculum across verifier-based (Oracle) and verifier-free (Majority Vote and Entropy) settings. Rows alternate between Pass@1 and Pass@8 for each setting. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Variance Ratio Analysis. Variance ratio (kept/full) for Action Variance (σ 2 g,t) and Problem Variance (Vprob,t) alongside the retention rate (βt , right axis). When βt < 1, both ratios are consistently below 1, confirming that curriculum selection reduces variance as predicted by Theorem 4.3. As βt → 1, the ratios approach 1 since the selected set converges to the full dataset. 6. Conclusion In this work,… view at source ↗

**Figure 4.** Figure 4: Curriculum Difficulty Analysis on two 1.5B models. Rows represent algorithms (Oracle [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: showed the absolute variance values for the kept and full subsets. The kept subset (blue) maintained consistently lower variance than the full dataset (red), with the gap most pronounced during early training when aggressive curriculum selection was applied. This directly validated the mechanism underlying Theorem 4.3: by selecting high-confidence prompts, VI-CuRL created a more homogeneous training distri… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
cs.LG 2026-04 unverdicted novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning, pp. 41–48, 2009. 14

work page 2009
[2]

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Cai, X.-Q., Wang, W., Liu, F., Liu, T., Niu, G., and Sugiyama, M. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers.arXiv preprint arXiv:2510.00915, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Chen, H., Feng, Y., Liu, Z., Yao, W., Prabhakar, A., Heinecke, S., Ho, R., M` ui, P. T., Savarese, S., Xiong, C., and Wang, H. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

work page Pith review arXiv 2024
[4]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H.-S., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Neural networks and the bias/variance dilemma

Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992

work page 1992
[6]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H...

work page doi:10.1038/s41586-025-09422-z 2025
[7]

L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Ku, L., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Associatio...

work page 2024
[8]

S., Yu, A

Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[9]

Aime 2024 (dataset card)

HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URL https://huggingfac e.co/datasets/HuggingFaceH4/aime_2024

work page 2024
[10]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

Jiang, G., Feng, W., Quan, G., Hao, C., Zhang, Y., Liu, G., and Wang, H. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

work page arXiv 2025
[11]

Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

work page 2015
[12]

Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, D., Perez, E., Schiefer, N., Dodds, Z., Dassarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

P., Packer, B., and Koller, D

Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, volume 23, 2010. 16

work page 2010
[14]

V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Informat...

work page 2022
[15]

Let’s verify step by step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[16]

arXiv preprint arXiv:2505.16022 , year =

Liu, W., Qi, S., Wang, X., Qian, C., Du, Y., and He, Y. Nover: Incentive training for language models via verifier-free reinforcement learning.arXiv preprint arXiv:2505.16022, 2025

work page arXiv 2025
[17]

Y., Roongta, M., Cai, C., Luo, J., Li, L

Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Li, L. E., Popa, R. A., and Stoica, I. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-w ith-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 , 2025. Notion Blog

work page 2025
[18]

Amc 2023 (dataset card)

math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URL https://huggingface.co/d atasets/math-ai/amc23

work page 2023
[19]

Aime 2025 (dataset card)

OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URL https://huggingface. co/datasets/opencompass/AIME2025

work page 2025
[20]

L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. J. Training language models to follow instructions with human feedback.Advances in neural information p...

work page 2022
[21]

arXiv preprint arXiv:2505.22660

Prabhudesai, M., Chen, L., Ippoliti, A., Fragkiadaki, K., Liu, H., and Pathak, D. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

work page arXiv 2025
[22]

J., and Lakshminarayanan, B

Ren, J., Zhao, Y., Vu, T., Liu, P. J., and Lakshminarayanan, B. Self-evaluation improves selective generation in large language models. InProceedings on ”I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, pp. 49–64, 2023

work page 2023
[23]

High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

work page 2016
[24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y., Wang, Z., Ebrahimi, S., and Wang, H. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025
[27]

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Shi, T., Wu, Y., Song, L., Zhou, T., and Zhao, J. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

work page internal anchor Pith review arXiv 2025
[28]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for rein- forcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999
[29]

Self-consistency improves chain of thought reasoning in language models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[30]

H., Xia, F., Le, Q., and Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Xia, F., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 18

work page 2022
[31]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

work page 1992
[33]

Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl

Yao, J., Hao, Y., Zhang, H., Dong, H., Xiong, W., Jiang, N., and Zhang, T. Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl. arXiv preprint arXiv:2505.02391, 2025

work page arXiv 2025
[34]

Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

Ye, K., Zhou, H., Zhu, J., Quinzan, F., and Shi, C. Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

work page arXiv 2025
[35]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

work page 2022
[36]

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Zhang, Q., Wu, H., Zhang, C., Zhao, P., and Bian, Y. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page Pith review arXiv 2025
[37]

No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

Zhang, Y., Zhang, Z., Guan, H., Cheng, Y., Duan, Y., Wang, C., Wang, Y., Zheng, S., and He, J. No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

work page arXiv 2025
[38]

arXiv preprint arXiv:2505.21493 , year =

Zhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025

work page arXiv 2025
[39]

TTRL: Test-Time Reinforcement Learning

Zuo, Y., Zhang, K., Qu, S., Sheng, L., Zhu, X., Zhang, Y., Qi, B., Sun, Y., Cui, G., Ding, N., and Zhou, B. Ttrl: Test-time reinforcement learning.ArXiv, abs/2504.16084, 2025. 19 A. Related Work LLM Reasoning and RLVR..Recent advances in LLM reasoning have been significantly propelled by RL. Chain-of-thought prompting [ 30] and self-consistency [ 29] esta...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

kept”, blue) consistently exhibits lower absolute variance than the full dataset (“full

leverage entropy minimization as a fully unsupervised reward signal, while the Entropy Mechanism [4] stabilizes training by managing policy collapse. Test-Time Reinforcement Learning (TTRL) [39] further explores converting test-time compute (e.g., majority voting) into training signals. However, most of these methods fundamentally rely on access to ground...

work page
[41]

The Expected Action Variance Term:Conditioned on a specific prompt x, the mask wt(x) and scalarβ t are constants. Thus: Vary1:G(ˆgt|x) = Vary1:G wt(x) βt g(x, y1:G;θ) x = wt(x) βt 2 Vary1:G(g(x, y1:G;θ)|x),(by quadratic scaling of scalar variance, applied to trace) Taking the expectation overx: Ex[Vary1:G(ˆgt|x)] =E x wt(x)2 β2 t Vary1:G(g(x, y1:G;θ)|x) ....

work page
[42]

Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx

The Problem Variance Term:First, consider the inner expectation Ey1:G[ˆgt|x]. Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx. Then: Ey1:G[ˆgt|x] = wt(x) βt ¯g(x). The variance of this quantity overxis: Varx wt(x) βt ¯g(x) =E x " wt(x) βt ¯g(x) 2# − Ex wt(x) βt ¯g(x) 2 . The second part is simply the squared norm of the true gradient of ...

work page 2024

[1] [1]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning, pp. 41–48, 2009. 14

work page 2009

[2] [2]

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Cai, X.-Q., Wang, W., Liu, F., Liu, T., Niu, G., and Sugiyama, M. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers.arXiv preprint arXiv:2510.00915, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Chen, H., Feng, Y., Liu, Z., Yao, W., Prabhakar, A., Heinecke, S., Ho, R., M` ui, P. T., Savarese, S., Xiong, C., and Wang, H. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

work page Pith review arXiv 2024

[4] [4]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H.-S., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Neural networks and the bias/variance dilemma

Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992

work page 1992

[6] [6]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H...

work page doi:10.1038/s41586-025-09422-z 2025

[7] [7]

L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Ku, L., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Associatio...

work page 2024

[8] [8]

S., Yu, A

Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[9] [9]

Aime 2024 (dataset card)

HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URL https://huggingfac e.co/datasets/HuggingFaceH4/aime_2024

work page 2024

[10] [10]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

Jiang, G., Feng, W., Quan, G., Hao, C., Zhang, Y., Liu, G., and Wang, H. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

work page arXiv 2025

[11] [11]

Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

work page 2015

[12] [12]

Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, D., Perez, E., Schiefer, N., Dodds, Z., Dassarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

P., Packer, B., and Koller, D

Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, volume 23, 2010. 16

work page 2010

[14] [14]

V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Informat...

work page 2022

[15] [15]

Let’s verify step by step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024

[16] [16]

arXiv preprint arXiv:2505.16022 , year =

Liu, W., Qi, S., Wang, X., Qian, C., Du, Y., and He, Y. Nover: Incentive training for language models via verifier-free reinforcement learning.arXiv preprint arXiv:2505.16022, 2025

work page arXiv 2025

[17] [17]

Y., Roongta, M., Cai, C., Luo, J., Li, L

Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Li, L. E., Popa, R. A., and Stoica, I. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-w ith-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 , 2025. Notion Blog

work page 2025

[18] [18]

Amc 2023 (dataset card)

math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URL https://huggingface.co/d atasets/math-ai/amc23

work page 2023

[19] [19]

Aime 2025 (dataset card)

OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URL https://huggingface. co/datasets/opencompass/AIME2025

work page 2025

[20] [20]

L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. J. Training language models to follow instructions with human feedback.Advances in neural information p...

work page 2022

[21] [21]

arXiv preprint arXiv:2505.22660

Prabhudesai, M., Chen, L., Ippoliti, A., Fragkiadaki, K., Liu, H., and Pathak, D. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

work page arXiv 2025

[22] [22]

J., and Lakshminarayanan, B

Ren, J., Zhao, Y., Vu, T., Liu, P. J., and Lakshminarayanan, B. Self-evaluation improves selective generation in large language models. InProceedings on ”I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, pp. 49–64, 2023

work page 2023

[23] [23]

High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

work page 2016

[24] [24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y., Wang, Z., Ebrahimi, S., and Wang, H. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025

[27] [27]

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Shi, T., Wu, Y., Song, L., Zhou, T., and Zhao, J. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

work page internal anchor Pith review arXiv 2025

[28] [28]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for rein- forcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999

[29] [29]

Self-consistency improves chain of thought reasoning in language models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[30] [30]

H., Xia, F., Le, Q., and Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Xia, F., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 18

work page 2022

[31] [31]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

work page 1992

[33] [33]

Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl

Yao, J., Hao, Y., Zhang, H., Dong, H., Xiong, W., Jiang, N., and Zhang, T. Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl. arXiv preprint arXiv:2505.02391, 2025

work page arXiv 2025

[34] [34]

Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

Ye, K., Zhou, H., Zhu, J., Quinzan, F., and Shi, C. Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

work page arXiv 2025

[35] [35]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

work page 2022

[36] [36]

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Zhang, Q., Wu, H., Zhang, C., Zhao, P., and Bian, Y. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page Pith review arXiv 2025

[37] [37]

No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

Zhang, Y., Zhang, Z., Guan, H., Cheng, Y., Duan, Y., Wang, C., Wang, Y., Zheng, S., and He, J. No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

work page arXiv 2025

[38] [38]

arXiv preprint arXiv:2505.21493 , year =

Zhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025

work page arXiv 2025

[39] [39]

TTRL: Test-Time Reinforcement Learning

Zuo, Y., Zhang, K., Qu, S., Sheng, L., Zhu, X., Zhang, Y., Qi, B., Sun, Y., Cui, G., Ding, N., and Zhou, B. Ttrl: Test-time reinforcement learning.ArXiv, abs/2504.16084, 2025. 19 A. Related Work LLM Reasoning and RLVR..Recent advances in LLM reasoning have been significantly propelled by RL. Chain-of-thought prompting [ 30] and self-consistency [ 29] esta...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

kept”, blue) consistently exhibits lower absolute variance than the full dataset (“full

leverage entropy minimization as a fully unsupervised reward signal, while the Entropy Mechanism [4] stabilizes training by managing policy collapse. Test-Time Reinforcement Learning (TTRL) [39] further explores converting test-time compute (e.g., majority voting) into training signals. However, most of these methods fundamentally rely on access to ground...

work page

[41] [41]

The Expected Action Variance Term:Conditioned on a specific prompt x, the mask wt(x) and scalarβ t are constants. Thus: Vary1:G(ˆgt|x) = Vary1:G wt(x) βt g(x, y1:G;θ) x = wt(x) βt 2 Vary1:G(g(x, y1:G;θ)|x),(by quadratic scaling of scalar variance, applied to trace) Taking the expectation overx: Ex[Vary1:G(ˆgt|x)] =E x wt(x)2 β2 t Vary1:G(g(x, y1:G;θ)|x) ....

work page

[42] [42]

Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx

The Problem Variance Term:First, consider the inner expectation Ey1:G[ˆgt|x]. Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx. Then: Ey1:G[ˆgt|x] = wt(x) βt ¯g(x). The variance of this quantity overxis: Varx wt(x) βt ¯g(x) =E x " wt(x) βt ¯g(x) 2# − Ex wt(x) βt ¯g(x) 2 . The second part is simply the squared norm of the true gradient of ...

work page 2024