VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3
The pith
VI-CuRL stabilizes verifier-independent RL for LLM reasoning by using intrinsic model confidence to reduce gradient variance while preserving asymptotic unbiasedness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VI-CuRL introduces a verifier-independent curriculum that ranks training samples by the model's own confidence scores and applies this ordering inside Group Relative Policy Optimization. The curriculum reduces variance in the policy gradient estimates without introducing external reward signals. Rigorous analysis proves the modified estimator converges to the same unbiased limit as the original method. On both math and general reasoning benchmarks the approach yields more stable training curves and higher final accuracy whether or not an external verifier is available.
What carries the argument
The confidence-guided curriculum inside VI-CuRL, which orders samples by the model's intrinsic confidence to reduce action and problem variance in verifier-free policy optimization.
If this is right
- Training of LLM reasoning policies can proceed without access to external verifiers.
- The estimator remains asymptotically unbiased while variance is actively lowered.
- Performance gains appear on both math-specific and general reasoning tasks.
- The same framework works in hybrid settings that sometimes include and sometimes omit verifiers.
- Destructive gradient variance that previously caused collapse is measurably suppressed.
Where Pith is reading between the lines
- The same confidence signal might be reused to adapt learning rates or batch sizes dynamically during training.
- If confidence correlates with latent capability, the curriculum could serve as a diagnostic for when a model has internalized a reasoning pattern.
- Extending the approach to multi-turn dialogue or agentic tasks would test whether problem-level variance reduction generalizes beyond single-step reasoning.
- Combining the curriculum with other internal signals such as entropy or uncertainty estimates could further tighten the bias-variance trade-off.
Load-bearing premise
The model's intrinsic confidence provides a reliable signal for sample ordering that reduces variance without adding bias that would invalidate the asymptotic unbiasedness guarantee.
What would settle it
Training runs on standard math-reasoning benchmarks in which the confidence curriculum produces either biased final policy estimates or recurrent collapse would falsify the central claims.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VI-CuRL, a verifier-independent curriculum RL framework for LLM reasoning that uses the model's intrinsic confidence scores to prioritize samples and reduce action/problem variance in settings like Group Relative Policy Optimization. It claims a rigorous proof that the resulting estimator is asymptotically unbiased and reports consistent empirical outperformance over verifier-dependent and verifier-independent baselines on math and general reasoning benchmarks, both with and without external verifiers.
Significance. If the asymptotic unbiasedness guarantee holds under the adaptive confidence-based sampling, the work would meaningfully advance scalable RLVR without external verifiers by addressing destructive gradient variance. The empirical stability claims across benchmarks add practical value. The attempt at a theoretical analysis is noted as a strength, though its validity under policy-dependent sampling remains to be verified.
major comments (1)
- [Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.
minor comments (1)
- [Abstract] Abstract: the statement that the method 'specifically target[s] the reduction of action and problem variance' would benefit from a brief parenthetical reference to the relevant variance decomposition or estimator form used in the analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify the theoretical analysis. We address the single major comment below.
read point-by-point responses
-
Referee: [Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.
Authors: We agree that the current proof presentation does not explicitly derive or bound the transient bias arising from the adaptive, policy-dependent sampling rule. The original argument relies on the fact that confidence scores converge as training progresses, which would make the sampling distribution asymptotically fixed, but this step is not formalized with a rate or coupling argument. In the revised manuscript we will add a new lemma that bounds the total variation distance between the time-varying sampling distribution and its limiting distribution, showing that the resulting bias term is o(1/T) and therefore does not affect asymptotic unbiasedness of the estimator. revision: yes
Circularity Check
No circularity exhibited; derivation claims independent proof
full rationale
The abstract asserts a rigorous theoretical analysis proving asymptotic unbiasedness of the VI-CuRL estimator, with curriculum construction based on intrinsic confidence. However, the visible text contains no equations, proof steps, or self-citations that reduce any claimed prediction or uniqueness result to fitted inputs or prior self-work by construction. Per hard rules, circularity requires explicit quotes showing Eq. X equivalent to input by definition or fitted parameter renamed as prediction; none are present. The central claim therefore remains self-contained against external benchmarks in the provided material, warranting score 0.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence prioritization parameters
axioms (1)
- domain assumption Model intrinsic confidence is a valid proxy for constructing an effective training curriculum independent of external verifiers.
Forward citations
Cited by 1 Pith paper
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
Reference graph
Works this paper leans on
-
[1]
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning, pp. 41–48, 2009. 14
work page 2009
-
[2]
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Cai, X.-Q., Wang, W., Liu, F., Liu, T., Niu, G., and Sugiyama, M. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers.arXiv preprint arXiv:2510.00915, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
T., Savarese, S., Xiong, C., and Wang, H
Chen, H., Feng, Y., Liu, Z., Yao, W., Prabhakar, A., Heinecke, S., Ho, R., M` ui, P. T., Savarese, S., Xiong, C., and Wang, H. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024
-
[4]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H.-S., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Neural networks and the bias/variance dilemma
Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992
work page 1992
-
[6]
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H...
-
[7]
L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Ku, L., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Associatio...
work page 2024
- [8]
-
[9]
HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URL https://huggingfac e.co/datasets/HuggingFaceH4/aime_2024
work page 2024
-
[10]
Jiang, G., Feng, W., Quan, G., Hao, C., Zhang, Y., Liu, G., and Wang, H. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025
-
[11]
Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015
work page 2015
-
[12]
Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, D., Perez, E., Schiefer, N., Dodds, Z., Dassarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, volume 23, 2010. 16
work page 2010
-
[14]
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Informat...
work page 2022
-
[15]
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[16]
Liu, W., Qi, S., Wang, X., Qian, C., Du, Y., and He, Y. Nover: Incentive training for language models via verifier-free reinforcement learning.arXiv preprint arXiv:2505.16022, 2025
-
[17]
Y., Roongta, M., Cai, C., Luo, J., Li, L
Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Li, L. E., Popa, R. A., and Stoica, I. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-w ith-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 , 2025. Notion Blog
work page 2025
-
[18]
math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URL https://huggingface.co/d atasets/math-ai/amc23
work page 2023
-
[19]
OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URL https://huggingface. co/datasets/opencompass/AIME2025
work page 2025
-
[20]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. J. Training language models to follow instructions with human feedback.Advances in neural information p...
work page 2022
-
[21]
Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025
Prabhudesai, M., Chen, L., Ippoliti, A., Fragkiadaki, K., Liu, H., and Pathak, D. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025
-
[22]
Ren, J., Zhao, Y., Vu, T., Liu, P. J., and Lakshminarayanan, B. Self-evaluation improves selective generation in large language models. InProceedings on ”I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, pp. 49–64, 2023
work page 2023
-
[23]
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016
work page 2016
-
[24]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
HybridFlow: A Flexible and Efficient RLHF Framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y., Wang, Z., Ebrahimi, S., and Wang, H. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025
work page 2025
-
[27]
Shi, T., Wu, Y., Song, L., Zhou, T., and Zhao, J. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025
-
[28]
S., McAllester, D., Singh, S., and Mansour, Y
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for rein- forcement learning with function approximation.Advances in neural information processing systems, 12, 1999
work page 1999
-
[29]
Self-consistency improves chain of thought reasoning in language models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[30]
H., Xia, F., Le, Q., and Zhou, D
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Xia, F., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 18
work page 2022
-
[31]
Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992
work page 1992
-
[33]
Yao, J., Hao, Y., Zhang, H., Dong, H., Xiong, W., Jiang, N., and Zhang, T. Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl. arXiv preprint arXiv:2505.02391, 2025
-
[34]
Ye, K., Zhou, H., Zhu, J., Quinzan, F., and Shi, C. Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025
-
[35]
Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
work page 2022
-
[36]
Zhang, Q., Wu, H., Zhang, C., Zhao, P., and Bian, Y. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025
-
[37]
No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025
Zhang, Y., Zhang, Z., Guan, H., Cheng, Y., Duan, Y., Wang, C., Wang, Y., Zheng, S., and He, J. No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025
-
[38]
Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025
Zhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025
-
[39]
TTRL: Test-Time Reinforcement Learning
Zuo, Y., Zhang, K., Qu, S., Sheng, L., Zhu, X., Zhang, Y., Qi, B., Sun, Y., Cui, G., Ding, N., and Zhou, B. Ttrl: Test-time reinforcement learning.ArXiv, abs/2504.16084, 2025. 19 A. Related Work LLM Reasoning and RLVR..Recent advances in LLM reasoning have been significantly propelled by RL. Chain-of-thought prompting [ 30] and self-consistency [ 29] esta...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
kept”, blue) consistently exhibits lower absolute variance than the full dataset (“full
leverage entropy minimization as a fully unsupervised reward signal, while the Entropy Mechanism [4] stabilizes training by managing policy collapse. Test-Time Reinforcement Learning (TTRL) [39] further explores converting test-time compute (e.g., majority voting) into training signals. However, most of these methods fundamentally rely on access to ground...
-
[41]
The Expected Action Variance Term:Conditioned on a specific prompt x, the mask wt(x) and scalarβ t are constants. Thus: Vary1:G(ˆgt|x) = Vary1:G wt(x) βt g(x, y1:G;θ) x = wt(x) βt 2 Vary1:G(g(x, y1:G;θ)|x),(by quadratic scaling of scalar variance, applied to trace) Taking the expectation overx: Ex[Vary1:G(ˆgt|x)] =E x wt(x)2 β2 t Vary1:G(g(x, y1:G;θ)|x) ....
-
[42]
Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx
The Problem Variance Term:First, consider the inner expectation Ey1:G[ˆgt|x]. Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx. Then: Ey1:G[ˆgt|x] = wt(x) βt ¯g(x). The variance of this quantity overxis: Varx wt(x) βt ¯g(x) =E x " wt(x) βt ¯g(x) 2# − Ex wt(x) βt ¯g(x) 2 . The second part is simply the squared norm of the true gradient of ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.