pith. sign in

arxiv: 2602.12579 · v2 · pith:3NJ3S6G5new · submitted 2026-02-13 · 💻 cs.LG · cs.AI

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learninglarge language modelsreasoningcurriculum learningvariance reductionverifier-independentpolicy optimizationasymptotic unbiasedness
0
0 comments X

The pith

VI-CuRL stabilizes verifier-independent RL for LLM reasoning by using intrinsic model confidence to reduce gradient variance while preserving asymptotic unbiasedness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops VI-CuRL to remove the need for external verifiers in reinforcement learning applied to large language model reasoning. Standard policy optimization methods encounter destructive variance that collapses training when no verifier is present. VI-CuRL builds a curriculum by prioritizing samples where the model already shows high confidence, which targets both action-level and problem-level variance. A theoretical analysis establishes that the resulting estimator stays asymptotically unbiased. Experiments show the method delivers stable training and higher performance than both verifier-dependent and verifier-independent baselines on math and general reasoning tasks.

Core claim

VI-CuRL introduces a verifier-independent curriculum that ranks training samples by the model's own confidence scores and applies this ordering inside Group Relative Policy Optimization. The curriculum reduces variance in the policy gradient estimates without introducing external reward signals. Rigorous analysis proves the modified estimator converges to the same unbiased limit as the original method. On both math and general reasoning benchmarks the approach yields more stable training curves and higher final accuracy whether or not an external verifier is available.

What carries the argument

The confidence-guided curriculum inside VI-CuRL, which orders samples by the model's intrinsic confidence to reduce action and problem variance in verifier-free policy optimization.

If this is right

  • Training of LLM reasoning policies can proceed without access to external verifiers.
  • The estimator remains asymptotically unbiased while variance is actively lowered.
  • Performance gains appear on both math-specific and general reasoning tasks.
  • The same framework works in hybrid settings that sometimes include and sometimes omit verifiers.
  • Destructive gradient variance that previously caused collapse is measurably suppressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same confidence signal might be reused to adapt learning rates or batch sizes dynamically during training.
  • If confidence correlates with latent capability, the curriculum could serve as a diagnostic for when a model has internalized a reasoning pattern.
  • Extending the approach to multi-turn dialogue or agentic tasks would test whether problem-level variance reduction generalizes beyond single-step reasoning.
  • Combining the curriculum with other internal signals such as entropy or uncertainty estimates could further tighten the bias-variance trade-off.

Load-bearing premise

The model's intrinsic confidence provides a reliable signal for sample ordering that reduces variance without adding bias that would invalidate the asymptotic unbiasedness guarantee.

What would settle it

Training runs on standard math-reasoning benchmarks in which the confidence curriculum produces either biased final policy estimates or recurrent collapse would falsify the central claims.

Figures

Figures reproduced from arXiv: 2602.12579 by Masashi Sugiyama, Xin-Qiang Cai.

Figure 1
Figure 1. Figure 1: Conceptual overview of VI-CuRL. Unlike standard RL that treats all samples equally, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The learning curves comparing VI-CuRL against the baseline No Curriculum across verifier-based (Oracle) and verifier-free (Majority Vote and Entropy) settings. Rows alternate between Pass@1 and Pass@8 for each setting. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Variance Ratio Analysis. Variance ratio (kept/full) for Action Variance (σ 2 g,t) and Problem Variance (Vprob,t) alongside the retention rate (βt , right axis). When βt < 1, both ratios are consistently below 1, confirming that curriculum selection reduces variance as predicted by Theorem 4.3. As βt → 1, the ratios approach 1 since the selected set converges to the full dataset. 6. Conclusion In this work,… view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum Difficulty Analysis on two 1.5B models. Rows represent algorithms (Oracle [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: showed the absolute variance values for the kept and full subsets. The kept subset (blue) maintained consistently lower variance than the full dataset (red), with the gap most pronounced during early training when aggressive curriculum selection was applied. This directly validated the mechanism underlying Theorem 4.3: by selecting high-confidence prompts, VI-CuRL created a more homogeneous training distri… view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces VI-CuRL, a verifier-independent curriculum RL framework for LLM reasoning that uses the model's intrinsic confidence scores to prioritize samples and reduce action/problem variance in settings like Group Relative Policy Optimization. It claims a rigorous proof that the resulting estimator is asymptotically unbiased and reports consistent empirical outperformance over verifier-dependent and verifier-independent baselines on math and general reasoning benchmarks, both with and without external verifiers.

Significance. If the asymptotic unbiasedness guarantee holds under the adaptive confidence-based sampling, the work would meaningfully advance scalable RLVR without external verifiers by addressing destructive gradient variance. The empirical stability claims across benchmarks add practical value. The attempt at a theoretical analysis is noted as a strength, though its validity under policy-dependent sampling remains to be verified.

major comments (1)
  1. [Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.
minor comments (1)
  1. [Abstract] Abstract: the statement that the method 'specifically target[s] the reduction of action and problem variance' would benefit from a brief parenthetical reference to the relevant variance decomposition or estimator form used in the analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the theoretical analysis. We address the single major comment below.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis (the section containing the asymptotic unbiasedness proof): the proof does not derive or bound the bias term arising from the time-varying, policy-dependent sampling probabilities induced by prioritizing samples according to the model's own evolving confidence scores. Standard REINFORCE-style unbiasedness arguments require either fixed sampling distributions or explicit cancellation of policy dependence in the limit; no such derivation or rate argument is shown for the adaptive curriculum rule.

    Authors: We agree that the current proof presentation does not explicitly derive or bound the transient bias arising from the adaptive, policy-dependent sampling rule. The original argument relies on the fact that confidence scores converge as training progresses, which would make the sampling distribution asymptotically fixed, but this step is not formalized with a rate or coupling argument. In the revised manuscript we will add a new lemma that bounds the total variation distance between the time-varying sampling distribution and its limiting distribution, showing that the resulting bias term is o(1/T) and therefore does not affect asymptotic unbiasedness of the estimator. revision: yes

Circularity Check

0 steps flagged

No circularity exhibited; derivation claims independent proof

full rationale

The abstract asserts a rigorous theoretical analysis proving asymptotic unbiasedness of the VI-CuRL estimator, with curriculum construction based on intrinsic confidence. However, the visible text contains no equations, proof steps, or self-citations that reduce any claimed prediction or uniqueness result to fitted inputs or prior self-work by construction. Per hard rules, circularity requires explicit quotes showing Eq. X equivalent to input by definition or fitted parameter renamed as prediction; none are present. The central claim therefore remains self-contained against external benchmarks in the provided material, warranting score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract only; full paper likely contains additional modeling choices and assumptions not visible here.

free parameters (1)
  • confidence prioritization parameters
    Used to select high-confidence samples for the curriculum; specific values or fitting procedure not stated.
axioms (1)
  • domain assumption Model intrinsic confidence is a valid proxy for constructing an effective training curriculum independent of external verifiers.
    Central premise enabling verifier-independent operation and variance reduction.

pith-pipeline@v0.9.0 · 5706 in / 996 out tokens · 28499 ms · 2026-05-25T07:24:47.784135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Curriculum learning

    Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning, pp. 41–48, 2009. 14

  2. [2]

    Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

    Cai, X.-Q., Wang, W., Liu, F., Liu, T., Niu, G., and Sugiyama, M. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers.arXiv preprint arXiv:2510.00915, 2025

  3. [3]

    T., Savarese, S., Xiong, C., and Wang, H

    Chen, H., Feng, Y., Liu, Z., Yao, W., Prabhakar, A., Heinecke, S., Ho, R., M` ui, P. T., Savarese, S., Xiong, C., and Wang, H. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

  4. [4]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H.-S., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  5. [5]

    Neural networks and the bias/variance dilemma

    Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992

  6. [6]

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H...

  7. [7]

    L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M

    He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Ku, L., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Associatio...

  8. [8]

    S., Yu, A

    Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024

  9. [9]

    Aime 2024 (dataset card)

    HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URL https://huggingfac e.co/datasets/HuggingFaceH4/aime_2024

  10. [10]

    Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

    Jiang, G., Feng, W., Quan, G., Hao, C., Zhang, Y., Liu, G., and Wang, H. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

  11. [11]

    Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

  12. [12]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, D., Perez, E., Schiefer, N., Dodds, Z., Dassarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...

  13. [13]

    P., Packer, B., and Koller, D

    Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, volume 23, 2010. 16

  14. [14]

    V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V

    Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Informat...

  15. [15]

    Let’s verify step by step

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  16. [16]

    Nover: Incentive training for language models via verifier-free reinforcement learning.arXiv preprint arXiv:2505.16022, 2025

    Liu, W., Qi, S., Wang, X., Qian, C., Du, Y., and He, Y. Nover: Incentive training for language models via verifier-free reinforcement learning.arXiv preprint arXiv:2505.16022, 2025

  17. [17]

    Y., Roongta, M., Cai, C., Luo, J., Li, L

    Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Li, L. E., Popa, R. A., and Stoica, I. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-w ith-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 , 2025. Notion Blog

  18. [18]

    Amc 2023 (dataset card)

    math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URL https://huggingface.co/d atasets/math-ai/amc23

  19. [19]

    Aime 2025 (dataset card)

    OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URL https://huggingface. co/datasets/opencompass/AIME2025

  20. [20]

    L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. J. Training language models to follow instructions with human feedback.Advances in neural information p...

  21. [21]

    Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

    Prabhudesai, M., Chen, L., Ippoliti, A., Fragkiadaki, K., Liu, H., and Pathak, D. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

  22. [22]

    J., and Lakshminarayanan, B

    Ren, J., Zhao, Y., Vu, T., Liu, P. J., and Lakshminarayanan, B. Self-evaluation improves selective generation in large language models. InProceedings on ”I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, pp. 49–64, 2023

  23. [23]

    High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

    Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.International Conference on Learning Representations, 2016

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  25. [25]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  26. [26]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y., Wang, Z., Ebrahimi, S., and Wang, H. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  27. [27]

    Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

    Shi, T., Wu, Y., Song, L., Zhou, T., and Zhao, J. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

  28. [28]

    S., McAllester, D., Singh, S., and Mansour, Y

    Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for rein- forcement learning with function approximation.Advances in neural information processing systems, 12, 1999

  29. [29]

    Self-consistency improves chain of thought reasoning in language models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

  30. [30]

    H., Xia, F., Le, Q., and Zhou, D

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Xia, F., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 18

  31. [31]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  32. [32]

    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

  33. [33]

    Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl

    Yao, J., Hao, Y., Zhang, H., Dong, H., Xiong, W., Jiang, N., and Zhang, T. Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl. arXiv preprint arXiv:2505.02391, 2025

  34. [34]

    Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

    Ye, K., Zhou, H., Zhu, J., Quinzan, F., and Shi, C. Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

  35. [35]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  36. [36]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

    Zhang, Q., Wu, H., Zhang, C., Zhao, P., and Bian, Y. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

  37. [37]

    No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

    Zhang, Y., Zhang, Z., Guan, H., Cheng, Y., Duan, Y., Wang, C., Wang, Y., Zheng, S., and He, J. No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

  38. [38]

    Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025

    Zhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025

  39. [39]

    TTRL: Test-Time Reinforcement Learning

    Zuo, Y., Zhang, K., Qu, S., Sheng, L., Zhu, X., Zhang, Y., Qi, B., Sun, Y., Cui, G., Ding, N., and Zhou, B. Ttrl: Test-time reinforcement learning.ArXiv, abs/2504.16084, 2025. 19 A. Related Work LLM Reasoning and RLVR..Recent advances in LLM reasoning have been significantly propelled by RL. Chain-of-thought prompting [ 30] and self-consistency [ 29] esta...

  40. [40]

    kept”, blue) consistently exhibits lower absolute variance than the full dataset (“full

    leverage entropy minimization as a fully unsupervised reward signal, while the Entropy Mechanism [4] stabilizes training by managing policy collapse. Test-Time Reinforcement Learning (TTRL) [39] further explores converting test-time compute (e.g., majority voting) into training signals. However, most of these methods fundamentally rely on access to ground...

  41. [41]

    The Expected Action Variance Term:Conditioned on a specific prompt x, the mask wt(x) and scalarβ t are constants. Thus: Vary1:G(ˆgt|x) = Vary1:G wt(x) βt g(x, y1:G;θ) x = wt(x) βt 2 Vary1:G(g(x, y1:G;θ)|x),(by quadratic scaling of scalar variance, applied to trace) Taking the expectation overx: Ex[Vary1:G(ˆgt|x)] =E x wt(x)2 β2 t Vary1:G(g(x, y1:G;θ)|x) ....

  42. [42]

    Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx

    The Problem Variance Term:First, consider the inner expectation Ey1:G[ˆgt|x]. Let ¯g(x) =Ey1:G|πθold [g|x] be the expected gradient for promptx. Then: Ey1:G[ˆgt|x] = wt(x) βt ¯g(x). The variance of this quantity overxis: Varx wt(x) βt ¯g(x) =E x " wt(x) βt ¯g(x) 2# − Ex wt(x) βt ¯g(x) 2 . The second part is simply the squared norm of the true gradient of ...