When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

Guojun Yin; Jiajun Chai; Li Wang; Tianhao Peng; Wei Lin; Xiaodong Lu; Xiaohan Wang; Yikun Ban

arxiv: 2605.25864 · v1 · pith:4KTDL4BXnew · submitted 2026-05-25 · 💻 cs.LG · cs.CL

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

Li Wang , Xiaodong Lu , Xiaohan Wang , Yikun Ban , Jiajun Chai , Wei Lin , Tianhao Peng , Guojun Yin This is my paper

Pith reviewed 2026-06-29 22:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords Reinforcement LearningActive LearningPseudo-labelsVerifiable RewardsTraining StabilityLabel AcquisitionCorrective Advantage Gap

0 comments

The pith

Active selection of ground-truth labels via Corrective Advantage Gap stabilizes RL training on pseudo-labels with small annotation budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RLAVR, which acquires true labels for a small number of samples chosen by the Corrective Advantage Gap metric and merges them with pseudo-labels. This combination prevents the training collapse that occurs in unsupervised RLVR when models rely only on self-generated pseudo-labels. A reader would care because ground-truth labels are costly in practice, yet pure pseudo-label approaches fail to maintain stable dynamics. The method further supplies CARE, a practical policy that approximates the ideal selection criterion without requiring oracle information upfront. Experiments across domains, model families, and scales show the approach improves both stability and final performance under limited budgets.

Core claim

RLAVR actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. The Corrective Advantage Gap metric identifies samples whose labeling carries high supervision value, and CARE translates this oracle criterion into a usable pre-query acquisition policy.

What carries the argument

The Corrective Advantage Gap (CAG) metric, which quantifies sample-level supervision value to decide which examples merit ground-truth labels.

If this is right

Training remains stable rather than collapsing when a small fraction of pseudo-labels is replaced by CAG-selected ground-truth labels.
Final task performance rises across model scales and domains when annotation budgets are constrained.
CARE supplies a deployable policy that approximates the ideal CAG criterion without needing ground-truth information at query time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic could be tested in other reward-sparse RL settings where pseudo-labels are the default but costly verification is available for a few cases.
One could measure whether CAG scores correlate with downstream annotation efficiency in new task families beyond those studied.
Dynamic adjustment of the annotation budget based on running CAG estimates might further reduce total labeling cost while preserving gains.

Load-bearing premise

The Corrective Advantage Gap metric can identify samples whose labeling will meaningfully improve training stability and final performance.

What would settle it

An experiment in which random sample selection for ground-truth labels yields equal or better stability and performance than CAG-guided selection would show the metric adds no value.

Figures

Figures reproduced from arXiv: 2605.25864 by Guojun Yin, Jiajun Chai, Li Wang, Tianhao Peng, Wei Lin, Xiaodong Lu, Xiaohan Wang, Yikun Ban.

**Figure 2.** Figure 2: Comparison of label acquisition strategies across three Qwen3 models on math tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall pipeline of CARE. The Stage-I classifier predicts sample reliability, the Stage-II [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (Left) Comparison of training curves among baseline methods on math tasks. (Right) Comparison of baseline training performance on math tasks after removing unsupervised samples. differs from semi-RLVR, we adapt semi-RLVR [44, 26] to the RLAVR setting by randomly selecting samples for ground-truth annotation. (3) Active learning methods. We consider three selection criteria: Entropy over the answer-cluster … view at source ↗

**Figure 5.** Figure 5: (a) Comparison of step training time across different baselines on math tasks. (b) Ablation study of CARE on math tasks with Qwen3-4B-Base. (c) Hyperparameter analysis of p2 on math tasks with Qwen3-4B-Base. (d) Hyperparameter analysis of p on math tasks with Qwen3-1.7B-Base. 8 Conclusion In this paper, we propose RLAVR, a new setting that actively acquires ground-truth annotations for a subset of samples … view at source ↗

**Figure 6.** Figure 6: (Left) Training dynamics of Phi4-mini-instruct on math tasks. Comparison of label acquisition strategies on math-task pseudo-labeled samples without masking (Middle) and with masking (Right). A¯ i,g = A ⋆ i,g × exp(−100 · si) (36) denoted as Oracle-decay. With this large decay coefficient, the advantage of incorrect samples approaches 0, while for unsupervised samples, where the voting is correct, CAG is 0… view at source ↗

**Figure 7.** Figure 7: Training dynamics across three Qwen3 models on K&K tasks. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: (Left) Training dynamics of Phi4-mini-instruct on K&K tasks. Comparison of label acquisition strategies on K&K pseudo-labeled samples without masking (Middle) and with masking (Right). C.7 Main Experimental Results on K&K Tasks To further verify the effectiveness of CARE on other tasks, we evaluate it in the logical reasoning domain by conducting experiments on the K&K dataset [41], following the same hype… view at source ↗

**Figure 9.** Figure 9: Comparison of label acquisition strategies on K&K pseudo-labeled samples using the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of label acquisition strategies across three Qwen3 model scales on math tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: (Left and Middle) CARE component ablation on Qwen3-1.7B-Base and Qwen3-8B-Base on math tasks. (Right) Comparison of training performance without unsupervised samples across baseline methods on Phi4-mini-instruct on math tasks. 0.10 0.15 0.20 0.25 p2 25 30 35 40 45 Avg. per. (%) Qwen3-1.7B-Base 25.7 24.9 24.6 23.7 0.10 0.15 0.20 0.25 p2 Qwen3-4B-Base 36.0 36.0 36.4 37.1 0.10 0.15 0.20 0.25 p2 Qwen3-8B-Base… view at source ↗

**Figure 12.** Figure 12: Hyperparameter analysis of p2 on Qwen3-Base (left) and Phi4 (right) models on math tasks [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of training performance among baselines without unsupervised samples on [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Example 1 of the CARE method using the Qwen3-4B-Base model. } = 4 \\times \\text{side length} = 4 \\times 440 = 1760 \\text{ feet} [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: CAG calculation example. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces CAG and CARE for active ground-truth selection in RLVR to cut annotation costs, with experiments suggesting gains, but the self-belief approximation is the weakest link.

read the letter

The main point is that RLAVR adds active selection of a few samples for real labels on top of pseudo-label RLVR, using a Corrective Advantage Gap metric turned into the CARE policy, and their tests show better stability and performance when annotation budgets are tight.

What is new is the specific CAG metric for measuring sample supervision value in this setting plus the CARE pre-query policy that approximates it without ground truth. The experiments run across domains, model families, and scales, and the code release helps. That empirical coverage is the strongest part.

The soft spot sits in the approximation step. CARE has to rely on model outputs to estimate CAG before querying, yet the title itself flags that self-belief misleads. If that mapping from estimated gap to actual corrective value is noisy, the claimed stabilization does not automatically follow even if the oracle version is sound. The abstract gives no equations, so circularity with the same pseudo-labels used in training cannot be ruled out from what is visible.

This is for researchers working on scaling RLVR for LLMs under real annotation limits. A reader who cares about active learning applied to verifiable-reward training will find the setup and results worth looking at.

It deserves a serious referee. The problem is practical, the experiments are broad, and the idea is a direct response to a known failure mode, even though the approximation needs checking in review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RLAVR, an active-label-acquisition framework for Reinforcement Learning with Verifiable Rewards (RLVR). It introduces the Corrective Advantage Gap (CAG) metric to quantify the value of obtaining a ground-truth label for a given sample and the CARE policy to translate an oracle CAG into a practical pre-query selection rule that operates on model outputs alone. The selected ground-truth labels are then mixed with pseudo-labels to stabilize training and improve final performance under constrained annotation budgets. The central empirical claim is that this approach outperforms random or heuristic selection across multiple domains, model families, and scales.

Significance. If the central claim holds, the work supplies a concrete, budget-aware mechanism for mitigating the well-known collapse of unsupervised RLVR while keeping annotation costs low. The public release of code is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Methods (CARE derivation and pre-query policy)] The mapping from the oracle CAG to the practical CARE policy is the load-bearing step for the central claim. Because CARE must approximate CAG without ground-truth rewards, any circular dependence on the same pseudo-labels used for policy training would invalidate the reported gains; the manuscript does not supply a derivation or ablation that isolates this approximation from the training signal.
[Experiments (ablation and selection-quality analysis)] The title itself flags that self-belief can mislead, yet the experimental section does not report a controlled test (e.g., oracle-CAG vs. CARE-CAG selection on the same seed) that would demonstrate the approximation remains corrective rather than merely reinforcing existing model errors.

minor comments (2)

[Abstract and §3] Notation for the CAG metric is introduced without an explicit equation reference in the abstract or early sections, making it difficult to trace how the metric is computed from advantage estimates.
[Experiments] The claim of 'extensive experiments across diverse domains, model families, and model scales' would be strengthened by a table that enumerates the exact datasets, model sizes, and annotation budgets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. Where the manuscript is missing requested elements, we commit to adding them in revision.

read point-by-point responses

Referee: [Methods (CARE derivation and pre-query policy)] The mapping from the oracle CAG to the practical CARE policy is the load-bearing step for the central claim. Because CARE must approximate CAG without ground-truth rewards, any circular dependence on the same pseudo-labels used for policy training would invalidate the reported gains; the manuscript does not supply a derivation or ablation that isolates this approximation from the training signal.

Authors: We agree that a clear isolation of the CARE approximation is essential. CARE is constructed exclusively from model outputs (token probabilities and response consistency) and does not ingest the pseudo-labels or reward signals used inside the RLVR training loop; the oracle CAG is defined as the expected advantage correction under a verified label, and CARE approximates the sign and magnitude of this gap via an entropy-based reliability score. Nevertheless, the current manuscript presents only a high-level description. We will add a dedicated appendix containing the full step-by-step derivation together with an ablation that replaces CARE with a variant that leaks training-signal information, thereby quantifying any circularity effect. revision: yes
Referee: [Experiments (ablation and selection-quality analysis)] The title itself flags that self-belief can mislead, yet the experimental section does not report a controlled test (e.g., oracle-CAG vs. CARE-CAG selection on the same seed) that would demonstrate the approximation remains corrective rather than merely reinforcing existing model errors.

Authors: A direct oracle-CAG versus CARE-CAG comparison on identical seeds would indeed be the cleanest demonstration. Because oracle CAG requires ground-truth labels for the entire candidate pool, a full-scale version is incompatible with the limited-budget regime studied in the paper. We will nevertheless add a controlled post-hoc analysis on a fully labeled subset of each benchmark: we compute both oracle and CARE rankings on the same seed, measure the overlap of selected samples, and report the downstream performance gap when the two policies are used for label acquisition. This will quantify how closely CARE tracks the corrective signal without reinforcing model errors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external experiments

full rationale

The provided abstract and description introduce CAG as a metric and CARE as a translation of an oracle criterion into a policy, but contain no equations, no fitted parameters renamed as predictions, and no self-citation chains that reduce the central claim to its own inputs by construction. Experiments across domains, models, and scales are presented as validation, making the work self-contained against external benchmarks. No load-bearing self-definitional or fitted-input steps are quotable from the text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on abstract only; the approach rests on the domain assumption that pseudo-labels remain useful when mixed with a small number of ground-truth labels and that sample selection via approximated CAG is feasible pre-query.

axioms (1)

domain assumption Pseudo-labels generated by the model can be productively combined with a small number of ground-truth labels without causing training collapse
Stated as the core motivation and solution in the abstract

invented entities (2)

Corrective Advantage Gap (CAG) no independent evidence
purpose: Metric to quantify sample-level supervision value for active acquisition
Newly proposed in the paper; no independent evidence provided in abstract
CARE acquisition policy no independent evidence
purpose: Practical pre-query policy approximating the oracle CAG criterion
Newly introduced; no independent evidence in abstract

pith-pipeline@v0.9.1-grok · 5765 in / 1319 out tokens · 27762 ms · 2026-06-29T22:56:18.465745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 38 canonical work pages · 17 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Active learning: A survey

Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S Yu. Active learning: A survey. InData classification, pages 599–634. Chapman and Hall/CRC, 2014

2014
[4]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023

2023
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Activellm: Large language model-based active learning for textual few-shot scenarios.Transactions of the Association for Computational Linguistics, 14:1–22, 2026

Markus Bayer, Justin Lutz, and Christian Reuter. Activellm: Large language model-based active learning for textual few-shot scenarios.Transactions of the Association for Computational Linguistics, 14:1–22, 2026

2026
[7]

Maximizing expected model change for active learning in regression

Wenbin Cai, Ya Zhang, and Jun Zhou. Maximizing expected model change for active learning in regression. In2013 IEEE 13th international conference on data mining, pages 51–60. IEEE, 2013

2013
[8]

Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

work page arXiv 2025
[9]

Efficient process reward model training via active learning.arXiv preprint arXiv:2504.10559, 2025

Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, and Longxu Dou. Efficient process reward model training via active learning.arXiv preprint arXiv:2504.10559, 2025. 10

work page arXiv 2025
[10]

Duo: Diverse, uncertain, on-policy query generation and selection for reinforcement learning from human feedback

Xuening Feng, Zhaohui Jiang, Timo Kaufmann, Puchen Xu, Eyke Hüllermeier, Paul Weng, and Yifei Zhu. Duo: Diverse, uncertain, on-policy query generation and selection for reinforcement learning from human feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16604–16612, 2025

2025
[11]

Deep Active Learning over the Long Tail

Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail.arXiv preprint arXiv:1711.00941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu, and Xuan Zhu. Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

work page arXiv 2026
[13]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. Rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

work page arXiv 2026
[16]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

2024
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Reinforcement learning from human feedback with active queries.arXiv preprint arXiv:2402.09401, 2024

Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries.arXiv preprint arXiv:2402.09401, 2024

work page arXiv 2024
[19]

Query-by-committee improvement with diversity and density in batch active learning.Information Sciences, 454:401–418, 2018

Seho Kee, Enrique Del Castillo, and George Runger. Query-by-committee improvement with diversity and density in batch active learning.Information Sciences, 454:401–418, 2018

2018
[20]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[21]

Tool verification for test-time reinforcement learning.arXiv preprint arXiv:2603.02203, 2026

Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, V olker Tresp, and Serena Yeung-Levy. Tool verification for test-time reinforcement learning.arXiv preprint arXiv:2603.02203, 2026

work page arXiv 2026
[22]

Rest: Reshaping token-level policy gradients for tool-use large language models.arXiv preprint arXiv:2509.21826, 2025

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, and Ran He. Rest: Reshaping token-level policy gradients for tool-use large language models.arXiv preprint arXiv:2509.21826, 2025

work page arXiv 2025
[23]

Awpo: Enhancing tool-use of large language models through adaptive integration of reasoning rewards.arXiv preprint arXiv:2512.19126, 2025

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, and Ran He. Awpo: Enhancing tool-use of large language models through adaptive integration of reasoning rewards.arXiv preprint arXiv:2512.19126, 2025

work page arXiv 2025
[24]

Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504,

Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504, 2024

work page arXiv 2024
[25]

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, and Deqing Wang. Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, and Jiaxuan You. Memreward: Graph-based experience memory for llm reward prediction with limited labels. arXiv preprint arXiv:2603.19310, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023

2023
[28]

Sample efficient reinforcement learning from human feedback via active exploration

Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. 2023

2023
[29]

Active learning with expected error reduction.arXiv preprint arXiv:2211.09283, 2022

Stephen Mussmann, Julia Reisler, Daniel Tsai, Ehsan Mousavi, Shayne O’Brien, and Moises Goldszmidt. Active learning with expected error reduction.arXiv preprint arXiv:2211.09283, 2022

work page arXiv 2022
[30]

Coverrl: Breaking the consensus trap in label-free reasoning via generator-verifier co-evolution.arXiv preprint arXiv:2603.17775, 2026

Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, and Yongliang Shen. Coverrl: Breaking the consensus trap in label-free reasoning via generator-verifier co-evolution.arXiv preprint arXiv:2603.17775, 2026

work page arXiv 2026
[31]

Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

work page arXiv 2025
[32]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025
[36]

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, et al. Diagnosing and mitigating system bias in self-rewarding rl.arXiv preprint arXiv:2510.08977, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

A new active labeling method for deep learning

Dan Wang and Yi Shang. A new active labeling method for deep learning. In2014 International joint conference on neural networks (IJCNN), pages 112–119. IEEE, 2014

2014
[38]

A survey on large language models for mathematical reasoning.ACM Computing Surveys, 58(8):1–35, 2026

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning.ACM Computing Surveys, 58(8):1–35, 2026

2026
[39]

Density weighted diversity based query strategy for active learning

Tingting Wang, Xufeng Zhao, Qiujian Lv, Bo Hu, and Degang Sun. Density weighted diversity based query strategy for active learning. In2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 156–161. IEEE, 2021

2021
[40]

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

Weiqin Wang, Yile Wang, Kehao Chen, and Hui Huang. Beyond majority voting: Towards fine-grained and more reliable reward signal for test-time reinforcement learning.arXiv preprint arXiv:2512.15146, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

On memorization of large language models in logical reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...

2025
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms

Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2586–2616, 2025

2025
[44]

Trapo: A semi-supervised reinforcement learning framework for boosting llm reasoning.arXiv preprint arXiv:2512.13106, 2025

Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen, Bowen Song, Weiqiang Wang, Junbo Zhao, Gang Chen, and Haobo Wang. Trapo: A semi-supervised reinforcement learning framework for boosting llm reasoning.arXiv preprint arXiv:2512.13106, 2025

work page arXiv 2025
[45]

Learn more with less: Uncertainty consistency guided query selection for rlvr.arXiv preprint arXiv:2601.22595, 2026

Hao Yi, Yulan Hu, Xin Li, Sheng Ouyang, Lizhong Ding, and Yong Liu. Learn more with less: Uncertainty consistency guided query selection for rlvr.arXiv preprint arXiv:2601.22595, 2026

work page arXiv 2026
[46]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Restrain: From spurious votes to signals– self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, et al. Restrain: From spurious votes to signals– self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025

work page arXiv 2025
[48]

Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

work page arXiv 2025
[49]

Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, and Dacheng Tao. Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

work page arXiv 2025
[50]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025
[51]

Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models.arXiv preprint arXiv:2508.00410, 2025

Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, and Bo Han. Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models.arXiv preprint arXiv:2508.00410, 2025

work page arXiv 2025
[52]

Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

work page arXiv 2025
[54]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 13 A Proofs A.1 Proof Lemma 6.1 Proof. Under the strict on-policy setting, the gradients induced by the ground-truth and pseudo- reward advantages can be w...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

20 ' \ " ,(9 . 5 , 11),N) ; \ndraw((10 . 5 , 9)--(11 , 9)) ; \ndraw((10 . 5 , 10)--(11 , 10)) ; \ndraw((10 . 7 5 , 9)-- (10 . 7 5 , 10)) ; \nlabel(\

**Scoring**: Pr o vide a scor e based on t he crit eria. The scor e should be a decimal fr om { 0 . 0 , 0 .2, 0 .4 , 0 . 6 , 0 . 8 , 1 . 0 } . 3 . **Strict Output R equir ement**: * **ONL Y** output t he scor e inside t he \ \bo x{} f ormat ( e.g., \ \bo x{0 . 8}). * **DO NO T** pr o vide an y r easoning, j ustification, or pr eamble. * **DO NO T** includ...
[56]

37 7 96447 30] C A G： ||A_ pseudo - A_true|| 2 =4 .4 149163349 Figure 15: CAG calculation example

6457 513111 , -0 . 37 7 96447 30] C A G： ||A_ pseudo - A_true|| 2 =4 .4 149163349 Figure 15: CAG calculation example. 24

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Active learning: A survey

Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S Yu. Active learning: A survey. InData classification, pages 599–634. Chapman and Hall/CRC, 2014

2014

[4] [4]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023

2023

[5] [5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Activellm: Large language model-based active learning for textual few-shot scenarios.Transactions of the Association for Computational Linguistics, 14:1–22, 2026

Markus Bayer, Justin Lutz, and Christian Reuter. Activellm: Large language model-based active learning for textual few-shot scenarios.Transactions of the Association for Computational Linguistics, 14:1–22, 2026

2026

[7] [7]

Maximizing expected model change for active learning in regression

Wenbin Cai, Ya Zhang, and Jun Zhou. Maximizing expected model change for active learning in regression. In2013 IEEE 13th international conference on data mining, pages 51–60. IEEE, 2013

2013

[8] [8]

Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

work page arXiv 2025

[9] [9]

Efficient process reward model training via active learning.arXiv preprint arXiv:2504.10559, 2025

Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, and Longxu Dou. Efficient process reward model training via active learning.arXiv preprint arXiv:2504.10559, 2025. 10

work page arXiv 2025

[10] [10]

Duo: Diverse, uncertain, on-policy query generation and selection for reinforcement learning from human feedback

Xuening Feng, Zhaohui Jiang, Timo Kaufmann, Puchen Xu, Eyke Hüllermeier, Paul Weng, and Yifei Zhu. Duo: Diverse, uncertain, on-policy query generation and selection for reinforcement learning from human feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16604–16612, 2025

2025

[11] [11]

Deep Active Learning over the Long Tail

Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail.arXiv preprint arXiv:1711.00941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu, and Xuan Zhu. Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

work page arXiv 2026

[13] [13]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. Rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

work page arXiv 2026

[16] [16]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

2024

[17] [17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Reinforcement learning from human feedback with active queries.arXiv preprint arXiv:2402.09401, 2024

Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries.arXiv preprint arXiv:2402.09401, 2024

work page arXiv 2024

[19] [19]

Query-by-committee improvement with diversity and density in batch active learning.Information Sciences, 454:401–418, 2018

Seho Kee, Enrique Del Castillo, and George Runger. Query-by-committee improvement with diversity and density in batch active learning.Information Sciences, 454:401–418, 2018

2018

[20] [20]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023

[21] [21]

Tool verification for test-time reinforcement learning.arXiv preprint arXiv:2603.02203, 2026

Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, V olker Tresp, and Serena Yeung-Levy. Tool verification for test-time reinforcement learning.arXiv preprint arXiv:2603.02203, 2026

work page arXiv 2026

[22] [22]

Rest: Reshaping token-level policy gradients for tool-use large language models.arXiv preprint arXiv:2509.21826, 2025

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, and Ran He. Rest: Reshaping token-level policy gradients for tool-use large language models.arXiv preprint arXiv:2509.21826, 2025

work page arXiv 2025

[23] [23]

Awpo: Enhancing tool-use of large language models through adaptive integration of reasoning rewards.arXiv preprint arXiv:2512.19126, 2025

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, and Ran He. Awpo: Enhancing tool-use of large language models through adaptive integration of reasoning rewards.arXiv preprint arXiv:2512.19126, 2025

work page arXiv 2025

[24] [24]

Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504,

Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504, 2024

work page arXiv 2024

[25] [25]

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, and Deqing Wang. Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, and Jiaxuan You. Memreward: Graph-based experience memory for llm reward prediction with limited labels. arXiv preprint arXiv:2603.19310, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023

2023

[28] [28]

Sample efficient reinforcement learning from human feedback via active exploration

Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. 2023

2023

[29] [29]

Active learning with expected error reduction.arXiv preprint arXiv:2211.09283, 2022

Stephen Mussmann, Julia Reisler, Daniel Tsai, Ehsan Mousavi, Shayne O’Brien, and Moises Goldszmidt. Active learning with expected error reduction.arXiv preprint arXiv:2211.09283, 2022

work page arXiv 2022

[30] [30]

Coverrl: Breaking the consensus trap in label-free reasoning via generator-verifier co-evolution.arXiv preprint arXiv:2603.17775, 2026

Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, and Yongliang Shen. Coverrl: Breaking the consensus trap in label-free reasoning via generator-verifier co-evolution.arXiv preprint arXiv:2603.17775, 2026

work page arXiv 2026

[31] [31]

Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

work page arXiv 2025

[32] [32]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025

[36] [36]

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, et al. Diagnosing and mitigating system bias in self-rewarding rl.arXiv preprint arXiv:2510.08977, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

A new active labeling method for deep learning

Dan Wang and Yi Shang. A new active labeling method for deep learning. In2014 International joint conference on neural networks (IJCNN), pages 112–119. IEEE, 2014

2014

[38] [38]

A survey on large language models for mathematical reasoning.ACM Computing Surveys, 58(8):1–35, 2026

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning.ACM Computing Surveys, 58(8):1–35, 2026

2026

[39] [39]

Density weighted diversity based query strategy for active learning

Tingting Wang, Xufeng Zhao, Qiujian Lv, Bo Hu, and Degang Sun. Density weighted diversity based query strategy for active learning. In2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 156–161. IEEE, 2021

2021

[40] [40]

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

Weiqin Wang, Yile Wang, Kehao Chen, and Hui Huang. Beyond majority voting: Towards fine-grained and more reliable reward signal for test-time reinforcement learning.arXiv preprint arXiv:2512.15146, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

On memorization of large language models in logical reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...

2025

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms

Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2586–2616, 2025

2025

[44] [44]

Trapo: A semi-supervised reinforcement learning framework for boosting llm reasoning.arXiv preprint arXiv:2512.13106, 2025

Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen, Bowen Song, Weiqiang Wang, Junbo Zhao, Gang Chen, and Haobo Wang. Trapo: A semi-supervised reinforcement learning framework for boosting llm reasoning.arXiv preprint arXiv:2512.13106, 2025

work page arXiv 2025

[45] [45]

Learn more with less: Uncertainty consistency guided query selection for rlvr.arXiv preprint arXiv:2601.22595, 2026

Hao Yi, Yulan Hu, Xin Li, Sheng Ouyang, Lizhong Ding, and Yong Liu. Learn more with less: Uncertainty consistency guided query selection for rlvr.arXiv preprint arXiv:2601.22595, 2026

work page arXiv 2026

[46] [46]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Restrain: From spurious votes to signals– self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, et al. Restrain: From spurious votes to signals– self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025

work page arXiv 2025

[48] [48]

Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

work page arXiv 2025

[49] [49]

Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, and Dacheng Tao. Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.08745, 2025

work page arXiv 2025

[50] [50]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025

[51] [51]

Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models.arXiv preprint arXiv:2508.00410, 2025

Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, and Bo Han. Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models.arXiv preprint arXiv:2508.00410, 2025

work page arXiv 2025

[52] [52]

Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

work page arXiv 2025

[54] [54]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 13 A Proofs A.1 Proof Lemma 6.1 Proof. Under the strict on-policy setting, the gradients induced by the ground-truth and pseudo- reward advantages can be w...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

20 ' \ " ,(9 . 5 , 11),N) ; \ndraw((10 . 5 , 9)--(11 , 9)) ; \ndraw((10 . 5 , 10)--(11 , 10)) ; \ndraw((10 . 7 5 , 9)-- (10 . 7 5 , 10)) ; \nlabel(\

**Scoring**: Pr o vide a scor e based on t he crit eria. The scor e should be a decimal fr om { 0 . 0 , 0 .2, 0 .4 , 0 . 6 , 0 . 8 , 1 . 0 } . 3 . **Strict Output R equir ement**: * **ONL Y** output t he scor e inside t he \ \bo x{} f ormat ( e.g., \ \bo x{0 . 8}). * **DO NO T** pr o vide an y r easoning, j ustification, or pr eamble. * **DO NO T** includ...

[56] [56]

37 7 96447 30] C A G： ||A_ pseudo - A_true|| 2 =4 .4 149163349 Figure 15: CAG calculation example

6457 513111 , -0 . 37 7 96447 30] C A G： ||A_ pseudo - A_true|| 2 =4 .4 149163349 Figure 15: CAG calculation example. 24