pith. machine review for the scientific record. sign in

arxiv: 2604.21327 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords test-time reinforcement learningmath reasoningspurious signalslabel noiselarge language modelsadvantage estimationpseudo-labelingdebiased reinforcement learning
0
0 comments X

The pith

DDRL mitigates spurious signal amplification in test-time RL for math reasoning by excluding ambiguous responses and using fixed advantages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Test-time reinforcement learning adapts large language models to math problems at inference time by treating the model's own sampled answers as pseudo-labels. This process creates reward noise because responses that the model agrees on only some of the time form an ambiguity region and produce unreliable training signals. The paper finds that group-relative advantage estimation actually magnifies these errors rather than averaging them out. The proposed DDRL framework counters the problem in three steps: frequency-based sampling keeps only high- or low-consistency examples while balancing positive and negative cases, fixed advantages replace relative scoring to remove bias, and a consensus stage refines the model off-policy using the filtered data. Experiments across three models and several math benchmarks show consistent gains over prior test-time RL methods.

Core claim

Responses with medium consistency constitute the main source of reward noise in test-time RL, and group-relative advantage estimation amplifies the resulting spurious signals; DDRL counters this by applying frequency-based sampling to drop ambiguous examples while preserving balance, switching to fixed advantages for debiased estimation, and adding a consensus-based off-policy refinement stage that uses the cleaned dataset for stable updates.

What carries the argument

The DDRL framework, which uses frequency-based sampling to exclude medium-consistency responses, fixed advantages instead of group-relative ones, and consensus-based off-policy refinement on the resulting dataset.

If this is right

  • Test-time adaptation becomes more stable because ambiguous samples are removed before advantage calculation.
  • Fixed advantages eliminate the bias that group-relative scoring introduces when pseudo-labels are noisy.
  • The resulting model updates are both more efficient and less prone to overfitting to label errors.
  • Performance improves consistently across different large language models on multiple mathematical reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar frequency-based filtering could be tested in other pseudo-labeling settings such as code generation or open-ended question answering.
  • If relative advantage estimation amplifies noise in this domain, the same pattern may appear in other on-policy RL loops that rely on self-generated labels.
  • The approach suggests that simple agreement counts among samples can serve as a practical proxy for identifying low-confidence outputs without extra models.

Load-bearing premise

Medium-consistency responses are the dominant noise source and frequency sampling combined with fixed advantages can remove that noise without discarding useful learning signal.

What would settle it

An experiment in which group-relative advantage estimation shows no amplification of noise from medium-consistency responses, or in which DDRL fails to outperform standard TTRL baselines on the same models and benchmarks.

Figures

Figures reproduced from arXiv: 2604.21327 by Jian Liang, Kuangpu Guo, Lingxiao He, Meng Wang, Qianlong Xie, Ran He, Xingxing Wang, Yongcan Yu.

Figure 1
Figure 1. Figure 1: Overview of test-time reinforcement learning [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Answer frequency as an imperfect proxy for reliability. We analyze the relationship between answer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Behavior of group-relative advantage estimation under limited positive samples. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of training dynamics between TTRL and DDRL via mean advantage. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Conditional answer distributions reveal ambiguous frequency regions in self-consistency in the Figure [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Answer frequency as an imperfect proxy for reliability on AIME (1985-2024). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper empirically studies spurious signal amplification in test-time reinforcement learning (TTRL) for mathematical reasoning. It identifies responses with medium consistency as forming an 'ambiguity region' that is the primary source of reward noise, which can be amplified by group-relative advantage estimation. To address this, the authors propose Debiased and Denoised test-time Reinforcement Learning (DDRL), consisting of frequency-based sampling to exclude ambiguous samples while maintaining balance between positive and negative examples, debiased advantage estimation using fixed advantages, and a consensus-based off-policy refinement stage. Experiments across three large language models and multiple math reasoning benchmarks show that DDRL consistently outperforms existing TTRL baselines.

Significance. If the results hold and the performance gains can be robustly attributed to the proposed mitigations rather than other factors, this work could provide a practical framework for improving the reliability of test-time adaptation in reasoning tasks by reducing the impact of label noise from ambiguous responses. The empirical identification of the ambiguity region and the unified framework are valuable contributions to the growing area of test-time RL for LLMs.

major comments (3)
  1. [Experiments] The experiments report consistent outperformance but do not include ablations that isolate the contribution of the frequency-based sampling strategy from the debiased advantage estimation and consensus refinement. Without such controls, it is difficult to confirm that the gains stem specifically from excluding the medium-consistency ambiguity region rather than the other components or implementation details.
  2. [Method (DDRL framework)] The assumption that medium-consistency responses primarily contain noise without recoverable useful signal is central but not directly tested. An analysis or experiment showing the performance impact of including vs. excluding these samples (e.g., via oracle or additional metrics) would strengthen the claim that exclusion mitigates spurious signals without discarding value.
  3. [Abstract and §3] Definitions of 'medium consistency', 'frequency-based sampling', and 'balanced set of positive and negative examples' are referenced but their precise operationalization (e.g., thresholds, sampling probabilities) needs to be detailed with pseudocode or equations to allow reproduction and verification of the noise exclusion.
minor comments (2)
  1. [Abstract] The abstract mentions 'the code will soon be released' but does not provide a link or timeline; consider updating with the actual repository URL once available.
  2. [Throughout] Ensure all acronyms (TTRL, DDRL) are defined on first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical validation and reproducibility of DDRL. We address each major comment point by point below and will incorporate revisions to improve the paper.

read point-by-point responses
  1. Referee: [Experiments] The experiments report consistent outperformance but do not include ablations that isolate the contribution of the frequency-based sampling strategy from the debiased advantage estimation and consensus refinement. Without such controls, it is difficult to confirm that the gains stem specifically from excluding the medium-consistency ambiguity region rather than the other components or implementation details.

    Authors: We agree that isolating the individual contributions would provide clearer attribution of the performance gains. In the revised manuscript, we will add dedicated ablation studies that evaluate the frequency-based sampling strategy in isolation (e.g., DDRL without sampling but with the other two components) as well as incremental combinations. These controls will directly test whether excluding the ambiguity region is the primary driver of improvement. revision: yes

  2. Referee: [Method (DDRL framework)] The assumption that medium-consistency responses primarily contain noise without recoverable useful signal is central but not directly tested. An analysis or experiment showing the performance impact of including vs. excluding these samples (e.g., via oracle or additional metrics) would strengthen the claim that exclusion mitigates spurious signals without discarding value.

    Authors: Section 3 presents an empirical analysis linking medium-consistency responses to elevated reward noise via consistency histograms and variance measurements. To directly test the exclusion assumption, we will add a new experiment in the revision comparing performance with and without these samples, using additional metrics such as estimated signal-to-noise ratios and accuracy on a small oracle-labeled subset where feasible. This will quantify any trade-off between noise reduction and potential signal loss. revision: yes

  3. Referee: [Abstract and §3] Definitions of 'medium consistency', 'frequency-based sampling', and 'balanced set of positive and negative examples' are referenced but their precise operationalization (e.g., thresholds, sampling probabilities) needs to be detailed with pseudocode or equations to allow reproduction and verification of the noise exclusion.

    Authors: We concur that precise operational details are necessary for reproducibility. In the revision, we will expand Section 3 with explicit definitions (e.g., medium consistency as responses with consistency scores in [0.3, 0.7]), the exact sampling probability formula for frequency-based exclusion, the balance constraint equations, and pseudocode for the full DDRL pipeline including all three stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper reports an empirical observation that medium-consistency responses constitute the primary source of reward noise in TTRL and can be amplified by group-relative advantage estimation. It then introduces DDRL as a set of direct design choices (frequency-based sampling for balanced exclusion, fixed-advantage debiased estimation, and consensus refinement) motivated by those observations. These steps are presented as engineering interventions validated by experiments on three LLMs, not as a derivation or prediction that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that medium-consistency responses are the dominant noise source and that the three proposed mitigations address it without side effects; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1186 out tokens · 27646 ms · 2026-05-09T22:49:48.898557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

    Acereason-nemotron: Advanc- ing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

  2. [2]

    The Llama 3 Herd of Models

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others

  3. [3]

    OpenAI o1 System Card

    Openai o1 system card.arXiv preprint arXiv:2412.16720. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others

  4. [4]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Tulu 3: Pushing fron- tiers in open language model post-training.arXiv preprint arXiv:2411.15124. Jia Li, Edward Beeching, Lewis Tunstall, Ben Lip- kin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, and 1 others. 2024a. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math prob- lems ...

  5. [5]

    Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, and ShaoGuo Liu

    Ettrl: Balancing exploration and exploitation in llm test- time reinforcement learning via entropy mechanism. arXiv preprint arXiv:2508.11356. 9 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto

  6. [6]

    arXiv preprint arXiv:2505.22660

    Maximizing confidence alone improves reasoning. arXiv preprint arXiv:2505.22660. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 oth- ers

  7. [7]

    Qwen2.5 Technical Report

    Qwen2.5 technical report.Preprint, arXiv:2412.15115. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

  8. [8]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

  9. [9]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256. Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt

  10. [10]

    Associated with the WaltonFuture GeoQA-8K-direct-synthesizing dataset release

    Unsu- pervised post-training for multi-modal llm reasoning via grpo.arXiv preprint arXiv:2505.22453. Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F Schmidt, and Jianfei Cai

  11. [11]

    Dong Yan, Gaochen Wu, and Bowen Zhou

    Spine: Token-selective test-time reinforcement learning with entropy-band regularization.arXiv preprint arXiv:2511.17938. Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, and Tieniu Tan

  12. [12]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2. 5-math technical report: Toward mathe- matical expert model via self-improvement.arXiv preprint arXiv:2409.12122. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, and 1 others. 2025a. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:250...

  13. [13]

    Yongcan Yu, Lijun Sheng, Ran He, and Jian Liang

    Benchmarking test-time adaptation against distribu- tion shifts in image classification.arXiv preprint arXiv:2307.03133. Yongcan Yu, Lijun Sheng, Ran He, and Jian Liang

  14. [14]

    Stamp: Outlier-aware test-time adaptation with stable memory replay. InProc. ECCV. Yongcan Yu, Yanbo Wang, Ran He, and Jian Liang. 2025c. Test-time immunization: A universal defense framework against jailbreaks for (multimodal) large language models.arXiv preprint arXiv:2505.22271. Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, J...

  15. [15]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Does re- inforcement learning really incentivize reasoning ca- pacity in llms beyond the base model?arXiv preprint arXiv:2504.13837. Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, and Dacheng Tao. 2025a. Consistent paths lead to truth: Self- rewarding reinforcement learning for llm reasoning. arXiv preprint arXiv...

  16. [16]

    Cumulative reasoning with large language models

    Cumulative reason- ing with large language models.arXiv preprint arXiv:2308.04371. Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song

  17. [17]

    Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

    Learning to rea- son without external rewards.arXiv preprint arXiv:2505.19590. Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xin- wei Long, Ermo Hua, and 1 others

  18. [18]

    Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084. 10 A Implementation Detail on Input Prompt Format Following TTRL (Zuo et al., 2025), we add the reasoning prompt to the model: Reasoning Cue Please reason step by step, and put your final answer within \boxed{}. For Math-specialized models, we use it as a sys- tem prompt; for other mo...

  19. [19]

    derivative d dx p(C|x) = p(C|x) 1−p(C|x) d dx log pc(x) pi(x) , (15) which explains why the posterior changes most rapidly in the ambiguous region while flattening at both extremes. Overall, this figure demonstrates that sampling frequency provides reliable confi- dence signals only in low- and high-frequency regimes, whereas medium-frequency answers in- ...

  20. [20]

    This confirms that the key assump- tion behind BCS is not specific to MATH-500 but generalizes to other datasets

    The same frequency–correctness pattern emerges: answers in the medium-frequency regime exhibit substantially higher uncertainty, while low-(<=5) and high-(>=45) frequency regions remain rela- tively reliable. This confirms that the key assump- tion behind BCS is not specific to MATH-500 but generalizes to other datasets. Method AIME 2024 AMC MATH-500Avg Q...

  21. [21]

    To investigate the impact of the refinement size on model performance, we conduct a sensitivity analysis by scaling the original setting (1x) to 2x and 4x Table

    DDRL continues to show consistent improvements over TTRL at this larger scale Refinement Size (compared to original setting) AIME 2024 AMC MATH-500 1x (original setting) 25.0 52.9 80.72x 25.2 53.0 80.74x 25.3 53.0 80.8 Table 6: Sensitivity analysis of refinement size on math- ematical reasoning benchmarks. To investigate the impact of the refinement size ...

  22. [22]

    Scaling up the refinement size yields highly marginal improvements across all benchmarks

    As shown in the table, our method demonstrates remarkable robustness to this hyperparameter. Scaling up the refinement size yields highly marginal improvements across all benchmarks. Specifically, the accuracy on AIME 2024 only increases slightly from 25.0 to 25.3, while performance on AMC and MATH-500 satu- rates almost immediately, showing negligible ga...