Distribution-aware reward estimation for test-time reinforcement learning.arXiv preprint arXiv:2601.21804,

Bodong Du, Xuanqi Huang, Xiaomeng Li · arXiv 2601.21804

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification cs.LG · 2026-06-02 · unverdicted · none · ref 15
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.

Distribution-aware reward estimation for test-time reinforcement learning.arXiv preprint arXiv:2601.21804,

fields

years

verdicts

representative citing papers

citing papers explorer