TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
arXiv preprint arXiv:2508.00410 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.
citing papers explorer
-
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
-
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.