arXiv preprint arXiv:2508.00410 , year=

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models , author= · 2025 · arXiv 2508.00410

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

citing papers explorer

Showing 4 of 4 citing papers.

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification cs.LG · 2026-06-02 · unverdicted · none · ref 12
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 15
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 41
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-25 · unverdicted · none · ref 51
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

arXiv preprint arXiv:2508.00410 , year=

fields

years

verdicts

representative citing papers

citing papers explorer