Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu, Xuan Zhu · 2026 · arXiv 2601.04700

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

citing papers explorer

Showing 1 of 1 citing paper.

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-25 · unverdicted · none · ref 12
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

Prism: A unified framework for post-training llms without verifiable rewards.arXiv preprint arXiv:2601.04700, 2026

fields

years

verdicts

representative citing papers

citing papers explorer