Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch

have been shown to improve credit assignment, learning efficiency · 2025 · arXiv 0015.6661

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

VeriGate: Verifier-Gated Step-Level Supervision for GRPO cs.LG · 2026-05-28 · unverdicted · none · ref 30
VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.

Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch

fields

years

verdicts

representative citing papers

citing papers explorer