arXiv preprint arXiv:2504.00891

Genprm: Scaling test-time compute of process reward models via generative reasoning · 2025 · arXiv 2504.00891

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Unsupervised Process Reward Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

OpenClaw-RL: Train Any Agent Simply by Talking

cs.CL · 2026-03-10 · unverdicted · novelty 6.0

OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

cs.LG · 2025-09-03 · unverdicted · novelty 6.0

PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

cs.CL · 2026-04-27

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

cs.CL · 2026-04-17

VRPRM: Process Reward Modeling via Visual Reasoning

cs.LG · 2025-08-05

citing papers explorer

Showing 6 of 6 citing papers after filters.

Unsupervised Process Reward Models cs.LG · 2026-05-11 · unverdicted · none · ref 33
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 180
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
OpenClaw-RL: Train Any Agent Simply by Talking cs.CL · 2026-03-10 · unverdicted · none · ref 3
OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training cs.LG · 2025-09-03 · unverdicted · none · ref 41
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 54
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning cs.CL · 2025-07-21 · unverdicted · none · ref 26
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.

arXiv preprint arXiv:2504.00891

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer