{"total":13,"items":[{"citing_arxiv_id":"2606.09078","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-08T06:22:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08231","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-06T15:39:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01249","ref_index":196,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trust Region On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-31T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25381","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not only where, But when: Temporal Scheduling for RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-25T03:10:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10158","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unsupervised Process Reward Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"addition to these approaches, majority voting is a reward-model-free method that selects the most frequent answer. One major concern with PRMs in TTS is the effective use of the assigned rewards to select the final response. Current selection methods do not achieve similar performance to the pass@N metric, where a single-correct answer is sufficient, and have led to recent exploration on improving PRMs [33, 22, 34, 35]. In our work, we observe thatuPRM performs on par with existing supervised counterparts despite being fully unsupervised. Reinforcement Learning with Process Reward Models.RL has been widely adopted to incentivize reasoning abilities in LLMs, particularly to solve mathematical problems [ 4, 36]. Most popular frameworks assign a sparse outcome reward for the entire response generated by the policy model."},{"citing_arxiv_id":"2604.24198","ref_index":77,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis","primary_cat":"cs.CL","submitted_at":"2026-04-27T09:00:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2025. SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning. In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, Toby Walsh, Julie Shah, and Zico Kolter (Eds.). AAAI Press, 29733-29735. doi:10.1609/AAAI.V39I28.35383 [77] Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, and Weinan Zhang. 2025. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models.CoRRabs/2510.08049 (2025). arXiv:2510.08049 doi:10.48550/ARXIV.2510.08049 [78] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao"},{"citing_arxiv_id":"2604.16029","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-17T13:00:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02913","ref_index":180,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.10165","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenClaw-RL: Train Any Agent Simply by Talking","primary_cat":"cs.CL","submitted_at":"2026-03-10T18:59:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.03403","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training","primary_cat":"cs.LG","submitted_at":"2025-09-03T15:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.03556","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VRPRM: Process Reward Modeling via Visual Reasoning","primary_cat":"cs.LG","submitted_at":"2025-08-05T15:25:24+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15778","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR","primary_cat":"cs.CL","submitted_at":"2025-07-21T16:34:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15698","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning","primary_cat":"cs.CL","submitted_at":"2025-07-21T15:07:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}