Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Anurag Beniwal; Chenlu Ye; Hao Chen; Jing Huang; Narayanan Sadagopan; Tong Zhang; Zhou Yu; Ziji Zhang

arxiv: 2509.03403 · v2 · pith:QGWRCRHCnew · submitted 2025-09-03 · 💻 cs.LG · cs.AI

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Chenlu Ye , Zhou Yu , Ziji Zhang , Hao Chen , Narayanan Sadagopan , Jing Huang , Tong Zhang , Anurag Beniwal This is my paper

classification 💻 cs.LG cs.AI

keywords processoutcomereasoningrewardrewardsprmsprofstrong

0 comments

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM--ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
cs.CL 2025-10 unverdicted novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
cs.LG 2026-04 unverdicted novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
LLM Reasoning with Process Rewards for Outcome-Guided Steps
cs.LG 2026-02 unverdicted novelty 5.0

PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.