Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Hailei Gong; Jianping Zhang; Shengjie Ma; Shenyang Tong; Teng Wang; Wenhan Yang; Yanan Zheng; Zewen Ye; Zeyu Li; Zhangyi Jiang

arxiv: 2503.13551 · v5 · submitted 2025-03-16 · 💻 cs.CL · cs.AI

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Teng Wang , Zhangyi Jiang , Zhenqi He , Shenyang Tong , Wenhan Yang , Yanan Zheng , Zeyu Li , Zifan He

show 4 more authors

Hailei Gong Zewen Ye Shengjie Ma Jianping Zhang

This is my paper

Pith reviewed 2026-05-22 23:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Hierarchical Reward ModelProcess Reward Modelreward hackingmulti-step reasoningdata augmentationMCTSLarge Language Modelsreasoning evaluation

0 comments

The pith

Hierarchical Reward Models assess both single steps and consecutive pairs to stabilize evaluations of LLM reasoning paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Process Reward Models often suffer from reward hacking, where they latch onto misleading intermediate signals, and they demand costly step-by-step human labels. The paper proposes the Hierarchical Reward Model to score both individual reasoning steps and pairs of consecutive steps, capturing local accuracy and longer-range coherence including cases where an error is later fixed by self-correction. It further introduces Hierarchical Node Compression, which merges adjacent steps inside Monte Carlo Tree Search trees to produce training data with added diversity and controlled noise at low cost. If these changes hold, reward signals become more reliable for guiding large language models through extended reasoning sequences without requiring proportionally more labeled data.

Core claim

The Hierarchical Reward Model jointly evaluates reasoning at the level of single steps and at the level of merged consecutive steps; when trained on trajectories augmented by Hierarchical Node Compression on MCTS search trees, it produces more stable and reliable step-wise assessments than standard Process Reward Models and shows strong generalization on MATH500 and GSM8K.

What carries the argument

The Hierarchical Reward Model that scores both fine-grained individual steps and coarse-grained consecutive step pairs, together with Hierarchical Node Compression that merges adjacent nodes in reasoning trees to generate augmented training data.

If this is right

HRM can correctly penalize paths that contain early errors even when later steps reflect self-correction.
HNC lowers the annotation burden by turning existing MCTS trajectories into multiple training examples without new human labels.
The resulting models exhibit lower variance and higher reliability than PRM when used to select among candidate reasoning chains.
Performance gains transfer across different math reasoning distributions including GSM8K and MATH500.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical scoring could be applied to code-generation or scientific reasoning traces where intermediate correctness is hard to judge.
Controlled noise from step merging may prove useful in other reward-modeling settings that currently overfit to surface patterns.
If HNC scales, reward models could be trained on orders of magnitude more synthetic trajectories while keeping labeling cost fixed.

Load-bearing premise

Merging consecutive reasoning steps via Hierarchical Node Compression produces training data whose controlled noise improves the model's ability to detect flawed multi-step paths rather than harming it.

What would settle it

On the PRM800K dataset, the Hierarchical Reward Model trained with Hierarchical Node Compression shows higher variance across repeated evaluations or lower accuracy at ranking correct versus flawed paths than a baseline Process Reward Model.

read the original abstract

Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HRM with HNC claims stabler evaluations than PRM but the abstract gives no numbers, baselines, or ablations, leaving the central claims unevaluable.

read the letter

The one or two things worth knowing right away are that this work proposes a Hierarchical Reward Model that scores both individual steps and groups of consecutive steps, paired with a Hierarchical Node Compression trick that merges steps in MCTS trajectories to create training examples with some built-in noise. The authors say this combo beats plain PRMs for stability and generalizes across datasets. The new element is really the multi-level scoring plus the specific way they augment data by compression rather than just generating more trajectories. It tries to solve the high cost of labeling every step and the problem of reward models failing when a bad step is followed by a correction. They lay out the motivation clearly enough, pointing to reward hacking and annotation expense as the core issues. The hierarchical angle is a reasonable response to needing to judge overall coherence in longer chains. The soft spots are in the evidence. The abstract reports better performance on PRM800K for stability and on MATH500 plus GSM8K for generalization, but supplies zero implementation details, no baseline descriptions, no statistical significance, and no ablation that isolates the effect of the compression. The assumption that the merged nodes create controlled noise that helps rather than obscures flaw detection is left unexamined. That makes the main claim hard to trust based on what's written. This is the kind of paper that might interest the small group working on process supervision for LLM reasoning. A reader could get an idea for their own hierarchical setup, but without the experimental backing it is not something I would cite or build on yet. I would not send it to peer review in this form. The work needs the full results and controls before it is ready for serious refereeing.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a Hierarchical Reward Model (HRM) that assesses both individual and consecutive reasoning steps at fine-grained and coarse-grained levels to mitigate reward hacking in Process Reward Models (PRMs). It further introduces Hierarchical Node Compression (HNC), a data-augmentation method that merges consecutive steps within MCTS-generated trajectories to reduce annotation costs while adding controlled noise. The central empirical claim is that HRM combined with HNC yields more stable and reliable evaluations than PRM on PRM800K and exhibits strong cross-domain generalization on MATH500 and GSM8K.

Significance. If the empirical superiority and generalization claims hold after proper validation, the work could meaningfully improve the reliability of process-level supervision for LLM reasoning by addressing reward hacking and lowering data-generation costs. The hierarchical evaluation of coherence across corrected flawed steps is a potentially useful direction, but the current manuscript supplies no quantitative results, implementation details, or ablations, so the practical significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: The claim that 'HRM, together with HNC, provides more stable and reliable evaluations than PRM' on PRM800K (and strong generalization on MATH500/GSM8K) is presented without any reported metrics, baseline comparisons, statistical tests, ablation results, or implementation details, rendering the central empirical contribution impossible to evaluate.
[Abstract] Abstract: The weakest assumption—that merging consecutive steps via HNC produces training data whose 'controlled noise' improves rather than harms the model's ability to localize the first error—is unsupported. Merging a flawed step with a subsequent correct step can create an ambiguous node label; the manuscript provides no analysis, ablation, or evidence showing that this operation enhances stability instead of diluting error signals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'HRM, together with HNC, provides more stable and reliable evaluations than PRM' on PRM800K (and strong generalization on MATH500/GSM8K) is presented without any reported metrics, baseline comparisons, statistical tests, ablation results, or implementation details, rendering the central empirical contribution impossible to evaluate.

Authors: The abstract provides a high-level summary. The full manuscript reports quantitative results in Section 4 and the appendix, including accuracy and stability metrics on PRM800K with baseline comparisons to standard PRMs, cross-domain results on MATH500 and GSM8K, ablations on hierarchical components, and implementation details for HRM and HNC. We will revise the abstract to reference these specific metrics and point to the relevant sections/tables for easier evaluation. revision: yes
Referee: [Abstract] Abstract: The weakest assumption—that merging consecutive steps via HNC produces training data whose 'controlled noise' improves rather than harms the model's ability to localize the first error—is unsupported. Merging a flawed step with a subsequent correct step can create an ambiguous node label; the manuscript provides no analysis, ablation, or evidence showing that this operation enhances stability instead of diluting error signals.

Authors: We acknowledge the concern that HNC could introduce label ambiguity when merging flawed and correct steps. The manuscript presents HNC as adding controlled noise via MCTS trajectories to improve robustness, supported by the overall empirical gains in stability and generalization. However, a targeted ablation isolating this merging effect is not explicitly detailed. We will add such an analysis in the revision to directly demonstrate the net benefit for error localization. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external dataset comparisons

full rationale

The paper introduces HRM and HNC as new modeling and data-augmentation techniques for reward modeling in LLM reasoning. Its central claims are supported solely by reported empirical performance on the external PRM800K, MATH500, and GSM8K benchmarks, with no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations described. Because the evaluation metrics and datasets are independent of the method definition itself, the derivation chain contains no self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5802 in / 1096 out tokens · 29634 ms · 2026-05-22T23:39:04.221003+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Reasoning with Process Rewards for Outcome-Guided Steps
cs.LG 2026-02 unverdicted novelty 5.0

PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.