Process Reward Agents for Steering Knowledge-Intensive Reasoning

Jiwoong Sohn; Kenneth Styppa; Michael Moor; Tomasz Sternal; Torsten Hoefler

arxiv: 2604.09482 · v1 · submitted 2026-04-10 · 💻 cs.AI

Process Reward Agents for Steering Knowledge-Intensive Reasoning

Jiwoong Sohn , Tomasz Sternal , Kenneth Styppa , Torsten Hoefler , Michael Moor This is my paper

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords process reward agentsknowledge-intensive reasoningmedical question answeringtest-time steeringfrozen policy modelsretrieval-augmented rewardsstep-wise search decoding

0 comments

The pith

Process Reward Agents supply online step-wise rewards from external knowledge to steer reasoning in frozen language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Process Reward Agents as a test-time technique that evaluates partial reasoning traces on the fly by retrieving and synthesizing clues from large external knowledge bases. This addresses the problem that intermediate steps in knowledge-intensive domains like medicine cannot be verified locally and errors propagate undetected. By enabling search-based decoding that ranks and prunes trajectories at every generation step, PRA improves performance on medical benchmarks and works with any frozen policy model without retraining it. A reader would care because the approach decouples the core reasoner from domain-specific evaluation, allowing smaller or newer models to tackle complex tasks more effectively.

Core claim

Process Reward Agents (PRA) provide domain-grounded, online, step-wise rewards to a frozen policy by using retrieval-augmented agents that score candidate trajectories at each generation step, allowing search methods to prune incorrect paths dynamically. Unlike post-hoc process reward models, PRA integrates directly into inference to prevent error propagation in knowledge-intensive reasoning. Experiments show it reaches 80.8% accuracy on MedQA with a 4B model and improves unseen frozen policies from 0.5B to 8B parameters by up to 25.7% without any updates to the policy itself.

What carries the argument

Process Reward Agents, external modules that retrieve relevant knowledge at each step and compute process rewards to rank and prune partial trajectories during generation for an unmodified policy model.

If this is right

Small frozen models reach new state-of-the-art results on medical reasoning benchmarks when guided by PRA.
The same reward agents improve accuracy across a wide range of unseen policy sizes without any policy retraining.
Search-based decoding becomes practical for knowledge-intensive tasks because bad paths can be discarded before full generation.
New model backbones can be deployed in specialized domains by swapping in appropriate reward agents rather than fine-tuning the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

PRA suggests a modular design where domain reward agents can be updated independently of the underlying language model.
The same online retrieval-reward pattern could apply to other fields with large external corpora, such as legal or scientific reasoning.
Combining PRA with occasional light policy fine-tuning might yield further gains while still avoiding full retraining for each domain.

Load-bearing premise

Retrieval from external knowledge sources yields accurate, low-latency rewards for every partial reasoning step without adding undetected errors that would cancel the benefits of early pruning.

What would settle it

A controlled run on MedQA where PRA-guided search produces lower accuracy than strong baselines or where the added retrieval latency exceeds the gains from pruning.

Figures

Figures reproduced from arXiv: 2604.09482 by Jiwoong Sohn, Kenneth Styppa, Michael Moor, Tomasz Sternal, Torsten Hoefler.

**Figure 2.** Figure 2: Performance on MedQA under inference time scaling. PRA continues to benefit from additional compute, while Self-Consistency saturates quickly. For SC, we estimate per-question expected accuracy via Monte Carlo sampling (1,000 trials); shaded regions show ±1 SE computed via bootstrap resampling over questions. Generalization to Unseen Datasets PRA demonstrates strong generalization to medical reasoning ben… view at source ↗

**Figure 3.** Figure 3: Search–accuracy trade-off on MedQA. We sweep the search threshold and report accuracy versus search frequency; the Pareto frontier highlights the best operating points for a given search budget. 5.3 Analysis on Margin Shift We analyze how margin shift varies across reasoning traces on MedQA. Specifically, we compute ∆m, which quantifies how the inclusion of retrieved evidence changes the teacher model’s … view at source ↗

**Figure 4.** Figure 4: Margin shift across step positions in reason [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean absolute magin shift over reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: presents simplified pseudocode for PRA-guided beam search. Each question is managed by a Trace object that maintains a beam of partial reasoning traces and a stage tag ∈ {reason, reward, search, done}. At each iteration, the global queue is drained, traces are partitioned by stage, and each partition is dispatched as a single batched operation to π, µϕ, or ρ. PRA Beam Search Pseudocode class Trace: stage: … view at source ↗

**Figure 7.** Figure 7: Policy prompt used for all PRA experiments. The prompt instructs explicit step-wise reasoning for [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Teacher prompt used for all PRA experiments. The prompt evaluates the last reasoning step given [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: PRA prompt used in all experiments. The documents section appears only when search is triggered [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRA shows a workable online way to steer frozen models with retrieval-based step rewards on medical reasoning, but the gains depend on reward accuracy that still needs checking.

read the letter

The main point is that Process Reward Agents attach online step-wise rewards from external knowledge to any frozen model during search-based decoding, leading to accuracy improvements on medical reasoning tasks. The work stands out for moving process rewards from post-generation scoring to active use in pruning trajectories mid-generation. This allows the method to catch errors earlier in knowledge-intensive chains where steps are not locally verifiable. They report consistent outperformance over baselines, with 80.8% accuracy on MedQA using Qwen3-4B as a new high at the 4B scale, and gains up to 25.7% on models from 0.5B to 8B without policy updates. The generalization across model sizes is a clear positive, as it supports the idea of swapping in new backbones while keeping the domain-specific reward module fixed. The soft spots center on the online reward mechanism. Computing retrieval-augmented rewards at every step requires the external sources to deliver accurate, non-conflicting information quickly. Without reported details on retrieval error rates, synthesis quality, or per-step overhead, it's possible that undetected issues in the rewards undermine the claimed benefits. The stress-test concern about reliability holds based on the abstract. The paper is empirical rather than theoretical, with no load-bearing assumptions that reduce to fitted parameters. This is for people working on test-time scaling or retrieval-augmented reasoning in specialized domains. Readers who need to apply new models to complex tasks without retraining will get value from the approach. It has enough specific results to merit a serious referee. I would recommend peer review, with attention to the reward computation pipeline and any efficiency measurements.

Referee Report

3 major / 2 minor

Summary. The paper introduces Process Reward Agents (PRA), a test-time method that supplies online, step-wise, retrieval-augmented rewards to frozen policy models for search-based decoding in knowledge-intensive reasoning. Unlike prior post-hoc process reward models, PRA ranks and prunes trajectories at every generation step. On medical benchmarks it reports 80.8% accuracy on MedQA with Qwen3-4B (new SOTA at 4B scale) and up to 25.7% accuracy gains on unseen frozen policies ranging from 0.5B to 8B parameters, without any policy updates.

Significance. If the online reward mechanism proves reliable and low-latency, the work would meaningfully advance inference-time scaling for domains where intermediate steps require external knowledge synthesis. The decoupling of policy and reward modules is a practical strength that could allow rapid deployment of new backbones without retraining.

major comments (3)

[Methods (PRA reward computation)] Methods section describing PRA reward computation: no quantitative bounds or error analysis are given for retrieval accuracy, conflicting passages, or hallucinated synthesis at each generation step. This directly affects attribution of the headline 80.8% MedQA accuracy and 25.7% generalization gains, as undetected reward errors could still propagate.
[Experiments and Results] Experiments section (results tables): generalization claims across 0.5B–8B models and the MedQA SOTA number are presented without reported variance, number of runs, statistical tests, or explicit data-split details. This is load-bearing for the robustness of the central empirical claims.
[Abstract and Introduction] Abstract and §1: the claim that PRA 'enables search-based decoding to rank and prune candidate trajectories at every generation step' is not supported by any latency or throughput measurements, leaving open whether the per-step overhead negates the reported accuracy benefits.

minor comments (2)

[Related Work] A comparison table contrasting PRA with prior post-hoc retrieval-augmented PRMs would improve clarity on the online vs. offline distinction.
[Figures] Figure captions and axis labels in the architecture diagram could more explicitly annotate the retrieval and synthesis modules.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript to strengthen the presentation of the PRA method and its empirical results.

read point-by-point responses

Referee: Methods section describing PRA reward computation: no quantitative bounds or error analysis are given for retrieval accuracy, conflicting passages, or hallucinated synthesis at each generation step. This directly affects attribution of the headline 80.8% MedQA accuracy and 25.7% generalization gains, as undetected reward errors could still propagate.

Authors: We agree that the absence of quantitative error analysis for the retrieval-augmented reward computation limits the ability to fully attribute performance gains. The current manuscript describes the PRA mechanism but does not report retrieval precision, recall on conflicting passages, or rates of hallucinated synthesis. In the revised version, we will add a dedicated subsection under Methods that includes: (1) retrieval accuracy metrics evaluated on a held-out set of medical queries, (2) analysis of how conflicting passages are handled during synthesis, and (3) manual inspection of a sample of generated rewards for hallucination. These additions will clarify the reliability of the online rewards and better support the reported accuracy improvements. revision: yes
Referee: Experiments section (results tables): generalization claims across 0.5B–8B models and the MedQA SOTA number are presented without reported variance, number of runs, statistical tests, or explicit data-split details. This is load-bearing for the robustness of the central empirical claims.

Authors: The referee correctly notes that the experimental results lack variance estimates, run counts, statistical tests, and explicit data-split information. This omission weakens the robustness claims for the 80.8% MedQA result and the generalization across model sizes. We will revise the Experiments section and associated tables to report: results averaged over at least five independent runs with different random seeds, standard deviations, paired statistical significance tests against baselines, and precise descriptions of the train/validation/test splits used for each benchmark. These details will be added to both the main text and the appendix. revision: yes
Referee: Abstract and §1: the claim that PRA 'enables search-based decoding to rank and prune candidate trajectories at every generation step' is not supported by any latency or throughput measurements, leaving open whether the per-step overhead negates the reported accuracy benefits.

Authors: The core claim in the abstract and introduction is that PRA enables online, step-wise ranking and pruning within search-based decoding, which follows directly from the method's design of providing rewards at each generation step rather than post hoc. However, we acknowledge that without empirical latency or throughput data, it remains unclear whether the per-step overhead offsets the accuracy gains in practice. To address this concern, we will add latency and throughput measurements in the Experiments section (with details in the appendix), comparing PRA-augmented decoding against standard beam search and other baselines on the same hardware. This will quantify the overhead and demonstrate that the accuracy improvements justify the additional computation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method with no derivation chain

full rationale

The paper introduces Process Reward Agents as a test-time empirical technique for online step-wise rewards during search-based decoding on knowledge-intensive tasks. Performance claims (e.g., 80.8% MedQA accuracy, up to 25.7% gains on unseen models) rest on benchmark experiments rather than any closed-form derivation, equations, or first-principles results. No steps reduce predictions to fitted inputs by construction, no self-definitional loops, no load-bearing self-citations for uniqueness theorems, and no ansatzes smuggled via prior work. The method is self-contained as an engineering contribution decoupling frozen policies from domain rewards.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that external knowledge retrieval can supply accurate step-level supervision in real time; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption External knowledge sources can be queried to produce reliable, non-local step-wise correctness signals during generation.
The method depends on this to enable online rewards without local verifiability.

invented entities (1)

Process Reward Agent no independent evidence
purpose: Online, domain-grounded reward provider that steers frozen policies at test time.
New agent abstraction introduced to decouple reward logic from the policy.

pith-pipeline@v0.9.0 · 5547 in / 1331 out tokens · 23108 ms · 2026-05-10T17:57:44.951995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Alex Graves

URL https://arxiv.org/abs/2601.08267. Alex Graves. Sequence transduction with recurrent neural networks, 2012. URL https://arxiv.org/ abs/1211.3711. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language un- derstanding, 2021. URL https://arxiv.org/abs/ 2009.03300. Xiao...

work page doi:10.18653/v1/2025.emnlp-main.837 2012
[2]

Reasoning Reward: Score 1 if the last step is logically coherent, medically sound, and aligns with the provided evidence; otherwise, score 0

work page
[3]

Provide your evaluation as two numbers, separated by a comma and a space, with no addi- tional explanation or text

Search Reward: Score 1 if, in order to evaluate the last reasoning step, you needed to refer to the provided evidence (i.e., the step required searching for or validating with external information), or if the reasoning step itself explicitly involves searching, retrieval, or referencing outside knowledge; otherwise, score 0. Provide your evaluation as two...

work page

[1] [1]

Alex Graves

URL https://arxiv.org/abs/2601.08267. Alex Graves. Sequence transduction with recurrent neural networks, 2012. URL https://arxiv.org/ abs/1211.3711. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language un- derstanding, 2021. URL https://arxiv.org/abs/ 2009.03300. Xiao...

work page doi:10.18653/v1/2025.emnlp-main.837 2012

[2] [2]

Reasoning Reward: Score 1 if the last step is logically coherent, medically sound, and aligns with the provided evidence; otherwise, score 0

work page

[3] [3]

Provide your evaluation as two numbers, separated by a comma and a space, with no addi- tional explanation or text

Search Reward: Score 1 if, in order to evaluate the last reasoning step, you needed to refer to the provided evidence (i.e., the step required searching for or validating with external information), or if the reasoning step itself explicitly involves searching, retrieval, or referencing outside knowledge; otherwise, score 0. Provide your evaluation as two...

work page