Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Haohan Wang; Kaidi Xu; Peng Kuang; Xiaoyu Han; Yanli Wang; Yaowenqi Liu

arxiv: 2510.13918 · v2 · submitted 2025-10-15 · 💻 cs.CL

Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Peng Kuang , Yanli Wang , Xiaoyu Han , Yaowenqi Liu , Kaidi Xu , Haohan Wang This is my paper

Pith reviewed 2026-05-18 07:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords test-time scalingprocess reward modelsLLMweighted aggregationcalibrationmajority votingsignal combinationefficiency

0 comments

The pith

Calibrating weights to combine LLM and PRM signals improves test-time scaling while using far less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework showing that the best way to use process reward model verification alongside LLM outputs is through a weighted aggregation of candidate responses. It demonstrates that the weights capturing the interplay between the two signals vary by model pair and often need to be negative to correct for biases. Efficient pre-computation methods are introduced to estimate these weights without repeated heavy computation. Experiments across five LLMs and seven PRMs confirm that the calibrated approach outperforms standard weighted majority voting while consuming only 21.3 percent of the computation. A sympathetic reader cares because the work indicates that refining how existing signals are merged can deliver gains without simply running more test-time samples.

Core claim

We develop a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method boosts

What carries the argument

Optimal weighted aggregation of responses, where weights for LLM and PRM signals are calibrated via pre-computation to account for their specific interplay including negative values.

If this is right

Optimal weights for the aggregation vary significantly across different LLM-PRM pairs.
Negative weights frequently improve results by correcting biases present in the individual signals.
Pre-computation of the weights allows the method to achieve gains with only 21.3 percent of the computation required by vanilla weighted majority voting.
Better aggregation strategies offer a path to performance improvements that does not rely on increasing the volume of test-time computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The calibration technique might extend naturally to other verification signals or ensembles of reward models.
Negative weights could be inspected to reveal systematic error patterns in either the LLM or the PRM.
Dynamic re-calibration during inference on new domains could further reduce reliance on static pre-computation.

Load-bearing premise

Optimal weights that capture the interplay between LLM and PRM signals exist and can be accurately estimated by pre-computation methods that generalize beyond the calibration data.

What would settle it

Apply the pre-computed weights from one set of LLM-PRM pairs and calibration problems to an entirely new benchmark or model pair and observe whether performance still exceeds vanilla weighted majority voting.

read the original abstract

Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows pre-calibrated weights for mixing LLM and PRM signals can beat simple voting at much lower compute, but the weights vary by pair and may not transfer if test data differs from calibration.

read the letter

The main point is that pre-computing weights to combine LLM outputs with PRM scores, including negative weights, can deliver better test-time scaling efficiency than majority voting or basic PRM selection, using only 21.3% of the compute in their tests. They back this with a theoretical framework for optimal weighted aggregation that captures model interplay, then show the weights differ a lot across pairs and often go negative. From there they introduce efficient pre-computation for calibration and run experiments on five LLMs paired with seven PRMs, reporting clear gains over vanilla weighted voting. That experimental breadth and the concrete efficiency number are the strongest parts. The framework itself is new in how it derives the need for these specific weights rather than defaulting to selection or voting. The soft spot is generalization. Since the weights shift significantly by LLM-PRM pair, the fixed values from calibration could lose their edge or even hurt performance if the test distribution has different response qualities or error patterns than the calibration set. That risk is real given how the paper itself highlights the variation. The math and data look solid enough on the surface for the claims made, with standard citations for the TTS and PRM area. This is aimed at people building practical test-time scaling systems who want smarter aggregation instead of just more sampling. A reader focused on efficient inference would get usable ideas from the experiments. It deserves a serious referee because the core idea addresses a real gap and the multi-model results give something concrete to evaluate.

Referee Report

3 major / 2 minor

Summary. The paper develops a theoretical framework showing that optimal aggregation of LLM and PRM signals for test-time scaling is achieved via weighted response aggregation whose weights capture model interplay and can be negative. It proposes efficient pre-computation methods to calibrate these weights for each LLM-PRM pair and reports that the resulting method improves TTS efficiency over vanilla weighted majority voting across 5 LLMs and 7 PRMs while using only 21.3% of the computation.

Significance. If the results hold, the work would usefully redirect attention in TTS research from simply increasing test-time compute toward more intelligent signal aggregation. The breadth of the empirical evaluation (5 LLMs, 7 PRMs) and the concrete efficiency number supply a practical benchmark, while the observation that optimal weights are frequently negative offers a clear conceptual contribution.

major comments (3)

[Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.
[Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.
[§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.

minor comments (2)

[Abstract and §2] The abstract and introduction refer to 'vanilla weighted majority voting' without a precise definition or citation to the exact baseline implementation; this should be clarified in §2 or §4.
[Notation and calibration section] Notation for the weighting functions is introduced as free parameters but their functional form and the exact pre-computation procedure are not summarized in a single equation or algorithm box; adding this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback, which has helped us strengthen the clarity and rigor of our theoretical and empirical contributions. We address each major comment in detail below, providing clarifications and indicating the revisions made to the manuscript.

read point-by-point responses

Referee: [Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.

Authors: We agree that explicit statements of the assumptions and objective are necessary for verifying the optimality claim. The framework assumes that the LLM's response correctness and the PRM's reward signal are conditionally independent given the true answer, and we derive the optimal weights by minimizing the expected 0-1 loss (i.e., probability of selecting an incorrect response). We have revised the Theoretical Framework section to include a new subsection detailing these assumptions, the loss function, and the step-by-step derivation. revision: yes
Referee: [Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.

Authors: We acknowledge the importance of checking for distribution shift, especially given the observed variability in weights across pairs. In the revised manuscript, we have added experiments in §5 that calibrate weights on one set of problems (e.g., from a math-focused dataset) and evaluate on a shifted test set (e.g., coding or general reasoning benchmarks). The results indicate that performance remains superior to baselines with only minor degradation, thereby supporting the practical utility of pre-computed weights. revision: yes
Referee: [§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.

Authors: We confirm that all calibration data used to fit the weights for each LLM-PRM pair was drawn from sources strictly disjoint from the test benchmarks reported in §5. This includes using separate validation splits or problems not present in the evaluation sets. We have updated the experimental details in §5 to explicitly state this disjointness and added a note on how the calibration sets were constructed to eliminate any potential circularity. revision: yes

Circularity Check

0 steps flagged

Theoretical framework and calibration are independent; no reduction to inputs by construction

full rationale

The paper begins with a theoretical derivation of optimal weighted aggregation for LLM-PRM signals, then introduces separate pre-computation calibration methods whose weights are estimated from data and validated across 5 LLMs and 7 PRMs. No equations or claims reduce the final performance result to the calibration inputs by definition, nor is there load-bearing self-citation or ansatz smuggling. Experiments provide external benchmarks, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of optimal weights that can be pre-computed from the interplay between LLM and PRM signals; these weights function as fitted parameters calibrated on data.

free parameters (1)

weighting functions
Weights estimated to capture complex interplay between LLM and PRM, often negative, calibrated via pre-computation methods.

axioms (1)

domain assumption Optimal aggregation strategy exists as a weighted combination of LLM and PRM signals.
Invoked in the theoretical framework development described in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1248 out tokens · 33463 ms · 2026-05-18T07:31:09.668161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2 (Optimal Aggregation Score): Score(αk) = Σ wi where wi = log P(pi|ci=1,V)/P(pi|ci=0,V) + log q_M(m-1)/(1-q_M)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Calibration via KDE on logit-transformed PRM scores and grid search for offset b yielding negative weights

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.