Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Pith reviewed 2026-05-18 07:31 UTC · model grok-4.3
The pith
Calibrating weights to combine LLM and PRM signals improves test-time scaling while using far less computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method boosts
What carries the argument
Optimal weighted aggregation of responses, where weights for LLM and PRM signals are calibrated via pre-computation to account for their specific interplay including negative values.
If this is right
- Optimal weights for the aggregation vary significantly across different LLM-PRM pairs.
- Negative weights frequently improve results by correcting biases present in the individual signals.
- Pre-computation of the weights allows the method to achieve gains with only 21.3 percent of the computation required by vanilla weighted majority voting.
- Better aggregation strategies offer a path to performance improvements that does not rely on increasing the volume of test-time computation.
Where Pith is reading between the lines
- The calibration technique might extend naturally to other verification signals or ensembles of reward models.
- Negative weights could be inspected to reveal systematic error patterns in either the LLM or the PRM.
- Dynamic re-calibration during inference on new domains could further reduce reliance on static pre-computation.
Load-bearing premise
Optimal weights that capture the interplay between LLM and PRM signals exist and can be accurately estimated by pre-computation methods that generalize beyond the calibration data.
What would settle it
Apply the pre-computed weights from one set of LLM-PRM pairs and calibration problems to an entirely new benchmark or model pair and observe whether performance still exceeds vanilla weighted majority voting.
read the original abstract
Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a theoretical framework showing that optimal aggregation of LLM and PRM signals for test-time scaling is achieved via weighted response aggregation whose weights capture model interplay and can be negative. It proposes efficient pre-computation methods to calibrate these weights for each LLM-PRM pair and reports that the resulting method improves TTS efficiency over vanilla weighted majority voting across 5 LLMs and 7 PRMs while using only 21.3% of the computation.
Significance. If the results hold, the work would usefully redirect attention in TTS research from simply increasing test-time compute toward more intelligent signal aggregation. The breadth of the empirical evaluation (5 LLMs, 7 PRMs) and the concrete efficiency number supply a practical benchmark, while the observation that optimal weights are frequently negative offers a clear conceptual contribution.
major comments (3)
- [Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.
- [Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.
- [§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.
minor comments (2)
- [Abstract and §2] The abstract and introduction refer to 'vanilla weighted majority voting' without a precise definition or citation to the exact baseline implementation; this should be clarified in §2 or §4.
- [Notation and calibration section] Notation for the weighting functions is introduced as free parameters but their functional form and the exact pre-computation procedure are not summarized in a single equation or algorithm box; adding this would improve readability.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback, which has helped us strengthen the clarity and rigor of our theoretical and empirical contributions. We address each major comment in detail below, providing clarifications and indicating the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.
Authors: We agree that explicit statements of the assumptions and objective are necessary for verifying the optimality claim. The framework assumes that the LLM's response correctness and the PRM's reward signal are conditionally independent given the true answer, and we derive the optimal weights by minimizing the expected 0-1 loss (i.e., probability of selecting an incorrect response). We have revised the Theoretical Framework section to include a new subsection detailing these assumptions, the loss function, and the step-by-step derivation. revision: yes
-
Referee: [Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.
Authors: We acknowledge the importance of checking for distribution shift, especially given the observed variability in weights across pairs. In the revised manuscript, we have added experiments in §5 that calibrate weights on one set of problems (e.g., from a math-focused dataset) and evaluate on a shifted test set (e.g., coding or general reasoning benchmarks). The results indicate that performance remains superior to baselines with only minor degradation, thereby supporting the practical utility of pre-computed weights. revision: yes
-
Referee: [§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.
Authors: We confirm that all calibration data used to fit the weights for each LLM-PRM pair was drawn from sources strictly disjoint from the test benchmarks reported in §5. This includes using separate validation splits or problems not present in the evaluation sets. We have updated the experimental details in §5 to explicitly state this disjointness and added a note on how the calibration sets were constructed to eliminate any potential circularity. revision: yes
Circularity Check
Theoretical framework and calibration are independent; no reduction to inputs by construction
full rationale
The paper begins with a theoretical derivation of optimal weighted aggregation for LLM-PRM signals, then introduces separate pre-computation calibration methods whose weights are estimated from data and validated across 5 LLMs and 7 PRMs. No equations or claims reduce the final performance result to the calibration inputs by definition, nor is there load-bearing self-citation or ansatz smuggling. Experiments provide external benchmarks, keeping the chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- weighting functions
axioms (1)
- domain assumption Optimal aggregation strategy exists as a weighted combination of LLM and PRM signals.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.2 (Optimal Aggregation Score): Score(αk) = Σ wi where wi = log P(pi|ci=1,V)/P(pi|ci=0,V) + log q_M(m-1)/(1-q_M)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Calibration via KDE on logit-transformed PRM scores and grid search for offset b yielding negative weights
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.