pith. sign in

arxiv: 2510.13918 · v2 · submitted 2025-10-15 · 💻 cs.CL

Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Pith reviewed 2026-05-18 07:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords test-time scalingprocess reward modelsLLMweighted aggregationcalibrationmajority votingsignal combinationefficiency
0
0 comments X

The pith

Calibrating weights to combine LLM and PRM signals improves test-time scaling while using far less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework showing that the best way to use process reward model verification alongside LLM outputs is through a weighted aggregation of candidate responses. It demonstrates that the weights capturing the interplay between the two signals vary by model pair and often need to be negative to correct for biases. Efficient pre-computation methods are introduced to estimate these weights without repeated heavy computation. Experiments across five LLMs and seven PRMs confirm that the calibrated approach outperforms standard weighted majority voting while consuming only 21.3 percent of the computation. A sympathetic reader cares because the work indicates that refining how existing signals are merged can deliver gains without simply running more test-time samples.

Core claim

We develop a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method boosts

What carries the argument

Optimal weighted aggregation of responses, where weights for LLM and PRM signals are calibrated via pre-computation to account for their specific interplay including negative values.

If this is right

  • Optimal weights for the aggregation vary significantly across different LLM-PRM pairs.
  • Negative weights frequently improve results by correcting biases present in the individual signals.
  • Pre-computation of the weights allows the method to achieve gains with only 21.3 percent of the computation required by vanilla weighted majority voting.
  • Better aggregation strategies offer a path to performance improvements that does not rely on increasing the volume of test-time computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The calibration technique might extend naturally to other verification signals or ensembles of reward models.
  • Negative weights could be inspected to reveal systematic error patterns in either the LLM or the PRM.
  • Dynamic re-calibration during inference on new domains could further reduce reliance on static pre-computation.

Load-bearing premise

Optimal weights that capture the interplay between LLM and PRM signals exist and can be accurately estimated by pre-computation methods that generalize beyond the calibration data.

What would settle it

Apply the pre-computed weights from one set of LLM-PRM pairs and calibration problems to an entirely new benchmark or model pair and observe whether performance still exceeds vanilla weighted majority voting.

read the original abstract

Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a theoretical framework showing that optimal aggregation of LLM and PRM signals for test-time scaling is achieved via weighted response aggregation whose weights capture model interplay and can be negative. It proposes efficient pre-computation methods to calibrate these weights for each LLM-PRM pair and reports that the resulting method improves TTS efficiency over vanilla weighted majority voting across 5 LLMs and 7 PRMs while using only 21.3% of the computation.

Significance. If the results hold, the work would usefully redirect attention in TTS research from simply increasing test-time compute toward more intelligent signal aggregation. The breadth of the empirical evaluation (5 LLMs, 7 PRMs) and the concrete efficiency number supply a practical benchmark, while the observation that optimal weights are frequently negative offers a clear conceptual contribution.

major comments (3)
  1. [Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.
  2. [Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.
  3. [§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.
minor comments (2)
  1. [Abstract and §2] The abstract and introduction refer to 'vanilla weighted majority voting' without a precise definition or citation to the exact baseline implementation; this should be clarified in §2 or §4.
  2. [Notation and calibration section] Notation for the weighting functions is introduced as free parameters but their functional form and the exact pre-computation procedure are not summarized in a single equation or algorithm box; adding this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback, which has helped us strengthen the clarity and rigor of our theoretical and empirical contributions. We address each major comment in detail below, providing clarifications and indicating the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Theoretical framework] Theoretical framework section: the derivation that optimal aggregation reduces to a weighted sum with potentially negative weights is load-bearing for the entire contribution, yet the manuscript provides no explicit statement of the distributional assumptions (e.g., independence of LLM and PRM errors) or the loss function being optimized; without these the optimality claim cannot be verified.

    Authors: We agree that explicit statements of the assumptions and objective are necessary for verifying the optimality claim. The framework assumes that the LLM's response correctness and the PRM's reward signal are conditionally independent given the true answer, and we derive the optimal weights by minimizing the expected 0-1 loss (i.e., probability of selecting an incorrect response). We have revised the Theoretical Framework section to include a new subsection detailing these assumptions, the loss function, and the step-by-step derivation. revision: yes

  2. Referee: [Calibration method and §5 experiments] Calibration method and §5 experiments: the reported 21.3% compute reduction and superiority over vanilla weighted majority voting presuppose that the pre-computed weights remain near-optimal on the test distribution. Because the paper itself notes that weights differ substantially across LLM-PRM pairs, an explicit check for performance degradation under distribution shift between calibration and evaluation data is required to support the generalization claim.

    Authors: We acknowledge the importance of checking for distribution shift, especially given the observed variability in weights across pairs. In the revised manuscript, we have added experiments in §5 that calibrate weights on one set of problems (e.g., from a math-focused dataset) and evaluate on a shifted test set (e.g., coding or general reasoning benchmarks). The results indicate that performance remains superior to baselines with only minor degradation, thereby supporting the practical utility of pre-computed weights. revision: yes

  3. Referee: [§5 experiments] §5 experiments, efficiency comparison: the central efficiency result is only convincing if the calibration set used to fit the weights is strictly disjoint from the test benchmarks; any overlap introduces circularity that could inflate the reported gains relative to the vanilla baseline.

    Authors: We confirm that all calibration data used to fit the weights for each LLM-PRM pair was drawn from sources strictly disjoint from the test benchmarks reported in §5. This includes using separate validation splits or problems not present in the evaluation sets. We have updated the experimental details in §5 to explicitly state this disjointness and added a note on how the calibration sets were constructed to eliminate any potential circularity. revision: yes

Circularity Check

0 steps flagged

Theoretical framework and calibration are independent; no reduction to inputs by construction

full rationale

The paper begins with a theoretical derivation of optimal weighted aggregation for LLM-PRM signals, then introduces separate pre-computation calibration methods whose weights are estimated from data and validated across 5 LLMs and 7 PRMs. No equations or claims reduce the final performance result to the calibration inputs by definition, nor is there load-bearing self-citation or ansatz smuggling. Experiments provide external benchmarks, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of optimal weights that can be pre-computed from the interplay between LLM and PRM signals; these weights function as fitted parameters calibrated on data.

free parameters (1)
  • weighting functions
    Weights estimated to capture complex interplay between LLM and PRM, often negative, calibrated via pre-computation methods.
axioms (1)
  • domain assumption Optimal aggregation strategy exists as a weighted combination of LLM and PRM signals.
    Invoked in the theoretical framework development described in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1248 out tokens · 33463 ms · 2026-05-18T07:31:09.668161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.