Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information
Pith reviewed 2026-05-21 20:43 UTC · model grok-4.3
The pith
Optimal Weight and Inverse Surprising Popularity algorithms improve multi-LLM aggregation by using first-order and second-order model information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors design Optimal Weight (OW) and Inverse Surprising Popularity (ISP) algorithms that leverage both first-order and second-order information to aggregate answers from multiple LLMs, provably mitigating inherent limitations of majority voting under mild assumptions on latent heterogeneity and correlation across models.
What carries the argument
Optimal Weight (OW) and Inverse Surprising Popularity (ISP) algorithms, which use first-order answer frequencies and second-order model correlations or surprise measures to produce weighted or adjusted collective decisions.
If this is right
- More reliable collective decisions from groups of LLMs in reasoning tasks.
- Consistent gains over standard baselines on fine-tuning benchmarks and real-world applications.
- A training-free aggregation framework that avoids the need for additional model tuning or data.
Where Pith is reading between the lines
- The approach may apply to other multi-agent systems where agents have varying expertise and correlations.
- Further work could test integration with uncertainty estimates from individual LLMs.
- Relaxing the assumptions might allow guarantees in settings with stronger model dependencies.
Load-bearing premise
Mild assumptions on latent heterogeneity and correlation across models that enable the theoretical guarantees for OW and ISP.
What would settle it
An empirical evaluation on a new dataset where OW and ISP show no improvement or worse performance than majority voting, or a counterexample where the mild assumptions hold but the provable mitigation fails.
read the original abstract
With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two aggregation algorithms, Optimal Weight (OW) and Inverse Surprising Popularity (ISP), for combining answers from multiple LLMs. These methods incorporate first-order and second-order information to account for latent heterogeneity and correlations among models, in contrast to standard majority voting. The authors present a theoretical analysis claiming that the methods provably mitigate limitations of majority voting under mild assumptions, and report empirical results showing consistent outperformance on synthetic datasets, UltraFeedback, MMLU, and the ARMMAN healthcare task.
Significance. If the theoretical guarantees hold under assumptions that are both precisely stated and plausibly satisfied by real LLM ensembles, the work offers a training-free framework that could meaningfully improve reliability in multi-agent LLM reasoning systems. The inclusion of a real-world healthcare application strengthens the practical case. The absence of detailed proof sketches or assumption checks in the provided abstract limits assessment of how non-vacuous the claimed improvements are.
major comments (1)
- [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that OW and ISP provably mitigate inherent limitations of majority voting rests on mild assumptions about latent heterogeneity and pairwise correlations. These assumptions must be stated with precise mathematical conditions (e.g., explicit bounds on response probabilities or conditions on the correlation matrix), and the manuscript should verify or discuss whether they hold for the LLMs and datasets used in the UltraFeedback/MMLU/ARMMAN experiments; without this, the performance gap over majority voting is not guaranteed to be non-vacuous.
minor comments (2)
- [Abstract] Abstract: The abstract asserts theoretical mitigation and consistent outperformance but provides no equations, proof ideas, dataset sizes, or error bars, making the strength of the claims difficult to evaluate from the summary.
- [Experiments] Experiments section: Results on benchmarks should include error bars or statistical significance tests to support the claim of consistent outperformance over baselines.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of our training-free aggregation framework. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§3 (Theoretical Analysis)] The central claim that OW and ISP provably mitigate inherent limitations of majority voting rests on mild assumptions about latent heterogeneity and pairwise correlations. These assumptions must be stated with precise mathematical conditions (e.g., explicit bounds on response probabilities or conditions on the correlation matrix), and the manuscript should verify or discuss whether they hold for the LLMs and datasets used in the UltraFeedback/MMLU/ARMMAN experiments; without this, the performance gap over majority voting is not guaranteed to be non-vacuous.
Authors: We appreciate the referee highlighting the need for clarity on our theoretical assumptions. Section 3 already formulates the assumptions with precise mathematical conditions, including explicit bounds on response probabilities under latent heterogeneity and conditions on the pairwise correlation matrix that enable the second-order information to yield provable gains over majority voting. The guarantees are therefore conditional on these assumptions, which we describe as mild because they allow for realistic levels of model heterogeneity and correlation without requiring independence or identical accuracy. We agree that a direct discussion of their plausibility for the UltraFeedback, MMLU, and ARMMAN settings would strengthen the link between theory and experiments. In the revised manuscript we will add a short discussion subsection that qualitatively argues why the assumptions are likely to hold in these LLM ensembles, drawing on the consistent empirical improvements we observe and the task characteristics. revision: yes
Circularity Check
No significant circularity in theoretical guarantees or algorithm derivation
full rationale
The paper derives OW and ISP aggregation methods by leveraging first-order and second-order information from LLM responses, then provides a theoretical analysis showing they mitigate majority voting limitations under explicitly stated mild assumptions on latent heterogeneity and model correlations. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the guarantees are presented as following from the assumptions rather than tautologically equivalent to them. Empirical results on UltraFeedback, MMLU, and ARMMAN are reported separately as validation and do not serve as the basis for the theoretical claims. The derivation chain is therefore self-contained with independent mathematical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mild assumptions on latent heterogeneity and correlation across models
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1. Under Assumption 1, f_OW ... is the Bayesian optimal aggregator for any P.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Efficient Ensemble Selection from Binary and Pairwise Feedback
The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
-
Prior-Agnostic Robust Forecast Aggregation
A log-odds pooling rule achieves minimax regret of 0.0255 for conditionally independent signals under prior-agnostic robust forecast aggregation with unknown state space.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.