Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

David Simchi-Levi; Haifeng Xu; Milind Tambe; Rui Ai; Yuqi Pan

arxiv: 2510.01499 · v2 · pith:TCNSXX7Jnew · submitted 2025-10-01 · 💻 cs.LG · cs.AI· cs.GT

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

Rui Ai , Yuqi Pan , David Simchi-Levi , Milind Tambe , Haifeng Xu This is my paper

Pith reviewed 2026-05-21 20:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GT

keywords LLM aggregationmajority votingmulti-agent reasoningOptimal WeightInverse Surprising Popularityhigher-order informationcollective decision making

0 comments

The pith

Optimal Weight and Inverse Surprising Popularity algorithms improve multi-LLM aggregation by using first-order and second-order model information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two aggregation methods called Optimal Weight and Inverse Surprising Popularity for combining answers from multiple large language models. These approaches move beyond simple majority voting by incorporating information about how models differ and correlate with each other. Theoretical results show the methods reduce known weaknesses of majority voting when mild assumptions about model behavior hold. Experiments across synthetic data, standard benchmarks like UltraFeedback and MMLU, and a real healthcare application demonstrate better performance than common baselines. The work offers a training-free way to reach more dependable group decisions from LLM ensembles.

Core claim

The authors design Optimal Weight (OW) and Inverse Surprising Popularity (ISP) algorithms that leverage both first-order and second-order information to aggregate answers from multiple LLMs, provably mitigating inherent limitations of majority voting under mild assumptions on latent heterogeneity and correlation across models.

What carries the argument

Optimal Weight (OW) and Inverse Surprising Popularity (ISP) algorithms, which use first-order answer frequencies and second-order model correlations or surprise measures to produce weighted or adjusted collective decisions.

If this is right

More reliable collective decisions from groups of LLMs in reasoning tasks.
Consistent gains over standard baselines on fine-tuning benchmarks and real-world applications.
A training-free aggregation framework that avoids the need for additional model tuning or data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may apply to other multi-agent systems where agents have varying expertise and correlations.
Further work could test integration with uncertainty estimates from individual LLMs.
Relaxing the assumptions might allow guarantees in settings with stronger model dependencies.

Load-bearing premise

Mild assumptions on latent heterogeneity and correlation across models that enable the theoretical guarantees for OW and ISP.

What would settle it

An empirical evaluation on a new dataset where OW and ISP show no improvement or worse performance than majority voting, or a counterexample where the mild assumptions hold but the provable mitigation fails.

read the original abstract

With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces OW and ISP as two new ways to aggregate LLM outputs using second-order information, with theory under mild assumptions and tests on benchmarks plus a healthcare task.

read the letter

The main takeaway is that this paper defines two aggregation procedures, Optimal Weight and Inverse Surprising Popularity, that go beyond simple majority voting by folding in information about how the different LLMs relate to one another. They treat the models as having latent differences in accuracy and some pairwise correlations, then build weights or adjustments from those quantities. The claim is that this setup provably reduces error under mild assumptions on heterogeneity and dependence, and the experiments back it up with gains on synthetic data, UltraFeedback, MMLU, and the ARMMAN healthcare dataset. That combination of a training-free method plus a real-world test is the useful part. Practitioners who already run several LLMs in parallel could plug these rules in without much overhead, and the healthcare example shows the idea is not limited to academic benchmarks. The work is honest about starting from the limitations of majority voting and trying to fix them directly with first- and second-order statistics. The soft spots sit mostly in the theory-to-practice link. The guarantees rest on those mild assumptions about latent heterogeneity and correlations; if the paper states them cleanly but does not check whether the actual response patterns in the MMLU or UltraFeedback runs satisfy the conditions, then the provable improvement does not automatically explain the reported wins. LLM outputs are often correlated because of shared pretraining or fine-tuning data, so it matters whether the assumptions still deliver a non-vacuous gap in that regime. The abstract also leaves out error bars and run counts, which makes it harder to judge how stable the improvements are. Overall this is aimed at people building multi-agent LLM systems who want a lightweight improvement over voting. A reader who needs concrete algorithms with some empirical backing will find it worth reading. It deserves a serious referee because the problem is timely, the methods are new, and the experiments reach outside pure benchmarks, even if the assumption checks need tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes two aggregation algorithms, Optimal Weight (OW) and Inverse Surprising Popularity (ISP), for combining answers from multiple LLMs. These methods incorporate first-order and second-order information to account for latent heterogeneity and correlations among models, in contrast to standard majority voting. The authors present a theoretical analysis claiming that the methods provably mitigate limitations of majority voting under mild assumptions, and report empirical results showing consistent outperformance on synthetic datasets, UltraFeedback, MMLU, and the ARMMAN healthcare task.

Significance. If the theoretical guarantees hold under assumptions that are both precisely stated and plausibly satisfied by real LLM ensembles, the work offers a training-free framework that could meaningfully improve reliability in multi-agent LLM reasoning systems. The inclusion of a real-world healthcare application strengthens the practical case. The absence of detailed proof sketches or assumption checks in the provided abstract limits assessment of how non-vacuous the claimed improvements are.

major comments (1)

[§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that OW and ISP provably mitigate inherent limitations of majority voting rests on mild assumptions about latent heterogeneity and pairwise correlations. These assumptions must be stated with precise mathematical conditions (e.g., explicit bounds on response probabilities or conditions on the correlation matrix), and the manuscript should verify or discuss whether they hold for the LLMs and datasets used in the UltraFeedback/MMLU/ARMMAN experiments; without this, the performance gap over majority voting is not guaranteed to be non-vacuous.

minor comments (2)

[Abstract] Abstract: The abstract asserts theoretical mitigation and consistent outperformance but provides no equations, proof ideas, dataset sizes, or error bars, making the strength of the claims difficult to evaluate from the summary.
[Experiments] Experiments section: Results on benchmarks should include error bars or statistical significance tests to support the claim of consistent outperformance over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of our training-free aggregation framework. We address the major comment point by point below.

read point-by-point responses

Referee: [§3 (Theoretical Analysis)] The central claim that OW and ISP provably mitigate inherent limitations of majority voting rests on mild assumptions about latent heterogeneity and pairwise correlations. These assumptions must be stated with precise mathematical conditions (e.g., explicit bounds on response probabilities or conditions on the correlation matrix), and the manuscript should verify or discuss whether they hold for the LLMs and datasets used in the UltraFeedback/MMLU/ARMMAN experiments; without this, the performance gap over majority voting is not guaranteed to be non-vacuous.

Authors: We appreciate the referee highlighting the need for clarity on our theoretical assumptions. Section 3 already formulates the assumptions with precise mathematical conditions, including explicit bounds on response probabilities under latent heterogeneity and conditions on the pairwise correlation matrix that enable the second-order information to yield provable gains over majority voting. The guarantees are therefore conditional on these assumptions, which we describe as mild because they allow for realistic levels of model heterogeneity and correlation without requiring independence or identical accuracy. We agree that a direct discussion of their plausibility for the UltraFeedback, MMLU, and ARMMAN settings would strengthen the link between theory and experiments. In the revised manuscript we will add a short discussion subsection that qualitatively argues why the assumptions are likely to hold in these LLM ensembles, drawing on the consistent empirical improvements we observe and the task characteristics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical guarantees or algorithm derivation

full rationale

The paper derives OW and ISP aggregation methods by leveraging first-order and second-order information from LLM responses, then provides a theoretical analysis showing they mitigate majority voting limitations under explicitly stated mild assumptions on latent heterogeneity and model correlations. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the guarantees are presented as following from the assumptions rather than tautologically equivalent to them. Empirical results on UltraFeedback, MMLU, and ARMMAN are reported separately as validation and do not serve as the basis for the theoretical claims. The derivation chain is therefore self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of mild assumptions about model heterogeneity and correlations plus the availability of second-order information in the answer distributions.

axioms (1)

domain assumption Mild assumptions on latent heterogeneity and correlation across models
Required for the theoretical analysis that shows OW and ISP mitigate limitations of majority voting.

pith-pipeline@v0.9.0 · 5678 in / 1095 out tokens · 38018 ms · 2026-05-21T20:43:58.602741+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1. Under Assumption 1, f_OW ... is the Bayesian optimal aggregator for any P.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Ensemble Selection from Binary and Pairwise Feedback
cs.GT 2026-05 unverdicted novelty 7.0

The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
Prior-Agnostic Robust Forecast Aggregation
cs.LG 2026-04 unverdicted novelty 7.0

A log-odds pooling rule achieves minimax regret of 0.0255 for conditionally independent signals under prior-agnostic robust forecast aggregation with unknown state space.