Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

Hung Le; Manh Nguyen; Sunil Gupta

arxiv: 2604.12196 · v1 · submitted 2026-04-14 · 💻 cs.CL

Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

Manh Nguyen , Sunil Gupta , Hung Le This is my paper

Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords radial consensus scorebest-of-n selectionllm inferencesemantic consensusfréchet meanmajority votinganswer selectiongeometric aggregation

0 comments

The pith

Radial Consensus Score selects LLM answers by radial distance to the weighted Fréchet mean of embeddings instead of vote counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Radial Consensus Score (RCS) to improve best-of-N selection from multiple LLM generations. RCS forms a semantic center as the weighted Fréchet mean of answer embeddings and ranks each candidate by its radial distance to that center. It supports flexible weighting schemes that combine frequency, probability, and geometry without any training. Experiments across seven benchmarks and five models show consistent gains over majority voting and other baselines, especially as the number of samples increases. The method also replaces voting in multi-agent debates and works in black-box access settings.

Core claim

Radial Consensus Score computes a weighted Fréchet mean of answer embeddings to form a semantic center and ranks candidates by their radial distance to this center, providing a general framework with variants for different weighting schemes that outperforms discrete voting methods in best-of-N selection across multiple tasks and models.

What carries the argument

Radial Consensus Score, which ranks answers by radial distance to the weighted Fréchet mean of their embeddings to measure semantic consensus.

If this is right

RCS variants achieve higher accuracy than majority voting and probability-based methods, with larger gains at higher sampling budgets.
RCS functions as a direct replacement for majority voting inside multi-agent debate frameworks.
The approach maintains effectiveness under black-box model access where only generated answers are available.
Performance improvements appear across both short-form QA and long-form reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding geometry may capture agreement in cases where surface votes fail due to diverse but semantically aligned answers.
RCS could combine with uncertainty estimates from the model to further refine selection without extra training.
The radial-distance principle might extend to other generative domains where quality aligns with proximity in latent space.

Load-bearing premise

The weighted Fréchet mean of answer embeddings reliably captures semantic consensus and radial distance to this center correlates with correctness better than frequency or probability alone.

What would settle it

A controlled test on a benchmark with labeled embeddings where RCS selects incorrect answers more often than majority voting when high-quality responses lie far from the embedding center.

Figures

Figures reproduced from arXiv: 2604.12196 by Hung Le, Manh Nguyen, Sunil Gupta.

**Figure 2.** Figure 2: Average performance over five benchmarks for different numbers of sampling responses [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples showing RCS recovers correct answers more reliably than SC. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Effect of the sentence embedding model on Arithmetics and Form.Log. using Llama3.2-3B. (b) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of reasoning paths on Arithmetics and Form.Log. using Qwen2.5-3B. Similar results for [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of the sentence embedding model on Arithmetics and Form.Log. using Qwen2.5-3B. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Performance on SciQ and GPQA when varying correctness threshold ( [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of reasoning paths on Arithmetics and Form.Log. using Llama3.2-3B. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fr\'echet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCS gives a clean geometric way to score LLM answers via distance to a weighted embedding mean, but the reported gains may mostly reflect the weighting choices rather than the radial signal.

read the letter

The paper's main contribution is Radial Consensus Score, which computes a weighted Fréchet mean of answer embeddings and ranks candidates by their radial distance to that center. It claims consistent wins over majority voting and probability baselines on seven benchmarks with five models, and the edge widens with larger sampling budgets. They also show it works as a drop-in for multi-agent debate and stays effective in black-box settings. The method stays training-free and simple, which is a plus for practical use. The experiments span short QA and longer reasoning tasks, giving decent coverage. What stands out is the flexible weighting options (uniform, frequency, probability) inside the same geometric frame. The soft spot is that the central claim rests on the geometry adding value beyond the weights already baked in. If errors tend to cluster in embedding space or strong but infrequent answers sit far from the mean, the radial ranking could be redundant or even harmful. The abstract does not isolate this with ablations that disable the distance component or test whether correct answers reliably sit closer to the center. No error bars or significance tests are mentioned either. This is the kind of incremental inference tweak that people running best-of-N pipelines or self-consistency setups would want to try. A reader focused on reliable LLM output selection gets the most out of it. I would send it to peer review because the idea is straightforward, the scope is reasonable, and the open question about what actually drives the gains is fixable with tighter controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces Radial Consensus Score (RCS), a training-free best-of-N selection method that computes a weighted Fréchet mean of answer embeddings to define a semantic center and ranks candidates by their radial distance to this center. It supports multiple weighting schemes (uniform, frequency-based, probability-based) and claims that RCS variants consistently outperform strong baselines across seven benchmarks and five open-weight models, with gains increasing at higher sampling budgets. Additional claims include effective replacement for majority voting in multi-agent debate and robustness in black-box settings.

Significance. If the empirical claims hold after addressing statistical and ablation gaps, RCS would offer a simple, geometry-aware alternative to discrete voting or pure probability methods for LLM answer selection. Its training-free nature, flexibility across weighting schemes, and applicability to black-box models represent practical strengths for scalable inference. The multi-benchmark evaluation and multi-agent extension add potential impact, though the core geometric contribution requires clearer isolation from existing signals.

major comments (2)

[Experimental results] Experimental results section: The abstract and reported results claim consistent outperformance with gains becoming more pronounced as N increases, but provide no error bars, standard deviations across multiple runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This undermines assessment of whether the observed improvements exceed variance, particularly for the scaling claim.
[Method] Method section (RCS definition): The central claim that radial distance to the weighted Fréchet mean captures semantic consensus better than frequency or probability signals alone is not supported by an ablation that isolates the radial component (e.g., comparing RCS to its weighting scheme without radial ranking, or analyzing embedding-space separation of correct vs. incorrect answers). Without this, it remains unclear whether the geometric signal adds non-redundant value or could degrade selection when rare correct answers lie far from the mean.

minor comments (2)

[Abstract] The abstract states results on 'seven benchmarks' but does not list them explicitly; a table in the experiments section summarizing tasks, models, and metrics would improve readability.
[Method] Notation for the weighted Fréchet mean and radial distance computation could be clarified with a short pseudocode or explicit formula to aid reproducibility across different embedding models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental results] Experimental results section: The abstract and reported results claim consistent outperformance with gains becoming more pronounced as N increases, but provide no error bars, standard deviations across multiple runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This undermines assessment of whether the observed improvements exceed variance, particularly for the scaling claim.

Authors: We agree with this observation. The current manuscript does not include error bars or statistical tests, which limits the ability to assess the robustness of the results. In the revised version, we will include standard deviations computed over multiple independent runs and conduct statistical significance tests (such as paired t-tests) to validate the improvements, especially the scaling behavior with increasing N. revision: yes
Referee: [Method] Method section (RCS definition): The central claim that radial distance to the weighted Fréchet mean captures semantic consensus better than frequency or probability signals alone is not supported by an ablation that isolates the radial component (e.g., comparing RCS to its weighting scheme without radial ranking, or analyzing embedding-space separation of correct vs. incorrect answers). Without this, it remains unclear whether the geometric signal adds non-redundant value or could degrade selection when rare correct answers lie far from the mean.

Authors: We recognize the importance of isolating the radial component to substantiate the geometric contribution. The manuscript currently presents RCS as a combination of weighting and radial ranking but lacks a dedicated ablation removing the radial aspect. We will add this ablation in the revision by comparing full RCS to versions that use only the weighting schemes for selection. We will also include analysis of embedding distances for correct versus incorrect answers to show the separation provided by the radial metric. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RCS is a direct, non-reductive definition

full rationale

The paper defines RCS explicitly as the radial distance to a weighted Fréchet mean computed on answer embeddings, with variants using uniform, frequency, or probability weights. This is a constructive algorithm rather than a derivation that reduces to its inputs by construction. No self-citations are load-bearing for the core method, no parameters are fitted and then relabeled as predictions, and no uniqueness theorems or ansatzes are smuggled in. Experimental claims rest on external benchmarks rather than tautological reuse of the same signals. The method is therefore self-contained against the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach assumes standard embedding spaces and Fréchet mean properties from prior math without new postulates.

pith-pipeline@v0.9.0 · 5558 in / 992 out tokens · 58546 ms · 2026-05-10T16:05:43.322922+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

(11) Since the objective is strictly convex inz, this solution is unique

(9) Taking derivative with respect tozand setting it to zero: −2 N ∑ i=1 piui +2z=0, (10) which yields: z= N ∑ i=1 piui. (11) Since the objective is strictly convex inz, this solution is unique. A.2 Implementation Details We use 5-shot prompting (Brown et al., 2020) for short-form QA and Chain-of-Thought prompting (Wei et al.,

work page 2020
[2]

Sx", "xS

for long-form tasks. For CE, we follow the original setup and set p=0.3. We summarize the evaluation benchmarks, including the number of evaluation samples and representative examples, in Table 7. For MMLU-Pro, we sample up to 10 questions per category (105 samples across 14 categories) to ensure broad coverage. We observe no differences between breaking ...

work page 2025

[1] [1]

(11) Since the objective is strictly convex inz, this solution is unique

(9) Taking derivative with respect tozand setting it to zero: −2 N ∑ i=1 piui +2z=0, (10) which yields: z= N ∑ i=1 piui. (11) Since the objective is strictly convex inz, this solution is unique. A.2 Implementation Details We use 5-shot prompting (Brown et al., 2020) for short-form QA and Chain-of-Thought prompting (Wei et al.,

work page 2020

[2] [2]

Sx", "xS

for long-form tasks. For CE, we follow the original setup and set p=0.3. We summarize the evaluation benchmarks, including the number of evaluation samples and representative examples, in Table 7. For MMLU-Pro, we sample up to 10 questions per category (105 samples across 14 categories) to ensure broad coverage. We observe no differences between breaking ...

work page 2025