Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score
Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3
The pith
Radial Consensus Score selects LLM answers by radial distance to the weighted Fréchet mean of embeddings instead of vote counts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Radial Consensus Score computes a weighted Fréchet mean of answer embeddings to form a semantic center and ranks candidates by their radial distance to this center, providing a general framework with variants for different weighting schemes that outperforms discrete voting methods in best-of-N selection across multiple tasks and models.
What carries the argument
Radial Consensus Score, which ranks answers by radial distance to the weighted Fréchet mean of their embeddings to measure semantic consensus.
If this is right
- RCS variants achieve higher accuracy than majority voting and probability-based methods, with larger gains at higher sampling budgets.
- RCS functions as a direct replacement for majority voting inside multi-agent debate frameworks.
- The approach maintains effectiveness under black-box model access where only generated answers are available.
- Performance improvements appear across both short-form QA and long-form reasoning tasks.
Where Pith is reading between the lines
- Embedding geometry may capture agreement in cases where surface votes fail due to diverse but semantically aligned answers.
- RCS could combine with uncertainty estimates from the model to further refine selection without extra training.
- The radial-distance principle might extend to other generative domains where quality aligns with proximity in latent space.
Load-bearing premise
The weighted Fréchet mean of answer embeddings reliably captures semantic consensus and radial distance to this center correlates with correctness better than frequency or probability alone.
What would settle it
A controlled test on a benchmark with labeled embeddings where RCS selects incorrect answers more often than majority voting when high-quality responses lie far from the embedding center.
Figures
read the original abstract
Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fr\'echet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Radial Consensus Score (RCS), a training-free best-of-N selection method that computes a weighted Fréchet mean of answer embeddings to define a semantic center and ranks candidates by their radial distance to this center. It supports multiple weighting schemes (uniform, frequency-based, probability-based) and claims that RCS variants consistently outperform strong baselines across seven benchmarks and five open-weight models, with gains increasing at higher sampling budgets. Additional claims include effective replacement for majority voting in multi-agent debate and robustness in black-box settings.
Significance. If the empirical claims hold after addressing statistical and ablation gaps, RCS would offer a simple, geometry-aware alternative to discrete voting or pure probability methods for LLM answer selection. Its training-free nature, flexibility across weighting schemes, and applicability to black-box models represent practical strengths for scalable inference. The multi-benchmark evaluation and multi-agent extension add potential impact, though the core geometric contribution requires clearer isolation from existing signals.
major comments (2)
- [Experimental results] Experimental results section: The abstract and reported results claim consistent outperformance with gains becoming more pronounced as N increases, but provide no error bars, standard deviations across multiple runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This undermines assessment of whether the observed improvements exceed variance, particularly for the scaling claim.
- [Method] Method section (RCS definition): The central claim that radial distance to the weighted Fréchet mean captures semantic consensus better than frequency or probability signals alone is not supported by an ablation that isolates the radial component (e.g., comparing RCS to its weighting scheme without radial ranking, or analyzing embedding-space separation of correct vs. incorrect answers). Without this, it remains unclear whether the geometric signal adds non-redundant value or could degrade selection when rare correct answers lie far from the mean.
minor comments (2)
- [Abstract] The abstract states results on 'seven benchmarks' but does not list them explicitly; a table in the experiments section summarizing tasks, models, and metrics would improve readability.
- [Method] Notation for the weighted Fréchet mean and radial distance computation could be clarified with a short pseudocode or explicit formula to aid reproducibility across different embedding models.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and commit to revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental results] Experimental results section: The abstract and reported results claim consistent outperformance with gains becoming more pronounced as N increases, but provide no error bars, standard deviations across multiple runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This undermines assessment of whether the observed improvements exceed variance, particularly for the scaling claim.
Authors: We agree with this observation. The current manuscript does not include error bars or statistical tests, which limits the ability to assess the robustness of the results. In the revised version, we will include standard deviations computed over multiple independent runs and conduct statistical significance tests (such as paired t-tests) to validate the improvements, especially the scaling behavior with increasing N. revision: yes
-
Referee: [Method] Method section (RCS definition): The central claim that radial distance to the weighted Fréchet mean captures semantic consensus better than frequency or probability signals alone is not supported by an ablation that isolates the radial component (e.g., comparing RCS to its weighting scheme without radial ranking, or analyzing embedding-space separation of correct vs. incorrect answers). Without this, it remains unclear whether the geometric signal adds non-redundant value or could degrade selection when rare correct answers lie far from the mean.
Authors: We recognize the importance of isolating the radial component to substantiate the geometric contribution. The manuscript currently presents RCS as a combination of weighting and radial ranking but lacks a dedicated ablation removing the radial aspect. We will add this ablation in the revision by comparing full RCS to versions that use only the weighting schemes for selection. We will also include analysis of embedding distances for correct versus incorrect answers to show the separation provided by the radial metric. revision: yes
Circularity Check
No significant circularity; RCS is a direct, non-reductive definition
full rationale
The paper defines RCS explicitly as the radial distance to a weighted Fréchet mean computed on answer embeddings, with variants using uniform, frequency, or probability weights. This is a constructive algorithm rather than a derivation that reduces to its inputs by construction. No self-citations are load-bearing for the core method, no parameters are fitted and then relabeled as predictions, and no uniqueness theorems or ansatzes are smuggled in. Experimental claims rest on external benchmarks rather than tautological reuse of the same signals. The method is therefore self-contained against the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
(11) Since the objective is strictly convex inz, this solution is unique
(9) Taking derivative with respect tozand setting it to zero: −2 N ∑ i=1 piui +2z=0, (10) which yields: z= N ∑ i=1 piui. (11) Since the objective is strictly convex inz, this solution is unique. A.2 Implementation Details We use 5-shot prompting (Brown et al., 2020) for short-form QA and Chain-of-Thought prompting (Wei et al.,
work page 2020
-
[2]
for long-form tasks. For CE, we follow the original setup and set p=0.3. We summarize the evaluation benchmarks, including the number of evaluation samples and representative examples, in Table 7. For MMLU-Pro, we sample up to 10 questions per category (105 samples across 14 categories) to ensure broad coverage. We observe no differences between breaking ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.