Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Pith reviewed 2026-05-21 09:43 UTC · model grok-4.3
The pith
Averaging LLM factuality judgments across multiple candidate permutations raises selection accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that candidate-order sensitivity constitutes a real source of error in listwise factuality judging, and that a consensus aggregate over seven permutations of the candidate set improves top-1 selection accuracy on the RewardBench 2 Factuality benchmark from 86.00 percent to 91.33 percent using GPT-5.4 and from 86.33 percent to 89.67 percent using Claude Sonnet 4.6.
What carries the argument
PCFJudge, the procedure that applies an identical factuality-first listwise prompt to multiple random permutations of the answer candidates and then merges the individual scores, ranks, and uncertainty estimates into one final decision.
If this is right
- Order sensitivity can be reduced at inference time by explicit marginalization over permutations.
- LLM-based factuality evaluation becomes more reliable when consensus across orderings is used.
- Existing benchmarks may contain hidden order-induced variance that the consensus method mitigates.
- The same principle could apply to other listwise tasks where presentation order affects outcomes.
Where Pith is reading between the lines
- This method might extend naturally to preference tuning or safety judging where order effects have been observed.
- Testing the approach on larger candidate sets or different prompt styles would reveal how broadly order marginalization helps.
- Combining permutation consensus with other ensemble techniques could yield further gains in judge robustness.
Load-bearing premise
The accuracy gains result specifically from marginalizing over different candidate orders rather than from running the judge multiple times or from other aspects of the aggregation.
What would settle it
Running the judge the same number of times but always keeping the original candidate order and observing no accuracy improvement would indicate that the benefit does not come from order variation.
read the original abstract
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PCFJudge, an inference-time method for listwise factuality evaluation that reruns the same prompt over K distinct permutations of the candidate set and aggregates the resulting scores, ranks, and uncertainty signals. On RewardBench 2 Factuality, the K=7 aggregate is reported to raise top-1 selection accuracy from 86.00% to 91.33% (GPT-5.4) and from 86.33% to 89.67% (Claude Sonnet 4.6), attributing the gains to marginalization of candidate-order sensitivity.
Significance. If the observed lifts are shown to arise specifically from order marginalization rather than generic multi-sample robustness, the method would offer a lightweight, training-free way to improve the reliability of LLM-as-judge pipelines for factuality tasks. The concrete benchmark numbers and the focus on a known nuisance variable constitute a practical contribution, though the current evidence leaves the attribution open.
major comments (3)
- [Abstract and §4] Abstract and §4: The central claim attributes the accuracy gains specifically to marginalizing candidate-order sensitivity, yet the reported experiments contain no ablation that compares the K-permutation aggregate against repeated identical-order judgments, temperature sampling, or prompt paraphrasing under the same aggregation rule. Without this isolation, the lift cannot be confidently ascribed to order rather than generic ensembling.
- [§4] §4: The abstract states concrete accuracy figures (86.00% → 91.33%, 86.33% → 89.67%) but supplies neither error bars, number of independent runs, nor statistical significance tests. This omission makes it impossible to judge whether the reported improvements exceed run-to-run variance on RewardBench 2 Factuality.
- [§3] §3 (Method): The aggregation procedure for combining scores, ranks, and uncertainty signals across permutations is described at a high level; the precise weighting or voting rule is not given in sufficient detail to allow exact reproduction or to verify that the procedure itself does not introduce additional bias.
minor comments (2)
- [§4] The paper should include the full experimental protocol (exact prompt templates, temperature settings, and candidate-set construction details) in an appendix or supplementary material to support reproducibility.
- [§3] Notation for the consensus aggregation (e.g., how ranks and uncertainty are combined) should be formalized with equations rather than prose to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: The central claim attributes the accuracy gains specifically to marginalizing candidate-order sensitivity, yet the reported experiments contain no ablation that compares the K-permutation aggregate against repeated identical-order judgments, temperature sampling, or prompt paraphrasing under the same aggregation rule. Without this isolation, the lift cannot be confidently ascribed to order rather than generic ensembling.
Authors: We agree that an explicit ablation isolating order marginalization from generic multi-sample ensembling would strengthen the attribution. The original experiments focused on the permutation-consensus procedure as a targeted response to the known nuisance of candidate ordering in listwise factuality prompts. In the revised manuscript we will add a controlled ablation that applies the identical aggregation rule to (i) repeated identical-order judgments and (ii) temperature-sampled judgments with fixed order, allowing direct comparison of the resulting accuracy lifts. revision: yes
-
Referee: [§4] §4: The abstract states concrete accuracy figures (86.00% → 91.33%, 86.33% → 89.67%) but supplies neither error bars, number of independent runs, nor statistical significance tests. This omission makes it impossible to judge whether the reported improvements exceed run-to-run variance on RewardBench 2 Factuality.
Authors: The referee correctly notes the absence of variance estimates. The reported point estimates reflect single evaluations on the fixed benchmark split. We will rerun the full evaluation pipeline across five independent trials (varying only the random seed for any stochastic components) and report mean accuracy with standard deviation together with a paired significance test against the single-order baseline in the revised §4 and, if space permits, the abstract. revision: yes
-
Referee: [§3] §3 (Method): The aggregation procedure for combining scores, ranks, and uncertainty signals across permutations is described at a high level; the precise weighting or voting rule is not given in sufficient detail to allow exact reproduction or to verify that the procedure itself does not introduce additional bias.
Authors: We appreciate the call for greater reproducibility. We will expand §3 with the exact aggregation equations: normalized score averaging, rank aggregation via a weighted Borda count that incorporates the per-permutation uncertainty signal as a weight, and the final consensus selection rule. A concise algorithm box and reference implementation sketch will also be added so that the procedure can be reproduced without ambiguity. revision: yes
Circularity Check
No circularity: purely empirical aggregation with external benchmark measurements
full rationale
The paper introduces PCFJudge as an inference-time procedure that reruns a fixed listwise prompt over K permutations of candidates and aggregates the outputs. The central results are direct accuracy measurements on the external RewardBench 2 Factuality benchmark (86.00% to 91.33% for GPT-5.4; 86.33% to 89.67% for Claude Sonnet 4.6). No equations, fitted parameters, or self-citations are used to derive these numbers; the reported lifts are observed outcomes rather than predictions forced by construction. The method contains no self-definitional loops, no renaming of known results as novel derivations, and no load-bearing uniqueness theorems imported from the authors' prior work. The evaluation is therefore self-contained against the stated benchmark.
Axiom & Free-Parameter Ledger
free parameters (1)
- K (number of permutations)
axioms (1)
- domain assumption Candidate order is a source of instability in listwise factuality evaluation that can be mitigated by aggregation across permutations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PCFJudge runs the same prompt over K orderings... aggregate the runs into four summary statistics... final consensus score Ci = 0.50 s̄i + 0.25 Bi + ...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1... majority vote over the top-choice identities already reduces error exponentially in K... Hoeffding’s inequality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.