Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Elsa Fan; Justin Tang; Nathan Huang; Tianyi Huang; Wenqian Chen

arxiv: 2603.20562 · v3 · pith:Y6C2V5CYnew · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang , Nathan Huang , Justin Tang , Wenqian Chen , Elsa Fan This is my paper

Pith reviewed 2026-05-21 09:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM judgesfactuality evaluationlistwise judgingpermutation consensusorder sensitivityconsensus aggregationRewardBench

0 comments

The pith

Averaging LLM factuality judgments across multiple candidate permutations raises selection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models used as judges for factuality change their rankings when the order of candidate answers is rearranged, even though the underlying facts remain the same. To address this, it proposes running the listwise prompt several times with different orderings of the same candidates and then combining the scores and ranks into a single consensus result. This approach matters because it targets a controllable source of inconsistency in automated evaluation without needing to change the underlying model. A sympathetic reader would care because more stable LLM judges could make benchmark results more trustworthy when comparing models on truthfulness. If correct, the method shows that part of current judging error is due to presentation rather than genuine understanding limits.

Core claim

The paper establishes that candidate-order sensitivity constitutes a real source of error in listwise factuality judging, and that a consensus aggregate over seven permutations of the candidate set improves top-1 selection accuracy on the RewardBench 2 Factuality benchmark from 86.00 percent to 91.33 percent using GPT-5.4 and from 86.33 percent to 89.67 percent using Claude Sonnet 4.6.

What carries the argument

PCFJudge, the procedure that applies an identical factuality-first listwise prompt to multiple random permutations of the answer candidates and then merges the individual scores, ranks, and uncertainty estimates into one final decision.

If this is right

Order sensitivity can be reduced at inference time by explicit marginalization over permutations.
LLM-based factuality evaluation becomes more reliable when consensus across orderings is used.
Existing benchmarks may contain hidden order-induced variance that the consensus method mitigates.
The same principle could apply to other listwise tasks where presentation order affects outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might extend naturally to preference tuning or safety judging where order effects have been observed.
Testing the approach on larger candidate sets or different prompt styles would reveal how broadly order marginalization helps.
Combining permutation consensus with other ensemble techniques could yield further gains in judge robustness.

Load-bearing premise

The accuracy gains result specifically from marginalizing over different candidate orders rather than from running the judge multiple times or from other aspects of the aggregation.

What would settle it

Running the judge the same number of times but always keeping the original candidate order and observing no accuracy improvement would indicate that the benefit does not come from order variation.

read the original abstract

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PCFJudge, an inference-time method for listwise factuality evaluation that reruns the same prompt over K distinct permutations of the candidate set and aggregates the resulting scores, ranks, and uncertainty signals. On RewardBench 2 Factuality, the K=7 aggregate is reported to raise top-1 selection accuracy from 86.00% to 91.33% (GPT-5.4) and from 86.33% to 89.67% (Claude Sonnet 4.6), attributing the gains to marginalization of candidate-order sensitivity.

Significance. If the observed lifts are shown to arise specifically from order marginalization rather than generic multi-sample robustness, the method would offer a lightweight, training-free way to improve the reliability of LLM-as-judge pipelines for factuality tasks. The concrete benchmark numbers and the focus on a known nuisance variable constitute a practical contribution, though the current evidence leaves the attribution open.

major comments (3)

[Abstract and §4] Abstract and §4: The central claim attributes the accuracy gains specifically to marginalizing candidate-order sensitivity, yet the reported experiments contain no ablation that compares the K-permutation aggregate against repeated identical-order judgments, temperature sampling, or prompt paraphrasing under the same aggregation rule. Without this isolation, the lift cannot be confidently ascribed to order rather than generic ensembling.
[§4] §4: The abstract states concrete accuracy figures (86.00% → 91.33%, 86.33% → 89.67%) but supplies neither error bars, number of independent runs, nor statistical significance tests. This omission makes it impossible to judge whether the reported improvements exceed run-to-run variance on RewardBench 2 Factuality.
[§3] §3 (Method): The aggregation procedure for combining scores, ranks, and uncertainty signals across permutations is described at a high level; the precise weighting or voting rule is not given in sufficient detail to allow exact reproduction or to verify that the procedure itself does not introduce additional bias.

minor comments (2)

[§4] The paper should include the full experimental protocol (exact prompt templates, temperature settings, and candidate-set construction details) in an appendix or supplementary material to support reproducibility.
[§3] Notation for the consensus aggregation (e.g., how ranks and uncertainty are combined) should be formalized with equations rather than prose to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The central claim attributes the accuracy gains specifically to marginalizing candidate-order sensitivity, yet the reported experiments contain no ablation that compares the K-permutation aggregate against repeated identical-order judgments, temperature sampling, or prompt paraphrasing under the same aggregation rule. Without this isolation, the lift cannot be confidently ascribed to order rather than generic ensembling.

Authors: We agree that an explicit ablation isolating order marginalization from generic multi-sample ensembling would strengthen the attribution. The original experiments focused on the permutation-consensus procedure as a targeted response to the known nuisance of candidate ordering in listwise factuality prompts. In the revised manuscript we will add a controlled ablation that applies the identical aggregation rule to (i) repeated identical-order judgments and (ii) temperature-sampled judgments with fixed order, allowing direct comparison of the resulting accuracy lifts. revision: yes
Referee: [§4] §4: The abstract states concrete accuracy figures (86.00% → 91.33%, 86.33% → 89.67%) but supplies neither error bars, number of independent runs, nor statistical significance tests. This omission makes it impossible to judge whether the reported improvements exceed run-to-run variance on RewardBench 2 Factuality.

Authors: The referee correctly notes the absence of variance estimates. The reported point estimates reflect single evaluations on the fixed benchmark split. We will rerun the full evaluation pipeline across five independent trials (varying only the random seed for any stochastic components) and report mean accuracy with standard deviation together with a paired significance test against the single-order baseline in the revised §4 and, if space permits, the abstract. revision: yes
Referee: [§3] §3 (Method): The aggregation procedure for combining scores, ranks, and uncertainty signals across permutations is described at a high level; the precise weighting or voting rule is not given in sufficient detail to allow exact reproduction or to verify that the procedure itself does not introduce additional bias.

Authors: We appreciate the call for greater reproducibility. We will expand §3 with the exact aggregation equations: normalized score averaging, rank aggregation via a weighted Borda count that incorporates the per-permutation uncertainty signal as a weight, and the final consensus selection rule. A concise algorithm box and reference implementation sketch will also be added so that the procedure can be reproduced without ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical aggregation with external benchmark measurements

full rationale

The paper introduces PCFJudge as an inference-time procedure that reruns a fixed listwise prompt over K permutations of candidates and aggregates the outputs. The central results are direct accuracy measurements on the external RewardBench 2 Factuality benchmark (86.00% to 91.33% for GPT-5.4; 86.33% to 89.67% for Claude Sonnet 4.6). No equations, fitted parameters, or self-citations are used to derive these numbers; the reported lifts are observed outcomes rather than predictions forced by construction. The method contains no self-definitional loops, no renaming of known results as novel derivations, and no load-bearing uniqueness theorems imported from the authors' prior work. The evaluation is therefore self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that order sensitivity is a meaningful and correctable source of error in listwise factuality judging; no free parameters or invented entities are introduced beyond the choice of K=7.

free parameters (1)

K (number of permutations)
The paper selects K=7 for the final reported aggregate; this hyperparameter is chosen rather than derived.

axioms (1)

domain assumption Candidate order is a source of instability in listwise factuality evaluation that can be mitigated by aggregation across permutations.
Stated directly in the abstract as the motivation for PCFJudge.

pith-pipeline@v0.9.0 · 5712 in / 1318 out tokens · 58592 ms · 2026-05-21T09:43:23.560095+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PCFJudge runs the same prompt over K orderings... aggregate the runs into four summary statistics... final consensus score Ci = 0.50 s̄i + 0.25 Bi + ...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1... majority vote over the top-choice identities already reduces error exponentially in K... Hoeffding’s inequality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.