Recognition: 2 theorem links
· Lean TheoremMitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3
The pith
A training method using permutations reduces selection bias in LLMs by promoting consistent semantic reasoning across option orders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PA-GRPO mitigates selection bias by enforcing permutation-consistent semantic reasoning through constructing a permutation group for each instance and optimizing with cross-permutation advantage relative to the mean reward over all permutations and consistency-aware reward encouraging consistent decisions across permutations.
What carries the argument
The cross-permutation advantage and consistency-aware reward mechanisms within PA-GRPO, which compute advantages relative to the mean reward across permutations and reward consistent outputs for the same instance.
Load-bearing premise
That making the model consistent across permutations of the same question actually captures better semantic reasoning instead of just averaging biases or creating new inconsistencies.
What would settle it
An experiment where a PA-GRPO trained model is tested on permuted questions and shows persistent bias or reduced accuracy compared to standard training on non-permuted inputs.
read the original abstract
Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code is available on github (https://github.com/ECNU-Text-Computing/PA-GRPO).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) to mitigate selection bias in LLMs on multiple-choice and pairwise tasks. It constructs a permutation group per instance, computes advantages relative to the mean reward across those permutations, and adds a consistency-aware reward term to encourage identical decisions across permutations, claiming superior performance and bias reduction on seven benchmarks compared to strong baselines.
Significance. If the empirical results hold under rigorous verification, PA-GRPO would represent a practical training-time debiasing technique that avoids the inference overhead of existing methods while preserving reasoning performance, addressing a persistent reliability issue in LLM-based evaluation.
major comments (2)
- The central claim that PA-GRPO enforces 'permutation-consistent semantic reasoning' is not supported by evidence that the policy conditions on question content rather than simply producing fixed outputs across permutations. No ablations (e.g., content-altered permutations or attention analysis) or comparison to a pure consistency baseline are described, leaving open the possibility that bias reduction occurs by construction without semantic improvement.
- Experimental results are reported without details on setup: number of permutations per instance, exact baseline implementations, reward model specification, statistical tests, or controls for post-hoc selection. This absence prevents verification of the claimed outperformance and bias reduction across the seven benchmarks.
minor comments (1)
- The GitHub code link is a positive step for reproducibility; ensure the released code matches the exact experimental configuration described.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each major comment below, providing clarifications and indicating where revisions will be made to improve the paper.
read point-by-point responses
-
Referee: The central claim that PA-GRPO enforces 'permutation-consistent semantic reasoning' is not supported by evidence that the policy conditions on question content rather than simply producing fixed outputs across permutations. No ablations (e.g., content-altered permutations or attention analysis) or comparison to a pure consistency baseline are described, leaving open the possibility that bias reduction occurs by construction without semantic improvement.
Authors: We respectfully disagree with the interpretation that bias reduction occurs merely by construction without semantic improvement. The rewards in PA-GRPO are derived from the correctness of the selected answer, which is inherently semantic. The cross-permutation advantage compares the reward of a particular permutation to the average across all permutations of the same instance, incentivizing the model to select the semantically correct option consistently rather than defaulting to a fixed position or output. A model that produces fixed position outputs (e.g., always choosing the first option) would receive low average rewards when the correct answer is not in that position, leading to negative advantages for such choices. The consistency-aware reward further reinforces selecting the same semantic answer across permutations. However, we acknowledge the value of additional evidence and will include an ablation study comparing PA-GRPO to a pure consistency baseline (without the advantage mechanism) in the revised manuscript to demonstrate the contribution of the semantic component. revision: partial
-
Referee: Experimental results are reported without details on setup: number of permutations per instance, exact baseline implementations, reward model specification, statistical tests, or controls for post-hoc selection. This absence prevents verification of the claimed outperformance and bias reduction across the seven benchmarks.
Authors: We agree that the experimental setup details were insufficiently documented in the manuscript. In the revised version, we will add a dedicated section detailing: the number of permutations generated per instance, the exact implementations and hyperparameters of all baselines (including references to their original papers and our re-implementations), the reward model specification (using the same preference model as in the original GRPO setup), the statistical tests performed (we will report results with statistical significance tests such as paired t-tests), and controls for post-hoc selection (no post-hoc selection was performed; all reported results are from the final models evaluated on the benchmarks). These details are present in our released code repository, but we will explicitly document them in the paper to ensure reproducibility. revision: yes
Circularity Check
No significant circularity; derivation uses defined internal baselines
full rationale
The paper introduces PA-GRPO by constructing permutation groups per instance and defining two mechanisms: cross-permutation advantage (advantages relative to mean reward over permutations of the same instance) and consistency-aware reward. These are explicit constructions within the optimization objective rather than quantities fitted to data and then relabeled as predictions. No equations reduce the claimed semantic-reasoning benefit to the input mean by construction, no self-citation chains justify uniqueness or ansatzes, and no renaming of known results occurs. The central claim rests on the training procedure plus external benchmark results, which are independent of the internal definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Enforcing answer consistency across permutations of the same question improves semantic reasoning without introducing new biases or degrading performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.