arxiv: 2603.21016 · v2 · submitted 2026-03-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Jinquan Zheng , Jia Yuan , Jiacheng Yao , Chenyang Gu , Pujun Zheng , Guoxiu He

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords selection biaslarge language modelspermutation consistencypolicy optimizationdebiasingconsistency rewardmultiple choice tasks

0 comments

The pith

A training method using permutations reduces selection bias in LLMs by promoting consistent semantic reasoning across option orders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models frequently show selection bias in multiple-choice tasks, favoring certain positions or labels over actual content. The paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) to address this during training. It generates multiple permutations of each question and optimizes the model using cross-permutation advantage calculated from mean rewards and a consistency-aware reward that promotes uniform decisions. This leads to lower bias while keeping high performance on seven benchmarks, offering a training-time solution instead of costly inference fixes.

Core claim

PA-GRPO mitigates selection bias by enforcing permutation-consistent semantic reasoning through constructing a permutation group for each instance and optimizing with cross-permutation advantage relative to the mean reward over all permutations and consistency-aware reward encouraging consistent decisions across permutations.

What carries the argument

The cross-permutation advantage and consistency-aware reward mechanisms within PA-GRPO, which compute advantages relative to the mean reward across permutations and reward consistent outputs for the same instance.

Load-bearing premise

That making the model consistent across permutations of the same question actually captures better semantic reasoning instead of just averaging biases or creating new inconsistencies.

What would settle it

An experiment where a PA-GRPO trained model is tested on permuted questions and shows persistent bias or reduced accuracy compared to standard training on non-permuted inputs.

read the original abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code is available on github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PA-GRPO adds permutation groups and a consistency term to GRPO training to reduce position and label bias in LLM multiple-choice tasks, but the abstract gives no evidence that the model still uses semantic content rather than locking onto fixed outputs.

read the letter

The main point is a training procedure that groups multiple permutations of each question, computes advantages relative to the mean reward across the group, and adds a consistency reward to push the model toward the same answer every time. This is the concrete new piece: applying group-relative policy optimization inside permutation sets rather than across unrelated instances. The paper does a clear job naming the practical problem of selection bias in benchmarks and showing why inference-time fixes can be costly. The two mechanisms are described plainly enough that someone could implement the core loop from the abstract and the linked code. Reporting results across seven benchmarks with bias reduction and maintained performance is useful as a starting point. The code release is a real plus for checking details. The soft spot is the missing link between consistency and actual semantic reasoning. Nothing in the description rules out a policy that simply outputs the same choice for every permutation of a given question, which would score well on both the cross-permutation advantage and the consistency term even if it ignores the question text. The stress-test concern lands on the current write-up: if the reward model itself contains positional cues, the mean baseline will not remove them, and the method can reduce measured bias by construction. No ablations on content-altered permutations or attention checks are mentioned, so it is not possible to tell whether reasoning improved or the model just became more rigid. This paper is for researchers working on RLHF variants and reliable LLM evaluation protocols. Anyone running or designing multiple-choice benchmarks would find the setup relevant and easy to test. It deserves peer review because the idea is straightforward to replicate and the practical stakes are clear, even though the current evidence is preliminary and needs controls on whether semantics remain in use.

Referee Report

2 major / 1 minor

Summary. The paper proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) to mitigate selection bias in LLMs on multiple-choice and pairwise tasks. It constructs a permutation group per instance, computes advantages relative to the mean reward across those permutations, and adds a consistency-aware reward term to encourage identical decisions across permutations, claiming superior performance and bias reduction on seven benchmarks compared to strong baselines.

Significance. If the empirical results hold under rigorous verification, PA-GRPO would represent a practical training-time debiasing technique that avoids the inference overhead of existing methods while preserving reasoning performance, addressing a persistent reliability issue in LLM-based evaluation.

major comments (2)

The central claim that PA-GRPO enforces 'permutation-consistent semantic reasoning' is not supported by evidence that the policy conditions on question content rather than simply producing fixed outputs across permutations. No ablations (e.g., content-altered permutations or attention analysis) or comparison to a pure consistency baseline are described, leaving open the possibility that bias reduction occurs by construction without semantic improvement.
Experimental results are reported without details on setup: number of permutations per instance, exact baseline implementations, reward model specification, statistical tests, or controls for post-hoc selection. This absence prevents verification of the claimed outperformance and bias reduction across the seven benchmarks.

minor comments (1)

The GitHub code link is a positive step for reproducibility; ensure the released code matches the exact experimental configuration described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each major comment below, providing clarifications and indicating where revisions will be made to improve the paper.

read point-by-point responses

Referee: The central claim that PA-GRPO enforces 'permutation-consistent semantic reasoning' is not supported by evidence that the policy conditions on question content rather than simply producing fixed outputs across permutations. No ablations (e.g., content-altered permutations or attention analysis) or comparison to a pure consistency baseline are described, leaving open the possibility that bias reduction occurs by construction without semantic improvement.

Authors: We respectfully disagree with the interpretation that bias reduction occurs merely by construction without semantic improvement. The rewards in PA-GRPO are derived from the correctness of the selected answer, which is inherently semantic. The cross-permutation advantage compares the reward of a particular permutation to the average across all permutations of the same instance, incentivizing the model to select the semantically correct option consistently rather than defaulting to a fixed position or output. A model that produces fixed position outputs (e.g., always choosing the first option) would receive low average rewards when the correct answer is not in that position, leading to negative advantages for such choices. The consistency-aware reward further reinforces selecting the same semantic answer across permutations. However, we acknowledge the value of additional evidence and will include an ablation study comparing PA-GRPO to a pure consistency baseline (without the advantage mechanism) in the revised manuscript to demonstrate the contribution of the semantic component. revision: partial
Referee: Experimental results are reported without details on setup: number of permutations per instance, exact baseline implementations, reward model specification, statistical tests, or controls for post-hoc selection. This absence prevents verification of the claimed outperformance and bias reduction across the seven benchmarks.

Authors: We agree that the experimental setup details were insufficiently documented in the manuscript. In the revised version, we will add a dedicated section detailing: the number of permutations generated per instance, the exact implementations and hyperparameters of all baselines (including references to their original papers and our re-implementations), the reward model specification (using the same preference model as in the original GRPO setup), the statistical tests performed (we will report results with statistical significance tests such as paired t-tests), and controls for post-hoc selection (no post-hoc selection was performed; all reported results are from the final models evaluated on the benchmarks). These details are present in our released code repository, but we will explicitly document them in the paper to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses defined internal baselines

full rationale

The paper introduces PA-GRPO by constructing permutation groups per instance and defining two mechanisms: cross-permutation advantage (advantages relative to mean reward over permutations of the same instance) and consistency-aware reward. These are explicit constructions within the optimization objective rather than quantities fitted to data and then relabeled as predictions. No equations reduce the claimed semantic-reasoning benefit to the input mean by construction, no self-citation chains justify uniqueness or ansatzes, and no renaming of known results occurs. The central claim rests on the training procedure plus external benchmark results, which are independent of the internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that semantic reasoning can be isolated and improved by enforcing answer consistency across surface-form permutations. No free parameters or invented entities are explicitly introduced in the abstract description.

axioms (1)

domain assumption Enforcing answer consistency across permutations of the same question improves semantic reasoning without introducing new biases or degrading performance.
This premise underpins both the cross-permutation advantage and consistency-aware reward mechanisms.

pith-pipeline@v0.9.0 · 5502 in / 1322 out tokens · 27440 ms · 2026-05-15T07:37:59.254656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
cs.CL 2026-04 unverdicted novelty 6.0

Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.