arxiv: 2602.01528 · v2 · submitted 2026-02-02 · 💻 cs.CY · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning

Qian Wang , Xuandong Zhao , Zirui Zhang , Zhanzhi Lou , Nuo Chen , Dawn Song , Bingsheng He

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:47 UTC · model grok-4.3

classification 💻 cs.CY cs.LG

keywords bias mitigationreinforcement learningLLM reasoningepistemic independencegeneralizationcognitive biasesrobustnessadversarial prompts

0 comments

The pith

Reinforcement learning trains LLMs to treat cognitive bias cues as non-predictive of reward, producing reasoning that generalizes to unseen biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often shift their answers when prompts include spurious cues such as claims of wide consensus or appeals to authority. Prompting and supervised fine-tuning change surface behavior but leave the underlying optimization pressure that favors those cues untouched. This work introduces Epistemic Independence Training, an RL method that deliberately makes bias signals equally likely to support correct and incorrect answers while penalizing any reward gain from following the bias. Models trained this way improve accuracy under bias attacks yet retain performance when bias and truth align, and training on a single bias type transfers to other biases never seen in training. The result is presented as a step toward models whose reasoning is independent of prompt-level tricks rather than merely trained to avoid particular ones.

Core claim

By placing bias signals in balanced conflict—so that they support right answers exactly as often as wrong ones—and applying a reward that subtracts credit for bias-following behavior, the training process renders those signals non-predictive of reward. Models therefore learn to ignore them. Experiments show this produces both higher accuracy and robustness on adversarial bias tests, and that training exclusively on bandwagon bias confers resistance to authority and distraction biases. The same procedure scales across model sizes and families and outperforms distribution-shift baselines that require explicit environment labels.

What carries the argument

Epistemic Independence Training (EIT): a reinforcement-learning objective that enforces balanced conflict between bias cues and ground-truth labels while subtracting reward for any answer that follows the bias cue.

If this is right

Models trained only on bandwagon bias resist authority and distraction biases without further training.
Accuracy rises on bias-adversarial versions of MedQA and HellaSwag while normal performance is preserved.
The method works across Qwen3-4B, Qwen3-8B, and Llama-3.2-3B without environment labels.
It outperforms GroupDRO and IRM on the same robustness metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balanced-conflict idea could be applied to other spurious correlations such as format preferences or demographic stereotypes.
If the independence is real, models might require fewer bias-specific datasets for reliable deployment in high-stakes evaluation tasks.
A natural next test is whether EIT reduces hallucination rates when factual claims are framed with authoritative language.

Load-bearing premise

That forcing bias cues to be equally likely to support correct and incorrect answers, together with a penalty on bias-following, will produce genuine independence rather than a narrow pattern of avoidance that only works inside the tested distributions.

What would settle it

A held-out test set that introduces a new bias type whose statistical relationship to correctness differs from the balanced 50-50 training distribution; if accuracy drops sharply on that set, the claim of transferable epistemic independence fails.

read the original abstract

Large language models (LLMs) increasingly serve as reasoners and automated evaluators, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals.} Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues attractive. We propose \textbf{Epistemic Independence Training (EIT)}, a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. \revised{EIT further generalizes across benchmarks (MedQA, HellaSwag), model families (Llama-3.2-3B), and scales (Qwen3-8B), and outperforms distribution-shift methods (GroupDRO, IRM) without requiring environment labels.} Code and data are available at https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EIT uses RL to make bias cues non-predictive via balanced conflicts and a penalty reward, yielding accuracy gains and cross-bias generalization on Qwen models, though the mechanism needs closer checks.

read the letter

The main point is that this paper introduces Epistemic Independence Training, an RL setup that balances how often bias signals point to right or wrong answers and then penalizes the model for following those signals. On Qwen3-4B this produces higher accuracy plus robustness to adversarial biases, and models trained only on bandwagon bias transfer to authority and distraction cases. It also reports gains on MedQA and HellaSwag, works on Llama-3.2-3B and Qwen3-8B, and beats GroupDRO and IRM without needing environment labels. Code is released, which is useful for checking the details.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Epistemic Independence Training (EIT), a reinforcement learning framework that operationalizes the principle of making bias cues non-predictive of reward. It employs a balanced conflict strategy (bias signals equally likely to support correct/incorrect answers) paired with a reward that penalizes bias-following without rewarding agreement. Experiments on Qwen3-4B report gains in accuracy and robustness to adversarial biases while preserving performance when bias aligns with truth; models trained only on bandwagon bias generalize to authority and distraction biases, other model families (Llama-3.2-3B), scales (Qwen3-8B), and benchmarks (MedQA, HellaSwag), outperforming GroupDRO and IRM without requiring environment labels.

Significance. If the reported generalization and robustness gains are substantiated, EIT would constitute a substantive contribution by targeting the underlying optimization incentives that make bias cues attractive, rather than relying on surface-level prompting or supervised fine-tuning. The cross-bias transferability claim, if verified through appropriate controls, could inform more reliable LLM reasoning systems in high-stakes domains.

major comments (3)

[Abstract] Abstract: The claim that training solely on bandwagon bias induces transferable epistemic independence to authority and distraction biases is load-bearing for the central contribution, yet no ablation is described that isolates lexical or syntactic overlap between training and test prompts; without such controls, the generalization could arise from surface pattern avoidance rather than learned independence from non-truth-tracking features.
[Experiments] Experiments (implied by abstract results): The abstract reports accuracy and robustness gains but supplies no details on training dynamics, number of runs, statistical significance testing, or explicit controls for confounding factors such as prompt length or token distribution shifts, preventing assessment of whether the data support the robustness claims.
[Method] Method: The balanced conflict strategy and reward formulation are described at a high level but lack a precise mathematical definition (e.g., how the conflict ratio is sampled and how the penalty term is computed), which is necessary to evaluate whether bias cues are rendered non-predictive by construction.

minor comments (2)

[Abstract] Abstract: The revised sentence on generalization across benchmarks and model families would benefit from explicit numerical deltas (e.g., accuracy improvements on MedQA) to allow readers to gauge effect sizes.
[Abstract] The anonymous code link is noted; upon acceptance the repository should be made public to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights important areas for clarification and strengthening, particularly around generalization controls, experimental rigor, and formal definitions. We have revised the paper accordingly and provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that training solely on bandwagon bias induces transferable epistemic independence to authority and distraction biases is load-bearing for the central contribution, yet no ablation is described that isolates lexical or syntactic overlap between training and test prompts; without such controls, the generalization could arise from surface pattern avoidance rather than learned independence from non-truth-tracking features.

Authors: We agree that ruling out surface-level pattern matching is essential for substantiating the epistemic independence claim. In the revised manuscript, we have added a new ablation study (Section 4.4) that controls for lexical and syntactic overlap. Specifically, we generated paraphrased test prompts for authority and distraction biases that share no n-gram overlap with bandwagon training templates while preserving semantic structure. The generalization gains persist under these controls (accuracy drop of <2% vs. original test set), indicating that the effect is not reducible to surface pattern avoidance. We also report cosine similarity statistics between training and test embeddings to quantify the separation. revision: yes
Referee: [Experiments] Experiments (implied by abstract results): The abstract reports accuracy and robustness gains but supplies no details on training dynamics, number of runs, statistical significance testing, or explicit controls for confounding factors such as prompt length or token distribution shifts, preventing assessment of whether the data support the robustness claims.

Authors: We acknowledge the need for greater transparency in experimental reporting. The revised manuscript now includes: (i) training curves showing reward and accuracy trajectories over 5000 steps for all conditions; (ii) results averaged over 5 independent random seeds with standard deviations; (iii) paired t-tests with p-values for all reported improvements (all p < 0.01); and (iv) explicit controls matching prompt length distributions and token frequency histograms between biased and unbiased conditions. These additions appear in the new Appendix C and updated Section 4. revision: yes
Referee: [Method] Method: The balanced conflict strategy and reward formulation are described at a high level but lack a precise mathematical definition (e.g., how the conflict ratio is sampled and how the penalty term is computed), which is necessary to evaluate whether bias cues are rendered non-predictive by construction.

Authors: We agree that a formal specification is required. In the revised Section 3, we now provide the exact definitions: Let B be the bias cue and A the answer. The balanced conflict samples P(B supports correct) = P(B supports incorrect) = 0.5. The reward is r = r_correct - λ * I(follow_bias), where I(follow_bias) = 1 if the model selects the answer indicated by B and 0 otherwise, with λ = 1.0. This ensures bias cues have zero expected correlation with reward by construction. The full sampling procedure and pseudocode are included in the updated Method section and new Appendix B. revision: yes

Circularity Check

0 steps flagged

No significant circularity in EIT derivation

full rationale

The paper grounds its EIT framework in an external principle that bias cues must be made non-predictive of reward, then operationalizes this via a balanced conflict strategy and a reward that penalizes bias-following. Generalization from bandwagon training to authority and distraction biases is presented as an empirical result across models, benchmarks, and scales, with comparisons to GroupDRO and IRM. No equations, fitted parameters, or self-citations reduce the claimed transferable epistemic independence to a definitional equivalence or input by construction. The derivation remains self-contained against the reported experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that making bias cues non-predictive of reward through balanced conflicts will produce genuine independence from bias rather than task-specific heuristics.

free parameters (1)

conflict balance ratio
The equal likelihood of bias supporting correct versus incorrect answers is a design choice whose exact tuning is not specified in the abstract.

axioms (1)

domain assumption Making bias cues non-predictive of reward produces transferable epistemic independence
This principle is stated as the grounding for the EIT framework in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1175 out tokens · 44949 ms · 2026-05-16T08:47:46.711886+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

to learn independence, bias cues must be made non-predictive of reward... balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the optimal policy must ignore b entirely

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.