arxiv: 2605.11134 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

Christian Moya , Alex Semendinger , Guang Lin , Elliott Thornley

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords preference optimizationspurious correlationsdistribution shiftdirect preference optimizationtie trainingcausal learningsycophancylength bias

0 comments

The pith

Preference optimization induces spurious feature reliance through mean bias and correlation leakage, creating a vulnerability to distribution shift that more training data cannot fix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard objectives such as direct preference optimization cause models to depend on spurious features at the population level. This dependence arises from two mechanisms: mean spurious bias and causal-spurious correlation leakage. The resulting reliance produces an irreducible vulnerability because additional samples drawn from the same training distribution do not reduce the model's use of spurious features. The authors introduce tie training, which augments the data with equal-utility preference pairs to impose data-driven regularization that selectively suppresses spurious learning while leaving causal learning intact. The analysis is derived for log-linear policies and then confirmed empirically on neural networks and large language models.

Core claim

In log-linear policies, standard preference-learning objectives induce reliance on spurious features through mean spurious bias and causal-spurious correlation leakage. This reliance produces an irreducible vulnerability to distribution shift because more data drawn from the identical training distribution fails to reduce dependence on the spurious features. Tie training, which augments the dataset with ties (equal-utility preference pairs), supplies data-driven regularization that reduces spurious learning without degrading causal learning.

What carries the argument

Tie training, a data augmentation method that inserts equal-utility preference pairs to create data-driven regularization against spurious correlations.

Load-bearing premise

The mechanisms and mitigation identified for log-linear policies extend to neural networks and large language models without significant degradation of causal learning.

What would settle it

An experiment that adds increasing volumes of in-distribution preference data and measures a corresponding drop in the model's reliance on known spurious features would falsify the claim of irreducible vulnerability.

Figures

Figures reproduced from arXiv: 2605.11134 by Alex Semendinger, Christian Moya, Elliott Thornley, Guang Lin.

**Figure 2.** Figure 2: Neural network validation. Left: Spurious gap (accuracy on aligned minus misaligned spurious conditions) decreases monotonically with tie mixing fraction α. Right: Strict training exhibits a persistent adversarial accuracy plateau despite increasing data; tie training breaks this plateau, improving robustness from ≈ 0.18 to ≈ 0.7 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 1.** Figure 1: Quantitative validation of linear theory. (a) Norm of learned spurious parameters against theoretical prediction (Theorem 4.1) (top). Second-order corrections restore agreement when the local regime is violated (bottom). (b) Deployment suboptimality decomposition: estimation error decays as O(1/n) while shift error persists, demonstrating irreducibility. As predicted by Theorem 5.3, empirical deployment… view at source ↗

**Figure 3.** Figure 3: Population scaling of spurious parameters in DPO. We compare empirical spurious parameter norms with the population prediction from Theorem 4.1. Left: Including curvature yields accurate predictions across β. Right: Ignoring curvature systematically underestimates spurious reliance, leading to large relative error even with infinite data. This confirms that curvature is necessary for correct population sca… view at source ↗

**Figure 4.** Figure 4: Empirical deployment suboptimality and its decomposition under distribution shift. The figure shows four quantities as a function of the number of training samples n: (i) empirical deployment suboptimality SubOptQ( ˆθ); (ii) shift error estimate; (iii) estimation error estimate; (iv) estimated upper bound. As n grows, estimation error decays while the shift error persists, demonstrating that deployment err… view at source ↗

**Figure 5.** Figure 5: Theoretical prediction for spurious reliance under tie training. The curve shows the reduction factor rth(α) = αλ0 αλ0+(1−α)σ2 as a function of the strict-preference fraction α, for different spurious variance ratios σ 2 /λ0. Increasing the proportion of tie examples (1 − α) monotonically suppresses reliance on spurious features, with stronger suppression when ties inject higher spurious variance. This bou… view at source ↗

**Figure 6.** Figure 6: Spurious gap (accuracy difference between aligned and misaligned spurious conditions) as a function of the fraction of strict preferences α. Note that as α decreases, the number of ties increases. Thus, tie training reduces spurious reliance despite hidden representations. F.2.3. PROXY METRICS Spurious Gap. We measure the spurious gap as the difference between accuracy on pairs where spurious features alig… view at source ↗

**Figure 7.** Figure 7: Tie training reduces spurious reliance. Counterfactual margin E[|r(ϕ) − r(ϕcf)|] as a function of the fraction of strict preferences α. As α decreases (more tie comparisons), the counterfactual margin drops sharply, indicating reduced sensitivity of the learned model to spurious features. 10 4 10 5 Number of Training Samples 0.3 0.4 0.5 0.6 0.7 A d v ers arial A c c ura c y o n Qadv Strict Training ( = 1.0… view at source ↗

**Figure 8.** Figure 8: Strict-only training plateaus under distribution shift; tie training improves robustness. Adversarial accuracy on Qadv, where spurious correlations flip, as a function of the number of training samples. Strict-only training (α = 1.0) exhibits a persistent accuracy plateau despite increasing data. In contrast, tie training (α = 0.75) improves adversarial accuracy, breaking the plateau. Dataset: Synthetic Ho… view at source ↗

**Figure 9.** Figure 9: Under standard RLHF reward learning, the learned policy exhibits nonzero reliance on spurious features (θs ̸= 0), and this reliance does not vanish with additional data drawn from the training distribution P. Tie training explicitly counteracts this effect, driving spurious reliance toward zero. 0 1000 2000 3000 4000 5000 Number of Training Examples 10 1 Estimation Error Strict Training Tie Training [PITH… view at source ↗

**Figure 10.** Figure 10: As the number of training samples increases, estimation error, defined as the weighted norm ∥ ˆθ−θ ∗ ∥Σ, decreases at comparable rates for strict MLE training and tie training, showing that tie training reduces spurious reliance without sacrificing estimation accuracy. Results. Without tie training, [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Greedy decoding of a log-linear RLHF policy does not introduce additional error mechanisms, but exposes spurious reward learning under shift: performance, measured as SubOptQ(π) := V ⋆ (π ⋆ ) − V ⋆ (π) degrades in both adversarial and suppression settings, and this error does not vanish with more data from P. Tie training reduces this shift-induced error. H. Limitations and Future Work Local regime and li… view at source ↗

read the original abstract

Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal--spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard preference optimization objectives (e.g., DPO) induce spurious feature reliance in log-linear policies through two population-level mechanisms—mean spurious bias and causal-spurious correlation leakage—creating an irreducible vulnerability to distribution shift that additional in-distribution data cannot resolve. It proposes tie training (augmenting with equal-utility preference pairs) as a data-driven regularizer that selectively mitigates spurious learning without harming causal learning, validates the theory on log-linear models, and provides empirical support that the mechanisms and mitigation benefits extend to neural networks and LLMs.

Significance. If the central claims hold, the work supplies a concrete mechanistic account of why preference learning produces sycophancy and length bias, together with a simple, data-augmentation-based fix that is provably effective under log-linear assumptions. The demonstration that more training data from the same distribution cannot eliminate the spurious dependence is a useful negative result for alignment research. The empirical extension to LLMs, if reproducible, would directly inform practical mitigation strategies for current models.

major comments (3)

[Abstract / Theoretical Analysis] Abstract and theoretical sections: the characterization of spurious learning via mean spurious bias and causal-spurious correlation leakage, as well as the proof that tie training is a selective regularizer, are derived exclusively under log-linear policy assumptions; the manuscript provides no formal argument or population-level analysis showing that the same two channels dominate in over-parameterized neural networks, where different optimization dynamics can exploit correlations.
[Empirical Validation] Empirical validation section: the claim that 'both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models' rests on experiments whose data exclusion rules, spurious-feature construction, and controls for causal-feature preservation are not fully specified, preventing assessment of whether the reported gains are robust or sensitive to implementation details.
[Consequences on Deployment] Consequences section: the assertion of an 'irreducible vulnerability to distribution shift' is shown only for log-linear policies; without a corresponding analysis or counter-example for neural policies, the load-bearing claim that more data from the training distribution cannot reduce spurious dependence does not yet extend to the motivating LLM setting.

minor comments (2)

[Tie Training Definition] Notation for tie training could be formalized with an explicit objective or augmentation rule to make the method reproducible from the text alone.
[Figures] Several figures comparing spurious vs. causal accuracy under increasing data would benefit from error bars or multiple random seeds to support the 'irreducible' claim.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the scope of our theoretical results and committing to improvements in the empirical details and discussion of limitations.

read point-by-point responses

Referee: [Abstract / Theoretical Analysis] Abstract and theoretical sections: the characterization of spurious learning via mean spurious bias and causal-spurious correlation leakage, as well as the proof that tie training is a selective regularizer, are derived exclusively under log-linear policy assumptions; the manuscript provides no formal argument or population-level analysis showing that the same two channels dominate in over-parameterized neural networks, where different optimization dynamics can exploit correlations.

Authors: We agree that the formal proofs and population-level analysis are derived exclusively under log-linear policy assumptions, as stated throughout the manuscript. This choice enables exact characterization of the two mechanisms and the selective regularization property of tie training. For over-parameterized neural networks we provide only empirical evidence that the mechanisms and mitigation benefits persist. In revision we will expand the discussion section to explicitly note the absence of a formal extension and to articulate why the population-level mechanisms are expected to remain relevant despite differing optimization dynamics. revision: partial
Referee: [Empirical Validation] Empirical validation section: the claim that 'both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models' rests on experiments whose data exclusion rules, spurious-feature construction, and controls for causal-feature preservation are not fully specified, preventing assessment of whether the reported gains are robust or sensitive to implementation details.

Authors: We acknowledge that the current manuscript does not provide sufficient implementation detail for full reproducibility. In the revised version we will add an expanded experimental appendix that fully specifies the data exclusion rules, the precise construction of spurious features, the controls used to preserve causal features, and all hyper-parameter choices. These additions will allow readers to assess robustness directly. revision: yes
Referee: [Consequences on Deployment] Consequences section: the assertion of an 'irreducible vulnerability to distribution shift' is shown only for log-linear policies; without a corresponding analysis or counter-example for neural policies, the load-bearing claim that more data from the training distribution cannot reduce spurious dependence does not yet extend to the motivating LLM setting.

Authors: The formal proof of irreducible vulnerability is indeed limited to log-linear policies. For neural networks and LLMs we report only empirical observations that additional in-distribution data fails to eliminate spurious dependence. In revision we will revise the consequences section to clearly separate the proven log-linear result from the supporting empirical findings and to state that a formal extension to neural policies remains an open question. revision: partial

standing simulated objections not resolved

A formal population-level analysis demonstrating that the two identified channels dominate in over-parameterized neural networks

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained for log-linear case with empirical extension

full rationale

The paper's central derivation characterizes spurious learning mechanisms (mean spurious bias and causal-spurious correlation leakage) explicitly under log-linear policy assumptions via population-level analysis of preference objectives, without reducing to fitted parameters or self-definitions. Tie training is introduced as a new data-augmentation strategy and analyzed for its selective regularization effect. Extension to neural networks and LLMs is framed as empirical validation rather than a theoretical claim, with no load-bearing self-citations, ansatz smuggling, or renaming of known results. The derivation chain remains independent of its inputs and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full set of modeling assumptions, fitted quantities, and any invented constructs cannot be audited. The abstract invokes log-linear policies as the analytic setting and introduces tie training as a new procedure.

axioms (1)

domain assumption Analysis restricted to log-linear policies
Explicitly stated as the focus for deriving the two spurious-learning channels.

invented entities (1)

Tie training no independent evidence
purpose: Data augmentation via equal-utility preference pairs to introduce regularization against spurious features
New strategy proposed to selectively reduce spurious learning

pith-pipeline@v0.9.0 · 5504 in / 1347 out tokens · 109032 ms · 2026-05-13T06:26:36.426012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

[1]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[5]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[6]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020
[7]

arXiv preprint arXiv:2502.00657 , year=

Llm safety alignment is divergence estimation in disguise , author=. arXiv preprint arXiv:2502.00657 , year=

work page arXiv
[8]

A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

A long way to go: Investigating length correlations in rlhf , author=. arXiv preprint arXiv:2310.03716 , year=

work page arXiv
[9]

Towards Understanding Sycophancy in Language Models

Towards understanding sycophancy in language models , author=. arXiv preprint arXiv:2310.13548 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

From lists to emojis: How format bias affects model alignment , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[11]

arXiv preprint arXiv:2006.09994 , year=

Noise or signal: The role of image backgrounds in object recognition , author=. arXiv preprint arXiv:2006.09994 , year=

work page arXiv 2006
[12]

International conference on learning representations , year=

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , author=. International conference on learning representations , year=

work page
[13]

International Conference on Machine Learning , pages=

Just train twice: Improving group robustness without training group information , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[14]

International conference on machine learning , pages=

On the spectral bias of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[15]

Advances in neural information processing systems , volume=

Sgd on neural networks learns functions of increasing complexity , author=. Advances in neural information processing systems , volume=

work page
[16]

arXiv preprint arXiv:2403.03375 , year=

Complexity matters: Dynamics of feature learning in the presence of spurious correlations , author=. arXiv preprint arXiv:2403.03375 , year=

work page arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Simplicity bias in 1-hidden layer neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Advances in Neural Information Processing Systems , volume=

The pitfalls of simplicity bias in neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Advances in Neural Information Processing Systems , volume=

Gradient starvation: A learning proclivity in neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

Advances in Neural Information Processing Systems , volume=

On feature learning in the presence of spurious correlations , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Last layer re-training is sufficient for robustness to spurious correlations

Last layer re-training is sufficient for robustness to spurious correlations , author=. arXiv preprint arXiv:2204.02937 , year=

work page arXiv
[22]

arXiv preprint arXiv:2310.16228 , year=

On the foundations of shortcut learning , author=. arXiv preprint arXiv:2310.16228 , year=

work page arXiv
[23]

arXiv preprint arXiv:2502.01347 , year=

Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization , author=. arXiv preprint arXiv:2502.01347 , year=

work page arXiv
[24]

(2025 a ), A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO, arXiv preprint arXiv:2505.15694

A unified theoretical analysis of private and robust offline alignment: from rlhf to dpo , author=. arXiv preprint arXiv:2505.15694 , year=

work page arXiv
[25]

International Conference on Machine Learning , pages=

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[26]

International Conference on Artificial Intelligence and Statistics , pages=

Differentially private reward estimation with preference feedback , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[27]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[28]

Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint , author=. arXiv preprint arXiv:2312.11456 , year=

work page arXiv
[29]

International Conference on Machine Learning , pages=

An investigation of why overparameterization exacerbates spurious correlations , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[30]

arXiv preprint arXiv:2110.04301 , year=

Salient imagenet: How to discover spurious features in deep learning? , author=. arXiv preprint arXiv:2110.04301 , year=

work page arXiv
[31]

International Conference on Machine Learning , pages=

Examining and combating spurious features under distribution shift , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[32]

arXiv preprint arXiv:2310.10076 , year=

Verbosity bias in preference labeling by large language models , author=. arXiv preprint arXiv:2310.10076 , year=

work page arXiv
[33]

Advances in Neural Information Processing Systems , volume=

Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review arXiv
[35]

Invariant Risk Minimization

Invariant risk minimization , author=. arXiv preprint arXiv:1907.02893 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[36]

arXiv preprint arXiv:2502.15657 , year=

Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? , author=. arXiv preprint arXiv:2502.15657 , year=

work page arXiv
[37]

The alignment problem from a deep learning perspective

The alignment problem from a deep learning perspective , author=. arXiv preprint arXiv:2209.00626 , year=

work page arXiv
[38]

arXiv preprint arXiv:1906.01820 , year =

Risks from learned optimization in advanced machine learning systems , author=. arXiv preprint arXiv:1906.01820 , year=

work page arXiv 1906
[39]

Goal misgeneralization: Why correct specifications aren’t enough for correct goals, 2022

Goal misgeneralization: Why correct specifications aren't enough for correct goals , author=. arXiv preprint arXiv:2210.01790 , year=

work page arXiv
[40]

International Conference on Machine Learning , pages=

Goal misgeneralization in deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[41]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[42]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards robust classification model by counterfactual and invariant data generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[43]

arXiv preprint arXiv:2106.02112 , year=

Finding and fixing spurious patterns with explanations , author=. arXiv preprint arXiv:2106.02112 , year=

work page arXiv
[44]

arXiv preprint arXiv:2403.00409 , year=

Provably robust dpo: Aligning language models with noisy feedback , author=. arXiv preprint arXiv:2403.00409 , year=

work page arXiv
[45]

arXiv preprint arXiv:2407.07880 , year=

Towards robust alignment of language models: Distributionally robustifying direct preference optimization , author=. arXiv preprint arXiv:2407.07880 , year=

work page arXiv
[46]

arXiv preprint arXiv:2505.08849 , year=

Improved Algorithms for Differentially Private Language Model Alignment , author=. arXiv preprint arXiv:2505.08849 , year=

work page arXiv
[47]

Conference on Learning Theory , pages=

Contextual dueling bandits , author=. Conference on Learning Theory , pages=. 2015 , organization=

work page 2015
[48]

International conference on artificial intelligence and statistics , pages=

Dueling rl: Reinforcement learning with trajectory preferences , author=. International conference on artificial intelligence and statistics , pages=. 2023 , organization=

work page 2023
[49]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024

Disentangling length from quality in direct preference optimization , author=. arXiv preprint arXiv:2403.19159 , year=

work page arXiv
[50]

arXiv preprint arXiv:2307.08701 , year=

Alpagasus: Training a better alpaca with fewer data , author=. arXiv preprint arXiv:2307.08701 , year=

work page arXiv
[51]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page