When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

Anh Tuan Luu; Cong-Duy Nguyen; Khoi Le; Miao Chunyan; Phong Nguyen; See-kiong Ng; Thong Nguyen; Tri Cao

arxiv: 2605.25629 · v2 · pith:SAD3VLBWnew · submitted 2026-05-25 · 💻 cs.CL · cs.LG

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

Khoi Le , Tri Cao , Phong Nguyen , Cong-Duy Nguyen , Anh Tuan Luu , Miao Chunyan , See-Kiong Ng , Thong Nguyen This is my paper

Pith reviewed 2026-06-29 21:21 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords weak-to-strong generalizationreward modelingpreference learningdistribution shiftrepresentation anchoringscalable oversightlanguage model alignment

0 comments

The pith

Strong reward models trained on weak preference labels succeed within the training distribution yet fail to transfer to shifted preference datasets due to representational drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies weak-to-strong preference learning for reward models when the test distribution differs from training. Strong models can score well on the weak labels seen during training but then perform poorly on new preference datasets. The authors trace this to fine-tuning that shifts the model's internal features toward source-specific patterns instead of preserving general preference understanding. They introduce Representation Anchoring, a regularizer that keeps representations close to the original pretrained space while still allowing useful adaptation. Experiments across domains, datasets, and model sizes show that anchoring improves transfer without harming in-distribution results.

Core claim

Strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. This stems from a representational failure mode where weak-supervised fine-tuning pulls the strong model toward source-domain features instead of maintaining broadly transferable preference representations. Representation Anchoring constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation, and thereby improves out-of-distribution transfer across preference domains, datasets, and model families.

What carries the argument

Representation Anchoring (Anchor), a regularizer that constrains the distance of fine-tuned representations from the pretrained strong model's representation space.

If this is right

In-distribution performance remains competitive under the anchoring regularizer.
Out-of-distribution transfer improves consistently across preference domains, datasets, and model families.
Transfer-aware metrics are required to detect brittleness that standard in-distribution evaluations miss.
Current weak-to-strong reward modeling approaches exhibit hidden limitations under preference shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Anchoring techniques may prove useful in other weak-to-strong alignment tasks that involve representation stability.
Preference dataset construction could benefit from deliberate diversity to reduce dependence on post-hoc anchoring.
Scalable oversight protocols should incorporate explicit zero-shot shift tests rather than relying solely on matched train-test splits.

Load-bearing premise

The observed transfer failures are caused by excessive representational drift toward source-domain features that can be mitigated by constraining distance from the pretrained representation space.

What would settle it

If applying Representation Anchoring produces no measurable reduction in representational drift or no corresponding gain in out-of-distribution transfer on held-out preference datasets, the proposed mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.25629 by Anh Tuan Luu, Cong-Duy Nguyen, Khoi Le, Miao Chunyan, Phong Nguyen, See-kiong Ng, Thong Nguyen, Tri Cao.

**Figure 1.** Figure 1: ID success can hide OOD failure. Although the resulting student can perform well on in-domain evaluation, its performance can degrade substantially when tested on unseen preference domains. This motivates evaluating W2S reward models under zero-shot preference-domain shift rather than relying only on indomain accuracy. Weak-to-strong generalization (W2S) provides a concrete abstraction of this problem: a… view at source ↗

**Figure 2.** Figure 2: Overview of ANCHOR under zero-shot preference-domain shift. ANCHOR trains a strong reward model with the standard weak-to-strong preference loss Lw2s, while regularizing its response-token hidden states toward a frozen pretrained reference model through Lanchor. At inference, the reference model is discarded and only the learned scalar reward model is used for both in-distribution and zero-shot out-of-dist… view at source ↗

**Figure 3.** Figure 3: Effect of anchoring strength λ on in-distribution and out-of-distribution transfer performance. Larger x-axis values correspond to smaller anchoring coefficients, i.e., weaker representation anchoring. ter preserves transferable representations without overly restricting adaptation. 7 Related Work Weak-to-strong generalization and scalable oversight. Recent work has applied weak-tostrong (W2S) generalizat… view at source ↗

**Figure 4.** Figure 4: CKA and CCA distances between HelpSteer3-trained LoRA checkpoints and the frozen Llama-8B base [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a clear failure mode where W2S reward models succeed in-distribution but drop on preference shifts, and shows that a simple anchoring regularizer helps keep representations more transferable.

read the letter

The main takeaway is that standard weak-to-strong fine-tuning on preferences can overfit to source-domain features, so the student looks competent on the training set but transfers poorly when the preference distribution changes. They test this with zero-shot shifts across datasets and model families, and introduce Representation Anchoring to limit how far the fine-tuned model drifts from the original pretrained space.

What stands out is the evaluation protocol itself. Most W2S work stays in-distribution; running the same student across mismatched preference sources is a useful stress test and matches real deployment concerns in scalable oversight. The Anchor regularizer is straightforward—no new architecture, just a distance penalty during training—and the abstract claims it improves OOD transfer without hurting in-distribution scores. That combination of diagnosis plus a practical mitigation is the useful part.

The soft spot is isolation of the mechanism. The central story attributes the OOD drop to representational drift toward source features, but the abstract gives no sign they matched label noise, dataset size, or construction artifacts between in- and out-of-distribution sets, or compared Anchor against generic regularizers like stronger weight decay. Without those controls it is hard to know whether the gains come from preserved transferable features or simply from less aggressive fitting to weak-label quirks. The numbers and ablations are also missing from what is visible, so the size of the effect and its robustness remain open.

This is worth a serious referee for groups working on reward modeling and alignment. Readers who care about preference transfer under distribution shift will get concrete ideas to try; the work is not yet at the stage where it changes practice on its own. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper studies weak-to-strong preference learning under zero-shot distribution shift across preference datasets. It claims that strong students trained on weak labels can succeed in-distribution yet fail to transfer, due to a representational failure mode in which fine-tuning pulls models toward source-domain features rather than maintaining transferable preference representations. It proposes Representation Anchoring (Anchor), a regularizer that constrains drift from the pretrained representation space, and reports that Anchor improves OOD transfer while preserving competitive ID performance across domains, datasets, and model families.

Significance. If the empirical results and controls hold, the work would be significant for scalable oversight research: it supplies a concrete evaluation protocol and transfer-aware metrics that expose brittleness in existing W2S reward modeling, together with a simple, practical regularizer that demonstrably improves robustness. The emphasis on out-of-distribution transfer rather than matched train-test evaluation is a useful corrective to current practice.

major comments (2)

[Abstract] Abstract: the central claims (in-distribution success with OOD failure, representational drift as the cause, and consistent gains from Anchor) are stated without any quantitative results, error bars, dataset sizes, model scales, or ablation numbers, so the magnitude, statistical reliability, and reproducibility of the findings cannot be assessed from the manuscript text.
[Method / Experiments] Method / Experiments (implied by the abstract description of Anchor): the attribution of transfer failure specifically to excessive representational drift toward source features is not isolated from alternative explanations such as label noise levels or dataset-construction artifacts; without matched noise controls or head-to-head comparisons against other regularizers of comparable strength (e.g., weight decay or label smoothing), the reported OOD gains from Anchor could arise from generic overfitting reduction rather than preservation of transferable representations.

minor comments (1)

[Abstract] Abstract: the phrase 'zero-shot distribution shift' is used without a precise definition of what is held fixed versus varied between source and target preference datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and rigor of our presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims (in-distribution success with OOD failure, representational drift as the cause, and consistent gains from Anchor) are stated without any quantitative results, error bars, dataset sizes, model scales, or ablation numbers, so the magnitude, statistical reliability, and reproducibility of the findings cannot be assessed from the manuscript text.

Authors: We agree with this observation. The original abstract was written to be concise, but we recognize that including quantitative highlights would better convey the strength of the results. In the revised version, we have updated the abstract to include specific numbers on in-distribution and out-of-distribution performance gains from Anchor, along with details on the datasets, model scales, and number of runs for error bars. revision: yes
Referee: [Method / Experiments] Method / Experiments (implied by the abstract description of Anchor): the attribution of transfer failure specifically to excessive representational drift toward source features is not isolated from alternative explanations such as label noise levels or dataset-construction artifacts; without matched noise controls or head-to-head comparisons against other regularizers of comparable strength (e.g., weight decay or label smoothing), the reported OOD gains from Anchor could arise from generic overfitting reduction rather than preservation of transferable representations.

Authors: This is a valid concern. Our manuscript provides evidence through representation probing and similarity metrics showing that standard W2S fine-tuning leads to greater drift from the pretrained space, correlating with OOD failure. However, to address potential confounds from generic regularization, we have conducted additional experiments comparing Anchor to weight decay and label smoothing at matched regularization strengths. These results, now included in the revised paper, demonstrate that Anchor yields superior OOD performance, suggesting the benefit is not solely from overfitting reduction. For label noise and dataset artifacts, we used consistent weak label generation procedures across domains, but we have added a discussion of these factors and note that future work could explore explicit noise matching. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental observations, not self-referential derivations or fitted inputs.

full rationale

The paper presents an empirical study of weak-to-strong preference learning under distribution shift, proposing Representation Anchoring as a regularizer based on observed transfer failures. No equations, derivations, or first-principles results are described in the abstract or reader summary. Claims about representational drift and mitigation are framed as experimental findings rather than mathematical reductions to inputs. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The central argument relies on external benchmarks (preference datasets, model families) and does not reduce by construction to fitted parameters renamed as predictions. This is a standard non-finding for an evaluation-focused ML paper without visible theoretical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or invented entities; full paper required for ledger construction.

pith-pipeline@v0.9.1-grok · 5725 in / 1083 out tokens · 36348 ms · 2026-06-29T21:21:19.394098+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526. Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. 2022. Fine- tuning can distort pretrained features and underper- form out-of-distribution. InInternational Conference on Learning Representations. Natha...

2022
[2]

Scalable agent alignment via reward modeling: a research direction

Mixout: Effective regularization to finetune large-scale pretrained language models. InInterna- tional Conference on Learning Representations. Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. Scalable agent alignment via reward modeling: a research direction. Preprint, arXiv:1811.07871. Hongyu Li, Liang Ding, Meng ...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

InThe Thirteenth International Conference on Learning Representations

MACPO: Weak-to-strong alignment via multi- agent contrastive preference optimization. InThe Thirteenth International Conference on Learning Representations. Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. InThe Thirty-eighth Annual Conference on Neural Information Processing Sys- tems. Fan N...

2024
[4]

Training language models to follow instructions with human feedback

Weak-for-strong: Training weak meta-agent to harness strong executors. InSecond Conference on Language Modeling. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welind...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

InAdvances in neural information processing systems, volume 36, pages 53728–53741

Direct preference optimization: Your language model is secretly a reward model. InAdvances in neural information processing systems, volume 36, pages 53728–53741. Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, and Xipeng Qiu. 2025. How to miti- gate overfitting in weak-to-strong generalization? In Proceedings of the 63rd Annual Meeting o...

work page arXiv 2025
[6]

InThe Thirteenth International Conference on Learning Representations

Spurious forgetting in continual learning of language models. InThe Thirteenth International Conference on Learning Representations. Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, and Yu Qiao. 2024. Weak-to-strong search: Align large language models via searching over small language models. InThe Thirty-eighth An- nual Conference on Neural I...

2024

[1] [1]

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526. Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. 2022. Fine- tuning can distort pretrained features and underper- form out-of-distribution. InInternational Conference on Learning Representations. Natha...

2022

[2] [2]

Scalable agent alignment via reward modeling: a research direction

Mixout: Effective regularization to finetune large-scale pretrained language models. InInterna- tional Conference on Learning Representations. Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. Scalable agent alignment via reward modeling: a research direction. Preprint, arXiv:1811.07871. Hongyu Li, Liang Ding, Meng ...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

InThe Thirteenth International Conference on Learning Representations

MACPO: Weak-to-strong alignment via multi- agent contrastive preference optimization. InThe Thirteenth International Conference on Learning Representations. Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. InThe Thirty-eighth Annual Conference on Neural Information Processing Sys- tems. Fan N...

2024

[4] [4]

Training language models to follow instructions with human feedback

Weak-for-strong: Training weak meta-agent to harness strong executors. InSecond Conference on Language Modeling. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welind...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

InAdvances in neural information processing systems, volume 36, pages 53728–53741

Direct preference optimization: Your language model is secretly a reward model. InAdvances in neural information processing systems, volume 36, pages 53728–53741. Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, and Xipeng Qiu. 2025. How to miti- gate overfitting in weak-to-strong generalization? In Proceedings of the 63rd Annual Meeting o...

work page arXiv 2025

[6] [6]

InThe Thirteenth International Conference on Learning Representations

Spurious forgetting in continual learning of language models. InThe Thirteenth International Conference on Learning Representations. Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, and Yu Qiao. 2024. Weak-to-strong search: Align large language models via searching over small language models. InThe Thirty-eighth An- nual Conference on Neural I...

2024