Weight-Space Geometry of Offline Reasoning Training

Aleksandr Nikolich; Igor Kiselev; Karina Romanova; Vladimir Platonov

arxiv: 2606.23740 · v1 · pith:3OSFOGQDnew · submitted 2026-06-21 · 💻 cs.LG · cs.AI

Weight-Space Geometry of Offline Reasoning Training

Aleksandr Nikolich , Igor Kiselev , Vladimir Platonov , Karina Romanova This is my paper

Pith reviewed 2026-06-26 10:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningweight space geometryDPOmode connectivityCKAreasoning distillationGSM8K

0 comments

The pith

DPO produces weight deltas in a near-orthogonal subspace to other offline losses, crosses a mode-connectivity barrier, and reaches the highest accuracy on GSM8K and AIME26.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains six offline losses (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from one base model using attention-only LoRA. It then compares the resulting weight deltas with cosine similarity, principal-angle analysis, linear mode connectivity, and CKA. SFT, RFT, and RIFT produce nearly colinear deltas and similar accuracy around 87-88 percent. Offline GRPO adds an orthogonal component while staying in the same basin. DPO occupies a near-orthogonal subspace, shows a connectivity barrier, and drops late-layer CKA to about 0.46, yet records 93.5 percent on GSM8K and 30 percent on AIME26. The comparison treats the standard 10x smaller learning rate for DPO as part of the joint loss-plus-optimizer choice.

Core claim

DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46 while reaching the highest accuracy (93.5 percent on GSM8K, 30.0 percent on AIME26); SFT, RFT, and RIFT remain nearly colinear with cosine similarity at least 0.97 and comparable accuracy, whereas Offline GRPO introduces a substantial orthogonal component yet stays inside the SFT loss basin.

What carries the argument

Geometry of weight deltas under different losses, quantified by cosine similarity, principal angles between subspaces, linear mode connectivity, and centered kernel alignment (CKA).

If this is right

SFT, RFT, and RIFT converge to nearly the same weight updates and downstream accuracy.
DFT produces weight deltas that diverge in direction more than any reward-weighted method despite identical data.
Offline GRPO adds an orthogonal direction to the SFT update while remaining inside the same loss basin.
DPO's distinct geometry coincides with the largest accuracy gains on both GSM8K and AIME26 under the reported protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A learning-rate-matched DPO run would clarify how much of the geometric and accuracy differences stem from the loss alone versus the optimizer scale.
The orthogonality observed for DPO may mark a route to basins that generalize better on mathematical reasoning.
Similar weight-space diagnostics could be applied to other domains to test whether loss-induced geometric separation predicts performance.

Load-bearing premise

The observed differences in weight-space geometry and accuracy can be attributed primarily to the choice of loss function, with the 10x smaller learning rate for DPO treated as a joint factor.

What would settle it

Re-train DPO at the same learning rate as the other five methods and test whether the near-orthogonality, mode-connectivity barrier, reduced CKA, and accuracy advantage remain.

Figures

Figures reproduced from arXiv: 2606.23740 by Aleksandr Nikolich, Igor Kiselev, Karina Romanova, Vladimir Platonov.

**Figure 1.** Figure 1: Global ∆W cosine across all eight losses (Qwen3-4B, seed 42; all adapters trained in one consistent space). Rewardweighted SFT/RFT/RIFT cluster (0.94–0.98); DFT intermediate (∼ 0.55); Offline GRPO at 0.71–0.80 to the cluster; DPO nearorthogonal (≤ 0.13). The two on-policy methods, Online GRPO and Online DAPO, are each near-orthogonal to every offline loss and to each other (−0.16); orthogonal-fraction of… view at source ↗

**Figure 2.** Figure 2: Seed and learning-rate sensitivity (SFT, Qwen3-4B). Left: across two seeds the output direction u1 stays aligned (∼0.99) while the input direction v1 and full cosine are low at small LR and rise with LR; dashed shows SFT–RFT at a fixed seed. Middle: a 10× LR step rotates ∆W (cosine ≈ 0.55) and grows its norm sub-linearly — LR is not a pure rescaling. Right: interpolating the two seeds’ deltas shows no loss… view at source ↗

**Figure 3.** Figure 3: Greedy pass@1 with Wilson 95% CI bars on GSM8K (n=1319) and AIME26 (n=30). Dark bars: Qwen3-4B-Instruct. Light bars: Llama-3.2-3B-Instruct. On both architectures, DPO sits noticeably above the SFT/RFT/DFT/RIFT/Offline GRPO cluster on GSM8K (Qwen3: McNemar p < 10−9 vs. each other method); Llama-3.2-3B AIME26 floors near zero at this model scale. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Per-layer cosine similarity of LoRA deltas to SFT, on Qwen3-4B (left, 36 layers) and Llama-3.2-3B (right, 28 layers). SFT/RFT/RIFT track each other across all layers; Offline GRPO, DFT, and especially DPO diverge in deeper layers, with the same qualitative pattern on both architectures [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Linear mode connectivity (masked-answer CE on GSM8K) on Qwen3-4B (left) and Llama-3.2-3B (right). Same picture: SFT/Off.GRPO/RIFT/DFT pairs are barrier-free; RIFT→DPO shows a sharp barrier above α=0.5 on both architectures (DPO endpoint loss 8.64 Qwen3, 8.96 Llama32) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Linear CKA of hidden states across all blocks for selected method pairs on 100 GSM8K prompts: Qwen3-4B (left, 36 blocks), Llama-3.2-3B (right, 28 blocks). On both architectures: SFT/RIFT indistinguishable (> 0.99), Off.GRPO diverges in output-facing layers, and DPO collapses in the final third (Qwen3 ∼0.45, Llama32 ∼0.62). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, we analyze the resulting deltas via cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. We observe: (i) SFT, RFT, and RIFT have nearly colinear weight deltas (cosine >= 0.97, top-1 principal angle ~7 deg median over 144 modules) and comparable GSM8K accuracy (87-88%, n=1319; pairwise McNemar p >= 0.15); (ii) DFT diverges further in direction than any reward-weighted method despite using the same data; (iii) Offline GRPO adds a substantial component orthogonal to the SFT direction (~67% globally, up to ~86% in late layers) while staying in the SFT loss basin; (iv) DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46. DPO also reaches the highest accuracy in our protocol on both GSM8K (93.5%, McNemar p < 10^-9 vs. each other method) and AIME26 (30.0% vs. 3.3-10.0%); its training uses a 10x smaller learning rate than the others (the standard convention), so the update-norm and accuracy gaps reflect loss-function and optimizer choices jointly, and a learning-rate-matched DPO comparison is left for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The controlled geometry comparison across six methods is the real contribution, but DPO's reported distinctions are confounded by its 10x smaller learning rate.

read the letter

This paper runs six offline methods (SFT, RFT, DFT, RIFT, GRPO, DPO) on the exact same math rollouts from Qwen3-4B using attention-only LoRA, then measures how the resulting weight deltas differ via cosine similarity, principal angles, linear mode connectivity, and CKA, alongside accuracy on GSM8K and AIME26 with McNemar tests.

What stands out is that SFT, RFT, and RIFT produce nearly collinear updates (cosine >=0.97, small angles), while DFT, GRPO, and DPO move in more distinct directions, with DPO showing the largest separation, a mode-connectivity barrier, and the highest accuracy (93.5% GSM8K, 30% AIME26). The side-by-side geometry metrics on shared data are new relative to prior single-method work.

The setup is controlled enough to make the measurements credible within its scope. The authors also flag the key limitation themselves: DPO used a 10x smaller learning rate, so the orthogonality, barrier, CKA drop, and accuracy edge reflect both the loss and the optimizer scale together. A matched-LR control is left for future work, which means the claim that DPO is mechanistically distinct cannot yet be isolated to the objective alone.

The paper stays narrow—one base model, one task family, LoRA only—but the direct measurements avoid circularity and the citation pattern is appropriate. Readers working on reasoning distillation or weight-space analysis of fine-tuning would find the metrics useful for deciding whether loss choice actually matters beyond final accuracy.

It deserves peer review. The design is tight enough that referees can evaluate the concrete numbers and the acknowledged LR caveat without needing major new experiments.

Referee Report

1 major / 2 minor

Summary. The manuscript examines whether six offline reasoning training methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) applied to identical math rollouts from Qwen3-4B with attention-only LoRA produce distinct weight-space geometries. Using cosine similarity, principal-angle analysis, linear mode connectivity, and CKA on the resulting deltas, it reports that SFT/RFT/RIFT deltas are nearly collinear (cosine >=0.97, ~7 deg angles) with comparable GSM8K accuracy (~87-88%), DFT diverges more, GRPO adds an orthogonal component while remaining in the SFT basin, and DPO is near-orthogonal with a mode-connectivity barrier and late-layer CKA ~0.46, also achieving the highest accuracy (93.5% GSM8K, 30% AIME26). The abstract notes DPO used a 10x smaller learning rate (standard convention), so its update-norm and accuracy gaps reflect loss and optimizer choices jointly.

Significance. If the reported geometric distinctions hold after isolating loss effects, the work supplies concrete evidence that offline RL losses induce mechanistically different weight updates rather than converging to equivalent directions, supported by multiple metrics and McNemar-tested accuracy differences on a controlled single-base-model setup. This could guide loss selection for reasoning distillation beyond accuracy tables alone.

major comments (1)

[Abstract] Abstract: The central observations that DPO occupies a near-orthogonal subspace, exhibits a mode-connectivity barrier, collapses late-layer CKA to ~0.46, and attains the highest accuracy are obtained under a 10x smaller learning rate than SFT/RFT/DFT/RIFT/GRPO. The text explicitly states that the resulting gaps reflect loss-function and optimizer choices jointly and leaves a learning-rate-matched comparison for future work. Without that control, the geometric distinctions cannot be attributed primarily to the DPO objective, which is load-bearing for the claim that the methods are mechanistically distinct due to loss choice.

minor comments (2)

The manuscript should provide explicit details on data exclusion rules, exact rollout generation procedure, and complete hyperparameter tables (including LoRA rank, batch size, and optimizer settings) to support reproducibility of the reported metrics.
Clarify the precise aggregation method for the 144-module principal-angle and CKA statistics (e.g., median vs. mean, per-layer vs. global) in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for identifying this important point about confounding factors in the DPO comparison. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central observations that DPO occupies a near-orthogonal subspace, exhibits a mode-connectivity barrier, collapses late-layer CKA to ~0.46, and attains the highest accuracy are obtained under a 10x smaller learning rate than SFT/RFT/DFT/RIFT/GRPO. The text explicitly states that the resulting gaps reflect loss-function and optimizer choices jointly and leaves a learning-rate-matched comparison for future work. Without that control, the geometric distinctions cannot be attributed primarily to the DPO objective, which is load-bearing for the claim that the methods are mechanistically distinct due to loss choice.

Authors: We agree that the 10× smaller learning rate for DPO (standard convention) means the observed geometry and accuracy cannot be attributed solely to the loss function. The manuscript already states this qualification explicitly. The results nevertheless demonstrate that, when each method is run under its conventional hyperparameter protocol on identical data, the resulting weight deltas occupy distinct geometries. This is a practically relevant observation for how these methods are applied in practice. We will revise the abstract to foreground this qualification more prominently and to clarify that the reported distinctions hold under standard training settings rather than claiming isolation of the loss effect alone. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The manuscript performs direct training runs of six methods on identical data, then measures weight deltas via cosine similarity, principal angles, mode connectivity, and CKA, plus downstream accuracies with McNemar tests. No equations derive a 'prediction' from fitted parameters; no self-citations support load-bearing uniqueness claims; the LR difference for DPO is explicitly flagged as a joint factor with future matched-LR work noted. All central claims rest on observable quantities independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Empirical comparative study; no new mathematical axioms or postulated entities. Relies on standard assumptions of LoRA fine-tuning, evaluation benchmarks (GSM8K, AIME), and similarity metrics.

free parameters (1)

DPO learning rate = 10x smaller than other methods
Chosen 10x smaller per standard convention; jointly affects update norm and accuracy gaps with the loss function.

pith-pipeline@v0.9.1-grok · 5895 in / 1476 out tokens · 34084 ms · 2026-06-26T10:24:17.820881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Liu, Zehua and Liu, Shuqi and Zhong, Tao and Yuan, Mingxuan , journal =
[2]

2025 , note =

Offline. 2025 , note =

2025
[3]

On the Generalization of

Wu, Yongliang and others , journal =. On the Generalization of
[4]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Yu, Qiying and others , journal =
[6]

NeurIPS Workshop on Mechanistic Interpretability , year =

Shared Parameter Subspaces in Emergently Misaligned Behavior , author =. NeurIPS Workshop on Mechanistic Interpretability , year =
[7]

NeurIPS Workshop on Mechanistic Interpretability , year =

Convergent Linear Representations of Emergent Misalignment , author =. NeurIPS Workshop on Mechanistic Interpretability , year =
[8]

NeurIPS Workshop on Mechanistic Interpretability , year =

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author =. NeurIPS Workshop on Mechanistic Interpretability , year =
[9]

Ward and others , booktitle =. Rank-1
[10]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author =. arXiv preprint arXiv:2308.01825 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y K and Wu, Y and Guo, Daya , journal =
[12]

Advances in Neural Information Processing Systems , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , year =
[13]

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , booktitle =
[14]

A general theoretical paradigm to understand learning from human preferences

A General Theoretical Paradigm to Understand Learning from Human Preferences , author =. arXiv preprint arXiv:2310.12036 , year =

work page arXiv
[15]

Meng, Yu and Xia, Mengzhou and Chen, Danqi , booktitle =
[16]

Xiao, Teng and others , booktitle =
[17]

Advances in Neural Information Processing Systems , year =

Noise Contrastive Alignment of Language Models with Explicit Rewards , author =. Advances in Neural Information Processing Systems , year =
[18]

International Conference on Machine Learning , year =

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author =. International Conference on Machine Learning , year =
[19]

International Conference on Learning Representations , year =

Git Re-Basin: Merging Models Modulo Permutation Symmetries , author =. International Conference on Learning Representations , year =
[20]

International Conference on Machine Learning , year =

Similarity of Neural Network Representations Revisited , author =. International Conference on Machine Learning , year =
[21]

arXiv preprint , year =
[22]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =
[23]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Interpreting

nostalgebraist , booktitle =. Interpreting. 2020 , howpublished =

2020

[1] [1]

Liu, Zehua and Liu, Shuqi and Zhong, Tao and Yuan, Mingxuan , journal =

[2] [2]

2025 , note =

Offline. 2025 , note =

2025

[3] [3]

On the Generalization of

Wu, Yongliang and others , journal =. On the Generalization of

[4] [4]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Yu, Qiying and others , journal =

[6] [6]

NeurIPS Workshop on Mechanistic Interpretability , year =

Shared Parameter Subspaces in Emergently Misaligned Behavior , author =. NeurIPS Workshop on Mechanistic Interpretability , year =

[7] [7]

NeurIPS Workshop on Mechanistic Interpretability , year =

Convergent Linear Representations of Emergent Misalignment , author =. NeurIPS Workshop on Mechanistic Interpretability , year =

[8] [8]

NeurIPS Workshop on Mechanistic Interpretability , year =

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author =. NeurIPS Workshop on Mechanistic Interpretability , year =

[9] [9]

Ward and others , booktitle =. Rank-1

[10] [10]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author =. arXiv preprint arXiv:2308.01825 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y K and Wu, Y and Guo, Daya , journal =

[12] [12]

Advances in Neural Information Processing Systems , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , year =

[13] [13]

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , booktitle =

[14] [14]

A general theoretical paradigm to understand learning from human preferences

A General Theoretical Paradigm to Understand Learning from Human Preferences , author =. arXiv preprint arXiv:2310.12036 , year =

work page arXiv

[15] [15]

Meng, Yu and Xia, Mengzhou and Chen, Danqi , booktitle =

[16] [16]

Xiao, Teng and others , booktitle =

[17] [17]

Advances in Neural Information Processing Systems , year =

Noise Contrastive Alignment of Language Models with Explicit Rewards , author =. Advances in Neural Information Processing Systems , year =

[18] [18]

International Conference on Machine Learning , year =

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author =. International Conference on Machine Learning , year =

[19] [19]

International Conference on Learning Representations , year =

Git Re-Basin: Merging Models Modulo Permutation Symmetries , author =. International Conference on Learning Representations , year =

[20] [20]

International Conference on Machine Learning , year =

Similarity of Neural Network Representations Revisited , author =. International Conference on Machine Learning , year =

[21] [21]

arXiv preprint , year =

[22] [22]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

[23] [23]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Interpreting

nostalgebraist , booktitle =. Interpreting. 2020 , howpublished =

2020