Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

Ruilin Tang; Yu Hu; Zhong Ye

arxiv: 2605.27469 · v1 · pith:QQM7U4YUnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

Zhong Ye , Yu Hu , Ruilin Tang This is my paper

Pith reviewed 2026-06-29 19:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learninglogit shiftarchitecture selectionmodel selectionplasticity-stabilityneural network widthsgradient scaling

0 comments

The pith

Architecture-driven Shift (ADS) predicts logit shift trends from architecture properties and few samples in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decouples logit shift into an architecture-dependent part and a data-dependent part, then combines them into Architecture-driven Shift (ADS) to forecast how much logits will change when a pre-trained model learns a new task. This matters because full logit-shift measurement is too expensive for choosing among many candidate architectures, while ADS needs only a small data sample and the model's widths and depths. The relation is derived from three parts: how gradient spectral norms scale with layer width, how far the optimizer travels on the new task, and how wide networks create asymptotic task conflicts. Experiments across more than 175 heterogeneous architectures show monotonic correlation with observed logit shift, with the weakest Spearman's rank correlation at 0.731. ADS also serves as a cheap stand-in for expected calibration error when selecting reliable continual-learning models on three datasets.

Core claim

We decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, (3) the asymptotic task conflict in wide networks.

What carries the argument

Architecture-driven Shift (ADS), the combination of architecture dependency (from spectral-norm gradient scaling, optimization path length, and asymptotic task conflict) and data dependency.

If this is right

ADS ranks architectures by expected logit shift without retraining each one on the full sequence.
ADS supplies a lightweight proxy for expected calibration error when choosing models for continual learning.
The three mechanistic components together explain why wider or deeper layers produce larger shifts on subsequent tasks.
ADS remains computable from a small subset of the new-task data.
The framework applies to real-world networks whose layer widths vary, unlike earlier uniform-width analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ADS could be used to guide architecture search toward lower-shift designs before any continual-learning training begins.
The same decoupling might apply to other shift-sensitive metrics such as feature drift or decision-boundary movement.
If the three mechanistic components dominate, then controlling layer widths alone could reduce unwanted logit shift even when data order changes.

Load-bearing premise

The logit shift can be cleanly decoupled into an architecture dependency and a data dependency such that their combination captures the tendency of logit shift using only a few data samples, independent of specific optimization details or task ordering.

What would settle it

Observing a collection of new heterogeneous architectures where the Spearman's rank correlation between ADS and measured logit shift falls below 0.7 would falsify the central relation.

Figures

Figures reproduced from arXiv: 2605.27469 by Ruilin Tang, Yu Hu, Zhong Ye.

**Figure 2.** Figure 2: Empirical Validation of the Middle-Layer Vulnerability. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Quantitative evaluation of the ADS-based selector: Calibration efficacy, filtering trade [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical Validation of the Middle-Layer Vulnerability across several scenarios: cali [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation comparison between empirically observed logit shift and theoretical metrics [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative evaluation of the ADS-based selector across all benchmarks: calibration [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift serves as a natural proxy because it represents the logit shift in CL scenarios. However, obtaining the logit shift requires huge computational cost, which hinders large-scale model selection. Existing theoretical analyses fail to offer an efficient alternative because of the assumption of uniform hidden layer widths, which ignores the structural heterogeneity (variable width and depth) of real-world architectures. This raises a critical question: what theoretically relationship can be identified between heterogeneous architecture and logit shift on prior tasks (that the model has been trained on)? To answer the question, we decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, and (3) the asymptotic task conflict in wide networks. Extensive empirical results across more than 175 diverse architectures demonstrate a strong monotonic correlation (the weakest Spearman's $r_s=0.731$) between ADS and logit shift. Practically, we demonstrate that ADS can serve as a lightweight proxy of the expected calibration error, which is a widely used metric for reliable CL model selection, on three datasets across six scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADS gives a practical cheap proxy for logit shift across heterogeneous nets with decent correlation numbers, but the steps from the three components to the explicit formula are not shown.

read the letter

The paper's main contribution is a composite score ADS that combines architecture statistics with limited data to track logit shift tendency in continual learning. It drops the uniform-width assumption from earlier work and tests the idea on more than 175 real architectures, reporting a minimum Spearman's rs of 0.731 plus some utility as a stand-in for expected calibration error on three datasets.

The empirical side is straightforward and addresses a real pain point: screening many pre-trained models without running full CL training each time. The correlation holds across the reported setups, and the practical demonstration on model selection is the part that could actually get used.

The soft spot is the missing derivation. The abstract lists three mechanistic pieces—spectral-norm scaling with width, optimization path length, and asymptotic task conflict—but does not show how they are combined into the final ADS expression or whether any free parameters were fitted. Without those steps it is difficult to judge whether the claimed clean split between architecture and data dependencies leaves residual dependence on optimizer schedule or task order, especially once layer widths and depths vary.

This is for researchers who need fast architecture screening inside continual learning pipelines. Readers focused on efficient CL deployment will find the correlation results and the proxy use case worth checking. It deserves a serious referee because the empirical numbers are concrete and the underlying problem is practical, even though the theory side needs the explicit formulas and controls filled in.

I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Architecture-driven Shift (ADS) as a lightweight proxy for logit shift in continual learning (CL) model selection. It decouples logit shift into architecture and data dependencies, with ADS defined as their combination and derived from three components: spectral norm scaling of weight matrix gradients with layer width, optimization path length of the new task, and asymptotic task conflict in wide networks. For well-optimized models, higher ADS is claimed to associate with larger logit shift after new-task training. Empirical results report a minimum Spearman's rs of 0.731 across 175 architectures and utility as a proxy for expected calibration error (ECE) on three datasets across six scenarios.

Significance. If the decoupling and derivation hold without residual optimizer or ordering dependencies, ADS could enable efficient pre-trained model selection for CL by avoiding full logit-shift computation, with the scale of the 175-architecture study providing notable empirical grounding for practical use in architecture search.

major comments (2)

[Abstract] Abstract / derivation: No explicit formula or combination steps are given for how the three mechanistic components (spectral norm scaling, path length, asymptotic conflict) aggregate into ADS. This leaves open whether residual terms remain that depend on learning-rate schedule, momentum, or task permutation, directly undermining the claimed clean decoupling into architecture vs. data dependencies that is load-bearing for the independence from optimization details.
[Abstract] Abstract / heterogeneous architectures: The spectral-norm scaling component is invoked for variable widths and depths, yet the original arguments typically assume uniform layers; without the extension shown, it is unclear whether the ADS expression remains valid for the heterogeneous real-world architectures tested, which is central to the generalizability claim across 175 models.

minor comments (2)

[Abstract] The abstract reports the weakest rs=0.731 but does not identify the dataset/scenario yielding this minimum, reducing clarity on robustness.
Notation for the final ADS expression should be introduced with an equation number once the derivation is supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed comments on our manuscript. We address each major comment below and have updated the abstract and relevant sections to improve clarity on the derivation and applicability to heterogeneous architectures.

read point-by-point responses

Referee: [Abstract] Abstract / derivation: No explicit formula or combination steps are given for how the three mechanistic components (spectral norm scaling, path length, asymptotic conflict) aggregate into ADS. This leaves open whether residual terms remain that depend on learning-rate schedule, momentum, or task permutation, directly undermining the claimed clean decoupling into architecture vs. data dependencies that is load-bearing for the independence from optimization details.

Authors: We agree that the abstract would benefit from an explicit statement of how the components combine into ADS. The full derivation in Section 3 defines ADS as the product of the three terms (spectral norm scaling of gradients, optimization path length, and asymptotic task conflict), which isolates the architecture-dependent factors under fixed optimization settings. We will revise the abstract to include this formula and note the assumptions regarding optimizer independence for the decoupling. revision: yes
Referee: [Abstract] Abstract / heterogeneous architectures: The spectral-norm scaling component is invoked for variable widths and depths, yet the original arguments typically assume uniform layers; without the extension shown, it is unclear whether the ADS expression remains valid for the heterogeneous real-world architectures tested, which is central to the generalizability claim across 175 models.

Authors: Our derivation extends the spectral norm scaling to heterogeneous architectures by computing it layer-wise and aggregating across layers of varying widths and depths. This is presented in the theoretical framework and empirically validated with the 175 architectures. To address the concern, we will add a brief mention of this layer-wise extension in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation claims independence via three components without equations reducing ADS to logit shift by construction

full rationale

The abstract decouples logit shift into architecture and data dependencies whose combination is defined as ADS, then states that higher ADS associates with larger logit shift derived from spectral norm scaling, optimization path length, and asymptotic task conflict. No equations are supplied showing that the ADS expression equals or is fitted to the logit shift quantity itself. Empirical correlation (Spearman's r_s >=0.731) is reported separately as validation. No self-citation, ansatz smuggling, or renaming of known results is present in the text. The derivation is therefore treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the decoupling step and the assertion that the three listed components suffice to capture architecture-driven logit shift; no explicit free parameters are named, but the construction of ADS itself functions as an invented composite quantity whose independent grounding is the reported correlation.

axioms (2)

domain assumption Logit shift factors cleanly into architecture dependency and data dependency
Invoked to establish the ADS framework
ad hoc to paper Spectral norm scaling, optimization path length, and asymptotic task conflict together determine the architecture-driven component
Listed as the basis for the derivation

invented entities (1)

Architecture-driven Shift (ADS) no independent evidence
purpose: Lightweight computable proxy for logit-shift tendency
Newly defined combination of the two dependencies

pith-pipeline@v0.9.1-grok · 5846 in / 1511 out tokens · 40563 ms · 2026-06-29T19:10:39.713505+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 2 internal anchors

[1]

A., Doan, T., and Sugiyama, M

URLhttps: //openreview.net/forum?id=SkMwpiR9Y7. Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for con- tinual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942,

work page arXiv 2006
[2]

Qualitatively characterizing neural network optimization problems

11 Published as a conference paper at ICLR 2026 Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems.arXiv preprint arXiv:1412.6544,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Weight initialization and variance dynamics in deep neural networks and large lan- guage models.arXiv preprint arXiv:2510.09423,

Yankun Han. Weight initialization and variance dynamics in deep neural networks and large lan- guage models.arXiv preprint arXiv:2510.09423,

work page arXiv
[5]

Michael McCloskey and Neal J Cohen

URLhttps: //www.ijcai.org/proceedings/2024/514. Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pp. 109–165. Elsevier,

2024
[6]

Hallmarks of optimiza- tion trajectories in neural networks: Directional exploration and redundancy.arXiv preprint arXiv:2403.07379,

Sidak Pal Singh, Bobby He, Thomas Hofmann, and Bernhard Sch ¨olkopf. Hallmarks of optimiza- tion trajectories in neural networks: Directional exploration and redundancy.arXiv preprint arXiv:2403.07379,

work page arXiv
[7]

12 Published as a conference paper at ICLR 2026 Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume

2026
[8]

PathLen(l) ≈C traj ·Disp (l),(23) whereC traj is constant

A DETAILEDTHEORETICALFRAMEWORK Assumption A.1(Trajectory Regularity).The optimization trajectory in parameter space is free from sharp directional reversals, implying that the ratio of path length to displacement remains bounded. PathLen(l) ≈C traj ·Disp (l),(23) whereC traj is constant. There are a growing body of theoretical and empirical work supports ...

2021
[9]

Global Success

and zero biases. The network is initialized in a trainable regime where gradients do not vanish exponentially with depth. Specifically, the layer-wise weightsΘ (l) follow a non-degenerate distribution with variance satisfying: VarΘ(l) ≫0,∀l= 1, . . . , L⇐ ⇒E[Θ (l)2]≫ E[Θ(l)] 2 .(24) This assumption is both theoretically grounded and standard in practice, ...

2015
[10]

width” and “depth

Theorem A.10(Scaling Law of Output Shift).Letf i(x)andf i+1(x)denote the model output before and after learning taski+ 1, respectively. Under the assumption that second-order terms are negligible (dominating<20%of the variation), the shift in the output function for a previous task samplexscales as: ∥fi+1(x)−f i(x)∥2 ∝ LX l=1 (w(l−1))α+ 1 2 (w(l))β · |lbe...

2018
[11]

20 Published as a conference paper at ICLR 2026 (a) Validation on fully-connected neural network (FNN)

B.3 ADDITIONALEVIDENCE OFMIDDLE-LAYERVULNERABILITY As demonstrated in Figure 4, the middle-layer vulnerability exist across scenarios (different combi- nation of various datasets) and model families (fully-connected neural networks and convolutional neural networks). 20 Published as a conference paper at ICLR 2026 (a) Validation on fully-connected neural ...

2026
[12]

22 Published as a conference paper at ICLR 2026 B.6 EXPERIMENTALRESULTS The empirical validation of the ADS-based selector’s performance in reliable continual learning model selection is shown in Figure

2026
[13]

As illustrated in the mid- dle column of Figure 6, the ADS-based selector serves as an effective coarse-grained selector

Although the selected architectures do not achieve the ideal lowest Expected Calibration Error (ECE), they are more reliable than both the vanilla architecture and those calibrated with a post-hoc temperature scaling mechanism. As illustrated in the mid- dle column of Figure 6, the ADS-based selector serves as an effective coarse-grained selector. It achi...

2026

[1] [1]

A., Doan, T., and Sugiyama, M

URLhttps: //openreview.net/forum?id=SkMwpiR9Y7. Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for con- tinual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942,

work page arXiv 2006

[2] [2]

Qualitatively characterizing neural network optimization problems

11 Published as a conference paper at ICLR 2026 Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems.arXiv preprint arXiv:1412.6544,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Weight initialization and variance dynamics in deep neural networks and large lan- guage models.arXiv preprint arXiv:2510.09423,

Yankun Han. Weight initialization and variance dynamics in deep neural networks and large lan- guage models.arXiv preprint arXiv:2510.09423,

work page arXiv

[5] [5]

Michael McCloskey and Neal J Cohen

URLhttps: //www.ijcai.org/proceedings/2024/514. Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pp. 109–165. Elsevier,

2024

[6] [6]

Hallmarks of optimiza- tion trajectories in neural networks: Directional exploration and redundancy.arXiv preprint arXiv:2403.07379,

Sidak Pal Singh, Bobby He, Thomas Hofmann, and Bernhard Sch ¨olkopf. Hallmarks of optimiza- tion trajectories in neural networks: Directional exploration and redundancy.arXiv preprint arXiv:2403.07379,

work page arXiv

[7] [7]

12 Published as a conference paper at ICLR 2026 Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume

2026

[8] [8]

PathLen(l) ≈C traj ·Disp (l),(23) whereC traj is constant

A DETAILEDTHEORETICALFRAMEWORK Assumption A.1(Trajectory Regularity).The optimization trajectory in parameter space is free from sharp directional reversals, implying that the ratio of path length to displacement remains bounded. PathLen(l) ≈C traj ·Disp (l),(23) whereC traj is constant. There are a growing body of theoretical and empirical work supports ...

2021

[9] [9]

Global Success

and zero biases. The network is initialized in a trainable regime where gradients do not vanish exponentially with depth. Specifically, the layer-wise weightsΘ (l) follow a non-degenerate distribution with variance satisfying: VarΘ(l) ≫0,∀l= 1, . . . , L⇐ ⇒E[Θ (l)2]≫ E[Θ(l)] 2 .(24) This assumption is both theoretically grounded and standard in practice, ...

2015

[10] [10]

width” and “depth

Theorem A.10(Scaling Law of Output Shift).Letf i(x)andf i+1(x)denote the model output before and after learning taski+ 1, respectively. Under the assumption that second-order terms are negligible (dominating<20%of the variation), the shift in the output function for a previous task samplexscales as: ∥fi+1(x)−f i(x)∥2 ∝ LX l=1 (w(l−1))α+ 1 2 (w(l))β · |lbe...

2018

[11] [11]

20 Published as a conference paper at ICLR 2026 (a) Validation on fully-connected neural network (FNN)

B.3 ADDITIONALEVIDENCE OFMIDDLE-LAYERVULNERABILITY As demonstrated in Figure 4, the middle-layer vulnerability exist across scenarios (different combi- nation of various datasets) and model families (fully-connected neural networks and convolutional neural networks). 20 Published as a conference paper at ICLR 2026 (a) Validation on fully-connected neural ...

2026

[12] [12]

22 Published as a conference paper at ICLR 2026 B.6 EXPERIMENTALRESULTS The empirical validation of the ADS-based selector’s performance in reliable continual learning model selection is shown in Figure

2026

[13] [13]

As illustrated in the mid- dle column of Figure 6, the ADS-based selector serves as an effective coarse-grained selector

Although the selected architectures do not achieve the ideal lowest Expected Calibration Error (ECE), they are more reliable than both the vanilla architecture and those calibrated with a post-hoc temperature scaling mechanism. As illustrated in the mid- dle column of Figure 6, the ADS-based selector serves as an effective coarse-grained selector. It achi...

2026