Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization

Anton Sugolov; Haoming Meng; Vardan Papyan

arxiv: 2606.30813 · v1 · pith:D6S6PTC7new · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization

Haoming Meng , Anton Sugolov , Vardan Papyan This is my paper

Pith reviewed 2026-07-01 06:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Gradient SmoothingDepth-wise Gradient AugmentationWindow SmoothingOptimizationTransformersDeep LearningPreconditioningLayer-wise Updates

0 comments

The pith

Transforming optimizer updates along network depth via smoothing improves training and generalization in repeated-block architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep neural networks with repeated blocks develop structured layer relationships during training. The paper shows these can be exploited by a depth-wise transformation of block-wise updates from any base optimizer. Gradient Smoothing, using a simple local window operator, is the concrete method studied. It requires no architecture or objective changes and adds little cost. Experiments across language pretraining, RL post-training, diffusion, and vision transformers show consistent gains in optimization and generalization, plus more structured representation evolution across depth.

Core claim

Depth-wise Gradient Augmentation obtains each layer's update by transforming the full collection of block-wise optimizer updates along the depth dimension. Gradient Smoothing instantiates this with a local Window Smoothing operator. The resulting updates act as structured depth-wise preconditioning, yielding better optimization trajectories and final performance than the base optimizer alone while preserving compatibility with existing pipelines.

What carries the argument

The Window Smoothing operator inside the Depth-wise Gradient Augmentation framework, which couples block-wise updates by local smoothing along the depth dimension.

If this is right

The method works on top of arbitrary base optimizers such as SGD, Adam, or Muon.
Performance gains appear in language-model pretraining, RL post-training for reasoning, diffusion modeling, and ViT image classification.
Representation evolution across depth becomes more structured.
No changes to model architecture or training objective are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth-wise coupling idea could be tested on architectures that develop repeated-block structure only late in training.
Fixed window smoothing might be replaced by a learned operator that adapts to observed layer correlations.
Depth could be treated as an explicit dimension for preconditioning in the same way width or batch size already are.
The approach may combine naturally with existing techniques such as gradient clipping or adaptive learning-rate schedules.

Load-bearing premise

Deep neural networks with repeated architectural blocks exhibit structured relationships across layers that emerge during training and can be usefully exploited by transforming the collection of block-wise optimizer updates along the depth dimension.

What would settle it

Running a standard transformer language-model pretraining experiment with Gradient Smoothing applied on top of Adam or SGD and finding no improvement or a degradation in training loss or downstream metrics relative to the unsmoothed baseline.

Figures

Figures reproduced from arXiv: 2606.30813 by Anton Sugolov, Haoming Meng, Vardan Papyan.

**Figure 1.** Figure 1: Gradient Smoothing. Representation of the gradient augmentation scheme applied across depth to the updates in a deep network after backpropagation. For a deep network with L identical architectural blocks (but with different parameters), the gradient updates in each parameter block θ l are reweighted across depth to stabilize information propagation. 2.2. General First-Order Updates Rather than committing … view at source ↗

**Figure 2.** Figure 2: Nanochat pretraining with Gradient Smoothing. Validation loss, BPB, and DCLM CORE metric during nanochat pretraining of GPT2 with the default Adam + NorMuon optimizer setup ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Test accuracy improvement with smoothing. Test accuracy of ViT-B trained on CIFAR-100 with DeiT training recipe with data augmentations for 1700 epochs. We compare baseline training (α = 0, red) against window smoothing with α = 0.1 (blue) and α = 0.2 (purple). Solid lines show running average trendlines (25 epochs) for clarity. Both smoothing configurations outperform baseline, with α = 0.2 achieving a f… view at source ↗

**Figure 4.** Figure 4: Microbatch gradient variance with window smoothing. Total microbatch variance Vmb (Section 4.4) during nanochat depth 24 pretraining, comparing the baseline (red) against window smoothing with α = 0.05 (green) and α = 0.1 (blue). After the initial warmup, both smoothing runs maintain consistently lower microbatch gradient variance than baseline, with the gap widening later in training. 1 ≤ ℓ ≤ L. For ℓ ∈ … view at source ↗

**Figure 5.** Figure 5: Layer contributions similarity with increased smoothing. Average cosine similarity of layer differences dℓ = xℓ+1 −xℓ for ViT-B CLS token trained on CIFAR-100 (1700 epochs). Means are taken across all layer pairs (di, dj ) for 1 ≤ i, j ≤ L−1, i ̸= j. Each point shows the median across 100 images from the validation set (rhombus) and training set (circle); error bars indicate interquartile range (Q1–Q3). … view at source ↗

**Figure 6.** Figure 6: Gradient smoothing increases layer contribution alignment. CLS token trajectories of ViT-B trained on CIFAR-100 for 1700 epochs, comparing baseline (α = 0, red) against heavy smoothing (α = 0.4, navy). For each layer difference dℓ = xℓ+1 − xℓ, we compute the mean cosine similarity with all other layer differences dj (j ̸= ℓ). Across both validation and training sets, the smoothed model exhibits consistentl… view at source ↗

**Figure 7.** Figure 7: Greater linearity with increased smoothing. Line Shape Score (LSS) of CLS token trajectories in ViT-B trained on CIFAR-100 (1700 epochs). Each point shows the median across 100 validation (rhombus) and training (circle) images, while error bars indicate interquartile range (Q1–Q3). As the window smoothing parameter α ∈ {0.1, . . . , 0.4} increases from the baseline (α = 0), median LSS decreases monotonical… view at source ↗

**Figure 8.** Figure 8: Lower depth gradient variance with smoothing. Total depth variance V depth total during nanochat d24 pretraining (7000 optimization steps, logged every 50 steps), comparing the baseline (blue) against window smoothing with α = 0.05 (red) and α = 0.1 (green). Both smoothing runs show consistently lower depth-wise gradient dispersion than baseline once training leaves the warmup phase, with α = 0.05 and α = … view at source ↗

read the original abstract

Deep neural networks with repeated architectural blocks, such as transformers, often exhibit structured relationships across layers that emerge during training. Motivated by this observation, we introduce \emph{Depth-wise Gradient Augmentation}, a general optimization paradigm in which the update applied to each layer is obtained by transforming the collection of block-wise optimizer updates along the depth dimension. Within this framework, we study \emph{Gradient Smoothing}, a family of depth-wise smoothing methods, and instantiate it with a simple local \emph{Window Smoothing} operator. The resulting method operates directly on block-wise updates produced by arbitrary base optimizers (e.g., SGD, Adam, Muon), incurs minimal computational overhead, and is compatible with existing optimization pipelines. We evaluate Gradient Smoothing across a diverse set of architectures and training regimes, including language model pretraining, RL post-training of LLMs for reasoning, diffusion modeling, and image classification with Vision Transformers. Across these settings, Gradient Smoothing consistently improves optimization and generalization performance without modifying model architectures or training objectives. We further show that it promotes more structured representation evolution across depth, consistent with its interpretation as a structured depth-wise preconditioning method. Together, these results establish Depth-wise Gradient Augmentation as a promising framework for exploiting cross-depth structure in optimization and demonstrate Gradient Smoothing as a simple and broadly applicable instantiation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient Smoothing applies a window average to layer-wise optimizer updates and claims consistent gains across LM pretraining, RL, diffusion, and ViTs, but the size of those gains and the distance from prior layer-coupling ideas remain unclear without the numbers.

read the letter

The main takeaway is that this paper takes updates from any base optimizer and smooths them locally across depth with a simple window operator. They report that the change improves both optimization and generalization in language model pretraining, RL post-training for reasoning, diffusion modeling, and Vision Transformer classification, all without touching the architecture or loss.

What the work actually adds is the framing of Depth-wise Gradient Augmentation as a general paradigm and the concrete Gradient Smoothing family with the Window Smoothing instantiation. The method is deliberately modular, runs on top of SGD, Adam, or Muon, and adds little compute. The experiments span four distinct regimes, which is a reasonable test of generality, and the claim that it produces more structured representation evolution across layers follows directly from the depth-wise coupling idea.

The soft spots are straightforward. The abstract supplies no quantitative deltas, no baseline tables, and no statistical details, so it is impossible to judge whether the gains are large enough to matter in practice or whether they survive standard ablations. Without those numbers it is also hard to tell how much the method overlaps with earlier structured preconditioners or layer-wise coupling schemes. The full paper will need to show the raw results and the citation positioning before the contribution can be sized.

This paper is for people who train large repeated-block models and want a low-friction add-on to existing optimizers. A reader who needs a plug-in improvement with minimal code change could get value if the experiments check out.

I would send it to peer review. The core construction is clean and the evaluation domains are relevant; referees can sort out the magnitude and the prior-art questions once the numbers are on the table.

Referee Report

2 major / 1 minor

Summary. The paper introduces Depth-wise Gradient Augmentation as a general paradigm for obtaining layer updates by transforming collections of block-wise optimizer outputs along the depth dimension. It focuses on the Gradient Smoothing family, instantiated via a local Window Smoothing operator, and claims this yields consistent gains in optimization and generalization on language-model pretraining, RL post-training of LLMs, diffusion modeling, and ViT image classification. The method is presented as modular, compatible with arbitrary base optimizers (SGD, Adam, Muon), incurring minimal overhead, and additionally promoting more structured representation evolution across depth.

Significance. If the empirical claims are substantiated with rigorous, reproducible experiments, the contribution would be significant: a simple, architecture-agnostic operator that exploits emergent cross-layer structure in repeated-block networks and can be dropped into existing pipelines. The framing as structured depth-wise preconditioning and the breadth of evaluated domains (pretraining, RL, diffusion, classification) would position it as a broadly applicable optimization technique.

major comments (2)

[Abstract] Abstract: the central claim that 'Gradient Smoothing consistently improves optimization and generalization performance' is asserted without any reported metrics, baselines, effect sizes, number of runs, or statistical tests. This absence is load-bearing for an empirical paper whose primary contribution is performance improvement.
[Abstract / method description] The description of the Window Smoothing operator and its interaction with base-optimizer outputs (e.g., how the depth-wise transformation is exactly defined and whether it preserves unbiasedness or introduces new hyperparameters) is not accompanied by any equation or pseudocode in the provided text, preventing verification that the method is indeed 'parameter-free' or minimal-overhead as stated.

minor comments (1)

[Abstract] The abstract mentions compatibility with Muon but provides no citation or brief description of this optimizer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Gradient Smoothing consistently improves optimization and generalization performance' is asserted without any reported metrics, baselines, effect sizes, number of runs, or statistical tests. This absence is load-bearing for an empirical paper whose primary contribution is performance improvement.

Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. In the revised version, we will incorporate key results such as average relative improvements across tasks (with specific effect sizes), the number of independent runs, and mention of statistical significance where applicable, while preserving brevity. revision: yes
Referee: [Abstract / method description] The description of the Window Smoothing operator and its interaction with base-optimizer outputs (e.g., how the depth-wise transformation is exactly defined and whether it preserves unbiasedness or introduces new hyperparameters) is not accompanied by any equation or pseudocode in the provided text, preventing verification that the method is indeed 'parameter-free' or minimal-overhead as stated.

Authors: The Window Smoothing operator is formally defined in Section 3 with the exact depth-wise transformation (a local convex combination of block-wise updates) and pseudocode in Algorithm 1. It preserves unbiasedness as a linear operator with weights summing to one and introduces no new hyperparameters (window size is fixed at a default value with no tuning required). Overhead is strictly O(L) for L layers. We will revise the abstract to include a brief reference to these properties or a short equation for improved self-containment. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces Depth-wise Gradient Augmentation and Gradient Smoothing as a modular operator applied to block-wise updates from arbitrary base optimizers. The abstract and description contain no equations, fitted parameters, or derivation steps that reduce to self-definition or input data by construction. Claims rest on empirical evaluation across domains rather than any mathematical reduction or self-citation chain. No load-bearing premises invoke prior author work as a uniqueness theorem or ansatz. The method is presented as compatible with existing pipelines without architecture changes, making the central contribution self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the single domain assumption below is the explicit motivation stated in the text. No free parameters or invented entities are mentioned.

axioms (1)

domain assumption Deep neural networks with repeated architectural blocks exhibit structured relationships across layers that emerge during training.
This observation is presented as the direct motivation for introducing Depth-wise Gradient Augmentation.

pith-pipeline@v0.9.1-grok · 5769 in / 1152 out tokens · 29866 ms · 2026-07-01T06:34:28.853342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Bao, F., Nie, S., Xue, K., Cao, Y ., Li, C., Su, H., and Zhu, J

URL https://arxiv.org/ abs/2407.07810. Bao, F., Nie, S., Xue, K., Cao, Y ., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion mod- els,

work page arXiv
[2]

URL https://arxiv.org/abs/2209. 12152. Ben-Shaul, I. and Dekel, S. Nearest class-center simplifica- tion through intermediate layers. In Cloninger, A., Doster, T., Emerson, T., Kaul, M., Ktena, I., Kvinge, H., Mi- olane, N., Rieck, B., Tymochko, S., and Wolf, G. (eds.), Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, volume 1...

2022
[3]

Reinforcement learning for reasoning in small llms: What works and what doesn’t.arXiv preprint arXiv:2503.16219, 2025

URLhttps://arxiv.org/abs/2503.16219. Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., and Hennig, P. Kronecker-factored approximate curvature for modern neural network architectures,

work page arXiv
[4]

Fisher, Q., Meng, H., and Papyan, V

URL https://arxiv.org/abs/2311.00636. Fisher, Q., Meng, H., and Papyan, V . Pushing bound- aries: Mixup’s influence on neural collapse,

work page arXiv
[5]

URL https://arxiv.org/abs/2402.06171. Gai, K. and Zhang, S. A mathematical principle of deep learning: Learn the geodesic curve in the wasser- stein space,

work page arXiv
[6]

Deep Residual Networks Learn the Geodesic Curve in the Wasserstein Space

URL https://arxiv.org/abs/ 2102.09235. Garrod, C. and Keating, J. P. Unifying low dimensional observations in deep learning through the deep linear unconstrained feature model,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D

URL https: //arxiv.org/abs/1806.03884. Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D. A. The unreasonable ineffectiveness of the deeper layers,

work page arXiv
[8]

& Roberts, D

URL https://arxiv.org/ abs/2403.17887. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao...

work page arXiv
[9]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Gupta, V ., Koren, T., and Singer, Y . Shampoo: Precon- ditioned stochastic tensor optimization,

work page doi:10.1038/s41586-025-09422-z
[10]

Shampoo: Preconditioned Stochastic Tensor Optimization

URL https://arxiv.org/abs/1802.09568. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

URL https:// arxiv.org/abs/1512.03385. Hoyt, C. R. and Owen, A. B. Probing neural networks with t-sne, class-specific projections and a guided tour,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

On layer-wise repre- sentation similarity: Application for multi-exit models with a single classifier

Jiang, J., Zhou, J., and Zhu, Z. On layer-wise repre- sentation similarity: Application for multi-exit models with a single classifier. InNeurIPS 2024 Workshop 11 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization on Symmetry and Geometry in Neural Representations, 2025a. URL https://openreview.net/forum? id=YanMgtZhfY. Jiang, J., Z...

2024
[13]

Karpathy, A

URLhttps://arxiv.org/abs/2512.08819. Karpathy, A. nanochat: The best chatgpt that $100 can buy,

work page arXiv
[14]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/ 1412.6980. Krause, F., Phan, T., Gui, M., Baumann, S. A., Hu, V . T., and Ommer, B. Tread: Token routing for efficient architecture-agnostic diffusion training,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Lad, V ., Lee, J

URL https://arxiv.org/abs/2501.04765. Lad, V ., Lee, J. H., Gurnee, W., and Tegmark, M. The remarkable robustness of llms: Stages of inference?,

work page arXiv
[16]

URLhttps://arxiv.org/abs/2406.19384. Li, J. and Papyan, V . Residual alignment: Uncovering the mechanisms of residual networks,

work page arXiv
[17]

Li, Z., Liu, L., Liang, C., Chen, W., and Zhao, T

URL https: //arxiv.org/abs/2401.09018. Li, Z., Liu, L., Liang, C., Chen, W., and Zhao, T. Normuon: Making muon more efficient and scalable,

work page arXiv
[18]

Lin, W., Dangel, F., Eschenhagen, R., Neklyudov, K., Kristiadi, A., Turner, R

URL https://arxiv.org/abs/2510.05491. Lin, W., Dangel, F., Eschenhagen, R., Neklyudov, K., Kristiadi, A., Turner, R. E., and Makhzani, A. Struc- tured inverse-free natural gradient: Memory-efficient & numerically-stable kfac,

work page arXiv
[19]

org/abs/2312.05705

URL https://arxiv. org/abs/2312.05705. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization,

work page arXiv
[20]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/ 1711.05101. Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W

URL https://arxiv.org/abs/1503.05671. Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect,

work page arXiv
[22]

Xiang Meng, Kayhan Behdin, Haoyue Wang, and Rahul Mazumder

URL https://arxiv.org/abs/2403.03853. Nagwekar, A. Towards guided descent: Optimization algo- rithms for training neural networks at scale,

work page arXiv
[23]

Papyan, V

URL https://arxiv.org/abs/2512.18373. Papyan, V . Traces of class/cross-class structure pervade deep learning spectra,

work page arXiv
[24]

URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117

doi: 10.1073/ pnas.2015509117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117. Parker, L., Onal, E., Stengel, A., and Intrater, J. Neural collapse in the intermediate hidden layers of classification neural networks,

work page doi:10.1073/pnas.2015509117
[25]

Sarfati, R., Liu, T

URL https:// arxiv.org/abs/2502.01954. Sarfati, R., Liu, T. J. B., Boull´e, N., and Earls, C. J. Lines of thought in large language models,

work page arXiv
[26]

URL https: //arxiv.org/abs/2410.01545. Shai, A. S., Marzen, S. E., Teixeira, L., Oldenziel, A. G., and Riechers, P. M. Transformers represent belief state geometry in their residual stream,

work page arXiv
[27]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URL https: //arxiv.org/abs/2405.15943. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page arXiv
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Skean, O., Arefin, M. R., LeCun, Y ., and Shwartz-Ziv, R. Does representation matter? exploring intermedi- ate layers in large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Skean, O., Arefin, M

URL https: //arxiv.org/abs/2412.09563. Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y ., and Shwartz-Ziv, R. Layer by layer: Uncov- ering hidden representations in language models,

work page arXiv
[30]

Layer by Layer: Uncovering Hidden Representations in Language Models

URLhttps://arxiv.org/abs/2502.02013. 12 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization Song, Z.-Y ., Li, Z., Cao, Q.-H., xing Luo, M., and Zhu, H. X. Bridging the dimensional chasm: Uncover layer- wise dimensional reduction in transformers through token correlation,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

S´uken´ık, P., Mondelli, M., and Lampert, C

URL https://arxiv.org/abs/ 2503.22547. S´uken´ık, P., Mondelli, M., and Lampert, C. Deep neural collapse is provably optimal for the deep unconstrained features model,

work page arXiv
[32]

ICML, 2021

URL https://arxiv.org/abs/2012.12877. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need,

work page arXiv 2012
[33]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. Soap: Improving and stabilizing shampoo using adam,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

SOAP: Improving and Stabilizing Shampoo using Adam

URLhttps://arxiv.org/abs/2409.11321. Wang, J., Fan, L., Zhang, D., Jing, W., Di, D., Song, Y ., Liu, S., and Cong, C. Visual prompt-agnostic evolution,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Wang, P., Li, X., Yaras, C., Zhu, Z., Balzano, L., Hu, W., and Qu, Q

URLhttps://arxiv.org/abs/2601.20232. Wang, P., Li, X., Yaras, C., Zhu, Z., Balzano, L., Hu, W., and Qu, Q. Understanding deep representation learning via layerwise feature compression and discrimination, 2024a. Wang, S., Gai, K., and Zhang, S. Progressive feedforward collapse of resnet training, 2024b. Wolfram, C. and Schein, A. Layers at similar depths g...

work page arXiv
[36]

URLhttps://arxiv.org/abs/2504.08775. Wu, R. and Papyan, V . Linguistic collapse: Neural col- lapse in (large) language models,

work page arXiv
[37]

URL https: //arxiv.org/abs/2405.17767. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Yang, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., P...

work page arXiv
[38]

Qwen2 Technical Report

URL https://arxiv.org/abs/2407.10671. Zangrando, E., Deidda, P., Brugiapaglia, S., Guglielmi, N., and Tudisco, F. Provable emergence of deep neu- ral collapse and low-rank bias in l2-regularized nonlinear networks,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Zarka, J., Guth, F., and Mallat, S

URL https://arxiv.org/abs/ 2402.03991. Zarka, J., Guth, F., and Mallat, S. Separation and concentra- tion in deep networks,

work page arXiv
[40]

org/abs/2012.10424

URL https://arxiv. org/abs/2012.10424. Zhou, N., Chen, J., and Huang, D. Sharing task-relevant information in visual prompt tuning by cross-layer dy- namic connection.IEEE Transactions on Image Pro- cessing, 34:4527–4540,

work page arXiv 2012
[41]

doi: 10.1109/tip.2025.3587587

ISSN 1941-0042. doi: 10.1109/tip.2025.3587587. URL http://dx.doi. org/10.1109/tip.2025.3587587. 13 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization A. Appendix A.1. Specialization of Smoothing to Adam, AdamW, and Muon In our experiments, the base optimizer U (t) is typically Adam (Kingma & Ba, 2017), AdamW (Loshchilov & Hutter, 20...

work page doi:10.1109/tip.2025.3587587 1941

[1] [1]

Bao, F., Nie, S., Xue, K., Cao, Y ., Li, C., Su, H., and Zhu, J

URL https://arxiv.org/ abs/2407.07810. Bao, F., Nie, S., Xue, K., Cao, Y ., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion mod- els,

work page arXiv

[2] [2]

URL https://arxiv.org/abs/2209. 12152. Ben-Shaul, I. and Dekel, S. Nearest class-center simplifica- tion through intermediate layers. In Cloninger, A., Doster, T., Emerson, T., Kaul, M., Ktena, I., Kvinge, H., Mi- olane, N., Rieck, B., Tymochko, S., and Wolf, G. (eds.), Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, volume 1...

2022

[3] [3]

Reinforcement learning for reasoning in small llms: What works and what doesn’t.arXiv preprint arXiv:2503.16219, 2025

URLhttps://arxiv.org/abs/2503.16219. Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., and Hennig, P. Kronecker-factored approximate curvature for modern neural network architectures,

work page arXiv

[4] [4]

Fisher, Q., Meng, H., and Papyan, V

URL https://arxiv.org/abs/2311.00636. Fisher, Q., Meng, H., and Papyan, V . Pushing bound- aries: Mixup’s influence on neural collapse,

work page arXiv

[5] [5]

URL https://arxiv.org/abs/2402.06171. Gai, K. and Zhang, S. A mathematical principle of deep learning: Learn the geodesic curve in the wasser- stein space,

work page arXiv

[6] [6]

Deep Residual Networks Learn the Geodesic Curve in the Wasserstein Space

URL https://arxiv.org/abs/ 2102.09235. Garrod, C. and Keating, J. P. Unifying low dimensional observations in deep learning through the deep linear unconstrained feature model,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D

URL https: //arxiv.org/abs/1806.03884. Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D. A. The unreasonable ineffectiveness of the deeper layers,

work page arXiv

[8] [8]

& Roberts, D

URL https://arxiv.org/ abs/2403.17887. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao...

work page arXiv

[9] [9]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Gupta, V ., Koren, T., and Singer, Y . Shampoo: Precon- ditioned stochastic tensor optimization,

work page doi:10.1038/s41586-025-09422-z

[10] [10]

Shampoo: Preconditioned Stochastic Tensor Optimization

URL https://arxiv.org/abs/1802.09568. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

URL https:// arxiv.org/abs/1512.03385. Hoyt, C. R. and Owen, A. B. Probing neural networks with t-sne, class-specific projections and a guided tour,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

On layer-wise repre- sentation similarity: Application for multi-exit models with a single classifier

Jiang, J., Zhou, J., and Zhu, Z. On layer-wise repre- sentation similarity: Application for multi-exit models with a single classifier. InNeurIPS 2024 Workshop 11 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization on Symmetry and Geometry in Neural Representations, 2025a. URL https://openreview.net/forum? id=YanMgtZhfY. Jiang, J., Z...

2024

[13] [13]

Karpathy, A

URLhttps://arxiv.org/abs/2512.08819. Karpathy, A. nanochat: The best chatgpt that $100 can buy,

work page arXiv

[14] [14]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/ 1412.6980. Krause, F., Phan, T., Gui, M., Baumann, S. A., Hu, V . T., and Ommer, B. Tread: Token routing for efficient architecture-agnostic diffusion training,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Lad, V ., Lee, J

URL https://arxiv.org/abs/2501.04765. Lad, V ., Lee, J. H., Gurnee, W., and Tegmark, M. The remarkable robustness of llms: Stages of inference?,

work page arXiv

[16] [16]

URLhttps://arxiv.org/abs/2406.19384. Li, J. and Papyan, V . Residual alignment: Uncovering the mechanisms of residual networks,

work page arXiv

[17] [17]

Li, Z., Liu, L., Liang, C., Chen, W., and Zhao, T

URL https: //arxiv.org/abs/2401.09018. Li, Z., Liu, L., Liang, C., Chen, W., and Zhao, T. Normuon: Making muon more efficient and scalable,

work page arXiv

[18] [18]

Lin, W., Dangel, F., Eschenhagen, R., Neklyudov, K., Kristiadi, A., Turner, R

URL https://arxiv.org/abs/2510.05491. Lin, W., Dangel, F., Eschenhagen, R., Neklyudov, K., Kristiadi, A., Turner, R. E., and Makhzani, A. Struc- tured inverse-free natural gradient: Memory-efficient & numerically-stable kfac,

work page arXiv

[19] [19]

org/abs/2312.05705

URL https://arxiv. org/abs/2312.05705. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization,

work page arXiv

[20] [20]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/ 1711.05101. Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W

URL https://arxiv.org/abs/1503.05671. Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect,

work page arXiv

[22] [22]

Xiang Meng, Kayhan Behdin, Haoyue Wang, and Rahul Mazumder

URL https://arxiv.org/abs/2403.03853. Nagwekar, A. Towards guided descent: Optimization algo- rithms for training neural networks at scale,

work page arXiv

[23] [23]

Papyan, V

URL https://arxiv.org/abs/2512.18373. Papyan, V . Traces of class/cross-class structure pervade deep learning spectra,

work page arXiv

[24] [24]

URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117

doi: 10.1073/ pnas.2015509117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117. Parker, L., Onal, E., Stengel, A., and Intrater, J. Neural collapse in the intermediate hidden layers of classification neural networks,

work page doi:10.1073/pnas.2015509117

[25] [25]

Sarfati, R., Liu, T

URL https:// arxiv.org/abs/2502.01954. Sarfati, R., Liu, T. J. B., Boull´e, N., and Earls, C. J. Lines of thought in large language models,

work page arXiv

[26] [26]

URL https: //arxiv.org/abs/2410.01545. Shai, A. S., Marzen, S. E., Teixeira, L., Oldenziel, A. G., and Riechers, P. M. Transformers represent belief state geometry in their residual stream,

work page arXiv

[27] [27]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URL https: //arxiv.org/abs/2405.15943. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page arXiv

[28] [28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Skean, O., Arefin, M. R., LeCun, Y ., and Shwartz-Ziv, R. Does representation matter? exploring intermedi- ate layers in large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Skean, O., Arefin, M

URL https: //arxiv.org/abs/2412.09563. Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y ., and Shwartz-Ziv, R. Layer by layer: Uncov- ering hidden representations in language models,

work page arXiv

[30] [30]

Layer by Layer: Uncovering Hidden Representations in Language Models

URLhttps://arxiv.org/abs/2502.02013. 12 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization Song, Z.-Y ., Li, Z., Cao, Q.-H., xing Luo, M., and Zhu, H. X. Bridging the dimensional chasm: Uncover layer- wise dimensional reduction in transformers through token correlation,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

S´uken´ık, P., Mondelli, M., and Lampert, C

URL https://arxiv.org/abs/ 2503.22547. S´uken´ık, P., Mondelli, M., and Lampert, C. Deep neural collapse is provably optimal for the deep unconstrained features model,

work page arXiv

[32] [32]

ICML, 2021

URL https://arxiv.org/abs/2012.12877. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need,

work page arXiv 2012

[33] [33]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. Soap: Improving and stabilizing shampoo using adam,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

SOAP: Improving and Stabilizing Shampoo using Adam

URLhttps://arxiv.org/abs/2409.11321. Wang, J., Fan, L., Zhang, D., Jing, W., Di, D., Song, Y ., Liu, S., and Cong, C. Visual prompt-agnostic evolution,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Wang, P., Li, X., Yaras, C., Zhu, Z., Balzano, L., Hu, W., and Qu, Q

URLhttps://arxiv.org/abs/2601.20232. Wang, P., Li, X., Yaras, C., Zhu, Z., Balzano, L., Hu, W., and Qu, Q. Understanding deep representation learning via layerwise feature compression and discrimination, 2024a. Wang, S., Gai, K., and Zhang, S. Progressive feedforward collapse of resnet training, 2024b. Wolfram, C. and Schein, A. Layers at similar depths g...

work page arXiv

[36] [36]

URLhttps://arxiv.org/abs/2504.08775. Wu, R. and Papyan, V . Linguistic collapse: Neural col- lapse in (large) language models,

work page arXiv

[37] [37]

URL https: //arxiv.org/abs/2405.17767. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Yang, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., P...

work page arXiv

[38] [38]

Qwen2 Technical Report

URL https://arxiv.org/abs/2407.10671. Zangrando, E., Deidda, P., Brugiapaglia, S., Guglielmi, N., and Tudisco, F. Provable emergence of deep neu- ral collapse and low-rank bias in l2-regularized nonlinear networks,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Zarka, J., Guth, F., and Mallat, S

URL https://arxiv.org/abs/ 2402.03991. Zarka, J., Guth, F., and Mallat, S. Separation and concentra- tion in deep networks,

work page arXiv

[40] [40]

org/abs/2012.10424

URL https://arxiv. org/abs/2012.10424. Zhou, N., Chen, J., and Huang, D. Sharing task-relevant information in visual prompt tuning by cross-layer dy- namic connection.IEEE Transactions on Image Pro- cessing, 34:4527–4540,

work page arXiv 2012

[41] [41]

doi: 10.1109/tip.2025.3587587

ISSN 1941-0042. doi: 10.1109/tip.2025.3587587. URL http://dx.doi. org/10.1109/tip.2025.3587587. 13 Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization A. Appendix A.1. Specialization of Smoothing to Adam, AdamW, and Muon In our experiments, the base optimizer U (t) is typically Adam (Kingma & Ba, 2017), AdamW (Loshchilov & Hutter, 20...

work page doi:10.1109/tip.2025.3587587 1941