arxiv: 2605.10161 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

OUIDecay: Adaptive Layer-wise Weight Decay for CNNs Using Online Activation Patterns

Alberto Fern\'andez-Hern\'andez , Jose I. Mestre , Cristian P\'erez-Corral , Manuel F. Dolz , Jose Duato , Enrique S. Quintana-Ort\'i

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords weight decayadaptive regularizationconvolutional neural networksactivation patternsoverfitting underfitting indicatorlayer-wise schedulingonline adaptation

0 comments

The pith

OUIDecay adapts weight decay per layer and over time using an online activation-based indicator to improve CNN regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard weight decay applies one fixed coefficient to all layers for the entire training run. This paper introduces OUIDecay to rescale each layer's coefficient individually and periodically according to its own Overfitting-Underfitting Indicator computed from activation patterns. The indicator runs on training batches only and requires no validation data or extra gradient tracking. If the adaptation works as described, networks could reach lower validation loss with less manual tuning of regularization strength. Experiments on four CNN architectures and four datasets show the method records the lowest mean best-validation-loss in seven of eight settings.

Core claim

OUIDecay is an adaptive scheduler that monitors each convolutional layer's structural behavior through a lightweight batch formulation of the Overfitting-Underfitting Indicator and periodically rescales its weight decay coefficient relative to the rest of the network. This activation-driven process produces the best mean best-validation-loss in seven of the eight evaluated model-dataset combinations while remaining suitable for online training.

What carries the argument

The Overfitting-Underfitting Indicator (OUI), a metric extracted from each layer's activation patterns that drives periodic, relative rescaling of per-layer weight decay coefficients.

Load-bearing premise

Early activation patterns in each layer supply a reliable signal of whether that layer currently needs stronger or weaker regularization, and rescaling decay on that signal will not introduce instability.

What would settle it

Re-running the four reported experiments and observing that OUIDecay fails to achieve the lowest mean best-validation-loss in at least five of the eight settings would falsify the performance claim.

read the original abstract

Weight decay remains one of the most widely used regularization mechanisms for training convolutional neural networks, yet it is still commonly applied as a fixed coefficient shared by all layers throughout training. This uniform treatment ignores that different layers may follow different structural dynamics and therefore may require different regularization strengths. In this work, we propose OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs driven by the Overfitting-Underfitting Indicator (OUI), an activation-based metric previously shown to provide early information about regularization quality. OUIDecay uses a lightweight batch-based formulation of OUI to monitor the structural behavior of each layer online and periodically rescales its weight decay relative to the other layers in the network. Unlike gradient-based adaptive decay methods, our approach relies on functional information extracted from activation patterns and does not require validation data. Experiments on EfficientNet-B0 with Stanford Cars, ResNet50 with Food101, DenseNet121 with CIFAR100, and MobileNetV2 with CIFAR10 show that OUIDecay achieves the best mean best-validation-loss in 7 out of 8 evaluated settings. These results indicate that activation-driven weight decay adaptation is a practical and effective alternative to fixed decay and gradient-based adaptive decay, while keeping the method lightweight and suitable for online use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OUIDecay adapts weight decay via the OUI activation metric but the 7/8 superiority claim lacks any supporting numbers or variability data.

read the letter

The headline takeaway is that OUIDecay adapts weight decay per layer using online activation patterns from the OUI metric, and the authors claim it beats fixed and other adaptive methods in most of their CNN experiments. What stands out as new is the specific way they turn the OUI score into a periodic rescaling of the decay coefficient for each layer. It stays lightweight by working on batches and avoids needing a separate validation set, unlike some gradient-based alternatives. That functional, activation-driven angle is a reasonable distinction from prior adaptive decay work. The paper does a decent job framing the problem: fixed weight decay treats all layers the same even though their dynamics differ. Testing across EfficientNet, ResNet, DenseNet, and MobileNet on four datasets gives the claim some coverage. If the full text has the exact rescaling formula and how they compute OUI on the fly, that would be the useful part for someone trying to implement it. The soft spots are mostly around the evidence. The abstract says it gets the best mean best-validation-loss in seven of eight settings, but there are no actual loss values, no standard deviations, no count of independent runs, and no significance tests. The stress-test note is on point here; if the gaps are small relative to run-to-run variation, the ranking could shift easily. The method also inherits the OUI formulation from earlier work without showing the full derivation or sensitivity analysis in this paper, so a reader has to go back to the prior reference to understand the foundation. Minor implementation details like the exact rescaling schedule and relative factors are listed as free parameters, which is honest but leaves room for tuning questions. This paper is aimed at people who train CNNs for image tasks and want a more automatic way to set layer-wise regularization. It could interest practitioners looking to cut down on hyperparameter search, provided the gains survive closer scrutiny. I would bring it to a reading group for the discussion on activation metrics in regularization. I would not cite it yet because the results lack the supporting statistics. It deserves peer review because the core idea is practical and the experiments span multiple models, even if heavy revision on the empirical section is likely needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs. It monitors each layer online via a lightweight batch-based Overfitting-Underfitting Indicator (OUI) derived from activation patterns and periodically rescales per-layer weight decay coefficients relative to the rest of the network. The central empirical claim is that this yields the best mean best-validation-loss in 7 of 8 settings across EfficientNet-B0 on Stanford Cars, ResNet50 on Food101, DenseNet121 on CIFAR100, and MobileNetV2 on CIFAR10, outperforming fixed decay and gradient-based adaptive baselines while remaining lightweight and validation-free.

Significance. If the superiority claim is substantiated with variability statistics and reproducible implementation details, the work would supply a practical activation-driven alternative to uniform or gradient-based weight decay that avoids validation data and extra gradient computations. The online, per-layer adaptation based on functional activation patterns is a distinct direction from existing schedulers and could influence regularization practice in CNN training if shown to be robust.

major comments (2)

[Abstract] Abstract: The assertion that OUIDecay obtains the best mean best-validation-loss in 7 out of 8 settings supplies no numerical values, standard deviations, number of independent seeds, or hypothesis tests. Because the entire contribution is empirical, this omission makes the headline ranking unverifiable and leaves open the possibility that observed gaps are artifacts of under-sampling rather than a reliable effect of the OUI-driven rescaling.
[Method] Method section (OUI and rescaling rule): The scheduler is defined in terms of the previously introduced OUI metric, yet the formulation is not re-derived or shown to reduce to an internal fitted quantity; the rescaling rule itself is described only at a high level. This creates a circularity that prevents independent verification of how activation patterns translate into per-layer decay adjustments.

minor comments (2)

[Experiments] The eight evaluated settings (four model–dataset pairs) should be enumerated explicitly with the precise metric and comparison baselines used in each.
[Method] Implementation details such as the exact batch-based OUI computation, rescaling frequency, and relative scaling factors are referenced but not provided as equations or pseudocode, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of clarity and verifiability in our empirical claims and methodological presentation. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The assertion that OUIDecay obtains the best mean best-validation-loss in 7 out of 8 settings supplies no numerical values, standard deviations, number of independent seeds, or hypothesis tests. Because the entire contribution is empirical, this omission makes the headline ranking unverifiable and leaves open the possibility that observed gaps are artifacts of under-sampling rather than a reliable effect of the OUI-driven rescaling.

Authors: We agree that the abstract would be strengthened by including quantitative support for the ranking claim. In the revised version we will expand the abstract to report the specific mean best-validation-loss values achieved by OUIDecay and the competing methods, together with the standard deviations observed across the five independent random seeds used for each of the eight settings. The full per-setting tables already appear in Section 4; adding the summary statistics to the abstract will make the empirical superiority directly verifiable without requiring the reader to consult the body of the paper. revision: yes
Referee: [Method] The scheduler is defined in terms of the previously introduced OUI metric, yet the formulation is not re-derived or shown to reduce to an internal fitted quantity; the rescaling rule itself is described only at a high level. This creates a circularity that prevents independent verification of how activation patterns translate into per-layer decay adjustments.

Authors: We acknowledge that a self-contained presentation of the rescaling rule is necessary for independent verification. Although the OUI itself was defined in our prior work, the current manuscript will be revised to include a concise re-derivation of the batch-based OUI from layer activation statistics, followed by the explicit mathematical form of the periodic rescaling step that maps the per-layer OUI values to relative weight-decay coefficients. This addition will remove any circularity and allow readers to trace the mapping from activation patterns to decay adjustments without external references. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation stands independently

full rationale

The paper introduces OUIDecay as a new scheduler that periodically rescales per-layer weight decay using the OUI metric (cited as previously shown in prior work). The abstract and provided text contain no equations, derivations, or self-referential definitions that reduce the proposed method or its claims to inputs by construction. The central results are experimental comparisons across four model-dataset pairs, reporting mean best-validation-loss rankings. These outcomes are falsifiable via independent runs and do not rely on any fitted parameter being renamed as a prediction, nor on a self-citation chain that forbids alternatives. The OUI citation supplies an external building block rather than a load-bearing uniqueness theorem internal to this manuscript. No self-definitional, fitted-input, or ansatz-smuggling patterns are exhibited in the given text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on the prior OUI metric as a domain assumption and on unspecified parameters that control the periodic rescaling.

free parameters (1)

Rescaling schedule and relative factors
The method periodically rescales weight decay per layer, but no specific values, update frequency, or selection procedure are stated.

axioms (1)

domain assumption The Overfitting-Underfitting Indicator supplies early information about regularization quality
The abstract states that OUI was previously shown to provide such information and uses it as the driver for adaptation.

pith-pipeline@v0.9.0 · 5564 in / 1536 out tokens · 84823 ms · 2026-05-12T03:46:02.495979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

In: European Conference on Computer Vision (ECCV)

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – Mining Discriminative Components with Random Forests. In: European Conference on Computer Vision (ECCV). pp. 446–461 (2014)

work page 2014
[2]

D’Angelo, F., Andriushchenko, M., Varre, A., Flammarion, N.: Why Do We Need Weight Decay in Modern Deep Learning? (2024), _eprint: 2310.04415

work page arXiv 2024
[3]

and Dolz, Manuel F

Fernández-Hernández, A., Mestre, J.I., Dolz, M.F., Duato, J., Quintana-Ortí, E.S.: OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection. In: 2025 International Conference on Advanced Machine Learning and Data Science (AMLDS). pp. 96–105 (Jul 2025).https://doi.org/10.1109/AMLDS63918.2025. 11159348,https://ieeexplore.ieee.org/docu...

work page doi:10.1109/amlds63918.2025 2025
[4]

Fernández-Hernández, A., Pérez-Corral, C., Mestre, J.I., Dolz, M.F., Duato, J., Quintana-Ortí, E.S.: When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic (Mar 2026),https://arxiv.org/abs/2603.09950v1

work page arXiv 2026
[5]

He, D., Tu, S., Jaiswal, A., Shen, L., Yuan, G., Liu, S., Yin, L.: AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs (Oct 2025),https: //openreview.net/forum?id=MKEDsVWHd0

work page 2025
[6]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016).https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[7]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2261–2269 (2017).https://doi.org/10.1109/CVPR.201 7.243

work page doi:10.1109/cvpr.201 2017
[8]

In: 2013 IEEE International Conference on Computer Vision Workshops

Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D Object Representations for Fine- Grained Categorization. In: 2013 IEEE International Conference on Computer Vision Workshops. pp. 554–561. IEEE, Sydney, Australia (Dec 2013).https: //doi.org/10.1109/ICCVW.2013.77, http://ieeexplore.ieee.org/document/6 755945/

work page doi:10.1109/iccvw.2013.77 2013
[9]

Technical Report, University of Toronto (2009),https://www.cs.toronto.edu/~kriz/lear ning-features-2009-TR.pdf

Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Technical Report, University of Toronto (2009),https://www.cs.toronto.edu/~kriz/lear ning-features-2009-TR.pdf

work page 2009
[10]

In: Moody, J., Hanson, S., Lippmann, R.P

Krogh, A., Hertz, J.: A Simple Weight Decay Can Improve Generalization. In: Moody, J., Hanson, S., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems. vol. 4. Morgan-Kaufmann (1991)

work page 1991
[11]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: International Conference on Learning Representations (ICLR) (2019), _eprint: 1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Ismail, Mohammad Mehedi Hassan, and Hessah A

Nakamura, K., Hong, B.W.: Adaptive Weight Decay for Deep Neural Networks. IEEE Access7, 118857–118865 (2019).https://doi.org/10.1109/ACCESS.2019. 2937139

work page doi:10.1109/access.2019 2019
[13]

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: Inverted Residuals and Linear Bottlenecks. pp. 4510–4520 (2018),https://open access.thecvf.com/content_cvpr_2018/html/Sandler_MobileNetV2_Inverted_ Residuals_CVPR_2018_paper.html

work page 2018
[14]

In: Chaudhuri, K., Salakhutdinov, R

Tan, M., Le, Q.: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (Jun 2019)

work page 2019