Layer-wise Derivative Controlled Networks

Rowan Martnishn; Sean Anderson

arxiv: 2605.15463 · v1 · pith:6M5JYGMHnew · submitted 2026-05-14 · 💻 cs.LG

Layer-wise Derivative Controlled Networks

Rowan Martnishn , Sean Anderson This is my paper

Pith reviewed 2026-05-19 15:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural architecturederivative regularizationgradient volatilityparameter efficiencyPolynomial EngineDREGMNIST classificationordinal regression

0 comments

The pith

ChainzRule replaces standard activations with a Polynomial Engine under layer-wise Differential Regularization to cut parameters while lowering gradient volatility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChainzRule, a neural architecture that swaps piecewise-linear activations for a Polynomial Engine whose intermediate derivatives are controlled by targeted Differential Regularization, or DREG. This mechanism is meant to suppress extreme input sensitivity without global Lipschitz penalties, thereby producing smoother output manifolds and more stable training. On MNIST the design reportedly lowers peak gradient volatility by 23.1 percent and on Yelp Full ordinal regression it reaches 70.17 percent accuracy while using 15.5 times fewer parameters than the compared baselines. A sympathetic reader would care because the result suggests that stability, accuracy, and hardware efficiency need not trade off against one another when derivative control is built into the architecture itself.

Core claim

ChainzRule is built around a Polynomial Engine whose derivatives are explicitly regularized at each layer by DREG. This targeted, layer-wise control replaces the coarse global constraints used in earlier Lipschitz-based methods and is claimed to preserve full representational capacity while reducing unpredictable swings in output for small input changes. In direct Fair Fight comparisons the resulting models outperform standard networks on MNIST and Yelp Full tasks despite the large reduction in parameter count.

What carries the argument

Polynomial Engine governed by Differential Regularization (DREG): a layer-wise penalty applied directly to intermediate derivatives that damps extreme sensitivity without attenuating the engine's expressive power.

If this is right

Networks for safety-critical applications could be made more predictable without sacrificing accuracy.
Training runs may converge with less volatility when derivative penalties are applied inside the architecture rather than only through the loss.
Hardware-constrained deployments become feasible because the same task accuracy is reached with substantially smaller models.
The same DREG mechanism could be inserted into other activation families beyond the tested Polynomial Engine.
Derivative control at training time may reduce the need for post-hoc calibration or adversarial training to achieve robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the layer-wise derivative control generalizes, similar regularization could be applied to attention or convolution layers to stabilize transformers and CNNs on long sequences or high-resolution images.
The approach implicitly suggests that explicit gradient statistics could become a new hyperparameter or architectural primitive rather than an after-the-fact diagnostic.
A natural next test would be to measure whether the smoother manifolds also improve calibration or out-of-distribution detection on the same benchmarks.
The method may interact productively with quantization or pruning pipelines because fewer parameters plus lower internal sensitivity could compound efficiency gains.

Load-bearing premise

The Fair Fight benchmarks apply identical training protocols, data splits, and hyperparameter tuning to both the proposed model and all baselines, and DREG leaves the Polynomial Engine's full representational capacity intact without hidden adjustments that favor ChainzRule.

What would settle it

Re-running the Fair Fight benchmarks with strictly matched random seeds, data splits, optimizer schedules, and hyperparameter grids and finding no reduction in peak gradient volatility or no parameter-efficiency gain would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.15463 by Rowan Martnishn, Sean Anderson.

**Figure 1.** Figure 1: This Pareto plot visualizes the Stability-Accuracy Frontier using the data from Table 1. Note that the X-axis [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: The Vanilla MLP (middle) displays noisy, unstructured gradients across the entire field. The ChainzRule [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity plateau across five synthetic families. As [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of ChainzRule (w/ DREG) against MLP, Neural ODE, Sobolev MLP, and KAN. (a) shows [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

As machine learning models grow in complexity, they increasingly struggle with three conflicting demands: the need for high accuracy, the requirement for hardware efficiency, and the necessity of functional stability. Traditional architectures often achieve performance at the expense of spiky or unpredictable behavior, where small changes in input lead to massive swings in output -- a critical flaw for real-world deployment in sensitive environments. This paper introduces ChainzRule (CR), a novel neural architecture designed to harmonize these competing goals. ChainzRule replaces standard piecewise-linear activations with a Polynomial Engine governed by Differential Regularization (DREG). Unlike traditional methods that impose global, coarse-grained constraints on a model's Lipschitz constant, DREG acts as a targeted regularization on intermediate derivatives. This approach suppresses extreme sensitivity without attenuating the representational power inherent in the Polynomial Engine. In head-to-head "Fair Fight" benchmarks, ChainzRule outperformed standard models while using 15.5x fewer parameters. On the MNIST dataset, it reduced peak gradient volatility by an average of 23.1%, ensuring a smoother and more predictable manifold. On Yelp Full ordinal regression under explicit DREG regularization, ChainzRule achieves 70.17% accuracy, validating that derivative-aware regularization is compatible with competitive performance on realistic tasks. By embedding gradient awareness into the architecture via DREG, ChainzRule demonstrates that stability and accuracy need not be competing objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChainzRule claims big stability and efficiency wins with polynomial activations plus layer-wise derivative regularization, but the benchmarks look too underspecified to trust the numbers.

read the letter

The main thing here is that the paper puts forward ChainzRule as a network that swaps standard activations for a Polynomial Engine and adds targeted DREG on intermediate derivatives to cut gradient volatility while using far fewer parameters. The headline numbers are a 15.5x parameter drop, 23.1% lower peak volatility on MNIST, and 70.17% accuracy on Yelp Full. Those sound useful if they hold up, but the abstract gives almost no way to check them. What looks new is the shift to layer-wise derivative control instead of a single global Lipschitz bound. That choice could let the model keep more capacity while still damping spikes, and pairing it with polynomials rather than ReLUs is a concrete design decision. The paper does a reasonable job naming the real tension between accuracy, hardware cost, and predictable behavior in deployed systems. The soft spots are in the validation and the missing context. The stress-test point about identical protocols is the key one: without details on data splits, hyperparameter search budgets, or how the baselines were actually implemented, the efficiency and stability claims could easily be artifacts of unequal effort or of DREG quietly limiting what the polynomial layers can do. No equations or derivation steps appear, so it is hard to see how DREG is formalized or whether it introduces its own fitting circularity. The citation pattern is also thin, with no visible links to prior work on Lipschitz penalties or polynomial networks. This is the sort of paper that might interest people who build models for control or medical tasks where sudden output swings are a problem. A reader could borrow the idea of regularizing intermediate derivatives and test it on their own setup. I would send it for peer review if the full manuscript adds clear experimental protocols, ablations on DREG strength, and direct comparisons to existing stability methods. Without those pieces the claims stay too hard to evaluate.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ChainzRule (CR), a neural architecture replacing standard activations with a Polynomial Engine under Differential Regularization (DREG). It claims to reconcile accuracy, parameter efficiency, and stability, outperforming baselines with 15.5x fewer parameters, reducing MNIST peak gradient volatility by 23.1%, and reaching 70.17% accuracy on Yelp Full ordinal regression in 'Fair Fight' benchmarks.

Significance. If the efficiency and stability claims can be verified under matched protocols and if DREG is shown to preserve capacity without implicit shrinkage, the work would offer a concrete route to gradient-aware architectures that avoid the accuracy-stability trade-off common in Lipschitz-constrained models. The targeted, layer-wise nature of DREG is a potentially useful distinction from global regularization approaches.

major comments (3)

[Abstract] Abstract: the headline numerical claims (15.5x parameter reduction, 23.1% volatility drop, 70.17% Yelp accuracy) are presented without any description of experimental protocol, baseline definitions, data splits, hyperparameter search budget, or statistical tests, rendering the results unverifiable.
[Methods] No equations or derivation steps appear for the Polynomial Engine or the DREG mechanism; without these it is impossible to assess whether DREG is a parameter-free or post-hoc adjustment and whether the reported stability gains are independent of the regularization strength that is itself fitted to produce the result.
[Experiments] Experiments section: the 'Fair Fight' benchmark description does not confirm that baselines received identical training protocols, data splits, and tuning effort; absent an ablation that varies DREG strength while measuring approximation error on a fixed function class, the efficiency and stability advantages could be artifacts of unequal experimental conditions rather than the architecture.

minor comments (2)

[Abstract] The abstract introduces 'ChainzRule' and 'Polynomial Engine' without a brief parenthetical gloss on their relationship before stating performance numbers.
[Abstract] Notation for 'peak gradient volatility' is used without a precise definition or reference to the exact quantity being measured (e.g., max-norm of per-layer gradients or variance across batches).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened for clarity and verifiability. We address each major comment point-by-point below and have prepared revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the headline numerical claims (15.5x parameter reduction, 23.1% volatility drop, 70.17% Yelp accuracy) are presented without any description of experimental protocol, baseline definitions, data splits, hyperparameter search budget, or statistical tests, rendering the results unverifiable.

Authors: We agree that the abstract's brevity makes the headline claims less immediately verifiable on their own. The full manuscript's Experiments section details the 'Fair Fight' protocol, baseline definitions (standard feed-forward and convolutional networks with matched capacity), data splits (standard MNIST train/test and Yelp Full 5-fold), hyperparameter search (grid search over learning rates and regularization strengths with equivalent compute budget), and statistical reporting (means and standard deviations over 5 random seeds). In the revised version, we have added a concise clause to the abstract referencing these matched protocols and directing readers to the Experiments section for full details. revision: yes
Referee: [Methods] No equations or derivation steps appear for the Polynomial Engine or the DREG mechanism; without these it is impossible to assess whether DREG is a parameter-free or post-hoc adjustment and whether the reported stability gains are independent of the regularization strength that is itself fitted to produce the result.

Authors: We acknowledge the omission of explicit mathematical details in the initial submission. The revised manuscript adds a dedicated Methods subsection with the full formulation: the Polynomial Engine replaces ReLU with a learnable polynomial of degree K per neuron, parameterized by coefficients that are optimized end-to-end; DREG is derived as the expected L2 norm of the input-output Jacobian at each layer, added as a weighted term λ·DREG to the task loss. DREG is neither parameter-free nor post-hoc; λ is a tunable hyperparameter, and we report results across a range of λ values to show that stability gains persist without implicit capacity shrinkage. revision: yes
Referee: [Experiments] Experiments section: the 'Fair Fight' benchmark description does not confirm that baselines received identical training protocols, data splits, and tuning effort; absent an ablation that varies DREG strength while measuring approximation error on a fixed function class, the efficiency and stability advantages could be artifacts of unequal experimental conditions rather than the architecture.

Authors: We confirm that the original Experiments section states identical protocols were used for all models (same optimizer, learning-rate schedule, batch size, epochs, and data splits), with hyperparameter tuning performed under an equal search budget. To directly address the concern, the revised manuscript includes a new ablation subsection that fixes the network architecture and function class, varies only the DREG strength λ, and reports both approximation error (on a synthetic target function) and gradient volatility; the results show that the reported efficiency and stability benefits scale with DREG strength and are not explained by unequal conditions. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations visible; claims remain self-contained.

full rationale

The manuscript text supplied contains only descriptive claims about ChainzRule, the Polynomial Engine, and DREG without any equations, derivation steps, parameter-fitting procedures, or self-citations. No load-bearing step can be quoted that reduces by construction to its own inputs, fitted values, or prior author work. Per the hard rules, absence of visible mathematical structure means the result is treated as self-contained against external benchmarks; the reader's speculation about fitted regularization strength cannot be confirmed from the text and is therefore not grounds for a circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

Abstract-only view prevents full ledger; new terms are introduced without stated assumptions or independent evidence.

free parameters (1)

DREG regularization strength
Likely tuned to suppress gradient volatility while preserving accuracy; value not reported.

invented entities (2)

Polynomial Engine no independent evidence
purpose: Replace piecewise-linear activations for derivative control
Core component of the architecture; no independent evidence supplied.
Differential Regularization (DREG) no independent evidence
purpose: Targeted layer-wise derivative regularization
Central mechanism claimed to harmonize stability and performance; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5765 in / 1233 out tokens · 59820 ms · 2026-05-19T15:15:55.955446+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ChainzRule replaces standard piecewise-linear activations with a Polynomial Engine governed by Differential Regularization (DREG)... L = L_task + λ ∑_l E[||S^(l)||_F^2]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

POLY DREG achieves 96.38% accuracy... reduced peak gradient volatility by 23.1%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 14 internal anchors

[1]

URLhttps://pmc.ncbi

doi: 10.3389/fdata.2024.12705377. URLhttps://pmc.ncbi. nlm.nih.gov/articles/PMC12705377/. Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer,

work page doi:10.3389/fdata.2024.12705377 2024
[2]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equa- tions.arXiv preprint arXiv:1806.07366,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Neural Ordinary Differential Equations

doi: 10.48550/arXiv.1806.07366. URLhttps://arxiv.org/ abs/1806.07366. Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improv- ing robustness to adversarial examples.arXiv preprint arXiv:1704.08847,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.07366
[4]

Parseval Networks: Improving Robustness to Adversarial Examples

doi: 10.48550/arXiv.1704.08847. URLhttps://arxiv.org/abs/1704.08847. Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.08847
[5]

Sobolev Training for Neural Networks

Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. Sobolev training for neural networks.arXiv preprint arXiv:1706.04859,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Harris Drucker and Yann Le Cun

URLhttps://arxiv.org/ abs/2305.01240. Harris Drucker and Yann Le Cun. Improving generalization performance using double backpropagation.IEEE Trans- actions on Neural Networks, 3(6):991–997,

work page arXiv
[7]

A Closer Look at Double Backpropagation

Christian Etmann. A closer look at double backpropagation.arXiv preprint arXiv:1906.06637,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[8]

Training Compute-Optimal Large Language Models

doi: 10.1162/neco.1997.9.8.1735. Jordan et al. Hoffmann. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.8.1735 1997
[9]

Andrew et al. Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

20 Layer-wise Derivative Controlled NetworksA PREPRINT Jared et al. Kaplan. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[11]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y . Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

KAN: Kolmogorov-Arnold Networks

doi: 10.48550/arXiv. 2404.19756. URLhttps://arxiv.org/abs/2404.19756. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[13]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for genera- tive adversarial networks.arXiv preprint arXiv:1802.05957,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Spectral Normalization for Generative Adversarial Networks

doi: 10.48550/arXiv.1802.05957. URL https://arxiv.org/abs/1802.05957. Georgii Novikov et al. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. arXiv preprint arXiv:2202.00441,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.05957
[15]

doi: 10.1016/j.jcp.2018.10

work page doi:10.1016/j.jcp.2018.10 2018
[16]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Inverted resid- uals and linear bottlenecks: Mobile networks for classification, detection and segmentation.arXiv preprint arXiv:1801.04381,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

doi: 10.48550/arXiv.1801.04381. URLhttps://arxiv.org/abs/1801.04381. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15:1929–1958,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.04381 1929
[18]

Y . Zhu, S. Zhang, and H. Lin. Hypertext: Hyperbolic text embeddings for document classification. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, page Link,

work page 2020
[19]

URLhttps://ouci.dntb.gov.ua/en/ works/732m0D69/

doi: 10.1007/s41870-023-01600-4. URLhttps://ouci.dntb.gov.ua/en/ works/732m0D69/. 21

work page doi:10.1007/s41870-023-01600-4

[1] [1]

URLhttps://pmc.ncbi

doi: 10.3389/fdata.2024.12705377. URLhttps://pmc.ncbi. nlm.nih.gov/articles/PMC12705377/. Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer,

work page doi:10.3389/fdata.2024.12705377 2024

[2] [2]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equa- tions.arXiv preprint arXiv:1806.07366,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Neural Ordinary Differential Equations

doi: 10.48550/arXiv.1806.07366. URLhttps://arxiv.org/ abs/1806.07366. Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improv- ing robustness to adversarial examples.arXiv preprint arXiv:1704.08847,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.07366

[4] [4]

Parseval Networks: Improving Robustness to Adversarial Examples

doi: 10.48550/arXiv.1704.08847. URLhttps://arxiv.org/abs/1704.08847. Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.08847

[5] [5]

Sobolev Training for Neural Networks

Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. Sobolev training for neural networks.arXiv preprint arXiv:1706.04859,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Harris Drucker and Yann Le Cun

URLhttps://arxiv.org/ abs/2305.01240. Harris Drucker and Yann Le Cun. Improving generalization performance using double backpropagation.IEEE Trans- actions on Neural Networks, 3(6):991–997,

work page arXiv

[7] [7]

A Closer Look at Double Backpropagation

Christian Etmann. A closer look at double backpropagation.arXiv preprint arXiv:1906.06637,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[8] [8]

Training Compute-Optimal Large Language Models

doi: 10.1162/neco.1997.9.8.1735. Jordan et al. Hoffmann. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.8.1735 1997

[9] [9]

Andrew et al. Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

20 Layer-wise Derivative Controlled NetworksA PREPRINT Jared et al. Kaplan. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[11] [11]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y . Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

KAN: Kolmogorov-Arnold Networks

doi: 10.48550/arXiv. 2404.19756. URLhttps://arxiv.org/abs/2404.19756. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[13] [13]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for genera- tive adversarial networks.arXiv preprint arXiv:1802.05957,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Spectral Normalization for Generative Adversarial Networks

doi: 10.48550/arXiv.1802.05957. URL https://arxiv.org/abs/1802.05957. Georgii Novikov et al. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. arXiv preprint arXiv:2202.00441,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.05957

[15] [15]

doi: 10.1016/j.jcp.2018.10

work page doi:10.1016/j.jcp.2018.10 2018

[16] [16]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Inverted resid- uals and linear bottlenecks: Mobile networks for classification, detection and segmentation.arXiv preprint arXiv:1801.04381,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

doi: 10.48550/arXiv.1801.04381. URLhttps://arxiv.org/abs/1801.04381. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15:1929–1958,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.04381 1929

[18] [18]

Y . Zhu, S. Zhang, and H. Lin. Hypertext: Hyperbolic text embeddings for document classification. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, page Link,

work page 2020

[19] [19]

URLhttps://ouci.dntb.gov.ua/en/ works/732m0D69/

doi: 10.1007/s41870-023-01600-4. URLhttps://ouci.dntb.gov.ua/en/ works/732m0D69/. 21

work page doi:10.1007/s41870-023-01600-4