arxiv: 2604.10560 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.NE

Recognition: 2 theorem links

· Lean Theorem

Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria

Nikodem Tomczak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords sparse neural networksfan-in profilesdynamic sparse trainingRigLheterogeneous connectivitygradient hierarchytopological equilibrianetwork pruning

0 comments

The pith

Which neurons become hubs in sparse networks matters more than overall connectivity variance, as random placement offers no gain while optimization-driven placement improves accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether deliberately varying how many inputs each neuron receives in sparse networks can improve performance over uniform random sparsity. It defines static profiled sparse networks using continuous nonlinear functions to set fan-in profiles, creating a mix of densely and sparsely connected neurons. Across vision and tabular datasets at high sparsity levels, these fixed profiles match the accuracy of uniform random baselines when hub placement is arbitrary rather than learned. When the same profiles initialize RigL dynamic sparse training, those that match the distribution RigL naturally reaches during training yield small but consistent gains, with the advantage increasing on harder tasks. RigL always converges to the same characteristic fan-in distribution no matter where it starts, indicating that the training process itself selects which neurons act as hubs.

Core claim

Static heterogeneous fan-in profiles defined by parametric families, lognormal, and power-law functions produce no accuracy advantage over uniform random connectivity at sparsities from 80 to 99.9 percent when hub locations remain fixed and arbitrary. Structured profiles do create 2-5 times higher gradient concentration at hub neurons, with the strength of this hierarchy scaling directly with the fan-in coefficient of variation. Initializing RigL with lognormal profiles matched to its observed equilibrium distribution consistently outperforms standard ERK initialization, delivering gains that grow with task difficulty and allowing the optimizer to refine weights instead of rearranging the 90

What carries the argument

Profiled Sparse Networks (PSN) that replace uniform fan-in with deterministic heterogeneous profiles generated by continuous nonlinear functions, together with the convergence of RigL dynamic sparse training to a stable characteristic fan-in distribution independent of starting initialization.

If this is right

At 90 percent sparsity all static profiles including uniform random stay within 0.6 percent of dense baseline accuracy on every dataset tested.
Gradient magnitude concentrates 2-5 times more at hub neurons under structured profiles than under uniform random connectivity.
Lognormal initialization matched to RigL equilibrium improves final accuracy by 0.16 to 0.49 percent over ERK, with larger gains on harder tasks.
RigL reaches the same equilibrium fan-in distribution regardless of whether training begins from uniform, ERK, or profiled initializations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future sparse training algorithms could benefit from directly optimizing the identity of hub neurons rather than only their degree distribution.
The equilibrium fan-in profile may reflect an intrinsic property of gradient flow under magnitude-based pruning that is independent of the specific pruning schedule.
If the equilibrium distribution proves stable across deeper and wider networks, it could serve as a parameter-free target for initializing any dynamic sparse method.
The finding separates the effect of variance in connectivity from the effect of which specific neurons receive that variance, suggesting topology selection is the active ingredient in dynamic sparsity.

Load-bearing premise

The observed convergence of RigL to one characteristic fan-in distribution, and the lack of benefit from static heterogeneous profiles, hold beyond the four tested datasets, two-to-three-layer architectures, and specific hyper-parameters examined.

What would settle it

An experiment in which RigL is run on a new architecture or dataset and converges to a markedly different fan-in distribution, or a static profile whose arbitrary hub placement produces accuracy gains exceeding 1 percent over random baselines at 90 percent sparsity.

Figures

Figures reproduced from arXiv: 2604.10560 by Nikodem Tomczak.

**Figure 2.** Figure 2: Static connectivity structure does not affect accuracy. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: RigL accuracy versus sparsity by initialisation strategy across four datasets (5 seeds [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions, creating neurons with both dense and sparse receptive fields. We benchmark PSN across four classification datasets spanning vision and tabular domains, input dimensions from 54 to 784, and network depths of 2--3 hidden layers. At 90% sparsity, all static profiles, including the uniform random baseline, achieve accuracy within 0.2-0.6% of dense baselines on every dataset, demonstrating that heterogeneous connectivity provides no accuracy advantage when hub placement is arbitrary rather than task-aligned. This result holds across sparsity levels (80-99.9%), profile shapes (eight parametric families, lognormal, and power-law), and fan-in coefficients of variation from 0 to 2.5. Internal gradient analysis reveals that structured profiles create a 2-5x gradient concentration at hub neurons compared to the ~1x uniform distribution in random baselines, with the hierarchy strength predicted by fan-in coefficient of variation ($r = 0.93$). When PSN fan-in distributions are used to initialise RigL dynamic sparse training, lognormal profiles matched to the equilibrium fan-in distribution consistently outperform standard ERK initialisation, with advantages growing on harder tasks, achieving +0.16% on Fashion-MNIST ($p = 0.036$, $d = 1.07$), +0.43% on EMNIST, and +0.49% on Forest Cover. RigL converges to a characteristic fan-in distribution regardless of initialisation. Starting at this equilibrium allows the optimiser to refine weights rather than rearrange topology. Which neurons become hubs matters more than the degree of connectivity variance, i.e., random hub placement provides no advantage, while optimisation-driven placement does.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Static heterogeneous fan-in profiles give no accuracy edge over random in sparse nets, but matching RigL initialization to the observed equilibrium distribution yields small gains by reducing topology rearrangement.

read the letter

The main thing to know is that arbitrary heterogeneous connectivity adds nothing useful for accuracy in these sparse networks, while starting from the right fan-in distribution helps dynamic training a bit. The paper tests eight parametric profile families plus lognormal and power-law shapes on four datasets with input sizes from 54 to 784 dimensions and 2-3 hidden layers. At 90% sparsity all static profiles land within 0.2-0.6% of dense baselines, and this holds from 80% to 99.9% sparsity with coefficients of variation up to 2.5. Gradient measurements show 2-5x concentration at hubs for structured profiles versus uniform, with r=0.93 to the CV. When those equilibrium distributions initialize RigL, lognormal starts beat standard ERK by up to 0.49% on Forest Cover, with the advantage larger on harder tasks and supported by p-values and effect sizes. RigL itself converges to a characteristic fan-in distribution no matter the start, so the benefit comes from skipping the rearrangement phase. The experiments are systematic within their scope and the negative result on static profiles is cleanly shown. The soft spots are the narrow range of depths and dataset sizes; nothing tests whether the equilibrium or the initialization benefit survives deeper nets, wider layers, or larger-scale vision tasks. The reported gains remain modest and the claim of independence from specific hyperparameters rests on the tested setups only. This work is for people already running dynamic sparse training and looking for better initialization heuristics. It does not change core theory but supplies reproducible empirical guidance on fan-in profiles and gradient hierarchy. The methods are straightforward enough that a referee could check the claims without major new experiments. Send it for review; the systematic benchmarks and the static-versus-dynamic contrast are worth referee time even if the practical impact stays incremental.

Referee Report

2 major / 3 minor

Summary. The paper introduces Profiled Sparse Networks (PSN) that use deterministic heterogeneous fan-in profiles defined by continuous nonlinear functions. Across four classification datasets (input dims 54–784), 2–3 hidden layer networks, and sparsity levels 80–99.9%, it reports that all static PSN profiles (eight parametric families plus lognormal/power-law, CV 0–2.5) achieve accuracy within 0.2–0.6% of dense baselines and show no advantage over uniform random connectivity when hub placement is arbitrary. Gradient analysis shows 2–5× concentration at hubs predicted by CV (r=0.93). Initializing RigL with lognormal profiles matched to the observed equilibrium distribution yields small but statistically significant gains over ERK (+0.16% Fashion-MNIST p=0.036 d=1.07; larger on EMNIST and Forest Cover), while RigL converges to a characteristic fan-in distribution independent of initialization. The central claim is that optimization-driven hub placement matters more than the degree of connectivity variance.

Significance. If the central empirical findings hold, the work provides concrete evidence that topology initialization can improve dynamic sparse training and that arbitrary heterogeneous connectivity confers little benefit. Strengths include consistent accuracy and gradient results across four datasets and multiple sparsity levels, use of statistical tests, and the observation that RigL reaches an equilibrium fan-in distribution. The practical suggestion of matching initial profiles to this equilibrium is a modest but actionable contribution to sparse training literature.

major comments (2)

[RigL convergence and initialization experiments] The claim that RigL converges to a characteristic fan-in distribution 'regardless of initialisation' and that this equilibrium is task-aligned rests on experiments limited to 2–3 hidden layers and four datasets (Section on RigL results and initialization experiments). If the equilibrium distribution or the benefit of starting at it changes with depth, width, or task difficulty, the contrast between arbitrary static profiles and optimization-driven placement does not support the broader conclusion that 'which neurons become hubs matters more than the degree of connectivity variance'.
[RigL initialization results] The reported accuracy gains from equilibrium-matched initialization are small (+0.16% on Fashion-MNIST, +0.43% EMNIST, +0.49% Forest Cover) with moderate effect sizes; combined with the absence of full hyper-parameter search details and ablation on whether the advantage persists under different RigL schedules or deeper architectures, this weakens the load-bearing assertion that starting at equilibrium allows the optimizer to 'refine weights rather than rearrange topology'.

minor comments (3)

[PSN definition] The definition of the eight parametric profile families and the exact mapping from CV to the nonlinear functions could be stated more explicitly (e.g., with equations) to allow exact reproduction.
[Figures] Figure captions and legends should clarify which curves correspond to which profile families and whether error bars represent standard deviation or standard error across the reported runs.
[Static profile benchmarks] A short discussion of why the tested static profiles (CV up to 2.5) are considered representative of 'arbitrary' heterogeneous connectivity would strengthen the interpretation of the null result for static PSN.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting important limitations in scope and experimental detail. We have revised the manuscript to qualify claims, add hyperparameter documentation, and expand the limitations discussion while preserving the core empirical findings. Point-by-point responses to the major comments follow.

read point-by-point responses

Referee: The claim that RigL converges to a characteristic fan-in distribution 'regardless of initialisation' and that this equilibrium is task-aligned rests on experiments limited to 2–3 hidden layers and four datasets (Section on RigL results and initialization experiments). If the equilibrium distribution or the benefit of starting at it changes with depth, width, or task difficulty, the contrast between arbitrary static profiles and optimization-driven placement does not support the broader conclusion that 'which neurons become hubs matters more than the degree of connectivity variance'.

Authors: We agree the experiments are restricted to 2–3 hidden layers on the four datasets. Within these regimes the convergence to a characteristic fan-in distribution occurred consistently across initializations, and the initialization benefit scaled with task difficulty. We have added an explicit limitations paragraph in the discussion stating that the equilibrium may shift with greater depth or width and that the current evidence supports the conclusion only for the tested architectures. The central claim is now scoped accordingly, emphasizing that optimization-driven placement outperformed arbitrary heterogeneity in the studied settings. revision: partial
Referee: The reported accuracy gains from equilibrium-matched initialization are small (+0.16% on Fashion-MNIST, +0.43% EMNIST, +0.49% Forest Cover) with moderate effect sizes; combined with the absence of full hyper-parameter search details and ablation on whether the advantage persists under different RigL schedules or deeper architectures, this weakens the load-bearing assertion that starting at equilibrium allows the optimizer to 'refine weights rather than rearrange topology'.

Authors: The gains are modest yet statistically significant with the reported p-values and effect sizes. We have added a full hyperparameter appendix detailing the grid search, RigL growth rate (0.1), update interval (every 1000 steps), and all other schedule parameters used. Exhaustive ablations on every schedule variant were not performed owing to the computational cost of dynamic sparse training; however, the advantage held across all four datasets and multiple sparsity levels. The manuscript text has been revised to state that, in the evaluated settings, equilibrium-matched initialization permits greater focus on weight refinement rather than topology rearrangement. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmarks with independent experimental support

full rationale

The manuscript reports experimental results on PSN static profiles and RigL dynamic training across four datasets, multiple sparsity levels, and eight profile families. All central claims—including convergence of RigL to a characteristic fan-in distribution, gradient hierarchy scaling with CV (r=0.93), and accuracy gains from equilibrium initialization—are direct outcomes of the reported runs rather than derivations, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Work is empirical with no explicit free parameters, axioms, or invented physical entities; relies on standard back-propagation and classification loss assumptions common to the field.

invented entities (1)

Profiled Sparse Networks (PSN) no independent evidence
purpose: Framework for deterministic heterogeneous fan-in profiles in sparse networks
Newly defined method whose performance claims rest on the paper's own benchmarks.

pith-pipeline@v0.9.0 · 5631 in / 1150 out tokens · 38302 ms · 2026-05-10T16:39:23.886551+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Constants phi_golden_ratio echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ϕi =⌊i·φ·n⌋modn ... golden ratio φ≈1.618

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 10 canonical work pages · 4 internal anchors

[1]

The lottery ticket hypothesis: Finding sparse, train- able neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, train- able neural networks. InInternational Conference on Learning Representations (ICLR), 2019. 18

2019
[2]

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connec- tions for efficient neural network. InAdvances in Neural Information Processing Systems (NIPS), volume 28, pages 1135–1143, 2015

2015
[3]

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021

2021
[4]

The State of Sparsity in Deep Neural Networks

Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep learning.arXiv preprint arXiv:1902.09574, 2019

work page Pith review arXiv 1902
[5]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems (NIPS), pages 598–605, 1990

1990
[6]

Emergence of scaling in random networks.Science, 286(5439):509–512, 1999

Albert-L´ aszl´ o Barab´ asi and R´ eka Albert. Emergence of scaling in random networks.Science, 286(5439):509–512, 1999

1999
[7]

Complex brain networks: graph theoretical analysis of structural and functional systems.Nature Reviews Neuroscience, 10(3):186–198, 2009

Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems.Nature Reviews Neuroscience, 10(3):186–198, 2009

2009
[8]

Dynamic sparse training with structured sparsity

Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, and Yani Ioannou. Dynamic sparse training with structured sparsity. InThe Twelfth International Conference on Learning Representations, 2024

2024
[9]

A dendritic-inspired net- work science generative model for topological initialization of connectivity in sparse artificial neural networks.Preprints, October 2025

Diego Cerretti, Yingtao Zhang, and Carlo Vittorio Cannistraci. A dendritic-inspired net- work science generative model for topological initialization of connectivity in sparse artificial neural networks.Preprints, October 2025

2025
[10]

Brain network science modelling of sparse neural networks enables transformers and llms to perform as fully connected, 2026

Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, and Carlo Vittorio Cannistraci. Brain network science modelling of sparse neural networks enables transformers and llms to perform as fully connected, 2026

2026
[11]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998
[12]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.arXiv:1708.07747, 2017

work page internal anchor Pith review arXiv 2017
[13]

EMNIST: an extension of MNIST to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr´ e van Schaik. EMNIST: Ex- tending MNIST to handwritten letters.arXiv:1702.05373, 2017

work page Pith review arXiv 2017
[14]

Blackard and Denis J

Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Com- puters and Electronics in Agriculture, 24(3):131–151, 1999

1999
[15]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256, 2010

2010
[16]

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015

2015
[17]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016. 19

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Rigging the lottery: Making all tickets winners

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 2943–2952, 2020

2020
[19]

Nguyen, Madeleine Gibescu, and Antonio Liotta

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science.Nature Communications, 9(1):2383, 2018

2018
[20]

Scott Gray, Alec Radford, and Diederik P. Kingma. GPU kernels for block-sparse weights. Technical report, OpenAI, 2017

2017
[21]

Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations (ICLR), 2017

2017
[22]

Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems (NIPS), pages 164– 171, 1993

1993
[23]

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: Single-shot net- work pruning based on connection sensitivity. InInternational Conference on Learning Representations (ICLR), 2019

2019
[24]

Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow.arXiv:2006.05467, 2020

work page arXiv 2006
[25]

Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus

Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, 2014

2014
[26]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications.arXiv:1704.04861, 2017

work page internal anchor Pith review arXiv 2017
[27]

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 6105–6114, 2019

2019
[28]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

2022
[29]

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv:1611.01578, 2017

work page Pith review arXiv 2017
[30]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[31]

Efficient content- based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics (TACL), 9:53–68, 2021

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content- based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics (TACL), 9:53–68, 2021

2021
[32]

Sparser, better, deeper, stronger: Improving sparse training with exact orthogonal initialization

Aleksandra Irena Nowak, Lukasz Gniecki, Filip Szatkowski, and Jacek Tabor. Sparser, better, deeper, stronger: Improving sparse training with exact orthogonal initialization. arXiv:2406.01755, 2024. 20

work page arXiv 2024
[33]

Network sparsity unlocks the scaling potential of deep reinforcement learning.arXiv:2506.17204, 2025

Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, and Dacheng Tao. Network sparsity unlocks the scaling potential of deep reinforcement learning.arXiv:2506.17204, 2025

work page arXiv 2025
[34]

Picking winning tickets before train- ing by preserving gradient flow

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before train- ing by preserving gradient flow. InInternational Conference on Learning Representations (ICLR), 2020

2020
[35]

Johnson and Joram Lindenstrauss

William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.Contemporary Mathematics, 26:189–206, 1984. 21

1984