pith. sign in

arxiv: 2602.04832 · v2 · submitted 2026-02-04 · 💻 cs.LG · cs.AI· cs.CV· cs.NE

It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Pith reviewed 2026-05-16 07:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.NE
keywords gradient descentneural network capacitylottery ticket conjectureReLU networksnetwork pruningneuron dynamicsmutual alignmentweight norms
0
0 comments X

The pith

Gradient descent reduces a neural network's effective capacity through mutual alignment, unlocking, and racing of neurons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how gradient descent training in single-hidden-layer ReLU networks reduces the network's theoretical capacity to fit the specific task at hand. It identifies three key dynamical principles—mutual alignment, unlocking, and racing—that govern neuron behavior during training. These principles explain why equivalent neurons can be merged and low-norm weights pruned after training without harming performance. They also provide a mechanism for the lottery ticket conjecture, showing how neurons with advantageous initial conditions gain higher weight norms. This matters because it offers a dynamical explanation for the success of overparameterized networks and post-training compression techniques.

Core claim

In single-hidden-layer ReLU networks, gradient descent induces mutual alignment among neurons, unlocking of certain neurons from poor initial conditions, and a racing dynamic where some neurons achieve higher weight norms. Together these processes adapt the network's capacity to the task, allowing for the merging of equivalent neurons or pruning of low-norm weights after training. This mechanism accounts for the lottery ticket conjecture by demonstrating that beneficial initial conditions enable specific neurons to win the race for higher norms.

What carries the argument

The three dynamical principles of mutual alignment, unlocking, and racing that govern individual neuron weight updates during gradient descent.

If this is right

  • Equivalent neurons can be merged post-training to reduce network size while preserving function.
  • Low-norm weights can be pruned after training with minimal impact on performance.
  • Neurons with specific beneficial initial conditions obtain higher weight norms through the racing dynamic.
  • Overparameterized networks succeed because excess neurons are effectively neutralized or pruned via these dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These principles may extend to deeper networks if similar alignment occurs between layers.
  • Initialization schemes could be optimized to increase the number of winning neurons in the race.
  • Similar capacity adaptation might occur in other architectures like CNNs or transformers.
  • The racing dynamic suggests that training time or learning rate schedules could influence which neurons dominate.

Load-bearing premise

The dynamical principles identified in single-hidden-layer ReLU networks also operate as the essential mechanisms in deeper and more complex neural network architectures.

What would settle it

Training a multi-layer network on the same tasks and checking whether low-norm weights can still be pruned without performance loss, or directly observing whether mutual alignment and racing dynamics appear across hidden layers.

read the original abstract

Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles, namely mutual alignment, unlocking and racing, that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes learning dynamics at the level of individual neurons in single-hidden-layer ReLU networks trained with gradient descent. It identifies three dynamical principles—mutual alignment, unlocking, and racing—that together explain post-training capacity reduction via neuron merging or low-norm weight pruning, and provides a mechanistic account of the lottery ticket conjecture by linking beneficial initial conditions to higher final weight norms.

Significance. If the principles hold and extend beyond the single-layer setting, the work would supply a dynamical explanation for why pruning and merging succeed and why certain initializations win the lottery ticket, strengthening the theoretical basis for capacity adaptation in neural networks.

major comments (2)
  1. [Abstract and theoretical analysis] The three dynamical principles and all derivations are developed exclusively for single-hidden-layer ReLU networks (abstract and theoretical analysis sections). The central claim that these principles explain the lottery ticket conjecture and capacity reduction in general is therefore load-bearing on an untested transfer to deeper architectures, where interlayer feature competition or residual paths could alter or eliminate the racing dynamic that produces norm disparity.
  2. [Discussion of lottery ticket conjecture] No analysis or discussion is provided on how the racing mechanism would be modified by depth-dependent norm scaling or multi-layer interactions, yet the paper presents the single-layer results as explanatory for the lottery ticket phenomenon observed in deeper models.
minor comments (1)
  1. [Theoretical analysis] The definitions of mutual alignment, unlocking, and racing would benefit from explicit equations or pseudocode to make the dynamical principles reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of our dynamical analysis. We agree that the scope of the theoretical results is limited to single-hidden-layer ReLU networks and that the manuscript would benefit from clearer caveats and additional discussion on generalizability. We respond to each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis] The three dynamical principles and all derivations are developed exclusively for single-hidden-layer ReLU networks (abstract and theoretical analysis sections). The central claim that these principles explain the lottery ticket conjecture and capacity reduction in general is therefore load-bearing on an untested transfer to deeper architectures, where interlayer feature competition or residual paths could alter or eliminate the racing dynamic that produces norm disparity.

    Authors: We acknowledge that all formal derivations and the three dynamical principles (mutual alignment, unlocking, and racing) are developed and proven only for single-hidden-layer ReLU networks, as stated throughout the manuscript. The central claim is that these principles supply a mechanistic explanation for why beneficial initial conditions produce higher final weight norms, thereby accounting for the emergence of winning tickets and post-training capacity reduction via pruning or merging. While we do not claim a rigorous transfer to deeper networks, the core racing dynamic—differential norm growth driven by alignment—addresses a key ingredient of the lottery ticket conjecture that has been observed across architectures. We will revise the abstract to explicitly delimit the scope to single-hidden-layer networks and add a new subsection in the Discussion that outlines how interlayer competition or residual connections might modulate the racing mechanism, without overstating the current results. revision: partial

  2. Referee: [Discussion of lottery ticket conjecture] No analysis or discussion is provided on how the racing mechanism would be modified by depth-dependent norm scaling or multi-layer interactions, yet the paper presents the single-layer results as explanatory for the lottery ticket phenomenon observed in deeper models.

    Authors: We agree that the manuscript lacks explicit discussion of depth-dependent effects. The racing mechanism arises from the interaction between forward alignment and backward gradient scaling, which in deeper networks would be further influenced by layer-wise norm propagation and feature competition. Nevertheless, the single-layer analysis isolates the fundamental process by which initial alignment advantages are amplified into norm disparities, providing a building block for understanding lottery tickets in deeper models. We will add a concise paragraph in the Discussion section that (i) notes the absence of multi-layer analysis, (ii) speculates on how depth-dependent scaling could preserve or alter the racing dynamic, and (iii) identifies this as an important direction for future work. revision: yes

Circularity Check

0 steps flagged

Dynamical analysis derives principles without reduction to inputs by construction

full rationale

The paper performs a dynamical analysis of gradient descent on single-hidden-layer ReLU networks to identify the principles of mutual alignment, unlocking, and racing. These are obtained from the learning dynamics equations rather than by fitting parameters to the final norms or by defining quantities in terms of the target observations (capacity reduction or lottery-ticket neuron selection). No self-citation chain, ansatz smuggling, or renaming of known results is required for the central claims; the derivation remains self-contained against the single-layer model assumptions. The single-layer scope is an explicit modeling choice, not a circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that single-hidden-layer ReLU dynamics are representative and that the three principles can be derived without additional fitted parameters beyond standard gradient descent.

axioms (1)
  • domain assumption Single-hidden-layer ReLU networks capture the essential capacity-adaptation mechanisms of deeper networks
    Paper restricts analysis to this architecture to derive the three principles.

pith-pipeline@v0.9.0 · 5432 in / 1164 out tokens · 31607 ms · 2026-05-16T07:10:03.861409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.