pith. sign in

arxiv: 2504.09484 · v2 · submitted 2025-04-13 · 💻 cs.LG

An overview of condensation phenomenon in deep learning

Pith reviewed 2026-05-22 19:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords condensation phenomenonneural network traininggeneralizationtraining dynamicsloss landscapedropouttransformersdeep learning
0
0 comments X

The pith

Neurons in the same layer of neural networks condense into groups with similar outputs during nonlinear training, with the number of clusters increasing monotonically over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This overview describes how neurons during nonlinear training form clusters where outputs within each group become similar. The process typically produces more such clusters as training continues. Small initial weights and dropout both encourage the clustering to happen. The review connects these observations to training dynamics and the shape of the loss landscape. A reader would care because the clustering appears linked to how networks generalize and to stronger reasoning performance in transformers.

Core claim

During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

What carries the argument

The condensation phenomenon: neurons grouping into clusters that share similar output values.

If this is right

  • Small weight initializations facilitate condensation.
  • Dropout optimization facilitates the condensation process.
  • Condensation provides insights into generalization abilities of neural networks.
  • Condensation correlates with stronger reasoning abilities in transformer-based language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If condensation supports generalization, then deliberately promoting it through initialization or regularization choices could produce more efficient networks.
  • The reported correlation with reasoning in transformers suggests checking whether cluster count predicts capability gains in other sequence models.
  • Stable condensed states may act as attractors in the loss landscape whose geometry could be studied directly to explain training trajectories.

Load-bearing premise

Condensation is a general and reproducible feature that appears across different neural network architectures, tasks, and training setups.

What would settle it

A training run on a standard benchmark in which neurons fail to form output-similar clusters or in which the number of clusters does not increase monotonically while the network still trains successfully.

Figures

Figures reproduced from arXiv: 2504.09484 by Yaoyu Zhang, Zhangchen Zhou, Zhi-Qin John Xu.

Figure 1
Figure 1. Figure 1: The feature maps {(θk, Ak)}k of a two-layer ReLU neural network. The red dots and the gray dots are the features of the active and the static neurons respectively and the blue solid lines are the trajectories of the active neurons during the training. The epochs are described in subcaptions [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The feature map of two-layer Tanh neural networks. The red dots are the features of neurons [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Small initialization (convolutional and fully connected layers initially follow [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Condensation phenomenon in a ResNet-18 model pre-trained on ImageNet. (a) and (b) show [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Phase diagram of two-layer ReLU NNs at infinite-width limit. The marked examples are [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The heatmap of the cosine similarity of neurons of two-layer NNs at the initial training [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The loss distribution during the training among two-layer ReLU NNs with different widths. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tanh NNs outputs and features under different dropout rates. The width of the hidden [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of loss and output between the model trained by gradient descent with small [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cosine similarity matrices of neuron input weights ( [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript provides an overview of the condensation phenomenon in deep neural networks: during nonlinear training, neurons within the same layer form groups with similar outputs, and the number of such condensed clusters typically increases monotonically with training progress. It discusses facilitating conditions (small weight initializations, Dropout), mechanisms viewed through training dynamics and loss-landscape structure, and implications for generalization and reasoning capabilities in transformer-based models.

Significance. If the described condensation behavior proves robust and general, the overview could help explain emergent properties of trained networks and suggest practical training heuristics. As a synthesis of empirical observations rather than a source of new derivations or controlled experiments, its primary value lies in consolidating prior findings for the community; however, the absence of a fixed, reproducible definition of condensation limits the strength of any broader claims about monotonicity or universality.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim that 'the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses' is load-bearing yet rests on unspecified similarity measures and cluster-counting procedures. Without an invariant definition (e.g., fixed output-correlation threshold, cosine similarity on activations, or parameter-free k-means protocol), different post-hoc choices can alter both detected clusters and the observed trend, undermining reproducibility across the cited setups.
  2. [Abstract] Abstract / mechanisms discussion: The examination of underlying mechanisms 'from the perspectives of training dynamics and the structure of the loss landscape' is presented at a high level without specific derivations, equations, or falsifiable predictions that would allow independent verification or extension of the described condensation process.
minor comments (2)
  1. The manuscript would benefit from an explicit 'Definition' subsection early in the text that fixes the similarity metric and cluster-counting rule used throughout the overview.
  2. References to prior empirical studies should include a brief note on the architectures, datasets, and exact clustering hyperparameters employed in each cited observation to aid readers in assessing generality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of this overview paper. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim that 'the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses' is load-bearing yet rests on unspecified similarity measures and cluster-counting procedures. Without an invariant definition (e.g., fixed output-correlation threshold, cosine similarity on activations, or parameter-free k-means protocol), different post-hoc choices can alter both detected clusters and the observed trend, undermining reproducibility across the cited setups.

    Authors: We agree that the lack of a single fixed definition limits the strength of broad claims about monotonicity. The manuscript is an overview that reports empirical observations as described in the cited literature, where condensation is commonly identified via output similarity (e.g., cosine similarity of activations or weight vectors) followed by clustering. In the revised manuscript we will insert a new subsection (likely in Section 2) that explicitly catalogs the similarity measures and cluster-counting procedures used across the primary references, together with a brief discussion of how the monotonic trend has been observed under those choices. This addition will improve reproducibility without altering the overview character of the work. revision: yes

  2. Referee: [Abstract] Abstract / mechanisms discussion: The examination of underlying mechanisms 'from the perspectives of training dynamics and the structure of the loss landscape' is presented at a high level without specific derivations, equations, or falsifiable predictions that would allow independent verification or extension of the described condensation process.

    Authors: As an overview whose primary contribution is synthesis rather than new theoretical derivations, the mechanisms section summarizes insights already present in the referenced studies. We will revise the text to include more explicit pointers to the key equations and analyses in those works (e.g., the mean-field or gradient-flow perspectives in the cited papers) and to note which predictions have been empirically tested. Because the manuscript does not aim to supply original derivations, we will not add new equations, but the expanded citations and cross-references should make verification and extension easier for readers. revision: partial

Circularity Check

0 steps flagged

No circularity: overview summarizes empirical observations without derivation chain

full rationale

The paper is explicitly an overview of an observed phenomenon in neural network training, presenting claims as summaries of empirical findings rather than first-principles derivations or predictions. No mathematical steps, equations, or parameter-fitting procedures are described that could reduce to self-definition or fitted inputs by construction. Self-citations (if present in the full text) serve to reference prior empirical work but do not constitute a load-bearing uniqueness theorem or ansatz that forces the central claim. The monotonic cluster increase is stated as a typical observation across setups, not a result derived from the paper's own inputs. This is the standard honest outcome for a survey-style manuscript with no claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The overview rests on the empirical regularity of condensation and its correlation with generalization; no new free parameters, invented entities, or ad-hoc axioms are introduced in the provided abstract.

axioms (1)
  • domain assumption Neurons in the same layer tend to condense into groups with similar outputs during nonlinear training of neural networks.
    This is the core observed regularity that the overview takes as given and then discusses mechanisms and implications for.

pith-pipeline@v0.9.0 · 5632 in / 1191 out tokens · 49232 ms · 2026-05-22T19:58:20.494597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformers show a sharp, task-specific critical window for weight decay application that determines reasoning versus memorization, with middle placement optimal and boundaries as narrow as 100 steps.

  2. WebSailor: Navigating Super-human Reasoning for Web Agent

    cs.CL 2025-07 conditional novelty 6.0

    WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    A Closer Look at Memorization in Deep Networks

    [AJB+17] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394 ,

  2. [2]

    Early alignment in two-layer networks training is a two-edged sword

    [BF24] Etienne Boursier and Nicolas Flammarion. Early alignment in two-layer networks training is a two-edged sword. arXiv preprint arXiv:2401.10791 ,

  3. [3]

    On the dynamics of three-layer neural networks: initial condensation

    [CL24] Zheng-an Chen and Tao Luo. On the dynamics of three-layer neural networks: initial condensation. arXiv preprint arXiv:2402.15958 ,

  4. [4]

    A phase shift deep neural network for high frequency wave equations in inhomogeneous media

    [CLL19] Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency wave equations in inhomogeneous media. Arxiv preprint, arXiv:1909.11759 ,

  5. [5]

    Phase diagram of initial condensation for two-layer neural networks

    [CLL+23] Zhengan Chen, Yuqing Li, Tao Luo, Zhangchen Zhou, and Zhi-Qin John Xu. Phase diagram of initial condensation for two-layer neural networks. arXiv preprint arXiv:2303.06561,

  6. [6]

    Analyzing multi-stage loss curve: Plateau and descent mechanisms in neural networks

    [CLW24] Zheng-An Chen, Tao Luo, and GuiHong Wang. Analyzing multi-stage loss curve: Plateau and descent mechanisms in neural networks. arXiv preprint arXiv:2410.20119 ,

  7. [7]

    Directional convergence near small initializations and saddles in two-homogeneous neural networks

    [KH24a] Akshay Kumar and Jarvis Haupt. Directional convergence near small initializations and saddles in two-homogeneous neural networks. arXiv preprint arXiv:2402.09226 ,

  8. [8]

    Early directional convergence in deep homogeneous neural networks for small initializations

    [KH24b] Akshay Kumar and Jarvis Haupt. Early directional convergence in deep homogeneous neural networks for small initializations. arXiv preprint arXiv:2403.08121 ,

  9. [9]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

  10. [10]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    [KMN+16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 ,

  11. [11]

    Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains

    14 [LCX20] Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains. Communications in Computa- tional Physics, 28(5):1970–2001,

  12. [12]

    An upper limit of decaying rate with respect to frequency in deep neural network

    [LMW+21] Tao Luo, Zheng Ma, Zhiwei Wang, Zhi-Qin John Xu, and Yaoyu Zhang. An upper limit of decaying rate with respect to frequency in deep neural network. arXiv preprint arXiv:2105.11675,

  13. [13]

    A multi-scale dnn algorithm for nonlin- ear elliptic equations with multiple scales

    [LXZ20] Xi-An Li, Zhi-Qin John Xu, and Lei Zhang. A multi-scale dnn algorithm for nonlin- ear elliptic equations with multiple scales. Communications in Computational Physics , 28(5):1886–1906,

  14. [14]

    Gradient Descent Quantizes ReLU Network Features

    [MBG18] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network features. arXiv preprint arXiv:1803.08367 ,

  15. [15]

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

    [MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit.arXiv preprint arXiv:1902.06015,

  16. [16]

    Dropout: a simple way to prevent neural networks from overfitting

    [SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,

  17. [17]

    An analysis for reasoning bias of language models with small initialization

    [YZX25] Junjie Yao, Zhongwang Zhang, and Zhi-Qin John Xu. An analysis for reasoning bias of language models with small initialization. arXiv preprint arXiv:2502.04375 ,

  18. [18]

    Complexity control facilitates reasoning-based compositional generalization in transform- ers

    [ZLW+25] Zhongwang Zhang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, and Zhi-Qin John Xu. Complexity control facilitates reasoning-based compositional generalization in transform- ers. arXiv preprint arXiv:2501.08537 ,

  19. [19]

    The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

    [ZWY+18] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195 ,

  20. [20]

    Embedding principle of loss landscape of deep neural networks

    16 [ZZLX21] Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle of loss landscape of deep neural networks. arXiv preprint arXiv:2105.14573 ,

  21. [21]

    Understanding the initial condensation of convolutional neural networks

    [ZZLX23] Zhangchen Zhou, Hanxu Zhou, Yuqing Li, and Zhi-Qin John Xu. Understanding the initial condensation of convolutional neural networks. arXiv preprint arXiv:2305.09947 ,

  22. [22]

    Optimistic estimate uncovers the potential of nonlinear models

    [ZZZ+23] Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, and Zhi-Qin John Xu. Optimistic estimate uncovers the potential of nonlinear models. arXiv preprint arXiv:2307.08921,