Geometric Analysis of Neural Regression Collapse via Intrinsic Dimension
Pith reviewed 2026-05-18 10:19 UTC · model grok-4.3
The pith
In neural regression, models collapse and generalize poorly when the intrinsic dimension of their last-layer features falls below that of the targets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Collapsed regression models show ID_H smaller than ID_Y, which produces over-compression and degraded generalization. Non-collapsed models maintain ID_H larger than ID_Y, and in those cases performance scales with data volume and noise level. The authors therefore distinguish an over-compressed regime, where features lack enough dimensions to capture target structure, from an under-compressed regime, where extra dimensions can be pruned without loss.
What carries the argument
Comparison of estimated intrinsic dimension of last-layer activations (ID_H) to intrinsic dimension of regression targets (ID_Y), used to detect over-compression versus under-compression.
If this is right
- In over-compressed regimes, deliberately increasing feature dimensionality should raise performance.
- In under-compressed regimes, reducing feature dimensionality should not hurt and may help.
- For non-collapsed models, the benefit of any given ID_H value changes with the amount of training data and the noise level in the targets.
- Monitoring whether ID_H exceeds ID_Y during training supplies a geometric diagnostic for when collapse is likely to damage regression performance.
Where Pith is reading between the lines
- Training procedures could track ID_H relative to ID_Y in real time and trigger dimensionality adjustments when the model enters the over-compressed regime.
- The same intrinsic-dimension comparison might apply to sequence or time-series regression where targets possess their own low-dimensional manifold structure.
- The regimes suggest a simple regularization rule: expand the last layer when ID_H is below ID_Y and the data budget allows, otherwise prune.
Load-bearing premise
That the estimated intrinsic dimensions of activations and targets give a reliable geometric measure of the information that actually matters for generalization.
What would settle it
Finding a regression task where a model with clearly lower ID_H than ID_Y still achieves strong test performance, or where deliberately moving between the two regimes fails to change generalization as predicted.
Figures
read the original abstract
Neural multivariate regression underpins a wide range of domains, including control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze regression models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H < ID_Y, leading to over-compression and poor generalization, whereas non-collapsed models typically maintain ID_H > ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression collapse and suggest practical strategies for enhancing generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that in neural multivariate regression, collapse manifests as last-layer feature intrinsic dimension ID_H falling below target dimension ID_Y, producing over-compression that harms generalization; non-collapsed models maintain ID_H > ID_Y and their performance varies with data volume and noise. From these patterns the authors define over-compressed and under-compressed regimes that prescribe whether expanding or contracting feature dimensionality will improve results, supported by observations on control tasks and synthetic data.
Significance. If the reported ID inequalities reliably index loss or retention of regression-relevant information, the work supplies a geometric explanation for why collapse benefits classification yet degrades regression and offers concrete dimensionality-control heuristics for practitioners in robotics, control, and finance.
major comments (3)
- [Abstract] Abstract and experimental sections: the central claim that ID_H < ID_Y constitutes over-compression rests on an unspecified intrinsic-dimension estimator whose bias under the studied noise levels, sample sizes, and manifold curvatures is not characterized; without such validation the regime distinction risks being an estimator artifact rather than a geometric cause of generalization failure.
- [Results] Results and discussion: the over-compressed versus under-compressed regimes are introduced as direct observational labels without a derivation showing that the ID comparison is not tautological with the fitted regression loss or confounded by task-specific output structure; a concrete test (e.g., controlled synthetic manifolds with known ground-truth dimension) is needed to establish that the inequality predicts performance differences beyond correlation.
- [Methods] Methods: no information is supplied on the precise ID estimator (MLE, correlation dimension, etc.), its hyper-parameters, number of trials, or statistical controls for finite-sample effects, all of which are load-bearing for interpreting ID_H versus ID_Y as a faithful proxy for information content relevant to generalization.
minor comments (2)
- [Abstract] The abstract states results hold 'across control tasks and synthetic datasets' yet provides no enumeration of the specific tasks or dataset sizes, which would aid reproducibility.
- Notation for ID_H and ID_Y should be defined at first use with an explicit reference to the estimator formula or implementation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important points regarding the characterization of the intrinsic-dimension estimator and the validation of the proposed regimes. We have revised the manuscript to address these concerns by adding explicit details on the estimator, its hyperparameters, bias characterization via synthetic benchmarks, and controlled experiments on manifolds with known ground-truth dimensions. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental sections: the central claim that ID_H < ID_Y constitutes over-compression rests on an unspecified intrinsic-dimension estimator whose bias under the studied noise levels, sample sizes, and manifold curvatures is not characterized; without such validation the regime distinction risks being an estimator artifact rather than a geometric cause of generalization failure.
Authors: We agree that the original submission did not sufficiently detail the estimator or its bias properties. In the revised manuscript we now specify that intrinsic dimension is computed via the maximum-likelihood estimator of Levina and Bickel, with k=10 nearest neighbors and averaging over 5 independent trials per point. We have added an appendix section that quantifies estimator bias on synthetic manifolds matching the noise levels, sample sizes, and curvature ranges of our experiments; the results confirm that the observed ID_H < ID_Y threshold remains reliable and is not an artifact under the conditions studied. revision: yes
-
Referee: [Results] Results and discussion: the over-compressed versus under-compressed regimes are introduced as direct observational labels without a derivation showing that the ID comparison is not tautological with the fitted regression loss or confounded by task-specific output structure; a concrete test (e.g., controlled synthetic manifolds with known ground-truth dimension) is needed to establish that the inequality predicts performance differences beyond correlation.
Authors: The regimes were initially motivated by consistent empirical patterns across control tasks and synthetic data. To strengthen the claim, the revised version includes a short derivation relating ID_H < ID_Y to information loss on the target manifold and adds new controlled experiments on synthetic manifolds (Swiss-roll and hypersphere embeddings) with explicitly known ground-truth dimensions. These experiments demonstrate that adjusting feature dimensionality according to the ID comparison improves test performance even when regression loss is held constant, supporting that the inequality carries predictive value beyond direct correlation with loss. revision: yes
-
Referee: [Methods] Methods: no information is supplied on the precise ID estimator (MLE, correlation dimension, etc.), its hyper-parameters, number of trials, or statistical controls for finite-sample effects, all of which are load-bearing for interpreting ID_H versus ID_Y as a faithful proxy for information content relevant to generalization.
Authors: We have expanded the Methods section to provide the missing details: the estimator is the MLE of Levina and Bickel; hyperparameters are k=10 (with sensitivity checks for k=5 and k=20); results are averaged over five trials per sample; and finite-sample effects are controlled via bootstrap resampling with 100 resamples to report confidence intervals on ID estimates. These additions make the proxy interpretation reproducible and directly address the concern. revision: yes
Circularity Check
No significant circularity in empirical geometric analysis
full rationale
The paper presents its core claims as direct experimental observations: collapsed regression models show ID_H < ID_Y while non-collapsed ones maintain ID_H > ID_Y, with regimes identified from how performance varies with data quantity and noise. No mathematical derivation chain, equations, or predictions are described that reduce by construction to fitted parameters or self-referential definitions. The analysis relies on estimating intrinsic dimensions of activations and targets across control and synthetic datasets, treating the resulting regime distinctions as empirical findings rather than tautological outputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described structure; the work is self-contained against external benchmarks via its observational methodology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Intrinsic dimension of neural activations and targets can be estimated reliably enough to support geometric comparisons that explain generalization differences.
invented entities (2)
-
over-compressed regime
no independent evidence
-
under-compressed regime
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Collapsed models exhibit ID_H < ID_Y, leading to over-compression... two regimes (over-compressed and under-compressed)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the 2-NN global estimator for intrinsic dimension
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
On the role of neural collapse in transfer learning
Tomer Galanti, András György, and Marcus Hutter. On the role of neural collapse in transfer learning. arXiv preprint arXiv:2112.15121,
-
[3]
Jack of all trades, master of some, a multi-purpose transformer agent
11 Quentin Gallouédec, Edward Emanuel Beeching, Clément ROMAC, and Emmanuel Dellandrea. Jack of all trades, master of some, a multi-purpose transformer agent. InICML 2024 Workshop: Aligning Reinforcement Learning Experimentalists and Theorists,
work page 2024
-
[4]
Peifeng Gao, Qianqian Xu, Yibo Yang, Peisong Wen, Huiyang Shao, Zhiyong Yang, Bernard Ghanem, and Qingming Huang. Towards demystifying the generalization behaviors when neural collapse emerges.arXiv preprint arXiv:2310.08358,
-
[5]
Degrees of Freedom in Deep Neural Networks
Tianxiang Gao and Vladimir Jojic. Degrees of freedom in deep neural networks.arXiv preprint arXiv:1603.09260,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Cross entropy versus label smoothing: A neural collapse perspective.arXiv preprint arXiv:2402.03979,
Li Guo, George Andriopoulos, Zifan Zhao, Shuyang Ling, Zixuan Dong, and Keith Ross. Cross entropy versus label smoothing: A neural collapse perspective.arXiv preprint arXiv:2402.03979,
-
[7]
XY Han, Vardan Papyan, and David L Donoho. Neural collapse under MSE loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073,
-
[8]
Wanli Hong and Shuyang Ling. Neural collapse for unconstrained feature model under cross-entropy loss with imbalanced data.arXiv preprint arXiv:2309.09725,
-
[9]
Generalization bounds via distillation.arXiv preprint arXiv:2104.05641,
Daniel Hsu, Ziwei Ji, Matus Telgarsky, and Lan Wang. Generalization bounds via distillation.arXiv preprint arXiv:2104.05641,
-
[10]
Like Hui, Mikhail Belkin, and Preetum Nakkiran. Limitations of neural collapse for understanding generalization in deep learning.arXiv preprint arXiv:2202.08384,
-
[11]
Measuring the Intrinsic Dimension of Objective Landscapes
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes.arXiv preprint arXiv:1804.08838,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Neural collapse in multi-label learning with pick-all-label loss.arXiv preprint arXiv:2310.15903,
Pengyu Li, Xiao Li, Yutong Wang, and Qing Qu. Neural collapse in multi-label learning with pick-all-label loss.arXiv preprint arXiv:2310.15903,
-
[13]
Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, and Qing Qu. Understanding and improving transfer learning of deep models via neural collapse.arXiv preprint arXiv:2212.12206,
-
[14]
Chuang Ma, Tomoyuki Obuchi, and Toshiyuki Tanaka. Neural collapse in cumulative link models for ordinal regression: An analysis with unconstrained feature model.arXiv preprint arXiv:2506.05801,
-
[15]
Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality.arXiv preprint arXiv:1801.02613, 2018a. Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijew- ickrema, and J...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Neural collapse with unconstrained features.arXiv preprint arXiv:2011.11619,
Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features.arXiv preprint arXiv:2011.11619,
-
[17]
The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,
Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,
-
[18]
Bastian Rieck, Matteo Togninalli, Christian Bock, Michael Moor, Max Horn, Thomas Gumbsch, and Karsten Borgwardt. Neural persistence: A complexity measure for deep neural networks using algebraic topology.arXiv preprint arXiv:1812.09764,
-
[19]
Peter Súkeník, Christoph H Lampert, and Marco Mondelli. Neural collapse is globally optimal in deep regularized resnets and transformers.arXiv preprint arXiv:2505.15239,
-
[20]
Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error.arXiv preprint arXiv:1808.08558,
-
[21]
Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network.arXiv preprint arXiv:1909.11274,
-
[22]
Mu- joco: A physics engine for model-based control
doi: 10.1109/IROS.2012.6386109. Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, March
-
[23]
URLhttps://zenodo.org/record/8127025. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,
-
[24]
Siwei Wang and Stephanie E Palmer. Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method.arXiv preprint arXiv:2305.11957,
-
[25]
Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. Characterizing truthfulness in large language model generations with local intrinsic dimension.arXiv preprint arXiv:2402.18048,
-
[26]
14 A Experiment Details A.1 MuJoCo experiments MuJoCo (Multi-Joint dynamics with Contact) is a physics engine designed for research in robotics, biomechanics, and animation, providing fast and accurate simulations of systems involving complex contact dynamics. It balances physical realism with computational efficiency to enable reliable modeling of robot–...
work page 2024
-
[27]
All environments introduce stochasticity by perturbing a fixed initial state with Gaussian noise
Figure 7: Screenshot of various MuJoCo environments [Towers et al., 2024]. All environments introduce stochasticity by perturbing a fixed initial state with Gaussian noise. Their state spaces combine positions of body and joint with corresponding velocities. Control is achieved by applying joint torques, which serve as the actions. Expert datasets are gen...
work page 2024
- [29]
-
[30]
IDH IDP IDY IDX Figure 11: Comparison between ID H and ID P for Halfcheetah, Hopper, CIFAR-10, and MNIST datasets [Ansuini et al., 2019]. Conversely, saturation of the upper bound, i.e., ID P ≃C , is associated with poor generalization performance, suggesting that maximal output layer dimensionality corresponds to overfitting in classification tasks, see ...
work page 2019
- [31]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.