The Geometry of Projection Heads: Conditioning, Invariance, and Collapse
Pith reviewed 2026-05-20 14:09 UTC · model grok-4.3
The pith
Nonlinear projection heads make collapsed states unstable by inducing negative eigenvalues in the Hessian.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling the projection head as a trainable Riemannian metric on the backbone representation manifold, the analysis establishes that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, rendering those states unstable. Linear and ReLU heads lack this native negative curvature under continuous-time gradient flow and instead rely on discrete-time dynamics or BatchNorm to escape. The same metric view characterizes how degeneracy controls the information-invariance trade-off and directly accounts for why the head must be removed after training.
What carries the argument
The trainable Riemannian metric induced by the projection head on the backbone representation manifold, which adapts local geometry to loss constraints and generates curvature at collapse points.
If this is right
- Linear heads implicitly perform subspace whitening.
- Nonlinear head depth increases the capacity to adapt local metrics to the loss's topological constraints.
- Smooth activations such as Swish generate explicit negative curvature that enables escape from collapse.
- Metric degeneracy directly governs the information-invariance trade-off and necessitates discarding the head.
- The head functions as a universal geometric buffer that decouples the semantic backbone from pretraining constraints.
Where Pith is reading between the lines
- Designers could select activations to tune the sign of curvature and reduce reliance on BatchNorm for stability.
- Continuous tracking of Hessian eigenvalues during training offers a practical diagnostic for early collapse risk.
- The Riemannian-metric framing may extend to other representation-learning components to reveal analogous conditioning effects.
- Keeping a lightweight nonlinear head at inference time could preserve some invariance benefits without harming downstream tasks.
Load-bearing premise
The projection head can be modeled as a trainable Riemannian metric on the backbone representation manifold.
What would settle it
A computation of the Hessian at a collapsed equilibrium under a smooth nonlinear head that shows all eigenvalues are nonnegative, or a continuous-time simulation in which such heads remain trapped in collapse without discrete steps or BatchNorm.
Figures
read the original abstract
We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. It claims that linear heads perform implicit subspace whitening while nonlinear heads adapt local metrics to the loss's topological constraints (with depth controlling capacity), proves that smooth nonlinear heads induce negative Hessian eigenvalues at collapsed equilibria (rendering them unstable), and empirically tracks optimization geometry to show that activations like Swish generate explicit negative curvature to escape collapse under continuous-time flow (unlike linear/ReLU heads, which rely on discrete dynamics and BatchNorm). The work further characterizes metric degeneracy in the information-invariance trade-off and positions the head as a universal geometric buffer, with evaluations across contrastive and decorrelation objectives on foundation models.
Significance. If the geometric modeling and derivations hold, the paper supplies a coherent framework explaining the functional role of projection heads, their necessity for avoiding destructive constraints during pretraining, and the reason they are discarded afterward. The empirical tracking of curvature across activations and the cross-objective validation on foundation models are concrete strengths that could inform practical design choices for mitigating dimensional collapse.
major comments (2)
- [Geometric modeling (opening claim and Hessian analysis)] The central claim that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria rests on modeling the projection head as a trainable Riemannian metric. The explicit map from head weights to the metric tensor, the coordinate chart on the backbone manifold, and the precise dependence of the metric on head parameters are left implicit, so it is unclear whether the eigenvalue sign follows from the architecture or from an auxiliary choice in the geometric construction.
- [Proof of Hessian negativity and continuous-time analysis] The abstract asserts proofs of negative eigenvalues together with continuous-time analysis, yet the manuscript provides neither the full derivations nor the explicit connection between the continuous-time gradient flow and the discrete SGD dynamics actually used in training. This gap is load-bearing for the instability result and the claim that only smooth nonlinear heads can escape collapse via curvature.
minor comments (2)
- [Notation and definitions] Notation for the Riemannian metric tensor and its dependence on head depth should be introduced with an explicit equation early in the geometric-modeling section to improve readability.
- [Empirical validation] The empirical section would benefit from a table summarizing the tracked curvature values (or eigenvalue signs) for Swish, ReLU, and linear heads across the reported objectives.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of the geometric framework. We address the two major comments point by point below, clarifying the modeling and committing to explicit additions that strengthen the derivations without altering the core claims.
read point-by-point responses
-
Referee: [Geometric modeling (opening claim and Hessian analysis)] The central claim that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria rests on modeling the projection head as a trainable Riemannian metric. The explicit map from head weights to the metric tensor, the coordinate chart on the backbone manifold, and the precise dependence of the metric on head parameters are left implicit, so it is unclear whether the eigenvalue sign follows from the architecture or from an auxiliary choice in the geometric construction.
Authors: We agree that the geometric construction benefits from greater explicitness. Section 3 defines the head as inducing a trainable metric g_θ via the Jacobian pullback of the Euclidean metric on the output space, with the backbone manifold equipped with the standard coordinate chart induced by the representation embedding. The dependence on head parameters θ enters through the Jacobian of the nonlinear head h_θ. In revision we will insert a new subsection that writes the map θ ↦ g_θ explicitly, shows that the sign of the Hessian eigenvalues at collapse is determined solely by the second derivative of the activation (negative for smooth nonlinearities such as Swish), and confirms no auxiliary choices are required. This addition will make the architectural origin of the negativity unambiguous. revision: yes
-
Referee: [Proof of Hessian negativity and continuous-time analysis] The abstract asserts proofs of negative eigenvalues together with continuous-time analysis, yet the manuscript provides neither the full derivations nor the explicit connection between the continuous-time gradient flow and the discrete SGD dynamics actually used in training. This gap is load-bearing for the instability result and the claim that only smooth nonlinear heads can escape collapse via curvature.
Authors: The manuscript contains sketches of the Hessian computation and continuous-time flow in Sections 4–5 together with empirical curvature tracking, but we acknowledge that complete derivations and the discrete-to-continuous link are not expanded. In the revision we will add a self-contained appendix with the full Hessian derivation at collapsed equilibria, showing that smoothness of the activation produces at least one negative eigenvalue. We will also include a paragraph relating the continuous gradient flow to discrete SGD under small learning-rate regimes, noting that the instability induced by negative curvature persists in the discrete setting and is observed in our actual training runs. These additions directly address the load-bearing gap while preserving the original empirical findings. revision: yes
Circularity Check
No significant circularity; geometric modeling is independent foundation
full rationale
The paper opens by adopting the modeling choice that the projection head is a trainable Riemannian metric on the backbone representation manifold. All subsequent claims—including implicit whitening for linear heads, adaptation of local metrics by nonlinear heads, and the proof of negative Hessian eigenvalues at collapsed equilibria for smooth nonlinear heads—are presented as consequences derived inside this framework. No equations, self-citations, or fitted parameters are shown in the supplied text that would reduce any of these results to the modeling assumption by construction. Empirical tracking of optimization geometry is described separately from the theoretical derivation. The chain therefore remains self-contained and does not exhibit the required explicit reduction for a circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The projection head can be modeled as a trainable Riemannian metric on the backbone representation manifold.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / Jcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable
-
IndisputableMonolith/Foundation/BranchSelection.leaninteractionDefect_RCLCombiner / IsCouplingCombiner echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the pullback metric G(z) = Jh(z)⊤Jh(z) is singular ... v⊤G(z)v = 0 for v in Vaug
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
Matrix Perturbation Theory , author =
-
[5]
International Conference on Learning Representations (ICLR) , year =
Projection Head is Secretly an Information Bottleneck , author =. International Conference on Learning Representations (ICLR) , year =
-
[6]
Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =
A Simple Framework for Contrastive Learning of Visual Representations , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =. 2020 , volume =
work page 2020
-
[7]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2020 , url =
work page 2020
-
[8]
Advances in Neural Information Processing Systems , volume =
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =
work page 2020
-
[9]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Chen, Xinlei and He, Kaiming , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[10]
Transactions on Machine Learning Research , year =
Guillotine Regularization: Why Removing Layers Is Needed to Improve Generalization in Self-Supervised Learning , author =. Transactions on Machine Learning Research , year =
-
[11]
Journal of Machine Learning Research , volume =
Emergence of Invariance and Disentanglement in Deep Representations , author =. Journal of Machine Learning Research , volume =. 2018 , url =
work page 2018
-
[12]
Proceedings of the 38th International Conference on Machine Learning (ICML) , series =
Understanding Self-Supervised Learning Dynamics without Contrastive Pairs , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , series =. 2021 , publisher =
work page 2021
-
[13]
International Conference on Learning Representations (ICLR) , year =
Understanding Dimensional Collapse in Contrastive Self-Supervised Learning , author =. International Conference on Learning Representations (ICLR) , year =
-
[14]
Information Geometry and Its Applications , author =. 2016 , publisher =. doi:10.1007/978-4-431-55978-8 , url =
-
[15]
The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning , url =
Wen, Zixin and Li, Yuanzhi , booktitle =. The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning , url =
- [16]
-
[17]
Proceedings of the 38th International Conference on Machine Learning (ICML) , pages=
Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St. Proceedings of the 38th International Conference on Machine Learning (ICML) , pages=. 2021 , volume=
work page 2021
- [18]
-
[19]
Advances in Neural Information Processing Systems , volume=
Effects of Data Geometry in Early Deep Learning , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=
work page 2022
-
[20]
Natural Gradient Works Efficiently in Learning , year=
Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=
-
[21]
Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=
work page 2019
-
[22]
Hadsell, R. and Chopra, S. and LeCun, Y. , booktitle=. Dimensionality Reduction by Learning an Invariant Mapping , year=
-
[23]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Unsupervised Feature Learning via Non-Parametric Instance Discrimination , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2018 , url=
work page 2018
-
[24]
Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=
Unsupervised Visual Representation Learning by Context Prediction , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=. 2015 , url=
work page 2015
-
[25]
Yang, Greg and Hu, Edward J. , booktitle=. Tensor Programs. 2021 , volume=
work page 2021
-
[26]
Approximation capabilities of multilayer feedforward networks , journal =. 1991 , url =
work page 1991
-
[27]
29th Annual Conference on Learning Theory , pages =
Gradient Descent Only Converges to Minimizers , author =. 29th Annual Conference on Learning Theory , pages =. 2016 , editor =
work page 2016
-
[28]
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of
Raghu, Aniruddh and Raghu, Maithra and Bengio, Samy and Vinyals, Oriol , booktitle =. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of. 2020 , url =
work page 2020
-
[29]
Advances in Neural Information Processing Systems , volume =
Deep Learning versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =
work page 2020
-
[30]
Proceedings of the 38th International Conference on Machine Learning , series =
Whitening for Self-Supervised Representation Learning , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =
work page 2021
-
[31]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =
An Empirical Study of Training Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =. 2021 , url =
work page 2021
-
[32]
When Vision Transformers Outperform
Chen, Xiangning and Hsieh, Cho-Jui and Gong, Boqing , booktitle =. When Vision Transformers Outperform. 2022 , url =
work page 2022
-
[33]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel , year=. Gaussian Error Linear Units (. 1606.08415 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Advances in Neural Information Processing Systems , volume =
How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =
work page 2018
-
[35]
Neural Collapse is Globally Optimal in Deep Regularized
S. Neural Collapse is Globally Optimal in Deep Regularized. Advances in Neural Information Processing Systems , year =
-
[36]
Proceedings of the 38th International Conference on Machine Learning , series =
Training Data-Efficient Image Transformers & Distillation through Attention , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =
work page 2021
-
[37]
Advances in Neural Information Processing Systems , volume=
Do Vision Transformers See Like Convolutional Neural Networks? , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=
work page 2021
-
[38]
Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) , volume=
Non-asymptotic Theory of Random Matrices: Extreme Singular Values , author=. Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) , volume=. 2010 , publisher=
work page 2010
-
[39]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , author=. 2018 , eprint=
work page 2018
-
[40]
Proceedings of the 36th International Conference on Machine Learning , series=
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , author=. Proceedings of the 36th International Conference on Machine Learning , series=. 2019 , publisher=
work page 2019
-
[41]
Advances in Neural Information Processing Systems , volume=
Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=. 2018 , url=
work page 2018
-
[42]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =
Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =. 2021 , url=
work page 2021
-
[43]
Advances in Neural Information Processing Systems , volume =
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url=
work page 2020
-
[44]
Advances in Neural Information Processing Systems , volume =
Big Self-Supervised Models are Strong Semi-Supervised Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url=
work page 2020
-
[45]
Asymptotic and Finite-Time Guarantees for
Faris Chaudhry , year=. Asymptotic and Finite-Time Guarantees for. 2603.12552 , archivePrefix=
-
[46]
Trajectory-Restricted Optimization Conditions and Geometry-Aware Linear Convergence , author=. 2026 , eprint=
work page 2026
- [47]
-
[48]
Sokolic, Jure and Giryes, Raja and Sapiro, Guillermo and Rodrigues, Miguel R. D. , year=. Robust Large Margin Deep Neural Networks , volume=. IEEE Transactions on Signal Processing , publisher=
-
[49]
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges , author=. 2021 , eprint=
work page 2021
-
[50]
International Conference on Learning Representations (ICLR) , year =
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author =. International Conference on Learning Representations (ICLR) , year =
-
[51]
Journal of Machine Learning Research: Workshop and Conference Proceedings , volume =
Escaping From Saddle Points -- Online Stochastic Gradient for Tensor Decomposition , author =. Journal of Machine Learning Research: Workshop and Conference Proceedings , volume =. 2015 , url=
work page 2015
-
[52]
Advances in Neural Information Processing Systems , volume =
Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization , author =. Advances in Neural Information Processing Systems , volume =. 2014 , url=
work page 2014
-
[53]
Information Flow in Self-Supervised Learning , author=. 2024 , eprint=
work page 2024
-
[54]
2015 IEEE Information Theory Workshop (ITW) , year =
Deep Learning and the Information Bottleneck Principle , author =. 2015 IEEE Information Theory Workshop (ITW) , year =
work page 2015
-
[55]
Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining , booktitle =. A. 2022 , url=
work page 2022
-
[56]
International Conference on Learning Representations (ICLR) , year =
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =
-
[57]
Bengio, Y. and Simard, P. and Frasconi, P. , journal=. Learning long-term dependencies with gradient descent is difficult , year=
-
[58]
European Conference on Computer Vision (ECCV) , series =
Identity Mappings in Deep Residual Networks , author =. European Conference on Computer Vision (ECCV) , series =. 2016 , publisher =
work page 2016
-
[59]
Pinkus, Allan , journal =. Approximation Theory of the. 1999 , url =
work page 1999
-
[60]
Advances in Neural Information Processing Systems , volume =
Implicit Bias of Gradient Descent on Linear Convolutional Networks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url=
work page 2018
-
[61]
Journal of Machine Learning Research , volume =
The Implicit Bias of Gradient Descent on Separable Data , author =. Journal of Machine Learning Research , volume =. 2018 , url=
work page 2018
-
[62]
International Conference on Learning Representations (ICLR) , year =
Investigating the Benefits of Projection Head for Representation Learning , author =. International Conference on Learning Representations (ICLR) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.