Backpropagation-Friendly Eigendecomposition

Mathieu Salzmann; Pascal Fua; Wei Wang; Yinlin Hu; Zheng Dang

arxiv: 1906.09023 · v2 · pith:5YREE2MWnew · submitted 2019-06-21 · 💻 cs.LG · cs.CV· stat.ML

Backpropagation-Friendly Eigendecomposition

Wei Wang , Zheng Dang , Yinlin Hu , Pascal Fua , Mathieu Salzmann This is my paper

Pith reviewed 2026-05-25 19:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords eigendecompositionbackpropagationdeep networksZCA whiteningPCA denoisingnumerical stabilitynormalizationcovariance matrix

0 comments

The pith

A new eigendecomposition approach makes backpropagation stable for large matrices in deep networks without splitting the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard eigendecomposition and its power-iteration approximation produce unstable gradients when used inside neural networks, particularly with large matrices. Partitioning matrices into small arbitrary groups avoids the instability but has no theoretical justification and prevents full exploitation of eigendecomposition. The paper presents a differentiable formulation that computes stable gradients for eigenvectors and eigenvalues directly on large matrices. The method is demonstrated on ZCA whitening as a batch-normalization alternative and on a new PCA-based feature denoising step. If the approach works as claimed, networks can incorporate full eigendecomposition operations during training without custom partitioning or gradient clipping.

Core claim

The authors introduce a numerically stable and differentiable eigendecomposition procedure that operates on large matrices without partitioning and yields reliable gradients for backpropagation, outperforming both direct eigendecomposition and power iteration on ZCA whitening and the newly introduced PCA denoising normalization.

What carries the argument

Numerically stable differentiable eigendecomposition that computes eigenvector gradients without matrix partitioning.

If this is right

Networks can apply eigendecomposition directly to full-size feature covariance matrices rather than arbitrary partitions.
ZCA whitening layers become more robust to numerical issues than those using standard ED or power iteration.
PCA denoising can serve as a new feature normalization layer that further reduces noise beyond batch normalization.
The full expressive power of eigenvector-based operations becomes available inside differentiable pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability technique could be reused for other spectral layers that rely on eigenvectors, such as certain graph convolutions or whitening-based regularizers.
Training dynamics might change in models that currently avoid eigendecomposition because of gradient instability.
The method might allow direct differentiation through operations like orthogonalization or principal-component projection without auxiliary losses.

Load-bearing premise

The proposed eigendecomposition remains numerically stable and produces accurate gradients when applied to the matrix sizes that arise during end-to-end training of deep networks.

What would settle it

Applying the method to a covariance matrix whose size exceeds the largest matrix tested in the ZCA or PCA experiments and observing NaN values or exploding gradients during backpropagation would falsify the stability claim.

Figures

Figures reproduced from arXiv: 1906.09023 by Mathieu Salzmann, Pascal Fua, Wei Wang, Yinlin Hu, Zheng Dang.

**Figure 1.** Figure 1: (a) shows how the value of (λk/λ1) k changes w.r.t. the eigenvalue ratio λk/λ1 and iteration number k. (b) shows the contour of curved surface in (a). λi/λ1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95 0.99 0.995 0.999 k = dln(0.01)/ln(λi/λ1)e 2 2 3 4 5 6 9 14 19 29 59 299 598 2995 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Performance of our PCA denoising layer as a function of (a) the percentage of preserved Information, (b) the number e of eigenvectors retained, (c) the number of power iterations. 3. For our algorithm, we never saw a training failure cases independently of the matrix dimension ranging from d = 4 to d = 64. In [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss as a function of the number of epochs. Methods Error Value BN Min 4.66 Mean 4.81±0.19 PCA(PI) Min 5.05 Mean 5.35±0.25 PCA(SVD) Min NaN Mean NaN PCA(Ours) Min 4.58 Mean 4.63±0.11 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the network's features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical way to run eigendecomposition inside networks on full-size matrices without forced splits, and shows it on ZCA and a new PCA-based normalization.

read the letter

The main contribution is a differentiable eigendecomposition routine that stays stable on large matrices and does not rely on arbitrary partitioning. That removes a real practical barrier when people want to use eigenvectors directly in layers for whitening or similar operations. The authors also introduce PCA denoising as a normalization variant and compare it against batch norm style methods. On the positive side, the abstract makes clear that the approach targets the known instability of both direct ED and power iteration during backprop, and the applications to ZCA and the new denoiser give concrete test cases. The claim that it outperforms the baselines on robustness is stated directly. The soft spot is that the abstract supplies no equations, no derivation sketch, and no mention of how the stability is achieved or proved, so it is impossible to judge whether the fix is general or tied to particular matrix properties. Experiments are only summarized at the level of the two applications, with no numbers or ablation details visible here. If the full paper contains reproducible code or a clear algorithmic description that matches the stability claim, the work is worth a referee's time. It is aimed at people building custom normalization or spectral layers rather than a broad audience. I would send it to review because the problem is genuine and the scope is focused, even if the method needs close checking.

Referee Report

0 major / 2 minor

Summary. The paper claims to introduce a numerically stable, differentiable eigendecomposition procedure suitable for backpropagation through large matrices in deep networks, without requiring arbitrary partitioning. It applies the method to ZCA whitening (as an alternative to batch normalization) and introduces PCA denoising as a new normalization layer, reporting improved robustness over direct ED and Power Iteration in these settings.

Significance. If the stability and differentiability claims hold for large matrices, the work would enable fuller use of spectral methods inside end-to-end training without ad-hoc splits, which could strengthen normalization and feature-processing modules that rely on eigenvectors.

minor comments (2)

[Abstract] Abstract: the claim of 'better robustness' is stated without any quantitative metrics, dataset names, or matrix sizes; moving even one concrete result into the abstract would strengthen the summary.
The manuscript should clarify whether the new PCA-denoising normalization requires any additional hyperparameters beyond the standard PCA rank or eigenvalue threshold.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the work and the recommendation for minor revision. We are glad that the potential to enable fuller use of spectral methods in end-to-end training is recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a new differentiable eigendecomposition procedure claimed to be numerically stable for large matrices. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain. The derivation of the method (whatever its explicit form) is offered as independent content rather than a renaming or tautological restatement of its inputs. The empirical demonstrations on ZCA and PCA denoising are presented as validation, not as the source of the claimed properties. This is the common case of a self-contained technical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5673 in / 942 out tokens · 21586 ms · 2026-05-25T19:12:00.877531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

Matthew Turk and Alex Pentland. Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

work page 1991
[2]

Natural neural networks

Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. 2015

work page 2015
[3]

Decorrelated batch normalization

Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018

work page 2018
[4]

Matrix Backpropagation for Deep Networks with Structured Layers

Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In CVPR, 2015

work page 2015
[5]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. 2018

work page 2018
[6]

Suwajanakorn, N

S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints via End-To-End Geometric Reasoning. In NIPS, 2018

work page 2018
[7]

K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P. Fua. Learning to Find Good Correspondences. In CVPR, 2018

work page 2018
[8]

Ranftl and V

R. Ranftl and V . Koltun. Deep Fundamental Matrix Estimation. In ECCV, 2018

work page 2018
[9]

Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses

Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann. Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses. In ECCV, 2018

work page 2018
[10]

Papadopoulo and M

T. Papadopoulo and M. Lourakis. Estimating the jacobian of the singular value decomposition: Theory and applications. In ECCV, pages 554–570, 2000

work page 2000
[11]

Deep Learning of Graph Matching

Andrei Zanﬁr and Cristian Sminchisescu. Deep Learning of Graph Matching. In CVPR, 2018

work page 2018
[12]

Stable and efﬁcient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd

Yuji Nakatsukasa and Nicholas J Higham. Stable and efﬁcient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientiﬁc Computing, 35(3):A1325–A1349, 2013

work page 2013
[13]

Burden and J

Richard L. Burden and J. Douglas Faires. Numerical Analysis. Ninth edition, 1989

work page 1989
[14]

Optimal whitening and decorrelation

Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018

work page 2018
[15]

independent components

Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge ﬁlters. Vision research, 37(23):3327–3338, 1997

work page 1997
[16]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 2015

work page 2015
[17]

Pca based image denoising

Y Murali Mohan Babu, M Venkata Subramanyam, and MN Giri Prasad. Pca based image denoising. Signal & Image Processing, 3(2):236, 2012

work page 2012
[18]

Dynamic label graph matching for unsupervised video re-identiﬁcation

Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video re-identiﬁcation. In ICCV, 2017

work page 2017
[19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

work page 2016
[20]

Krizhevsky

A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, 2009

work page 2009
[21]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. 9 Appendix 6 Approximate ED Gradients with PI in Backpropogation In the following two subsections, we prove that the gradients computed from ...

work page 2017

[1] [1]

Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

Matthew Turk and Alex Pentland. Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

work page 1991

[2] [2]

Natural neural networks

Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. 2015

work page 2015

[3] [3]

Decorrelated batch normalization

Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018

work page 2018

[4] [4]

Matrix Backpropagation for Deep Networks with Structured Layers

Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In CVPR, 2015

work page 2015

[5] [5]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. 2018

work page 2018

[6] [6]

Suwajanakorn, N

S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints via End-To-End Geometric Reasoning. In NIPS, 2018

work page 2018

[7] [7]

K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P. Fua. Learning to Find Good Correspondences. In CVPR, 2018

work page 2018

[8] [8]

Ranftl and V

R. Ranftl and V . Koltun. Deep Fundamental Matrix Estimation. In ECCV, 2018

work page 2018

[9] [9]

Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses

Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann. Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses. In ECCV, 2018

work page 2018

[10] [10]

Papadopoulo and M

T. Papadopoulo and M. Lourakis. Estimating the jacobian of the singular value decomposition: Theory and applications. In ECCV, pages 554–570, 2000

work page 2000

[11] [11]

Deep Learning of Graph Matching

Andrei Zanﬁr and Cristian Sminchisescu. Deep Learning of Graph Matching. In CVPR, 2018

work page 2018

[12] [12]

Stable and efﬁcient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd

Yuji Nakatsukasa and Nicholas J Higham. Stable and efﬁcient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientiﬁc Computing, 35(3):A1325–A1349, 2013

work page 2013

[13] [13]

Burden and J

Richard L. Burden and J. Douglas Faires. Numerical Analysis. Ninth edition, 1989

work page 1989

[14] [14]

Optimal whitening and decorrelation

Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018

work page 2018

[15] [15]

independent components

Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge ﬁlters. Vision research, 37(23):3327–3338, 1997

work page 1997

[16] [16]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 2015

work page 2015

[17] [17]

Pca based image denoising

Y Murali Mohan Babu, M Venkata Subramanyam, and MN Giri Prasad. Pca based image denoising. Signal & Image Processing, 3(2):236, 2012

work page 2012

[18] [18]

Dynamic label graph matching for unsupervised video re-identiﬁcation

Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video re-identiﬁcation. In ICCV, 2017

work page 2017

[19] [19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

work page 2016

[20] [20]

Krizhevsky

A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, 2009

work page 2009

[21] [21]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. 9 Appendix 6 Approximate ED Gradients with PI in Backpropogation In the following two subsections, we prove that the gradients computed from ...

work page 2017