pith. sign in

arxiv: 1907.07739 · v1 · pith:XS6X37TBnew · submitted 2019-07-17 · 💻 cs.LG · cs.CV· stat.ML

Deep Multi-View Learning via Task-Optimal CCA

Pith reviewed 2026-05-24 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords multi-view learningcanonical correlation analysisdeep CCAsupervised CCAcross-view classificationsemi-supervised learning
0
0 comments X

The pith

Simultaneously optimizing CCA correlation and a task objective in one deep network produces a shared latent space that is both highly correlated across views and more discriminative.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard CCA ignores class labels and that prior fixes either fail to optimize the projection and the downstream task jointly or remain linear. By training a deep network end-to-end on both the CCA correlation loss and a task loss, the method learns non-linear projections whose latent space supports better cross-view classification, view-based regularization, and semi-supervised learning. A sympathetic reader would care because multi-view data appear in many practical settings where one view is expensive or missing at test time.

Core claim

Simultaneously optimizing a CCA-based objective and a task objective in an end-to-end manner learns a non-linear CCA projection to a shared latent space that is highly correlated across views and discriminative for the task.

What carries the argument

Joint end-to-end optimization of the CCA correlation objective together with a task-specific objective inside a single deep network.

If this is right

  • The approach yields measurable gains in cross-view classification accuracy over prior state-of-the-art methods including deep supervised baselines.
  • The same joint objective supplies effective regularization when a second view is available at training time.
  • Performance improves in semi-supervised regimes where labels are scarce for one or both views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-training pattern could be applied to other multi-modal fusion problems that currently treat alignment and prediction as separate stages.
  • Stability of the combined loss could be tested by scaling the method to larger or noisier view pairs where correlation and discrimination objectives conflict more strongly.

Load-bearing premise

Jointly optimizing the two objectives will keep the latent space highly correlated while making it more discriminative, without the combined loss introducing instability or overfitting that cancels the gain.

What would settle it

A direct comparison on the same real multi-view datasets in which the jointly trained model fails to outperform either linear supervised CCA or a deep network that trains the CCA projection and the task head separately.

Figures

Figures reproduced from arXiv: 1907.07739 by Charles M. Perou, Heather D. Couture, J.S. Marron, Marc Niethammer, Melissa Troester, Roland Kwitt.

Figure 1
Figure 1. Figure 1: Deep CCA architectures: (a) DCCA maximizes the sum correlation in projection space by optimizing an equivalent loss, the trace norm objective (TNO) [3]; (b) SoftCCA relaxes the orthogonality constraints by regularizing with soft decorrelation (Decorr) and optimizes the `2 distance in the projection space (equivalent to sum correlation with activations normalized to unit variance) [8]. Our TOCCA methods add… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Sum correlation vs. cross-view classification accuracy (on MNIST) across different hyperpa￾rameter settings on a training set size of 10,000 for DCCA [3], SoftCCA [8], TOCCA-W, and TOCCA-SD. For unsupervised methods (DCCA and SoftCCA), large correlations do not necessarily imply good accuracy. Right: The effect of batch size on classification accuracy for each TOCCA method on MNIST (training set size… view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE plots for CCA methods on our variation of MNIST. Each method was used to compute projections for the two views (left and right sides of the images) using 10,000 training examples. The plots show a visualization of the projection for the left view with each digit colored differently. TOCCA-SD and TOCCA-ND (not shown) produced similar results to TOCCA-W [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Canonical Correlation Analysis (CCA) is widely used for multimodal data analysis and, more recently, for discriminative tasks such as multi-view learning; however, it makes no use of class labels. Recent CCA methods have started to address this weakness but are limited in that they do not simultaneously optimize the CCA projection for discrimination and the CCA projection itself, or they are linear only. We address these deficiencies by simultaneously optimizing a CCA-based and a task objective in an end-to-end manner. Together, these two objectives learn a non-linear CCA projection to a shared latent space that is highly correlated and discriminative. Our method shows a significant improvement over previous state-of-the-art (including deep supervised approaches) for cross-view classification, regularization with a second view, and semi-supervised learning on real data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Task-Optimal CCA, a deep multi-view learning approach that jointly optimizes a CCA correlation objective together with a task-specific loss inside a single end-to-end network. This produces a non-linear projection onto a shared latent space that is both highly correlated across views and more discriminative than those obtained by separately trained or purely supervised baselines. Real-data experiments are reported to show significant gains over prior state-of-the-art (including deep supervised methods) on cross-view classification, view-regularized supervised learning, and semi-supervised settings.

Significance. If the empirical gains are reproducible and not artifacts of hyper-parameter tuning or dataset choice, the work would constitute a useful practical advance in multi-view representation learning by closing the gap between correlation-maximizing and task-driven objectives inside the same differentiable pipeline. The method directly tests whether end-to-end joint optimization can simultaneously satisfy the two desiderata that standard CCA and earlier deep CCA variants address only sequentially.

major comments (2)
  1. The central empirical claim rests on the joint loss producing a latent space that remains both correlated and task-discriminative; the manuscript should report the sensitivity of the reported gains to the relative weighting of the CCA and task terms (a free parameter listed in the axiom ledger) and include an ablation that isolates the contribution of each term.
  2. Because the method is evaluated on cross-view classification, regularization with a second view, and semi-supervised regimes, the experimental section must clarify whether the same network architecture, loss weighting schedule, and optimization protocol are used across all three settings or whether task-specific tuning undermines the claim of a single unified approach.
minor comments (2)
  1. Notation for the combined objective should be introduced once with explicit symbols for the CCA term, the task term, and the weighting hyper-parameter rather than being redefined inline in each experimental subsection.
  2. The abstract states 'significant improvement' without numerical deltas; the results section should include a table that directly compares the proposed method against the cited baselines on each task with the same metrics and error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the two major comments below and will revise the manuscript accordingly to strengthen the empirical presentation.

read point-by-point responses
  1. Referee: The central empirical claim rests on the joint loss producing a latent space that remains both correlated and task-discriminative; the manuscript should report the sensitivity of the reported gains to the relative weighting of the CCA and task terms (a free parameter listed in the axiom ledger) and include an ablation that isolates the contribution of each term.

    Authors: We agree that sensitivity analysis and ablation are valuable for validating the joint objective. In the revised manuscript we will add (i) an ablation isolating the CCA term versus the task term and (ii) results across a range of weighting values for the CCA loss to demonstrate that the reported gains are not artifacts of a single hyper-parameter choice. revision: yes

  2. Referee: Because the method is evaluated on cross-view classification, regularization with a second view, and semi-supervised regimes, the experimental section must clarify whether the same network architecture, loss weighting schedule, and optimization protocol are used across all three settings or whether task-specific tuning undermines the claim of a single unified approach.

    Authors: The same network architecture, loss weighting schedule, and optimization protocol are used in all three regimes; only the supervision signal (labels or lack thereof) changes according to the task. We will add an explicit statement and a summary table in the experimental section of the revision to remove any ambiguity about the unified protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for jointly optimizing a CCA correlation objective and a task-specific loss in an end-to-end deep network to produce a shared latent space. All central claims concern measurable performance gains on real-data tasks (cross-view classification, view regularization, semi-supervised learning) relative to baselines; these are validated experimentally rather than derived from a closed mathematical chain. No equations reduce a claimed prediction to a fitted parameter by construction, no uniqueness theorems are imported via self-citation, and no ansatz or renaming of known results is presented as a derivation. The work is therefore self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; implementation details, network architectures, loss weighting, and optimization choices are unknown, so the ledger is necessarily incomplete.

free parameters (2)
  • network depth and width
    Deep non-linear CCA requires choosing architecture hyperparameters that are fitted or selected on data.
  • loss weighting between CCA and task terms
    Joint optimization requires a scalar that balances the two objectives and is chosen to achieve reported performance.
axioms (1)
  • domain assumption Paired multi-view samples with class labels are available for training
    The method is described for supervised and semi-supervised multi-view settings that presuppose such paired labeled data.

pith-pipeline@v0.9.0 · 5680 in / 1211 out tokens · 19251 ms · 2026-05-24T20:17:47.856068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

  1. [1]

    Relations between two sets of variates

    Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, dec 1936

  2. [2]

    Eigenproblems in pattern recognition

    Tijl De Bie, Nello Cristianini, and Roman Rosipal. Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129–167. Springer Berlin Heidelberg, 2005

  3. [3]

    Deep Canonical Correlation Analysis

    Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep Canonical Correlation Analysis. In Proc. ICML, 2013

  4. [4]

    On deep multi-view representation learning

    Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In Proc. ICML, 2015

  5. [5]

    Stochastic optimization for deep CCA via nonlinear orthogonal iterations

    Weiran Wang, Raman Arora, Karen Livescu, and Nathan Srebro. Stochastic optimization for deep CCA via nonlinear orthogonal iterations. In Proc. Allerton Conference on Communication, Control, and Computing , 2016

  6. [6]

    Multi-view Discriminant Analysis

    Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. Multi-view Discriminant Analysis. IEEE PAMI, 2015

  7. [7]

    Khapra, Hugo Larochelle, and Balaraman Ravindran

    Sarath Chandar, Mitesh M. Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational Neural Networks. Neural Computation, 28(2):257–285, feb 2016

  8. [8]

    Hospedales

    Xiaobin Chang, Tao Xiang, and Timothy M. Hospedales. Scalable and Effective Deep CCA via Soft Decorrelation. In Proc. CVPR, 2018

  9. [9]

    Audiovisual synchronization and fusion using canonical correlation analysis

    Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A Murat Tekalp. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7):1396–1403, 2007

  10. [10]

    Towards Deep and Discriminative Canonical Correlation Analysis

    Matthias Dorfer, Gerhard Widmer, and Gerhard Widmerajku At. Towards Deep and Discriminative Canonical Correlation Analysis. In Proc. ICML Workshop on Multi-view Representaiton Learning, 2016

  11. [11]

    Kernel cca for multi-view learning of acoustic features using articulatory measurements

    Raman Arora and Karen Livescu. Kernel cca for multi-view learning of acoustic features using articulatory measurements. In Symposium on Machine Learning in Speech and Language Processing , 2012

  12. [12]

    Supervised multi-view canonical correlation analysis (sMVCCA): integrating histologic and proteomic features for predicting recurrent prostate cancer

    George Lee, Asha Singanamalli, Haibo Wang, Michael D Feldman, Stephen R Master, Natalie N C Shih, Elaine Spangler, Timothy Rebbeck, John E Tomaszewski, and Anant Madabhushi. Supervised multi-view canonical correlation analysis (sMVCCA): integrating histologic and proteomic features for predicting recurrent prostate cancer. IEEE Transactions on Medical Ima...

  13. [13]

    Supervised multi-view canonical correla- tion analysis: fused multimodal prediction of disease diagnosis and prognosis

    Asha Singanamalli, Haibo Wang, George Lee, Natalie Shih, Mark Rosen, Stephen Master, John Tomaszewski, Michael Feldman, and Anant Madabhushi. Supervised multi-view canonical correla- tion analysis: fused multimodal prediction of disease diagnosis and prognosis. In Proc. SPIE Medical Imaging, 2014

  14. [14]

    Joint learning of cross-modal classifier and factor analysis for multimedia data classification

    Kanghong Duan, Hongxin Zhang, and Jim Jing Yan Wang. Joint learning of cross-modal classifier and factor analysis for multimedia data classification. Neural Computing and Applications , 27(2):459–468, feb 2016

  15. [15]

    End-to-end cross- modality retrieval with CCA projections and pairwise ranking loss

    Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Korzeniowski, and Gerhard Widmer. End-to-end cross- modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval, 7(2):117–128, jun 2018

  16. [16]

    Joint sparse representation for robust multimodal biometrics recognition

    Sumit Shekhar, Vishal M Patel, Nasser M Nasrabadi, and Rama Chellappa. Joint sparse representation for robust multimodal biometrics recognition. IEEE PAMI, 36(1):113–26, jan 2014

  17. [17]

    Coupled dictionary learning and feature mapping for cross-modal retrieval

    Xing Xu, Atsushi Shimada, Rin-ichiro Taniguchi, and Li He. Coupled dictionary learning and feature mapping for cross-modal retrieval. In Proc. International Conference on Multimedia and Expo , 2015

  18. [18]

    Miriam Cha, Youngjune Gwon, and H. T. Kung. Multimodal sparse representation learning and applications. arXiv preprint: 1511.06238, 2015. 9

  19. [19]

    Multimodal Task-Driven Dictionary Learning for Image Classification

    Soheil Bahrampour, Nasser M. Nasrabadi, Asok Ray, and W. Kenneth Jenkins. Multimodal Task-Driven Dictionary Learning for Image Classification. arXiv preprint: 1502.01094, 2015

  20. [20]

    Common Representation Learning Using Step-based Correlation Multi-Modal CNN

    Gaurav Bhatt, Piyush Jha, and Balasubramanian Raman. Common Representation Learning Using Step- based Correlation Multi-Modal CNN. arXiv preprint: 1711.00003, 2017

  21. [21]

    Decorrelated Batch Normalization

    Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated Batch Normalization. In Proc. CVPR, 2018

  22. [22]

    The mnist database of handwritten digits

    Yann LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998

  23. [23]

    Bilenko and Jack L

    Natalia Y . Bilenko and Jack L. Gallant. Pyrcca: regularized kernel canonical correlation analysis in Python and its applications to neuroimaging. Frontiers in Neuroinformatics, 10, nov 2016

  24. [24]

    Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. Multimedia content processing through cross-modal association. In Proc. ACM International Conference on Multimedia , 2003

  25. [25]

    Revisiting Small Batch Training for Deep Neural Networks

    Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks. arxiv preprint: 1804.07612, 2018

  26. [26]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. ICML, 2015

  27. [27]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1106–1114, 2012

  28. [28]

    Deep linear discriminant analysis

    Matthias Dorfer, Rainer Kelz, and Gerhard Widmer. Deep linear discriminant analysis. In Proc. ICLR, 2016

  29. [29]

    DeepSurv: Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep Neural Network

    Jared Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deep Survival: A Deep Cox Proportional Hazards Network. arxiv preprint: 1606.00931, 2016

  30. [30]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proc. ECCV, 2018

  31. [31]

    Optimal whitening and decorrelation

    Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. arXiv preprint: 1512.00809, 2015

  32. [32]

    Visualizing high-dimensional data using t-sne

    L Van Der Maaten and G Hinton. Visualizing high-dimensional data using t-sne. journal of machine learning research. Journal of Machine Learning Research, 9:26, 2008

  33. [33]

    Allott, Joseph Geradts, Stephanie M Cohen, Chui Kit Tse, Erin L

    MA Troester, Xuezheng Sun, Emma H. Allott, Joseph Geradts, Stephanie M Cohen, Chui Kit Tse, Erin L. Kirk, Leigh B Thorne, Michelle Matthews, Yan Li, Zhiyuan Hu, Whitney R. Robinson, Katherine A. Hoadley, Olufunmilayo I. Olopade, Katherine E. Reeder-Hayes, H. Shelton Earp, Andrew F. Olshan, LA Carey, and Charles M. Perou. Racial differences in PAM50 subtyp...

  34. [34]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. ICLR, 2015

  35. [35]

    Supervised risk predictor of breast cancer based on intrinsic subtypes

    Joel S Parker, Michael Mullins, Maggie CU Cheang, Samuel Leung, David V oduc, Tammi Vickery, Sherri Davies, Christiane Fauron, Xiaping He, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27(8):1160–1167, 2009

  36. [36]

    Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Proc. ICASSP, 2015

  37. [37]

    Joint and Individual Variation Explained (JIVE) for Integrated Analysis of Multiple Data Types

    Eric F Lock, Katherine A Hoadley, J S Marron, and Andrew B Nobel. Joint and Individual Variation Explained (JIVE) for Integrated Analysis of Multiple Data Types. The Annals of Applied Statistics , 7(1):523–542, mar 2013

  38. [38]

    Bayesian joint analysis of heterogeneous genomics data

    Priyadip Ray, Lingling Zheng, Joseph Lucas, and Lawrence Carin. Bayesian joint analysis of heterogeneous genomics data. Bioinformatics, 30(10):1370–6, may 2014

  39. [39]

    Angle-based joint and individual variation explained

    Qing Feng, Meilei Jiang, Jan Hannig, and JS Marron. Angle-based joint and individual variation explained. Journal of Multivariate Analysis, 166:241–265, 2018. 10 Supplementary Material This supplementary material includes additional details on our TOCCA algorithm and experiments, including 1) a comparison of our formulation with other related CCA approach...