Task-Driven Common Representation Learning via Bridge Neural Network

Meiyu Huang; Xueshuang Xiang; Yao Xu

arxiv: 1906.10897 · v1 · pith:BMJDS3ZNnew · submitted 2019-06-26 · 💻 cs.LG · stat.ML

Task-Driven Common Representation Learning via Bridge Neural Network

Yao Xu , Xueshuang Xiang , Meiyu Huang This is my paper

Pith reviewed 2026-05-25 15:37 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords bridge neural networkcommon representation learningnegative samplestotal correlationconvolutional neural networkspair matchingtransfer learningcanonical correlation analysis

0 comments

The pith

Bridge neural network learns common representations between two data sources by training with artificial negative samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the bridge neural network to extract task-specific common representations from two separate data sources. Two convolutional neural networks map each source into a shared feature space. The training objective relies on artificial negative samples, which supports mini-batch training and is shown through analysis to be asymptotically equivalent to maximizing the total correlation between the sources. Experiments on pair matching, canonical correlation analysis, transfer learning, and reconstruction tasks report state-of-the-art results.

Core claim

The bridge neural network consists of two convolutional neural networks that project two given data sources into a common feature space. Its training uses artificial negative samples in a manner that permits mini-batch optimization, and the paper establishes that this objective is asymptotically equivalent to maximizing the total correlation of the two data sources. The resulting common representations deliver state-of-the-art performance on pair matching, canonical correlation analysis, transfer learning, and reconstruction.

What carries the argument

The bridge neural network formed by two convolutional neural networks that map paired data sources into a shared feature space, trained via artificial negative samples to capture dependence.

If this is right

The negative-sample objective enables efficient training on large paired datasets without explicit pairwise correlation computation.
The learned common representation transfers directly to downstream tasks such as matching or cross-domain prediction.
The asymptotic link to total correlation supplies a theoretical justification for why the method recovers shared structure.
The same architecture applies across multiple tasks without task-specific redesign of the correlation term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The negative-sample approach may connect to contrastive methods and could be adapted to unpaired or multi-view settings.
Finite-sample behavior may deviate from the asymptotic equivalence, suggesting value in studying sample-size effects.
The framework could extend to more than two sources by chaining multiple bridges.
Reconstruction performance indicates the representation preserves information from both sources, which might aid generative tasks.

Load-bearing premise

That two separate convolutional networks can map the data sources into a feature space in which negative-sample training both succeeds in learning a common representation and produces one that is useful for the target task.

What would settle it

A paired dataset on which the negative-sample training yields low measured total correlation between the projected sources, or on which the bridge network fails to match or exceed standard baselines on the reported tasks.

Figures

Figures reproduced from arXiv: 1906.10897 by Meiyu Huang, Xueshuang Xiang, Yao Xu.

**Figure 2.** Figure 2: A framework of using Bridge Neural Network [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: A schematic of bridge neural network, which employs two convolutional neural networks that project two given data [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Top 100 false positive samples. Samples out [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: A schematic of bridge neural network with recon [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Reconstruction results of BNN for MNIST. (a) Left [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

This paper introduces a novel deep learning based method, named bridge neural network (BNN) to dig the potential relationship between two given data sources task by task. The proposed approach employs two convolutional neural networks that project the two data sources into a feature space to learn the desired common representation required by the specific task. The training objective with artificial negative samples is introduced with the ability of mini-batch training and it's asymptotically equivalent to maximizing the total correlation of the two data sources, which is verified by the theoretical analysis. The experiments on the tasks, including pair matching, canonical correlation analysis, transfer learning, and reconstruction demonstrate the state-of-the-art performance of BNN, which may provide new insights into the aspect of common representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BNN proposes dual CNN projections plus negative-sample objective claimed asymptotically equivalent to total correlation, but the abstract supplies no derivations or experimental details to check either claim.

read the letter

The core idea here is a Bridge Neural Network that runs two separate CNNs on two data sources, projects them into a shared feature space, and trains with artificial negative samples so the objective can run in mini-batches. The authors say this objective is asymptotically equivalent to maximizing total correlation between the sources, and they report state-of-the-art numbers on pair matching, CCA, transfer learning, and reconstruction. That combination of architecture and objective looks like the actual technical step beyond the deep CCA baselines they cite. The task-driven framing is also a reasonable way to motivate the work for applied multi-view settings. The practical upside is that the negative-sample trick lets them avoid full-batch computations, which matters for real data sizes. On the other side, the abstract states both the theoretical equivalence and the SOTA results without showing the derivation steps, the exact loss, the baselines, or any error bars. Without those pieces it is not possible to tell whether the equivalence is independent or circular, or whether the experiments actually control for standard multi-view methods. The circularity burden noted in the reader report is the main open question. This is squarely for people already working on multi-view or multi-modal representation learning who need a new training trick for paired data. A reader in that subfield could extract the architecture and objective and test them, but the paper does not resolve broader questions about common representations. I would bring it to a reading group focused on recent contrastive or multi-view objectives, but not as a general-interest item. I would not cite it on the basis of the abstract alone. The work is coherent enough on its own terms to deserve referee time so the theory and numbers can be checked.

Referee Report

2 major / 0 minor

Summary. The paper introduces the Bridge Neural Network (BNN), which employs two separate CNNs to project two data sources into a shared feature space and learn task-driven common representations. It proposes a training objective based on artificial negative samples that supports mini-batch training and claims this objective is asymptotically equivalent to maximizing the total correlation between the sources, with the equivalence verified by theoretical analysis. Experiments on pair matching, canonical correlation analysis, transfer learning, and reconstruction are asserted to demonstrate state-of-the-art performance.

Significance. If the claimed asymptotic equivalence can be shown to hold independently and the empirical results prove robust with proper controls, the work would offer a practical bridge between negative-sampling objectives and information-theoretic measures of dependence, with potential value for multi-view and multi-modal representation learning.

major comments (2)

[Abstract] Abstract: the central claim that the negative-sample objective is asymptotically equivalent to total correlation is presented as verified by theoretical analysis, yet no derivation steps, intermediate results, or conditions are supplied, preventing verification that the analysis is non-circular and independent of the method's own definitions.
[Abstract] Abstract / Experiments: the assertion of state-of-the-art performance on pair matching, CCA, transfer learning, and reconstruction lacks any mention of baselines, error bars, statistical significance, or dataset details, so the empirical support for the central claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address each major comment below. Where revisions are needed to improve clarity in the abstract, we will incorporate them in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the negative-sample objective is asymptotically equivalent to total correlation is presented as verified by theoretical analysis, yet no derivation steps, intermediate results, or conditions are supplied, preventing verification that the analysis is non-circular and independent of the method's own definitions.

Authors: We agree that the abstract is too terse on this point. The full manuscript (Section 3) derives the equivalence by showing that the negative-sampling loss converges to the total correlation as the number of negative samples tends to infinity, using the definition of total correlation as the sum of mutual informations and standard limit arguments on the softmax normalization. The derivation is independent of the BNN architecture itself. To make this verifiable from the abstract, we will add a concise clause stating the key limit condition and the information-theoretic connection. revision: yes
Referee: [Abstract] Abstract / Experiments: the assertion of state-of-the-art performance on pair matching, CCA, transfer learning, and reconstruction lacks any mention of baselines, error bars, statistical significance, or dataset details, so the empirical support for the central claim cannot be evaluated.

Authors: The abstract is intentionally brief, but the referee is correct that it should indicate the nature of the empirical support. Sections 4–5 of the manuscript report comparisons against standard CCA, DCCA, DCCAE, and other baselines on MNIST, CIFAR-10, and multi-view datasets, with means and standard deviations over 5–10 runs and paired t-tests for significance. We will revise the abstract to mention that results are reported against established baselines with statistical controls on standard benchmark datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central claim is an asymptotic equivalence (verified by theoretical analysis) between a negative-sample objective and total correlation of two sources, plus separate empirical SOTA results on downstream tasks. No quoted equations, self-citations, or steps in the supplied text reduce this equivalence to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. The derivation is presented as independent theoretical work, and the method is not forced by its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that CNN projections can produce a task-useful common space and on the paper-specific claim that the negative-sample objective is asymptotically equivalent to total correlation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Two convolutional neural networks can project the two data sources into a feature space where a common representation useful for the specific task exists.
This premise is required for the dual-CNN architecture to learn the desired common representation.
ad hoc to paper The training objective with artificial negative samples is asymptotically equivalent to maximizing total correlation.
This equivalence is the key theoretical justification supplied by the paper's analysis.

pith-pipeline@v0.9.0 · 5645 in / 1368 out tokens · 36228 ms · 2026-05-25T15:37:06.570311+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[2]

Akaho, S. 2006. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071

work page internal anchor Pith review Pith/arXiv arXiv 2006
[3]

Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning , 1247--1255

work page 2013
[4]

Arora, R., and Livescu, K. 2013. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on , 7135--7139. IEEE

work page 2013
[5]

R., and Jordan, M

Bach, F. R., and Jordan, M. I. 2002. Kernel independent component analysis. Journal of machine learning research 3(Jul):1--48

work page 2002
[6]

M.; Larochelle, H.; and Ravindran, B

Chandar, S.; Khapra, M. M.; Larochelle, H.; and Ravindran, B. 2016. Correlational neural networks. Neural computation 28(2):257--285

work page 2016
[7]

Chopra, S.; Hadsell, R.; and LeCun, Y. 2005. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , volume 1, 539--546. IEEE

work page 2005
[8]

P.; and Ungar, L

Dhillon, P.; Foster, D. P.; and Ungar, L. H. 2011. Multi-view learning of word embeddings via cca. In Advances in neural information processing systems , 199--207

work page 2011
[9]

Eisenschtat, A., and Wolf, L. 2017. Linking image and text with 2-way nets. arXiv preprint

work page 2017
[10]

R.; Mourao-Miranda, J.; Brammer, M.; and Shawe-Taylor, J

Hardoon, D. R.; Mourao-Miranda, J.; Brammer, M.; and Shawe-Taylor, J. 2007. Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImage 37(4):1250--1259

work page 2007
[11]

R.; Szedmak, S.; and Shawe-Taylor, J

Hardoon, D. R.; Szedmak, S.; and Shawe-Taylor, J. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16(12):2639--2664

work page 2004
[12]

Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28(3/4):321--377

work page 1936
[13]

Kim, T.-K.; Wong, S.-F.; and Cipolla, R. 2007. Tensor canonical correlation analysis for action classification. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on , 1--8. IEEE

work page 2007
[14]

LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278--2324

work page 1998
[15]

Melzer, T.; Reiter, M.; and Bischof, H. 2001. Nonlinear feature extraction using generalized canonical correlation analysis. In International Conference on Artificial Neural Networks , 353--360. Springer

work page 2001
[16]

Michaeli, T.; Wang, W.; and Livescu, K. 2016. Nonparametric canonical correlation analysis. In International Conference on Machine Learning , 1967--1976

work page 2016
[17]

Mineiro, P., and Karampatziakis, N. 2014. A randomized algorithm for cca. arXiv preprint arXiv:1411.3409

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) , 689--696

work page 2011
[19]

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct):2825--2830

work page 2011
[20]

Vinod, H. D. 1976. Canonical ridge and econometrics of joint production. Journal of econometrics 4(2):147--166

work page 1976
[21]

Wang, W.; Arora, R.; Livescu, K.; and Bilmes, J. 2015. On deep multi-view representation learning. In International Conference on Machine Learning , 1083--1092

work page 2015
[22]

Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition , 5005--5013

work page 2016
[23]

Westbury, J. 1994. X-ray microbeam speech production database user’s handbook: Madison. WI: Waisman Center, University of Wisconsin

work page 1994
[24]

Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. Computer Science

work page 2013
[25]

Yan, F., and Mikolajczyk, K. 2015. Deep correlation for matching images and text. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on , 3441--3450. IEEE

work page 2015
[26]

Zhao, J.; Xie, X.; Xu, X.; and Sun, S. 2017. Multi-view learning overview: Recent progress and new challenges. Information Fusion 38:43--54

work page 2017

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[2] [2]

Akaho, S. 2006. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071

work page internal anchor Pith review Pith/arXiv arXiv 2006

[3] [3]

Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning , 1247--1255

work page 2013

[4] [4]

Arora, R., and Livescu, K. 2013. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on , 7135--7139. IEEE

work page 2013

[5] [5]

R., and Jordan, M

Bach, F. R., and Jordan, M. I. 2002. Kernel independent component analysis. Journal of machine learning research 3(Jul):1--48

work page 2002

[6] [6]

M.; Larochelle, H.; and Ravindran, B

Chandar, S.; Khapra, M. M.; Larochelle, H.; and Ravindran, B. 2016. Correlational neural networks. Neural computation 28(2):257--285

work page 2016

[7] [7]

Chopra, S.; Hadsell, R.; and LeCun, Y. 2005. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , volume 1, 539--546. IEEE

work page 2005

[8] [8]

P.; and Ungar, L

Dhillon, P.; Foster, D. P.; and Ungar, L. H. 2011. Multi-view learning of word embeddings via cca. In Advances in neural information processing systems , 199--207

work page 2011

[9] [9]

Eisenschtat, A., and Wolf, L. 2017. Linking image and text with 2-way nets. arXiv preprint

work page 2017

[10] [10]

R.; Mourao-Miranda, J.; Brammer, M.; and Shawe-Taylor, J

Hardoon, D. R.; Mourao-Miranda, J.; Brammer, M.; and Shawe-Taylor, J. 2007. Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImage 37(4):1250--1259

work page 2007

[11] [11]

R.; Szedmak, S.; and Shawe-Taylor, J

Hardoon, D. R.; Szedmak, S.; and Shawe-Taylor, J. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16(12):2639--2664

work page 2004

[12] [12]

Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28(3/4):321--377

work page 1936

[13] [13]

Kim, T.-K.; Wong, S.-F.; and Cipolla, R. 2007. Tensor canonical correlation analysis for action classification. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on , 1--8. IEEE

work page 2007

[14] [14]

LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278--2324

work page 1998

[15] [15]

Melzer, T.; Reiter, M.; and Bischof, H. 2001. Nonlinear feature extraction using generalized canonical correlation analysis. In International Conference on Artificial Neural Networks , 353--360. Springer

work page 2001

[16] [16]

Michaeli, T.; Wang, W.; and Livescu, K. 2016. Nonparametric canonical correlation analysis. In International Conference on Machine Learning , 1967--1976

work page 2016

[17] [17]

Mineiro, P., and Karampatziakis, N. 2014. A randomized algorithm for cca. arXiv preprint arXiv:1411.3409

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) , 689--696

work page 2011

[19] [19]

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct):2825--2830

work page 2011

[20] [20]

Vinod, H. D. 1976. Canonical ridge and econometrics of joint production. Journal of econometrics 4(2):147--166

work page 1976

[21] [21]

Wang, W.; Arora, R.; Livescu, K.; and Bilmes, J. 2015. On deep multi-view representation learning. In International Conference on Machine Learning , 1083--1092

work page 2015

[22] [22]

Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition , 5005--5013

work page 2016

[23] [23]

Westbury, J. 1994. X-ray microbeam speech production database user’s handbook: Madison. WI: Waisman Center, University of Wisconsin

work page 1994

[24] [24]

Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. Computer Science

work page 2013

[25] [25]

Yan, F., and Mikolajczyk, K. 2015. Deep correlation for matching images and text. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on , 3441--3450. IEEE

work page 2015

[26] [26]

Zhao, J.; Xie, X.; Xu, X.; and Sun, S. 2017. Multi-view learning overview: Recent progress and new challenges. Information Fusion 38:43--54

work page 2017