Hierarchical Contrastive Learning for Multimodal Data

Doudou Zhou; Huichao Li; Junhan Yu

arxiv: 2604.05462 · v1 · submitted 2026-04-07 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Hierarchical Contrastive Learning for Multimodal Data

Huichao Li , Junhan Yu , Doudou Zhou This is my paper

Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords hierarchical contrastive learningmultimodal datalatent variable modelsidentifiabilitycontrastive learningelectronic health records

0 comments

The pith

Hierarchical Contrastive Learning decomposes multimodal data into globally shared, partially shared, and modality-specific latent factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal representation learning typically relies on a binary shared-private decomposition of latent information, but this overlooks factors shared only among subsets of modalities. The paper proposes Hierarchical Contrastive Learning (HCL) as a unified framework that captures globally shared, partially shared, and modality-specific representations using a hierarchical latent-variable model with structural sparsity. It employs a structure-aware contrastive objective to align modalities only when they truly share a factor. Under the assumption of uncorrelated latent variables, the authors prove identifiability of this hierarchical decomposition and provide recovery guarantees along with estimation and risk bounds. This matters because it prevents over-aligning unrelated signals and allows better use of complementary information, as validated in simulations and on electronic health records data.

Core claim

HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction.

What carries the argument

hierarchical latent-variable formulation with structural sparsity and structure-aware contrastive objective

If this is right

Simulations demonstrate accurate recovery of hierarchical structure and selection of task-relevant components
On multimodal electronic health records, HCL produces more informative representations
Predictive performance improves consistently compared to standard approaches

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to other multimodal domains like vision and language where partial sharing is common
Inspecting the learned hierarchy might reveal interpretable data structures in applications
Relaxing the uncorrelated assumption could lead to extensions for more realistic correlated latents

Load-bearing premise

The latent variables are uncorrelated.

What would settle it

A counterexample dataset with correlated latent factors where the hierarchical decomposition cannot be recovered would falsify the identifiability claims.

Figures

Figures reproduced from arXiv: 2604.05462 by Doudou Zhou, Huichao Li, Junhan Yu.

**Figure 2.** Figure 2: Hierarchical decomposition for multimodal data. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Block matrices error of the hierarchical contrastive learning (HCL) framework as [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Global matrix error of the hierarchical contrastive learning (HCL) framework [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Downstream prediction performance of different methods under varying sample [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of methods for 30-day readmission prediction under joint training [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of methods for next-visit length-of-stay prediction under joint training [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of methods for one-year mortality prediction under joint training [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Normalized importance weights of latent structures for the three downstream [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

read the original abstract

Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HCL brings hierarchical structure to multimodal contrastive learning for partial sharing, with theory that requires uncorrelated latents.

read the letter

The main thing here is that HCL adds a hierarchical latent structure to contrastive multimodal learning so it can capture partial sharing between subsets of modalities rather than just all-or-nothing. This goes beyond the usual shared-private split by defining global, partial, and specific factors, then using a structure-aware contrastive loss that aligns only when factors are actually shared, plus sparsity to pick relevant parts. The abstract says they prove identifiability of the decomposition, recovery of the loading matrices, and some estimation and risk bounds when latents are uncorrelated. Simulations recover the structure well, and the EHR experiments give better predictions than baselines. The main limitation is that everything theoretical depends on the latents being uncorrelated. That's a restrictive assumption for real data, especially in health records where factors often correlate, so the guarantees may not apply broadly. Without the full paper it's hard to see how robust the proofs are or how the experiments handle partial sharing validation. The citation pattern in the abstract seems light on direct competitors to the hierarchical setup. This paper is for people working on multimodal representations who need to model incomplete overlaps, like in medical data. It has enough formal claims and empirical hints to merit peer review, even if the assumption needs scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hierarchical Contrastive Learning (HCL), a framework for multimodal representation learning that models globally shared, partially shared, and modality-specific latent factors via a hierarchical latent-variable model combined with structural sparsity and a structure-aware contrastive objective. Under the assumption of uncorrelated latent variables, the authors claim to prove identifiability of the hierarchical decomposition, recovery guarantees for the loading matrices, and bounds on parameter estimation and excess risk for downstream prediction. Support is provided through simulations demonstrating structure recovery and experiments on multimodal electronic health records showing improved predictive performance.

Significance. If the identifiability and recovery results hold, the work would meaningfully advance multimodal learning by moving beyond binary shared-private decompositions to capture partial sharing across modality subsets, with potential benefits for interpretability in domains such as healthcare. The explicit conditioning of all theoretical claims on uncorrelated latents, together with the empirical validation, represents a clear contribution if the proofs are tight and the simulations reproducible.

major comments (2)

[Abstract and theoretical analysis section] Abstract and theoretical analysis section: All central claims (identifiability of the hierarchical decomposition, recovery guarantees for loading matrices, and excess-risk bounds) are derived under the external assumption of uncorrelated latent variables. The manuscript should include a dedicated discussion or sensitivity analysis showing how the results degrade under mild correlations, as this assumption is load-bearing for the uniqueness of the global/partial/modality-specific factorization.
[Simulation section] Simulation section: The claim of 'accurate recovery of hierarchical structure' is central to validating the recovery guarantees, yet no quantitative metrics (e.g., Frobenius error on loading matrices, precision/recall on factor selection) or variability measures across repeated runs are referenced; without these, it is impossible to assess whether the empirical results corroborate the theoretical rates.

minor comments (2)

[Notation] Notation for the three levels of latent factors (global, partial, modality-specific) should be introduced once with explicit symbols and used consistently in all equations and figures.
[EHR experiments] The EHR experiment description would benefit from a table summarizing the number of modalities, sample size, prediction tasks, and baseline methods to allow direct comparison of the reported performance gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential contribution of our hierarchical framework for multimodal representation learning. We address each major comment in detail below and commit to targeted revisions that strengthen the presentation of our theoretical and empirical results.

read point-by-point responses

Referee: [Abstract and theoretical analysis section] Abstract and theoretical analysis section: All central claims (identifiability of the hierarchical decomposition, recovery guarantees for loading matrices, and excess-risk bounds) are derived under the external assumption of uncorrelated latent variables. The manuscript should include a dedicated discussion or sensitivity analysis showing how the results degrade under mild correlations, as this assumption is load-bearing for the uniqueness of the global/partial/modality-specific factorization.

Authors: We agree that the uncorrelated latent variables assumption is load-bearing for the identifiability of the global/partial/modality-specific factorization and the associated recovery guarantees. In the revised manuscript we will insert a new subsection 'Role and Sensitivity of the Uncorrelated Latent Assumption' immediately after the main identifiability theorem. This subsection will (i) restate why uncorrelatedness is required for uniqueness of the hierarchical decomposition, (ii) present a controlled simulation study that introduces mild pairwise correlations (coefficients 0.1–0.3) among latent factors and reports the resulting degradation in loading-matrix recovery error and factor-selection accuracy, and (iii) discuss practical regimes (e.g., pre-whitened EHR features) in which the assumption holds approximately. These additions will be simulation-based and will not alter the formal statements of the theorems. revision: yes
Referee: [Simulation section] Simulation section: The claim of 'accurate recovery of hierarchical structure' is central to validating the recovery guarantees, yet no quantitative metrics (e.g., Frobenius error on loading matrices, precision/recall on factor selection) or variability measures across repeated runs are referenced; without these, it is impossible to assess whether the empirical results corroborate the theoretical rates.

Authors: We acknowledge that the current simulation section relies primarily on qualitative visualizations. In the revision we will augment this section with a new table (and accompanying text) that reports, over 20 independent replications: (a) average Frobenius norm error and standard deviation for each estimated loading matrix, (b) precision and recall for recovering the hierarchical structure (global, partial, and modality-specific factors), and (c) the same metrics under the mild-correlation regimes introduced in the new theoretical subsection. These quantitative results will be directly compared against the rates predicted by the recovery theorems. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of identifiability proofs, recovery guarantees, and excess-risk bounds that are explicitly conditioned on the external assumption of uncorrelated latent variables. No load-bearing step reduces a derived quantity to a fitted parameter or self-citation by construction; the hierarchical decomposition, structural sparsity, and contrastive objective are introduced as modeling choices whose theoretical properties are then proven under the stated assumption rather than being tautological with the inputs. The provided abstract and reader summary contain no equations or citations that exhibit self-definition, fitted-input renaming, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework introduces a hierarchical latent structure whose identifiability depends on an external statistical assumption; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Latent variables are uncorrelated
Invoked to prove identifiability of the hierarchical decomposition

invented entities (1)

Hierarchical latent factors (global, partial, modality-specific) no independent evidence
purpose: To represent different levels of sharing across modalities
Core modeling construct introduced by the framework

pith-pipeline@v0.9.0 · 5463 in / 1273 out tokens · 35153 ms · 2026-05-10T19:23:36.561400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

& Rezaei, M

Alvandi, A. & Rezaei, M. (2025), ‘Revisiting theory of contrastive learning for domain generalization’,arXiv:2512.02831. Baltrušaitis, T., Ahuja, C. & Morency, L.-P. (2018), ‘Multimodal machine learning: A survey and taxonomy’,IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2), 423–443. Cai, T., Huang, F., Nakada, R., Zhang, L. & Zhou, D...

work page arXiv 2025
[2]

Ma, Z. & Ma, R. (2026), ‘Optimal estimation of shared singular subspaces across multiple noisy matrices’,IEEE Transactions on Information Theory. Mao, L., Wang, Q., Su, Y., Lure, F., Chong, C. D., Schwedt, T. J. & Li, J. (2026), ‘Supervised multimodal fission learning’,INFORMS Journal on Data Science. Meng, C., Luo, J., Yan, Z., Yu, Z., Fu, R., Gan, Z. & ...

work page arXiv 2026

[1] [1]

& Rezaei, M

Alvandi, A. & Rezaei, M. (2025), ‘Revisiting theory of contrastive learning for domain generalization’,arXiv:2512.02831. Baltrušaitis, T., Ahuja, C. & Morency, L.-P. (2018), ‘Multimodal machine learning: A survey and taxonomy’,IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2), 423–443. Cai, T., Huang, F., Nakada, R., Zhang, L. & Zhou, D...

work page arXiv 2025

[2] [2]

Ma, Z. & Ma, R. (2026), ‘Optimal estimation of shared singular subspaces across multiple noisy matrices’,IEEE Transactions on Information Theory. Mao, L., Wang, Q., Su, Y., Lure, F., Chong, C. D., Schwedt, T. J. & Li, J. (2026), ‘Supervised multimodal fission learning’,INFORMS Journal on Data Science. Meng, C., Luo, J., Yan, Z., Yu, Z., Fu, R., Gan, Z. & ...

work page arXiv 2026