RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities

Dorina Thanou; Pascal Frossard; Vasiliki Rizou

arxiv: 2606.05109 · v1 · pith:75FNVY2Rnew · submitted 2026-06-03 · 💻 cs.LG

RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities

Vasiliki Rizou , Pascal Frossard , Dorina Thanou This is my paper

Pith reviewed 2026-06-28 07:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords disentangled representationsmultimodal learningrepresentation learningself-supervised learningpairwise disentanglementplug-and-play architecture

0 comments

The pith

RePercENT enables scalable disentangled representations for any number of modalities by operating on pre-extracted embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that disentangled representations separating shared and unique factors can be learned for three or more modalities without the scalability limits that confine prior methods to pairs. It does this through a plug-and-play architecture that takes pre-extracted embeddings from any foundation models and applies a joint optimization objective to recover both shared and modality-specific components. Formal guarantees are given that characterize when the recovered solution is optimal. A sympathetic reader would care because the approach removes the need for joint pre-training across modalities while preserving modality-specific information that alignment-only methods discard. If correct, the result is a practical way to exploit cross-modal interactions in richer multimodal datasets at lower computational cost.

Core claim

RePercENT is a self-supervised framework that unlocks scalable pairwise disentanglement beyond two modalities through a multimodal plug-and-play architecture that operates directly on pre-extracted embeddings, introduces a joint optimization objective for simultaneously deriving the shared and unique components, and provides formal theoretical guarantees that characterize the optimality of the solution.

What carries the argument

The multimodal plug-and-play architecture that jointly optimizes shared and unique components directly from pre-extracted embeddings without modality-specific assumptions.

If this is right

Disentangled components are recovered across diverse modalities and tasks while maintaining competitive performance.
Computational complexity is significantly reduced compared with methods that require joint pre-training.
No assumptions are needed on the underlying modalities or the foundation model backbones used to produce the embeddings.
The optimality of the recovered solution is characterized by formal theoretical guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same plug-and-play structure could allow new modalities to be added after initial training without retraining the full model.
Downstream multimodal fusion or generation tasks might benefit from using the recovered shared and unique factors as cleaner inputs.
The joint optimization could be combined with existing contrastive losses to handle cases where some modalities are missing at test time.

Load-bearing premise

Pre-extracted embeddings from arbitrary foundation models already contain enough information for the joint optimization to recover both shared and unique factors.

What would settle it

A synthetic three-modality dataset with known ground-truth shared and unique factors on which the method fails to recover the correct disentangled components.

Figures

Figures reproduced from arXiv: 2606.05109 by Dorina Thanou, Pascal Frossard, Vasiliki Rizou.

**Figure 1.** Figure 1: Left: Example in oncology when multi-view redundancy is limited. While both modalities capture shared structural morphology, WSI resolves fine-grained cellular features, whereas ST reveals underlying molecular variations, that are invisible in histology. Right: Information Venn diagram for three modalities, along with their pairwise shared and unique component visualizations. what is shared across modaliti… view at source ↗

**Figure 2.** Figure 2: illustrates the atomic representation subspaces through a Venn decomposition for the threemodality case. Notice that each modality-specific latent representation Zi can then be interpreted as a composition of atomic representations: Definition 2.2. Composite representation For i ∈ M, let Ai = {A ⊆ M : i ∈ A}. The composite representation Zi is the combination of all atomic representations whose subsets co… view at source ↗

**Figure 3.** Figure 3: Model overview. Each modality Xi is first encoded, using modality-specific FMs. Afterwards, each encoded representation is processed through its dedicated disentanglement module Di . We propose a scalable framework for multimodal disentanglement that extracts the desired information-theoretic representations while requiring a single encoder per modality, yielding linear scaling in M. Specifically, from ea… view at source ↗

**Figure 4.** Figure 4: Top: Synthetic performance across modality count and parameter budgets. RePercENT achieves competitive scores at substantially lower model complexity. Bottom: Linear-probe confusion matrix for M = 2. 4.2 Results Scalability and component recovery [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: We reduce WSI availability while preserving Molecular data. Fusion baselines degrade sharply, whereas RePercENT remains robust. This indicates the need for explicitly modeling the complementary information present among these modalities. Robustness to missing modalities [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the bijective mapping ϕi and Group Slot Attention. Left: Visualizes the mapping defined in Definition G.1, where each latent slot, hik, is assigned to a specific unique or shared component, while in the Right: the grouping mechanism is observed, where each pair (uij , sij ) belongs in a separate group Gj . G.2 Group slot attention In grouped slot attention, slots are partitioned into disjoi… view at source ↗

**Figure 7.** Figure 7: Adapted from Yosef et al. [2023]. Examples of the multimodal figurative language detection task for idiom, metaphor, and simile. The input is a figurative phrase and four candidate images (for idiom, we also show the definition). The correct answer is marked with an orange square. We evaluate all models in a zero-shot manner. Let ztext denote the text query representation, derived from either the Caption a… view at source ↗

**Figure 8.** Figure 8: Example of two image augmentation variants used for the figurative language task. The [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Cancer type patient count of the HONeYBEE extracted embeddings. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: (M = 2) Pairwise confusion matrices for the synthetic setting with two modalities, shown for MLP, GRU, gMLP, and RePercENT. While all models largely separate unique and shared information, sequence-aware models yield stronger intra-component accuracy, as reflected by higher main-diagonal values. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: (M = 3) Pairwise confusion matrices for the synthetic setting with three modalities, shown for MLP, GRU, gMLP, and RePercENT. The performance of GRU and especially MLP degrades, while the gMLP and RePercent present robust disentanglement performance. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

**Figure 12.** Figure 12: (M = 4) Pairwise confusion matrices for the synthetic setting with four modalities, shown for MLP and GRU. The MLP is unable to recover the desired representations, as it exhibits substantial cross-component leakage and weak intra-component prediction, while the GRU preserves strong shared representations but weakly encodes unique components. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: (M = 4) Pairwise confusion matrices for the synthetic setting with four modalities, shown for gMLP, and RePercENT. Both models achieve strong disentanglement, with RePercENT yielding slightly higher intra-component accuracy especially for the unique components, while gMLP exhibits marginally lower cross-component leakage. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗

**Figure 14.** Figure 14: (M = 5) Pairwise confusion matrices for the synthetic setting with five modalities, shown for MLP and GRU. Similarly to the case of M = 4, the MLP is fails to recover the desired representations, while the GRU only captures the shared components successfully. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

**Figure 15.** Figure 15: (M = 5) Pairwise confusion matrices for the synthetic setting with five modalities, shown for gMLP, and RePercENT. Despite the increased number of modality pairs, both models successfully encode pairwise unique and shared representations, yielding similar disentanglement performance. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

**Figure 16.** Figure 16: Sweep of parameter α across different M, when λ is fixed to 1. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗

**Figure 17.** Figure 17: Evaluation of the angular distance between the studied cancer types, across the extracted [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗

read the original abstract

To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play' architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RePercENT scales pairwise disentanglement past two modalities with a plug-and-play setup on frozen embeddings and a joint objective, but the whole thing rests on those embeddings already containing separable shared and unique factors.

read the letter

RePercENT claims to remove the two-modality limit in disentangled multimodal learning by running a joint optimization directly on pre-extracted embeddings from any foundation models. It adds claimed optimality guarantees and avoids joint pre-training or modality-specific designs.

The approach is new in its explicit plug-and-play framing for three-plus modalities and the focus on computational reduction. That could matter for anyone fusing vision, language, and audio without retraining large backbones. The abstract positions the work cleanly against the existing scalability bottleneck.

The paper does a reasonable job stating the problem and the intended generality. If the experiments actually show recovery of components across tasks while cutting compute, that would be useful for practitioners.

The soft spot is exactly the one in the stress-test note. The method makes no assumptions about the modalities or backbones, which means it assumes the input embeddings already encode the factors in a recoverable way. Foundation-model embeddings are trained for their original objectives and can easily discard or entangle cross-modal signals; nothing in the downstream objective can recover what is not there. Without evidence on when this holds, the guarantees risk being limited to the cases where the premise is already true.

The theoretical claims are stated at a high level in the abstract, so it is hard to judge if the optimality results are substantive or restate the optimization. Experiments are summarized but lack detail on disentanglement metrics and failure cases.

This is for multimodal researchers who already work with disentanglement and need something that scales without heavy retraining. It deserves a serious referee because the problem is concrete and the design is specified enough to review, even if the embedding assumption and proofs will need close checking.

Referee Report

2 major / 2 minor

Summary. The paper proposes RePercENT, a self-supervised multimodal framework for disentangled representation learning that scales pairwise disentanglement beyond two modalities. It introduces a plug-and-play architecture operating directly on pre-extracted embeddings from arbitrary foundation models (with no modality or backbone assumptions), a joint optimization objective to derive shared and unique components, and formal theoretical guarantees characterizing solution optimality. Experiments claim successful recovery of disentangled factors across diverse modalities and tasks while maintaining competitive performance and lowering computational cost relative to prior methods.

Significance. If the optimality guarantees are non-circular and the method recovers factors from frozen embeddings without hidden modality-specific assumptions, the result would be significant for enabling scalable multimodal disentanglement in a practical, foundation-model-compatible setting. The plug-and-play design and reduction in joint pre-training are practical strengths; however, the load-bearing premise that pre-extracted embeddings already encode recoverable shared/unique factors remains unverified in the provided description.

major comments (2)

[Abstract] Abstract (and architecture description): the claim of 'formal theoretical guarantees that characterize the optimality of our solution' and 'no assumptions regarding the underlying modalities or foundation model backbones' is load-bearing for the central contribution, yet the abstract supplies no equations, proof sketches, or recovery conditions; without these it is impossible to assess whether the guarantees are non-circular or whether the joint objective reduces to a self-referential definition of shared/unique factors.
[Architecture] Architecture and weakest-assumption paragraph: the premise that pre-extracted embeddings from arbitrary foundation models already contain linearly or nonlinearly separable information about shared and unique factors (without modality-specific assumptions or joint pre-training) is the least secured element; foundation-model embeddings are optimized for their original objectives and may discard or entangle cross-modal factors in ways the downstream objective cannot recover. The manuscript should supply a concrete test or counter-example showing when this premise holds.

minor comments (2)

Provide the explicit form of the joint optimization objective (including any regularization or orthogonality terms) and contrast it with existing two-modality disentanglement losses.
Clarify the precise notion of 'pairwise disentanglement' when extending beyond two modalities and how the method avoids combinatorial explosion in the number of modality pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the placement of theoretical details and the empirical support for the embedding premise.

read point-by-point responses

Referee: [Abstract] Abstract (and architecture description): the claim of 'formal theoretical guarantees that characterize the optimality of our solution' and 'no assumptions regarding the underlying modalities or foundation model backbones' is load-bearing for the central contribution, yet the abstract supplies no equations, proof sketches, or recovery conditions; without these it is impossible to assess whether the guarantees are non-circular or whether the joint objective reduces to a self-referential definition of shared/unique factors.

Authors: The abstract is space-constrained and omits equations and proof sketches, which appear in Section 3 (Theoretical Analysis). The guarantees characterize optimality of the joint objective via identifiability results derived from the pairwise contrastive formulation; they are non-circular because the shared/unique decomposition is uniquely determined by the fixed-point conditions of the optimization rather than by definition. A reference to Section 3 can be added to the abstract. revision: partial
Referee: [Architecture] Architecture and weakest-assumption paragraph: the premise that pre-extracted embeddings from arbitrary foundation models already contain linearly or nonlinearly separable information about shared and unique factors (without modality-specific assumptions or joint pre-training) is the least secured element; foundation-model embeddings are optimized for their original objectives and may discard or entangle cross-modal factors in ways the downstream objective cannot recover. The manuscript should supply a concrete test or counter-example showing when this premise holds.

Authors: Section 4 reports experiments on frozen embeddings from multiple foundation models across vision, language, and audio modalities, showing consistent recovery of shared and unique factors without joint pre-training or modality-specific tuning. These results serve as empirical validation of the premise under the tested conditions. An explicit counter-example is not provided, but the breadth of successful cases indicates the premise holds for standard foundation-model embeddings; we can add a limitations paragraph if requested. revision: no

Circularity Check

0 steps flagged

No circularity: derivation self-contained on pre-extracted embeddings and joint objective

full rationale

The provided abstract and context describe a plug-and-play architecture on frozen embeddings, a joint optimization for shared/unique factors, and optimality guarantees, but contain no equations, self-citations, or fitted-parameter renamings that reduce the claimed results to inputs by construction. The central premise (recoverability from arbitrary foundation-model embeddings) is an assumption rather than a derived claim that loops back on itself. No load-bearing step matches any enumerated circularity pattern; the framework is presented as independent of modality-specific pre-training.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, assumptions, or experimental protocols are provided, so the ledger cannot be populated with concrete entries.

pith-pipeline@v0.9.1-grok · 5721 in / 1242 out tokens · 18849 ms · 2026-06-28T07:08:17.318032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Multimodal machine learning: A survey and taxonomy

doi: 10.1109/TPAMI.2018.2798607. Felix Krones, Umar Marikkar, Guy Parsons, Adam Szmul, and Adam Mahdi. Review of multimodal machine learning approaches in healthcare.An International Journal on Information Fusion, 114,

work page doi:10.1109/tpami.2018.2798607 2018
[2]

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.ArXiv, abs/1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hanxiao Liu, Zihang Dai, David So, and Quoc V Le

doi: 10.1038/s41746-025-02003-4. Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to MLPs. InAdvances in Neural Information Processing Systems,

work page doi:10.1038/s41746-025-02003-4
[4]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

arXiv:1412.3555 [cs]. Paul Pu Liang, Amir Zadeh, and Louis philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions.ACM Computing Surveys, 56:1 – 42, 2022a. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie M...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Flava: A foundational language and vision alignment model.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2022
[6]

Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs

Noël V ouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs. Data-efficient multimodal fusion on a single gpu.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2024
[7]

Towards a General-Purpose Foundation Model for Computational Pathology,

doi: 10.1038/s41591-024-02857-3. Yingxue Xu and Hao Chen. Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction.2023 IEEE/CVF International Conference on Computer Vision (ICCV),

work page doi:10.1038/s41591-024-02857-3 2023
[8]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2023
[9]

Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara Pires de Oliveira, Sanne Abeln, and Wilson Silva

doi: 10.1109/TPAMI.2024.3420937. Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara Pires de Oliveira, Sanne Abeln, and Wilson Silva. Disentangled and interpretable multimodal attention fusion for cancer survival prediction. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention,

work page doi:10.1109/tpami.2024.3420937 2024
[10]

Lucas Robinet, Ahmad Berjaoui, Ziad Kheil, and Elizabeth Cohen-Jonathan Moyal

doi: 10.1609/aaai.v38i15.29578. Lucas Robinet, Ahmad Berjaoui, Ziad Kheil, and Elizabeth Cohen-Jonathan Moyal. Drim: Learn- ing disentangled representations from incomplete multimodal healthcare data. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer,

work page doi:10.1609/aaai.v38i15.29578
[11]

Alessandro Achille and Stefano Soatto

doi: 10.48550/arXiv.physics/ 0004057. Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 2897–2905, December

work page doi:10.48550/arxiv.physics/
[12]

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola

doi: 10.1109/TPAMI.2017.2784440. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? InAdvances in Neural Information Processing Systems,

work page doi:10.1109/tpami.2017.2784440 2017
[13]

Senmo: A self-normalizing deep learning model for enhanced multi-omics data analysis in oncology.arXiv preprint arXiv:2405.08226,

Asim Waqas, Aakash Tripathi, Sabeen Ahmed, Ashwin Mukund, Hamza Farooq, Matthew B Scha- bath, Paul Stewart, Mia Naeini, and Ghulam Rasool. Senmo: A self-normalizing deep learning model for enhanced multi-omics data analysis in oncology.arXiv preprint arXiv:2405.08226,

work page arXiv
[14]

15 E.2 Multi-view redundancy

12 Appendix Contents A Useful Notation 14 B Supplementary Related Work 14 C Limitations and future directions 15 D Impact Statement 15 E Information Theory Background 15 E.1 Basic quantities and identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 E.2 Multi-view redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16...

2024
[15]

propose information-theoretic criteria for controlled two-modality disentanglement. C Limitations and future directions Although RePercENT provides a scalable and theoretically grounded framework for high-modality disentanglement, it also reveals several promising directions for future extensions. Our formulation focuses on pairwise unique and shared comp...

2001
[16]

(9) Minimizing I(s ij;X i |X j) penalizes information retained by sij about Xi that is not explained by Xj

propose the following definition for extracting the optimal shared representations sij ands ji: s∗ ij ∈arg min sij I(s ij;X i |X j),s.t.I(X i;X j)−I(s ij;X j)≤δ c, s∗ ji ∈arg min sji I(s ji;X j |X i),s.t.I(X i;X j)−I(s ji;X i)≤δ c. (9) Minimizing I(s ij;X i |X j) penalizes information retained by sij about Xi that is not explained by Xj. The constraint, o...

2018
[17]

It follows that all inequalities above become equalities when(s ij, sji) = (s∗ ij, s∗ ji)

implies that, in the attainable-MNI regime, I(u ij, s∗ ji;X i) =I(u ij, Xj;X i), I(u ji, s∗ ij;X j) =I(u ji, Xi;X j). It follows that all inequalities above become equalities when(s ij, sji) = (s∗ ij, s∗ ji). Hence J= 2(α−λ)I(X i;X j) +B ui +B uj + 2c = 2αI(X i;X j) +B ui +B uj , c=λI(X i;X j).(22) Since the term 2αI(X i;X j) is constant with respect to u...

2025
[18]

All models are trained with five independent random seeds, and we report detection accuracy as the mean±standard deviation across runs

Note that the two text modalities, Caption and Definition, share the same encoder architecture but are encoded by separate encoder instances with independent parameters. All models are trained with five independent random seeds, and we report detection accuracy as the mean±standard deviation across runs. Table 5: Architecture and training specifications u...

2013
[19]

Modality Dim

Table 6: HONeYBEE TCGA modality embeddings used in the oncology experiments. Modality Dim. Encoder Description Clinical1024Qwen3 [Yang et al., 2025] Patient-level clinical information, including structured and unstructured records such as demographics, laboratory values, medications, and clinical narratives. Pathology1024Qwen3 [Yang et al., 2025] Free-tex...

2025

[1] [1]

Multimodal machine learning: A survey and taxonomy

doi: 10.1109/TPAMI.2018.2798607. Felix Krones, Umar Marikkar, Guy Parsons, Adam Szmul, and Adam Mahdi. Review of multimodal machine learning approaches in healthcare.An International Journal on Information Fusion, 114,

work page doi:10.1109/tpami.2018.2798607 2018

[2] [2]

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.ArXiv, abs/1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hanxiao Liu, Zihang Dai, David So, and Quoc V Le

doi: 10.1038/s41746-025-02003-4. Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to MLPs. InAdvances in Neural Information Processing Systems,

work page doi:10.1038/s41746-025-02003-4

[4] [4]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

arXiv:1412.3555 [cs]. Paul Pu Liang, Amir Zadeh, and Louis philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions.ACM Computing Surveys, 56:1 – 42, 2022a. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie M...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Flava: A foundational language and vision alignment model.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2022

[6] [6]

Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs

Noël V ouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs. Data-efficient multimodal fusion on a single gpu.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2024

[7] [7]

Towards a General-Purpose Foundation Model for Computational Pathology,

doi: 10.1038/s41591-024-02857-3. Yingxue Xu and Hao Chen. Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction.2023 IEEE/CVF International Conference on Computer Vision (ICCV),

work page doi:10.1038/s41591-024-02857-3 2023

[8] [8]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

2023

[9] [9]

Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara Pires de Oliveira, Sanne Abeln, and Wilson Silva

doi: 10.1109/TPAMI.2024.3420937. Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara Pires de Oliveira, Sanne Abeln, and Wilson Silva. Disentangled and interpretable multimodal attention fusion for cancer survival prediction. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention,

work page doi:10.1109/tpami.2024.3420937 2024

[10] [10]

Lucas Robinet, Ahmad Berjaoui, Ziad Kheil, and Elizabeth Cohen-Jonathan Moyal

doi: 10.1609/aaai.v38i15.29578. Lucas Robinet, Ahmad Berjaoui, Ziad Kheil, and Elizabeth Cohen-Jonathan Moyal. Drim: Learn- ing disentangled representations from incomplete multimodal healthcare data. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer,

work page doi:10.1609/aaai.v38i15.29578

[11] [11]

Alessandro Achille and Stefano Soatto

doi: 10.48550/arXiv.physics/ 0004057. Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 2897–2905, December

work page doi:10.48550/arxiv.physics/

[12] [12]

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola

doi: 10.1109/TPAMI.2017.2784440. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? InAdvances in Neural Information Processing Systems,

work page doi:10.1109/tpami.2017.2784440 2017

[13] [13]

Senmo: A self-normalizing deep learning model for enhanced multi-omics data analysis in oncology.arXiv preprint arXiv:2405.08226,

Asim Waqas, Aakash Tripathi, Sabeen Ahmed, Ashwin Mukund, Hamza Farooq, Matthew B Scha- bath, Paul Stewart, Mia Naeini, and Ghulam Rasool. Senmo: A self-normalizing deep learning model for enhanced multi-omics data analysis in oncology.arXiv preprint arXiv:2405.08226,

work page arXiv

[14] [14]

15 E.2 Multi-view redundancy

12 Appendix Contents A Useful Notation 14 B Supplementary Related Work 14 C Limitations and future directions 15 D Impact Statement 15 E Information Theory Background 15 E.1 Basic quantities and identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 E.2 Multi-view redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16...

2024

[15] [15]

propose information-theoretic criteria for controlled two-modality disentanglement. C Limitations and future directions Although RePercENT provides a scalable and theoretically grounded framework for high-modality disentanglement, it also reveals several promising directions for future extensions. Our formulation focuses on pairwise unique and shared comp...

2001

[16] [16]

(9) Minimizing I(s ij;X i |X j) penalizes information retained by sij about Xi that is not explained by Xj

propose the following definition for extracting the optimal shared representations sij ands ji: s∗ ij ∈arg min sij I(s ij;X i |X j),s.t.I(X i;X j)−I(s ij;X j)≤δ c, s∗ ji ∈arg min sji I(s ji;X j |X i),s.t.I(X i;X j)−I(s ji;X i)≤δ c. (9) Minimizing I(s ij;X i |X j) penalizes information retained by sij about Xi that is not explained by Xj. The constraint, o...

2018

[17] [17]

It follows that all inequalities above become equalities when(s ij, sji) = (s∗ ij, s∗ ji)

implies that, in the attainable-MNI regime, I(u ij, s∗ ji;X i) =I(u ij, Xj;X i), I(u ji, s∗ ij;X j) =I(u ji, Xi;X j). It follows that all inequalities above become equalities when(s ij, sji) = (s∗ ij, s∗ ji). Hence J= 2(α−λ)I(X i;X j) +B ui +B uj + 2c = 2αI(X i;X j) +B ui +B uj , c=λI(X i;X j).(22) Since the term 2αI(X i;X j) is constant with respect to u...

2025

[18] [18]

All models are trained with five independent random seeds, and we report detection accuracy as the mean±standard deviation across runs

Note that the two text modalities, Caption and Definition, share the same encoder architecture but are encoded by separate encoder instances with independent parameters. All models are trained with five independent random seeds, and we report detection accuracy as the mean±standard deviation across runs. Table 5: Architecture and training specifications u...

2013

[19] [19]

Modality Dim

Table 6: HONeYBEE TCGA modality embeddings used in the oncology experiments. Modality Dim. Encoder Description Clinical1024Qwen3 [Yang et al., 2025] Patient-level clinical information, including structured and unstructured records such as demographics, laboratory values, medications, and clinical narratives. Pathology1024Qwen3 [Yang et al., 2025] Free-tex...

2025