pith. sign in

arxiv: 2605.21059 · v1 · pith:C4Y7AZSJnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

Multimodal LLMs under Pairwise Modalities

Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords multimodal LLMspairwise modalitieslatent alignmentcross-modal recompositioncontrastive learningrepresentation identifiability3D point cloudstactile data
0
0 comments X

The pith

Pairwise modality observations suffice to learn identifiable shared representations for multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models can be trained without fully aligned multi-way datasets by instead using only multiple pairwise modality observations. It first derives theoretical conditions under which latent representations remain identifiable from pairwise data alone. Building on that, the authors introduce a two-stage method: alignment of a shared latent space through self-reconstruction and pairwise contrastive learning that incorporates partial alignment and minimal latent specification, followed by cross-modal recomposition that swaps encoders from new modalities with decoders from pre-trained ones. Experiments demonstrate that adding 3D point clouds and tactile modalities to existing MLLMs yields strong cross-modal transfer and generation. If the claim holds, the approach removes a major data-curation bottleneck for expanding multimodal models across new domains.

Core claim

By observing only pairwise modalities and applying an inductive bias of partial alignment together with minimal latent specification during contrastive learning, the model recovers an identifiable shared latent space; encoders of newly introduced modalities can then be integrated with decoders of pre-trained modalities to enable effective cross-modal recomposition and transfer without requiring joint multi-way data.

What carries the argument

Two-stage framework of latent representation alignment (self-modal reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification) followed by cross-modal recomposition via encoder-decoder integration.

If this is right

  • New modalities such as 3D point clouds or tactile data can be added to pre-trained MLLMs using only pairwise pairings with existing modalities.
  • Cross-modal performance remains strong after alignment without access to the full joint multimodal distribution.
  • Training cost drops because multi-way aligned datasets no longer need to be curated for every combination of modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise-only recipe might allow incremental addition of further modalities beyond the two demonstrated in the paper.
  • If the identifiability conditions hold in practice, similar alignment stages could be inserted into other contrastive or reconstruction-based multimodal pipelines.
  • The framework suggests a route to modality expansion in settings where collecting full multi-way alignments is logistically impossible.

Load-bearing premise

The inductive biases of partial alignment and minimal latent specification in pairwise contrastive learning are enough to recover identifiable shared representations that support cross-modal recomposition.

What would settle it

A controlled experiment that removes the partial-alignment bias during contrastive learning and measures whether cross-modal generation or reconstruction quality collapses on held-out modality pairs.

Figures

Figures reproduced from arXiv: 2605.21059 by Gongxu Luo, Guangyi Chen, Kun Zhang, Yan Li, Yuewen Sun, Yunlong Deng.

Figure 1
Figure 1. Figure 1: Comparison between jointly-aligned and pairwise-aligned multimodal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data generation process. The shared latent space c (blue) contains factors that influ￾ence multiple modalities, while modality-specific variables s (red) affect only their corresponding modality. Each observation x (m) is generated from its shared latent block z (m) c and modality-specific block z (m) s . Data Generating Process. We consider x := [x (1) , . . . , x (M) ] to denote a collection of obser￾vat… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage framework for multimodal learning from pairwise-only [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs can be trained scalably using only pairwise modality data rather than fully joint multimodal alignments. It first derives theoretical conditions for identifiability of shared latent representations from pairwise observations alone, then introduces a two-stage framework: (1) latent alignment via self-modal reconstruction plus pairwise contrastive learning that incorporates partial alignment and minimal latent specification as inductive biases, and (2) cross-modal recomposition by grafting encoders of new modalities (3D point clouds, tactile) onto pre-trained decoders. Experiments adding these modalities to existing MLLMs are reported to yield strong cross-modal transfer and generation performance.

Significance. If the identifiability result is rigorous and the inductive biases provably enforce a shared latent space in practice, the work would meaningfully lower the data-collection barrier for multimodal models, enabling cheaper extension to 3D and tactile modalities without requiring expensive three-way alignments.

major comments (2)
  1. [§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.
  2. [§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.
minor comments (2)
  1. [§4.1] The abstract and §4.1 refer to 'minimal latent specification' without a precise definition or hyper-parameter schedule; a short paragraph clarifying the exact constraint (e.g., dimension or sparsity) would improve reproducibility.
  2. [Figure 3] Figure 3 (cross-modal generation examples) lacks quantitative metrics alongside the qualitative samples; adding FID or retrieval accuracy numbers would strengthen the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we plan to make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.

    Authors: The identifiability theorem in Section 3 provides the theoretical foundation for recovering shared latent representations from pairwise modality observations under the specified axioms. The contrastive learning objective, augmented with partial alignment and minimal latent specification, is intended to operationalize these conditions by promoting invariance across modalities while the self-reconstruction term helps prevent dimensional collapse. We acknowledge that the manuscript does not include an explicit analysis or verification showing that this objective eliminates all non-identifiable solutions, such as modality-specific rotations in higher latent dimensions. To address this point, we will add a subsection in the revised version that discusses how the inductive biases interact with the identifiability conditions, including a sketch of why rotations are discouraged by the partial alignment term and collapse is mitigated by reconstruction. revision: yes

  2. Referee: [§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.

    Authors: We agree that providing a quantitative ablation on the impact of the partial-alignment term would better support the claim. Currently, the experiments focus on end-to-end performance in cross-modal tasks with the full framework. We will include an additional ablation study in the revised manuscript that compares the proposed contrastive objective (with partial alignment) against standard InfoNCE, using diagnostics such as the variance of cross-modal latent distances or the consistency of representations under random rotations to quantify the reduction in admissible transformations. revision: yes

Circularity Check

0 steps flagged

Theoretical identifiability analysis stands as independent foundation; no reduction of results to inputs by construction.

full rationale

The paper first derives conditions for identifiability from pairwise modality observations as a standalone theoretical step, then constructs a two-stage framework (latent alignment via self-reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification biases, followed by encoder-decoder recomposition) that applies those conditions in practice. No equations or claims reduce the identifiability result or cross-modal performance predictions to fitted parameters, self-definitions, or self-citation chains; the inductive biases are presented as implementation choices rather than tautological restatements of the theory. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of identifiability conditions from pairwise observations and the practical effectiveness of the proposed inductive biases without requiring full joint multimodal data.

free parameters (1)
  • latent space dimension
    The minimal latent specification implies a choice of shared representation dimension that balances information retention and simplicity, likely tuned during training.
axioms (1)
  • domain assumption Representations from different modalities share an identifiable common latent structure recoverable from pairwise observations under suitable conditions.
    This is invoked as the foundation for the theoretical analysis that justifies the framework.

pith-pipeline@v0.9.0 · 5761 in / 1535 out tokens · 76153 ms · 2026-05-21T05:35:04.582460+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.:Gpt-4technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Advances in neural information processing systems 34, 24206–24221 (2021)

    Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in neural information processing systems 34, 24206–24221 (2021)

  3. [3]

    Advances in neural information processing systems35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

  4. [4]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    Advances in neural information processing systems 33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cai, M., Liu, H., Mustikovela, S.K., Meyer, G.P., Chai, Y., Park, D., Lee, Y.J.: Vip-llava: Making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12914–12923 (2024)

  7. [7]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

  8. [8]

    In: European conference on computer vision

    Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)

  9. [9]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  10. [10]

    Qwen2-Audio Technical Report

    Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024) Multimodal LLMs under Pairwise Modalities 17

  11. [11]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

  13. [13]

    arXiv preprint arXiv:2510.10487 (2025)

    Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency. arXiv preprint arXiv:2510.10487 (2025)

  14. [14]

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al.: Palm-e: An embodied multimodal language model (2023)

  15. [15]

    In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning au- dio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

  16. [16]

    A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

    Fu, L., Datta, G., Huang, H., Panitch, W.C.H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., Goldberg, K.: A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232 (2024)

  17. [17]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180–15190 (2023)

  19. [19]

    In: Conference on Uncertainty in Artificial Intelligence

    Hälvä, H., Hyvarinen, A.: Hidden markov nonlinear ica: Unsupervised learn- ing from nonstationary time series. In: Conference on Uncertainty in Artificial Intelligence. pp. 939–948. PMLR (2020)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)

  21. [21]

    Iclr 1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022)

  22. [22]

    Advances in Neural Information Processing Systems36, 72096–72109 (2023)

    Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., et al.: Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems36, 72096–72109 (2023)

  23. [23]

    In: The 22nd International Conference on Artificial Intelligence and Statistics

    Hyvarinen, A., Sasaki, H., Turner, R.: Nonlinear ica using auxiliary variables and generalized contrastive learning. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 859–868. PMLR (2019) 18 Y. Li et al

  24. [24]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  26. [26]

    In: International conference on machine learning

    Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kong, L., Ma, M.Q., Chen, G., Xing, E.P., Chi, Y., Morency, L.P., Zhang, K.: Understanding masked autoencoders via hierarchical latent variable models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7918–7928 (2023)

  28. [28]

    In: International conference on machine learning

    Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: International conference on machine learning. pp. 11455–11472. PMLR (2022)

  29. [29]

    arXiv preprint arXiv:2207.07732 (2022)

    Lachapelle, S., Lacoste-Julien, S.: Partial disentanglement via mechanism sparsity. arXiv preprint arXiv:2207.07732 (2022)

  30. [30]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  31. [31]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  32. [32]

    In: Inter- national conference on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In: Inter- national conference on machine learning. pp. 12888–12900. PMLR (2022)

  33. [33]

    Advances in neural information processing systems34, 9694– 9705 (2021)

    Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems34, 9694– 9705 (2021)

  34. [34]

    Advances in Neural Information Processing Systems36, 34504–34518 (2023)

    Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., Zhang, K.: Subspace identifica- tion for multi-source domain adaptation. Advances in Neural Information Processing Systems36, 34504–34518 (2023)

  35. [36]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  36. [37]

    In: international conference on machine learning

    Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019)

  37. [38]

    Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

    Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

  38. [39]

    In: Proceedings of the AAAI conference on artificial intelligence

    Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: Multi- modal learning with severely missing modality. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2302–2310 (2021)

  39. [40]

    Cambridge university press (2009)

    Pearl, J.: Causality. Cambridge university press (2009)

  40. [41]

    The MIT press (2017)

    Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: founda- tions and learning algorithms. The MIT press (2017)

  41. [42]

    arXiv preprint arXiv:2402.17766 (2024)

    Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Wang, H., Yi, L., Ma, K.: Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766 (2024)

  42. [43]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  43. [44]

    Proceedings of the IEEE109(5), 612–634 (2021)

    Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE109(5), 612–634 (2021)

  44. [45]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15638–15650 (2022)

  45. [46]

    MIT press (2000)

    Spirtes, P., Glymour, C.N., Scheines, R.: Causation, prediction, and search. MIT press (2000)

  46. [47]

    arXiv preprint arXiv:2411.06518 (2024)

    Sun, Y., Kong, L., Chen, G., Li, L., Luo, G., Li, Z., Zhang, Y., Zheng, Y., Yang, M., Stojanov, P., et al.: Causal representation learning from multimodal biological observations. arXiv preprint arXiv:2411.06518 (2024)

  47. [48]

    Lxmert: Learning cross- modality encoder representations from transformers

    Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representa- tions from transformers. arXiv preprint arXiv:1908.07490 (2019)

  48. [49]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  49. [50]

    Advances in neural information processing systems30(2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)

  50. [51]

    Advances in neural information processing systems34, 16451–16467 (2021)

    Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-supervised learning with data augmenta- tions provably isolates content from style. Advances in neural information processing systems34, 16451–16467 (2021)

  51. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., Carneiro, G.: Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15878–15887 (2023)

  52. [53]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 20 Y. Li et al

  53. [54]

    In: International conference on machine learning

    Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International conference on machine learning. pp. 23318–23340. PMLR (2022)

  54. [55]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  55. [56]

    Advances in Neural Information Processing Systems36, 22099–22114 (2023)

    Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems36, 22099–22114 (2023)

  56. [57]

    arXiv preprint arXiv:2410.15319 (2024)

    Wu, A., Kuang, K., Zhu, M., Wang, Y., Zheng, Y., Han, K., Li, B., Chen, G., Wu, F., Zhang, K.: Causality for large language models. arXiv preprint arXiv:2410.15319 (2024)

  57. [58]

    Advances in neural information processing systems31 (2018)

    Wu, M., Goodman, N.: Multimodal generative models for scalable weakly- supervised learning. Advances in neural information processing systems31 (2018)

  58. [59]

    In: Forty-first International Conference on Machine Learning (2024)

    Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024)

  59. [60]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1912–1920 (2015)

  60. [61]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)

  61. [62]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

    Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)

  62. [63]

    Qwen2.5-Omni Technical Report

    Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

  63. [64]

    Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., Lv, Y., Wang, Y., Guo, D., Wang, H., Ma, L., Zhang, P., Zhang, X., Hao, H., Guo, Z., Yang, B., Zhang, B., Ma, Z., Wei, X., Bai, S., Chen, K., Liu, X., Wang, P., Yang, M., Liu, D., Ren, X., Zheng, B., Men, R., Zhou, F., Yu, B., Yang, J., Yu, L., Zhou, J., Lin, J.: Qwen3...

  64. [65]

    In: European Conference on Computer Vision

    Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Em- powering large language models to understand point clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024)

  65. [66]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155–18165 (2022) Multimodal LLMs under Pairwise Modalities 21

  66. [67]

    arXiv preprint arXiv:2311.04056 (2023)

    Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F.: Multi-view causal representation learning with partial observability. arXiv preprint arXiv:2311.04056 (2023)

  67. [68]

    In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

    Yao, W., Chen, G., Zhang, K.: Temporally disentangled representation learning. In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

  68. [69]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

  69. [70]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yu,X.,Tang,L.,Rao,Y.,Huang,T.,Zhou,J.,Lu,J.:Point-bert:Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313–19322 (2022)

  70. [71]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  71. [72]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123–18133 (2022)

  72. [73]

    Causal representation learning from multiple distributions: A general setting

    Zhang, K., Xie, S., Ng, I., Zheng, Y.: Causal representation learning from multiple distributions: A general setting. arXiv preprint arXiv:2402.05052 (2024)

  73. [74]

    Advances in Neural Information Processing Systems37, 91880–91903 (2024)

    Zhang, Z., Wang, Z., Liu, L., Huang, R., Cheng, X., Ye, Z., Lin, W., Liu, H., Huang, H., Zhao, Y., et al.: Extending multi-modal contrastive representa- tions. Advances in Neural Information Processing Systems37, 91880–91903 (2024)

  74. [75]

    Advances in neural information processing systems35, 16411– 16422 (2022)

    Zheng, Y., Ng, I., Zhang, K.: On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems35, 16411– 16422 (2022)

  75. [76]

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al.: Languagebind: Extending video-language pretrain- ing to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023)

  76. [77]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  77. [78]

    In: International conference on machine learning

    Zimmermann, R.S., Sharma, Y., Schneider, S., Bethge, M., Brendel, W.: Contrastive learning inverts the data generating process. In: International conference on machine learning. pp. 12979–12990. PMLR (2021)

  78. [79]

    multi-view complementarity

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 22 Y. Li et al. Table A1:Key notations in this paper. Notation Meaning M,[M]Number of modalities; index se...