Multimodal LLMs under Pairwise Modalities

Gongxu Luo; Guangyi Chen; Kun Zhang; Yan Li; Yuewen Sun; Yunlong Deng

arxiv: 2605.21059 · v1 · pith:C4Y7AZSJnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

Multimodal LLMs under Pairwise Modalities

Yan Li , Yunlong Deng , Yuewen Sun , Gongxu Luo , Kun Zhang , Guangyi Chen This is my paper

Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords multimodal LLMspairwise modalitieslatent alignmentcross-modal recompositioncontrastive learningrepresentation identifiability3D point cloudstactile data

0 comments

The pith

Pairwise modality observations suffice to learn identifiable shared representations for multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models can be trained without fully aligned multi-way datasets by instead using only multiple pairwise modality observations. It first derives theoretical conditions under which latent representations remain identifiable from pairwise data alone. Building on that, the authors introduce a two-stage method: alignment of a shared latent space through self-reconstruction and pairwise contrastive learning that incorporates partial alignment and minimal latent specification, followed by cross-modal recomposition that swaps encoders from new modalities with decoders from pre-trained ones. Experiments demonstrate that adding 3D point clouds and tactile modalities to existing MLLMs yields strong cross-modal transfer and generation. If the claim holds, the approach removes a major data-curation bottleneck for expanding multimodal models across new domains.

Core claim

By observing only pairwise modalities and applying an inductive bias of partial alignment together with minimal latent specification during contrastive learning, the model recovers an identifiable shared latent space; encoders of newly introduced modalities can then be integrated with decoders of pre-trained modalities to enable effective cross-modal recomposition and transfer without requiring joint multi-way data.

What carries the argument

Two-stage framework of latent representation alignment (self-modal reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification) followed by cross-modal recomposition via encoder-decoder integration.

If this is right

New modalities such as 3D point clouds or tactile data can be added to pre-trained MLLMs using only pairwise pairings with existing modalities.
Cross-modal performance remains strong after alignment without access to the full joint multimodal distribution.
Training cost drops because multi-way aligned datasets no longer need to be curated for every combination of modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairwise-only recipe might allow incremental addition of further modalities beyond the two demonstrated in the paper.
If the identifiability conditions hold in practice, similar alignment stages could be inserted into other contrastive or reconstruction-based multimodal pipelines.
The framework suggests a route to modality expansion in settings where collecting full multi-way alignments is logistically impossible.

Load-bearing premise

The inductive biases of partial alignment and minimal latent specification in pairwise contrastive learning are enough to recover identifiable shared representations that support cross-modal recomposition.

What would settle it

A controlled experiment that removes the partial-alignment bias during contrastive learning and measures whether cross-modal generation or reconstruction quality collapses on held-out modality pairs.

Figures

Figures reproduced from arXiv: 2605.21059 by Gongxu Luo, Guangyi Chen, Kun Zhang, Yan Li, Yuewen Sun, Yunlong Deng.

**Figure 2.** Figure 2: Data generation process. The shared latent space c (blue) contains factors that influence multiple modalities, while modality-specific variables s (red) affect only their corresponding modality. Each observation x (m) is generated from its shared latent block z (m) c and modality-specific block z (m) s . Data Generating Process. We consider x := [x (1) , . . . , x (M) ] to denote a collection of observat… view at source ↗

**Figure 3.** Figure 3: Two-stage framework for multimodal learning from pairwise-only [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable path to train MLLMs on pairwise data with identifiability theory and a two-stage alignment-recomposition method, but the practical enforcement of shared latents looks under-supported.

read the letter

The main point is that this work gives a concrete way to sidestep full joint multimodal datasets by training on pairwise observations only. They start with a theoretical analysis of identifiability conditions for representations when you see only pairs, then build a two-stage framework: first align latents across modalities using self-reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification as inductive biases, then attach new encoders to existing decoders for cross-modal transfer. The experiments add 3D point clouds and tactile data to pre-trained MLLMs and report decent cross-modal performance, which directly tackles the scalability headache of curating three-way or more aligned data.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs can be trained scalably using only pairwise modality data rather than fully joint multimodal alignments. It first derives theoretical conditions for identifiability of shared latent representations from pairwise observations alone, then introduces a two-stage framework: (1) latent alignment via self-modal reconstruction plus pairwise contrastive learning that incorporates partial alignment and minimal latent specification as inductive biases, and (2) cross-modal recomposition by grafting encoders of new modalities (3D point clouds, tactile) onto pre-trained decoders. Experiments adding these modalities to existing MLLMs are reported to yield strong cross-modal transfer and generation performance.

Significance. If the identifiability result is rigorous and the inductive biases provably enforce a shared latent space in practice, the work would meaningfully lower the data-collection barrier for multimodal models, enabling cheaper extension to 3D and tactile modalities without requiring expensive three-way alignments.

major comments (2)

[§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.
[§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.

minor comments (2)

[§4.1] The abstract and §4.1 refer to 'minimal latent specification' without a precise definition or hyper-parameter schedule; a short paragraph clarifying the exact constraint (e.g., dimension or sparsity) would improve reproducibility.
[Figure 3] Figure 3 (cross-modal generation examples) lacks quantitative metrics alongside the qualitative samples; adding FID or retrieval accuracy numbers would strengthen the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.

Authors: The identifiability theorem in Section 3 provides the theoretical foundation for recovering shared latent representations from pairwise modality observations under the specified axioms. The contrastive learning objective, augmented with partial alignment and minimal latent specification, is intended to operationalize these conditions by promoting invariance across modalities while the self-reconstruction term helps prevent dimensional collapse. We acknowledge that the manuscript does not include an explicit analysis or verification showing that this objective eliminates all non-identifiable solutions, such as modality-specific rotations in higher latent dimensions. To address this point, we will add a subsection in the revised version that discusses how the inductive biases interact with the identifiability conditions, including a sketch of why rotations are discouraged by the partial alignment term and collapse is mitigated by reconstruction. revision: yes
Referee: [§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.

Authors: We agree that providing a quantitative ablation on the impact of the partial-alignment term would better support the claim. Currently, the experiments focus on end-to-end performance in cross-modal tasks with the full framework. We will include an additional ablation study in the revised manuscript that compares the proposed contrastive objective (with partial alignment) against standard InfoNCE, using diagnostics such as the variance of cross-modal latent distances or the consistency of representations under random rotations to quantify the reduction in admissible transformations. revision: yes

Circularity Check

0 steps flagged

Theoretical identifiability analysis stands as independent foundation; no reduction of results to inputs by construction.

full rationale

The paper first derives conditions for identifiability from pairwise modality observations as a standalone theoretical step, then constructs a two-stage framework (latent alignment via self-reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification biases, followed by encoder-decoder recomposition) that applies those conditions in practice. No equations or claims reduce the identifiability result or cross-modal performance predictions to fitted parameters, self-definitions, or self-citation chains; the inductive biases are presented as implementation choices rather than tautological restatements of the theory. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of identifiability conditions from pairwise observations and the practical effectiveness of the proposed inductive biases without requiring full joint multimodal data.

free parameters (1)

latent space dimension
The minimal latent specification implies a choice of shared representation dimension that balances information retention and simplicity, likely tuned during training.

axioms (1)

domain assumption Representations from different modalities share an identifiable common latent structure recoverable from pairwise observations under suitable conditions.
This is invoked as the foundation for the theoretical analysis that justifies the framework.

pith-pipeline@v0.9.0 · 5761 in / 1535 out tokens · 76153 ms · 2026-05-21T05:35:04.582460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities... partial Jacobian A_{j←i}... Assumption 1(A2) ... X L_{ij} A_{j←i} = I
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-modal reconstruction and pair-wise contrastive learning... inductive bias... partial alignment and minimal latent specification

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 15 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.:Gpt-4technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Advances in neural information processing systems 34, 24206–24221 (2021)

Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in neural information processing systems 34, 24206–24221 (2021)

work page 2021
[3]

Advances in neural information processing systems35, 23716–23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

work page 2022
[4]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Advances in neural information processing systems 33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)

work page 1901
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cai, M., Liu, H., Mustikovela, S.K., Meyer, G.P., Chai, Y., Park, D., Lee, Y.J.: Vip-llava: Making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12914–12923 (2024)

work page 2024
[7]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

In: European conference on computer vision

Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)

work page 2020
[9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024
[10]

Qwen2-Audio Technical Report

Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024) Multimodal LLMs under Pairwise Modalities 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

work page 2023
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

work page 2023
[13]

arXiv preprint arXiv:2510.10487 (2025)

Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency. arXiv preprint arXiv:2510.10487 (2025)

work page arXiv 2025
[14]

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al.: Palm-e: An embodied multimodal language model (2023)

work page 2023
[15]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning au- dio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

work page 2023
[16]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Fu, L., Datta, G., Huang, H., Panitch, W.C.H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., Goldberg, K.: A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232 (2024)

work page arXiv 2024
[17]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180–15190 (2023)

work page 2023
[19]

In: Conference on Uncertainty in Artificial Intelligence

Hälvä, H., Hyvarinen, A.: Hidden markov nonlinear ica: Unsupervised learn- ing from nonstationary time series. In: Conference on Uncertainty in Artificial Intelligence. pp. 939–948. PMLR (2020)

work page 2020
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)

work page 2024
[21]

Iclr 1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022)

work page 2022
[22]

Advances in Neural Information Processing Systems36, 72096–72109 (2023)

Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., et al.: Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems36, 72096–72109 (2023)

work page 2023
[23]

In: The 22nd International Conference on Artificial Intelligence and Statistics

Hyvarinen, A., Sasaki, H., Turner, R.: Nonlinear ica using auxiliary variables and generalized contrastive learning. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 859–868. PMLR (2019) 18 Y. Li et al

work page 2019
[24]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

work page 2021
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

In: International conference on machine learning

Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)

work page 2021
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kong, L., Ma, M.Q., Chen, G., Xing, E.P., Chi, Y., Morency, L.P., Zhang, K.: Understanding masked autoencoders via hierarchical latent variable models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7918–7928 (2023)

work page 2023
[28]

In: International conference on machine learning

Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: International conference on machine learning. pp. 11455–11472. PMLR (2022)

work page 2022
[29]

arXiv preprint arXiv:2207.07732 (2022)

Lachapelle, S., Lacoste-Julien, S.: Partial disentanglement via mechanism sparsity. arXiv preprint arXiv:2207.07732 (2022)

work page arXiv 2022
[30]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

work page 2023
[32]

In: Inter- national conference on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In: Inter- national conference on machine learning. pp. 12888–12900. PMLR (2022)

work page 2022
[33]

Advances in neural information processing systems34, 9694– 9705 (2021)

Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems34, 9694– 9705 (2021)

work page 2021
[34]

Advances in Neural Information Processing Systems36, 34504–34518 (2023)

Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., Zhang, K.: Subspace identifica- tion for multi-source domain adaptation. Advances in Neural Information Processing Systems36, 34504–34518 (2023)

work page 2023
[36]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023
[37]

In: international conference on machine learning

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019)

work page 2019
[38]

Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

work page 2019
[39]

In: Proceedings of the AAAI conference on artificial intelligence

Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: Multi- modal learning with severely missing modality. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2302–2310 (2021)

work page 2021
[40]

Cambridge university press (2009)

Pearl, J.: Causality. Cambridge university press (2009)

work page 2009
[41]

The MIT press (2017)

Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: founda- tions and learning algorithms. The MIT press (2017)

work page 2017
[42]

arXiv preprint arXiv:2402.17766 (2024)

Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Wang, H., Yi, L., Ma, K.: Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766 (2024)

work page arXiv 2024
[43]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[44]

Proceedings of the IEEE109(5), 612–634 (2021)

Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE109(5), 612–634 (2021)

work page 2021
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15638–15650 (2022)

work page 2022
[46]

MIT press (2000)

Spirtes, P., Glymour, C.N., Scheines, R.: Causation, prediction, and search. MIT press (2000)

work page 2000
[47]

arXiv preprint arXiv:2411.06518 (2024)

Sun, Y., Kong, L., Chen, G., Li, L., Luo, G., Li, Z., Zhang, Y., Zheng, Y., Yang, M., Stojanov, P., et al.: Causal representation learning from multimodal biological observations. arXiv preprint arXiv:2411.06518 (2024)

work page arXiv 2024
[48]

Lxmert: Learning cross- modality encoder representations from transformers

Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representa- tions from transformers. arXiv preprint arXiv:1908.07490 (2019)

work page arXiv 1908
[49]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Advances in neural information processing systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)

work page 2017
[51]

Advances in neural information processing systems34, 16451–16467 (2021)

Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-supervised learning with data augmenta- tions provably isolates content from style. Advances in neural information processing systems34, 16451–16467 (2021)

work page 2021
[52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., Carneiro, G.: Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15878–15887 (2023)

work page 2023
[53]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 20 Y. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

In: International conference on machine learning

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International conference on machine learning. pp. 23318–23340. PMLR (2022)

work page 2022
[55]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Advances in Neural Information Processing Systems36, 22099–22114 (2023)

Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems36, 22099–22114 (2023)

work page 2023
[57]

arXiv preprint arXiv:2410.15319 (2024)

Wu, A., Kuang, K., Zhu, M., Wang, Y., Zheng, Y., Han, K., Li, B., Chen, G., Wu, F., Zhang, K.: Causality for large language models. arXiv preprint arXiv:2410.15319 (2024)

work page arXiv 2024
[58]

Advances in neural information processing systems31 (2018)

Wu, M., Goodman, N.: Multimodal generative models for scalable weakly- supervised learning. Advances in neural information processing systems31 (2018)

work page 2018
[59]

In: Forty-first International Conference on Machine Learning (2024)

Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024)

work page 2024
[60]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1912–1920 (2015)

work page 1912
[61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)

work page 2025
[62]

Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)

work page arXiv 2021
[63]

Qwen2.5-Omni Technical Report

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., Lv, Y., Wang, Y., Guo, D., Wang, H., Ma, L., Zhang, P., Zhang, X., Hao, H., Guo, Z., Yang, B., Zhang, B., Ma, Z., Wei, X., Bai, S., Chen, K., Liu, X., Wang, P., Yang, M., Liu, D., Ren, X., Zheng, B., Men, R., Zhou, F., Yu, B., Yang, J., Yu, L., Zhou, J., Lin, J.: Qwen3...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

In: European Conference on Computer Vision

Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Em- powering large language models to understand point clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024)

work page 2024
[66]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155–18165 (2022) Multimodal LLMs under Pairwise Modalities 21

work page 2022
[67]

arXiv preprint arXiv:2311.04056 (2023)

Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F.: Multi-view causal representation learning with partial observability. arXiv preprint arXiv:2311.04056 (2023)

work page arXiv 2023
[68]

In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

Yao, W., Chen, G., Zhang, K.: Temporally disentangled representation learning. In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

work page 2022
[69]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu,X.,Tang,L.,Rao,Y.,Huang,T.,Zhou,J.,Lu,J.:Point-bert:Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313–19322 (2022)

work page 2022
[71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023
[72]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123–18133 (2022)

work page 2022
[73]

Causal representation learning from multiple distributions: A general setting

Zhang, K., Xie, S., Ng, I., Zheng, Y.: Causal representation learning from multiple distributions: A general setting. arXiv preprint arXiv:2402.05052 (2024)

work page arXiv 2024
[74]

Advances in Neural Information Processing Systems37, 91880–91903 (2024)

Zhang, Z., Wang, Z., Liu, L., Huang, R., Cheng, X., Ye, Z., Lin, W., Liu, H., Huang, H., Zhao, Y., et al.: Extending multi-modal contrastive representa- tions. Advances in Neural Information Processing Systems37, 91880–91903 (2024)

work page 2024
[75]

Advances in neural information processing systems35, 16411– 16422 (2022)

Zheng, Y., Ng, I., Zhang, K.: On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems35, 16411– 16422 (2022)

work page 2022
[76]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al.: Languagebind: Extending video-language pretrain- ing to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

In: International conference on machine learning

Zimmermann, R.S., Sharma, Y., Schneider, S., Bethge, M., Brendel, W.: Contrastive learning inverts the data generating process. In: International conference on machine learning. pp. 12979–12990. PMLR (2021)

work page 2021
[79]

multi-view complementarity

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 22 Y. Li et al. Table A1:Key notations in this paper. Notation Meaning M,[M]Number of modalities; index se...

work page 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.:Gpt-4technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Advances in neural information processing systems 34, 24206–24221 (2021)

Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in neural information processing systems 34, 24206–24221 (2021)

work page 2021

[3] [3]

Advances in neural information processing systems35, 23716–23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

work page 2022

[4] [4]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Advances in neural information processing systems 33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)

work page 1901

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cai, M., Liu, H., Mustikovela, S.K., Meyer, G.P., Chai, Y., Park, D., Lee, Y.J.: Vip-llava: Making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12914–12923 (2024)

work page 2024

[7] [7]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

In: European conference on computer vision

Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)

work page 2020

[9] [9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024

[10] [10]

Qwen2-Audio Technical Report

Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024) Multimodal LLMs under Pairwise Modalities 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

work page 2023

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

work page 2023

[13] [13]

arXiv preprint arXiv:2510.10487 (2025)

Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency. arXiv preprint arXiv:2510.10487 (2025)

work page arXiv 2025

[14] [14]

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al.: Palm-e: An embodied multimodal language model (2023)

work page 2023

[15] [15]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning au- dio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

work page 2023

[16] [16]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Fu, L., Datta, G., Huang, H., Panitch, W.C.H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., Goldberg, K.: A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232 (2024)

work page arXiv 2024

[17] [17]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180–15190 (2023)

work page 2023

[19] [19]

In: Conference on Uncertainty in Artificial Intelligence

Hälvä, H., Hyvarinen, A.: Hidden markov nonlinear ica: Unsupervised learn- ing from nonstationary time series. In: Conference on Uncertainty in Artificial Intelligence. pp. 939–948. PMLR (2020)

work page 2020

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)

work page 2024

[21] [21]

Iclr 1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022)

work page 2022

[22] [22]

Advances in Neural Information Processing Systems36, 72096–72109 (2023)

Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., et al.: Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems36, 72096–72109 (2023)

work page 2023

[23] [23]

In: The 22nd International Conference on Artificial Intelligence and Statistics

Hyvarinen, A., Sasaki, H., Turner, R.: Nonlinear ica using auxiliary variables and generalized contrastive learning. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 859–868. PMLR (2019) 18 Y. Li et al

work page 2019

[24] [24]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

work page 2021

[25] [25]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

In: International conference on machine learning

Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)

work page 2021

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kong, L., Ma, M.Q., Chen, G., Xing, E.P., Chi, Y., Morency, L.P., Zhang, K.: Understanding masked autoencoders via hierarchical latent variable models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7918–7928 (2023)

work page 2023

[28] [28]

In: International conference on machine learning

Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: International conference on machine learning. pp. 11455–11472. PMLR (2022)

work page 2022

[29] [29]

arXiv preprint arXiv:2207.07732 (2022)

Lachapelle, S., Lacoste-Julien, S.: Partial disentanglement via mechanism sparsity. arXiv preprint arXiv:2207.07732 (2022)

work page arXiv 2022

[30] [30]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

work page 2023

[32] [32]

In: Inter- national conference on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In: Inter- national conference on machine learning. pp. 12888–12900. PMLR (2022)

work page 2022

[33] [33]

Advances in neural information processing systems34, 9694– 9705 (2021)

Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems34, 9694– 9705 (2021)

work page 2021

[34] [34]

Advances in Neural Information Processing Systems36, 34504–34518 (2023)

Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., Zhang, K.: Subspace identifica- tion for multi-source domain adaptation. Advances in Neural Information Processing Systems36, 34504–34518 (2023)

work page 2023

[35] [36]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023

[36] [37]

In: international conference on machine learning

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019)

work page 2019

[37] [38]

Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19

work page 2019

[38] [39]

In: Proceedings of the AAAI conference on artificial intelligence

Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: Multi- modal learning with severely missing modality. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2302–2310 (2021)

work page 2021

[39] [40]

Cambridge university press (2009)

Pearl, J.: Causality. Cambridge university press (2009)

work page 2009

[40] [41]

The MIT press (2017)

Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: founda- tions and learning algorithms. The MIT press (2017)

work page 2017

[41] [42]

arXiv preprint arXiv:2402.17766 (2024)

Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Wang, H., Yi, L., Ma, K.: Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766 (2024)

work page arXiv 2024

[42] [43]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[43] [44]

Proceedings of the IEEE109(5), 612–634 (2021)

Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE109(5), 612–634 (2021)

work page 2021

[44] [45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15638–15650 (2022)

work page 2022

[45] [46]

MIT press (2000)

Spirtes, P., Glymour, C.N., Scheines, R.: Causation, prediction, and search. MIT press (2000)

work page 2000

[46] [47]

arXiv preprint arXiv:2411.06518 (2024)

Sun, Y., Kong, L., Chen, G., Li, L., Luo, G., Li, Z., Zhang, Y., Zheng, Y., Yang, M., Stojanov, P., et al.: Causal representation learning from multimodal biological observations. arXiv preprint arXiv:2411.06518 (2024)

work page arXiv 2024

[47] [48]

Lxmert: Learning cross- modality encoder representations from transformers

Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representa- tions from transformers. arXiv preprint arXiv:1908.07490 (2019)

work page arXiv 1908

[48] [49]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [50]

Advances in neural information processing systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)

work page 2017

[50] [51]

Advances in neural information processing systems34, 16451–16467 (2021)

Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-supervised learning with data augmenta- tions provably isolates content from style. Advances in neural information processing systems34, 16451–16467 (2021)

work page 2021

[51] [52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., Carneiro, G.: Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15878–15887 (2023)

work page 2023

[52] [53]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 20 Y. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [54]

In: International conference on machine learning

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International conference on machine learning. pp. 23318–23340. PMLR (2022)

work page 2022

[54] [55]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [56]

Advances in Neural Information Processing Systems36, 22099–22114 (2023)

Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems36, 22099–22114 (2023)

work page 2023

[56] [57]

arXiv preprint arXiv:2410.15319 (2024)

Wu, A., Kuang, K., Zhu, M., Wang, Y., Zheng, Y., Han, K., Li, B., Chen, G., Wu, F., Zhang, K.: Causality for large language models. arXiv preprint arXiv:2410.15319 (2024)

work page arXiv 2024

[57] [58]

Advances in neural information processing systems31 (2018)

Wu, M., Goodman, N.: Multimodal generative models for scalable weakly- supervised learning. Advances in neural information processing systems31 (2018)

work page 2018

[58] [59]

In: Forty-first International Conference on Machine Learning (2024)

Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024)

work page 2024

[59] [60]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1912–1920 (2015)

work page 1912

[60] [61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)

work page 2025

[61] [62]

Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)

work page arXiv 2021

[62] [63]

Qwen2.5-Omni Technical Report

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [64]

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., Lv, Y., Wang, Y., Guo, D., Wang, H., Ma, L., Zhang, P., Zhang, X., Hao, H., Guo, Z., Yang, B., Zhang, B., Ma, Z., Wei, X., Bai, S., Chen, K., Liu, X., Wang, P., Yang, M., Liu, D., Ren, X., Zheng, B., Men, R., Zhou, F., Yu, B., Yang, J., Yu, L., Zhou, J., Lin, J.: Qwen3...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [65]

In: European Conference on Computer Vision

Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Em- powering large language models to understand point clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024)

work page 2024

[65] [66]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155–18165 (2022) Multimodal LLMs under Pairwise Modalities 21

work page 2022

[66] [67]

arXiv preprint arXiv:2311.04056 (2023)

Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F.: Multi-view causal representation learning with partial observability. arXiv preprint arXiv:2311.04056 (2023)

work page arXiv 2023

[67] [68]

In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

Yao, W., Chen, G., Zhang, K.: Temporally disentangled representation learning. In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue

work page 2022

[68] [69]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu,X.,Tang,L.,Rao,Y.,Huang,T.,Zhou,J.,Lu,J.:Point-bert:Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313–19322 (2022)

work page 2022

[70] [71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023

[71] [72]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123–18133 (2022)

work page 2022

[72] [73]

Causal representation learning from multiple distributions: A general setting

Zhang, K., Xie, S., Ng, I., Zheng, Y.: Causal representation learning from multiple distributions: A general setting. arXiv preprint arXiv:2402.05052 (2024)

work page arXiv 2024

[73] [74]

Advances in Neural Information Processing Systems37, 91880–91903 (2024)

Zhang, Z., Wang, Z., Liu, L., Huang, R., Cheng, X., Ye, Z., Lin, W., Liu, H., Huang, H., Zhao, Y., et al.: Extending multi-modal contrastive representa- tions. Advances in Neural Information Processing Systems37, 91880–91903 (2024)

work page 2024

[74] [75]

Advances in neural information processing systems35, 16411– 16422 (2022)

Zheng, Y., Ng, I., Zhang, K.: On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems35, 16411– 16422 (2022)

work page 2022

[75] [76]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al.: Languagebind: Extending video-language pretrain- ing to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [77]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[77] [78]

In: International conference on machine learning

Zimmermann, R.S., Sharma, Y., Schneider, S., Bethge, M., Brendel, W.: Contrastive learning inverts the data generating process. In: International conference on machine learning. pp. 12979–12990. PMLR (2021)

work page 2021

[78] [79]

multi-view complementarity

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 22 Y. Li et al. Table A1:Key notations in this paper. Notation Meaning M,[M]Number of modalities; index se...

work page 2023