Multimodal LLMs under Pairwise Modalities
Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3
The pith
Pairwise modality observations suffice to learn identifiable shared representations for multimodal LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By observing only pairwise modalities and applying an inductive bias of partial alignment together with minimal latent specification during contrastive learning, the model recovers an identifiable shared latent space; encoders of newly introduced modalities can then be integrated with decoders of pre-trained modalities to enable effective cross-modal recomposition and transfer without requiring joint multi-way data.
What carries the argument
Two-stage framework of latent representation alignment (self-modal reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification) followed by cross-modal recomposition via encoder-decoder integration.
If this is right
- New modalities such as 3D point clouds or tactile data can be added to pre-trained MLLMs using only pairwise pairings with existing modalities.
- Cross-modal performance remains strong after alignment without access to the full joint multimodal distribution.
- Training cost drops because multi-way aligned datasets no longer need to be curated for every combination of modalities.
Where Pith is reading between the lines
- The same pairwise-only recipe might allow incremental addition of further modalities beyond the two demonstrated in the paper.
- If the identifiability conditions hold in practice, similar alignment stages could be inserted into other contrastive or reconstruction-based multimodal pipelines.
- The framework suggests a route to modality expansion in settings where collecting full multi-way alignments is logistically impossible.
Load-bearing premise
The inductive biases of partial alignment and minimal latent specification in pairwise contrastive learning are enough to recover identifiable shared representations that support cross-modal recomposition.
What would settle it
A controlled experiment that removes the partial-alignment bias during contrastive learning and measures whether cross-modal generation or reconstruction quality collapses on held-out modality pairs.
Figures
read the original abstract
Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MLLMs can be trained scalably using only pairwise modality data rather than fully joint multimodal alignments. It first derives theoretical conditions for identifiability of shared latent representations from pairwise observations alone, then introduces a two-stage framework: (1) latent alignment via self-modal reconstruction plus pairwise contrastive learning that incorporates partial alignment and minimal latent specification as inductive biases, and (2) cross-modal recomposition by grafting encoders of new modalities (3D point clouds, tactile) onto pre-trained decoders. Experiments adding these modalities to existing MLLMs are reported to yield strong cross-modal transfer and generation performance.
Significance. If the identifiability result is rigorous and the inductive biases provably enforce a shared latent space in practice, the work would meaningfully lower the data-collection barrier for multimodal models, enabling cheaper extension to 3D and tactile modalities without requiring expensive three-way alignments.
major comments (2)
- [§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.
- [§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.
minor comments (2)
- [§4.1] The abstract and §4.1 refer to 'minimal latent specification' without a precise definition or hyper-parameter schedule; a short paragraph clarifying the exact constraint (e.g., dimension or sparsity) would improve reproducibility.
- [Figure 3] Figure 3 (cross-modal generation examples) lacks quantitative metrics alongside the qualitative samples; adding FID or retrieval accuracy numbers would strengthen the performance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Theoretical Analysis): the identifiability theorem is stated to hold under pairwise observations, yet the manuscript does not explicitly verify that the subsequent contrastive objective with only partial alignment and minimal latent specification is strong enough to rule out non-identifiable solutions (e.g., modality-specific rotations or dimensional collapse) once the latent dimension exceeds the free parameter listed in the axiom ledger.
Authors: The identifiability theorem in Section 3 provides the theoretical foundation for recovering shared latent representations from pairwise modality observations under the specified axioms. The contrastive learning objective, augmented with partial alignment and minimal latent specification, is intended to operationalize these conditions by promoting invariance across modalities while the self-reconstruction term helps prevent dimensional collapse. We acknowledge that the manuscript does not include an explicit analysis or verification showing that this objective eliminates all non-identifiable solutions, such as modality-specific rotations in higher latent dimensions. To address this point, we will add a subsection in the revised version that discusses how the inductive biases interact with the identifiability conditions, including a sketch of why rotations are discouraged by the partial alignment term and collapse is mitigated by reconstruction. revision: yes
-
Referee: [§4.2] §4.2 (Contrastive Learning Stage): the claim that the two-stage procedure recovers an identifiable shared space rests on the inductive biases being sufficient; however, no ablation or diagnostic is shown that quantifies how much the partial-alignment term reduces the set of admissible rotations relative to standard InfoNCE, leaving the central empirical claim under-supported.
Authors: We agree that providing a quantitative ablation on the impact of the partial-alignment term would better support the claim. Currently, the experiments focus on end-to-end performance in cross-modal tasks with the full framework. We will include an additional ablation study in the revised manuscript that compares the proposed contrastive objective (with partial alignment) against standard InfoNCE, using diagnostics such as the variance of cross-modal latent distances or the consistency of representations under random rotations to quantify the reduction in admissible transformations. revision: yes
Circularity Check
Theoretical identifiability analysis stands as independent foundation; no reduction of results to inputs by construction.
full rationale
The paper first derives conditions for identifiability from pairwise modality observations as a standalone theoretical step, then constructs a two-stage framework (latent alignment via self-reconstruction plus pairwise contrastive learning with partial alignment and minimal latent specification biases, followed by encoder-decoder recomposition) that applies those conditions in practice. No equations or claims reduce the identifiability result or cross-modal performance predictions to fitted parameters, self-definitions, or self-citation chains; the inductive biases are presented as implementation choices rather than tautological restatements of the theory. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent space dimension
axioms (1)
- domain assumption Representations from different modalities share an identifiable common latent structure recoverable from pairwise observations under suitable conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities... partial Jacobian A_{j←i}... Assumption 1(A2) ... X L_{ij} A_{j←i} = I
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
self-modal reconstruction and pair-wise contrastive learning... inductive bias... partial alignment and minimal latent specification
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.:Gpt-4technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Advances in neural information processing systems 34, 24206–24221 (2021)
Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in neural information processing systems 34, 24206–24221 (2021)
work page 2021
-
[3]
Advances in neural information processing systems35, 23716–23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)
work page 2022
-
[4]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Advances in neural information processing systems 33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
work page 1901
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cai, M., Liu, H., Mustikovela, S.K., Meyer, G.P., Chai, Y., Park, D., Lee, Y.J.: Vip-llava: Making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12914–12923 (2024)
work page 2024
-
[7]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
In: European conference on computer vision
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
work page 2020
-
[9]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
work page 2024
-
[10]
Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024) Multimodal LLMs under Pairwise Modalities 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
work page 2023
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)
work page 2023
-
[13]
arXiv preprint arXiv:2510.10487 (2025)
Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency. arXiv preprint arXiv:2510.10487 (2025)
-
[14]
Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al.: Palm-e: An embodied multimodal language model (2023)
work page 2023
-
[15]
Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning au- dio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
work page 2023
-
[16]
A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024
Fu, L., Datta, G., Huang, H., Panitch, W.C.H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., Goldberg, K.: A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232 (2024)
-
[17]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180–15190 (2023)
work page 2023
-
[19]
In: Conference on Uncertainty in Artificial Intelligence
Hälvä, H., Hyvarinen, A.: Hidden markov nonlinear ica: Unsupervised learn- ing from nonstationary time series. In: Conference on Uncertainty in Artificial Intelligence. pp. 939–948. PMLR (2020)
work page 2020
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)
work page 2024
-
[21]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022)
work page 2022
-
[22]
Advances in Neural Information Processing Systems36, 72096–72109 (2023)
Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., et al.: Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems36, 72096–72109 (2023)
work page 2023
-
[23]
In: The 22nd International Conference on Artificial Intelligence and Statistics
Hyvarinen, A., Sasaki, H., Turner, R.: Nonlinear ica using auxiliary variables and generalized contrastive learning. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 859–868. PMLR (2019) 18 Y. Li et al
work page 2019
-
[24]
In: International conference on machine learning
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
work page 2021
-
[25]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
In: International conference on machine learning
Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)
work page 2021
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kong, L., Ma, M.Q., Chen, G., Xing, E.P., Chi, Y., Morency, L.P., Zhang, K.: Understanding masked autoencoders via hierarchical latent variable models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7918–7928 (2023)
work page 2023
-
[28]
In: International conference on machine learning
Kong, L., Xie, S., Yao, W., Zheng, Y., Chen, G., Stojanov, P., Akinwande, V., Zhang, K.: Partial disentanglement for domain adaptation. In: International conference on machine learning. pp. 11455–11472. PMLR (2022)
work page 2022
-
[29]
arXiv preprint arXiv:2207.07732 (2022)
Lachapelle, S., Lacoste-Julien, S.: Partial disentanglement via mechanism sparsity. arXiv preprint arXiv:2207.07732 (2022)
-
[30]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)
work page 2023
-
[32]
In: Inter- national conference on machine learning
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In: Inter- national conference on machine learning. pp. 12888–12900. PMLR (2022)
work page 2022
-
[33]
Advances in neural information processing systems34, 9694– 9705 (2021)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems34, 9694– 9705 (2021)
work page 2021
-
[34]
Advances in Neural Information Processing Systems36, 34504–34518 (2023)
Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., Zhang, K.: Subspace identifica- tion for multi-source domain adaptation. Advances in Neural Information Processing Systems36, 34504–34518 (2023)
work page 2023
-
[36]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
-
[37]
In: international conference on machine learning
Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019)
work page 2019
-
[38]
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems32(2019) Multimodal LLMs under Pairwise Modalities 19
work page 2019
-
[39]
In: Proceedings of the AAAI conference on artificial intelligence
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: Multi- modal learning with severely missing modality. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2302–2310 (2021)
work page 2021
-
[40]
Cambridge university press (2009)
Pearl, J.: Causality. Cambridge university press (2009)
work page 2009
-
[41]
Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: founda- tions and learning algorithms. The MIT press (2017)
work page 2017
-
[42]
arXiv preprint arXiv:2402.17766 (2024)
Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Wang, H., Yi, L., Ma, K.: Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766 (2024)
-
[43]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[44]
Proceedings of the IEEE109(5), 612–634 (2021)
Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE109(5), 612–634 (2021)
work page 2021
-
[45]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15638–15650 (2022)
work page 2022
-
[46]
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, prediction, and search. MIT press (2000)
work page 2000
-
[47]
arXiv preprint arXiv:2411.06518 (2024)
Sun, Y., Kong, L., Chen, G., Li, L., Luo, G., Li, Z., Zhang, Y., Zheng, Y., Yang, M., Stojanov, P., et al.: Causal representation learning from multimodal biological observations. arXiv preprint arXiv:2411.06518 (2024)
-
[48]
Lxmert: Learning cross- modality encoder representations from transformers
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representa- tions from transformers. arXiv preprint arXiv:1908.07490 (2019)
-
[49]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Advances in neural information processing systems30(2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)
work page 2017
-
[51]
Advances in neural information processing systems34, 16451–16467 (2021)
Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-supervised learning with data augmenta- tions provably isolates content from style. Advances in neural information processing systems34, 16451–16467 (2021)
work page 2021
-
[52]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., Carneiro, G.: Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15878–15887 (2023)
work page 2023
-
[53]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 20 Y. Li et al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
In: International conference on machine learning
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International conference on machine learning. pp. 23318–23340. PMLR (2022)
work page 2022
-
[55]
Emu3: Next-Token Prediction is All You Need
Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Advances in Neural Information Processing Systems36, 22099–22114 (2023)
Wang, Z., Zhao, Y., Huang, H., Liu, J., Yin, A., Tang, L., Li, L., Wang, Y., Zhang, Z., Zhao, Z.: Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems36, 22099–22114 (2023)
work page 2023
-
[57]
arXiv preprint arXiv:2410.15319 (2024)
Wu, A., Kuang, K., Zhu, M., Wang, Y., Zheng, Y., Han, K., Li, B., Chen, G., Wu, F., Zhang, K.: Causality for large language models. arXiv preprint arXiv:2410.15319 (2024)
-
[58]
Advances in neural information processing systems31 (2018)
Wu, M., Goodman, N.: Multimodal generative models for scalable weakly- supervised learning. Advances in neural information processing systems31 (2018)
work page 2018
-
[59]
In: Forty-first International Conference on Machine Learning (2024)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024)
work page 2024
-
[60]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1912–1920 (2015)
work page 1912
-
[61]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)
work page 2025
-
[62]
Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)
-
[63]
Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., Lv, Y., Wang, Y., Guo, D., Wang, H., Ma, L., Zhang, P., Zhang, X., Hao, H., Guo, Z., Yang, B., Zhang, B., Ma, Z., Wei, X., Bai, S., Chen, K., Liu, X., Wang, P., Yang, M., Liu, D., Ren, X., Zheng, B., Men, R., Zhou, F., Yu, B., Yang, J., Yu, L., Zhou, J., Lin, J.: Qwen3...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
In: European Conference on Computer Vision
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Em- powering large language models to understand point clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024)
work page 2024
-
[66]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155–18165 (2022) Multimodal LLMs under Pairwise Modalities 21
work page 2022
-
[67]
arXiv preprint arXiv:2311.04056 (2023)
Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F.: Multi-view causal representation learning with partial observability. arXiv preprint arXiv:2311.04056 (2023)
-
[68]
Yao, W., Chen, G., Zhang, K.: Temporally disentangled representation learning. In: Advances in Neural Information Processing Systems (2022), https://openreview.net/forum?id=Vi-sZWNA_Ue
work page 2022
-
[69]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yu,X.,Tang,L.,Rao,Y.,Huang,T.,Zhou,J.,Lu,J.:Point-bert:Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313–19322 (2022)
work page 2022
-
[71]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
work page 2023
-
[72]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123–18133 (2022)
work page 2022
-
[73]
Causal representation learning from multiple distributions: A general setting
Zhang, K., Xie, S., Ng, I., Zheng, Y.: Causal representation learning from multiple distributions: A general setting. arXiv preprint arXiv:2402.05052 (2024)
-
[74]
Advances in Neural Information Processing Systems37, 91880–91903 (2024)
Zhang, Z., Wang, Z., Liu, L., Huang, R., Cheng, X., Ye, Z., Lin, W., Liu, H., Huang, H., Zhao, Y., et al.: Extending multi-modal contrastive representa- tions. Advances in Neural Information Processing Systems37, 91880–91903 (2024)
work page 2024
-
[75]
Advances in neural information processing systems35, 16411– 16422 (2022)
Zheng, Y., Ng, I., Zhang, K.: On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems35, 16411– 16422 (2022)
work page 2022
-
[76]
Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al.: Languagebind: Extending video-language pretrain- ing to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
In: International conference on machine learning
Zimmermann, R.S., Sharma, Y., Schneider, S., Bethge, M., Brendel, W.: Contrastive learning inverts the data generating process. In: International conference on machine learning. pp. 12979–12990. PMLR (2021)
work page 2021
-
[79]
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 22 Y. Li et al. Table A1:Key notations in this paper. Notation Meaning M,[M]Number of modalities; index se...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.