arxiv: 2605.08145 · v1 · submitted 2026-05-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

Yuriel Ryan , Hei Man Ip , Adriel Kuek , Paul Pu Liang , Roy Ka-Wei Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision language modelsmultimodal interactionsredundancyhallucinationrobustnessself-captioningmultimodal interaction gatevisual grounding

0 comments

The pith

Amplifying redundant multimodal interactions reduces visual errors in vision-language models by 38.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vision-language models can be made more robust to hallucinations and corrupted inputs by deliberately increasing redundant shared information between vision and language. It argues that current training practices remove this redundancy to focus on visual grounding, which leaves models vulnerable when one modality is impaired. The authors propose a self-captioning workflow using a Multimodal Interaction Gate to transform unique modality-specific information into redundant shared information. This approach is shown to cut visual-induced errors by 38.3% and boost consistency by 16.8%. A sympathetic reader would care because it offers a practical way to improve reliability in real-world scenarios with noisy or incomplete data.

Core claim

The central claim is that modern instruction datasets eliminate redundancies in multimodal interactions to prioritize visual grounding, leaving models unable to compensate for impaired modalities. By introducing a self-captioning workflow with a Multimodal Interaction Gate that converts unique interactions into redundant ones, the model gains exploitable shared information. This reduces visual induced errors by 38.3% and improves consistency by 16.8%.

What carries the argument

The Multimodal Interaction Gate: a mechanism in the self-captioning workflow that converts unique interactions into redundant interactions to increase exploitable shared information between modalities.

If this is right

Vision-language models can use shared redundant information to resolve ambiguities when one modality is corrupted or missing.
Self-captioning enables robustness improvements on existing models and datasets without requiring new instruction data.
Response consistency increases because redundant signals reinforce correct interpretations across modalities.
Hallucination rates drop as the model relies more on overlapping information rather than modality-specific guesses.
The method bridges the gap between training for precise visual grounding and the need for real-world robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same redundancy-amplification idea could apply to other multimodal settings such as audio-visual or text-audio models facing noise.
Future instruction dataset design might intentionally retain some redundancy to build robustness in from the start rather than removing it.
The interaction analysis framework could yield new metrics for quantifying how much shared information a training set provides.
Running the gate at inference time instead of only during tuning might allow dynamic compensation for changing input quality.

Load-bearing premise

The assumption that modern instruction datasets eliminate redundancies to prioritize visual grounding and that converting unique interactions to redundant ones via the Multimodal Interaction Gate will reliably compensate for impaired modalities without introducing new failure modes or losing synergistic information.

What would settle it

Apply the self-captioning tuning with the Multimodal Interaction Gate to a standard vision-language model, introduce controlled visual corruptions on a benchmark, and measure whether visual-induced errors fall by about 38% and consistency rises by about 17% relative to the baseline model.

Figures

Figures reproduced from arXiv: 2605.08145 by Adriel Kuek, Hei Man Ip, Paul Pu Liang, Roy Ka-Wei Lee, Yuriel Ryan.

**Figure 1.** Figure 1: In this example, the modalities in this example share a sufficient amount of (redundant) information to cover for an ambiguous text modality: the visual presence of the animal provides evidence that “sknuks” is indeed a misspelling of “skunks”. and augmentation strategies to intuition and heuristics, making it difficult to establish systematic approaches for more methodical progress towards robust VLMs. … view at source ↗

**Figure 2.** Figure 2: Illustrations of R, U, and S in multimodal data. Redundant interactions are commonly observed in captioning tasks where the text shares the same information with the image (e.g., the noodle dish “Lor Mee” and its corresponding image). Unique interactions are commonly observed in prompt ensembling techniques or modality grounding datasets; in this example, the task-relevant information is concentrated in th… view at source ↗

**Figure 3.** Figure 3: The Self-Captioning Multimodal Interaction Tuning workflow to increase exploitable redundant interactions. This workflow utilizes the MULTIMODAL INTERACTION (MI) GATE to systematically filter samples with high unique visual information to be captioned by the VLM, transferring the unique visual information into shared (redundant) information prior to the training loop. Algorithm 1 Estimate Interactions (D, … view at source ↗

**Figure 4.** Figure 4: Comparison of interactions between synthetically produced (blank) images, (diffusion-based) generated images, and the real data form DocMSU (Du et al., 2024). The generated images successfully converts UT to R and could even (closely) match the interaction distribution of the real data. Does the size of the captioning model matter? As reflected in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The performance stability, ∆P, of SmolVLM and LLaVa-OneVision plotted against increasing levels of corruption severity for both the visual and text modalities. Generally, models trained with increased redundancies (either 25% or 50%) have more stable performance compared to the baseline models (0% additional redundancy). Absolute values in Appendix D.2, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: An increasing trend of ∆P as the percentage of samples captioned (τ ) increases in hate speech detection (Kiela et al., 2021). random character insertions, drops, or replacements. Does increasing R improve robustness against modality corruption? From [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of interactions between synthetically produced (blank) images, (diffusion-based) generated images, and the real data form DocMSU (Du et al., 2024). The generated images successfully converts UT to R and could even (closely) match the interaction distribution of the real data. We conducted a separate experiment with a vision-language dataset (1000 samples) for sarcasm detection: DocMSU (Du et al.… view at source ↗

**Figure 8.** Figure 8: The loss curves of training SmolVLM 256M, 500M, and 2B parameter sizes. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Breakdown of the SFT mixture with the seven categories. There are a total of 983,930 samples in the training data after filtering out samples with excessively large media in the dataset. The distribution of each dataset before filtering is detailed in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: An image corruption example from GQA (Hudson & Manning, 2019) with the Impulse, Gaussian, and Shot noise methods from the clean image to the different severity levels of corruption [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Trend of average ∆P across all five severity levels of impulse image corruption as the threshold of samples captioned by the MI GATE increase. This task involves detecting hate speech with the cauldron variant of the Hateful Memes (Kiela et al., 2021) dataset. Notably, there is a slight positive trend in increasing redundant interactions with lower standard deviations by captioning more samples [PITH_FUL… view at source ↗

**Figure 12.** Figure 12: Average performance stability ∆P of each SmolVLM at specific levels of corruption severity for the Cauldron variant of the Hateful Memes dataset (Kiela et al., 2021). The SmolVLM are trained at 0-90% of the samples captioned by Qwen2.5-VL-32B-Instruct. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

read the original abstract

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a self-captioning workflow plus a Multimodal Interaction Gate to convert unique modality signals into redundant ones for better VLM robustness, but the reported gains are hard to pin on the gate rather than extra training data.

read the letter

The main point is that the authors want to reduce hallucinations and input corruption in vision-language models by deliberately increasing redundant shared information between modalities. They do this with a self-captioning workflow that feeds a new Multimodal Interaction Gate, which turns unique interactions into redundant ones. The abstract claims this cuts visual-induced errors by 38.3% and lifts consistency by 16.8% on their tests.

Referee Report

2 major / 2 minor

Summary. The paper claims that modern instruction datasets eliminate redundancies in favor of visual grounding, leading to VLM hallucination and robustness issues under ambiguous or corrupted modalities. It proposes a self-captioning workflow with a Multimodal Interaction Gate to convert unique interactions into redundant ones, thereby amplifying shared information to compensate for impaired modalities. Empirical results are reported as a 38.3% reduction in visual-induced errors and 16.8% improvement in consistency.

Significance. If the attribution to redundancy amplification holds after isolating confounding factors, the work could provide a principled way to improve VLM reliability using concepts from partial information decomposition. The introduction of the gate as a mechanism to explicitly tune interaction types is a potentially useful direction, though the current evidence does not yet establish this over simpler augmentation effects.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results: the 38.3% reduction in visual-induced errors and 16.8% consistency gain are presented as outcomes of amplifying redundant interactions via the Multimodal Interaction Gate, yet no ablation is described that compares self-captioning alone against self-captioning plus the gate. This leaves open that gains arise from additional training signal rather than the redundancy conversion, directly undermining the central causal claim.
[Methods] Methods section describing the Multimodal Interaction Gate: the mechanism for converting unique to redundant interactions is introduced without quantitative verification that redundancy (as opposed to unique or synergistic terms) has measurably increased, nor controls confirming that synergistic information is preserved and no new failure modes are introduced. This is load-bearing for the hypothesis that redundancy amplification compensates for impaired modalities.

minor comments (2)

[Abstract / Methods] The abstract and methods would benefit from explicit definitions or a diagram of how the gate operates on interaction terms (redundant/unique/synergistic) to improve clarity for readers unfamiliar with partial information decomposition.
[Experimental Results] Reporting of results should include error bars, number of runs, and details on data splits and baselines to allow assessment of the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in establishing the causal role of the Multimodal Interaction Gate and in verifying the underlying information-theoretic changes. We address each point below and will revise the manuscript to incorporate additional experiments and analysis.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results: the 38.3% reduction in visual-induced errors and 16.8% consistency gain are presented as outcomes of amplifying redundant interactions via the Multimodal Interaction Gate, yet no ablation is described that compares self-captioning alone against self-captioning plus the gate. This leaves open that gains arise from additional training signal rather than the redundancy conversion, directly undermining the central causal claim.

Authors: We agree that the absence of this ablation weakens the ability to attribute gains specifically to redundancy amplification rather than the self-captioning process itself. In the revised manuscript we will add a controlled ablation that trains identical models on the self-captioning workflow both with and without the Multimodal Interaction Gate, reporting the same error and consistency metrics to isolate the gate's contribution. revision: yes
Referee: [Methods] Methods section describing the Multimodal Interaction Gate: the mechanism for converting unique to redundant interactions is introduced without quantitative verification that redundancy (as opposed to unique or synergistic terms) has measurably increased, nor controls confirming that synergistic information is preserved and no new failure modes are introduced. This is load-bearing for the hypothesis that redundancy amplification compensates for impaired modalities.

Authors: The current manuscript relies on downstream performance improvements to support the redundancy hypothesis but does not include direct quantification of changes in redundant, unique, or synergistic information. We will revise the Methods section to incorporate partial information decomposition measurements before and after the gate, together with explicit checks that synergistic terms remain stable and that no additional failure modes appear on held-out corrupted-modality test sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims remain independent of method definition

full rationale

The paper states a hypothesis on multimodal interactions (redundant, unique, synergistic), describes a self-captioning workflow plus Multimodal Interaction Gate to convert unique to redundant interactions, and reports measured outcomes (38.3% error reduction, 16.8% consistency gain) as experimental results. No equations, fitted parameters, or derivations appear in the provided text that reduce the reported gains to the method by construction. The improvements are framed as empirical findings rather than tautological outputs of the gate definition itself. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming patterns are present. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that shared redundant information can compensate for impaired modalities and on the introduction of the Multimodal Interaction Gate as the mechanism to create that redundancy. No free parameters or additional invented entities beyond the gate are mentioned.

axioms (1)

domain assumption Shared information between modalities can compensate for impaired ones to resolve hallucination and robustness issues.
Directly stated as the core hypothesis in the abstract.

invented entities (1)

Multimodal Interaction Gate no independent evidence
purpose: Convert unique interactions into redundant interactions within the self-captioning workflow.
New mechanism introduced to amplify exploitable shared information.

pith-pipeline@v0.9.0 · 5464 in / 1366 out tokens · 76472 ms · 2026-05-12T01:33:43.764074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

amplifying redundant interactions would increase this exploitable shared information to resolve these issues... increasing redundancy can reduce visual induced errors by 38.3% and improve consistency by 16.8%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 5 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

Proceedings of the 2024

Yu, Haofei and Qi, Zhengyang and Jang, Lawrence Keunho and Salakhutdinov, Russ and Morency, Louis-Philippe and Liang, Paul Pu , editor =. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.558 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.558 2024
[10]

2025 , eprint=

Efficient Quantification of Multimodal Interaction at Sample Level , author=. 2025 , eprint=

work page 2025
[11]

Advances in Neural Information Processing Systems , year=

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework , author=. Advances in Neural Information Processing Systems , year=

work page
[12]

Quantifying Unique Information , volume=

Bertschinger, Nils and Rauh, Johannes and Olbrich, Eckehard and Jost, Jürgen and Ay, Nihat , year=. Quantifying Unique Information , volume=. Entropy , publisher=. doi:10.3390/e16042161 , number=

work page doi:10.3390/e16042161
[13]

Entropy , author =

Measuring. Entropy , author =. 2017 , note =. doi:10.3390/e19070318 , abstract =

work page doi:10.3390/e19070318 2017
[14]

Entropy , author =

Pointwise. Entropy , author =. 2018 , note =. doi:10.3390/e20040297 , abstract =

work page doi:10.3390/e20040297 2018
[15]

and Cohn, Jeffrey F

Wörtwein, Torsten and Allen, Nicholas B. and Cohn, Jeffrey F. and Morency, Louis-Philippe , month = nov, year =. Proceedings of the 26th. doi:10.1145/3678957.3685716 , abstract =

work page doi:10.1145/3678957.3685716
[16]

2025 , eprint=

I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts , author=. 2025 , eprint=

work page 2025
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

Measuring. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2025 , note =. doi:10.1609/aaai.v39i20.35452 , abstract =

work page doi:10.1609/aaai.v39i20.35452 2025
[18]

2024 , eprint=

DiffusionPID: Interpreting Diffusion via Partial Information Decomposition , author=. 2024 , eprint=

work page 2024
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[20]

doi:10.48550/arXiv.2506.10286 , abstract =

Park, Eunkyu and Kim, Minyeong and Kim, Gunhee , month = jun, year =. doi:10.48550/arXiv.2506.10286 , abstract =

work page doi:10.48550/arxiv.2506.10286
[21]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan , month = mar, year =. Mitigating. doi:10.48550/arXiv.2306.14565 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2306.14565
[22]

Evaluating and

He, Yixiao and Sun, Haifeng and Ren, Pengfei and Wang, Jingyu and Wang, Huazheng and Qi, Qi and Zhuang, Zirui and Wang, Jing , editor =. Evaluating and. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.naacl-long.349 , abstract =

work page doi:10.18653/v1/2025.naacl-long.349 2025
[23]

2025 , eprint=

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering , author=. 2025 , eprint=

work page 2025
[24]

Zou, Xin and Wang, Yizhou and Yan, Yibo and Lyu, Yuanhuiyi and Zheng, Kening and Huang, Sirui and Chen, Junkai and Jiang, Peijie and Liu, Jia and Tang, Chang and Hu, Xuming , month = may, year =. Look. doi:10.48550/arXiv.2410.03577 , abstract =

work page doi:10.48550/arxiv.2410.03577
[25]

Qwen2.5-VL Technical Report

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923
[26]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Tian, Hao and Duan, Yuchen and Su, Weijie and Shao, Jie and Gao, Zhangwei and Cui, Erfei and Wang, Xuehui and Cao, Yue and Liu, Yangzhou and Wei, Xingguang and Zhang, Hongjie and Wang, Haomin and Xu, Weiye and Li, Hao and Wang, Jiahao and Deng, Nianchen and Li...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479
[27]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, Haoyu and Liu, Wen and Zhang, Bo and Wang, Bingxuan and Dong, Kai and Liu, Bo and Sun, Jingxiang and Ren, Tongzheng and Li, Zhuoshu and Yang, Hao and Sun, Yaofeng and Deng, Chengqi and Xu, Hanwei and Xie, Zhenda and Ruan, Chong , month = mar, year =. doi:10.48550/arXiv.2403.05525 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525
[28]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[29]

2021 , eprint=

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , author=. 2021 , eprint=

work page 2021
[30]

and Boudiaf, Malik and Koliander, Günther and Piantanida, Pablo , month = jun, year =

Pichler, Georg and Colombo, Pierre Jean A. and Boudiaf, Malik and Koliander, Günther and Piantanida, Pablo , month = jun, year =. A. Proceedings of the 39th

work page
[31]

Lin, Ji and Yin, Hongxu and Ping, Wei and Molchanov, Pavlo and Shoeybi, Mohammad and Han, Song , year =

work page
[32]

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr, Alane and Zhou, Stephanie and Zhang, Ally and Zhang, Iris and Bai, Huajun and Artzi, Yoav. A Corpus for Reasoning about Natural Language Grounded in Photographs. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1644

work page doi:10.18653/v1/p19-1644 2019
[33]

Journal of Machine Learning Research , author =

A. Journal of Machine Learning Research , author =. 2023 , pages =

work page 2023
[34]

2010 , eprint=

Nonnegative Decomposition of Multivariate Information , author=. 2010 , eprint=

work page 2010
[35]

2023 , isbn =

Liang, Paul Pu and Cheng, Yun and Salakhutdinov, Ruslan and Morency, Louis-Philippe , title =. 2023 , isbn =. doi:10.1145/3577190.3614151 , booktitle =

work page doi:10.1145/3577190.3614151 2023
[36]

Entropy , author =

Probability. Entropy , author =. 2018 , note =. doi:10.3390/e20110826 , abstract =

work page doi:10.3390/e20110826 2018
[37]

2025 , eprint=

SmolVLM: Redefining small and efficient multimodal models , author=. 2025 , eprint=

work page 2025
[38]

2024 , eprint=

What matters when building vision-language models? , author=. 2024 , eprint=

work page 2024
[39]

Advances in Neural Information Processing Systems , author =

Benchmarking. Advances in Neural Information Processing Systems , author =. 2023 , pages =

work page 2023
[40]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[41]

2024 , eprint=

Building and better understanding vision-language models: insights and future directions , author=. 2024 , eprint=

work page 2024
[42]

Goyal, Yash and Khot, Tejas and Agrawal, Aishwarya and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Int. J. Comput. Vision , month = apr, pages =. 2019 , issue_date =. doi:10.1007/s11263-018-1116-0 , abstract =

work page doi:10.1007/s11263-018-1116-0 2019
[43]

Proceedings of CVPR , year=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

work page
[44]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[45]

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , month = oct, year =

work page
[46]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[47]

and Kamboj, Abhi and Do, Minh N

Nguyen, Duy A. and Kamboj, Abhi and Do, Minh N. , editor =. Robult:. 2025 , pages =. doi:10.24963/ijcai.2025/666 , booktitle =

work page doi:10.24963/ijcai.2025/666 2025
[48]

The Twelfth International Conference on Learning Representations , year=

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications , author=. The Twelfth International Conference on Learning Representations , year=

work page
[49]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Towards VQA Models That Can Read , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[50]

2025 , url=

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=. 2025 , url=

work page 2025
[51]

2025 , eprint=

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

work page 2025
[52]

Visual instruction tuning , volume =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , editor =. Visual instruction tuning , volume =. Advances in neural information processing systems , publisher =. 2023 , pages =

work page 2023
[53]

Xie, Yuxi and Li, Guanzhen and Xu, Xiao and Kan, Min-Yen , editor =. V-. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.775 , abstract =

work page doi:10.18653/v1/2024.findings-emnlp.775 2024
[54]

arXiv preprint arXiv:2501.09695 , year=

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key , author=. arXiv preprint arXiv:2501.09695 , year=

work page arXiv
[55]

A sober look at the robustness of clips to spurious features , volume =

Wang, Qizhou and Lin, Yong and Chen, Yongqiang and Schmidt, Ludwig and Han, Bo and Zhang, Tong , editor =. A sober look at the robustness of clips to spurious features , volume =. 2024 , pages =. doi:10.52202/079017-3893 , booktitle =

work page doi:10.52202/079017-3893 2024
[56]

InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP)

Hasan, Md Kamrul and Rahman, Wasifur and Bagher Zadeh, AmirAli and Zhong, Jianyuan and Tanveer, Md Iftekhar and Morency, Louis-Philippe and Hoque, Mohammed (Ehsan). UR - FUNNY : A Multimodal Language Dataset for Understanding Humor. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confe...

work page doi:10.18653/v1/d19-1211 2019
[57]

Forty-second International Conference on Machine Learning , year=

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page
[58]

Forty-second International Conference on Machine Learning , year=

Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance , author=. Forty-second International Conference on Machine Learning , year=

work page
[59]

2025 , url=

Beom Jin Kang and NamJoon Kim and Hyun Kim , booktitle=. 2025 , url=

work page 2025
[60]

2025 , eprint=

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=

work page 2025
[61]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , url =

Dai, Wenliang and Li, Junnan and LI, DONGXU and Tiong, Anthony and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale N and Hoi, Steven , booktitle =. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , url =

work page
[62]

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?

Geigle, Gregor and Timofte, Radu and Glava s , Goran. Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.159

work page doi:10.18653/v1/2024.emnlp-main.159 2024
[63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Favero, Alessandro and Zancato, Luca and Trager, Matthew and Choudhary, Siddharth and Perera, Pramuditha and Achille, Alessandro and Swaminathan, Ashwin and Soatto, Stefano , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[64]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

Quantifying Knowledge Distillation using Partial Information Decomposition , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =

work page 2025
[65]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Data Selection Matters: Towards Robust Instruction Tuning of Large Multimodal Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[66]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Visual Instruction Tuning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[67]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Enhancing Large Vision Language Models with Self-Training on Image Comprehension , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[68]

arXiv preprint arXiv:2301.03829 , year=

From plate to prevention: A dietary nutrient-aided platform for health promotion in singapore , author=. arXiv preprint arXiv:2301.03829 , year=

work page arXiv
[69]

2024 , note =

Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , note =. doi:10.1609/aaai.v38i16.29748 , abstract =

work page doi:10.1609/aaai.v38i16.29748 2024