arxiv: 2605.12517 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI· cs.CV

Recognition: no theorem link

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Mingyeong Kim , Jungwon Choi , Chaeyun Jang , Juho Lee (Kim Jaechul Graduate School of AI , KAIST)

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords vision-language modelsmissing modalitycalibrationlatent embeddingscross-attentiontext-only inference

0 comments

The pith

A lightweight cross-attention module can predict missing visual embeddings from text to restore accuracy and calibration in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models lose substantial accuracy and produce unreliable confidence scores when images are absent, even if the accompanying text captures the key content. This gap persists beyond simple information loss and is only partially fixed by generating full images. The paper introduces the Latent Imagination Module, a small cross-attention network that generates latent visual embeddings directly from text and inserts them into a frozen VLM backbone. Experiments show consistent gains in accuracy and better calibration on text-only benchmarks, unseen tasks, and missing-image cases without any pixel synthesis. The work positions latent-level modality completion as a practical route to reliable inference when visual input is unavailable.

Core claim

The Latent Imagination Module predicts imagined latent visual embeddings from textual input via cross-attention and supplies them to a frozen VLM backbone, yielding higher accuracy and lower calibration error than text-only prompting or generated-image baselines across text-only benchmarks, unseen tasks, and missing-image scenarios.

What carries the argument

Latent Imagination Module (LIM): a lightweight cross-attention module that predicts imagined latent embeddings from text for direct input to a frozen VLM.

If this is right

Accuracy rises on text-only benchmarks and unseen tasks when LIM supplies the predicted embeddings.
Calibration error drops in missing-image scenarios without requiring image generation.
The frozen backbone remains unchanged, preserving original capabilities when images are present.
The method scales to practical deployments where visual data is intermittently missing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to other multimodal models that encounter modality dropout, such as audio-text or video-text systems.
Deployment pipelines might use LIM to maintain performance in edge cases like low-bandwidth or privacy-restricted settings.
Training LIM on broader paired data could improve robustness to out-of-distribution visual concepts not seen in the original VLM pretraining.

Load-bearing premise

Cross-attention predictions of latent visual embeddings from text will be accurate and compatible enough to substitute for real visual input inside the frozen VLM without creating new systematic errors.

What would settle it

A controlled test in which the LIM-augmented model shows lower accuracy or higher calibration error than the plain text-only baseline on a held-out task would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.12517 by Chaeyun Jang, Juho Lee (Kim Jaechul Graduate School of AI, Jungwon Choi, KAIST), Mingyeong Kim.

**Figure 1.** Figure 1: Reliability diagrams on VQA-V1 (4,500 test samples) using LLAVA. Replacing the image with a text description induces severe over-confidence, while re-introducing an image-modality signal via text-to-image regeneration partially restores calibration. Bar intensity indicates the number of samples in each confidence bin. Re-generated images from descriptions. The degradation under Description+Question could… view at source ↗

**Figure 2.** Figure 2: Approach overview. (a) Standard VLM inference with image and text inputs. (b) Textonly inference with LIM, which predicts and injects latent embeddings. Additional model family. To verify that this missing-modality behavior is not unique to the VICUNA/LLAVA family, we also examine a QWEN-based VLM family under the same text-only prompting setup using our primary MSP confidence estimator. This sanity check… view at source ↗

**Figure 3.** Figure 3: Architecture of the Latent Imagination Module. To infer the missing visual information, we introduce the Latent Imagination Module (LIM). Unlike generative approaches that synthesize raw pixels (e.g., diffusion models), LIM predicts high-dimensional feature embeddings directly compatible with the VLM’s vision tower, which comprises an image encoder followed by a projection layer. This bypasses the compu… view at source ↗

**Figure 4.** Figure 4: Reliability diagrams for text-only QA. ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: (a) ScienceQA (image-subset): robustness to missing images. We evaluate on 2,017 ScienceQA test instances with images and simulate missing-modality by dropping images with probability p. For dropped instances, drop-and-text-only queries LLaVA with text-only, while drop-andLIM injects LIM-predicted latent embeddings in place of missing image features. (b) Ablations on filling missing visual signal with ar… view at source ↗

**Figure 6.** Figure 6: Example of description generation. (Left) Input image. (Right) LLaVA-generated description produced by the prompt in Appendix B.1 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Reliability diagrams for Qwen-based models under text-only prompting. We compare Qwen2.5-7B and Qwen2.5-VL-7B-Instruct using MSP confidence. The Qwen-based VLM shows a similar over-confidence trend under missing vision modality as observed for LLaVA in Section 2.2. B.3 PROMPTING DETAILS FOR DIFFUSION-GENERATED “HELPFUL IMAGES” In Section 2.3, we generate an auxiliary image for each text-only QA instance an… view at source ↗

read the original abstract

Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical latent-level fix for text-only VLM drops in accuracy and calibration, but the experiments skip the key check on whether predicted embeddings match real ones.

read the letter

The main point is that VLMs lose reliability on text alone even when the text describes the scene, and the authors offer a lightweight cross-attention module called LIM to predict the missing visual latent and plug it into a frozen backbone. This avoids generating pixels and targets a common deployment case where images are absent or optional. They report gains on text-only benchmarks, unseen tasks, and missing-image settings for both accuracy and calibration error. That framing of the problem as more than simple missing semantics is useful, and keeping the backbone frozen plus skipping image synthesis makes the idea deployable without heavy compute. The module itself is a straightforward addition that does not redefine the model architecture. The soft spot is the missing evidence on whether the predicted latents actually land in the right place. No cosine similarities, MMD scores, or layer-wise activation comparisons to real vision-encoder outputs are described, so it is unclear if the improvements come from faithful modality completion or from the extra parameters providing some other regularization effect. The abstract summarizes results without error bars, full baseline tables, or statistical tests, which leaves the central claim only lightly supported. The stress-test concern about distribution shift holds up on the given information; without those diagnostics, new miscalibration modes on out-of-distribution text remain possible. This work is aimed at practitioners who need better text-only behavior from existing VLMs rather than full retraining. A reader working on calibration or modality robustness would get a concrete idea to try. It deserves peer review because the problem is real and the proposal is efficient, though any referee will want the latent-space diagnostics and expanded experimental details before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper identifies that vision-language models suffer large accuracy drops and severe miscalibration on text-only inputs, even when text preserves semantic content. It proposes the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent visual embeddings from text and injects them into a frozen VLM backbone without generating pixels. Experiments across text-only benchmarks, unseen tasks, and missing-image scenarios report that LIM improves accuracy and reduces calibration error, positioning latent modality completion as a practical alternative to image synthesis.

Significance. If the results hold under closer scrutiny, the work would be significant for reliable VLM deployment in modality-missing settings. It offers an efficient, non-generative approach to modality bridging that avoids the cost of pixel synthesis while targeting calibration, a key practical failure mode. The lightweight cross-attention design and evaluation on held-out tasks are strengths that could influence follow-on work on frozen-backbone adaptation.

major comments (2)

[Abstract] Abstract: the central claim of accuracy and calibration gains is stated without any quantitative values, baselines, number of runs, or statistical tests, leaving the magnitude and reliability of LIM's benefit only weakly supported by the provided information.
[Experimental section] Experimental section (around the LIM evaluation): no latent-space diagnostics (cosine similarity, MMD, or per-layer activation statistics) are reported comparing LIM-predicted embeddings against real vision-encoder outputs on matched inputs. This is load-bearing for the claim that gains arise from faithful modality completion rather than parameter addition or regularization, because the skeptic concern about out-of-distribution latents remains unaddressed.

minor comments (2)

Add explicit description of the loss used to train the LIM cross-attention parameters and any regularization applied to keep predictions in-distribution.
Clarify whether the reported improvements are consistent across multiple random seeds or VLM backbones; include error bars or significance tests in the main results table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on the abstract and the need for additional diagnostics. Below, we provide point-by-point responses and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of accuracy and calibration gains is stated without any quantitative values, baselines, number of runs, or statistical tests, leaving the magnitude and reliability of LIM's benefit only weakly supported by the provided information.

Authors: We agree that the abstract lacks specific quantitative support. In the revised version, we will incorporate concrete numbers for accuracy and calibration improvements, specify the baselines, and include details on the number of runs and statistical tests to better substantiate the claims. revision: yes
Referee: [Experimental section] Experimental section (around the LIM evaluation): no latent-space diagnostics (cosine similarity, MMD, or per-layer activation statistics) are reported comparing LIM-predicted embeddings against real vision-encoder outputs on matched inputs. This is load-bearing for the claim that gains arise from faithful modality completion rather than parameter addition or regularization, because the skeptic concern about out-of-distribution latents remains unaddressed.

Authors: We recognize the importance of validating the quality of the predicted latent embeddings. We will add latent-space diagnostics, including cosine similarity and MMD metrics, comparing the LIM outputs to real vision embeddings in the updated experimental section. This will provide evidence that the performance gains are due to effective modality completion. revision: yes

Circularity Check

0 steps flagged

No circularity: LIM is a new trainable module whose outputs are evaluated on external benchmarks

full rationale

The paper introduces the Latent Imagination Module (LIM) as a lightweight cross-attention component trained to predict latent visual embeddings from text, which are then inserted into a frozen VLM backbone. Performance is measured via standard accuracy and calibration metrics on text-only benchmarks, unseen tasks, and missing-image scenarios. No derivation step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or uniqueness theorems from the same authors. The central result is an empirical demonstration that the added module yields measurable gains, which is self-contained against external data rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a cross-attention module can infer useful visual latents from text and that these latents integrate productively with a frozen backbone. The module itself introduces new trainable parameters whose values are fitted to data.

free parameters (1)

LIM cross-attention parameters
Trainable weights in the lightweight module that are fitted during training to map text to visual latents.

axioms (1)

domain assumption Cross-attention can map textual features to visual latent space in a way that benefits the frozen VLM
Invoked in the design and claimed effectiveness of LIM.

invented entities (1)

Latent Imagination Module (LIM) no independent evidence
purpose: Predict imagined latent visual embeddings from text for input to frozen VLM
Newly introduced module without independent external validation cited in the abstract.

pith-pipeline@v0.9.0 · 5471 in / 1307 out tokens · 51680 ms · 2026-05-14T21:39:26.364228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mm-align: Learning optimal transport- based alignment dynamics for fast and accurate inference on missing modality sequences

Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Poria. Mm-align: Learning optimal transport- based alignment dynamics for fast and accurate inference on missing modality sequences. arXiv preprint arXiv:2210.12798,

work page arXiv
[3]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[5]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card. arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Logiqa: A challenge dataset for machine reading comprehension with logical reasoning

10 ICLR 2026 Workshop: Principled Design for Trustworthy AI Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv:2007.08124,

work page arXiv 2026
[8]

Get your vitamin c! robust fact verification with contrastive evidence

Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541,

work page arXiv
[9]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642,

work page 2013
[10]

CommonsenseQA: A ques- tion answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A ques- tion answering challenge targeting commonsense knowledge. InProceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158,

work page 2019
[11]

Alignment for efficient tool calling of large language models

Hongshen Xu, Zihan Wang, Zichen Zhu, Lei Pan, Xingyu Chen, Shuai Fan, Lu Chen, and Kai Yu. Alignment for efficient tool calling of large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17787–17803,

work page 2025
[12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

HELPFULIMAGES

11 ICLR 2026 Workshop: Principled Design for Trustworthy AI A RELATEDWORKS Uncertainty Estimation and Calibration in Modern ModelsModel calibration has long been studied as a key component of reliable machine learning systems, revealing that high-capacity neu- ral networks often exhibit systematic overconfidence despite strong predictive accuracy. Early w...

work page 2026