Detect Before You Leap: Mirage Detection in Vision-Language Models

Md. Shaown Miah; Sayeed Shafayet Chowdhury

arxiv: 2606.00435 · v1 · pith:TEAKTN7Mnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Detect Before You Leap: Mirage Detection in Vision-Language Models

Sayeed Shafayet Chowdhury , Md. Shaown Miah This is my paper

Pith reviewed 2026-06-28 22:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords mirage detectionvision-language modelsvisual question answeringlayer-wise alignmentCLIPabstentionvisual groundingensemble detection

0 comments

The pith

Tracking patch-to-question alignment across CLIP vision layers detects when VLMs lack evidence for an answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that mirage responses, in which VLMs give confident answers without supporting visual evidence, can be caught before generation by monitoring how question-relevant visual information appears or fails to appear in successive layers of the vision encoder. A sympathetic reader would care because such ungrounded answers are especially risky in medical or document VQA, where they may be treated as image-based facts. The proposed TC-LIA method projects layer-wise patch tokens into the final CLIP embedding space, measures their similarity to the question embedding, and summarizes the resulting trajectory with four numeric features. These features are then combined with pixel statistics and VLM self-assessment in an ensemble classifier. Across five domains, three input conditions, and twelve backbones the ensemble reaches 94.6-94.7 percent three-class accuracy while keeping mirage rates below 3 percent.

Core claim

TC-LIA projects layer-wise image patch tokens from a CLIP ViT-H/14 encoder into the final embedding space and computes their similarity to the question embedding, producing an alignment trajectory whose summary statistics (final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope) serve as reliable indicators of whether question-relevant visual evidence is present.

What carries the argument

Text-Conditioned Layer-wise Internal Alignment (TC-LIA), which extracts an alignment trajectory by projecting successive-layer patch tokens into the final CLIP space and comparing them to the question embedding.

If this is right

The ensemble reaches 94.6-94.7 percent three-class detection accuracy across the tested settings.
Mirage rates fall below 3 percent while baseline rates range from 21.7 percent to 66.6 percent.
The same detector works across five VQA domains, three input conditions, and twelve VLM backbones.
The method is model-agnostic and operates before any answer is generated.
Pixel-statistic blank/noise detection and zero-shot domain routing further improve the ensemble.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration of the detector into production VLM pipelines could allow automatic abstention on evidence-poor queries.
The same layer-wise alignment idea might be tested on vision encoders other than CLIP ViT-H/14.
Extending the trajectory features to video or multi-image inputs could address related grounding failures.
Collecting human labels on the same trajectory features could calibrate the abstention threshold for specific risk tolerances.

Load-bearing premise

The alignment trajectory features extracted from CLIP ViT-H/14 layers supply a signal for mirage that remains reliable when the domain or VLM backbone changes.

What would settle it

A new VQA domain or additional VLM backbone on which the three-class accuracy drops below 85 percent or the mirage rate rises above 10 percent would falsify the claim of generalizability.

Figures

Figures reproduced from arXiv: 2606.00435 by Md. Shaown Miah, Sayeed Shafayet Chowdhury.

**Figure 2.** Figure 2: TC-LIA computes text-conditioned layer-wise patch alignment from CLIP ViT-H/14 features. Late [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 2.** Figure 2: The theoretical motivation in Section 5 formalizes why late-layer alignment, early-to-late gain, and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 1.** Figure 1: a lightweight high-recall blank/noise stage reduces the first term, while [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 3.** Figure 3: TC-LIA alignment behavior across conditions. (a) Mean top- [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Mirage rate reduction: Base Prompt → TC-LIA Only → Ensemble across twelve VLMs. Score / feature Interpretation AUROC ↑ Final cosine (ViT-H/14) global image–text match 0.921 Final cosine (adaptive) routed CLIP/BioMedCLIP match 0.931 Late top-k mean late local evidence 0.822 Gain early-to-late growth 0.876 Slope layer-wise trend 0.882 IAS weighted composite 0.938 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative result cards for related and unrelated-real inputs using the same medical question and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: 3-class accuracy vs. mirage rate across twelve complete VLM backbones. Upper-left corner is [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Answer quality of the VLM backbones (BLEU, ROUGE-L, BERTScore F1). [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Structured deterministic prompt used for VLM evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Per-domain TC-LIA layerwise alignment curves for all three input conditions. Each panel shows [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Per-domain distributions of the TC-LIA Internal Alignment Score (IAS) across the three input [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Per-condition answer quality across twelve complete VLM backbones. Rows: [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: XGBoost feature importance. The plot is diagnostic and should be interpreted as evidence that [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Left: 3-class accuracy on the held-out domain. Right: Mirage rate on the held-out domain. ChestVQA is the most challenging held-out domain (MR = 6.7%); InfoVQA is the easiest (MR = 1.4%, Acc = 96.3%). Similarly, for an unrelated pair, an error occurs when aL > τ . Thus, P(ˆy ̸= U | U) = P(aL > τ | U) = P(aL − µU > τ − µU | U). Since τ − µU = µR + µU 2 − µU = ∆ 2 , we obtain P(ˆy ̸= U | U) = P(aL − µU > ∆/… view at source ↗

**Figure 14.** Figure 14: Left: 3-class accuracy on the held-out VLM. Right: Mirage rate on the held-out VLM. Two variants are shown: full features (with vlm_class, dark) and features excluding the VLM class encoding (no vlm_class, light). Accuracy is stable across all nine held-out VLMs; removing vlm_class_enc increases mirage rate most for smaller, less capable models. Substituting the decomposition of aℓ gives early = c(x, q) +… view at source ↗

**Figure 15.** Figure 15: Text-conditioned GradCAM maps across three input conditions (columns: related, unrelatedreal, blank). For unrelated inputs, GradCAM fires on spurious visually salient regions that bear no relation to the question. For blank inputs, activations are near-uniform or randomly scattered. GradCAM cannot separate related from unrelated inputs (AUROC 0.543), motivating the move to patch–text cosine alignment in … view at source ↗

**Figure 16.** Figure 16: Baseline ROUGE-L by condition across representative datasets. The plot shows that answer [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Layer-wise answer-quality changes under attention knockout, grouped by condition. The effect [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Condition-wise RAPT shifts relative to the matched image-question setting. The perturbations [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: RAPTimage cascade under single-layer knockout (real inputs). Each line shows the change in image-attention RAPT relative to baseline when layer k is knocked out (k ∈ {0, 8, 17, 25, 33}). The knocked-out layer shows a sharp local drop (diagonal dip), but all other observed layers return to near-zero deviation, confirming that the network compensates for the blocked layer by redistributing image attention d… view at source ↗

**Figure 20.** Figure 20: ∆RAPTimage heatmap under attention knockout (signed-log scale, real inputs). Rows are knocked-out layers; columns are RAPT measurement layers. The dark-blue diagonal marks the local suppression at the intervened layer. The red upper-triangle shows forward compensation: downstream layers increase their image-attention allocation to recover the blocked signal. Early knockouts (rows 0–10) produce the widest … view at source ↗

**Figure 21.** Figure 21: Per-layer RAPT curves for Gemma-3-4B-IT across four input conditions. Left: Image RAPT by layer. Right: Question RAPT by layer. The dotted vertical line marks the early/late split at layer 17. In the image RAPT panel, the unrelated image condition (red) tracks closely with the matched condition (blue) across all layers, confirming that the decoder allocates similar image attention regardless of whether th… view at source ↗

**Figure 22.** Figure 22: SAM3-style prompt-conditioned grounding sanity check. Related examples may not always [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative result cards for all three input conditions (Related, Unrelated-Real, Blank/Noise) on [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative result cards for all three input conditions (Related, Unrelated-Real, Blank/Noise) on [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Qualitative result cards for all three input conditions (Related, Unrelated-Real, Blank/Noise) on [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Qualitative result cards for all three input conditions ( [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Qualitative result cards for all three input conditions ( [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗

**Figure 28.** Figure 28: Qualitative result cards for all three input conditions ( [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗

**Figure 29.** Figure 29: Qualitative result cards for all three input conditions ( [PITH_FULL_IMAGE:figures/full_fig_p043_29.png] view at source ↗

**Figure 30.** Figure 30: Qualitative result cards for all three input conditions ( [PITH_FULL_IMAGE:figures/full_fig_p044_30.png] view at source ↗

**Figure 31.** Figure 31: Confusion matrices: TC-LIA Only and best ensemble per VLM on the held-out test set. [PITH_FULL_IMAGE:figures/full_fig_p045_31.png] view at source ↗

**Figure 32.** Figure 32: Accuracy and mirage rate for all five ensemble classifiers across all VLM families. [PITH_FULL_IMAGE:figures/full_fig_p045_32.png] view at source ↗

**Figure 33.** Figure 33: 5-fold cross-validation accuracy for all classifiers across all VLM families. [PITH_FULL_IMAGE:figures/full_fig_p046_33.png] view at source ↗

**Figure 34.** Figure 34: Per-domain AUROC for binary RELATED vs. UNRELATED-REAL detection using final_cos (ViT-H/14 global cosine) alone versus the full Internal Alignment Score (IAS). IAS matches or exceeds final_cos across all domains. The largest gains appear in infovqa and docvqa, where a single global embedding is a weaker discriminator than the layer-wise patch-level alignment summary captured by IAS. 0.0 0.1 0.2 0.3 0.4 0.… view at source ↗

**Figure 35.** Figure 35: XGBoost feature importance across all VLM families. [PITH_FULL_IMAGE:figures/full_fig_p046_35.png] view at source ↗

**Figure 36.** Figure 36: Per-domain 3-class accuracy for all VLM families (best ensemble per model). [PITH_FULL_IMAGE:figures/full_fig_p047_36.png] view at source ↗

**Figure 37.** Figure 37: Comparison between VLM structured-prompt accuracy and full ensemble accuracy. The perfor [PITH_FULL_IMAGE:figures/full_fig_p047_37.png] view at source ↗

read the original abstract

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces TC-LIA to track question-relevant patch alignment across CLIP layers as a signal for mirage in VLMs, but the high reported accuracy rests on a fixed encoder whose transfer to other vision backbones is the weakest part of the claim.

read the letter

The main point is that they extract four alignment-trajectory features from a CLIP ViT-H/14 by projecting intermediate patch tokens into the final space and measuring similarity to the question embedding. These are then ensembled with pixel statistics, zero-shot routing, and VLM self-assessment to decide whether the model should answer or abstain.

The approach is new in its explicit layer-wise tracking for this task and in the way it tries to catch the failure before generation. Running the probe on twelve different VLM backbones and five domains is a reasonable attempt at breadth.

The soft spot is exactly the one the stress-test flags. All the alignment features come from one fixed CLIP encoder. When the target VLM uses a different vision tower, there is no direct evidence that the same trajectory statistics remain predictive. The abstract gives 94.6-94.7% three-class accuracy and sub-3% mirage rates, yet supplies no information on baseline implementations, feature-selection procedure, or statistical tests. Without those, the numbers are hard to interpret.

The work is aimed at groups that need practical pre-generation filters for medical or document VQA. A reader already working on internal probing or abstention methods will find the concrete feature definitions useful. The problem is real and the method is distinct enough from prior mirage papers that it should go to referees rather than be desk-rejected.

Referee Report

3 major / 2 minor

Summary. The paper introduces Text-Conditioned Layer-wise Internal Alignment (TC-LIA), which extracts alignment trajectory features (final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope) by projecting intermediate patch tokens from a fixed CLIP ViT-H/14 encoder into the final embedding space and comparing to the question embedding. These are ensembled with pixel statistics, zero-shot domain routing, and VLM self-assessment for pre-release three-class mirage detection (answer/abstain/mirage). The central empirical claim is that the best systems reach 94.6-94.7% accuracy with mirage rates below 3% across five VQA domains, three input conditions, and twelve VLM backbones, versus baseline mirage rates of 21.7-66.6%.

Significance. If the generalizability of the TC-LIA trajectory features holds, the work would be significant for safe VLM deployment in medical and document VQA by enabling abstention before ungrounded responses are generated. The scale of the evaluation (multiple domains, conditions, and backbones) and the model-agnostic framing using internal states from a single fixed encoder are strengths that, if substantiated with full experimental controls, could influence practical mitigation strategies.

major comments (3)

[Experimental results (likely §4-5)] The headline result (94.6-94.7% three-class accuracy and <3% mirage rate) rests on the assumption that the four TC-LIA alignment features extracted from a fixed CLIP ViT-H/14 remain predictive for VLMs whose vision encoders differ from CLIP; no ablation or transfer analysis is provided to test this when the target backbone uses a different vision tower, which directly undermines the model-agnostic claim across twelve backbones.
[Abstract and Experimental Evaluation] The abstract and results sections report strong aggregate numbers but supply no details on experimental controls, baseline re-implementations, statistical significance tests, cross-validation procedure, or whether the ensemble features were selected post-hoc on the test domains; these omissions are load-bearing for verifying the reported improvement over the 21.7-66.6% baseline mirage rates.
[Results tables] Table reporting per-domain and per-backbone accuracies (presumably Table 2 or 3): the absence of per-condition breakdowns, confidence intervals, or failure-case analysis for the three input conditions leaves open whether the <3% mirage rate holds uniformly or is driven by easier subsets.

minor comments (2)

[Method (§3)] The description of how layer-wise patch tokens are projected into the final embedding space could be clarified with an equation or pseudocode in the method section.
[Introduction] Citation to Asadi et al. 2026 for the mirage definition should be checked for consistency with the 2026 date.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for clarifying the experimental design and strengthening the presentation of results. We address each major comment point-by-point below and outline the revisions we will make.

read point-by-point responses

Referee: [Experimental results (likely §4-5)] The headline result (94.6-94.7% three-class accuracy and <3% mirage rate) rests on the assumption that the four TC-LIA alignment features extracted from a fixed CLIP ViT-H/14 remain predictive for VLMs whose vision encoders differ from CLIP; no ablation or transfer analysis is provided to test this when the target backbone uses a different vision tower, which directly undermines the model-agnostic claim across twelve backbones.

Authors: TC-LIA is designed to be model-agnostic precisely because it relies on a fixed external CLIP ViT-H/14 encoder rather than any internal states of the target VLM. The evaluation already spans twelve backbones whose vision encoders vary (including non-CLIP towers), and the reported performance holds across them. That said, an explicit per-backbone vision-tower breakdown and a dedicated transfer analysis would make the generalizability claim more transparent. We will add both to the revised manuscript. revision: yes
Referee: [Abstract and Experimental Evaluation] The abstract and results sections report strong aggregate numbers but supply no details on experimental controls, baseline re-implementations, statistical significance tests, cross-validation procedure, or whether the ensemble features were selected post-hoc on the test domains; these omissions are load-bearing for verifying the reported improvement over the 21.7-66.6% baseline mirage rates.

Authors: We agree that fuller documentation of the experimental protocol is necessary. The current manuscript contains the core numbers but omits several procedural details. In the revision we will expand the experimental setup and evaluation sections (and add an appendix) to describe baseline re-implementations, the cross-validation scheme, statistical tests, and confirm that feature selection and hyper-parameters were determined on held-out validation splits, not test domains. revision: yes
Referee: [Results tables] Table reporting per-domain and per-backbone accuracies (presumably Table 2 or 3): the absence of per-condition breakdowns, confidence intervals, or failure-case analysis for the three input conditions leaves open whether the <3% mirage rate holds uniformly or is driven by easier subsets.

Authors: We will revise the main results tables to include per-input-condition breakdowns, report confidence intervals, and add a short failure-case analysis section that examines whether the low mirage rate is uniform across the three conditions or concentrated in particular subsets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; TC-LIA features are extracted from fixed CLIP internals without reduction to fitted mirage labels.

full rationale

The paper defines TC-LIA by projecting CLIP ViT-H/14 patch tokens into the final embedding space and computing cosine similarities, gains, and slopes against the question embedding. These are combined with independent pixel statistics and VLM self-assessment. No equation or step shows a fitted parameter on mirage labels being renamed as a prediction, nor any self-citation chain that bears the central claim. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit modeling assumptions or new entities are named.

pith-pipeline@v0.9.1-grok · 5813 in / 1064 out tokens · 26575 ms · 2026-06-28T22:23:32.032907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages · 8 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687,

work page arXiv
[3]

Hallucination of Multimodal Large Language Models: A Survey

Jinze Bai, Shu Xie, Yawen Li, Zhibo Chen, Yunshan Zhang, Jun Wang, Yike Su, and Xiaohui Shen. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Language Models (Mostly) Know What They Know

URLhttps://github.com/mlfoundations/open_clip. Saurav Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Kuang, Wayne Xin Zhao, Hong Xie, Dawei Yin, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305,

2023
[8]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jing Yang, Chao Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hao Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. A multimodal biomedical foundation model trained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

We evaluate across five visually diverse VQA domains: Chest VQA, PathVQA, TextVQA, DocVQA, and InfoVQA

14 Appendix A Reproducibility Details A.1 Dataset Composition Table 4 provides the detailed dataset composition used in our mirage detection experiments. We evaluate across five visually diverse VQA domains: Chest VQA, PathVQA, TextVQA, DocVQA, and InfoVQA. For each domain, examples are organized into three input conditions: RELATED, where the image is se...

2019
[12]

seeing but not believing

The result shows that the overall mirage risk can be reduced by separately controlling blank/noise failures throughg B and semantic mismatch failures throughg N, which matches the staged design of the proposed detector. E Negative and Developmental Experiments In this section, we describe orthogonal approaches to TC-LIA that were attempted before TC-LIA. ...

2026
[13]

Seeing but not believing

In the image RAPT panel, theunrelated imagecondition (red) tracks closely with thematchedcondition (blue) across all layers, confirming that the decoder allocates similar image attention regardless of whether the image is semantically relevant. Theblank imagecondition (orange) shows suppressed but non-zero image RAPT, whileno question(green) produces spur...

1929

[1] [1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687,

work page arXiv

[3] [3]

Hallucination of Multimodal Large Language Models: A Survey

Jinze Bai, Shu Xie, Yawen Li, Zhibo Chen, Yunshan Zhang, Jun Wang, Yike Su, and Xiaohui Shen. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Language Models (Mostly) Know What They Know

URLhttps://github.com/mlfoundations/open_clip. Saurav Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Kuang, Wayne Xin Zhao, Hong Xie, Dawei Yin, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305,

2023

[8] [8]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jing Yang, Chao Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hao Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. A multimodal biomedical foundation model trained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

We evaluate across five visually diverse VQA domains: Chest VQA, PathVQA, TextVQA, DocVQA, and InfoVQA

14 Appendix A Reproducibility Details A.1 Dataset Composition Table 4 provides the detailed dataset composition used in our mirage detection experiments. We evaluate across five visually diverse VQA domains: Chest VQA, PathVQA, TextVQA, DocVQA, and InfoVQA. For each domain, examples are organized into three input conditions: RELATED, where the image is se...

2019

[12] [12]

seeing but not believing

The result shows that the overall mirage risk can be reduced by separately controlling blank/noise failures throughg B and semantic mismatch failures throughg N, which matches the staged design of the proposed detector. E Negative and Developmental Experiments In this section, we describe orthogonal approaches to TC-LIA that were attempted before TC-LIA. ...

2026

[13] [13]

Seeing but not believing

In the image RAPT panel, theunrelated imagecondition (red) tracks closely with thematchedcondition (blue) across all layers, confirming that the decoder allocates similar image attention regardless of whether the image is semantically relevant. Theblank imagecondition (orange) shows suppressed but non-zero image RAPT, whileno question(green) produces spur...

1929