hub

Mirage the illusion of visual understanding

URLhttps://arxiv · 2026 · arXiv 2603.21687

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0 · 2 refs

Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.

CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

Medical Context Distorts Decisions in Clinical Vision Language Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.

Do multimodal models imagine electric sheep?

cs.CV · 2026-05-10 · conditional · novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

Towards Conversational Medical AI with Eyes, Ears and a Voice

cs.AI · 2026-05-10 · conditional · novelty 6.0

AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

CIR benchmarks contain many unimodal shortcuts and noisy queries, leading to overestimation of models' multimodal composition capabilities.

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

cs.CV · 2026-05-10 · unverdicted · novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

citing papers explorer

Showing 13 of 13 citing papers.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding cs.CV · 2026-05-19 · accept · none · ref 12
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation cs.SE · 2026-04-30 · unverdicted · none · ref 14
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation cs.CV · 2026-05-16 · unverdicted · none · ref 53 · 2 links
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation cs.CV · 2026-04-29 · unverdicted · none · ref 5
CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models cs.LG · 2026-04-03 · unverdicted · none · ref 1
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Medical Context Distorts Decisions in Clinical Vision Language Models cs.CV · 2026-05-17 · unverdicted · none · ref 10
Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
Do multimodal models imagine electric sheep? cs.CV · 2026-05-10 · conditional · none · ref 25
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Towards Conversational Medical AI with Eyes, Ears and a Voice cs.AI · 2026-05-10 · conditional · none · ref 1
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence cs.CV · 2026-05-08 · unverdicted · none · ref 2 · 2 links
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time cs.LG · 2026-05-03 · unverdicted · none · ref 2
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 1
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
Do Composed Image Retrieval Benchmarks Require Multimodal Composition? cs.CV · 2026-05-14 · unverdicted · none · ref 7
CIR benchmarks contain many unimodal shortcuts and noisy queries, leading to overestimation of models' multimodal composition capabilities.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering cs.CV · 2026-05-10 · unverdicted · none · ref 50
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

Mirage the illusion of visual understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer