NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
hub
Mirage the illusion of visual understanding
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13roles
background 4polarities
background 4representative citing papers
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
CIR benchmarks contain many unimodal shortcuts and noisy queries, leading to overestimation of models' multimodal composition capabilities.
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
citing papers explorer
-
NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
-
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
-
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
-
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
-
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
-
Medical Context Distorts Decisions in Clinical Vision Language Models
Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
Towards Conversational Medical AI with Eyes, Ears and a Voice
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
-
Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
CIR benchmarks contain many unimodal shortcuts and noisy queries, leading to overestimation of models' multimodal composition capabilities.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.