MAIRA-2: Grounded Radiology Report Generation
read the original abstract
Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of large language models (LLMs) to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.
This paper has not been read by Pith yet.
Forward citations
Cited by 25 Pith papers
-
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
-
SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation
SHOVIR is a benchmark extending MIMIC-CXR and PadChest-GR with per-box labels and occlusion tests to isolate direct and contextual vision shortcuts in VLMs for radiology report generation.
-
Transition-Aware best-of-N sampling for Longitudinal Chest X-ray Reports
Transition-aware best-of-N sampling embeds report sentences as sets, computes directional transition vectors via set-to-set distances, and scores candidates by proximity to ground-truth training transitions.
-
CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays
CheXpercept is a sequential multi-level perception benchmark showing VLMs perform adequately only on coarse lesion detection in chest X-rays while degrading sharply on finer tasks, with medical VLMs offering no advant...
-
A Vision-language Framework for Comparative Reasoning in Radiology
Introduces MedReCo-DB dataset of 690k+ images and entity-aware models MedReCo/MedReCo-VLM that improve reference retrieval and comparative change interpretation in radiology across multiple centers and modalities.
-
Discrete Diffusion Language Models for Interactive Radiology Report Drafting
Diffusion LM matches AR performance on medical VQA, runs 3.5-4.4x faster, and enables bidirectional infilling for interactive radiology report drafting.
-
MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias
MLLMs show late-layer textual override of correct visual predictions, with a directional signature enabling a simple inference-time recovery method that improves conflict benchmarks by up to 9.4%.
-
Astra: a generalizable report generation foundation model for 3D computed tomography
Astra is a 3D CT vision-language foundation model trained on 90,678 thoracoabdominal scans that claims 44.1% better diagnostic metrics on internal and six external cohorts plus 29.6% faster chest reporting in real workflows.
-
CCS: Clinical Consensus Selection for Radiology Report Generation
CCS selects the best radiology report from multiple MLLM candidates by measuring clinical consensus with combined text and multimodal embedding utilities, yielding gains over single-path and Best-of-N baselines on cli...
-
Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
CoNNS uses an LLM-built concept ontology and cross-patient relabeling to filter noisy negatives, improving zero-shot classification and grounding of chest X-ray findings over prior methods.
-
Medical Context Distorts Decisions in Clinical Vision Language Models
Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
-
Spectral Vision Transformer for Efficient Tokenization with Limited Data
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
-
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.
-
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
-
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction
RA-RRG extracts key phrases with LLMs, retrieves them via multimodal similarity, and conditions report generation on them to achieve SOTA CheXbert scores and competitive RadGraph F1 on MIMIC-CXR and IU X-ray while sup...
-
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports
Lightweight metrics trained on Qwen3-8B and MedGemma-4B using synthetic pairs outperform larger medical LLMs at distinguishing clinical significance in radiology reports while balancing discrimination and robustness.
-
Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification
CoEV is a plug-and-play bidirectional verification method that maps text statements to visual evidence regions, assigns them to a four-quadrant factuality-grounding map, and uses this to detect and correct hallucinati...
-
PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining
PMC-InterCPT builds a context-grounded biomedical interleaved corpus from PMC literature and shows it improves multimodal performance on Qwen3.5-4B-Base after CPT and SFT while using fewer tokens.
-
RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection
RadGenome-Anatomy is a large-scale chest radiograph dataset with anatomy labels obtained by projecting 3D CT masks into 2D radiographic space for 210 structures in 25,692 studies.
-
MedMIX: Modality-Internal Expert Fusion for Multimodal Medical Diagnosis
MedMIX combines intra-modality expert fusion, learned inter-modality fusion, and training-only large-small collaboration to deliver robust multimodal medical prediction under incomplete modalities across three benchmarks.
-
Beyond Masks: The Case for Medical Image Parsing
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationship...
-
LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
LoFi adds location-aware captioning loss to jointly optimize fine-grained representations, yielding better retrieval and grounding on MIMIC-CXR and PadChest-GR.
-
RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray inter...
-
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
M4CXR is a multi-modal large language model that performs multiple tasks in chest X-ray analysis including report generation with claimed SOTA clinical accuracy using chain-of-thought prompting.
-
A unified multi-task framework enables interpretable chest radiograph analysis
A unified transformer performs four clinical tasks on chest X-rays and generates reports rated comparable to human ones in 66% of cases by radiologists.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.