MAIRA-2: Grounded Radiology Report Generation

Anja Thieme; Anton Schwaighofer; Daniel C. Castro; Fabian Falck; Felix Meissen; Fernando P\'erez-Garc\'ia; Harshita Sharma; Javier Alvarez-Valle; Julia Gong; Kenza Bouzid

arxiv: 2406.04449 · v2 · pith:LI2STNZRnew · submitted 2024-06-06 · 💻 cs.CL · cs.CV

MAIRA-2: Grounded Radiology Report Generation

Shruthi Bannur , Kenza Bouzid , Daniel C. Castro , Anton Schwaighofer , Anja Thieme , Sam Bond-Taylor , Maximilian Ilse , Fernando P\'erez-Garc\'ia

show 13 more authors

Valentina Salvatelli Harshita Sharma Felix Meissen Mercy Ranjit Shaury Srivastav Julia Gong Noel C. F. Codella Fabian Falck Ozan Oktay Matthew P. Lungren Maria Teodora Wetscherek Javier Alvarez-Valle Stephanie L. Hyland

This is my paper

classification 💻 cs.CL cs.CV

keywords generationreportgroundedtaskmaira-2modelsreportingimage

0 comments

read the original abstract

Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of large language models (LLMs) to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation
cs.CV 2026-06 unverdicted novelty 7.0

SHOVIR is a benchmark extending MIMIC-CXR and PadChest-GR with per-box labels and occlusion tests to isolate direct and contextual vision shortcuts in VLMs for radiology report generation.
Transition-Aware best-of-N sampling for Longitudinal Chest X-ray Reports
cs.CV 2026-06 unverdicted novelty 7.0

Transition-aware best-of-N sampling embeds report sentences as sets, computes directional transition vectors via set-to-set distances, and scores candidates by proximity to ground-truth training transitions.
CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays
cs.CV 2026-06 unverdicted novelty 7.0

CheXpercept is a sequential multi-level perception benchmark showing VLMs perform adequately only on coarse lesion detection in chest X-rays while degrading sharply on finer tasks, with medical VLMs offering no advant...
A Vision-language Framework for Comparative Reasoning in Radiology
cs.CV 2026-06 unverdicted novelty 7.0

Introduces MedReCo-DB dataset of 690k+ images and entity-aware models MedReCo/MedReCo-VLM that improve reference retrieval and comparative change interpretation in radiology across multiple centers and modalities.
Discrete Diffusion Language Models for Interactive Radiology Report Drafting
cs.AI 2026-07 unverdicted novelty 6.0

Diffusion LM matches AR performance on medical VQA, runs 3.5-4.4x faster, and enables bidirectional infilling for interactive radiology report drafting.
MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias
cs.CV 2026-06 unverdicted novelty 6.0

MLLMs show late-layer textual override of correct visual predictions, with a directional signature enabling a simple inference-time recovery method that improves conflict benchmarks by up to 9.4%.
Astra: a generalizable report generation foundation model for 3D computed tomography
cs.CV 2026-05 unverdicted novelty 6.0

Astra is a 3D CT vision-language foundation model trained on 90,678 thoracoabdominal scans that claims 44.1% better diagnostic metrics on internal and six external cohorts plus 29.6% faster chest reporting in real workflows.
CCS: Clinical Consensus Selection for Radiology Report Generation
cs.CL 2026-05 unverdicted novelty 6.0

CCS selects the best radiology report from multiple MLLM candidates by measuring clinical consensus with combined text and multimodal embedding utilities, yielding gains over single-path and Best-of-N baselines on cli...
Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
cs.CV 2026-05 unverdicted novelty 6.0

CoNNS uses an LLM-built concept ontology and cross-patient relabeling to filter noisy negatives, improving zero-shot classification and grounding of chest X-ray findings over prior methods.
Medical Context Distorts Decisions in Clinical Vision Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
Spectral Vision Transformer for Efficient Tokenization with Limited Data
cs.CV 2026-05 unverdicted novelty 6.0

A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
cs.CV 2026-04 unverdicted novelty 6.0

DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
cs.CV 2026-02 unverdicted novelty 6.0

MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction
cs.CV 2025-04 unverdicted novelty 6.0

RA-RRG extracts key phrases with LLMs, retrieves them via multimodal similarity, and conditions report generation on them to achieve SOTA CheXbert scores and competitive RadGraph F1 on MIMIC-CXR and IU X-ray while sup...
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports
cs.CL 2026-06 unverdicted novelty 5.0

Lightweight metrics trained on Qwen3-8B and MedGemma-4B using synthetic pairs outperform larger medical LLMs at distinguishing clinical significance in radiology reports while balancing discrimination and robustness.
Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification
cs.CV 2026-06 unverdicted novelty 5.0

CoEV is a plug-and-play bidirectional verification method that maps text statements to visual evidence regions, assigns them to a four-quadrant factuality-grounding map, and uses this to detect and correct hallucinati...
PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining
cs.CL 2026-05 unverdicted novelty 5.0

PMC-InterCPT builds a context-grounded biomedical interleaved corpus from PMC literature and shows it improves multimodal performance on Qwen3.5-4B-Base after CPT and SFT while using fewer tokens.
RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection
cs.CV 2026-05 unverdicted novelty 5.0

RadGenome-Anatomy is a large-scale chest radiograph dataset with anatomy labels obtained by projecting 3D CT masks into 2D radiographic space for 210 structures in 25,692 studies.
MedMIX: Modality-Internal Expert Fusion for Multimodal Medical Diagnosis
cs.LG 2026-05 unverdicted novelty 5.0

MedMIX combines intra-modality expert fusion, learned inter-modality fusion, and training-only large-small collaboration to deliver robust multimodal medical prediction under incomplete modalities across three benchmarks.
Beyond Masks: The Case for Medical Image Parsing
cs.CV 2026-05 unverdicted novelty 5.0

Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationship...
LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
cs.CV 2026-03 unverdicted novelty 5.0

LoFi adds location-aware captioning loss to jointly optimize fine-grained representations, yielding better retrieval and grounding on MIMIC-CXR and PadChest-GR.
RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
cs.MA 2025-09 unverdicted novelty 5.0

RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray inter...
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
cs.CV 2024-08 unverdicted novelty 5.0

M4CXR is a multi-modal large language model that performs multiple tasks in chest X-ray analysis including report generation with claimed SOTA clinical accuracy using chain-of-thought prompting.
A unified multi-task framework enables interpretable chest radiograph analysis
cs.CV 2026-06 unverdicted novelty 4.0

A unified transformer performs four clinical tasks on chest X-rays and generates reports rated comparable to human ones in 66% of cases by radiologists.