arxiv: 1901.07042 · v5 · submitted 2019-01-21 · 💻 cs.CV · cs.LG· eess.IV

Recognition: 1 theorem link

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Alistair E. W. Johnson , Tom J. Pollard , Nathaniel R. Greenbaum , Matthew P. Lungren , Chih-ying Deng , Yifan Peng , Zhiyong Lu , Roger G. Mark

show 2 more authors

Seth J. Berkowitz Steven Horng

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords chest x-raymedical datasetlabeled radiographscomputer visionNLP labelspublic databaseradiology reports

0 comments

The pith

A large dataset of 377,110 labeled chest x-rays is now publicly available for medical computer vision research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the creation and release of MIMIC-CXR-JPG, a processed version of the MIMIC-CXR database with 377,110 chest x-ray images from 227,827 studies. Each image comes with 14 labels obtained by applying natural language processing tools to the free-text radiology reports. This addresses the shortage of large labeled datasets needed to train high-performance computer vision algorithms for interpreting chest radiographs. By providing de-identified images and standardized labels, the work allows researchers to focus on algorithm development rather than data acquisition and privacy compliance.

Core claim

MIMIC-CXR-JPG v2.0.0 is a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. The dataset is derived entirely from the MIMIC-CXR database and provides a convenient processed version along with a standard reference for data splits and image labels.

What carries the argument

The MIMIC-CXR-JPG dataset, a collection of de-identified chest radiograph images paired with 14 labels extracted automatically from radiology reports using two NLP tools.

If this is right

Automated analysis of chest radiographs can be advanced by training models on this extensive collection of real clinical images.
The standardized labels and data splits enable consistent benchmarking across different research efforts.
Wider access to such data encourages diverse applications in medical imaging without individual researchers needing to source their own datasets.
Privacy-protected release supports ethical research practices in healthcare AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on these labels might be tested for performance on detecting specific conditions like atelectasis or pleural effusion.
Combining this dataset with other public x-ray collections could allow for larger scale training and better generalization.
Improvements in NLP tools could be evaluated by their agreement with these existing labels on the same reports.

Load-bearing premise

The 14 labels from the two NLP tools accurately capture the clinical content of the radiology reports and match verifiable findings in the images.

What would settle it

Independent radiologists reviewing a random sample of the reports and images to check if the assigned labels correctly identify the described findings.

read the original abstract

Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirely from the MIMIC-CXR database, and aims to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in medical computer vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward data release that packages existing MIMIC-CXR data into a convenient JPG format with 14 NLP-derived labels and fixed splits.

read the letter

This paper is mainly a data release for MIMIC-CXR-JPG v2.0.0. It converts the original chest radiographs into de-identified JPG files, attaches 14 labels generated by two established NLP tools on the reports, and supplies recommended data splits to reduce leakage between studies. The numbers are large: 377k images from 227k studies collected at Beth Israel Deaconess between 2011 and 2016. The central value is convenience and standardization rather than any new clinical finding or model. Researchers can now grab the data and start training without first wrestling with DICOM headers or building their own labeling pipeline. The sourcing, de-identification steps, and processing choices are described clearly enough that others can understand exactly what they are getting. The work sits on top of the prior MIMIC-CXR release, so the incremental contribution is the packaged version with a fixed label set and splits. That packaging still matters for reproducibility across different groups. One soft spot is that the paper does not run fresh validation of the NLP labels against the images or against radiologist reads on this corpus. It simply reports what the tools produced. For a data-release manuscript this is acceptable, because the authors do not claim the labels are perfect ground truth. They present them as derived labels that come with the dataset. The citation pattern is appropriate and points back to the source MIMIC-CXR work and the NLP tools. There is no modeling, no fitted parameters, and no circular claims, so the paper stays internally consistent. This is aimed at the medical computer vision community that needs large labeled chest X-ray collections for benchmarking or pre-training. Groups working on automated radiograph interpretation will get immediate practical value from the standardized format and splits. It deserves peer review because the resource addresses a real bottleneck and the documentation is solid enough to let others use it responsibly.

Referee Report

0 major / 2 minor

Summary. The manuscript describes the public release of MIMIC-CXR-JPG v2.0.0, a processed dataset of 377,110 de-identified chest radiographs associated with 227,827 studies from the Beth Israel Deaconess Medical Center (2011-2016). Images are supplied with 14 labels obtained by applying two documented NLP tools to the corresponding free-text radiology reports. The work positions the release as a convenient, standardized version of the source MIMIC-CXR database that includes fixed data splits and serves as a reference resource for medical computer vision research.

Significance. If released as described, the dataset supplies a large-scale, publicly accessible collection of labeled chest radiographs that directly addresses the data scarcity noted in the abstract. By providing de-identified images together with pre-computed labels and recommended splits, the release lowers barriers to entry for algorithm development and supports reproducible benchmarking. The explicit sourcing, de-identification, and processing pipeline documentation adds practical value for downstream users.

minor comments (2)

[Abstract] The abstract and introduction would benefit from naming the two specific NLP tools (e.g., their citations or versions) rather than referring to them generically, so readers can immediately locate the label-generation methodology.
[Dataset Description] A short table or paragraph summarizing the distribution of the 14 labels across the full dataset would help users assess class imbalance before downloading the data.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition that MIMIC-CXR-JPG provides a convenient, standardized resource with de-identified images, NLP-derived labels, and fixed splits to support reproducible research in medical computer vision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a data-release paper whose contribution consists of describing the public distribution of a processed version of the existing MIMIC-CXR collection together with 14 labels obtained by applying two documented NLP tools. No equations, predictions, fitted parameters, or derivations are present; the text simply reports dataset statistics, sourcing, de-identification steps, and label-generation procedures. All claims are externally verifiable by inspecting the released files and the cited source database, so no load-bearing step reduces to a self-definition or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a processed dataset rather than new theory or methods, relying on the existing MIMIC-CXR collection and off-the-shelf NLP labelers without introducing new free parameters or entities.

axioms (1)

domain assumption NLP tools produce labels that reflect the clinical findings described in the radiology reports
The 14 labels are generated solely by applying two existing NLP tools to the reports; no independent image-based validation is described.

pith-pipeline@v0.9.0 · 5548 in / 1155 out tokens · 28626 ms · 2026-05-17T04:11:25.530810+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation
cs.CV 2026-04 unverdicted novelty 6.0

RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
cs.CV 2026-04 unverdicted novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
cs.LG 2026-04 unverdicted novelty 6.0

ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.
Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution
cs.CV 2026-04 accept novelty 6.0

Replacing the generic Stable Diffusion VAE with domain-specific MedVAE pretrained on 1.6M medical images improves diffusion-based SR PSNR by 2.91-3.29 dB on knee/brain MRI and chest X-ray, with gains in fine details a...
Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

LLM-based semantic encoding of tabular variables creates schema-adaptive embeddings that support zero-shot transfer and improve multimodal dementia diagnosis on NACC and ADNI datasets.
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
cs.CV 2026-04 unverdicted novelty 6.0

Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
Gaze2Report: Radiology Report Generation via Visual-Gaze Prompt Tuning of LLMs
q-bio.TO 2026-04 unverdicted novelty 6.0

Gaze2Report combines predicted eye-gaze scanpaths and graph neural networks with LoRA-tuned LLMs to generate radiology reports that incorporate human visual attention without requiring gaze data at inference time.
Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models
cs.CV 2026-03 unverdicted novelty 6.0

CARPA generates anatomically faithful synthetic chest X-rays with controlled clinical concept insertions and deletions to expand training coverage and improve model precision, calibration, and reliability on real benchmarks.
NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
cs.CV 2026-03 unverdicted novelty 6.0

NeuroSymb-MRG uses differentiable logic chains and uncertainty-driven sampling to produce more factually consistent radiology reports than standard encoder-decoder or retrieval methods.
Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook
cs.CV 2025-12 accept novelty 6.0

LNMBench shows existing noisy-label methods degrade sharply under high and realistic noise in medical images due to class imbalance and domain shifts, and proposes a simple robustness fix.
Capabilities of Gemini Models in Medicine
cs.AI 2024-04 unverdicted novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
SparseContrast: Dynamic Sparse Attention for Efficient and Accurate Contrastive Learning in Medical Imaging
cs.CV 2026-04 unverdicted novelty 5.0

SparseContrast uses a saliency-guided dynamic sparse attention mechanism inside contrastive learning to cut training and inference time by up to 40% while matching or exceeding accuracy on chest X-ray tasks.
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
cs.AI 2026-04 unverdicted novelty 5.0

MARCH is a multi-agent system mimicking radiology department hierarchy that generates more clinically accurate and linguistically correct CT reports than prior single-model approaches.
M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
cs.CV 2026-04 unverdicted novelty 5.0

M-IDoL learns modality-specific and diverse representations by maximizing inter-modality entropy and minimizing intra-modality uncertainty through information decomposition in MoE subspaces.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
cs.CL 2026-02 unverdicted novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 17 Pith papers

[1]

The US radiologist workforce: an analysis of temporal and geographic variation by using large national datasets

Rosenkrantz AB, Hughes DR, Duszak Jr R. The US radiologist workforce: an analysis of temporal and geographic variation by using large national datasets. Radiology. 2015;279(1):175–184

work page 2015
[2]

A county-level analysis of the US radiologist workforce: physician supply and subspecialty characteristics

Rosenkrantz AB, Wang W, Hughes DR, Duszak Jr R. A county-level analysis of the US radiologist workforce: physician supply and subspecialty characteristics. Journal of the American College of Radiology. 2018;15(4):601– 606

work page 2018
[3]

Radiologist shortage leaves patient care at risk, warns royal college

Rimmer A. Radiologist shortage leaves patient care at risk, warns royal college. BMJ: British Medical Journal (Online). 2017;359

work page 2017
[4]

Improving Patient Safety: Avoiding Unread Imaging Exams in the National V A Enterprise Electronic Health Record

Bastawrous S, Carney B. Improving Patient Safety: Avoiding Unread Imaging Exams in the National V A Enterprise Electronic Health Record. Journal of digital imaging. 2017;30(3):309–313

work page 2017
[5]

Imaging in the land of 1000 hills: Rwanda radiology country report

Rosman DA, Nshizirungu JJ, Rudakemwa E, Moshi C, Tuyisenge JdD, Uwimana E, et al. Imaging in the land of 1000 hills: Rwanda radiology country report. Journal of Global Radiology. 2015;1(1):5

work page 2015
[6]

Diagnostic Radiology in Liberia: a country report

Ali FS, Harrington SG, Kennedy SB, Hussain S. Diagnostic Radiology in Liberia: a country report. Journal of Global Radiology. 2015;1(2):6

work page 2015
[7]

The unreasonable effectiveness of data

Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intelligent Systems. 2009;24(2):8–12

work page 2009
[8]

Revisiting unreasonable effectiveness of data in deep learning era

Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE; 2017. p. 843–852

work page 2017
[9]

Shiraishi J, Katsuragawa S, Ikezoe J, Matsumoto T, Kobayashi T, Komatsu Ki, et al. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. American Journal of Roentgenology. 2000;174(1):71–74

work page 2000
[10]

Preparing a collection of radiology examinations for distribution and retrieval

Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, et al. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association. 2015;23(2):304–310

work page 2015
[11]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases

Wang X, Peng Y , Lu L, Lu Z, Bagheri M, Summers RM. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE; 2017. p. 3462–3471

work page 2017
[12]

NegBio: a high-performance tool for negation and uncertainty detection in radiology reports

Peng Y , Wang X, Lu L, Bagheri M, Summers R, Lu Z. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings. 2018;2017:188

work page 2018
[13]

CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Irvin J, Rajpurkar P, Ko M, Yu Y , Ciurea-Ilcus S, Chute C, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Thirty-Third AAAI Conference on Artiﬁcial Intelligence; 2019

work page 2019
[14]

pydicom v1.3.0

Mason D, pydicom contributors. pydicom v1.3.0. Zenodo; 2019. Available from: https://doi.org/10.5281/ zenodo.3333768

work page 2019
[15]

MIMIC-CXR-JPG Database

Johnson AEW, Lungren M, Peng Y , Lu Z, , Mark RG, et al.. MIMIC-CXR-JPG Database. PhysioNet; 2019. Available from: https://doi.org/10.13026/8360-t248

work page doi:10.13026/8360-t248 2019
[16]

PhysioBank, Phys- ioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals

Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, Phys- ioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215–e220

work page 2000
[17]

MIMIC-CXR Database

Johnson AEW, Pollard TJ, Mark RG, Berkowitz SG, Horng S. MIMIC-CXR Database. PhysioNet; 2019. Available from: https://doi.org/10.13026/C2JT1Q

work page doi:10.13026/c2jt1q 2019
[18]

MIMIC-III, a freely accessible critical care database

Johnson AEW, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientiﬁc data. 2016;3:160035

work page 2016
[19]

The eICU Collaborative Research Database, a freely available multi-center database for critical care research

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientiﬁc data. 2018;5

work page 2018
[20]

The MIMIC-CXR Code Repository v2.0.0

Johnson AEW, Pollard TJ. The MIMIC-CXR Code Repository v2.0.0. Zenodo; 2019. Available from: https: //doi.org/10.5281/zenodo.3539363. 7

work page doi:10.5281/zenodo.3539363 2019