VIDS: A Verified Imaging Dataset Standard for Medical AI

Joan S. Muthu; John Shalen

arxiv: 2604.17525 · v1 · submitted 2026-04-19 · 📡 eess.IV · cs.CV

VIDS: A Verified Imaging Dataset Standard for Medical AI

Joan S. Muthu , John Shalen This is my paper

Pith reviewed 2026-05-10 04:56 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords VIDSmedical imaging datasetsdataset provenanceDICOM metadataNIfTI formatAI dataset validationannotation qualitymedical AI

0 comments

The pith

VIDS introduces a machine-enforceable standard for medical imaging datasets that tracks annotation provenance and quality documentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VIDS as an open specification that defines consistent folder layouts, file naming, annotation history schemas, and quality records for medical imaging data used in AI. It fills the gap left by DICOM, which handles individual studies, and BIDS, which focuses on research organization, by adding curation-level checks that current datasets often lack. VIDS keeps NIfTI as the working image format while storing full original DICOM metadata in sidecar files to maintain traceability, and it includes 21 validation rules plus export paths to common ML tools. Benchmarking of four widely used public datasets shows they meet only 20 to 39 percent of the 22 defined compliance dimensions, with the biggest shortfalls in provenance and quality documentation. The authors release LIDC-Hybrid-100, a 100-subject CT dataset with consensus annotations that passes all 21 rules on the full compliance profile.

Core claim

VIDS establishes folder layouts, naming rules, provenance schemas for who annotated what and with what tool, plus quality documentation requirements, all backed by 21 machine-enforceable validation rules across two compliance profiles. It adopts NIfTI as the canonical working format while preserving complete DICOM metadata in sidecars, and it supports direct export to frameworks such as nnU-Net, MONAI, COCO, or flat NIfTI without loss of traceability information. Benchmarking reveals that even major public datasets like LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon satisfy only 20-39 percent of the 22 compliance dimensions, primarily failing on provenance and quality. A

What carries the argument

The 22 compliance dimensions and 21 validation rules that check dataset structure, annotation provenance, quality documentation, and ML export readiness while preserving DICOM metadata in sidecars.

If this is right

Existing public datasets can be converted to VIDS format to add missing provenance records and pass automated validation.
Dataset creators can apply the rules from the start so that every annotation step remains traceable for later review.
ML pipelines can import VIDS datasets into nnU-Net or MONAI while retaining full metadata and quality flags.
Validators can be run automatically during dataset curation to catch gaps before release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could make it easier to combine data from multiple sources without losing track of how each label was created.
The sidecar approach for DICOM metadata may reduce the need to store duplicate full DICOM archives alongside working NIfTI files.
Future extensions could apply the same provenance rules to non-imaging medical data such as electronic health records.

Load-bearing premise

That the 22 dimensions and 21 rules capture the essential requirements for dataset quality and provenance without becoming either too narrow or too burdensome for everyday use.

What would settle it

A large-scale comparison in which models trained on VIDS-converted versions of existing datasets show no measurable improvement in accuracy or robustness over models trained on the original versions.

Figures

Figures reproduced from arXiv: 2604.17525 by Joan S. Muthu, John Shalen.

**Figure 2.** Figure 2: VIDS annotation sidecar with provenance, documenting the annotator, tool, date, and QC [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average compliance across four public datasets by category. Provenance and quality documen [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Medical imaging AI development is fundamentally dependent on annotated datasets, yet no existing standard provides machine-enforceable validation across dataset structure, annotation provenance, quality documentation, and ML readiness within a single framework. DICOM standardizes image acquisition, storage, and communication at the individual study level. BIDS organizes neuroimaging research datasets with consistent naming conventions. Neither addresses the curation layer, viz., who annotated what, when, with what tool, and to what quality standard. This paper presents VIDS (Verified Imaging Dataset Standard), an open specification that defines folder layout, file naming, annotation provenance schemas, quality documentation, and 21 machine-enforceable validation rules across two compliance profiles. VIDS uses NIfTI as a canonical working format while preserving full DICOM metadata in sidecars for traceability, and supports export to any downstream ML framework (nnU-Net, MONAI, COCO, flat NIfTI) without loss of provenance. Twenty-two compliance dimensions are defined and four major public datasets -- LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon -- are benchmarked against these dimensions. Even widely used datasets satisfy only 20--39% of these dimensions, with provenance and quality documentation as the largest systematic gaps. LIDC-Hybrid-100 is released as a 100-subject VIDS-compliant reference CT dataset with consensus segmentation masks from four radiologist annotations (mean pairwise Dice 0.7765), validating 21/21 on the Full compliance profile. VIDS is fully open source: the specification is CC BY 4.0, all tools are Apache 2.0, the reference validator is available on PyPI (pip install vids-validator), and LIDC-Hybrid-100 is published on Zenodo (https://doi.org/10.5281/zenodo.19582717).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIDS layers provenance tracking and validation rules onto NIfTI/DICOM for medical imaging datasets and shows popular ones fall short, but the 22 dimensions rest on author choices without clear derivation.

read the letter

VIDS defines a standard that adds provenance schemas and machine-enforceable validation rules to medical imaging datasets built on NIfTI and DICOM. The paper benchmarks four major public datasets against 22 compliance dimensions and finds they meet only 20-39 percent, with provenance and quality documentation as the largest gaps. It also releases a validator, the full spec, and a 100-subject reference CT dataset from LIDC with consensus radiologist annotations that passes 21 out of 21 rules on the full profile. The format choice keeps DICOM metadata in sidecars while using NIfTI for working files and supports export to nnU-Net, MONAI, and similar tools without losing traceability. Everything ships open source with a pip-installable validator and Zenodo data release. This makes the proposal immediately testable and usable for dataset curation. The practical bridge between existing standards and ML pipelines is the strongest element. The main limitation is that the work gives no account of how the authors selected exactly these 22 dimensions and 21 rules. No literature synthesis, expert Delphi process, or explicit mapping from DICOM and BIDS gaps is described. The compliance percentages therefore depend on the authors' judgment about what counts as essential. If the set omits key items like inter-rater reliability or regulatory traceability, or includes low-value checks, the gap claims become harder to generalize. The reference dataset shows the rules can be met in practice, which is useful, but does not validate that these are the right dimensions. This paper is for medical AI groups that curate or reuse annotated imaging data and want machine-checkable standards for reproducibility. Readers working on training pipelines would get concrete tools and a starting point for better documentation. The implementation and data release show clear thinking, so the paper deserves peer review where referees can test the dimensions against real workflows and suggest adjustments.

Referee Report

1 major / 1 minor

Summary. The paper introduces VIDS (Verified Imaging Dataset Standard), an open specification defining folder layout, file naming, annotation provenance schemas, quality documentation, and 21 machine-enforceable validation rules across two compliance profiles for medical imaging datasets. VIDS adopts NIfTI as the canonical working format while retaining full DICOM metadata in sidecars and supports lossless export to frameworks such as nnU-Net, MONAI, and COCO. Twenty-two compliance dimensions are defined; four public datasets (LIDC-IDRI, BraTS, CheXpert, Medical Segmentation Decathlon) are benchmarked and found to satisfy only 20-39% of the dimensions, with provenance and quality documentation as the largest gaps. The authors release LIDC-Hybrid-100, a 100-subject VIDS-compliant CT reference dataset with consensus segmentations (mean pairwise Dice 0.7765) that passes 21/21 rules on the Full profile. The specification (CC BY 4.0), validator (Apache 2.0, available via pip), and dataset (Zenodo) are fully open.

Significance. If the 22 dimensions and 21 rules are shown to be appropriately scoped and derived, VIDS could provide a practical, machine-checkable framework that improves traceability and quality assurance for medical imaging datasets used in AI. The open release of the validator tool and the LIDC-Hybrid-100 reference dataset with multi-radiologist consensus annotations constitute concrete, immediately usable contributions that support reproducibility.

major comments (1)

[Definition of the 22 compliance dimensions] The manuscript does not describe any systematic process (literature synthesis, expert Delphi, or explicit gap mapping from DICOM/BIDS) used to arrive at the precise set of 22 compliance dimensions and 21 validation rules. This is load-bearing for the central claim: the reported 20-39% compliance rates for LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon, and the conclusion that provenance/quality gaps are systematic, depend directly on these author-chosen criteria being both essential and non-burdensome. Without such justification it remains possible that the dimensions are incomplete (e.g., omitting regulatory traceability or inter-rater metrics) or include low-value items, rendering the benchmarking percentages non-generalizable. (See the section defining the 22 compliance dimensions and the subsequent benchmarking results.)

minor comments (1)

[Abstract and benchmarking results] The abstract reports that datasets satisfy 'only 20-39% of these dimensions'; a table or explicit breakdown in the results section showing per-dataset and per-dimension compliance would make the gap analysis more transparent and reproducible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of VIDS and for the constructive major comment. We agree that explicit justification of the compliance dimensions is necessary to support the benchmarking claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Definition of the 22 compliance dimensions] The manuscript does not describe any systematic process (literature synthesis, expert Delphi, or explicit gap mapping from DICOM/BIDS) used to arrive at the precise set of 22 compliance dimensions and 21 validation rules. This is load-bearing for the central claim: the reported 20-39% compliance rates for LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon, and the conclusion that provenance/quality gaps are systematic, depend directly on these author-chosen criteria being both essential and non-burdensome. Without such justification it remains possible that the dimensions are incomplete (e.g., omitting regulatory traceability or inter-rater metrics) or include low-value items, rendering the benchmarking percentages non-generalizable. (See the section defining the 22 compliance dimensions and the subsequent benchmarking results.

Authors: We acknowledge that the submitted manuscript does not include an explicit description of the derivation process for the 22 compliance dimensions and 21 rules. These were obtained by (1) cataloguing recurring dataset deficiencies reported across medical imaging AI literature (e.g., missing annotation provenance, undocumented quality controls), (2) identifying the specific gaps left by DICOM (study-level) and BIDS (research neuroimaging) when applied to curated ML datasets, and (3) retaining only those attributes that can be expressed as machine-enforceable rules. We will add a dedicated subsection (new Section 3.1) that presents this rationale, supplies supporting citations, and provides a gap-mapping table. We maintain that the current dimensions are both essential and practical, as shown by the reference dataset achieving full compliance and by the systematic shortfalls observed in four widely used public collections. We agree, however, that additional discussion of scope limitations (e.g., dataset-specific inter-rater statistics) will further strengthen the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; VIDS is a definitional standard proposal with direct benchmarking

full rationale

The paper defines 22 compliance dimensions and 21 validation rules as an open specification, then directly counts how many are met by four public datasets (reporting 20-39% compliance) and shows one constructed reference dataset meeting 21/21. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain. The benchmarking is a straightforward application of the proposed criteria rather than any reduction of outputs to inputs by construction. The central claim—that existing datasets exhibit gaps relative to the new standard—is self-contained and does not rely on external uniqueness theorems or smuggled ansatzes. This matches the expected non-finding for a standards proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new specification rather than deriving results from equations; it relies on existing image formats as foundations.

axioms (1)

domain assumption NIfTI is an appropriate canonical working format that can preserve full DICOM metadata via sidecars
Stated as the basis for the standard in the abstract

invented entities (1)

VIDS compliance profiles and 21 validation rules no independent evidence
purpose: To provide machine-enforceable checks for dataset structure, provenance, and quality
Newly defined by the authors

pith-pipeline@v0.9.0 · 5641 in / 1419 out tokens · 43720 ms · 2026-05-10T04:56:10.136820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

DICOM — Digital Imaging and Communications in Medicine, 2024

National Electrical Manufacturers Association. DICOM — Digital Imaging and Communications in Medicine, 2024. Accessed: 2026-04-14

work page 2024
[2]

The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments.Scientific Data, 3(1):1–9, 2016

Krzysztof J Gorgolewski, Tibor Auer, Vince D Calhoun, et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments.Scientific Data, 3(1):1–9, 2016

work page 2016
[3]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft COCO: Common objects in context. InEuropean Conference on Computer Vision (ECCV), pages 740–755, 2014

work page 2014
[4]

The Pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher K I Williams, John Winn, and Andrew Zisser- man. The Pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 10

work page 2010
[5]

Datasheets for datasets.Communica- tions of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, et al. Datasheets for datasets.Communica- tions of the ACM, 64(12):86–92, 2021

work page 2021
[6]

Data cards: Purposeful and trans- parent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible AI. InACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1776–1826, 2022

work page 2022
[7]

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon A A Kohl, Jens Petersen, and Klaus H Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2):203–211, 2021

work page 2021
[8]

MONAI: Medical open network for AI, 2020

MONAI Consortium. MONAI: Medical open network for AI, 2020. Accessed: 2026-04-14

work page 2020
[9]

Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans.Medical Physics, 38(2):915–931, 2011

work page 2011
[10]

The BraTS challenge: a comprehensive benchmark for brain tumor segmentation.IEEE Transactions on Medical Imaging, 34(10):1993– 2024, 2015

Bjoern H Menze, Andras Jakab, Stefan Bauer, et al. The BraTS challenge: a comprehensive benchmark for brain tumor segmentation.IEEE Transactions on Medical Imaging, 34(10):1993– 2024, 2015. BraTS challenge ongoing; 2023 iteration used for analysis

work page 1993
[11]

CheXpert: A large chest radio- graph dataset with uncertainty labels and expert comparison.Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–597, 2019

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radio- graph dataset with uncertainty labels and expert comparison.Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–597, 2019

work page 2019
[12]

The medical segmentation decathlon

Michela Antonelli, Annika Reinke, Spyridon Bakas, et al. The medical segmentation decathlon. Nature Communications, 13(1):4128, 2022. 11

work page 2022

[1] [1]

DICOM — Digital Imaging and Communications in Medicine, 2024

National Electrical Manufacturers Association. DICOM — Digital Imaging and Communications in Medicine, 2024. Accessed: 2026-04-14

work page 2024

[2] [2]

The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments.Scientific Data, 3(1):1–9, 2016

Krzysztof J Gorgolewski, Tibor Auer, Vince D Calhoun, et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments.Scientific Data, 3(1):1–9, 2016

work page 2016

[3] [3]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft COCO: Common objects in context. InEuropean Conference on Computer Vision (ECCV), pages 740–755, 2014

work page 2014

[4] [4]

The Pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher K I Williams, John Winn, and Andrew Zisser- man. The Pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 10

work page 2010

[5] [5]

Datasheets for datasets.Communica- tions of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, et al. Datasheets for datasets.Communica- tions of the ACM, 64(12):86–92, 2021

work page 2021

[6] [6]

Data cards: Purposeful and trans- parent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible AI. InACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1776–1826, 2022

work page 2022

[7] [7]

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon A A Kohl, Jens Petersen, and Klaus H Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2):203–211, 2021

work page 2021

[8] [8]

MONAI: Medical open network for AI, 2020

MONAI Consortium. MONAI: Medical open network for AI, 2020. Accessed: 2026-04-14

work page 2020

[9] [9]

Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans.Medical Physics, 38(2):915–931, 2011

work page 2011

[10] [10]

The BraTS challenge: a comprehensive benchmark for brain tumor segmentation.IEEE Transactions on Medical Imaging, 34(10):1993– 2024, 2015

Bjoern H Menze, Andras Jakab, Stefan Bauer, et al. The BraTS challenge: a comprehensive benchmark for brain tumor segmentation.IEEE Transactions on Medical Imaging, 34(10):1993– 2024, 2015. BraTS challenge ongoing; 2023 iteration used for analysis

work page 1993

[11] [11]

CheXpert: A large chest radio- graph dataset with uncertainty labels and expert comparison.Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–597, 2019

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radio- graph dataset with uncertainty labels and expert comparison.Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–597, 2019

work page 2019

[12] [12]

The medical segmentation decathlon

Michela Antonelli, Annika Reinke, Spyridon Bakas, et al. The medical segmentation decathlon. Nature Communications, 13(1):4128, 2022. 11

work page 2022