arxiv: 2605.05761 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

Fakrul Islam Tushar , Umme Hafsa Momy , Joseph Y. Lo , Geoffrey D. Rubin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords virtual lesion trialslung CT evaluationmodel benchmarkingsynthetic dataControlNet synthesisnodule profilingmedical imaging AIshortcut detection

0 comments

The pith

A programmable framework generates controlled virtual lesion trials on real lung CTs, producing model performance rankings that match real clinical data with 0.93 correlation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard benchmarks entangle lesion size, lobe location, anatomy, and scan context, so it is hard to isolate what actually drives model accuracy on lung CT tasks. iTRIALSPACE builds virtual trials by profiling nodules across multiple datasets, specifying explicit trial conditions, inserting anatomy-aware masks, and synthesizing final images with ControlNet. The resulting 55,469-sample study shows that rankings obtained on these synthetic trials transfer strongly to real held-out data. The same controlled modes also surface specific behaviors such as size-prediction collapse when lobe distributions are equalized and large host-to-donor variance in twin-cross tests. This gives researchers a way to run falsifiable, auditable experiments that static retrospective collections cannot provide.

Core claim

iTRIALSPACE composes real clinical CTs and lesion profiles into programmable virtual lesion trials via a four-stage pipeline of multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned synthesis. Across 13 trial modes the synthetic substrate stays within the real-to-real FID baseline, and model performance rankings on the virtual trials transfer to real clinical data with Spearman ρ = 0.93. The controlled modes additionally expose shortcut-driven size prediction collapse under lobe-equalized sampling and host-to-donor variance ratios of 8.9x and 3.3x in twin-cross analysis.

What carries the argument

The four-stage pipeline that turns real CTs and 54-attribute nodule profiles into auditable virtual lesion trials through ControlNet-conditioned synthesis.

If this is right

Synthetic rankings can be used to screen or compare models before large-scale real-data testing.
Lobe-equalized sampling trials can diagnose shortcut learning in size or location prediction tasks.
Twin-cross host-donor swaps quantify how much model output depends on surrounding anatomy versus lesion features.
The framework supports systematic, hypothesis-driven ablation of factors that remain entangled in fixed benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profiling-plus-synthesis approach could be adapted to other organs or modalities once corresponding multi-attribute lesion databases exist.
High transfer correlation suggests the framework could reduce the volume of real patient data needed for initial model validation.
Trial modes that isolate specific anatomical or acquisition variables offer a route to more targeted regulatory or clinical deployment audits.

Load-bearing premise

Synthesized images produced by ControlNet do not introduce model-specific artifacts or biases that are absent from real clinical CT scans.

What would settle it

A new set of models evaluated on both iTRIALSPACE trials and independent real clinical CT data showing low or negative correlation in performance rankings would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2605.05761 by Fakrul Islam Tushar, Geoffrey D. Rubin, Joseph Y. Lo, Umme Hafsa Momy.

**Figure 1.** Figure 1: iTRIALSPACE overview. Real clinical CTs and anatomy segmentations are converted into structured nodule profiles (Stage 1), instantiated as explicit trial specifications and manifests (Stage 2), composed through blueprint-guided donor-to-host mask insertion (Stage 3), and synthesized into CT volumes with NodMAISI (Stage 4). We introduce iTRIALSPACE, a programmable framework that composes real clinical CTs a… view at source ↗

**Figure 2.** Figure 2: Virtual Lesion Study (VLS) overview. iTRIALSPACE-generated synthetic CTs are combined with real clinical CTs to form the evaluation corpus used in the VLS. The study compares three medical VLMs across three clinically motivated tasks, presence detection, lobe localisation, and size classification, and four spatial-guidance conditions: plain, bounding box, contour, and bounding box+contour. Representative p… view at source ↗

**Figure 3.** Figure 3: Spatial-guidance effects across models, tasks, and domains. Real and synthetic accuracies for BiomedCLIP, LLaVA-Med, and MedGemma under plain input, bounding-box guidance, contour guidance, and combined guidance. Guidance substantially improves presence detection and lobe localisation for BiomedCLIP and MedGemma, whereas size classification remains the most resistant task across domains. by domain, with B… view at source ↗

**Figure 4.** Figure 4: Controlled interventions reveal shortcut and host effects in size classification. (a) Size accuracy across M1–M3; the dashed line marks 25% chance. BiomedCLIP and LLaVA-Med collapse to near chance by M3, while MedGemma remains above chance. (b) M13 host–donor transfer matrices. Accuracy varies more across host anatomy than donor source. wrong rather than merely random. Per-size-bin accuracy and full mode-b… view at source ↗

read the original abstract

We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($\rho$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iTRIALSPACE gives a workable way to run controlled virtual lesion trials on lung CT models, with synthetic rankings transferring to real data at rho=0.93.

read the letter

The paper's core contribution is a four-stage pipeline that profiles real nodules, lets you specify exact trial conditions like lobe-equalized sampling or twin-cross host-donor pairs, inserts anatomy-aware masks, and synthesizes CTs with ControlNet. They ran this across 13 explicit modes on 55k samples from three VLMs and showed the synthetic performance order matches real clinical data closely while staying inside real-to-real FID bounds. That transfer result plus the new findings on shortcut collapse and 8.9x/3.3x variance ratios are the parts that actually move beyond static benchmarks.

Referee Report

1 major / 3 minor

Summary. The paper introduces iTRIALSPACE, a programmable framework for controlled virtual lesion trials in lung CT. It composes real clinical CTs with profiled nodules via a four-stage pipeline (multidataset profiling of 13,140 nodules, explicit trial specification, anatomy-aware mask insertion, ControlNet-conditioned synthesis) to instantiate 13 trial modes. A 55,469-sample study across three medical VLMs, four spatial-guidance conditions, and three tasks shows synthetic FID within real-to-real baselines and strong transfer of performance rankings to real clinical data (ρ = 0.93, p < 10^{-15}), while controlled modes reveal shortcut-driven size-prediction collapse under lobe-equalized sampling and host-to-donor variance ratios of 8.9× and 3.3× in twin-cross analysis.

Significance. If the reported transfer holds, the work supplies a valuable, auditable infrastructure for disentangling factors (lesion size, lobe prevalence, anatomy, acquisition) that static retrospective benchmarks entangle. The scale (55k samples), use of public datasets, explicit trial construction, and demonstration that controlled modes expose findings unavailable to fixed-distribution benchmarks are concrete strengths. The high rank correlation and variance-ratio results provide falsifiable, quantitative support for the framework's utility beyond conventional evaluation.

major comments (1)

[Results / Virtual Lesion Study] The central transfer claim (ρ = 0.93) is load-bearing; while the abstract states that synthetic FID lies inside the real-to-real baseline and that the observed rank transfer itself evidences absence of differential artifacts, a dedicated subsection should report lesion-specific metrics (e.g., size, texture, or density histograms) comparing synthetic vs. real lesions within each trial mode to rule out model-specific synthesis biases.

minor comments (3)

[Abstract] Abstract: specify the exact statistical test underlying p < 10^{-15} (Spearman rank correlation significance test?) and report the number of models/tasks contributing to the pooled ρ.
[Methods] Methods: the four-stage pipeline is described at high level; add a figure or table enumerating the 13 trial modes with their explicit specifications (e.g., lobe-equalized sampling parameters, twin-cross host/donor definitions).
[Experimental Setup] The three clinical tasks and four spatial-guidance conditions are referenced but not itemized; list them explicitly with one-sentence definitions to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion to strengthen the validation of the central transfer claim. We will incorporate the requested lesion-specific analysis as a dedicated subsection in the revised manuscript.

read point-by-point responses

Referee: [Results / Virtual Lesion Study] The central transfer claim (ρ = 0.93) is load-bearing; while the abstract states that synthetic FID lies inside the real-to-real baseline and that the observed rank transfer itself evidences absence of differential artifacts, a dedicated subsection should report lesion-specific metrics (e.g., size, texture, or density histograms) comparing synthetic vs. real lesions within each trial mode to rule out model-specific synthesis biases.

Authors: We agree that explicit lesion-level distributional comparisons would provide additional, direct evidence against model-specific synthesis biases and complement the existing FID and rank-correlation results. In the revision we will add a new subsection (placed after the current FID analysis) that reports, for each of the 13 trial modes: (i) histograms and summary statistics of lesion size (diameter), (ii) texture features (e.g., Haralick, LBP), and (iii) density (HU) distributions, together with quantitative similarity measures (Kolmogorov-Smirnov tests and Wasserstein distances) between the synthetic lesions and their real counterparts drawn from the same trial-mode specification. These metrics will be computed on the 55 k-sample study set and will be presented alongside the existing performance tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents iTRIALSPACE as an empirical evaluation framework constructed from public CT datasets and ControlNet synthesis. Its central claim—that synthetic performance rankings transfer to real data with measured ρ = 0.93—is an observed correlation from a 55,469-sample study across explicit trial modes, not a quantity derived by construction from the framework's inputs or definitions. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the controlled modes (lobe-equalized sampling, twin-cross analysis) and FID comparisons are specified independently and validated against external real-data benchmarks. The pipeline remains self-contained without reducing claims to internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on the domain assumption of synthesis fidelity; no explicit free parameters or invented entities are described.

axioms (1)

domain assumption ControlNet-conditioned CT synthesis produces images distributionally close enough to real clinical CTs that model performance rankings transfer without introducing spurious features.
This assumption is required for the claim that synthetic trials are valid proxies and that rankings transfer to real data.

pith-pipeline@v0.9.0 · 5570 in / 1334 out tokens · 80968 ms · 2026-05-08T14:55:23.631188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages · 3 internal anchors

[1]

S. G. Armato et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI).Medical Physics, 38(2):915–931, 2011

2011
[2]

S. G. Armato et al. LUNGx Challenge for computerized lung nodule classification.Journal of Medical Imaging, 3(4):044506, 2016

2016
[3]

A. A. A. Setio et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge.Medical Image Analysis, 42:1–13, 2017

2017
[4]

Peeters et al

D. Peeters et al. LUNA25: Lung nodule malignancy risk estimation challenge. InMICCAI, 2025

2025
[5]

A. J. Wang et al. The Duke Lung Cancer Screening (DLCS) dataset: a reference dataset of annotated low-dose screening thoracic CT.Radiology: Artificial Intelligence, 7(4):e240248, 2025

2025
[6]

Pedrosa et al

J. Pedrosa et al. LNDb: A lung nodule database on computed tomography.Medical Image Analysis, 70:102027, 2021

2021
[7]

H. J. W. L. Aerts et al. Data from NSCLC-Radiomics (version 4) [Data set].The Cancer Imaging Archive,
[9]

Zhao et al

M. Zhao et al. Integrated multiomics signatures to optimize the accurate diagnosis of lung cancer.Nature Communications, 16(1):84, 2025

2025
[10]

D. R. Aberle et al. Reduced lung-cancer mortality with low-dose computed tomographic screening.NEJM, 365(5):395–409, 2011

2011
[11]

H. J. de Koning et al. Reduced lung-cancer mortality with volume CT screening in the NELSON trial. NEJM, 382(6):503–513, 2020

2020
[12]

Guo et al

P. Guo et al. MAISI: Medical AI for Synthetic Imaging. InWACV, 2025. arXiv:2409.11169

work page arXiv 2025
[13]

F. I. Tushar et al. NodMAISI: Controlled lung nodule synthesis for multi-dataset evaluation. arXiv:2512.18038, 2025

work page arXiv 2025
[14]

F. I. Tushar et al. Virtual lung screening trial (VLST): An in silico replication of the NLST.Medical Image Analysis, 103:103576, 2025

2025
[15]

F. I. Tushar et al. Utility of the virtual imaging trials methodology for objective characterization of AI systems and training data.Journal of Medical Imaging, 13(1):014506, 2026

2026
[16]

F. I. Tushar et al. SYN-LUNGS: Synthetic lung data for CAD training augmentation.arXiv:2502.21187, 2025

work page arXiv 2025
[17]

PiNS: Point-driven nodule segmentation toolkit

Fakrul Islam Tushar. PiNS: Point-driven nodule segmentation toolkit. Software, version 1.0.0, 2025. DOI: 10.5281/zenodo.17171571. License: CC-BY-NC-4.0

work page doi:10.5281/zenodo.17171571 2025
[18]

Badano et al

A. Badano et al. Evaluation of digital breast tomosynthesis as replacement of full-field digital mammogra- phy using an in silico imaging trial (VICTRE).JAMA Network Open, 1(7):e185474, 2018

2018
[19]

Sizikova et al

E. Sizikova et al. M-SYNTH: A VICTRE-inspired synthetic breast imaging dataset. InNeurIPS, 2023

2023
[20]

Pezeshk, N

A. Pezeshk, N. Petrick, and B. Sahiner. Seamless lesion insertion for data augmentation in CAD training. IEEE Transactions on Medical Imaging, 36(5):1005–1015, 2017

2017
[21]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs Trained by a Two Time- Scale Update Rule Converge to a Local Nash Equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017.https://arxiv.org/abs/1706.08500

work page Pith review arXiv 2017
[22]

Wasserthal et al

J. Wasserthal et al. TotalSegmentator: Robust segmentation of 104 anatomical structures in CT images. Radiology: Artificial Intelligence, 5(5):e230024, 2023

2023
[23]

Ct-bench: A benchmark for multimodal lesion understanding in computed tomography.arXiv preprint arXiv:2602.14879, 2026

Q. Zhu et al. CT-Bench: A benchmark for multimodal lesion understanding in computed tomography. arXiv:2602.14879, 2026

work page arXiv 2026
[24]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

S. Zhang et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.NEJM AI, 2024. arXiv:2303.00915. 10

work page internal anchor Pith review arXiv 2024
[25]

Li et al

C. Li et al. LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day. In NeurIPS Datasets & Benchmarks, 2023

2023
[26]

MedGemma Technical Report

A. Sellergren et al. MedGemma: Health AI developer foundations.arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Holistic Evaluation of Language Models

P. Liang et al. Holistic evaluation of language models.arXiv:2211.09110, 2022

work page internal anchor Pith review arXiv 2022
[28]

He et al

Y . He et al. VISTA3D: Versatile imaging segmentation and annotation model for 3D computed tomography. arXiv:2406.05285, 2024. 11 Appendix Contents Appendix A: Detailed Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Appendix B: Nodule Profile Dataset: Full 54-Attribute Schema . . . . ...

work page arXiv 2024
[29]

Prevalence Control (M1).Establishes the baseline reference cohort calibrated to a pub- lished screening prior, against which all other modes are compared.Definition: SM1 = (n, π=0.04, T NLST, ϕ nod=⊤, ϕ ins=ϕdefault ins , ϕ demo=⊤, σ=0, B=0,D excl=∅), under TNLST, 4% of sampled cases are malignant (π=0.04)
[30]

Size Control (M2).Detectability expected to have non-monotonic sensitivity to nodule diameter, this mode stratifies by size bins while holding all other iTRIALSPACEcomponents, enabling a direct measurement of size effect.Definition:Six sub-cohorts {S (i) M2}6 i=1 with ϕ(i) nod = [d∈s i], si ∈ {[0,4),[4,6),[6,10),[10,20),[20,30),[30,∞)}mm , n cases per sub-cohort
[31]

Location isolation (M3).Enables controlled evaluation of lobe-specific model behaviour by holding nodule size fixed across all five lobes, a cohort design that cannot be constructed from real registries, where lobe and size are naturally correlated.Definition: Five lobe sub-cohorts {S (ℓ) M3}ℓ∈L with ϕ(ℓ) nod = [lobe=ℓ∧d∈[6,15] mm] , n cases per lobe, ℓ∈ ...
[32]

Demographic stratification (M4).Lung CT texture, airway anatomy, and parenchymal density differ across age and sex, which can drive spurious model accuracy differences. M4 constructs matched sub-cohorts per demographic stratum using the source dataset with complete demographic metadata.Definition: ϕ(s) demo(q) = [dataset(q)=DLCS24∧s(q)] where s∈ {M <τ ,M ...
[33]

Counterfactual prevalence (M5).Apparent model accuracy shifts with class prevalence even when the model is unchanged, because threshold-based metrics depend on the pos- itive:negative ratio. M5 separates this statistical artifact from true model behaviour by 20 holding host anatomy constant while varying prevalence.Definition: C= Sv :v∈ {0.01,0.02,0.05,0....
[34]

Cross-dataset transfer (M6).Nodule appearance (texture, boundary sharpness, atten- uation) varies with acquisition scanner and reconstruction kernel across datasets; M6 tests whether a model’s performance is confounded by nodule source independent of anatomy.Definition:Five sub-cohorts, one per source dataset Dk: D(k) excl =D \ {D k} with ϕ(k) demo(q) = [...
[35]

Bootstrap CI (M7).A single cohort of size n yields a point estimate of accuracy; M7 quantifies the sampling uncertainty of that estimate by resampling the specification with different seeds.Definition: B replicates M(b) = Build(SM1, σ+b) , b= 1, . . . , B . Metric CI from percentile interval[ ˆθ(α/2), ˆθ(1−α/2)]over theBreplicates
[36]

We therefore define a shared union exclusion set D∪ train =S m D(m) train, where D(m) train denotes the public datasets documented in model m’s pre-training corpus

Algorithm comparison (M8).Head-to-head model comparisons are confounded if each model is tested on a different cohort realisation; M8 eliminates this by fixing a single held-out manifest shared across all models. We therefore define a shared union exclusion set D∪ train =S m D(m) train, where D(m) train denotes the public datasets documented in model m’s ...
[37]

Screening simulation (M9).In longitudinal screening, most prevalent nodules are detected in round 1, so subsequent cohorts are enriched for harder, incident lesions at progressively lower prevalence. M9 replicates this dynamic in a controlled setting.Definition: Three rounds with geometrically decaying prevalence πr =π 0 ·γ r, r∈ {0,1,2} , where π0 is the...
[38]

Digital twin modes (M11–M13).Digital twin modes preserve the identity relationship between donor and host

Multi-nodule context (M10).Real lung CTs often contain multiple nodules, and visual models may attend to the most salient lesion rather than the designated target; M10 measures how concurrent nodule presence affects per-target accuracy.Definition: Mixed manifest with fraction α of single-nodule cases and (1−α) multi-nodule cases, each with Nnod ∼ U[N min,...
[39]

Each nodule is re-inserted into its original scan as if it were placed anew.Definition: BuildIso: host q=p; donor (p, j) placed independently

Twin isolation (M11).To obtain per-lesion performance under a patient’s own anatomy, re- moving cross-patient anatomical confounds. Each nodule is re-inserted into its original scan as if it were placed anew.Definition: BuildIso: host q=p; donor (p, j) placed independently. One specification per annotated nodule
[40]

,(p, N p) inserted simultaneously

Twin complete (M12).A patient’s clinical CT containsallof their nodules simultaneously; M12 reconstructs this complete scenario to evaluate multi-nodule CT quality and per-patient accuracy, producing one synthetic CT per patient rather than one per nodule.Definition: BuildComp: host q=p; all Np native nodules (p,1), . . . ,(p, N p) inserted simultaneously...
[41]

et al.NSCLC-Radiomics

Twin cross (M13).To isolate the contribution of host anatomy to per-nodule accuracy independent of nodule properties, M13 transplants a nodule into adifferentpatient’s anatomy, enabling direct anatomy ablation.Definition: BuildCross(Dhost, Ddonor): host q̸=p . Assignment matrix A∈ {0,1} |P|×|P| governs donor-to-host pairing via three policies: (i) one_to_...

work page doi:10.7937/k9/tcia.2015.pf0m9rei 2048
[42]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...