The stochastic digital human is now enrolling for in silico imaging trials -- Methods and tools for generating digital cohorts

A Badano; B Sahiner; E Sizikova; JG Delfino; MA Anastasio; M Lago; S Guan

arxiv: 2301.08719 · v1 · submitted 2023-01-20 · 💻 cs.AI · physics.med-ph

The stochastic digital human is now enrolling for in silico imaging trials -- Methods and tools for generating digital cohorts

A Badano , M Lago , E Sizikova , JG Delfino , S Guan , MA Anastasio , B Sahiner This is my paper

Pith reviewed 2026-05-06 19:55 UTC · model claude-opus-4-7

classification 💻 cs.AI physics.med-ph

keywords in silico imaging trialsdigital human modelsvirtual clinical trialscohort samplingmedical device evaluationcomputational phantomsselection biasdata augmentation

0 comments

The pith

Digital human cohorts are ready to substitute for real patients in imaging device trials, if cohort-sampling bias is handled deliberately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Randomized clinical trials are expensive, slow, risky for participants, and often under-represent subgroups. The paper argues that imaging device evaluation can increasingly be moved into computer simulation, where the "patients" are digital human models with anatomy, physiology, and disease states rendered in enough detail to interact with a simulated scanner. To make that transition usable in practice, the authors give a classification of digital human models, survey the current methods for generating both healthy and diseased instances (including data augmentation as a generation tool), and compare four ways of drawing a cohort from a population of such models. Each sampling strategy comes with its own bias, and the contribution is to make those trade-offs explicit so a trial designer can pick deliberately. A sympathetic reader should care because if this works, new imaging technologies—including ones that cannot yet be built physically—can be screened, optimized, and shown to serve specific subpopulations before any human is exposed.

Core claim

The paper argues that in silico imaging trials—where a medical imaging device is evaluated entirely against simulated patients rather than recruited ones—are now methodologically mature enough to be run, provided the digital cohort is assembled with care. It organises the field by introducing a terminology and classification for digital human models, surveying the methods that generate healthy and diseased anatomies (including augmentation), and laying out four distinct strategies for sampling a cohort from those models. The central practical claim is that each sampling strategy carries its own bias profile, and that an investigator who chooses among them deliberately can run trials that sav

What carries the argument

A taxonomy of digital human models paired with a comparison of four cohort-sampling strategies. The taxonomy fixes what counts as a digital patient (anatomical scope, healthy vs. diseased, generation method), and the sampling comparison maps each strategy to the kind of selection bias it induces in the resulting in silico trial.

If this is right

Imaging devices that cannot be physically prototyped, or whose configurations span too large a space to test in patients, can be screened computationally before any hardware exists.
Subgroup representation can be enforced by construction rather than recruited for, addressing demographic gaps that physical trials struggle to close.
Regulatory submissions for imaging devices could include in silico evidence as a recognised component, shifting some of the evaluation burden away from human enrollment.
Trial designers gain a vocabulary for declaring which sampling strategy was used and which biases were therefore accepted, making in silico studies auditable and comparable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The four-way taxonomy of sampling strategies invites a quantitative bias budget: each strategy could be tagged with an expected effect size on the trial endpoint, turning the qualitative trade-off into a calculable one.
If diseased-anatomy generators improve faster than healthy-anatomy generators (or vice versa), in silico trials will systematically over- or under-estimate device performance on the lagging side; tracking this asymmetry over time would be informative.
Augmentation methods blur the line between 'sampled' and 'synthesised' patients, and the same digital body re-augmented many times is not the same as many independent bodies—an effective sample size, not a nominal one, is what should govern statistical claims.
The framework is most likely to land first in modalities where physics-based simulators are mature (CT, mammography, ultrasound) and to lag in modalities where the imaging physics or the disease signature are harder to render faithfully.

Load-bearing premise

That the digital humans available today resemble real patients closely enough—in anatomy, disease expression, and variability—that a device's measured performance on the simulated cohort actually predicts its performance on people.

What would settle it

A head-to-head study where the same imaging device is evaluated in an in silico trial built with these methods and in a matched physical clinical trial, and the two yield materially different performance estimates (sensitivity, specificity, or subgroup-specific accuracy) outside reported uncertainty. Convergence would support the claim; divergence would refute it.

Figures

Figures reproduced from arXiv: 2301.08719 by A Badano, B Sahiner, E Sizikova, JG Delfino, MA Anastasio, M Lago, S Guan.

**Figure 1.** Figure 1: Classification of ethods to generate digital humans for in silico clinical trials. are represented using computer-aided design (CAD) techniques where each component is a high-resolution, non self-intersecting mesh. In this case, the models are used for electromagnetic, thermal and acoustic simulations in the safety assessment of active and passive medical implants [34]. Safety evaluations do not require fu… view at source ↗

**Figure 2.** Figure 2: Effect of sampling strategies on performance assessment. Sampling is from a bimodal distribution of subjects (seen in 3D insert in the second panel from the left) described by 2 random parameters: (from left to right) uniform, matched, simpler, and narrow. Only 20 samples are shown here for ease of visualization. The gray shading depicts the distribution from which samples are taken in each of the 4 cases.… view at source ↗

read the original abstract

Randomized clinical trials, while often viewed as the highest evidentiary bar by which to judge the quality of a medical intervention, are far from perfect. In silico imaging trials are computational studies that seek to ascertain the performance of a medical device by collecting this information entirely via computer simulations. The benefits of in silico trials for evaluating new technology include significant resource and time savings, minimization of subject risk, the ability to study devices that are not achievable in the physical world, allow for the rapid and effective investigation of new technologies and ensure representation from all relevant subgroups. To conduct in silico trials, digital representations of humans are needed. We review the latest developments in methods and tools for obtaining digital humans for in silico imaging studies. First, we introduce terminology and a classification of digital human models. Second, we survey available methodologies for generating digital humans with healthy and diseased status and examine briefly the role of augmentation methods. Finally, we discuss the trade-offs of four approaches for sampling digital cohorts and the associated potential for study bias with selecting specific patient distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Useful organizing review for the in silico imaging trials community; judge it as a taxonomy, not a validation study.

read the letter

Quick note on Badano et al. on digital cohorts for in silico imaging trials.

This is a review paper, and you should read it as one. The contribution is a vocabulary and a classification: terminology for digital human models, a survey of generation methods (healthy and diseased), a quick word on augmentation, and a discussion of four cohort-sampling strategies with their bias trade-offs. The author list is exactly the group you'd expect — FDA/CDRH plus Anastasio — and they're writing for a community (VICTRE, regulatory science, virtual imaging trials) that is actively trying to coalesce. That gives the paper a fair shot at being useful infrastructure if the taxonomy lands.

What it does well, based on the abstract: it explicitly tackles the cohort-sampling-bias question, which is the part of in silico trial design that most often gets hand-waved. Naming the four sampling regimes and their attendant biases is the kind of move that lets a reviewer or trial designer ask sharper questions. Coming from FDA authors, it also has a chance of becoming the de facto reference, which matters for a paper of this type.

On the reader's stress points: I agree with the stress-test note. The "are the digital humans faithful?" objection is a real concern for the field but not really what this paper is claiming. A taxonomy doesn't have to validate the underlying models; it has to make the validation question askable in standardized terms. The reader's weakest-assumption framing misfires here.

The genuine soft spots are the ones the stress-test flags and that I can't resolve from the abstract alone: (i) whether the four-strategy partition is actually exhaustive or whether hybrid regimes (conditional generative models trained on registry data, importance-weighted resampling, GAN-augmented stratified sampling) get awkwardly bucketed; (ii) whether the bias discussion is operational enough that an FDA reviewer can apply it ex ante, or whether it stays qualitative; (iii) whether the proposed terms collide with already-circulating VICTRE-adjacent vocabulary. A referee should push on all three. None are dealbreakers; they're the right questions for this kind of paper.

Who it's for: anyone doing virtual imaging trials, regulatory scientists, and people building digital phantoms or generative anatomical models. Not for someone looking for new mechanism or measurement.

Recommendation: send it to peer review. It's the right authors writing the right paper at the right moment, and the soft spots are revisable. I'd probably cite it if I touch this area in the next year. Worth a reading group slot if anyone in the room works on simulation-based evaluation.

Referee Report

4 major / 4 minor

Summary. The manuscript is a review of methods and tools for generating "digital humans" — computational anatomical/pathological representations of patients — to support in silico imaging trials of medical devices. The paper (i) proposes terminology and a classification of digital human models, (ii) surveys methodologies for generating both healthy and diseased digital humans and briefly addresses augmentation methods, and (iii) discusses four approaches for sampling digital cohorts, identifying the trade-offs and potential study biases associated with each. The implicit central contribution is organizational: a vocabulary and a sampling-strategy taxonomy intended to make in silico imaging trials more rigorous and comparable.

Significance. If the taxonomy and bias analysis are usable as stated, this is a timely and useful contribution. In silico imaging trials (e.g., VICTRE-style studies) are gaining regulatory traction, and the field currently lacks a shared vocabulary and a shared account of cohort-sampling biases. A review that cleanly partitions cohort-sampling strategies and names the biases each strategy induces gives trial designers, reviewers, and regulators a common reference. The four-strategy partition, if exhaustive and operationalizable, is itself the deliverable; the paper does not need to validate any particular digital human model to be valuable, and should not be held to that bar. The contribution would be strengthened materially if the bias discussion is presented in terms that a study designer can act on ex ante (e.g., diagnostic checks, applicability conditions) rather than only as qualitative caveats.

major comments (4)

[Cohort-sampling taxonomy (four approaches)] The load-bearing claim of the paper is that four approaches partition the space of digital-cohort sampling strategies. The abstract does not state whether these four are intended to be exhaustive, mutually exclusive, or merely representative. The manuscript should state the partition criterion explicitly and address hybrid regimes that arise in practice — e.g., conditional generative models trained on registry priors, importance-weighted resampling to match a target subgroup distribution, and rejection sampling against an external prevalence target. If such regimes are intended to fall inside one of the four categories, the mapping should be made explicit; if they sit outside, the taxonomy claim should be softened.
[Bias analysis of sampling strategies] A taxonomy of biases is most useful when it is operational. The abstract promises a discussion of 'potential for study bias' but does not indicate whether biases are characterized qualitatively, via worst-case bounds, via simulation diagnostics, or via a checklist a reviewer can apply. For each of the four sampling approaches, the manuscript should associate (i) the bias mechanism, (ii) an a priori detectability or boundedness criterion, and (iii) a recommended mitigation or reporting requirement. Without this, the bias section risks being descriptive only, which limits its utility for trial design and regulatory review — the audience the paper appears to target.
[Terminology and classification of digital human models] Because terminology proposals only accrue value through adoption, the manuscript should explicitly reconcile its terms with already-circulating usage in the VICTRE, virtual imaging trial (e.g., XCAT/4D-XCAT phantom), and digital twin literatures. Where the proposed terms differ from established usage, the rationale for the change should be given, and a crosswalk table is advisable. If such reconciliation is already present in the full text, this comment can be discharged; from the abstract alone it cannot be assessed.
[Healthy vs. diseased generation; augmentation] The abstract states augmentation is treated 'briefly.' Augmentation is the principal route by which present-day cohorts achieve disease diversity, and it interacts directly with the sampling-bias analysis (augmented samples are not exchangeable with naturally generated ones). The manuscript should clarify whether augmented samples are treated as a fifth class of sample, as a subroutine inside one of the four strategies, or as out of scope, and should comment on how the bias account changes when augmented samples are mixed into a cohort.

minor comments (4)

[Abstract] The phrase 'ensure representation from all relevant subgroups' is stronger than the body can plausibly support — in silico cohorts can in principle include underrepresented subgroups, but 'ensuring' representation depends on the underlying generative model's coverage. Recommend softening to 'enable' or 'facilitate.'
[Abstract] The opening sentence ('Randomized clinical trials ... are far from perfect') is rhetorical scaffolding rather than a claim the paper supports. Consider replacing with a sentence that names the specific limitations of RCTs that in silico trials are positioned to address (cost, rare subgroups, counterfactual device variants).
[Scope statement] It would help readers if the abstract or introduction stated explicitly which imaging modalities are in scope (e.g., CT, mammography, MRI) and which are not, since the maturity of digital human models varies sharply by modality.
[Terminology] Clarify the relationship between 'digital human,' 'digital twin,' 'computational phantom,' and 'virtual patient' in the introduction; these are used inconsistently across communities and the review's value depends on disambiguating them up front.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for a constructive report and for correctly identifying the organizational contribution — terminology, a sampling-strategy taxonomy, and an associated bias account — as the deliverable against which the manuscript should be judged. We accept all four major comments as actionable and propose revisions accordingly. Specifically, we will (1) state the partition criterion underlying the four sampling approaches explicitly, soften the exhaustiveness claim to 'primitive strategies that compose to cover practical regimes,' and add an explicit mapping of hybrid regimes (conditional generative models with registry priors, importance-weighted resampling, rejection sampling) onto compositions of the four primitives; (2) reorganize the bias section into a per-strategy operational table giving bias mechanism, ex ante diagnostic, and recommended mitigation/reporting item, in the form a trial designer or reviewer can apply; (3) add a crosswalk table reconciling our terminology with VICTRE, XCAT/4D-XCAT, virtual-imaging-trial, and digital-twin usage, with rationale wherever our terms depart from established ones; and (4) clarify that augmentation is treated as a post-sampling operator inherited by any of the four primitives, not as a fifth class, and expand the discussion of how mixed cohorts perturb the bias account. We do not claim formal worst-case bias bounds and will state this limitation explicitly. We believe these revisions are within the scope of a minor revisio

read point-by-point responses

Referee: Cohort-sampling taxonomy: state whether the four approaches are exhaustive, mutually exclusive, or representative; address hybrid regimes (conditional generative models with registry priors, importance-weighted resampling, rejection sampling against prevalence targets).

Authors: We agree this is the load-bearing claim and the manuscript currently understates the assumption. In revision we will (i) state the partition criterion explicitly — the four approaches are organized by how the joint distribution over phenotypic and pathological attributes is induced (drawn directly from a reference population, drawn from an analytic/parametric prior, drawn from a learned generative prior, or drawn from a fixed enumerated set). We do not claim mutual exclusivity at the level of implementation; rather, we claim the four are exhaustive at the level of the sampling primitive used, and any practical pipeline is a composition of these primitives. (ii) We will add a subsection on hybrid regimes giving the explicit mapping the referee requests: conditional generative models trained on registry priors are a composition of a learned-prior sampler conditioned on a reference-population sampler; importance-weighted resampling and rejection sampling against an external prevalence target are post-hoc reweightings layered on any of the four primitives, and we will discuss them as such. (iii) Where the abstract currently reads as a strong partition claim, we will soften the language to 'four primitive sampling strategies that, in combination, span the regimes encountered in practice.' revision: yes
Referee: Bias analysis should be operational: for each strategy give (i) bias mechanism, (ii) a priori detectability/boundedness criterion, (iii) recommended mitigation or reporting requirement.

Authors: This is a fair and constructive criticism. The current bias discussion is largely qualitative, and we agree that for the regulatory and trial-design audience we target, a more actionable presentation is needed. In the revised manuscript we will reorganize the bias section as a per-strategy table with three columns matching the referee's structure: (i) the dominant bias mechanism (e.g., reference-population coverage gaps for direct sampling; mis-specification and tail extrapolation for parametric priors; mode collapse and training-set leakage for learned priors; combinatorial under-coverage for enumerated cohorts); (ii) an ex ante diagnostic — for example, support-overlap and effective-sample-size diagnostics for reweighted cohorts, two-sample tests against a target marginal for generative cohorts, and coverage audits over protected subgroups for all four; and (iii) recommended mitigations and reporting items (pre-registration of the target distribution, disclosure of training data provenance, subgroup-stratified performance reporting). We will not claim formal worst-case bounds, as these are not generally available in this setting, and we will say so explicitly. revision: yes
Referee: Reconcile proposed terminology with existing usage in VICTRE, virtual imaging trial / XCAT-4D-XCAT phantom, and digital-twin literatures; provide rationale for departures and a crosswalk table.

Authors: Partially addressed in the current full text but not as a single consolidated crosswalk. The body of the manuscript references VICTRE, the XCAT/4D-XCAT family, and the digital-twin literature and adopts established terms where they exist. However, we agree a dedicated crosswalk table makes the reconciliation auditable rather than implicit. In revision we will add a table mapping our terms to the corresponding terms used in (a) the VICTRE program, (b) the XCAT/4D-XCAT phantom literature, (c) the broader virtual-imaging-trial literature, and (d) the digital-twin literature, with a brief justification wherever our preferred term differs (typically because the existing term is overloaded across communities). Where our usage is identical to established usage, the table will make that equally explicit so that no novelty is implied. revision: yes
Referee: Clarify the status of augmentation: a fifth class of sample, a subroutine within one of the four strategies, or out of scope; and discuss how the bias account changes when augmented samples are mixed into a cohort.

Authors: We agree the abstract's 'briefly' understates the importance of this question and that augmented samples are not exchangeable with natively generated ones. Our position, which we will state explicitly in the revision, is that augmentation is not a fifth sampling primitive but a transformation applied to samples produced by any of the four primitives; it therefore inherits the bias profile of its source and adds its own (label-preservation assumptions, off-manifold extrapolation, correlation inflation between augmented siblings). We will expand the augmentation subsection to (i) place augmentation formally as a post-sampling operator, (ii) describe how the per-strategy bias entries change when augmented samples are admixed (notably effective-sample-size deflation and loss of independence across the cohort), and (iii) recommend reporting the augmentation ratio and the augmentation operator alongside the base sampling strategy. We acknowledge that a fully quantitative treatment of mixed cohorts is beyond the scope of a review and will say so. revision: partial

Circularity Check

0 steps flagged

No circularity: review/taxonomy paper makes no derivational claims that could reduce to their inputs.

full rationale

This is a survey/taxonomy paper. Its claims, as visible in the abstract, are organizational: (a) introduce terminology and a classification of digital human models, (b) survey methodologies for generating digital humans, and (c) discuss trade-offs of four cohort-sampling approaches and associated bias. None of these are first-principles derivations or quantitative predictions whose outputs could equal their inputs by construction. There is no fitted parameter being relabeled as a prediction, no self-citation chain being used to forbid alternatives, no uniqueness theorem being imported from the authors' prior work, and no ansatz being smuggled in. The reader's concern about digital-human fidelity is a correctness/scope concern about the underlying models being surveyed, not a circularity in the survey's own argument; the skeptic note correctly identifies this as a category error. With only the abstract available, no load-bearing step can be exhibited that reduces to its own input. The honest finding is no significant circularity. Genuine risks for a paper of this type — exhaustiveness of the four-strategy partition, operationalizability of the bias account, terminological conflict with the existing VICTRE community — are correctness/utility risks, not circularity, per the rubric's hard rule 5.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9783 in / 3900 out tokens · 60232 ms · 2026-05-06T19:55:07.271676+00:00 · methodology

The stochastic digital human is now enrolling for in silico imaging trials -- Methods and tools for generating digital cohorts

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)