Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Daniel P. Palomar; Jasin Machkour; Michael Muma; Taulant Koka

arxiv: 2604.07464 · v1 · submitted 2026-04-08 · 📊 stat.ME · stat.ML

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Taulant Koka , Jasin Machkour , Daniel P. Palomar , Michael Muma This is my paper

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords variable selectionFDR controlhigh-dimensional statisticsdummy variablesT-Rex selectorLARS algorithmgenome-wide association studies

0 comments

The pith

Sequential sampling of virtual dummy projections enables exact FDR-controlled variable selection at scales requiring terabytes of memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-dimensional variable selection with FDR control, as in the T-Rex selector, normally augments data with millions of i.i.d. synthetic null variables, but this creates impossible memory demands at biobank scales. The paper formalizes forward selection through an adaptive filtration and shows that the process interacts with unselected dummies only via their projections onto a low-dimensional evolving subspace. For rotationally invariant dummy distributions, it derives an adaptive stick-breaking sampler that draws these projections exactly from the conditional law given the selection history, eliminating any need to store the dummy matrix. A pathwise universality theorem then guarantees that selection paths converge to the same Gaussian limit under mild delocalization, so the virtual version inherits the exact selection law and FDR guarantees of the original method. Experiments on realistic GWAS data confirm that the resulting VD-T-Rex procedure controls FDR and retains power where explicit-dummy competitors fail or time out.

Core claim

Under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge pathwise to the same Gaussian limit; for rotationally invariant distributions this limit is realized exactly by sampling dummy projections via an adaptive stick-breaking construction that conditions only on the selection history, thereby replacing materialization of the dummy matrix with sequential sampling while preserving the exact selection law and FDR guarantees of the T-Rex selector.

What carries the argument

The adaptive stick-breaking construction that draws dummy projections exactly from their conditional distribution given the filtration of the selection process.

If this is right

FDR control remains exact at predictor counts where storing the dummy matrix exceeds available memory.
Memory and runtime drop by several orders of magnitude while the selection law is unchanged.
The procedure applies directly to genome-wide association studies involving millions of predictors.
Any forward-selection method compatible with the T-Rex aggregation framework inherits the same scaling benefit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same virtual-projection idea could be ported to other dummy-augmented selectors beyond LARS.
Because the limit law is universal, the precise marginal distribution of the dummies may matter less than their standardization and independence.
In streaming or online settings the stick-breaking sampler could be updated incrementally as new observations arrive.

Load-bearing premise

The synthetic null variables are drawn from a rotationally invariant distribution so that their projections onto the adaptively chosen subspace admit exact conditional sampling.

What would settle it

On a large-scale GWAS dataset, compare the variables selected and the empirical FDR achieved by VD-T-Rex against those of the original T-Rex selector run on a down-scaled instance where dummy materialization remains feasible; any systematic discrepancy would falsify the claim of exact preservation.

Figures

Figures reproduced from arXiv: 2604.07464 by Daniel P. Palomar, Jasin Machkour, Michael Muma, Taulant Koka.

**Figure 1.** Figure 1: Exemplary timeline of the filtration (Fk). At step k, the selection rule ϕk is applied using only the information in Fk. If a real variable is selected (j ⋆ ≤ p), the σ-algebra does not grow, hence F + k = Fk. If a dummy is selected (j ⋆ > p), it is realized, yielding the intermediate σ-algebra F + k = σ(Fk, dℓ ⋆ ). The next projections {αk+1,ℓ}τℓ>k are then drawn conditional on F + k , yielding Fk+1. beco… view at source ↗

**Figure 2.** Figure 2: Illustration of the stick-breaking construction for spherical dummies. From left to right: Initially, the dummy [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Distributional equivalence of VD–LARS and explicitly augmented AD–LARS. Top row: median and [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: FDR control and power of VD–T–Rex versus AD–T–Rex. Top row: empirical averaged false discovery [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Universality diagnostics for Lemma 6 (conditional CLT for fresh projections). Columns correspond to [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Finite–sample effect of Gaussian norm fluctuations in T–Rex. Heatmaps show differences, reported as [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Benchmark of AD–LARS and VD–LARS. Columns correspond to [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

High-dimensional variable selection, particularly in genomics, requires error-controlling procedures that scale to millions of predictors. The Terminating-Random Experiments (T-Rex) selector achieves false discovery rate (FDR) control by aggregating results of early terminated random experiments, each combining original predictors with i.i.d. synthetic null variables (dummies). At biobank scales, however, explicit dummy augmentation requires terabytes of memory. We demonstrate that this bottleneck is not fundamental. Formalizing the information flow of forward selection through a filtration, we show that compatible selectors interact with unselected dummies solely through projections onto an adaptively evolving low-dimensional subspace. For rotationally invariant dummy distributions, we derive an adaptive stick-breaking construction sampling these projections from their exact conditional distribution given the selection history, thereby eliminating dummy matrix materialization. We prove a pathwise universality theorem: under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge to the same Gaussian limit. We instantiate the theory through Virtual Dummy LARS (VD-LARS), reducing memory and runtime by several orders of magnitude while preserving the exact selection law and FDR guarantees of the T-Rex selector. Experiments on realistic genome-wide association study data confirm that VD-T-Rex controls FDR and achieves power at scales where all competing methods either fail or time out.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They figured out a way to run T-Rex without storing the dummy matrix by sampling projections adaptively.

read the letter

The paper shows how to keep the exact T-Rex selection law and FDR control without ever materializing a large dummy matrix. Instead of generating millions of synthetic null features, they sample only the projections that the forward selection actually uses. The new pieces are the adaptive stick-breaking sampler for those projections and the pathwise universality theorem that says the selection paths converge to the same limit for generic standardized i.i.d. dummies. They turn this into VD-LARS and demonstrate it on genome-wide association data at scales where competing methods run out of memory or time. The approach is clean in principle. The memory and runtime savings are real if the sampler works as claimed. What needs checking is whether the stick-breaking construction truly draws from the exact conditional distribution at every step of the adaptive process. The stress on rotational invariance and delocalization conditions is standard, but small errors in the dependence structure would invalidate the exact equivalence to T-Rex. This is for statisticians and computational biologists working on high-dimensional selection with strict error control. It is worth a serious referee because the central claim is a concrete algorithmic improvement backed by a formal argument rather than just heuristics. The experiments confirm the practical gains, which strengthens the case for further scrutiny.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Virtual Dummy LARS (VD-LARS) to scale the T-Rex selector for FDR-controlled high-dimensional variable selection. By modeling forward selection via a filtration on the information flow, it derives an adaptive stick-breaking construction that samples projections of rotationally invariant dummy variables exactly from their conditional distribution given the selection history, avoiding explicit materialization of large dummy matrices. A pathwise universality theorem is proved showing that, under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge to the same Gaussian limit. The method is instantiated as VD-T-Rex, which is claimed to preserve the exact selection law and FDR guarantees of T-Rex while reducing memory and runtime by orders of magnitude; this is supported by experiments on realistic GWAS data demonstrating FDR control and power at scales where competitors fail.

Significance. If the exact equivalence via the stick-breaking sampler and the universality theorem hold, the work removes a fundamental scalability barrier for FDR-controlling selectors in genomics and similar domains, enabling analysis at biobank scales (millions of predictors) with terabyte-scale memory savings. The formalization of the filtration and the pathwise convergence result provide theoretical value beyond the implementation, and the explicit preservation of T-Rex properties (rather than approximation) is a notable strength. Empirical confirmation on GWAS data further supports practical utility.

major comments (2)

[§4.2] §4.2 (Adaptive Stick-Breaking Construction): The central claim of exact (not approximate) preservation of the T-Rex selection law rests on this construction sampling projections exactly from the conditional law given the filtration. The exposition should explicitly verify that the adaptive updates to the stick-breaking parameters fully capture the dependence induced by each selection decision on the evolving orthogonal complement; without a low-dimensional worked example or auxiliary lemma confirming rotational invariance is maintained step-by-step, the exchangeability with real i.i.d. dummies (and thus exact FDR control) remains difficult to confirm.
[Theorem 5.1] Theorem 5.1 (Pathwise Universality): While the mild delocalization conditions are stated, the proof should clarify how the Gaussian limit is obtained pathwise rather than in distribution only, particularly the role of the generic standardized i.i.d. dummies in controlling the remainder terms after each projection sampling step. This is load-bearing for the claim that VD-LARS inherits the exact selection law.

minor comments (3)

[§2] The notation for the filtration (F_t) and the projection operators could be introduced with a small illustrative diagram in §2 to aid readability for readers unfamiliar with the T-Rex framework.
[Table 2] Table 2 (GWAS runtime comparison): The reported speedups lack error bars or details on the number of replicates; adding these would strengthen the empirical claims.
[Introduction] A few instances of undefined acronyms (e.g., first use of 'LARS' in the introduction) should be expanded on first appearance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify opportunities to strengthen the exposition of the adaptive stick-breaking sampler and the pathwise convergence argument. We address each point below and will revise the manuscript to incorporate the requested clarifications while preserving the original claims.

read point-by-point responses

Referee: [§4.2] §4.2 (Adaptive Stick-Breaking Construction): The central claim of exact (not approximate) preservation of the T-Rex selection law rests on this construction sampling projections exactly from the conditional law given the filtration. The exposition should explicitly verify that the adaptive updates to the stick-breaking parameters fully capture the dependence induced by each selection decision on the evolving orthogonal complement; without a low-dimensional worked example or auxiliary lemma confirming rotational invariance is maintained step-by-step, the exchangeability with real i.i.d. dummies (and thus exact FDR control) remains difficult to confirm.

Authors: We agree that the current exposition would benefit from an explicit verification of the step-by-step preservation of rotational invariance. Section 4.2 derives the adaptive stick-breaking parameters directly from the conditional distribution of the dummy projections given the filtration generated by prior selections; because the dummies are rotationally invariant, each projection onto the orthogonal complement of the selected subspace leaves the remaining unselected dummies exchangeable, with the stick lengths rescaled by the residual norm in that complement. This ensures the sampled projections are exactly distributed as those from explicit i.i.d. dummies. To address the concern, we will insert a low-dimensional worked example (p=4, two selection steps) and an auxiliary lemma that tracks the invariance of the conditional law after each update. These additions will make the exact equivalence transparent without altering the theoretical claims. revision: yes
Referee: [Theorem 5.1] Theorem 5.1 (Pathwise Universality): While the mild delocalization conditions are stated, the proof should clarify how the Gaussian limit is obtained pathwise rather than in distribution only, particularly the role of the generic standardized i.i.d. dummies in controlling the remainder terms after each projection sampling step. This is load-bearing for the claim that VD-LARS inherits the exact selection law.

Authors: We appreciate the request for greater clarity on the pathwise aspect. The proof of Theorem 5.1 establishes almost-sure convergence of the entire selection path by representing the discrepancy between the VD-LARS path and the limiting Gaussian process as a martingale with respect to the filtration of selection decisions. The quadratic variation of this martingale is bounded using the delocalization conditions, which ensure that the generic standardized i.i.d. dummies make the remainder terms after each adaptive projection sampling step vanish almost surely (via a conditional strong law of large numbers). We will expand the proof appendix to include this martingale construction and the explicit control of remainders, thereby confirming that the pathwise limit underpins the exact inheritance of the T-Rex selection law. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation relies on original filtration formalization, derived stick-breaking sampler, and proved universality theorem

full rationale

The paper formalizes forward selection via an adaptively evolving filtration and derives an adaptive stick-breaking process to sample projections exactly from the conditional law under rotational invariance, thereby replacing explicit dummy matrices. It separately proves a pathwise universality theorem showing convergence of selection paths to a common Gaussian limit under delocalization. These elements are presented as new derivations that establish equivalence to the T-Rex selector's law without reducing to fitted parameters from the target data or to unverified self-citations. The central claim of exact FDR preservation follows directly from the constructed equivalence rather than from any tautological renaming or input-output identity. No load-bearing step collapses by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on domain assumptions about dummy distributions and data delocalization rather than on fitted parameters or newly invented entities with independent evidence.

axioms (2)

domain assumption Dummy distributions are rotationally invariant
Invoked to derive the exact conditional sampling of projections via adaptive stick-breaking.
domain assumption Mild delocalization conditions on the data
Required for the pathwise universality theorem that selection paths converge to a Gaussian limit.

invented entities (1)

Virtual dummies realized via sequential projection sampling no independent evidence
purpose: To replicate the statistical behavior of explicit dummy augmentation without materializing the full matrix
New sampling mechanism introduced to eliminate the memory bottleneck while preserving the exact selection law.

pith-pipeline@v0.9.0 · 5548 in / 1484 out tokens · 43529 ms · 2026-05-10T17:50:05.087223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

R. F. Barber and E. Candès. Controlling the false discovery rate via knockoffs.The Annals of Statistics, 43(5): 2055–2085,

work page 2055
[2]

B. A. Frigyik, A. Kapila, and M. R. Gupta. Introduction to the Dirichlet distribution and related processes.Department of Electrical Engineering, University of Washington, UWEETR-2010-0006, 6:1–27,

work page 2010
[3]

URLhttp://www.jstor.org/stable/26362897

ISSN 00905364. URLhttp://www.jstor.org/stable/26362897. R. Tibshirani. Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288,

work page arXiv
[4]

In particular, the framework is not tied to a particular construction of Vk

25 A preprint A Compatibility of Variable Selectors Compatibility with the virtual-dummy construction is determined not by a specific forward-selection algorithm, but by whether the quantities used in the selection rule are contained in the revealed subspace. In particular, the framework is not tied to a particular construction of Vk. What matters is only...

work page 1993
[5]

For this ω, the weights q(nrs)(ω) are deterministic, satisfyP j(q(nrs) j (ω))2 = 1 and maxj |q(nrs) j (ω)| →0

Fix ω in this probability-one event. For this ω, the weights q(nrs)(ω) are deterministic, satisfyP j(q(nrs) j (ω))2 = 1 and maxj |q(nrs) j (ω)| →0 . By the Hájek–Šidák central limit theorem [Hájek et al., 1999, Thm. 1, Sec. 6.1.2], the weighted sums ξnrs (ω) = nrsX j=1 q(nrs) j (ω)δ (nrs) j converge in distribution toN(0,1). Equivalently, for every bounde...

work page 1999

[1] [1]

R. F. Barber and E. Candès. Controlling the false discovery rate via knockoffs.The Annals of Statistics, 43(5): 2055–2085,

work page 2055

[2] [2]

B. A. Frigyik, A. Kapila, and M. R. Gupta. Introduction to the Dirichlet distribution and related processes.Department of Electrical Engineering, University of Washington, UWEETR-2010-0006, 6:1–27,

work page 2010

[3] [3]

URLhttp://www.jstor.org/stable/26362897

ISSN 00905364. URLhttp://www.jstor.org/stable/26362897. R. Tibshirani. Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288,

work page arXiv

[4] [4]

In particular, the framework is not tied to a particular construction of Vk

25 A preprint A Compatibility of Variable Selectors Compatibility with the virtual-dummy construction is determined not by a specific forward-selection algorithm, but by whether the quantities used in the selection rule are contained in the revealed subspace. In particular, the framework is not tied to a particular construction of Vk. What matters is only...

work page 1993

[5] [5]

For this ω, the weights q(nrs)(ω) are deterministic, satisfyP j(q(nrs) j (ω))2 = 1 and maxj |q(nrs) j (ω)| →0

Fix ω in this probability-one event. For this ω, the weights q(nrs)(ω) are deterministic, satisfyP j(q(nrs) j (ω))2 = 1 and maxj |q(nrs) j (ω)| →0 . By the Hájek–Šidák central limit theorem [Hájek et al., 1999, Thm. 1, Sec. 6.1.2], the weighted sums ξnrs (ω) = nrsX j=1 q(nrs) j (ω)δ (nrs) j converge in distribution toN(0,1). Equivalently, for every bounde...

work page 1999