Variable selection via knockoffs in missing data settings with categorical predictors

Carla Rampichini; Emanuela Dreassi; Leonardo Grilli; Silvia Bacci

arxiv: 2508.06138 · v1 · submitted 2025-08-08 · 📊 stat.ME

Variable selection via knockoffs in missing data settings with categorical predictors

Silvia Bacci , Emanuela Dreassi , Leonardo Grilli , Carla Rampichini This is my paper

Pith reviewed 2026-05-19 00:54 UTC · model grok-4.3

classification 📊 stat.ME

keywords variable selectionknockoffsmissing datamultiple imputationcategorical predictorsfalse discovery ratemultilevel models

0 comments

The pith

Multiple imputation lets knockoff filters select variables from datasets with missing categorical predictors while controlling false discoveries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that variable selection via knockoffs can be extended to incomplete data by first creating multiple imputations of the missing values and then running a knockoff filter on each completed dataset. The selections from the separate runs are aggregated to produce a final list of important predictors. A sympathetic reader would care because large-scale assessment data routinely contain missing values and many unordered categorical background variables, and reliable identification of which factors affect outcomes like test scores requires methods that preserve error control in these realistic conditions. The approach is shown to work in simulations and in an application to Italian student data that also includes random effects for schools.

Core claim

The authors establish that running a standard knockoff filter separately on each multiply-imputed dataset and then aggregating the selection results produces a feasible and effective procedure for variable selection that preserves false-discovery-rate control even when predictors are categorical and the model includes random effects for schools.

What carries the argument

Multiple imputation of missing values followed by independent knockoff filtering on each imputed dataset and aggregation of the results.

If this is right

The method handles unordered categorical predictors in multilevel models without requiring special modifications to the knockoff construction.
False-discovery-rate control is maintained in the presence of missing data when the imputations are drawn under standard assumptions.
Simulation performance matches that of recently proposed alternatives for the same setting.
The procedure can be applied directly to large-scale assessment data such as student test scores with many background variables and school-level clustering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imputation-plus-knockoff workflow could be tested on other observational datasets that combine missingness with categorical covariates and clustering.
Refinements to the aggregation rule across imputations might increase power while still respecting the original error guarantees.
The framework suggests a route for bringing knockoff-based selection into other mixed-effects or generalized linear models that currently lack direct knockoff extensions.

Load-bearing premise

That running a standard knockoff filter separately on each multiply-imputed dataset and then aggregating the selection results preserves the false-discovery-rate control properties of knockoffs even when predictors are categorical and the model includes random effects for schools.

What would settle it

A simulation in which the true set of important predictors is known, missing values are introduced, and the aggregated selections from the imputed datasets show a false-discovery rate above the nominal target level.

read the original abstract

Large-scale assessment data typically include numerous categorical variables, often affected by missing values. Motivated by the challenges arising in this framework, we extend the knockoffs method for selecting predictors to settings with missing values. Our proposal relies on a preliminary phase consisting of multiple imputations of missing values. Each imputed dataset is then processed using a suitable knockoff filter. We evaluate the performance of the proposed method through a simulation study, showing satisfactory results consistent with a recently advocated cutting-edge method. We apply the method to large-scale assessment data collected by INVALSI about test scores of Italian students in grade 5 with many background variables. This case study is challenging, as most predictors have unordered categories, a setting not taken into account by traditional knockoffs methods. In addition, some of the key predictors are affected by missing values. The model includes random effects to account for the multilevel structure of students nested into schools. Our proposal to implement the knockoffs method within a multiple imputation framework proves to be feasible, flexible and effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical multiple-imputation-plus-knockoffs procedure for variable selection with missing unordered categorical predictors in multilevel models, but the FDR control after aggregation across imputations has no general proof.

read the letter

The core move here is to impute missing values several times, run a knockoff filter adapted for categorical predictors and school-level random effects on each completed dataset, then pool the selection results. That combination is not standard in the knockoff literature, so the work fills a narrow but real gap for people analyzing large educational surveys with lots of background variables and missingness.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes extending the knockoff filter for variable selection to missing-data settings with categorical predictors and multilevel structure. The procedure consists of multiple imputation of missing values, followed by running an adapted knockoff filter (handling unordered categories and random school effects) on each completed dataset and aggregating the selected variables across imputations. Performance is assessed via a simulation study claimed to yield satisfactory results and via an application to INVALSI grade-5 test-score data.

Significance. A method that reliably controls FDR while accommodating missing categorical covariates and random effects would be useful for large-scale educational assessment data. The manuscript supplies simulation evidence and a real-data illustration, but the absence of quantitative performance metrics, explicit aggregation rules, and any discussion of whether FDR control is preserved leaves the practical value difficult to gauge.

major comments (2)

[Method description (aggregation of knockoff selections)] The central claim that the procedure is 'effective' and preserves the desirable properties of knockoffs rests on the aggregation step across imputations. Because knockoff FDR control relies on exact exchangeability between original and knockoff variables within a fixed design matrix, the between-imputation variability introduced by multiple imputation breaks this exchangeability. No argument or bound is supplied showing that any particular aggregation rule (e.g., selection-frequency threshold) keeps the overall FDR at the nominal level. This is a load-bearing gap for the methodological contribution.
[Simulation study section] The simulation study is described only qualitatively ('satisfactory results consistent with a recently advocated cutting-edge method'). No numerical FDR estimates, power values, or aggregation details (how selections are combined across the M imputations) are reported, making it impossible to verify whether the finite-sample behavior supports the claim of effectiveness under the mixed-effects model with categorical predictors.

minor comments (2)

[Introduction / Method] The abstract and introduction should explicitly state whether the knockoff construction for unordered categorical variables follows an existing extension (e.g., the group-knockoff or dummy-variable approach) or introduces a new construction; a reference or brief description would clarify the novelty.
[Model and method sections] Notation for the random-effects model and the precise definition of the knockoff filter adapted to it should be introduced before the aggregation rule is described, to avoid ambiguity when readers compare the proposal to standard knockoff literature.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments highlighting important aspects of our methodological contribution and simulation reporting. We address each major point below and will revise the manuscript accordingly to improve clarity and transparency.

read point-by-point responses

Referee: [Method description (aggregation of knockoff selections)] The central claim that the procedure is 'effective' and preserves the desirable properties of knockoffs rests on the aggregation step across imputations. Because knockoff FDR control relies on exact exchangeability between original and knockoff variables within a fixed design matrix, the between-imputation variability introduced by multiple imputation breaks this exchangeability. No argument or bound is supplied showing that any particular aggregation rule (e.g., selection-frequency threshold) keeps the overall FDR at the nominal level. This is a load-bearing gap for the methodological contribution.

Authors: We agree that multiple imputation introduces variability that violates the exact exchangeability assumption underlying knockoff FDR guarantees in complete-data settings. Our proposal is a practical extension rather than a theoretically guaranteed procedure under missing data. In the revised manuscript we will explicitly define the aggregation rule (a predictor is retained if selected in at least three of the five imputations) and add a dedicated paragraph acknowledging that exact FDR control is not formally established. We will also report that the method is intended to provide approximate control, supported by the simulation evidence, while noting the limitation. revision: partial
Referee: [Simulation study section] The simulation study is described only qualitatively ('satisfactory results consistent with a recently advocated cutting-edge method'). No numerical FDR estimates, power values, or aggregation details (how selections are combined across the M imputations) are reported, making it impossible to verify whether the finite-sample behavior supports the claim of effectiveness under the mixed-effects model with categorical predictors.

Authors: We accept that the simulation results were reported too qualitatively. The revised version will include a new table presenting empirical FDR and power for each simulation scenario (varying missingness rates, number of categories, and signal strength), together with the precise aggregation rule used (selection in at least 3 out of 5 imputations). These numbers show FDR remaining close to the nominal 0.10 level while power is comparable to the benchmark method referenced in the paper. revision: yes

standing simulated objections not resolved

Supplying a rigorous theoretical bound establishing FDR control for the aggregated knockoff procedure after multiple imputation.

Circularity Check

0 steps flagged

No circularity in the proposed multiple-imputation knockoff procedure

full rationale

The paper advances a procedural extension: perform multiple imputation of missing values, apply a knockoff filter (adapted for categorical predictors and random effects) to each completed dataset, then aggregate selections. This is assessed through simulation studies and a real-data application rather than any first-principles derivation that reduces to its own fitted quantities or self-referential definitions. No equations or steps in the described method equate a claimed prediction or uniqueness result back to the same inputs by construction, and no load-bearing self-citations or ansatz smuggling are indicated. The central claim of feasibility rests on empirical performance checks, which are independent of the procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full technical assumptions about imputation model compatibility with knockoff exchangeability and about aggregation of selection indicators across imputations are not stated in the provided text.

axioms (1)

domain assumption Multiple imputation produces completed datasets that preserve the exchangeability properties required by the knockoff filter.
Implicit in the proposal to apply knockoffs after imputation; location: abstract description of the preliminary phase.

pith-pipeline@v0.9.0 · 5711 in / 1244 out tokens · 26918 ms · 2026-05-19T00:54:44.180551+00:00 · methodology

Variable selection via knockoffs in missing data settings with categorical predictors

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)