pith. sign in

arxiv: 1907.03153 · v1 · pith:MLKL4ZWYnew · submitted 2019-07-06 · 📊 stat.ME · stat.AP

The revisited knockoffs method for variable selection in L1-penalised regressions

Pith reviewed 2026-05-25 01:34 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords variable selectionL1-penalized regressionknockoffspenalty parameterhigh-dimensional datacovariate ranking
0
0 comments X

The pith

A revisited knockoffs method determines the penalty parameter for variable selection in L1-penalized regressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a new method using knockoffs to choose the penalty in L1 regressions for selecting relevant covariates. This approach works for different types of response variables and when the number of observations is less than the number of covariates. It also provides an ordering of covariate importance. A sympathetic reader would care because it offers a general way to handle variable selection in high-dimensional settings without relying on specific model assumptions beyond the knockoffs framework.

Core claim

We develop a new method based on the knockoffs idea to handle the choice of the penalty parameter in L1-penalised regression models. This revisited knockoffs method is general and suitable for a wide range of regressions with various types of response variables. It works when the number of observations is smaller than the number of covariates and gives an order of importance of the covariates.

What carries the argument

The revisited knockoffs method, which adapts the knockoffs framework to select the penalty parameter and rank covariates in L1-penalized regressions.

If this is right

  • It enables variable selection in regressions with more covariates than observations.
  • It applies to various response variable types beyond standard linear models.
  • It provides a ranking of covariate importance rather than just selection.
  • It can be compared to other variable selection methods through experimental results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might extend to other penalized regression types if the knockoffs adaptation generalizes.
  • It could reduce reliance on cross-validation for choosing the penalty in high-dimensional settings.
  • It connects to broader uses of knockoffs for controlling false discoveries in variable selection.

Load-bearing premise

The knockoffs framework can be adapted to L1-penalized regressions without needing extra assumptions on the data distribution or model specifics.

What would settle it

An experiment showing that the method fails to correctly identify relevant variables or select the penalty in a controlled simulation with known ground truth when n is less than p.

Figures

Figures reproduced from arXiv: 1907.03153 by Anne G\'egout-Petit, Aur\'elie Gueudin-Muller, Cl\'emence Karmann.

Figure 1
Figure 1. Figure 1: Example of positive statistics Wi sorted in ascending order. Linear Gaus￾sian regression model with n = 500 observations of p = 20 covariates. Only covari￾ates X1, X2, X3, X4 and X5 belong to the model (regression coefficients are set to β = (1, 1, 1, 1, 1, 0, . . . , 0)). Wi , i = 1, . . . , 20, which implies that X3 is the covariate the most likely to belong to the model. We can clearly observe a breakdo… view at source ↗
Figure 2
Figure 2. Figure 2: Detection rates of each covariate for the four meth [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection rates of each covariate for the three met [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection rates of each covariate for the three met [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparing the [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Boxplots of detection rates of each covariate acco [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Boxplots of detection rates of each covariate acco [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Boxplots of detection rates of each covariate acco [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detection rates of each covariate for the three met [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

We consider the problem of variable selection in regression models. In particular, we are interested in selecting explanatory covariates linked with the response variable and we want to determine which covariates are relevant, that is which covariates are involved in the model. In this framework, we deal with L1-penalised regression models. To handle the choice of the penalty parameter to perform variable selection, we develop a new method based on the knockoffs idea. This revisited knockoffs method is general, suitable for a wide range of regressions with various types of response variables. Besides, it also works when the number of observations is smaller than the number of covariates and gives an order of importance of the covariates. Finally, we provide many experimental results to corroborate our method and compare it with other variable selection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a 'revisited knockoffs' procedure to select the L1 penalty parameter λ for variable selection in penalized regression. It claims the method applies to a wide range of response distributions, remains valid when n < p, produces an ordering of covariate importance, and is supported by experimental comparisons against other selection methods.

Significance. A procedure that extends knockoff-based selection to arbitrary GLMs and the high-dimensional regime without Gaussian assumptions would address a practical gap; however, the abstract provides no indication that such an extension is achieved, limiting assessment of potential impact.

major comments (2)
  1. [Abstract] Abstract: the assertion that the method is 'suitable for a wide range of regressions with various types of response variables' and 'works when the number of observations is smaller than the number of covariates' is load-bearing for the central claim, yet no construction of knockoff variables X̃ is supplied that preserves the joint exchangeability (X, X̃) under the null for non-Gaussian responses or rank-deficient designs.
  2. [Abstract] Abstract and method description: the procedure is said to 'give an order of importance of the covariates' via the λ path, but no derivation shows how the knockoff statistics are extracted or why the resulting ordering inherits FDR control (or an analogous guarantee) once the response distribution is arbitrary.
minor comments (1)
  1. [Abstract] The abstract states that 'many experimental results' corroborate the method; a brief indication of the simulation settings, response types, and performance metrics would help readers evaluate the scope of the reported corroboration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the method is 'suitable for a wide range of regressions with various types of response variables' and 'works when the number of observations is smaller than the number of covariates' is load-bearing for the central claim, yet no construction of knockoff variables X̃ is supplied that preserves the joint exchangeability (X, X̃) under the null for non-Gaussian responses or rank-deficient designs.

    Authors: Knockoff construction operates solely on the covariate matrix X and is independent of the response distribution Y; exchangeability of (X, X̃) therefore holds regardless of whether the response is Gaussian or belongs to another GLM family. For the n < p regime we rely on existing high-dimensional knockoff constructions (e.g., those based on approximate exchangeability or SDP relaxations) that accommodate rank-deficient designs. We will revise the abstract and add an explicit paragraph in Section 2 describing the precise construction employed. revision: yes

  2. Referee: [Abstract] Abstract and method description: the procedure is said to 'give an order of importance of the covariates' via the λ path, but no derivation shows how the knockoff statistics are extracted or why the resulting ordering inherits FDR control (or an analogous guarantee) once the response distribution is arbitrary.

    Authors: The importance ordering is induced by the sequence of λ values at which each original variable enters the L1 path; knockoff statistics are formed by comparing entry λ’s of originals versus knockoffs, and the threshold is chosen to guarantee FDR control. Because the exchangeability property is a property of the design only, the control argument carries over to arbitrary response distributions. We will insert a short derivation subsection (new Section 3.2) that extracts the statistics explicitly and states the FDR guarantee under the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: method adapts knockoffs without reducing claims to fitted inputs or self-citations by construction

full rationale

The paper introduces a revisited knockoffs procedure for choosing the L1 penalty in regressions, claiming generality across response types and n < p regimes. No quoted equations or sections exhibit self-definitional loops (e.g., defining a quantity in terms of itself), fitted parameters renamed as predictions, or load-bearing self-citations that substitute for independent justification. The central adaptation is presented as an extension supported by experiments rather than forced by prior author results or ansatz smuggling. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

With only the abstract available, no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5672 in / 1008 out tokens · 20784 ms · 2026-05-25T01:34:02.102850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Analysis of ordinal categorical data

    Alan Agresti. Analysis of ordinal categorical data . Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, second edi tion, 2010

  2. [2]

    Categorical data analysis

    Alan Agresti. Categorical data analysis . Wiley series in probability and statistics. Wiley, 3ed. edition, 2013

  3. [3]

    Regression, discrimination and measurement models for ordered categorical variables

    JA Anderson and PR Philips. Regression, discrimination and measurement models for ordered categorical variables. Applied statistics , pages 22–31, 1981. 16

  4. [4]

    Auger and Charles E

    Ivan E. Auger and Charles E. Lawrence. Algorithms for the optimal identification of segment neighborhoods. Bull. Math. Biol. , 51(1):39–54, 1989

  5. [5]

    Cand` es

    Rina Foygel Barber and Emmanuel J. Cand` es. Controlling the false discovery rate via knockoffs. Ann. Statist. , 43(5):2055–2085, 2015

  6. [6]

    A knockoff filter for high-dimensional selective inference

    Rina Foygel Barber and Emmanuel J Candes. A knockoff filter for high-dimensional selective inference. arXiv preprint arXiv:1602.03574 , 2016

  7. [7]

    The elements of statistical learning

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer Series in Statistics. Springer-Verlag, New York , 2001. Data mining, inference, and prediction

  8. [8]

    Statistical learning with sparsity: the lasso and generalizations

    Trevor Hastie, Robert Tibshirani, and Martin Wainwrigh t. Statistical learning with sparsity: the lasso and generalizations . CRC press, 2015

  9. [9]

    Stability approach to regularization selection (stars) for high dimensional graphical models

    Han Liu, Kathryn Roeder, and Larry Wasserman. Stability approach to regularization selection (stars) for high dimensional graphical models. I n J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, edit ors, Advances in Neural Information Processing Systems 23 , pages 1432–1440. Curran Associates, Inc., 2010

  10. [10]

    The analysis of ordered catego rical data: an overview and a survey of recent developments

    Ivy Liu and Alan Agresti. The analysis of ordered catego rical data: an overview and a survey of recent developments. Test, 14(1):1–73, 2005. With discussion and a rejoinder by the authors

  11. [11]

    Regression models for ordinal data

    Peter McCullagh. Regression models for ordinal data. J. Roy. Statist. Soc. Ser. B , 42(2):109–142, 1980

  12. [12]

    L1-regularization pa th algorithm for general- ized linear models

    Mee Young Park and Trevor Hastie. L1-regularization pa th algorithm for general- ized linear models. Journal of the Royal Statistical Society: Series B (Statist ical Methodology), 69(4):659–677, 2007

  13. [13]

    A statistical approach for CGH microarray data analysis

    Franck Picard, St´ ephane Robin, Marc Lavielle, Christ ian Vaisse, Gilles Celeux, and Jean-Jacques Daudin. A statistical approach for CGH microarray data analysis . PhD thesis, INRIA, 2004

  14. [14]

    A segmenta- tion/clustering model for the analysis of array cgh data

    Franck Picard, St´ ephane Robin, E Lebarbier, and J-J Da udin. A segmenta- tion/clustering model for the analysis of array cgh data. Biometrics, 63(3):758–766, 2007

  15. [15]

    Alternative analyses for the singly-order ed contingency table

    Gary Simon. Alternative analyses for the singly-order ed contingency table. Journal of the American Statistical Association , 69(348):971–976, 1974

  16. [16]

    Or dinal graphical models: A tale of two approaches

    Arun Sai Suggala, Eunho Yang, and Pradeep Ravikumar. Or dinal graphical models: A tale of two approaches. In Doina Precup and Yee Whye Teh, edi tors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 3260–3269, International Convention Centre, Sydney, Australia, 06–11 Aug 2...

  17. [17]

    Regression shrinkage and selectio n via the lasso

    Robert Tibshirani. Regression shrinkage and selectio n via the lasso. J. Roy. Statist. Soc. Ser. B , 58(1):267–288, 1996. 17

  18. [18]

    High dimensional v ariable selection

    Larry Wasserman and Kathryn Roeder. High dimensional v ariable selection. Annals of statistics , 37(5A):2178, 2009

  19. [19]

    Analysis of conting ency tables having ordered response categories

    O Dale Williams and James E Grizzle. Analysis of conting ency tables having ordered response categories. Journal of the American Statistical Association , 67(337):55–63, 1972

  20. [20]

    Genome-wide association analysis by lasso penalized logis tic regression

    Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric Sobel, an d Kenneth Lange. Genome-wide association analysis by lasso penalized logis tic regression. Bioinfor- matics, 25(6):714–721, 2009

  21. [21]

    On model selection consistency of L asso

    Peng Zhao and Bin Yu. On model selection consistency of L asso. J. Mach. Learn. Res., 7:2541–2563, 2006

  22. [22]

    Classification of gene microar rays by penalized logistic regression

    Ji Zhu and Trevor Hastie. Classification of gene microar rays by penalized logistic regression. Biostatistics, 5(3):427–443, 2004. 18