Scalable Subset Selection in Linear Mixed Models

Joanna J. J. Wang; Matt P. Wand; Ryan Thompson

arxiv: 2506.20425 · v3 · pith:UIUPDZLMnew · submitted 2025-06-25 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Scalable Subset Selection in Linear Mixed Models

Ryan Thompson , Matt P. Wand , Joanna J. J. Wang This is my paper

Pith reviewed 2026-05-19 08:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords linear mixed modelssubset selectionl0 regularizationcoordinate descenthigh-dimensional datasparsitygeneralized linear mixed modelspenalized quasi-likelihood

0 comments

The pith

A new ℓ0-regularized method scales subset selection in linear mixed models to thousands of predictors in seconds to minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear mixed models incorporate both fixed and random effects to analyze heterogeneous data such as in personalized medicine, but existing sparse methods cannot handle more than a few hundred candidate predictors. This paper introduces an ℓ0-regularized approach that closes the scalability gap with ordinary linear models. It relies on a coordinate descent algorithm with a convergence guarantee as the main solver and adds a local search step to navigate the non-convex objective surface. The same framework extends to generalized linear mixed models through a penalized quasi-likelihood approximation. A finite-sample bound on Kullback-Leibler divergence and strong empirical results on synthetic and real data complete the contribution.

Core claim

The paper presents a new ℓ0 regularized method for LMM subset selection that can run on datasets containing thousands of predictors in seconds to minutes. On the computational front, a coordinate descent algorithm serves as the main workhorse with a provided convergence guarantee, while a local search algorithm helps traverse the nonconvex optimization surface. Both algorithms extend readily to subset selection in generalized LMMs via a penalized quasi-likelihood approximation. On the statistical front, a finite-sample bound on the Kullback-Leibler divergence of the new method is derived, and experiments demonstrate excellent performance on synthetic and real datasets.

What carries the argument

ℓ0-regularized penalized quasi-likelihood objective for linear mixed models, solved by coordinate descent with convergence guarantee plus local search to handle non-convexity.

If this is right

Subset selection in LMMs becomes practical for wide datasets that previously required dropping random effects or using only convex relaxations.
The coordinate descent procedure converges for the given non-convex objective.
The same computational and statistical framework applies directly to generalized linear mixed models.
A finite-sample KL divergence bound quantifies how well the fitted sparse model approximates the true data-generating process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners analyzing clinical or genomic data with individual-level heterogeneity could now perform high-dimensional feature selection without sacrificing the random-effects structure.
The local-search component suggests that direct attack on non-convex ℓ0 problems can be more effective than convex surrogates when paired with a good initialization strategy.
Screening rules or warm-starting from convex solutions might further reduce the already modest runtimes for even larger predictor counts.

Load-bearing premise

Coordinate descent combined with local search reliably finds good solutions on the non-convex ℓ0-regularized objective, and the penalized quasi-likelihood approximation stays accurate when extending to generalized linear mixed models.

What would settle it

Running the algorithm on a dataset with 2000 predictors and measuring whether runtime stays under a few minutes while the selected model outperforms a dense baseline in cross-validated prediction error would directly test the central scalability and performance claims.

read the original abstract

Linear mixed models (LMMs), which incorporate fixed and random effects, are key tools for analyzing heterogeneous data, such as in personalized medicine. Nowadays, this type of data is increasingly wide, sometimes containing thousands of candidate predictors, necessitating sparsity for prediction and interpretation. However, existing sparse learning methods for LMMs do not scale well beyond tens or hundreds of predictors, leaving a large gap compared with sparse methods for linear models, which ignore random effects. This paper closes the gap with a new $\ell_0$ regularized method for LMM subset selection that can run on datasets containing thousands of predictors in seconds to minutes. On the computational front, we develop a coordinate descent algorithm as our main workhorse and provide a guarantee of its convergence. We also develop a local search algorithm to help traverse the nonconvex optimization surface. Both algorithms readily extend to subset selection in generalized LMMs via a penalized quasi-likelihood approximation. On the statistical front, we provide a finite-sample bound on the Kullback-Leibler divergence of the new method. We then demonstrate its excellent performance in experiments involving synthetic and real datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable l0-regularized coordinate descent method for subset selection in LMMs that scales to thousands of predictors, plus a finite-sample KL bound, though the non-convex solver's ability to find good supports remains the main open question.

read the letter

The central advance here is taking the l0 penalty, already standard for ordinary linear models, and making it computationally feasible inside linear mixed models without restricting to small numbers of predictors. The coordinate descent routine is the main engine, they state a convergence result for it, and they add a local search step to move around the non-convex surface. The extension to generalized LMMs through penalized quasi-likelihood is direct, and the finite-sample KL bound supplies a statistical guarantee that most algorithmic papers skip. These pieces together address the scalability gap the abstract identifies between sparse linear models and sparse LMMs.

Referee Report

2 major / 2 minor

Summary. The paper proposes a scalable ℓ0-regularized method for subset selection in linear mixed models (LMMs) that handles thousands of predictors. It introduces a coordinate descent algorithm with a stated convergence guarantee, augments it with local search to address non-convexity, extends the approach to generalized LMMs via penalized quasi-likelihood, derives a finite-sample bound on the Kullback-Leibler divergence, and reports strong empirical performance on synthetic and real data.

Significance. If the algorithmic reliability and statistical bound hold, the work would meaningfully close the scalability gap between sparse linear regression and LMMs, enabling practical sparse modeling for high-dimensional heterogeneous data in areas such as personalized medicine. The explicit convergence guarantee and finite-sample KL bound are positive theoretical contributions that go beyond purely empirical claims.

major comments (2)

[Algorithm and convergence guarantee (abstract and main algorithmic section)] The convergence guarantee for coordinate descent is stated to apply to the non-convex ℓ0-penalized objective, yet on such surfaces it only reaches coordinate-wise stationary points. Because the central scalability and performance claims (thousands of predictors in seconds to minutes with excellent recovery) rest on the combination of coordinate descent plus local search reliably locating high-quality supports, the manuscript should provide either a stronger guarantee or targeted experiments quantifying how often local search escapes poor stationary points as p grows (e.g., in the synthetic experiments section).
[Statistical bound section] The finite-sample KL bound is presented as an independent statistical result, but the practical performance claims rely on data-driven choice of the regularization strength λ and other hyperparameters. The manuscript should clarify whether the bound remains valid or informative under such tuning, or whether an additional oracle inequality or stability argument is needed to connect the bound to the reported empirical results.

minor comments (2)

[Generalized LMM extension] The extension to generalized LMMs via penalized quasi-likelihood is mentioned but would benefit from an explicit statement of the approximation error or conditions under which the quasi-likelihood remains accurate for the subset selection task.
[Experiments] Figure captions and experimental tables should include the exact values of p, n, and number of random effects used in each synthetic setting to allow direct comparison with prior LMM sparse methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical results and the role of the local search procedure. We address each major comment below, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: The convergence guarantee for coordinate descent is stated to apply to the non-convex ℓ0-penalized objective, yet on such surfaces it only reaches coordinate-wise stationary points. Because the central scalability and performance claims (thousands of predictors in seconds to minutes with excellent recovery) rest on the combination of coordinate descent plus local search reliably locating high-quality supports, the manuscript should provide either a stronger guarantee or targeted experiments quantifying how often local search escapes poor stationary points as p grows (e.g., in the synthetic experiments section).

Authors: We agree that the stated convergence result for coordinate descent establishes convergence to a coordinate-wise stationary point of the non-convex objective, consistent with standard analysis for block coordinate methods on non-convex problems. The local search step is explicitly introduced to move beyond such points by evaluating neighboring supports. To directly address the concern, we will add a targeted set of synthetic experiments that report, for increasing p, the fraction of instances in which local search improves the objective value or support recovery relative to coordinate descent alone. These results will be placed in the synthetic experiments section. revision: yes
Referee: The finite-sample KL bound is presented as an independent statistical result, but the practical performance claims rely on data-driven choice of the regularization strength λ and other hyperparameters. The manuscript should clarify whether the bound remains valid or informative under such tuning, or whether an additional oracle inequality or stability argument is needed to connect the bound to the reported empirical results.

Authors: The finite-sample KL bound is derived for the estimator minimizing the penalized objective at a fixed λ. In the reported experiments λ is selected by cross-validation. The bound therefore applies directly only to fixed regularization; its informativeness for the tuned estimator would indeed require an additional stability or oracle argument. We will revise the statistical bound section to state this distinction explicitly and to note that a full oracle inequality connecting the bound to cross-validated λ is left for future work, while preserving the bound as a guarantee for any fixed λ. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivations and bounds presented as independent contributions

full rationale

The paper develops a coordinate descent algorithm with an explicit convergence guarantee, augments it with local search for the non-convex ℓ0 objective, extends the approach to GLMMs via penalized quasi-likelihood, and derives a finite-sample KL divergence bound. None of these steps reduce by construction to fitted inputs, self-citations, or renamed empirical patterns within the provided sections. The algorithmic guarantees and statistical bound are framed as new results rather than tautological restatements of the method's own tuning or prior self-referential claims. Performance demonstrations rely on experiments but do not substitute for the core derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear mixed model assumptions plus new algorithmic components for non-convex optimization; no new physical entities are introduced.

free parameters (1)

regularization strength lambda
The ℓ0 penalty parameter controls sparsity and must be chosen or tuned for each dataset.

axioms (2)

domain assumption Data are generated from a linear mixed model with fixed and random effects
The entire subset selection framework is derived under this modeling assumption.
ad hoc to paper Coordinate descent converges for the non-convex ℓ0-penalized objective
The paper supplies a convergence guarantee specific to the proposed algorithm.

pith-pipeline@v0.9.0 · 5734 in / 1186 out tokens · 61110 ms · 2026-05-19T08:07:34.786631+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a coordinate descent algorithm as our main workhorse and provide a guarantee of its convergence. We also develop a local search algorithm to help traverse the nonconvex optimization surface.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

finite-sample bound on the Kullback-Leibler divergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.