Revisiting Bayesian Variable Selection via Optimization
Pith reviewed 2026-05-09 23:21 UTC · model grok-4.3
The pith
The marginal likelihood in Bayesian variable selection can be globally optimized by a difference-of-convex algorithm despite lacking log-concavity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating the negative log-marginal likelihood as a loss function of the latent precision parameters, the paper rewrites it as the difference of two convex functions. The DC algorithm applied to this loss converges to the global minimizer at a linear rate whenever the feasible set is compact. The same convergence holds for the type-II maximum-likelihood estimator and for the maximum marginal posterior under priors that preserve the DC structure.
What carries the argument
Difference-of-convex (DC) algorithm applied to the negative log-marginal likelihood written as the difference of two convex functions of the latent precision parameters.
If this is right
- Type-II maximum-likelihood variable selection can be solved to global optimality without risk of spurious local modes.
- The same global guarantee applies to maximum marginal posterior modes under priors that keep the DC structure intact.
- The algorithm supplies a tuning-free, fast alternative or warm start for collapsed Gibbs samplers in linear regression.
- The approach extends directly to models with structured sparsity penalties.
- Numerical experiments on simulated data and on aftershock-risk mapping after the 2019 Ridgecrest earthquakes confirm practical performance.
Where Pith is reading between the lines
- Similar DC decompositions may exist for marginal likelihoods in generalized linear models or non-linear regression, allowing the same convergence argument.
- Practitioners could run the DC optimizer first to locate a high-quality mode before launching full posterior sampling, reducing the chance that MCMC chains start far from the main mass.
- The linear convergence rate suggests that the number of iterations needed scales modestly with dimension when the compact-set radius is fixed.
Load-bearing premise
The negative log-marginal likelihood must admit an expression as the difference of two convex functions, and the domain of the latent precision parameters must satisfy mild compactness.
What would settle it
A low-dimensional linear regression example in which the global mode is known by exhaustive enumeration, yet the DC algorithm returns a strictly inferior point when the compactness condition is removed.
Figures
read the original abstract
Variable selection in linear regression has been a central topic in statistical research for decades. Bayesian variable selection methods, which account for uncertainty in both the regression coefficients and the noise variance, have achieved broad success through the use of discrete or continuous shrinkage priors and efficient collapsed Gibbs samplers. Despite their popularity and strong empirical performance, an enigma remains: the marginal likelihood, obtained by integrating out the regression coefficients and noise variance, is not log-concave; therefore, there is no guarantee of reliably finding its global optimum. In this article, we study this problem from an optimization perspective. Taking the negative log-marginal likelihood as a loss function of the latent precision parameters, we can rewrite it as a difference of convex functions (DC), and then optimize it via a simple iterative algorithm. Under mild compact set conditions, the DC algorithm converges to the global optimum at a linear rate. The positive finding applies to type-II maximum likelihood and extends to maximum marginal posterior under suitable priors, indicating that the problem of mode finding in Bayesian variable selection is much more benign than the lack of log-concavity might suggest. Besides the theoretical insight, the proposed algorithm is easy to implement, free of tuning, and extensible to structured sparsity, and thus can serve as an efficient alternative or warm-start for traditional Markov chain Monte Carlo solutions. The method is illustrated through numerical studies and a spatial data application for quantifying the aftershock risk following the 2019 Ridgecrest earthquakes. The source code for the algorithm is publicly available at https://github.com/leoduan/dca_optimization_variable_selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the negative log-marginal likelihood for Bayesian variable selection in linear regression, expressed as a function of latent precision parameters, can be rewritten as a difference of convex functions. It then proposes a DC algorithm that converges to the global optimum at a linear rate under mild compact set conditions. The result is stated to apply to type-II maximum likelihood and to extend to maximum marginal posterior under suitable priors. The approach is presented as tuning-free and easy to implement, serving as an alternative or warm-start for MCMC, with illustrations on numerical studies and a spatial data application; source code is publicly released.
Significance. If the DC decomposition and convergence result hold and extend to the boundary regimes of interest, the work would provide useful theoretical insight that mode-finding in Bayesian variable selection is more tractable than the absence of log-concavity suggests, together with a practical algorithm. The public availability of the source code is a positive feature that supports reproducibility and potential adoption as a complement to existing MCMC methods.
major comments (1)
- [Convergence theorem and surrounding discussion of the DC algorithm] The convergence theorem (the statement that the DC algorithm converges to the global optimum at linear rate under mild compact set conditions) is load-bearing for the central claim. The relevant modes in variable selection routinely lie at the boundary of the domain, with some precision parameters τ_j → 0 (inclusion) or τ_j → ∞ (exclusion). Any compact set is closed and bounded and therefore excludes at least one of these extremes. The manuscript provides no explicit argument that the global minimizer remains interior to the chosen compact set or that boundary solutions can be recovered by a limiting argument. Without this, the guarantee does not transfer to the statistical regimes the method is intended to address.
minor comments (2)
- [Numerical studies] The numerical studies section would benefit from explicit side-by-side reporting of variable recovery rates, runtime, and sensitivity to initialization when comparing the DC algorithm against collapsed Gibbs sampling.
- [Method section] Notation for the latent precision vector τ and the DC decomposition could be introduced with a short table or explicit listing of the convex and concave parts to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying a key theoretical point regarding the scope of the convergence result. We address this comment directly below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Convergence theorem and surrounding discussion of the DC algorithm] The convergence theorem (the statement that the DC algorithm converges to the global optimum at linear rate under mild compact set conditions) is load-bearing for the central claim. The relevant modes in variable selection routinely lie at the boundary of the domain, with some precision parameters τ_j → 0 (inclusion) or τ_j → ∞ (exclusion). Any compact set is closed and bounded and therefore excludes at least one of these extremes. The manuscript provides no explicit argument that the global minimizer remains interior to the chosen compact set or that boundary solutions can be recovered by a limiting argument. Without this, the guarantee does not transfer to the statistical regimes the method is intended to address.
Authors: We thank the referee for highlighting this important limitation in the current statement of the theorem. The convergence result is indeed established only on compact sets, which necessarily exclude the boundary regimes τ_j → 0 and τ_j → ∞ that arise in variable selection. The manuscript does not supply an explicit interiority argument or a limiting procedure to recover boundary solutions. To address this, we will revise the manuscript by adding a dedicated remark that considers a sequence of expanding compact sets K_n whose union is the positive orthant. Under the same mild conditions used in the original proof, we will show that the sequence of DC iterates on K_n converges to a point that is a global minimizer on the closure whenever such a minimizer exists in the extended reals. We will also include additional numerical experiments that initialize the algorithm near the boundaries and confirm recovery of the expected modes. These additions will be incorporated in the revised version. revision: yes
Circularity Check
No significant circularity: DC form and convergence derived from standard marginal likelihood
full rationale
The paper starts from the standard negative log-marginal likelihood obtained by integrating out regression coefficients and noise variance in the linear model. It then performs an algebraic rewrite of this expression into a difference of convex functions, which is a direct derivation rather than a self-referential definition or fitted construction. The DC algorithm is applied to this derived objective, and the linear-rate global convergence is asserted under explicitly stated mild compact-set conditions drawn from general DC programming theory. No step reduces the target result to a parameter fitted on the same data, a self-citation chain, or a renamed empirical pattern. The central claim therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The negative log-marginal likelihood is a difference of convex functions.
Reference graph
Works this paper leans on
-
[1]
Yao, Chaorui and Jiang, Xin , journal=
- [2]
-
[3]
Guyon, Isabelle and Gunn, Steve and Ben-Hur, Asa and Dror, Gideon , journal=
-
[4]
George, Edward I. and McCulloch, Robert E. , title =. Journal of the American Statistical Association , year =
-
[5]
Polson, Nicholas G. and Scott, James G. , title =. Bayesian Statistics 9 , editor =. 2011 , doi =
work page 2011
-
[6]
Carvalho, Carlos M. and Polson, Nicholas G. and Scott, James G. , title =. Biometrika , year =
-
[7]
Bhattacharya, Anirban and Pati, Debdeep and Pillai, Natesh S. and Dunson, David B. , title =. Journal of the American Statistical Association , year =
-
[8]
Armagan, Artin and Dunson, David B. and Lee, Jaeyong , title =. Statistica Sinica , year =
-
[9]
Bai, Ray and Ghosh, Malay , title =. Statistica Sinica , year =
-
[10]
Journal of the American Statistical Association , year =
Ro. Journal of the American Statistical Association , year =
-
[11]
Journal of the Royal Statistical Society: Series B , year =
Tibshirani, Robert , title =. Journal of the Royal Statistical Society: Series B , year =
-
[12]
Johnson, Alicia A. and Jones, Galin L. , title =. Electronic Journal of Statistics , volume =
-
[13]
Linear Algebra and Its Applications , volume =
Rom. Linear Algebra and Its Applications , volume =
-
[14]
Nishimura, Akihiko and Suchard, Marc A , journal=. 2023 , publisher=
work page 2023
- [15]
-
[16]
and Chakravarti, Nilotpal , title =
Best, Michael J. and Chakravarti, Nilotpal , title =. Mathematical Programming , volume =. 1990 , doi =
work page 1990
-
[17]
De Leeuw, Jan and Hornik, Kurt and Mair, Patrick , journal=
- [18]
- [19]
-
[20]
Tipping, Michael E , journal=
- [21]
-
[22]
Faul, Anita and Tipping, Michael , journal=
-
[23]
Polson, Nicholas G and Scott, James G and Windle, Jesse , journal=. 2013 , publisher=
work page 2013
-
[24]
Griffin, Jim E. and Brown, Philip J , title =. Bayesian Analysis , year =
-
[25]
Electronic Journal of Statistics , volume =
Vats, Dootika , title =. Electronic Journal of Statistics , volume =
- [26]
-
[27]
Bhattacharya, Anirban and Chakraborty, Antik and Mallick, Bani K , journal=. 2016 , publisher=
work page 2016
-
[28]
Johndrow, James and Orenstein, Paulo and Bhattacharya, Anirban , journal=
-
[29]
Rajaratnam, Bala and Sparks, Doug and Khare, Kshitij and Zhang, Liyuan , journal=
-
[30]
Journal of the American Statistical Association , year =
Fan, Jianqing and Li, Runze , title =. Journal of the American Statistical Association , year =
-
[31]
Journal of the American Statistical Association , year =
Park, Trevor and Casella, George , title =. Journal of the American Statistical Association , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.