From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs
Pith reviewed 2026-05-22 07:36 UTC · model grok-4.3
The pith
Batching branch-and-bound nodes on GPUs speeds up exact solutions for sparse generalized linear models by one to two orders of magnitude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework uses padding together with lightweight custom kernels to handle irregular node data structures, achieving one to two orders of magnitude speedups and zero optimality gap on challenging instances.
What carries the argument
The CPU-GPU framework that batches multiple branch-and-bound nodes for parallel GPU processing, using padding to manage varying data sizes across nodes.
If this is right
- The method delivers exact optimal solutions without sacrificing optimality.
- Speedups of 10 to 100 times allow solving larger or harder instances of cardinality-constrained GLMs.
- The framework can be extended to collect the full Rashomon set of near-optimal models for additional statistical analysis.
- Modular design allows integration with existing branch-and-bound solvers for other combinatorial problems.
Where Pith is reading between the lines
- If the batching overhead stays low, similar GPU acceleration could apply to branch-and-bound in other domains like integer programming or scheduling.
- Collecting the Rashomon set opens the door to model selection based on user-defined secondary criteria beyond just sparsity and fit.
- Future work might test the framework on very large-scale problems to see where GPU memory limits the batch size.
Load-bearing premise
Padding the irregular data structures of different branch-and-bound nodes combined with custom kernels adds low enough overhead to produce real speedups on GPUs rather than slowing things down.
What would settle it
Running the framework on the same challenging instances and finding that it either takes more time than a pure CPU version or returns solutions with a positive optimality gap.
Figures
read the original abstract
GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a CPU-GPU framework for parallel branch-and-bound (BnB) to compute optimal k-sparse generalized linear models. Multiple BnB nodes are batched and processed on the GPU using padding combined with lightweight custom kernels to accommodate irregular node data (varying active sets, depths, and subproblem sizes). The framework is described as simple, generic, and modular. Experiments are reported to achieve one to two orders of magnitude speedups over sequential CPU baselines while maintaining zero optimality gap on challenging instances; the approach is also extended to enumerate the full Rashomon set for downstream statistical tasks such as variable-importance analysis.
Significance. If the reported speedups are robust, the work would meaningfully improve the practicality of exact solvers for cardinality-constrained GLMs, a class of problems that remains computationally demanding. The modular CPU-GPU design and the Rashomon-set extension are concrete strengths that could support follow-on statistical applications. The absence of parameter fitting or circular derivations in the algorithmic contribution keeps the burden of proof on empirical validation rather than on mathematical self-consistency.
major comments (2)
- [§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.
- [§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.
minor comments (2)
- [Figure 2] Figure 2 caption does not state the exact GPU kernel launch configuration or the memory layout after padding; adding these details would improve reproducibility.
- [§2.1] The GLM loss and gradient expressions in §2.1 are given in prose; an explicit equation block would clarify the precise form passed to the GPU kernels.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify the experimental and implementation details in our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.
Authors: We agree that these details are necessary for reproducibility and to allow readers to assess robustness. In the revised manuscript we will expand §4 with a table (or expanded text) that reports the exact (n, p, k) values for every instance, the total number of test problems, the precise versions of all baseline solvers, the CPU and GPU hardware models together with memory capacities, and run-to-run variability (standard deviations or ranges over repeated executions). These additions will be placed in the main text or a dedicated appendix without changing any numerical results. revision: yes
-
Referee: [§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.
Authors: We acknowledge that quantitative implementation metrics would strengthen the presentation. We will revise §3.3 (and add a short appendix if needed) to report representative padding ratios observed across our node batches, estimated GPU occupancy for the custom kernels, and a coarse timing breakdown separating data-transfer, padding, and kernel execution. If full low-level profiler traces were not collected in the original experiments, we will state this limitation explicitly while supplying the best available figures from our runs. revision: partial
Circularity Check
No circularity: empirical framework proposal with experimental validation
full rationale
The paper presents a modular CPU-GPU batching framework for parallel branch-and-bound on k-sparse GLM instances, using padding and lightweight custom kernels to manage irregular node data. Claims rest on experimental speedups (1-2 orders of magnitude) and zero optimality gap rather than any mathematical derivation, first-principles prediction, or equation that reduces outputs to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing steps. The work is implementation-driven and externally falsifiable via the reported benchmarks, satisfying the criteria for a self-contained non-circular contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009
Tobias Achterberg. SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009. doi: 10.1007/s12532-008-0001-1
-
[2]
Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. doi: 10.1016/j.orl.2004.04.002
-
[3]
Practical large-scale linear programming using primal-dual hybrid gradient
David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, 11 and Warren Schudy. Practical large-scale linear programming using primal-dual hybrid gradient. InAdvances in Neural Information Processing Systems, pages 20243–20257, 2021
work page 2021
-
[4]
The UCI Machine Learning Repository, 2007
Arthur Asuncion and David Newman. The UCI Machine Learning Repository, 2007
work page 2007
-
[5]
Safe screening rules forℓ0-regression from perspective relaxations
Alper Atamturk and Andrés Gómez. Safe screening rules forℓ0-regression from perspective relaxations. InProceedings of the 37th International Conference on Machine Learning, pages 421–430, 2020
work page 2020
-
[6]
Alper Atamtürk and Andrés Gómez. Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023
work page 2023
-
[7]
Alper Atamtürk, Andrés Gómez, and Shaoning Han. Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021
work page 2021
-
[8]
Near-optimal decision trees in a SPLIT second
Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. InInternational Conference on Machine Learning, 2025
work page 2025
- [9]
-
[10]
Dimitris Bertsimas and Wes Gurnee. Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023
work page 2023
-
[11]
Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020
work page 2020
-
[12]
Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020
work page 2020
-
[13]
Batched first-order methods for parallel LP solving in MIP, 2026
Nicolas Blin, Stefano Gualandi, Christopher Maes, Andrea Lodi, and Bartolomeo Stellato. Batched first-order methods for parallel LP solving in MIP, 2026. URLhttps://arxiv. org/abs/2601.21990
-
[14]
Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001
work page 2001
-
[15]
Monotone regression: A simple and fast O (n) PAVA implementation
Frank MTA Busing. Monotone regression: A simple and fast O (n) PAVA implementation. Journal of Statistical Software, 102(Code Snippet 1):1–25, 2022
work page 2022
-
[16]
Convex programming for disjunctive convex optimization
Sebastián Ceria and João Soares. Convex programming for disjunctive convex optimization. Mathematical Programming, 86(3):595–614, 1999
work page 1999
-
[17]
GPU-accelerated primal heuristics for mixed integer programming, 2025
Akif Çördük, Piotr Sielski, Alice Boucher, and Kumar Aatish. GPU-accelerated primal heuristics for mixed integer programming, 2025. URLhttps://arxiv.org/abs/2510.20499
-
[18]
On the power of linear programming for K-means clustering
Antonio De Rosa, Aida Khajavirad, and Yakun Wang. On the power of linear programming for K-means clustering, 2024. URLhttps://arxiv.org/abs/2402.01061
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder. Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021
work page 2021
-
[20]
Jiayun Dong and Cynthia Rudin. Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020
work page 2020
-
[21]
Jon Donnelly, Srikar Katta, Cynthia Rudin, and Edward P. Browne. The Rashomon importance distribution: Getting RID of unstable, single model-based variable importance. InAdvances in Neural Information Processing Systems, volume 36, 2023. 12
work page 2023
-
[22]
Rashomon sets for prototypical-part networks: Editing interpretable models in real-time
Jon Donnelly, Zhicheng Guo, Alina Jade Barnett, Hayden McTavish, Chaofan Chen, and Cynthia Rudin. Rashomon sets for prototypical-part networks: Editing interpretable models in real-time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[23]
Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.Journal of Machine Learning Research, 20(177):1–81, 2019
work page 2019
-
[24]
Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006
work page 2006
-
[25]
Oktay Günlük and Jeff Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010
work page 2010
-
[26]
Gurobi Optimizer Reference Manual, 2025
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2025
work page 2025
-
[27]
A new branch-and-bound pruning framework forℓ0-regularized problems
Théo Guyard, Cédric Herzet, Clément Elvira, and Ayse-Nur Arslan. A new branch-and-bound pruning framework forℓ0-regularized problems. InProceedings of the 41st International Conference on Machine Learning, pages 48077–48096, 2024
work page 2024
-
[28]
Resolving Predictive Multiplicity for the Rashomon Set
Parian Haghighat, Hadis Anahideh, and Cynthia Rudin. Resolving predictive multiplicity for the Rashomon set, 2026. URLhttps://arxiv.org/abs/2601.09071
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,
Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,
- [30]
-
[31]
Hussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020
work page 2020
-
[32]
Hussein Hazimeh, Rahul Mazumder, and Ali Saab. Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022
work page 2022
-
[33]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, volume 25, pages 1097–1105, 2012
work page 2012
-
[34]
A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025
Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025. URLhttps://arxiv.org/ abs/2505.00311
-
[35]
J. T. Linderoth and M. W. P. Savelsbergh. A computational study of search strategies for mixed integer programming.INFORMS Journal on Computing, 11(2):173–187, 1999. doi: 10.1287/ijoc.11.2.173
-
[36]
FasterRisk: Fast and accurate interpretable risk scores
Jiachang Liu, Chudi Zhong, Boxuan Li, Margo Seltzer, and Cynthia Rudin. FasterRisk: Fast and accurate interpretable risk scores. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[37]
Fast sparse classification for generalized linear and additive models
Jiachang Liu, Chudi Zhong, Margo Seltzer, and Cynthia Rudin. Fast sparse classification for generalized linear and additive models. InProceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 9304–9333, 2022
work page 2022
-
[38]
OKRidge: Scalable optimal k-sparse ridge regression
Jiachang Liu, Sam Rosen, Chudi Zhong, and Cynthia Rudin. OKRidge: Scalable optimal k-sparse ridge regression. InAdvances in Neural Information Processing Systems, pages 41076–41258, 2023. 13
work page 2023
-
[39]
Scalable first-order method for certifying optimal k-sparse GLMs
Jiachang Liu, Soroosh Shafiee, and Andrea Lodi. Scalable first-order method for certifying optimal k-sparse GLMs. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Le...
work page 2025
-
[40]
Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026
Jiachang Liu, Andrea Lodi, and Soroosh Shafiee. Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026. URLhttps://arxiv.org/ abs/2603.01306
-
[41]
arXiv preprint arXiv:2311.07710
Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming, 2025. URLhttps://arxiv.org/abs/2311.07710
-
[42]
cuPDLP-C : A strengthened implementation of cuPDLP for linear programming by C language
Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language, 2024. URLhttps://arxiv.org/abs/2312.14832
-
[43]
Leveraging predictive equivalence in decision trees
Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, and Cynthia Rudin. Leveraging predictive equivalence in decision trees. InInternational Conference on Machine Learning, 2025
work page 2025
-
[44]
A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026
Xiang Meng, Ryan Lucas, and Rahul Mazumder. A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026. URLhttps://arxiv.org/abs/2602. 04551
work page 2026
-
[45]
David R. Morrison, Sheldon H. Jacobson, Jason J. Sauppe, and Edward C. Sewell. Branch- and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016. doi: 10.1016/j.disopt.2016.01.005
-
[46]
MOSEK ApS.The MOSEK optimization toolbox for MATLAB manual. Version 11.0.4, 2025
work page 2025
-
[47]
Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H
Simon D. Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H. McCormick. REALITrees: Rashomon ensemble active learning for interpretable trees, 2026. URLhttps://arxiv.org/abs/2603.22750
-
[48]
Santander cus- tomer transaction prediction
Mercedes Piedra, Sohier Dane, and Soraya Jimenez. Santander cus- tomer transaction prediction. https://kaggle.com/competitions/ santander-customer-transaction-prediction, 2019. Kaggle competition
work page 2019
-
[49]
Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Large-scale deep unsupervised learning using graphics processors. InProceedings of the 26th Annual International Conference on Machine Learning, pages 873–880, 2009
work page 2009
-
[50]
Tyrrell Rockafellar.Convex Analysis
R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970
work page 1970
-
[51]
Position: Amazing things come from having many good models
Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner. Position: Amazing things come from having many good models. InProceedings of the 41st International Conference on Machine Learning, pages 42783–42795, 2024
work page 2024
-
[52]
Soroosh Shafiee and Fatma Kılınç-Karzan. Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024
work page 2024
-
[53]
Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016. 14
work page 2016
-
[54]
Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019
Berk Ustun and Cynthia Rudin. Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019
work page 2019
-
[55]
On the convexification of constrained quadratic optimization problems with indicator variables
Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. On the convexification of constrained quadratic optimization problems with indicator variables. InInteger Programming and Combinatorial Optimization, pages 433–447, 2020
work page 2020
-
[56]
Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022
work page 2022
-
[57]
Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020
work page 2020
-
[58]
Exploring the whole Rashomon set of sparse decision trees
Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, and Cynthia Rudin. Exploring the whole Rashomon set of sparse decision trees. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[59]
Exploring and interacting with the set of good sparse generalized additive models
Chudi Zhong, Zhi Chen, Jiachang Liu, Margo Seltzer, and Cynthia Rudin. Exploring and interacting with the set of good sparse generalized additive models. InAdvances in Neural Information Processing Systems, volume 36, pages 56673–56699, 2023. Appendix 7 Primer on Branch and Bound A branch-and-bound tree stores partial fixing decisions on the binary suppor...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.