From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

Andrea Lodi; Jiachang Liu

arxiv: 2605.22188 · v1 · pith:TKGSQGB4new · submitted 2026-05-21 · 💻 cs.LG · math.OC· stat.ML

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs

Jiachang Liu , Andrea Lodi This is my paper

Pith reviewed 2026-05-22 07:36 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords branch and boundGPU parallelizationsparse generalized linear modelscardinality constraintsexact optimizationRashomon set

0 comments

The pith

Batching branch-and-bound nodes on GPUs speeds up exact solutions for sparse generalized linear models by one to two orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that moves multiple nodes from the branch-and-bound search tree to the GPU for simultaneous processing. This addresses the usual bottleneck where nodes must be handled one after another on the CPU, especially for problems that mix discrete choices with nonlinear objectives like finding the best sparse model. A sympathetic reader would care because exact optimal solutions for these models have been computationally expensive, limiting their use on larger datasets or more complex problems. The approach keeps the exactness of branch and bound while gaining speed from GPU parallelism through careful data handling.

Core claim

We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework uses padding together with lightweight custom kernels to handle irregular node data structures, achieving one to two orders of magnitude speedups and zero optimality gap on challenging instances.

What carries the argument

The CPU-GPU framework that batches multiple branch-and-bound nodes for parallel GPU processing, using padding to manage varying data sizes across nodes.

If this is right

The method delivers exact optimal solutions without sacrificing optimality.
Speedups of 10 to 100 times allow solving larger or harder instances of cardinality-constrained GLMs.
The framework can be extended to collect the full Rashomon set of near-optimal models for additional statistical analysis.
Modular design allows integration with existing branch-and-bound solvers for other combinatorial problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the batching overhead stays low, similar GPU acceleration could apply to branch-and-bound in other domains like integer programming or scheduling.
Collecting the Rashomon set opens the door to model selection based on user-defined secondary criteria beyond just sparsity and fit.
Future work might test the framework on very large-scale problems to see where GPU memory limits the batch size.

Load-bearing premise

Padding the irregular data structures of different branch-and-bound nodes combined with custom kernels adds low enough overhead to produce real speedups on GPUs rather than slowing things down.

What would settle it

Running the framework on the same challenging instances and finding that it either takes more time than a pure CPU version or returns solutions with a positive optimality gap.

Figures

Figures reproduced from arXiv: 2605.22188 by Andrea Lodi, Jiachang Liu.

**Figure 2.** Figure 2: Example of padding and column sorting for a batched proximal evaluation. Non-free [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of batch-size for GPU-parallel BnB for our method. Results are on synthetic [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Primer on branch and bound. Each open node stores partial fixing decisions on z. A lower bound prunes a node (with red crosses) when it cannot beat the incumbent; otherwise the node is branched into two children by fixing one free variable, either with zj = 0 or zj = 1 The green node finds an incumbent, which becomes optimal when there are no open nodes. In this work, we process multiple nodes in batches… view at source ↗

**Figure 5.** Figure 5: Support-frequency summary for the saved top- [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Model-reliance summary for the saved top- [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Primary sparse-logistic objective versus AUC over the saved top- [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Primary sparse-logistic objective versus accuracy over the saved top- [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy versus AUC over the saved top-1000 Dorothea Rashomon set with k = 5. This plot shows whether the models preferred by a threshold-dependent metric are also preferred by a ranking metric. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: The BnB search tree and the support trie are different objects. BnB edges are [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a modular GPU batching framework for branch-and-bound on k-sparse GLMs that claims large speedups, but the experimental support is still thin on details.

read the letter

The core idea is a CPU-GPU setup that batches heterogeneous branch-and-bound nodes for cardinality-constrained GLMs, using padding and lightweight custom kernels to run them in parallel on the GPU. This targets the irregular, discrete search structure that most prior GPU work has avoided in favor of continuous first-order methods. The framework is presented as simple and modular, with an added note on collecting the full Rashomon set for later statistical work like variable importance or secondary metric selection. If the speedups hold, it could push exact solvers into larger problem sizes where they were previously too slow. That part is worth attention for anyone building practical exact methods in sparse modeling. The engineering choices around batching and kernel design show honest engagement with the data movement and irregularity problems that usually block GPU use in BnB. Credit is due for keeping the approach generic rather than tying it to one specific solver. The reported one-to-two order speedups and zero optimality gaps on challenging instances are the headline results. At the same time, the abstract supplies almost no numbers on instance dimensions, baseline solvers, hardware, or run-to-run variability. Without those, it is difficult to tell whether the gains are robust or tied to particular distributions. The padding strategy for uneven node sizes is the obvious place to look for hidden costs. If active-set sizes or depths differ a lot across nodes, the dummy work could cut into net throughput, and the paper would benefit from utilization numbers or kernel profiles to address that directly. This work is aimed at researchers who need scalable exact solvers for sparse GLMs or who want to explore combinatorial optimization on GPUs. A reader already familiar with BnB implementations will get the most from the framework description and the Rashomon-set extension. The central algorithmic claim is coherent on its own terms and addresses a genuine bottleneck, so the paper deserves a serious referee even if the empirical section needs expansion.

Referee Report

2 major / 2 minor

Summary. The paper proposes a CPU-GPU framework for parallel branch-and-bound (BnB) to compute optimal k-sparse generalized linear models. Multiple BnB nodes are batched and processed on the GPU using padding combined with lightweight custom kernels to accommodate irregular node data (varying active sets, depths, and subproblem sizes). The framework is described as simple, generic, and modular. Experiments are reported to achieve one to two orders of magnitude speedups over sequential CPU baselines while maintaining zero optimality gap on challenging instances; the approach is also extended to enumerate the full Rashomon set for downstream statistical tasks such as variable-importance analysis.

Significance. If the reported speedups are robust, the work would meaningfully improve the practicality of exact solvers for cardinality-constrained GLMs, a class of problems that remains computationally demanding. The modular CPU-GPU design and the Rashomon-set extension are concrete strengths that could support follow-on statistical applications. The absence of parameter fitting or circular derivations in the algorithmic contribution keeps the burden of proof on empirical validation rather than on mathematical self-consistency.

major comments (2)

[§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.
[§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.

minor comments (2)

[Figure 2] Figure 2 caption does not state the exact GPU kernel launch configuration or the memory layout after padding; adding these details would improve reproducibility.
[§2.1] The GLM loss and gradient expressions in §2.1 are given in prose; an explicit equation block would clarify the precise form passed to the GPU kernels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the experimental and implementation details in our manuscript. We address each major comment below.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.

Authors: We agree that these details are necessary for reproducibility and to allow readers to assess robustness. In the revised manuscript we will expand §4 with a table (or expanded text) that reports the exact (n, p, k) values for every instance, the total number of test problems, the precise versions of all baseline solvers, the CPU and GPU hardware models together with memory capacities, and run-to-run variability (standard deviations or ranges over repeated executions). These additions will be placed in the main text or a dedicated appendix without changing any numerical results. revision: yes
Referee: [§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.

Authors: We acknowledge that quantitative implementation metrics would strengthen the presentation. We will revise §3.3 (and add a short appendix if needed) to report representative padding ratios observed across our node batches, estimated GPU occupancy for the custom kernels, and a coarse timing breakdown separating data-transfer, padding, and kernel execution. If full low-level profiler traces were not collected in the original experiments, we will state this limitation explicitly while supplying the best available figures from our runs. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with experimental validation

full rationale

The paper presents a modular CPU-GPU batching framework for parallel branch-and-bound on k-sparse GLM instances, using padding and lightweight custom kernels to manage irregular node data. Claims rest on experimental speedups (1-2 orders of magnitude) and zero optimality gap rather than any mathematical derivation, first-principles prediction, or equation that reduces outputs to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing steps. The work is implementation-driven and externally falsifiable via the reported benchmarks, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions of GPU memory coalescing and the correctness of branch-and-bound pruning rules; no new mathematical axioms or free parameters are introduced in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1102 out tokens · 40175 ms · 2026-05-22T07:36:05.363424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009

Tobias Achterberg. SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009. doi: 10.1007/s12532-008-0001-1

work page doi:10.1007/s12532-008-0001-1 2009
[2]

Achterberg, T

Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. doi: 10.1016/j.orl.2004.04.002

work page doi:10.1016/j.orl.2004.04.002 2005
[3]

Practical large-scale linear programming using primal-dual hybrid gradient

David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, 11 and Warren Schudy. Practical large-scale linear programming using primal-dual hybrid gradient. InAdvances in Neural Information Processing Systems, pages 20243–20257, 2021

work page 2021
[4]

The UCI Machine Learning Repository, 2007

Arthur Asuncion and David Newman. The UCI Machine Learning Repository, 2007

work page 2007
[5]

Safe screening rules forℓ0-regression from perspective relaxations

Alper Atamturk and Andrés Gómez. Safe screening rules forℓ0-regression from perspective relaxations. InProceedings of the 37th International Conference on Machine Learning, pages 421–430, 2020

work page 2020
[6]

Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

Alper Atamtürk and Andrés Gómez. Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

work page 2023
[7]

Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

Alper Atamtürk, Andrés Gómez, and Shaoning Han. Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

work page 2021
[8]

Near-optimal decision trees in a SPLIT second

Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. InInternational Conference on Machine Learning, 2025

work page 2025
[9]

SIAM, 2017

Amir Beck.First-Order Methods in Optimization. SIAM, 2017

work page 2017
[10]

Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

Dimitris Bertsimas and Wes Gurnee. Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

work page 2023
[11]

Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

work page 2020
[12]

Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

work page 2020
[13]

Batched first-order methods for parallel LP solving in MIP, 2026

Nicolas Blin, Stefano Gualandi, Christopher Maes, Andrea Lodi, and Bartolomeo Stellato. Batched first-order methods for parallel LP solving in MIP, 2026. URLhttps://arxiv. org/abs/2601.21990

work page arXiv 2026
[14]

Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

work page 2001
[15]

Monotone regression: A simple and fast O (n) PAVA implementation

Frank MTA Busing. Monotone regression: A simple and fast O (n) PAVA implementation. Journal of Statistical Software, 102(Code Snippet 1):1–25, 2022

work page 2022
[16]

Convex programming for disjunctive convex optimization

Sebastián Ceria and João Soares. Convex programming for disjunctive convex optimization. Mathematical Programming, 86(3):595–614, 1999

work page 1999
[17]

GPU-accelerated primal heuristics for mixed integer programming, 2025

Akif Çördük, Piotr Sielski, Alice Boucher, and Kumar Aatish. GPU-accelerated primal heuristics for mixed integer programming, 2025. URLhttps://arxiv.org/abs/2510.20499

work page arXiv 2025
[18]

On the power of linear programming for K-means clustering

Antonio De Rosa, Aida Khajavirad, and Yakun Wang. On the power of linear programming for K-means clustering, 2024. URLhttps://arxiv.org/abs/2402.01061

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder. Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

work page 2021
[20]

Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

Jiayun Dong and Cynthia Rudin. Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

work page 2020
[21]

Jon Donnelly, Srikar Katta, Cynthia Rudin, and Edward P. Browne. The Rashomon importance distribution: Getting RID of unstable, single model-based variable importance. InAdvances in Neural Information Processing Systems, volume 36, 2023. 12

work page 2023
[22]

Rashomon sets for prototypical-part networks: Editing interpretable models in real-time

Jon Donnelly, Zhicheng Guo, Alina Jade Barnett, Hayden McTavish, Chaofan Chen, and Cynthia Rudin. Rashomon sets for prototypical-part networks: Editing interpretable models in real-time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[23]

Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.Journal of Machine Learning Research, 20(177):1–81, 2019

work page 2019
[24]

Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

work page 2006
[25]

Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

Oktay Günlük and Jeff Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

work page 2010
[26]

Gurobi Optimizer Reference Manual, 2025

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2025

work page 2025
[27]

A new branch-and-bound pruning framework forℓ0-regularized problems

Théo Guyard, Cédric Herzet, Clément Elvira, and Ayse-Nur Arslan. A new branch-and-bound pruning framework forℓ0-regularized problems. InProceedings of the 41st International Conference on Machine Learning, pages 48077–48096, 2024

work page 2024
[28]

Resolving Predictive Multiplicity for the Rashomon Set

Parian Haghighat, Hadis Anahideh, and Cynthia Rudin. Resolving predictive multiplicity for the Rashomon set, 2026. URLhttps://arxiv.org/abs/2601.09071

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

work page
[30]

URLhttps://arxiv.org/abs/2407.15049

work page arXiv
[31]

Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

Hussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

work page 2020
[32]

Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

Hussein Hazimeh, Rahul Mazumder, and Ali Saab. Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

work page 2022
[33]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, volume 25, pages 1097–1105, 2012

work page 2012
[34]

A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025

Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025. URLhttps://arxiv.org/ abs/2505.00311

work page arXiv 2025
[35]

J. T. Linderoth and M. W. P. Savelsbergh. A computational study of search strategies for mixed integer programming.INFORMS Journal on Computing, 11(2):173–187, 1999. doi: 10.1287/ijoc.11.2.173

work page doi:10.1287/ijoc.11.2.173 1999
[36]

FasterRisk: Fast and accurate interpretable risk scores

Jiachang Liu, Chudi Zhong, Boxuan Li, Margo Seltzer, and Cynthia Rudin. FasterRisk: Fast and accurate interpretable risk scores. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[37]

Fast sparse classification for generalized linear and additive models

Jiachang Liu, Chudi Zhong, Margo Seltzer, and Cynthia Rudin. Fast sparse classification for generalized linear and additive models. InProceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 9304–9333, 2022

work page 2022
[38]

OKRidge: Scalable optimal k-sparse ridge regression

Jiachang Liu, Sam Rosen, Chudi Zhong, and Cynthia Rudin. OKRidge: Scalable optimal k-sparse ridge regression. InAdvances in Neural Information Processing Systems, pages 41076–41258, 2023. 13

work page 2023
[39]

Scalable first-order method for certifying optimal k-sparse GLMs

Jiachang Liu, Soroosh Shafiee, and Andrea Lodi. Scalable first-order method for certifying optimal k-sparse GLMs. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Le...

work page 2025
[40]

Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026

Jiachang Liu, Andrea Lodi, and Soroosh Shafiee. Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026. URLhttps://arxiv.org/ abs/2603.01306

work page arXiv 2026
[41]

arXiv preprint arXiv:2311.07710

Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming, 2025. URLhttps://arxiv.org/abs/2311.07710

work page arXiv 2025
[42]

cuPDLP-C : A strengthened implementation of cuPDLP for linear programming by C language

Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language, 2024. URLhttps://arxiv.org/abs/2312.14832

work page arXiv 2024
[43]

Leveraging predictive equivalence in decision trees

Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, and Cynthia Rudin. Leveraging predictive equivalence in decision trees. InInternational Conference on Machine Learning, 2025

work page 2025
[44]

A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026

Xiang Meng, Ryan Lucas, and Rahul Mazumder. A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026. URLhttps://arxiv.org/abs/2602. 04551

work page 2026
[45]

Morrison, Sheldon H

David R. Morrison, Sheldon H. Jacobson, Jason J. Sauppe, and Edward C. Sewell. Branch- and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016. doi: 10.1016/j.disopt.2016.01.005

work page doi:10.1016/j.disopt.2016.01.005 2016
[46]

Version 11.0.4, 2025

MOSEK ApS.The MOSEK optimization toolbox for MATLAB manual. Version 11.0.4, 2025

work page 2025
[47]

Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H

Simon D. Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H. McCormick. REALITrees: Rashomon ensemble active learning for interpretable trees, 2026. URLhttps://arxiv.org/abs/2603.22750

work page arXiv 2026
[48]

Santander cus- tomer transaction prediction

Mercedes Piedra, Sohier Dane, and Soraya Jimenez. Santander cus- tomer transaction prediction. https://kaggle.com/competitions/ santander-customer-transaction-prediction, 2019. Kaggle competition

work page 2019
[49]

Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Large-scale deep unsupervised learning using graphics processors. InProceedings of the 26th Annual International Conference on Machine Learning, pages 873–880, 2009

work page 2009
[50]

Tyrrell Rockafellar.Convex Analysis

R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

work page 1970
[51]

Position: Amazing things come from having many good models

Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner. Position: Amazing things come from having many good models. InProceedings of the 41st International Conference on Machine Learning, pages 42783–42795, 2024

work page 2024
[52]

Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

Soroosh Shafiee and Fatma Kılınç-Karzan. Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

work page 2024
[53]

Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016

Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016. 14

work page 2016
[54]

Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

Berk Ustun and Cynthia Rudin. Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

work page 2019
[55]

On the convexification of constrained quadratic optimization problems with indicator variables

Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. On the convexification of constrained quadratic optimization problems with indicator variables. InInteger Programming and Combinatorial Optimization, pages 433–447, 2020

work page 2020
[56]

Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

work page 2022
[57]

Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

work page 2020
[58]

Exploring the whole Rashomon set of sparse decision trees

Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, and Cynthia Rudin. Exploring the whole Rashomon set of sparse decision trees. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[59]

Exploring and interacting with the set of good sparse generalized additive models

Chudi Zhong, Zhi Chen, Jiachang Liu, Margo Seltzer, and Cynthia Rudin. Exploring and interacting with the set of good sparse generalized additive models. InAdvances in Neural Information Processing Systems, volume 36, pages 56673–56699, 2023. Appendix 7 Primer on Branch and Bound A branch-and-bound tree stores partial fixing decisions on the binary suppor...

work page 2023

[1] [1]

SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009

Tobias Achterberg. SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009. doi: 10.1007/s12532-008-0001-1

work page doi:10.1007/s12532-008-0001-1 2009

[2] [2]

Achterberg, T

Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. doi: 10.1016/j.orl.2004.04.002

work page doi:10.1016/j.orl.2004.04.002 2005

[3] [3]

Practical large-scale linear programming using primal-dual hybrid gradient

David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, 11 and Warren Schudy. Practical large-scale linear programming using primal-dual hybrid gradient. InAdvances in Neural Information Processing Systems, pages 20243–20257, 2021

work page 2021

[4] [4]

The UCI Machine Learning Repository, 2007

Arthur Asuncion and David Newman. The UCI Machine Learning Repository, 2007

work page 2007

[5] [5]

Safe screening rules forℓ0-regression from perspective relaxations

Alper Atamturk and Andrés Gómez. Safe screening rules forℓ0-regression from perspective relaxations. InProceedings of the 37th International Conference on Machine Learning, pages 421–430, 2020

work page 2020

[6] [6]

Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

Alper Atamtürk and Andrés Gómez. Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

work page 2023

[7] [7]

Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

Alper Atamtürk, Andrés Gómez, and Shaoning Han. Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

work page 2021

[8] [8]

Near-optimal decision trees in a SPLIT second

Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. InInternational Conference on Machine Learning, 2025

work page 2025

[9] [9]

SIAM, 2017

Amir Beck.First-Order Methods in Optimization. SIAM, 2017

work page 2017

[10] [10]

Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

Dimitris Bertsimas and Wes Gurnee. Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

work page 2023

[11] [11]

Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

work page 2020

[12] [12]

Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

work page 2020

[13] [13]

Batched first-order methods for parallel LP solving in MIP, 2026

Nicolas Blin, Stefano Gualandi, Christopher Maes, Andrea Lodi, and Bartolomeo Stellato. Batched first-order methods for parallel LP solving in MIP, 2026. URLhttps://arxiv. org/abs/2601.21990

work page arXiv 2026

[14] [14]

Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

work page 2001

[15] [15]

Monotone regression: A simple and fast O (n) PAVA implementation

Frank MTA Busing. Monotone regression: A simple and fast O (n) PAVA implementation. Journal of Statistical Software, 102(Code Snippet 1):1–25, 2022

work page 2022

[16] [16]

Convex programming for disjunctive convex optimization

Sebastián Ceria and João Soares. Convex programming for disjunctive convex optimization. Mathematical Programming, 86(3):595–614, 1999

work page 1999

[17] [17]

GPU-accelerated primal heuristics for mixed integer programming, 2025

Akif Çördük, Piotr Sielski, Alice Boucher, and Kumar Aatish. GPU-accelerated primal heuristics for mixed integer programming, 2025. URLhttps://arxiv.org/abs/2510.20499

work page arXiv 2025

[18] [18]

On the power of linear programming for K-means clustering

Antonio De Rosa, Aida Khajavirad, and Yakun Wang. On the power of linear programming for K-means clustering, 2024. URLhttps://arxiv.org/abs/2402.01061

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder. Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

work page 2021

[20] [20]

Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

Jiayun Dong and Cynthia Rudin. Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

work page 2020

[21] [21]

Jon Donnelly, Srikar Katta, Cynthia Rudin, and Edward P. Browne. The Rashomon importance distribution: Getting RID of unstable, single model-based variable importance. InAdvances in Neural Information Processing Systems, volume 36, 2023. 12

work page 2023

[22] [22]

Rashomon sets for prototypical-part networks: Editing interpretable models in real-time

Jon Donnelly, Zhicheng Guo, Alina Jade Barnett, Hayden McTavish, Chaofan Chen, and Cynthia Rudin. Rashomon sets for prototypical-part networks: Editing interpretable models in real-time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[23] [23]

Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.Journal of Machine Learning Research, 20(177):1–81, 2019

work page 2019

[24] [24]

Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

work page 2006

[25] [25]

Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

Oktay Günlük and Jeff Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

work page 2010

[26] [26]

Gurobi Optimizer Reference Manual, 2025

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2025

work page 2025

[27] [27]

A new branch-and-bound pruning framework forℓ0-regularized problems

Théo Guyard, Cédric Herzet, Clément Elvira, and Ayse-Nur Arslan. A new branch-and-bound pruning framework forℓ0-regularized problems. InProceedings of the 41st International Conference on Machine Learning, pages 48077–48096, 2024

work page 2024

[28] [28]

Resolving Predictive Multiplicity for the Rashomon Set

Parian Haghighat, Hadis Anahideh, and Cynthia Rudin. Resolving predictive multiplicity for the Rashomon set, 2026. URLhttps://arxiv.org/abs/2601.09071

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

work page

[30] [30]

URLhttps://arxiv.org/abs/2407.15049

work page arXiv

[31] [31]

Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

Hussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

work page 2020

[32] [32]

Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

Hussein Hazimeh, Rahul Mazumder, and Ali Saab. Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

work page 2022

[33] [33]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, volume 25, pages 1097–1105, 2012

work page 2012

[34] [34]

A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025

Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025. URLhttps://arxiv.org/ abs/2505.00311

work page arXiv 2025

[35] [35]

J. T. Linderoth and M. W. P. Savelsbergh. A computational study of search strategies for mixed integer programming.INFORMS Journal on Computing, 11(2):173–187, 1999. doi: 10.1287/ijoc.11.2.173

work page doi:10.1287/ijoc.11.2.173 1999

[36] [36]

FasterRisk: Fast and accurate interpretable risk scores

Jiachang Liu, Chudi Zhong, Boxuan Li, Margo Seltzer, and Cynthia Rudin. FasterRisk: Fast and accurate interpretable risk scores. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[37] [37]

Fast sparse classification for generalized linear and additive models

Jiachang Liu, Chudi Zhong, Margo Seltzer, and Cynthia Rudin. Fast sparse classification for generalized linear and additive models. InProceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 9304–9333, 2022

work page 2022

[38] [38]

OKRidge: Scalable optimal k-sparse ridge regression

Jiachang Liu, Sam Rosen, Chudi Zhong, and Cynthia Rudin. OKRidge: Scalable optimal k-sparse ridge regression. InAdvances in Neural Information Processing Systems, pages 41076–41258, 2023. 13

work page 2023

[39] [39]

Scalable first-order method for certifying optimal k-sparse GLMs

Jiachang Liu, Soroosh Shafiee, and Andrea Lodi. Scalable first-order method for certifying optimal k-sparse GLMs. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Le...

work page 2025

[40] [40]

Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026

Jiachang Liu, Andrea Lodi, and Soroosh Shafiee. Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026. URLhttps://arxiv.org/ abs/2603.01306

work page arXiv 2026

[41] [41]

arXiv preprint arXiv:2311.07710

Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming, 2025. URLhttps://arxiv.org/abs/2311.07710

work page arXiv 2025

[42] [42]

cuPDLP-C : A strengthened implementation of cuPDLP for linear programming by C language

Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language, 2024. URLhttps://arxiv.org/abs/2312.14832

work page arXiv 2024

[43] [43]

Leveraging predictive equivalence in decision trees

Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, and Cynthia Rudin. Leveraging predictive equivalence in decision trees. InInternational Conference on Machine Learning, 2025

work page 2025

[44] [44]

A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026

Xiang Meng, Ryan Lucas, and Rahul Mazumder. A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026. URLhttps://arxiv.org/abs/2602. 04551

work page 2026

[45] [45]

Morrison, Sheldon H

David R. Morrison, Sheldon H. Jacobson, Jason J. Sauppe, and Edward C. Sewell. Branch- and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016. doi: 10.1016/j.disopt.2016.01.005

work page doi:10.1016/j.disopt.2016.01.005 2016

[46] [46]

Version 11.0.4, 2025

MOSEK ApS.The MOSEK optimization toolbox for MATLAB manual. Version 11.0.4, 2025

work page 2025

[47] [47]

Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H

Simon D. Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H. McCormick. REALITrees: Rashomon ensemble active learning for interpretable trees, 2026. URLhttps://arxiv.org/abs/2603.22750

work page arXiv 2026

[48] [48]

Santander cus- tomer transaction prediction

Mercedes Piedra, Sohier Dane, and Soraya Jimenez. Santander cus- tomer transaction prediction. https://kaggle.com/competitions/ santander-customer-transaction-prediction, 2019. Kaggle competition

work page 2019

[49] [49]

Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Large-scale deep unsupervised learning using graphics processors. InProceedings of the 26th Annual International Conference on Machine Learning, pages 873–880, 2009

work page 2009

[50] [50]

Tyrrell Rockafellar.Convex Analysis

R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

work page 1970

[51] [51]

Position: Amazing things come from having many good models

Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner. Position: Amazing things come from having many good models. InProceedings of the 41st International Conference on Machine Learning, pages 42783–42795, 2024

work page 2024

[52] [52]

Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

Soroosh Shafiee and Fatma Kılınç-Karzan. Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

work page 2024

[53] [53]

Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016

Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016. 14

work page 2016

[54] [54]

Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

Berk Ustun and Cynthia Rudin. Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

work page 2019

[55] [55]

On the convexification of constrained quadratic optimization problems with indicator variables

Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. On the convexification of constrained quadratic optimization problems with indicator variables. InInteger Programming and Combinatorial Optimization, pages 433–447, 2020

work page 2020

[56] [56]

Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

work page 2022

[57] [57]

Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

work page 2020

[58] [58]

Exploring the whole Rashomon set of sparse decision trees

Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, and Cynthia Rudin. Exploring the whole Rashomon set of sparse decision trees. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[59] [59]

Exploring and interacting with the set of good sparse generalized additive models

Chudi Zhong, Zhi Chen, Jiachang Liu, Margo Seltzer, and Cynthia Rudin. Exploring and interacting with the set of good sparse generalized additive models. InAdvances in Neural Information Processing Systems, volume 36, pages 56673–56699, 2023. Appendix 7 Primer on Branch and Bound A branch-and-bound tree stores partial fixing decisions on the binary suppor...

work page 2023