pith. sign in

arxiv: 2605.22188 · v1 · pith:TKGSQGB4new · submitted 2026-05-21 · 💻 cs.LG · math.OC· stat.ML

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs

Pith reviewed 2026-05-22 07:36 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords branch and boundGPU parallelizationsparse generalized linear modelscardinality constraintsexact optimizationRashomon set
0
0 comments X

The pith

Batching branch-and-bound nodes on GPUs speeds up exact solutions for sparse generalized linear models by one to two orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that moves multiple nodes from the branch-and-bound search tree to the GPU for simultaneous processing. This addresses the usual bottleneck where nodes must be handled one after another on the CPU, especially for problems that mix discrete choices with nonlinear objectives like finding the best sparse model. A sympathetic reader would care because exact optimal solutions for these models have been computationally expensive, limiting their use on larger datasets or more complex problems. The approach keeps the exactness of branch and bound while gaining speed from GPU parallelism through careful data handling.

Core claim

We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework uses padding together with lightweight custom kernels to handle irregular node data structures, achieving one to two orders of magnitude speedups and zero optimality gap on challenging instances.

What carries the argument

The CPU-GPU framework that batches multiple branch-and-bound nodes for parallel GPU processing, using padding to manage varying data sizes across nodes.

If this is right

  • The method delivers exact optimal solutions without sacrificing optimality.
  • Speedups of 10 to 100 times allow solving larger or harder instances of cardinality-constrained GLMs.
  • The framework can be extended to collect the full Rashomon set of near-optimal models for additional statistical analysis.
  • Modular design allows integration with existing branch-and-bound solvers for other combinatorial problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the batching overhead stays low, similar GPU acceleration could apply to branch-and-bound in other domains like integer programming or scheduling.
  • Collecting the Rashomon set opens the door to model selection based on user-defined secondary criteria beyond just sparsity and fit.
  • Future work might test the framework on very large-scale problems to see where GPU memory limits the batch size.

Load-bearing premise

Padding the irregular data structures of different branch-and-bound nodes combined with custom kernels adds low enough overhead to produce real speedups on GPUs rather than slowing things down.

What would settle it

Running the framework on the same challenging instances and finding that it either takes more time than a pure CPU version or returns solutions with a positive optimality gap.

Figures

Figures reproduced from arXiv: 2605.22188 by Andrea Lodi, Jiachang Liu.

Figure 1
Figure 1. Figure 1: Hybrid CPU–GPU BnB framework for mixed-integer nonlinear programs, where [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of padding and column sorting for a batched proximal evaluation. Non-free [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of batch-size for GPU-parallel BnB for our method. Results are on synthetic [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Primer on branch and bound. Each open node stores partial fixing de￾cisions on z. A lower bound prunes a node (with red crosses) when it cannot beat the incumbent; otherwise the node is branched into two children by fixing one free variable, either with zj = 0 or zj = 1 The green node finds an incum￾bent, which becomes optimal when there are no open nodes. In this work, we process multiple nodes in batches… view at source ↗
Figure 5
Figure 5. Figure 5: Support-frequency summary for the saved top- [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model-reliance summary for the saved top- [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Primary sparse-logistic objective versus AUC over the saved top- [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Primary sparse-logistic objective versus accuracy over the saved top- [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy versus AUC over the saved top-1000 Dorothea Rashomon set with k = 5. This plot shows whether the models preferred by a threshold-dependent metric are also preferred by a ranking metric. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The BnB search tree and the support trie are different objects. BnB edges are [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a CPU-GPU framework for parallel branch-and-bound (BnB) to compute optimal k-sparse generalized linear models. Multiple BnB nodes are batched and processed on the GPU using padding combined with lightweight custom kernels to accommodate irregular node data (varying active sets, depths, and subproblem sizes). The framework is described as simple, generic, and modular. Experiments are reported to achieve one to two orders of magnitude speedups over sequential CPU baselines while maintaining zero optimality gap on challenging instances; the approach is also extended to enumerate the full Rashomon set for downstream statistical tasks such as variable-importance analysis.

Significance. If the reported speedups are robust, the work would meaningfully improve the practicality of exact solvers for cardinality-constrained GLMs, a class of problems that remains computationally demanding. The modular CPU-GPU design and the Rashomon-set extension are concrete strengths that could support follow-on statistical applications. The absence of parameter fitting or circular derivations in the algorithmic contribution keeps the burden of proof on empirical validation rather than on mathematical self-consistency.

major comments (2)
  1. [§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.
  2. [§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.
minor comments (2)
  1. [Figure 2] Figure 2 caption does not state the exact GPU kernel launch configuration or the memory layout after padding; adding these details would improve reproducibility.
  2. [§2.1] The GLM loss and gradient expressions in §2.1 are given in prose; an explicit equation block would clarify the precise form passed to the GPU kernels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the experimental and implementation details in our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [§4 Experiments] §4 Experiments: the central claim of one-to-two-orders-of-magnitude speedups is presented without reporting instance dimensions (n, p, k), number of test problems, baseline solver versions, hardware specifications (CPU/GPU models and memory), or any measure of run-to-run variability. These omissions make it impossible to judge whether the gains are stable or sensitive to problem distribution.

    Authors: We agree that these details are necessary for reproducibility and to allow readers to assess robustness. In the revised manuscript we will expand §4 with a table (or expanded text) that reports the exact (n, p, k) values for every instance, the total number of test problems, the precise versions of all baseline solvers, the CPU and GPU hardware models together with memory capacities, and run-to-run variability (standard deviations or ranges over repeated executions). These additions will be placed in the main text or a dedicated appendix without changing any numerical results. revision: yes

  2. Referee: [§3.3 Implementation] §3.3 Implementation: the padding-plus-custom-kernel strategy for heterogeneous BnB nodes is described at a high level, yet no padding ratios, GPU occupancy figures, or kernel-level timing breakdowns are supplied. Without these quantities it is unclear whether the overhead of dummy elements and masking negates the expected parallelism on more varied node-size distributions, directly affecting the net-speedup claim.

    Authors: We acknowledge that quantitative implementation metrics would strengthen the presentation. We will revise §3.3 (and add a short appendix if needed) to report representative padding ratios observed across our node batches, estimated GPU occupancy for the custom kernels, and a coarse timing breakdown separating data-transfer, padding, and kernel execution. If full low-level profiler traces were not collected in the original experiments, we will state this limitation explicitly while supplying the best available figures from our runs. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with experimental validation

full rationale

The paper presents a modular CPU-GPU batching framework for parallel branch-and-bound on k-sparse GLM instances, using padding and lightweight custom kernels to manage irregular node data. Claims rest on experimental speedups (1-2 orders of magnitude) and zero optimality gap rather than any mathematical derivation, first-principles prediction, or equation that reduces outputs to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing steps. The work is implementation-driven and externally falsifiable via the reported benchmarks, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions of GPU memory coalescing and the correctness of branch-and-bound pruning rules; no new mathematical axioms or free parameters are introduced in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1102 out tokens · 40175 ms · 2026-05-22T07:36:05.363424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009

    Tobias Achterberg. SCIP: solving constraint integer programs.Mathematical Programming Computation, 1(1):1–41, 2009. doi: 10.1007/s12532-008-0001-1

  2. [2]

    Achterberg, T

    Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. doi: 10.1016/j.orl.2004.04.002

  3. [3]

    Practical large-scale linear programming using primal-dual hybrid gradient

    David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, 11 and Warren Schudy. Practical large-scale linear programming using primal-dual hybrid gradient. InAdvances in Neural Information Processing Systems, pages 20243–20257, 2021

  4. [4]

    The UCI Machine Learning Repository, 2007

    Arthur Asuncion and David Newman. The UCI Machine Learning Repository, 2007

  5. [5]

    Safe screening rules forℓ0-regression from perspective relaxations

    Alper Atamturk and Andrés Gómez. Safe screening rules forℓ0-regression from perspective relaxations. InProceedings of the 37th International Conference on Machine Learning, pages 421–430, 2020

  6. [6]

    Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

    Alper Atamtürk and Andrés Gómez. Supermodularity and valid inequalities for quadratic optimization with indicators.Mathematical Programming, 201(1–2):295–338, 2023

  7. [7]

    Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

    Alper Atamtürk, Andrés Gómez, and Shaoning Han. Sparse and smooth signal estimation: Convexification of ℓ0-formulations.Journal of Machine Learning Research, 22(52):1–43, 2021

  8. [8]

    Near-optimal decision trees in a SPLIT second

    Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. InInternational Conference on Machine Learning, 2025

  9. [9]

    SIAM, 2017

    Amir Beck.First-Order Methods in Optimization. SIAM, 2017

  10. [10]

    Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

    Dimitris Bertsimas and Wes Gurnee. Learning sparse nonlinear dynamics via mixed-integer optimization.Nonlinear Dynamics, 111(7):6585–6604, 2023

  11. [11]

    Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

    Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.The Annals of Statistics, 48(1):300–323, 2020

  12. [12]

    Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

    Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse regression: Scalable algorithms and empirical performance.Statistical Science, 35(4):555–578, 2020

  13. [13]

    Batched first-order methods for parallel LP solving in MIP, 2026

    Nicolas Blin, Stefano Gualandi, Christopher Maes, Andrea Lodi, and Bartolomeo Stellato. Batched first-order methods for parallel LP solving in MIP, 2026. URLhttps://arxiv. org/abs/2601.21990

  14. [14]

    Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

    Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical Science, 16(3):199–231, 2001

  15. [15]

    Monotone regression: A simple and fast O (n) PAVA implementation

    Frank MTA Busing. Monotone regression: A simple and fast O (n) PAVA implementation. Journal of Statistical Software, 102(Code Snippet 1):1–25, 2022

  16. [16]

    Convex programming for disjunctive convex optimization

    Sebastián Ceria and João Soares. Convex programming for disjunctive convex optimization. Mathematical Programming, 86(3):595–614, 1999

  17. [17]

    GPU-accelerated primal heuristics for mixed integer programming, 2025

    Akif Çördük, Piotr Sielski, Alice Boucher, and Kumar Aatish. GPU-accelerated primal heuristics for mixed integer programming, 2025. URLhttps://arxiv.org/abs/2510.20499

  18. [18]

    On the power of linear programming for K-means clustering

    Antonio De Rosa, Aida Khajavirad, and Yakun Wang. On the power of linear programming for K-means clustering, 2024. URLhttps://arxiv.org/abs/2402.01061

  19. [19]

    Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

    Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder. Learning sparse classifiers: Continuous and mixed integer optimization perspectives.Journal of Machine Learning Research, 22(135):1–47, 2021

  20. [20]

    Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

    Jiayun Dong and Cynthia Rudin. Exploring the cloud of variable importance for the set of all good models.Nature Machine Intelligence, 2:810–824, 2020

  21. [21]

    Jon Donnelly, Srikar Katta, Cynthia Rudin, and Edward P. Browne. The Rashomon importance distribution: Getting RID of unstable, single model-based variable importance. InAdvances in Neural Information Processing Systems, volume 36, 2023. 12

  22. [22]

    Rashomon sets for prototypical-part networks: Editing interpretable models in real-time

    Jon Donnelly, Zhicheng Guo, Alina Jade Barnett, Hayden McTavish, Chaofan Chen, and Cynthia Rudin. Rashomon sets for prototypical-part networks: Editing interpretable models in real-time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  23. [23]

    Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.Journal of Machine Learning Research, 20(177):1–81, 2019

  24. [24]

    Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

    Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0–1 mixed integer programs.Mathematical Programming, 106(2):225–236, 2006

  25. [25]

    Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

    Oktay Günlük and Jeff Linderoth. Perspective reformulations of mixed integer nonlinear programs with indicator variables.Mathematical Programming, 124(1–2):183–205, 2010

  26. [26]

    Gurobi Optimizer Reference Manual, 2025

    Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2025

  27. [27]

    A new branch-and-bound pruning framework forℓ0-regularized problems

    Théo Guyard, Cédric Herzet, Clément Elvira, and Ayse-Nur Arslan. A new branch-and-bound pruning framework forℓ0-regularized problems. InProceedings of the 41st International Conference on Machine Learning, pages 48077–48096, 2024

  28. [28]

    Resolving Predictive Multiplicity for the Rashomon Set

    Parian Haghighat, Hadis Anahideh, and Cynthia Rudin. Resolving predictive multiplicity for the Rashomon set, 2026. URLhttps://arxiv.org/abs/2601.09071

  29. [29]

    Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

    Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming algorithms on GPU,

  30. [30]

    URLhttps://arxiv.org/abs/2407.15049

  31. [31]

    Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

    Hussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms.Operations Research, 68(5):1517–1537, 2020

  32. [32]

    Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

    Hussein Hazimeh, Rahul Mazumder, and Ali Saab. Sparse regression at scale: Branch- and-bound rooted in first-order optimization.Mathematical Programming, 196(1):347–388, 2022

  33. [33]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, volume 25, pages 1097–1105, 2012

  34. [34]

    A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025

    Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025. URLhttps://arxiv.org/ abs/2505.00311

  35. [35]

    J. T. Linderoth and M. W. P. Savelsbergh. A computational study of search strategies for mixed integer programming.INFORMS Journal on Computing, 11(2):173–187, 1999. doi: 10.1287/ijoc.11.2.173

  36. [36]

    FasterRisk: Fast and accurate interpretable risk scores

    Jiachang Liu, Chudi Zhong, Boxuan Li, Margo Seltzer, and Cynthia Rudin. FasterRisk: Fast and accurate interpretable risk scores. InAdvances in Neural Information Processing Systems, volume 35, 2022

  37. [37]

    Fast sparse classification for generalized linear and additive models

    Jiachang Liu, Chudi Zhong, Margo Seltzer, and Cynthia Rudin. Fast sparse classification for generalized linear and additive models. InProceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 9304–9333, 2022

  38. [38]

    OKRidge: Scalable optimal k-sparse ridge regression

    Jiachang Liu, Sam Rosen, Chudi Zhong, and Cynthia Rudin. OKRidge: Scalable optimal k-sparse ridge regression. InAdvances in Neural Information Processing Systems, pages 41076–41258, 2023. 13

  39. [39]

    Scalable first-order method for certifying optimal k-sparse GLMs

    Jiachang Liu, Soroosh Shafiee, and Andrea Lodi. Scalable first-order method for certifying optimal k-sparse GLMs. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Le...

  40. [40]

    Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026

    Jiachang Liu, Andrea Lodi, and Soroosh Shafiee. Gpu-friendly and linearly convergent first-order methods for certifying optimalk-sparse glms, 2026. URLhttps://arxiv.org/ abs/2603.01306

  41. [41]

    arXiv preprint arXiv:2311.07710

    Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming, 2025. URLhttps://arxiv.org/abs/2311.07710

  42. [42]

    cuPDLP-C : A strengthened implementation of cuPDLP for linear programming by C language

    Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language, 2024. URLhttps://arxiv.org/abs/2312.14832

  43. [43]

    Leveraging predictive equivalence in decision trees

    Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, and Cynthia Rudin. Leveraging predictive equivalence in decision trees. InInternational Conference on Machine Learning, 2025

  44. [44]

    A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026

    Xiang Meng, Ryan Lucas, and Rahul Mazumder. A GPU-accelerated nonlinear branch- and-bound framework for sparse linear models, 2026. URLhttps://arxiv.org/abs/2602. 04551

  45. [45]

    Morrison, Sheldon H

    David R. Morrison, Sheldon H. Jacobson, Jason J. Sauppe, and Edward C. Sewell. Branch- and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016. doi: 10.1016/j.disopt.2016.01.005

  46. [46]

    Version 11.0.4, 2025

    MOSEK ApS.The MOSEK optimization toolbox for MATLAB manual. Version 11.0.4, 2025

  47. [47]

    Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H

    Simon D. Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, and Tyler H. McCormick. REALITrees: Rashomon ensemble active learning for interpretable trees, 2026. URLhttps://arxiv.org/abs/2603.22750

  48. [48]

    Santander cus- tomer transaction prediction

    Mercedes Piedra, Sohier Dane, and Soraya Jimenez. Santander cus- tomer transaction prediction. https://kaggle.com/competitions/ santander-customer-transaction-prediction, 2019. Kaggle competition

  49. [49]

    Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Large-scale deep unsupervised learning using graphics processors. InProceedings of the 26th Annual International Conference on Machine Learning, pages 873–880, 2009

  50. [50]

    Tyrrell Rockafellar.Convex Analysis

    R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

  51. [51]

    Position: Amazing things come from having many good models

    Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner. Position: Amazing things come from having many good models. InProceedings of the 41st International Conference on Machine Learning, pages 42783–42795, 2024

  52. [52]

    Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

    Soroosh Shafiee and Fatma Kılınç-Karzan. Constrained optimization of rank-one functions with indicator variables.Mathematical Programming, 208(1–2):533–579, 2024

  53. [53]

    Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016

    Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems.Machine Learning, 102(3):349–391, 2016. 14

  54. [54]

    Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

    Berk Ustun and Cynthia Rudin. Learning optimized risk scores.Journal of Machine Learning Research, 20(150):1–75, 2019

  55. [55]

    On the convexification of constrained quadratic optimization problems with indicator variables

    Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. On the convexification of constrained quadratic optimization problems with indicator variables. InInteger Programming and Combinatorial Optimization, pages 433–447, 2020

  56. [56]

    Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

    Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. Ideal formulations for constrained convex optimization problems with indicator variables.Mathematical Programming, 192(1): 57–88, 2022

  57. [57]

    Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

    Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression.SIAM Journal on Optimization, 30(4):3359–3386, 2020

  58. [58]

    Exploring the whole Rashomon set of sparse decision trees

    Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, and Cynthia Rudin. Exploring the whole Rashomon set of sparse decision trees. InAdvances in Neural Information Processing Systems, volume 35, 2022

  59. [59]

    Exploring and interacting with the set of good sparse generalized additive models

    Chudi Zhong, Zhi Chen, Jiachang Liu, Margo Seltzer, and Cynthia Rudin. Exploring and interacting with the set of good sparse generalized additive models. InAdvances in Neural Information Processing Systems, volume 36, pages 56673–56699, 2023. Appendix 7 Primer on Branch and Bound A branch-and-bound tree stores partial fixing decisions on the binary suppor...