pith. sign in

arxiv: 1907.05912 · v2 · pith:7LW5NAOEnew · submitted 2019-07-12 · 💻 cs.LG · cs.AI

MIPaaL: Mixed Integer Program as a Layer

Pith reviewed 2026-05-24 22:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords decision-focused learningmixed integer programmingdifferentiable optimizationcutting planesend-to-end learningcombinatorial optimizationmachine learning for optimization
0
0 comments X

The pith

Differentiating through mixed integer programs allows predictive models to be trained directly on downstream decision quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how to embed mixed integer linear programs as differentiable layers inside machine learning pipelines. This extends decision-focused learning from linear programs to the wider class of MIPs that include discrete variables and arbitrary linear constraints. The key step is to differentiate through an exact cutting-planes solver that iteratively tightens a continuous relaxation until an integral solution appears. On several real-world domains the resulting end-to-end models produce better decisions than either a two-stage predict-then-optimize pipeline or a decision-focused model trained only on the LP relaxation.

Core claim

By employing a cutting planes solution approach, which is an exact algorithm that iteratively adds constraints to a continuous relaxation of the problem until an integral solution is found, it is possible to differentiate through a MIP. This enables decision-focused learning for the broad class of problems encoded as mixed integer linear programs, supporting arbitrary linear constraints over discrete and continuous variables, and yields higher-quality decisions than treating prediction and prescription separately or than applying decision-focused learning only to the LP relaxation.

What carries the argument

Cutting-planes solution approach that iteratively adds constraints to the continuous relaxation until integrality is reached, thereby permitting gradient flow back through the MIP solver.

If this is right

  • MIPs encoding discrete choices can now participate directly in end-to-end gradient training.
  • Predictive models can be optimized for the quality of the MIP solutions they induce rather than for prediction accuracy alone.
  • The approach applies to any problem expressible with arbitrary linear constraints on mixed discrete and continuous variables.
  • On evaluated domains the method outperforms both separate prediction-plus-optimization and decision-focused learning on the LP relaxation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may allow decision-focused training on larger combinatorial problems once more efficient cutting-plane implementations are used inside the layer.
  • Similar differentiation could be attempted with other exact MIP solvers such as branch-and-bound if their solution paths can be made differentiable.
  • The method opens a route to jointly learning prediction models and the structure of the MIP constraints themselves when both are uncertain.

Load-bearing premise

Gradients computed through the iterative cutting-planes process remain stable and informative enough to train the upstream predictive model effectively.

What would settle it

If training with the MIP layer produces worse final decisions than the LP-relaxation baseline on the same real-world instances, the claimed advantage of the differentiable MIP approach would not hold.

Figures

Figures reproduced from arXiv: 1907.05912 by Aaron Ferber, Bistra Dilkina, Bryan Wilder, Milind Tambe.

Figure 1
Figure 1. Figure 1: Scatter plots of predicted coefficients vs ground truth returns. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Machine learning components commonly appear in larger decision-making pipelines; however, the model training process typically focuses only on a loss that measures accuracy between predicted values and ground truth values. Decision-focused learning explicitly integrates the downstream decision problem when training the predictive model, in order to optimize the quality of decisions induced by the predictions. It has been successfully applied to several limited combinatorial problem classes, such as those that can be expressed as linear programs (LP), and submodular optimization. However, these previous applications have uniformly focused on problems from specific classes with simple constraints. Here, we enable decision-focused learning for the broad class of problems that can be encoded as a Mixed Integer Linear Program (MIP), hence supporting arbitrary linear constraints over discrete and continuous variables. We show how to differentiate through a MIP by employing a cutting planes solution approach, which is an exact algorithm that iteratively adds constraints to a continuous relaxation of the problem until an integral solution is found. We evaluate our new end-to-end approach on several real world domains and show that it outperforms the standard two phase approaches that treat prediction and prescription separately, as well as a baseline approach of simply applying decision-focused learning to the LP relaxation of the MIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents MIPaaL, a technique to embed mixed-integer linear programs as differentiable layers within neural networks for decision-focused learning. It differentiates through an exact cutting-planes solver by back-propagating through the sequence of LP relaxations solved during cut generation, and reports that the resulting end-to-end model outperforms both separate prediction-plus-optimization pipelines and a decision-focused LP-relaxation baseline on several real-world tasks.

Significance. If the differentiation is rigorously defined and the gradients remain informative, the work would meaningfully extend decision-focused learning beyond LPs and submodular problems to the much larger class of MIPs. The empirical claim of outperformance over both two-stage and LP-relaxation baselines, if reproducible, would constitute a practical contribution.

major comments (1)
  1. [Abstract (method description)] The central technical claim—that gradients can be obtained by back-propagation through the iterative cutting-plane process—does not address the dependence of cut generation on the current fractional vertex, which itself depends on the upstream predictions. Because the sequence of active cuts can change discontinuously, the mapping from predictions to the final integral solution is piecewise and the pieces can switch abruptly; no argument is supplied that the resulting subgradient remains well-defined or useful for training across these switches.
minor comments (1)
  1. [Abstract] The abstract refers to evaluation on 'several real world domains' without naming the domains or providing any quantitative results, making it impossible to assess the strength of the empirical support from the provided text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting an important subtlety in the differentiability argument. We address the concern directly below and will incorporate clarifications into a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract (method description)] The central technical claim—that gradients can be obtained by back-propagation through the iterative cutting-plane process—does not address the dependence of cut generation on the current fractional vertex, which itself depends on the upstream predictions. Because the sequence of active cuts can change discontinuously, the mapping from predictions to the final integral solution is piecewise and the pieces can switch abruptly; no argument is supplied that the resulting subgradient remains well-defined or useful for training across these switches.

    Authors: We agree that cut selection depends on the current fractional vertex and that this dependence can cause the active-cut sequence to change discontinuously with the predictions. Our differentiation procedure unrolls the cutting-plane loop and applies the standard LP differentiation operator (via the KKT conditions of each successive relaxation) to the sequence of LPs that were actually solved for a given prediction. For any fixed sequence of cuts the end-to-end map is piecewise linear and therefore admits a well-defined subgradient almost everywhere. At the (measure-zero) points where the sequence jumps we simply retain the subgradient corresponding to the cuts that were active in the forward pass; this is the natural extension of the subgradient used by other decision-focused methods that also encounter combinatorial choices (e.g., active-set changes in quadratic programs). While a fully rigorous proof that these subgradients remain informative for every possible MIP is beyond the scope of the paper, the empirical results on multiple real-world domains demonstrate that the resulting gradients are sufficiently informative to outperform both two-stage and LP-relaxation baselines. We will add an expanded discussion of this subgradient construction, its relation to existing work on non-smooth optimization, and the practical evidence of its utility in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation introduces independent differentiation technique

full rationale

The paper's core contribution is a novel method to differentiate through MIPs by backpropagating through the cutting-planes algorithm's sequence of LP relaxations until integrality. This is presented as an extension of prior decision-focused learning work to the MIP setting, with the technical steps (iterative constraint addition and gradient computation) described directly rather than defined in terms of the target predictions or fitted parameters. No load-bearing step reduces by construction to a self-citation, ansatz, or renaming; the evaluations compare against baselines including LP relaxation, confirming the claim rests on the new algorithmic construction rather than circular reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable; the contribution is a differentiation technique rather than new modeling assumptions.

pith-pipeline@v0.9.0 · 5744 in / 1183 out tokens · 27363 ms · 2026-05-24T22:11:18.243956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    A supervised machine learning approach to variable branching in branch-and-bound

    Alejandro Marcos Alvarez, Quentin Louveaux, and Louis Wehenkel. A supervised machine learning approach to variable branching in branch-and-bound. In ECML, 2014

  2. [2]

    Amos and J

    B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. In ICML, 2017

  3. [3]

    Gomory cuts revisited

    Egon Balas, Sebastian Ceria, Gérard Cornuéjols, and N Natraj. Gomory cuts revisited. Opera- tions Research Letters, 19(1):1–9, 1996

  4. [4]

    Learning to branch

    Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In ICML, 2018

  5. [5]

    Diver- sity constraints in public housing allocation

    Nawal Benabbou, Mithun Chakraborty, Xuan-Vinh Ho, Jakub Sliwinski, and Yair Zick. Diver- sity constraints in public housing allocation. In AAMAS, 2018

  6. [6]

    Using a financial training criterion rather than a prediction criterion

    Yoshua Bengio. Using a financial training criterion rather than a prediction criterion. Interna- tional Journal of Neural Systems, 8(04):433–443, 1997

  7. [7]

    Machine learning for combinatorial optimization: a methodological tour d’horizon

    Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128, 2018

  8. [8]

    Portfolio construction through mixed-integer programming at grantham, mayo, van otterloo and company.Interfaces, 29(1):49– 66, 1999

    Dimitris Bertsimas, Christopher Darnell, and Robert Soucy. Portfolio construction through mixed-integer programming at grantham, mayo, van otterloo and company.Interfaces, 29(1):49– 66, 1999

  9. [9]

    Cutting planes from extended lp formulations

    Merve Bodur, Sanjeeb Dash, and Oktay Günlük. Cutting planes from extended lp formulations. Mathematical Programming, 161(1-2):159–192, 2017

  10. [10]

    Optimizing intensive care unit discharge decisions with patient readmissions

    Carri W Chan, Vivek F Farias, Nicholas Bambos, and Gabriel J Escobar. Optimizing intensive care unit discharge decisions with patient readmissions. Operations research, 60(6):1323–1341, 2012

  11. [11]

    An anatomy of trading strategies

    Jennifer Conrad and Gautam Kaul. An anatomy of trading strategies. The Review of Financial Studies, 11(3):489–519, 1998

  12. [12]

    Dobbs, Oktay Günlük, Tomasz J

    Sanjeeb Dash, Neil B. Dobbs, Oktay Günlük, Tomasz J. Nowicki, and Grzegorz M. ´Swirszcz. Lattice-free sets, multi-branch split disjunctions, and mixed-integer programming.Mathematical Programming, 145(1):483–508, 2014

  13. [13]

    End-to-end differentiable physics for learning and control

    Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. End-to-end differentiable physics for learning and control. In Advances in Neural Information Processing Systems, pages 7178–7189, 2018

  14. [14]

    Effect of the accuracy of price forecasting on profit in a price based unit commitment

    Erik Delarue, Pieterjan Van Den Bosch, and William D’haeseleer. Effect of the accuracy of price forecasting on profit in a price based unit commitment. Electric power systems research, 80(10):1306–1313, 2010

  15. [15]

    Position-indexed formulations for kidney exchange

    John P Dickerson, David F Manlove, Benjamin Plaut, Tuomas Sandholm, and James Trim- ble. Position-indexed formulations for kidney exchange. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 25–42. ACM, 2016

  16. [16]

    predict, then optimize

    Adam N Elmachtoub and Paul Grigas. Smart" predict, then optimize". arXiv preprint arXiv:1710.08005, 2017

  17. [17]

    An algorithm for the mixed integer problem

    Ralph Gomory. An algorithm for the mixed integer problem. Technical report, RAND CORP SANTA MONICA CA, 1960

  18. [18]

    Roy, and Xiang Yan

    Yi hao Kao, Benjamin V . Roy, and Xiang Yan. Directed regression. In Y . Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages 889–897. Curran Associates, Inc., 2009

  19. [19]

    Learning to search in branch and bound algorithms

    He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Advances in Neural Information Processing Systems, pages 3293–3301, 2014. 10

  20. [20]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 , ICML’15, pages 448–456. JMLR.org, 2015

  21. [21]

    Minima of functions of several variables with inequalities as side constraints

    William Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939

  22. [22]

    A fast and high quality multilevel scheme for partitioning irregular graphs

    George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998

  23. [23]

    Learning combinatorial optimization algorithms over graphs

    Elias B Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017

  24. [24]

    Learning to run heuristics in tree search

    Elias B Khalil, Bistra Dilkina, George L Nemhauser, Shabbir Ahmed, and Yufen Shao. Learning to run heuristics in tree search. In IJCAI, pages 659–666, 2017

  25. [25]

    Learning to branch in mixed integer programming

    Elias B Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  27. [27]

    Attention, learn to solve routing problems! In International Conference on Learning Representations, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2019

  28. [28]

    What game are we playing? end-to-end learning in normal and extensive form games

    Chun Kai Ling, Fei Fang, and J Zico Kolter. What game are we playing? end-to-end learning in normal and extensive form games. IJCAI, 2018

  29. [29]

    On learning and branching: a survey

    Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25(2):207–236, 2017

  30. [30]

    Rectifier nonlinearities improve neural network acoustic models

    Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013

  31. [31]

    Portfolio selection

    Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952

  32. [32]

    Optimal residential load control with price prediction in real-time electricity pricing environments

    Amir-Hamed Mohsenian-Rad and Alberto Leon-Garcia. Optimal residential load control with price prediction in real-time electricity pricing environments. IEEE Trans. Smart Grid, 1(2):120–133, 2010

  33. [33]

    Reinforcement learning for solving the vehicle routing problem

    Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9839–9849, 2018

  34. [34]

    Integer programming: The global impact

    George L Nemhauser. Integer programming: The global impact. ISyE DOS Optimization Seminar presentation at Georgia Tech http://hdl.handle.net/1853/49829, 2013

  35. [35]

    Data analysis and optimization for (citi) bike sharing

    Eoin O’Mahony and David B Shmoys. Data analysis and optimization for (citi) bike sharing. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015

  36. [36]

    Automatic differentiation in PyTorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

  37. [37]

    Decision-focused learning of adversary behavior in security games

    Andrew Perrault, Bryan Wilder, Eric Ewing, Aditya Mate, Bistra Dilkina, and Milind Tambe. Decision-focused learning of adversary behavior in security games. arXiv preprint arXiv:1903.00958, 2019

  38. [38]

    Various end-of-day data

    Quandl. Various end-of-day data. https://www.quandl.com/, 2019. 11

  39. [39]

    Guiding combinatorial optimization with uct

    Ashish Sabharwal, Horst Samulowitz, and Chandra Reddy. Guiding combinatorial optimization with uct. InInternational Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, pages 356–361. Springer, 2012

  40. [40]

    Ipknot: fast and accurate prediction of rna secondary structures with pseudoknots using integer programming

    Kengo Sato, Yuki Kato, Michiaki Hamada, Tatsuya Akutsu, and Kiyoshi Asai. Ipknot: fast and accurate prediction of rna secondary structures with pseudoknots using integer programming. Bioinformatics, 27(13):i85–i93, 2011

  41. [41]

    Collective classification in network data

    Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi- Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008

  42. [42]

    Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

  43. [43]

    Nurse scheduling using integer linear programming and constraint programming

    Lorraine Trilling, Alain Guinet, and Dominiue Le Magny. Nurse scheduling using integer linear programming and constraint programming. IFAC Proceedings Volumes, 39(3):671–676, 2006

  44. [44]

    Pointer networks

    Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015

  45. [45]

    Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver

    Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pages 6545–6554, 2019

  46. [46]

    Scheduling nursing personnel according to nursing preference: A mathe- matical programming approach

    D Michael Warner. Scheduling nursing personnel according to nursing preference: A mathe- matical programming approach. Operations Research, 24(5):842–856, 1976

  47. [47]

    Melding the data-decisions pipeline: Decision- focused learning for combinatorial optimization

    Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decision- focused learning for combinatorial optimization. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019

  48. [48]

    Integer and combinatorial optimization

    Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization. John Wiley & Sons, 2014

  49. [49]

    Deep reinforcement learning for green security game with online information

    Lantao Yu, Yi Wu, Rohit Singh, Lucas Joppa, and Fei Fang. Deep reinforcement learning for green security game with online information. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019. A Portfolio Optimization Specifics A.1 Data specification We simulate our approach on this problem setting we use historical price and volume data d...