arxiv: 2604.22746 · v1 · submitted 2026-04-24 · 🧮 math.OC · cs.LG

Recognition: unknown

Relaxation-Informed Training of Neural Network Surrogate Models

Calvin Tsay

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:53 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords neural network surrogatesmixed-integer linear programmingReLU networksrelaxation gapregularizationglobal optimizationstochastic programming

0 comments

The pith

Penalizing LP relaxation gaps and big-M constants during ReLU network training makes embedded MILPs solve up to four orders of magnitude faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReLU neural networks trained as surrogate models can be embedded exactly into mixed-integer linear programs for global optimization, but standard training objectives leave the resulting MILPs intractable due to loose continuous relaxations and large big-M constants. This work introduces bound-based regularizers that penalize big-M values and the number of unstable neurons, plus an LP relaxation gap regularizer that penalizes the per-sample gap between the network output and its continuous relaxation at training points. The gap regularizer's gradient is computed from LP dual variables, and combining the terms approximates the total derivative of the gap with respect to network weights. Experiments on non-convex benchmark functions and a two-stage stochastic program with quantile surrogates show these changes reduce MILP solve times dramatically while preserving competitive predictive accuracy.

Core claim

Adding regularizers that penalize the LP relaxation gap of the MILP encoding and the associated big-M constants at training points produces ReLU surrogate networks whose MILP encodings remain accurate yet admit far tighter continuous relaxations, cutting solve times by up to four orders of magnitude relative to unregularized baselines.

What carries the argument

The LP relaxation gap regularizer, whose gradient with respect to network parameters is obtained directly from the dual solution of the LP relaxation at each training point.

If this is right

MILP solve times drop by up to four orders of magnitude on non-convex benchmark functions.
Competitive surrogate accuracy is retained on the original prediction task.
The approach succeeds on quantile neural network surrogates inside two-stage stochastic programs.
Combining the big-M, unstable-neuron, and gap regularizers approximates the full total derivative of the relaxation gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularizers might be applied when embedding networks inside other discrete optimization frameworks beyond standard MILP encodings.
If the training distribution is too narrow, the tractability gains may vanish on out-of-sample optimization queries.
The method could be paired with architecture search that also minimizes the number of binary variables in the encoding.
Scaling the approach to deeper or wider networks may require efficient warm-starting of the per-sample LP solves.

Load-bearing premise

Regularizing relaxation properties only at the finite set of training points will produce networks whose MILP encodings stay tractable for new points encountered during later optimization.

What would settle it

Train the regularized models, then solve the MILPs on a fresh set of input points drawn from the same distribution and observe that solve times remain comparable to the unregularized baseline while predictive accuracy does not degrade.

read the original abstract

ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds regularizers that target MILP encoding properties during NN surrogate training and reports large empirical speedups, but the gains rest on training-point behavior that may not carry over to the solver's search path.

read the letter

The main point is that this work trains ReLU networks as surrogates while directly penalizing properties that hurt MILP tractability, like large big-M constants, unstable neurons, and the LP relaxation gap. They derive a gradient for the gap term from dual variables and combine it with bound-based terms to approximate the total derivative with respect to the network weights. This is a straightforward but useful shift from pure prediction loss to something that accounts for the downstream solver use case.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes training regularizers for ReLU neural networks used as surrogate models to improve tractability when the networks are encoded exactly as MILPs. Bound-based regularizers penalize large big-M constants and unstable neurons; an LP relaxation gap regularizer is introduced whose gradient is obtained from LP dual variables without custom autodiff. Experiments on non-convex benchmark functions and a two-stage stochastic program with quantile surrogates report MILP solve-time reductions of up to four orders of magnitude relative to an unregularized baseline while preserving predictive accuracy.

Significance. If the reported speedups prove robust, the work would be significant for optimization over learned models, directly targeting the computational bottleneck of NN-MILP embeddings by shaping the network during training rather than post-processing. The derivation of the LP-gap gradient from dual variables is a clean technical contribution that avoids custom differentiation machinery and could transfer to other bilevel or relaxation-based training settings. The paper supplies explicit motivation, derivations, and reproducible experimental protocols on standard benchmarks.

major comments (3)

[Methods (regularizer definitions) and Experiments] The bound-based and LP-gap regularizers are defined and minimized exclusively on the finite training set (methods section on regularizer definitions and gradient computation). The central claim that these yield tractable MILPs rests on the untested assumption that improved relaxation tightness and stable activation patterns generalize to the (unknown) points visited by branch-and-bound. No analysis, additional sampling, or out-of-sample gap measurements are provided to support this extrapolation, which directly affects the reported four-order-of-magnitude speedups.
[Experiments] Experimental results report speedups “up to four orders of magnitude” on benchmarks and the stochastic program, yet supply no statistical significance tests, variance across random seeds or instances, sensitivity analysis with respect to the regularization coefficients, or comparison against stronger baselines (e.g., post-training bound tightening or alternative big-M formulations). These omissions make it impossible to judge whether the gains are reliable or merely artifacts of particular hyper-parameter choices.
[Methods (LP-gap regularizer and total-derivative approximation)] The claim that combining the proposed regularizers “approximates the full total derivative of the LP gap” is stated without a quantitative validation (e.g., comparison of the approximated gradient against an exact total-derivative computation on a small network). Because the approximation is central to the method’s justification, a concrete error metric or small-scale verification would strengthen the argument.

minor comments (3)

[Methods] Notation for the LP dual variables and the per-sample gap expression should be made fully explicit so that the gradient implementation can be reproduced from the text alone.
[Experiments] The stochastic-programming experiment uses quantile neural network surrogates; additional detail on how the quantile loss and the resulting piecewise-linear encoding interact with the MILP formulation would aid clarity.
[Experiments] Figures showing solve-time distributions would benefit from consistent log-scale axes and error bars or box plots to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: [Methods (regularizer definitions) and Experiments] The bound-based and LP-gap regularizers are defined and minimized exclusively on the finite training set (methods section on regularizer definitions and gradient computation). The central claim that these yield tractable MILPs rests on the untested assumption that improved relaxation tightness and stable activation patterns generalize to the (unknown) points visited by branch-and-bound. No analysis, additional sampling, or out-of-sample gap measurements are provided to support this extrapolation, which directly affects the reported four-order-of-magnitude speedups.

Authors: We agree that the regularizers are computed solely on the training set and that direct evidence of generalization to branch-and-bound nodes is valuable. The reported speedups are nevertheless measured on the actual MILP instances arising in optimization, which involve out-of-sample evaluations. In the revision we will add out-of-sample LP-gap and unstable-neuron statistics on a held-out validation set drawn from the optimization domain, together with a brief discussion of how training-set regularization influences the points visited during search. revision: partial
Referee: [Experiments] Experimental results report speedups “up to four orders of magnitude” on benchmarks and the stochastic program, yet supply no statistical significance tests, variance across random seeds or instances, sensitivity analysis with respect to the regularization coefficients, or comparison against stronger baselines (e.g., post-training bound tightening or alternative big-M formulations). These omissions make it impossible to judge whether the gains are reliable or merely artifacts of particular hyper-parameter choices.

Authors: We accept that the experimental presentation lacks statistical rigor and sensitivity analysis. The revised manuscript will report mean and standard deviation of solve times across multiple random seeds for both training and MILP solving. We will also include sensitivity plots with respect to the regularization coefficients. For baselines we will add a discussion of post-training bound tightening as a complementary technique and, on the smaller benchmark instances, provide direct numerical comparisons where computationally feasible. revision: yes
Referee: [Methods (LP-gap regularizer and total-derivative approximation)] The claim that combining the proposed regularizers “approximates the full total derivative of the LP gap” is stated without a quantitative validation (e.g., comparison of the approximated gradient against an exact total-derivative computation on a small network). Because the approximation is central to the method’s justification, a concrete error metric or small-scale verification would strengthen the argument.

Authors: We thank the referee for this observation. The bound-based terms are intended to capture indirect sensitivities while the LP-gap term captures the direct effect. In the revision we will add a small-scale verification subsection: for a toy network we compute the exact total derivative of the LP gap via finite differences and compare it to the gradient supplied by the combined regularizers, reporting the relative error as a quantitative metric. revision: yes

Circularity Check

0 steps flagged

Derivation of regularizers is self-contained from MILP structure and duals

full rationale

The paper derives bound-based regularizers and the LP-gap regularizer directly from the standard ReLU MILP encoding and LP dual variables at training points. The total-derivative approximation is constructed explicitly by summing the individual regularizer gradients, without any reduction to fitted parameters, self-citations, or ansatzes imported from prior work. No load-bearing step equates a claimed prediction or uniqueness result to its own inputs by construction. The reported speedups are empirical outcomes, not definitional.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the standard exact MILP encoding of ReLU networks and on LP duality; no new physical entities are postulated. The only free parameters are the regularization coefficients whose values are chosen by the user.

free parameters (1)

regularization coefficients
Weights multiplying the big-M, unstable-neuron, and LP-gap penalty terms must be selected or tuned for each problem.

axioms (2)

standard math ReLU networks admit an exact MILP encoding via big-M formulations
Standard result in the optimization literature for piecewise-linear activations.
domain assumption LP relaxation gap evaluated at training points is a useful proxy for overall MILP tractability
Central modeling choice that justifies the LP-gap regularizer.

pith-pipeline@v0.9.0 · 5529 in / 1411 out tokens · 29897 ms · 2026-05-08T10:53:19.580349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Journal of Global Optimization91(1), 1–37 (2025)

Bertsimas, D., Margaritis, G.: Global optimization: a machine learning approach. Journal of Global Optimization91(1), 1–37 (2025)

2025
[2]

Computers & Chemical Engineering166, 107898 (2022)

Bradley, W., Kim, J., Kilwein, Z., Blakely, L., Eydenberg, M., Jalvin, J., Laird, C., Boukou- vala, F.: Perspectives on the integration between first-principles and data-driven modeling. Computers & Chemical Engineering166, 107898 (2022)

2022
[3]

Computers & Chemical Engineering179, 108411 (2023)

Misener, R., Biegler, L.: Formulating data-driven surrogate models for process optimization. Computers & Chemical Engineering179, 108411 (2023)

2023
[4]

Computers & Chemical Engineering131, 106580 (2019)

Grimstad, B., Andersson, H.: ReLU networks as surrogate models in mixed-integer linear programs. Computers & Chemical Engineering131, 106580 (2019)

2019
[5]

INFORMS Journal on Computing (2026)

Huchette, J., Mu˜ noz, G., Serra, T., Tsay, C.: When deep learning meets polyhedral theory: A survey. INFORMS Journal on Computing (2026)

2026
[6]

Optimization and Engineering, 1–33 (2026)

Plate, C., Hahn, M., Klimek, A., Ganzer, C., Sundmacher, K., Sager, S.: An analysis of optimization problems involving relu neural networks. Optimization and Engineering, 1–33 (2026)

2026
[7]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Botoeva, E., Kouvaros, P., Kronqvist, J., Lomuscio, A., Misener, R.: Efficient verification of relu-based neural networks via dependency analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3291–3299 (2020)

2020
[8]

Journal of Global Optimization81(1), 109–152 (2021)

R¨ ossig, A., Petkovic, M.: Advances in verification of ReLU neural networks. Journal of Global Optimization81(1), 109–152 (2021)

2021
[9]

arXiv preprint arXiv:2406.05670 (2024)

Sosnin, P., M¨ uller, M.N., Baader, M., Tsay, C., Wicker, M.: Certified robustness to data poisoning in gradient-based training. arXiv preprint arXiv:2406.05670 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2602.16944 (2026)

Sosnin, P., Knapp, J., Kennedy, F., Collyer, J., Tsay, C.: Exact certification of data-poisoning attacks using mixed-integer programming. arXiv preprint arXiv:2602.16944 (2026)

work page arXiv 2026
[11]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Kanamori, K., Takagi, T., Kobayashi, K., Ike, Y., Uemura, K., Arimura, H.: Ordered coun- terfactual explanation by mixed-integer linear optimization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11564–11574 (2021)

2021
[12]

In: International Conference on Artificial Intelligence and Statistics, pp

Tsiourvas, A., Sun, W., Perakis, G.: Manifold-aligned counterfactual explanations for neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 3763–3771 (2024). PMLR

2024
[13]

Computers & Chemical Engineering181, 108518 (2024)

Burtea, R., Tsay, C.: Constrained continuous-action reinforcement learning for supply chain inventory management. Computers & Chemical Engineering181, 108518 (2024)

2024
[14]

arXiv preprint arXiv:1909.12397 (2019)

Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: Continuous action Q-learning. arXiv preprint arXiv:1909.12397 (2019)

work page arXiv 1909
[15]

In: International Conference on Machine Learning, pp

Benbaki, R., Chen, W., Meng, X., Hazimeh, H., Ponomareva, N., Zhao, Z., Mazumder, R.: Fast as chita: Neural network pruning with combinatorial optimization. In: International Conference on Machine Learning, pp. 2031–2049 (2023). PMLR

2031
[16]

Advances in Neural Information Processing Systems34, 27081–27093 (2021)

Serra, T., Yu, X., Kumar, A., Ramalingam, S.: Scaling up exact neural network compression by ReLU stability. Advances in Neural Information Processing Systems34, 27081–27093 (2021)

2021
[17]

arXiv preprint arXiv:2205.14189 (2022) 31

Perakis, G., Tsiourvas, A.: Optimizing objective functions from trained ReLU neural networks via sampling. arXiv preprint arXiv:2205.14189 (2022) 31

work page arXiv 2022
[18]

In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp

Tong, J., Cai, J., Serra, T.: Optimization over trained neural networks: Taking a relaxing walk. In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 221–233 (2024). Springer

2024
[19]

arXiv preprint arXiv:2512.24295 (2025)

Tong, J., Zhu, Y., Serra, T., Burer, S.: Optimization over trained neural networks: Going large with gradient-based algorithms. arXiv preprint arXiv:2512.24295 (2025)

work page arXiv 2025
[20]

European Journal of Operational Research314(1), 1–14 (2024)

Fajemisin, A.O., Maragno, D., Hertog, D.: Optimization with constraint learning: A framework and survey. European Journal of Operational Research314(1), 1–14 (2024)

2024
[21]

˙I., Hertog, D., Fajemisin, A.O.: Mixed- integer optimization with constraint learning

Maragno, D., Wiberg, H., Bertsimas, D., Birbil, S ¸. ˙I., Hertog, D., Fajemisin, A.O.: Mixed- integer optimization with constraint learning. Operations Research73(2), 1011–1028 (2025)

2025
[22]

In: International Conference on Learning Representations (2023)

Dumouchelle, J., Julien, E., Kurtz, J., Khalil, E.B.: Neur2RO: Neural two-stage robust optimization. In: International Conference on Learning Representations (2023)

2023
[23]

In: International Conference on Machine Learning, Optimization, and Data Science, pp

Kronqvist, J., Li, B., Rolfes, J., Zhao, S.: Alternating mixed-integer programming and neural network training for approximating stochastic two-stage problems. In: International Conference on Machine Learning, Optimization, and Data Science, pp. 124–139 (2023). Springer

2023
[24]

Advances in neural information processing systems35, 23992–24005 (2022)

Patel, R.M., Dumouchelle, J., Khalil, E., Bodur, M.: Neur2SP: Neural two-stage stochastic programming. Advances in neural information processing systems35, 23992–24005 (2022)

2022
[25]

INFORMS Journal on Computing34(2), 807–816 (2022)

Bergman, D., Huang, T., Brooks, P., Lodi, A., Raghunathan, A.U.: JANOS: an integrated predictive and prescriptive modeling framework. INFORMS Journal on Computing34(2), 807–816 (2022)

2022
[26]

Journal of Machine Learning Research23(349), 1–8 (2022)

Ceccon, F., Jalving, J., Haddad, J., Thebelt, A., Tsay, C., Laird, C.D., Misener, R.: OMLT: Optimization & machine learning toolkit. Journal of Machine Learning Research23(349), 1–8 (2022)

2022
[27]

In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp

Turner, M., Chmiela, A., Koch, T., Winkler, M.: PySCIPOpt-ML: Embedding trained machine learning models into mixed-integer programs. In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 218–234 (2025). Springer

2025
[28]

Applied Energy351, 121767 (2023)

Jalving, J., Ghouse, J., Cortes, N., Gao, X., Knueven, B., Agi, D., Martin, S., Chen, X., Guit- tet, D., Tumbalam-Gooty, R.,et al.: Beyond price taker: Conceptual design and optimization of integrated energy systems using machine learning market surrogates. Applied Energy351, 121767 (2023)

2023
[29]

Industrial & Engineering Chemistry Research63(32), 13966–13979 (2024)

L´ opez-Flores, F.J., Ram´ ırez-M´ arquez, C., Ponce-Ortega, J.M.: Process systems engineering tools for optimization of trained machine learning models: Comparative and perspective. Industrial & Engineering Chemistry Research63(32), 13966–13979 (2024)

2024
[30]

Computers & Chemical Engineering185, 108660 (2024)

McDonald, T., Tsay, C., Schweidtmann, A.M., Yorke-Smith, N.: Mixed-integer optimisa- tion of graph neural networks for computer-aided molecular design. Computers & Chemical Engineering185, 108660 (2024)

2024
[31]

Journal of Optimization Theory and Applications180(3), 925–948 (2019)

Schweidtmann, A.M., Mitsos, A.: Deterministic global optimization with artificial neural networks embedded. Journal of Optimization Theory and Applications180(3), 925–948 (2019)

2019
[32]

Constraints 23(3), 296–309 (2018)

Fischetti, M., Jo, J.: Deep neural networks and mixed integer linear optimization. Constraints 23(3), 296–309 (2018)

2018
[33]

arXiv preprint arXiv:1706.07351 (2017)

Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward relu neural 32 networks. arXiv preprint arXiv:1706.07351 (2017)

work page arXiv 2017
[34]

In: International Conference on Learning Representations (2017)

Tjeng, V., Xiao, K.Y., Tedrake, R.: Evaluating robustness of neural networks with mixed integer programming. In: International Conference on Learning Representations (2017)

2017
[35]

Mathematical Programming183(1), 3–39 (2020)

Anderson, R., Huchette, J., Ma, W., Tjandraatmadja, C., Vielma, J.P.: Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming183(1), 3–39 (2020)

2020
[36]

Advances in neural information processing systems34, 3068–3080 (2021)

Tsay, C., Kronqvist, J., Thebelt, A., Misener, R.: Partition-based formulations for mixed- integer optimization of trained ReLU neural networks. Advances in neural information processing systems34, 3068–3080 (2021)

2021
[37]

arXiv preprint arXiv:2312.16699 (2023)

Badilla, F., Goycoolea, M., Mu˜ noz, G., Serra, T.: Computational tradeoffs of optimization- based bound tightening in relu networks. arXiv preprint arXiv:2312.16699 (2023)

work page arXiv 2023
[38]

In: 2024 IEEE 63rd Conference on Decision and Control (CDC), pp

Sosnin, P., Tsay, C.: Scaling mixed-integer programming for certification of neural network controllers using bounds tightening. In: 2024 IEEE 63rd Conference on Decision and Control (CDC), pp. 1645–1650 (2024). IEEE

2024
[39]

In: International Confer- ence on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp

Zhao, H., Hijazi, H., Jones, H., Moore, J., Tanneau, M., Van Hentenryck, P.: Bound tightening using rolling-horizon decomposition for neural network verification. In: International Confer- ence on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 289–303 (2024). Springer

2024
[40]

Envelope theorems for arbitrary choice sets,

Milgrom, P., Segal, I.: Envelope theorems for arbitrary choice sets. Econometrica70(2), 583– 601 (2002) https://doi.org/10.1111/1468-0262.00296

work page doi:10.1111/1468-0262.00296 2002
[41]

Academic Press, New York (1983)

Fiacco, A.V.: Introduction to Sensitivity and Stability Analysis in Nonlinear Programming. Academic Press, New York (1983)

1983
[42]

In: International Conference on Learning Representations (2019)

Xiao, K., Tjeng, V., Shafiullah, N.M., Madry, A.: Training for faster adversarial robustness ver- ification via inducing ReLU stability. In: International Conference on Learning Representations (2019)

2019
[43]

arXiv preprint arXiv:1810.12715 (2018)

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T., Kohli, P.: On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715 (2018)

work page arXiv 2018
[44]

In: International Conference on Machine Learning, pp

Mirman, M., Gehr, T., Vechev, M.: Differentiable abstract interpretation for provably robust neural networks. In: International Conference on Machine Learning, pp. 3578–3586 (2018). PMLR

2018
[45]

In: International Conference on Learning Representations (2020)

Zhang, H., Chen, H., Xiao, C., Gowal, S., Stanforth, R., Li, B., Boning, D., Hsieh, C.J.: Towards stable and efficient training of verifiably robust neural networks. In: International Conference on Learning Representations (2020)

2020
[46]

arXiv preprint arXiv:2511.09400 (2025)

Sosnin, P., Wicker, M., Collyer, J., Tsay, C.: Abstract gradient training: A unified certi- fication framework for data poisoning, unlearning, and differential privacy. arXiv preprint arXiv:2511.09400 (2025)

work page arXiv 2025
[47]

Journal of Artificial Intelligence Research80, 1623–1701 (2024)

Mandi, J., Kotary, J., Berden, S., Mulamba, M., Bucarey, V., Guns, T., Fioretto, F.: Decision- focused learning: Foundations, state of the art, benchmark and future opportunities. Journal of Artificial Intelligence Research80, 1623–1701 (2024)

2024
[48]

predict, then optimize

Elmachtoub, A.N., Grigas, P.: Smart “predict, then optimize”. Management Science68(1), 9–26 (2022) 33

2022
[49]

Advances in neural information processing systems30(2017)

Donti, P., Amos, B., Kolter, J.Z.: Task-based end-to-end model learning in stochastic optimization. Advances in neural information processing systems30(2017)

2017
[50]

arXiv preprint arXiv:1805.10265 (2018)

Dvijotham, K., Gowal, S., Stanforth, R., Arandjelovic, R., O’Donoghue, B., Uesato, J., Kohli, P.: Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265 (2018)

work page arXiv 2018
[51]

Mathematical Programming Computation16(3), 297–335 (2024) https://doi.org/10.1007/s12532-024-00255-x

Tang, B., Khalil, E.B.: PyEPO: a PyTorch-based end-to-end predict-then-optimize library for linear and integer programming. Mathematical Programming Computation16(3), 297–335 (2024) https://doi.org/10.1007/s12532-024-00255-x

work page doi:10.1007/s12532-024-00255-x 2024
[52]

In: International Conference on Machine Learning, pp

Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: International Conference on Machine Learning, pp. 146–155 (2017). PMLR

2017
[53]

arXiv preprint arXiv:2505.11342 (2025)

Rosemberg, A.W., Garcia, J.D., Bent, R., Van Hentenryck, P.: Sobolev training of end-to-end optimization proxies. arXiv preprint arXiv:2505.11342 (2025)

work page arXiv 2025
[54]

Computers & Chemical Engineering153, 107419 (2021)

Tsay, C.: Sobolev trained neural network surrogate models for optimization. Computers & Chemical Engineering153, 107419 (2021)

2021
[55]

Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning vol. 1. MIT press Cambridge, ??? (2016)

2016
[56]

Neurocomputing272, 660–667 (2018)

Manng˚ ard, M., Kronqvist, J., B¨ oling, J.M.: Structural learning in artificial neural networks using sparse optimization. Neurocomputing272, 660–667 (2018)

2018
[57]

Expert Systems with Applications284, 127876 (2025)

Alc´ antara, A., Ruiz, C., Tsay, C.: A quantile neural network framework for two-stage stochastic optimization. Expert Systems with Applications284, 127876 (2025)

2025
[58]

Industrial & Engineering Chemistry Research64(44), 21235–21250 (2025)

Ghilardi, L.M., Patr´ on, G.D., Alc´ antara, A., Tsay, C.: Integrated design and scheduling of hydrogen processes under uncertainty: A quantile neural network approach. Industrial & Engineering Chemistry Research64(44), 21235–21250 (2025)

2025
[59]

In: International Conference on Machine Learning, pp

Amos, B., Kolter, J.Z.: Optnet: Differentiable optimization as a layer in neural networks. In: International Conference on Machine Learning, pp. 136–145 (2017). PMLR

2017
[60]

Athena Scientific, Belmont, MA (1997)

Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization. Athena Scientific, Belmont, MA (1997)

1997
[61]

Journal of Global Optimization85(3), 569–594 (2023)

Wilhelm, M.E., Wang, C., Stuber, M.D.: Convex and concave envelopes of artificial neu- ral network activation functions for deterministic global optimization. Journal of Global Optimization85(3), 569–594 (2023)

2023
[62]

Advances in neural information processing systems32(2019)

Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable convex optimization layers. Advances in neural information processing systems32(2019)

2019
[63]

Advances in Neural Information Processing Systems35, 3801–3818 (2022)

Pineda, L., Fan, T., Monge, M., Venkataraman, S., Sodhi, P., Chen, R.T., Ortiz, J., DeTone, D., Wang, A., Anderson, S.,et al.: Theseus: A library for differentiable nonlinear optimization. Advances in Neural Information Processing Systems35, 3801–3818 (2022)

2022
[64]

INFORMS Journal on Computing36(2), 456–478 (2024)

Besan¸ con, M., Dias Garcia, J., Legat, B., Sharma, A.: Flexible differentiable optimization via model transformations. INFORMS Journal on Computing36(2), 456–478 (2024)

2024
[65]

arXiv preprint arXiv:2510.25986 (2025)

Rosemberg, A.W., Garcia, J.D., Pacaud, F., Parker, R.B., Legat, B., Sundar, K., Bent, R., Van Hentenryck, P.: A general and streamlined differentiable optimization framework. arXiv preprint arXiv:2510.25986 (2025)

work page arXiv 2025
[66]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., L´ eonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013) 34

work page internal anchor Pith review arXiv 2013
[67]

In: International Conference on Learning Representations (2019)

Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y., Xin, J.: Understanding straight-through esti- mator in training activation quantized neural nets. In: International Conference on Learning Representations (2019)

2019
[68]

Advances in neural information processing systems32(2019)

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems32(2019)

2019
[69]

https://www.gurobi

Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2026). https://www.gurobi. com

2026
[70]

Mathematical Programming Computation10(1), 119–142 (2018)

Huangfu, Q., Hall, J.J.: Parallelizing the dual revised simplex method. Mathematical Programming Computation10(1), 119–142 (2018)

2018
[71]

Advances in Neural Information Processing Systems34, 20243–20257 (2021)

Applegate, D., D´ ıaz, M., Hinder, O., Lu, H., Lubin, M., O’Donoghue, B., Schudy, W.: Prac- tical large-scale linear programming using primal-dual hybrid gradient. Advances in Neural Information Processing Systems34, 20243–20257 (2021)

2021
[72]

Mathematical Programming201(1), 133– 184 (2023)

Applegate, D., Hinder, O., Lu, H., Lubin, M.: Faster first-order primal-dual methods for linear programming using restarts and sharpness. Mathematical Programming201(1), 133– 184 (2023)

2023
[73]

ICNN-enhanced 2SP: Leveraging input convex neural networks for solving two-stage stochastic programming

Liu, Y., Oliveira, F., Kronqvist, J.: ICNN-enhanced 2SP: Leveraging input convex neu- ral networks for solving two-stage stochastic programming. arXiv preprint arXiv:2505.05261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

European Journal of Operational Research50(3), 280–297 (1991) 35

Cornu´ ejols, G., Sridharan, R., Thizy, J.-M.: A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research50(3), 280–297 (1991) 35

1991