Scalable Decision-Focused Learning through Cost-Sensitive Regression
Pith reviewed 2026-05-20 12:43 UTC · model grok-4.3
The pith
Decision-focused learning can be reframed as cost-sensitive regression to train with at most one solve per instance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reframing the predict-then-optimize problem as cost-sensitive multi-output regression, the authors derive loss components including cost-insensitive normalization, decision-aware asymmetric penalization, and instance-based costs. These allow the learning process to approximate the downstream combinatorial loss with at most one solve per training data instance and no additional solves during training, achieving comparable task performance to existing decision-focused learning methods but with significantly improved scalability.
What carries the argument
The set of loss function components for cost-sensitive regression: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs mimicking the true task loss locally.
If this is right
- Training becomes feasible for combinatorial problems too large for repeated-solve DFL methods.
- The number of optimization solves during training is reduced to at most one per data instance.
- Standard regression training techniques can be applied more directly to decision-aware prediction.
- Downstream task quality remains close to that of more computationally intensive approaches.
Where Pith is reading between the lines
- This reframing might enable the use of off-the-shelf regression models and optimizers in DFL pipelines.
- Future work could explore how these loss components perform when the underlying optimization problem has more complex structure.
- Scaling to very large instances could test the limits of the local approximation provided by instance-based costs.
Load-bearing premise
The proposed loss components sufficiently approximate the true downstream combinatorial task loss using at most one solve per training instance without needing further solves or post-hoc adjustments during training.
What would settle it
A direct comparison on problem sizes where both the new method and a repeated-solve baseline can complete training, checking if the final optimization costs on test data are statistically equivalent.
Figures
read the original abstract
Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These `predict-then-optimize' or `contextual optimization' problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reframes decision-focused learning for contextual combinatorial optimization as cost-sensitive multi-output regression. It introduces three loss components—cost-insensitive normalization, decision-aware asymmetric penalization of over- and under-predictions, and instance-based costs—that are designed to locally approximate the downstream task loss. These components require at most one combinatorial solve per training instance and none during training. The central empirical claim is that the resulting method achieves downstream task quality comparable to existing DFL approaches while being substantially more efficient, thereby scaling to problem sizes previously intractable for DFL.
Significance. If the approximation quality and scaling claims are substantiated, the work would meaningfully advance the DFL literature by removing the repeated-solve bottleneck that currently restricts end-to-end training to small instances. The regression reframing itself is a clean conceptual contribution that could be reused beyond the specific loss components proposed.
major comments (3)
- [Abstract and §4] Abstract and experimental section: the claim that the method achieves 'comparable downstream task quality' and 'enables scaling to problem sizes that have not been tackled before' is presented without any reported baselines, metrics, instance counts, solver times, or statistical tests. This absence prevents verification of the central efficiency-quality tradeoff.
- [§3.2–3.3] §3.2–3.3 (loss construction): the instance-based costs and decision-aware asymmetric penalization are asserted to mimic the true task loss 'locally' with a single solve. For problems whose feasible set couples the uncertain parameters (shortest-path, assignment, etc.), no argument or derivation is given showing that the per-coordinate surrogate gradient aligns with the subgradient of the downstream cost; compensation or amplification across coordinates is not addressed.
- [§3.1] §3.1 (reframing): the reduction to cost-sensitive regression is presented as parameter-free, yet the instance-based cost term appears to require an a-priori choice of how to map the single solve outcome into per-coordinate penalties. The manuscript does not state whether this mapping is fixed by the problem structure or introduces tunable hyperparameters.
minor comments (2)
- [Figures and Tables] Figure captions and experimental tables should explicitly list the combinatorial problem sizes (number of nodes/edges/variables) and the exact number of training instances used for each method.
- [§2–3] Notation for the multi-output regression target should be introduced once and used consistently; the distinction between the predicted vector ŷ and the cost vector c is occasionally blurred in the loss derivations.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's potential impact and for the constructive major comments. We address each point below and will revise the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and experimental section: the claim that the method achieves 'comparable downstream task quality' and 'enables scaling to problem sizes that have not been tackled before' is presented without any reported baselines, metrics, instance counts, solver times, or statistical tests. This absence prevents verification of the central efficiency-quality tradeoff.
Authors: We agree that the presentation of the empirical results can be strengthened by including more explicit details. In the revised manuscript, we will add comprehensive tables in §4 reporting all baselines (such as standard DFL methods like those based on SPO or direct optimization), specific metrics including downstream task costs or regrets, the number of instances and their sizes, solver runtimes for both training and inference, and results of statistical significance tests. This will substantiate the claims of comparable quality and improved scalability. revision: yes
-
Referee: [§3.2–3.3] §3.2–3.3 (loss construction): the instance-based costs and decision-aware asymmetric penalization are asserted to mimic the true task loss 'locally' with a single solve. For problems whose feasible set couples the uncertain parameters (shortest-path, assignment, etc.), no argument or derivation is given showing that the per-coordinate surrogate gradient aligns with the subgradient of the downstream cost; compensation or amplification across coordinates is not addressed.
Authors: The design of the loss components aims to provide a local approximation by using the outcome of a single solve to inform the penalties, with asymmetric penalization capturing the directional impact on the decision. For problems with coupled parameters, the instance-based costs are intended to reflect the marginal contribution of each parameter to the overall cost, which implicitly handles interactions through the optimization. We will include an expanded discussion and a derivation sketch in the revised §3.2–3.3 to better illustrate the alignment of the surrogate gradient with the true subgradient, addressing potential compensation effects across coordinates. revision: yes
-
Referee: [§3.1] §3.1 (reframing): the reduction to cost-sensitive regression is presented as parameter-free, yet the instance-based cost term appears to require an a-priori choice of how to map the single solve outcome into per-coordinate penalties. The manuscript does not state whether this mapping is fixed by the problem structure or introduces tunable hyperparameters.
Authors: The instance-based cost mapping is determined solely by the structure of the underlying combinatorial optimization problem and does not introduce additional tunable hyperparameters. It is computed based on the single solve result, such as through sensitivity analysis or dual information specific to the problem (e.g., for shortest paths or assignments). We will revise §3.1 to explicitly state this and provide the general mapping procedure along with examples from the experimental settings to clarify that it remains parameter-free. revision: yes
Circularity Check
No circularity: losses derived from reframing, not by construction or self-reference
full rationale
The paper reframes predict-then-optimize as cost-sensitive multi-output regression and explicitly constructs three loss components (cost-insensitive normalization, decision-aware asymmetric penalization, instance-based costs) to locally mimic the downstream task loss. These components are presented as following directly from the reframing and require at most one solve per instance by design. No step reduces a claimed prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose validity is internal to the present work. The central efficiency and quality claims rest on experimental comparison to prior DFL methods rather than on definitional equivalence. This is the normal case of a self-contained proposal of a surrogate loss.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Tuning data mining methods for cost- sensitive regression: A study in loan charge-off forecasting
[Bansalet al., 2008 ] Gaurav Bansal, Atish P Sinha, and Huimin Zhao. Tuning data mining methods for cost- sensitive regression: A study in loan charge-off forecasting. Journal of Management Information Systems, 25(3):315– 336,
work page 2008
-
[2]
Solver-free decision-focused learning for linear optimization problems
[Berdenet al., 2025 ] Senne Berden, Ali ˙Irfan Mah- muto˘gulları, Dimos Tsouros, and Tias Guns. Solver-free decision-focused learning for linear optimization problems. arXiv preprint arXiv:2505.22224,
-
[3]
Learning with differentiable pertubed optimizers
[Berthetet al., 2020 ] Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Fran- cis Bach. Learning with differentiable pertubed optimizers. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, pages 9508–9519. Curran Associates, Inc.,
work page 2020
-
[4]
[Decorteet al., 2023 ] Thomas Decorte, Jakob Raymaekers, and Tim Verdonck. Interpretable cost-sensitive regres- sion through one-step boosting.Decision Support Systems, 175:114024,
work page 2023
-
[5]
The foundations of cost- sensitive learning
[Elkan, 2001] Charles Elkan. The foundations of cost- sensitive learning. InInternational Joint Conference on Artificial Intelligence, pages 973–978,
work page 2001
-
[6]
[Elmachtoub and Grigas, 2022] Adam N. Elmachtoub and Paul Grigas. Smart ‘predict, then optimize’.Management Science, 68(1):9–26,
work page 2022
-
[7]
MIPaaL: Mixed integer program as a layer
[Ferberet al., 2020 ] Aaron Ferber, Bryan Wilder, Bistra Dilk- ina, and Milind Tambe. MIPaaL: Mixed integer program as a layer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1504–1511,
work page 2020
-
[8]
[Granger, 1969] Clive WJ Granger. Prediction with a gener- alized cost of error function.Journal of the Operational Research Society, 20(2):199–207,
work page 1969
-
[9]
Learning joint models of prediction and optimiza- tion
[Kotaryet al., 2024 ] James Kotary, Vincenzo Di Vito, Ja- cob Christopher, Pascal Van Hentenryck, and Ferdinando Fioretto. Learning joint models of prediction and optimiza- tion. InProceedings of ECAI, pages 2476–2483,
work page 2024
-
[10]
[Lawless and Zhou, 2022] Connor Lawless and Angela Zhou. A note on task-aware loss via reweighing prediction loss by decision-regret.arXiv preprint arXiv:2211.05116,
-
[11]
Smart predict-and-optimize for hard combina- torial optimization problems
[Mandiet al., 2020 ] Jayanta Mandi, Peter J Stuckey, Tias Guns, et al. Smart predict-and-optimize for hard combina- torial optimization problems. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1603–1610,
work page 2020
-
[12]
Decision-focused learning: Foun- dations, state of the art, benchmark and future opportunities
[Mandiet al., 2024 ] Jayanta Mandi, James Kotary, Senne Berden, Maxime Mulamba, Victor Bucarey, Tias Guns, and Ferdinando Fioretto. Decision-focused learning: Foun- dations, state of the art, benchmark and future opportunities. Journal of Artificial Intelligence Research, 80:1623–1701,
work page 2024
-
[13]
[McKenzieet al., 2024 ] Daniel McKenzie, Howard Heaton, and Samy Wu Fung. Differentiating through integer lin- ear programs with quadratic regularization and Davis-Yin splitting.Trans. Mach. Learn. Res., 2024,
work page 2024
-
[14]
Contrastive losses and solution caching for predict-and-optimize
[Mulambaet al., 2021 ] Maxime Mulamba, Jayanta Mandi, Michelangelo Diligenti, Michele Lombardi, Victor Bucarey, Tias Guns, et al. Contrastive losses and solution caching for predict-and-optimize. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 2833–2840,
work page 2021
-
[15]
Multi-output regression for imbal- anced data stream.Expert Systems, 40(10):e13417,
[Penget al., 2023 ] Tao Peng, Sana Sellami, Omar Boucelma, and Richard Chbeir. Multi-output regression for imbal- anced data stream.Expert Systems, 40(10):e13417,
work page 2023
-
[16]
[Sadanaet al., 2025 ] Utsav Sadana, Abhilash Chenreddy, Er- ick Delage, Alexandre Forel, Emma Frejinger, and Thibaut Vidal. A survey of contextual optimization methods for decision-making under uncertainty.European Journal of Operational Research, 320(2):271–289,
work page 2025
-
[17]
Sufficient de- cision proxies for decision-focused learning.arXiv preprint arXiv:2505.03953,
[Schutteet al., 2025 ] Noah Schutte, Grigorii Veviurko, Krzysztof Postek, and Neil Yorke-Smith. Sufficient de- cision proxies for decision-focused learning.arXiv preprint arXiv:2505.03953,
-
[18]
[Schutteet al., 2026 ] Noah Schutte, Kim van den Houten, and Grigorii Veviurko. PyDFLT: A Python-based decision- focused learning toolbox. https://doi.org/10.5281/zenodo. 20177131,
-
[19]
Decision-focused learning without decision-making: Learning locally opti- mized decision losses
[Shahet al., 2022 ] Sanket Shah, Kai Wang, Bryan Wilder, Andrew Perrault, and Milind Tambe. Decision-focused learning without decision-making: Learning locally opti- mized decision losses. InProceedings of NeuIPS,
work page 2022
-
[20]
Leaving the nest: Going be- yond local loss functions for predict-then-optimize
[Shahet al., 2024 ] Sanket Shah, Bryan Wilder, Andrew Per- rault, and Milind Tambe. Leaving the nest: Going be- yond local loss functions for predict-then-optimize. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 14902–14909,
work page 2024
-
[21]
Pyepo: A pytorch-based end-to-end predict-then-optimize library with linear objective function
[Tang and Khalil, 2022] Bo Tang and Elias B Khalil. Pyepo: A pytorch-based end-to-end predict-then-optimize library with linear objective function. InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop),
work page 2022
-
[22]
[Tang and Khalil, 2024] Bo Tang and Elias B. Khalil. Cave: A cone-aligned approach for fast predict-then-optimize with binary linear programs. In Bistra Dilkina, editor,Pro- cedings of CPAIOR, volume 14743 ofLecture Notes in Computer Science, pages 193–210. Springer,
work page 2024
-
[23]
One- sided support vector regression for multiclass cost-sensitive classification
[Tu and Lin, 2010] Han-Hsing Tu and Hsuan-Tien Lin. One- sided support vector regression for multiclass cost-sensitive classification. InICML, number 4, page 5,
work page 2010
-
[24]
[Vanderschuerenet al., 2022 ] Toon Vanderschueren, Tim Verdonck, Bart Baesens, and Wouter Verbeke. Predict-then- optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies.Information Sciences, 594:400–415,
work page 2022
-
[25]
Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization
[Wilderet al., 2019 ] Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. Proceedings of the AAAI Conference on Artificial Intelli- gence, 33(01):1658–1665,
work page 2019
-
[26]
[Zellner, 1986] Arnold Zellner. Bayesian estimation and pre- diction using asymmetric loss functions.Journal of the American Statistical Association, 81(394):446–451,
work page 1986
-
[27]
[Zhaoet al., 2011 ] Huimin Zhao, Atish P Sinha, and Gau- rav Bansal. An extended tuning method for cost-sensitive regression and forecasting.Decision Support Systems, 51(3):372–383,
work page 2011
-
[28]
Demand is generated uniformly from 1 to 10, as in Tang and Khalil [2024]. One adjustment is that we scale demand such that the problem is always feasible, by making total demand 70% of the total capacity of all vehicles together (this is about the average ratio in CVRP20, while the larger sizes have a high likelihood to be infeasible without scaling). For...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.