pith. sign in

arxiv: 2605.18005 · v1 · pith:YHWTHPPHnew · submitted 2026-05-18 · 💻 cs.LG · stat.ML

Scalable Decision-Focused Learning through Cost-Sensitive Regression

Pith reviewed 2026-05-20 12:43 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords decision-focused learningpredict-then-optimizecost-sensitive regressioncombinatorial optimizationend-to-end trainingscalable machine learning
0
0 comments X

The pith

Decision-focused learning can be reframed as cost-sensitive regression to train with at most one solve per instance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that decision-focused learning, which trains predictors to minimize downstream optimization costs rather than prediction error, can be made scalable by recasting it as a cost-sensitive regression task. Instead of repeatedly solving the combinatorial problem during training, the authors introduce loss components that require only zero or one solve per training instance. These include normalizing costs without considering decisions, asymmetrically penalizing over- and under-predictions in a decision-aware way, and using instance-specific costs that locally mimic the true task loss. If effective, this approach matches the performance of prior methods but runs much faster, opening the door to larger problem sizes previously out of reach for end-to-end DFL methods.

Core claim

By reframing the predict-then-optimize problem as cost-sensitive multi-output regression, the authors derive loss components including cost-insensitive normalization, decision-aware asymmetric penalization, and instance-based costs. These allow the learning process to approximate the downstream combinatorial loss with at most one solve per training data instance and no additional solves during training, achieving comparable task performance to existing decision-focused learning methods but with significantly improved scalability.

What carries the argument

The set of loss function components for cost-sensitive regression: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs mimicking the true task loss locally.

If this is right

  • Training becomes feasible for combinatorial problems too large for repeated-solve DFL methods.
  • The number of optimization solves during training is reduced to at most one per data instance.
  • Standard regression training techniques can be applied more directly to decision-aware prediction.
  • Downstream task quality remains close to that of more computationally intensive approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This reframing might enable the use of off-the-shelf regression models and optimizers in DFL pipelines.
  • Future work could explore how these loss components perform when the underlying optimization problem has more complex structure.
  • Scaling to very large instances could test the limits of the local approximation provided by instance-based costs.

Load-bearing premise

The proposed loss components sufficiently approximate the true downstream combinatorial task loss using at most one solve per training instance without needing further solves or post-hoc adjustments during training.

What would settle it

A direct comparison on problem sizes where both the new method and a repeated-solve baseline can complete training, checking if the final optimization costs on test data are statistically equivalent.

Figures

Figures reproduced from arXiv: 2605.18005 by Krzysztof Postek, Neil Yorke-Smith, Noah Schutte, Senne Berden, Tias Guns.

Figure 1
Figure 1. Figure 1: Example of one-sided loss on a linear problem [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of scale-invariant loss on a linear problem [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test absolute regret normalized on ℓMSE: Comparing COS, COS and ˜ O, with ˜ ℓMAE and ℓMSE. Property (1) and is a commonly used accuracy metric. Here we evaluate the performance of mean absolute error (MAE) ℓMAE(ˆc, c) = 1 d Pd j=1 |cˆ− c|, which is another common loss function that penalizes deviations from the true value linearly, instead of quadratically with MSE, penalizing outliers more. However, it is… view at source ↗
Figure 4
Figure 4. Figure 4: Test absolute regret normalized on SPO+ for all loss com [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Test absolute regret (y-axis) and runtime in seconds (x-axis) based on 3 seeds. Losses with regret [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparing our instance-based costs loss ℓ C MSE, iterative instance-based costs ℓ IC MSE, ensemble iterative instance-based costs ℓ EIC MSE and ℓ W(w) MSE for w ∈ {0, 0.2, 0.4, 0.6, 0.8, 1.0}. Test absolute regret normalized on ℓMSE(= ℓ W(0.0) MSE ). Based on 20 seeds. Error bars denote stdev. by boosting, where an ensemble of predictive models is used, we combine each predictive model obtained at each ite… view at source ↗
read the original abstract

Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These `predict-then-optimize' or `contextual optimization' problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reframes decision-focused learning for contextual combinatorial optimization as cost-sensitive multi-output regression. It introduces three loss components—cost-insensitive normalization, decision-aware asymmetric penalization of over- and under-predictions, and instance-based costs—that are designed to locally approximate the downstream task loss. These components require at most one combinatorial solve per training instance and none during training. The central empirical claim is that the resulting method achieves downstream task quality comparable to existing DFL approaches while being substantially more efficient, thereby scaling to problem sizes previously intractable for DFL.

Significance. If the approximation quality and scaling claims are substantiated, the work would meaningfully advance the DFL literature by removing the repeated-solve bottleneck that currently restricts end-to-end training to small instances. The regression reframing itself is a clean conceptual contribution that could be reused beyond the specific loss components proposed.

major comments (3)
  1. [Abstract and §4] Abstract and experimental section: the claim that the method achieves 'comparable downstream task quality' and 'enables scaling to problem sizes that have not been tackled before' is presented without any reported baselines, metrics, instance counts, solver times, or statistical tests. This absence prevents verification of the central efficiency-quality tradeoff.
  2. [§3.2–3.3] §3.2–3.3 (loss construction): the instance-based costs and decision-aware asymmetric penalization are asserted to mimic the true task loss 'locally' with a single solve. For problems whose feasible set couples the uncertain parameters (shortest-path, assignment, etc.), no argument or derivation is given showing that the per-coordinate surrogate gradient aligns with the subgradient of the downstream cost; compensation or amplification across coordinates is not addressed.
  3. [§3.1] §3.1 (reframing): the reduction to cost-sensitive regression is presented as parameter-free, yet the instance-based cost term appears to require an a-priori choice of how to map the single solve outcome into per-coordinate penalties. The manuscript does not state whether this mapping is fixed by the problem structure or introduces tunable hyperparameters.
minor comments (2)
  1. [Figures and Tables] Figure captions and experimental tables should explicitly list the combinatorial problem sizes (number of nodes/edges/variables) and the exact number of training instances used for each method.
  2. [§2–3] Notation for the multi-output regression target should be introduced once and used consistently; the distinction between the predicted vector ŷ and the cost vector c is occasionally blurred in the loss derivations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential impact and for the constructive major comments. We address each point below and will revise the manuscript accordingly where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and experimental section: the claim that the method achieves 'comparable downstream task quality' and 'enables scaling to problem sizes that have not been tackled before' is presented without any reported baselines, metrics, instance counts, solver times, or statistical tests. This absence prevents verification of the central efficiency-quality tradeoff.

    Authors: We agree that the presentation of the empirical results can be strengthened by including more explicit details. In the revised manuscript, we will add comprehensive tables in §4 reporting all baselines (such as standard DFL methods like those based on SPO or direct optimization), specific metrics including downstream task costs or regrets, the number of instances and their sizes, solver runtimes for both training and inference, and results of statistical significance tests. This will substantiate the claims of comparable quality and improved scalability. revision: yes

  2. Referee: [§3.2–3.3] §3.2–3.3 (loss construction): the instance-based costs and decision-aware asymmetric penalization are asserted to mimic the true task loss 'locally' with a single solve. For problems whose feasible set couples the uncertain parameters (shortest-path, assignment, etc.), no argument or derivation is given showing that the per-coordinate surrogate gradient aligns with the subgradient of the downstream cost; compensation or amplification across coordinates is not addressed.

    Authors: The design of the loss components aims to provide a local approximation by using the outcome of a single solve to inform the penalties, with asymmetric penalization capturing the directional impact on the decision. For problems with coupled parameters, the instance-based costs are intended to reflect the marginal contribution of each parameter to the overall cost, which implicitly handles interactions through the optimization. We will include an expanded discussion and a derivation sketch in the revised §3.2–3.3 to better illustrate the alignment of the surrogate gradient with the true subgradient, addressing potential compensation effects across coordinates. revision: yes

  3. Referee: [§3.1] §3.1 (reframing): the reduction to cost-sensitive regression is presented as parameter-free, yet the instance-based cost term appears to require an a-priori choice of how to map the single solve outcome into per-coordinate penalties. The manuscript does not state whether this mapping is fixed by the problem structure or introduces tunable hyperparameters.

    Authors: The instance-based cost mapping is determined solely by the structure of the underlying combinatorial optimization problem and does not introduce additional tunable hyperparameters. It is computed based on the single solve result, such as through sensitivity analysis or dual information specific to the problem (e.g., for shortest paths or assignments). We will revise §3.1 to explicitly state this and provide the general mapping procedure along with examples from the experimental settings to clarify that it remains parameter-free. revision: yes

Circularity Check

0 steps flagged

No circularity: losses derived from reframing, not by construction or self-reference

full rationale

The paper reframes predict-then-optimize as cost-sensitive multi-output regression and explicitly constructs three loss components (cost-insensitive normalization, decision-aware asymmetric penalization, instance-based costs) to locally mimic the downstream task loss. These components are presented as following directly from the reframing and require at most one solve per instance by design. No step reduces a claimed prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose validity is internal to the present work. The central efficiency and quality claims rest on experimental comparison to prior DFL methods rather than on definitional equivalence. This is the normal case of a self-contained proposal of a surrogate loss.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5758 in / 1091 out tokens · 32746 ms · 2026-05-20T12:43:48.434783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    Tuning data mining methods for cost- sensitive regression: A study in loan charge-off forecasting

    [Bansalet al., 2008 ] Gaurav Bansal, Atish P Sinha, and Huimin Zhao. Tuning data mining methods for cost- sensitive regression: A study in loan charge-off forecasting. Journal of Management Information Systems, 25(3):315– 336,

  2. [2]

    Solver-free decision-focused learning for linear optimization problems

    [Berdenet al., 2025 ] Senne Berden, Ali ˙Irfan Mah- muto˘gulları, Dimos Tsouros, and Tias Guns. Solver-free decision-focused learning for linear optimization problems. arXiv preprint arXiv:2505.22224,

  3. [3]

    Learning with differentiable pertubed optimizers

    [Berthetet al., 2020 ] Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Fran- cis Bach. Learning with differentiable pertubed optimizers. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, pages 9508–9519. Curran Associates, Inc.,

  4. [4]

    Interpretable cost-sensitive regres- sion through one-step boosting.Decision Support Systems, 175:114024,

    [Decorteet al., 2023 ] Thomas Decorte, Jakob Raymaekers, and Tim Verdonck. Interpretable cost-sensitive regres- sion through one-step boosting.Decision Support Systems, 175:114024,

  5. [5]

    The foundations of cost- sensitive learning

    [Elkan, 2001] Charles Elkan. The foundations of cost- sensitive learning. InInternational Joint Conference on Artificial Intelligence, pages 973–978,

  6. [6]

    Elmachtoub and Paul Grigas

    [Elmachtoub and Grigas, 2022] Adam N. Elmachtoub and Paul Grigas. Smart ‘predict, then optimize’.Management Science, 68(1):9–26,

  7. [7]

    MIPaaL: Mixed integer program as a layer

    [Ferberet al., 2020 ] Aaron Ferber, Bryan Wilder, Bistra Dilk- ina, and Milind Tambe. MIPaaL: Mixed integer program as a layer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1504–1511,

  8. [8]

    Prediction with a gener- alized cost of error function.Journal of the Operational Research Society, 20(2):199–207,

    [Granger, 1969] Clive WJ Granger. Prediction with a gener- alized cost of error function.Journal of the Operational Research Society, 20(2):199–207,

  9. [9]

    Learning joint models of prediction and optimiza- tion

    [Kotaryet al., 2024 ] James Kotary, Vincenzo Di Vito, Ja- cob Christopher, Pascal Van Hentenryck, and Ferdinando Fioretto. Learning joint models of prediction and optimiza- tion. InProceedings of ECAI, pages 2476–2483,

  10. [10]

    A note on task-aware loss via reweighing prediction loss by decision-regret.arXiv preprint arXiv:2211.05116,

    [Lawless and Zhou, 2022] Connor Lawless and Angela Zhou. A note on task-aware loss via reweighing prediction loss by decision-regret.arXiv preprint arXiv:2211.05116,

  11. [11]

    Smart predict-and-optimize for hard combina- torial optimization problems

    [Mandiet al., 2020 ] Jayanta Mandi, Peter J Stuckey, Tias Guns, et al. Smart predict-and-optimize for hard combina- torial optimization problems. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1603–1610,

  12. [12]

    Decision-focused learning: Foun- dations, state of the art, benchmark and future opportunities

    [Mandiet al., 2024 ] Jayanta Mandi, James Kotary, Senne Berden, Maxime Mulamba, Victor Bucarey, Tias Guns, and Ferdinando Fioretto. Decision-focused learning: Foun- dations, state of the art, benchmark and future opportunities. Journal of Artificial Intelligence Research, 80:1623–1701,

  13. [13]

    Differentiating through integer lin- ear programs with quadratic regularization and Davis-Yin splitting.Trans

    [McKenzieet al., 2024 ] Daniel McKenzie, Howard Heaton, and Samy Wu Fung. Differentiating through integer lin- ear programs with quadratic regularization and Davis-Yin splitting.Trans. Mach. Learn. Res., 2024,

  14. [14]

    Contrastive losses and solution caching for predict-and-optimize

    [Mulambaet al., 2021 ] Maxime Mulamba, Jayanta Mandi, Michelangelo Diligenti, Michele Lombardi, Victor Bucarey, Tias Guns, et al. Contrastive losses and solution caching for predict-and-optimize. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 2833–2840,

  15. [15]

    Multi-output regression for imbal- anced data stream.Expert Systems, 40(10):e13417,

    [Penget al., 2023 ] Tao Peng, Sana Sellami, Omar Boucelma, and Richard Chbeir. Multi-output regression for imbal- anced data stream.Expert Systems, 40(10):e13417,

  16. [16]

    A survey of contextual optimization methods for decision-making under uncertainty.European Journal of Operational Research, 320(2):271–289,

    [Sadanaet al., 2025 ] Utsav Sadana, Abhilash Chenreddy, Er- ick Delage, Alexandre Forel, Emma Frejinger, and Thibaut Vidal. A survey of contextual optimization methods for decision-making under uncertainty.European Journal of Operational Research, 320(2):271–289,

  17. [17]

    Sufficient de- cision proxies for decision-focused learning.arXiv preprint arXiv:2505.03953,

    [Schutteet al., 2025 ] Noah Schutte, Grigorii Veviurko, Krzysztof Postek, and Neil Yorke-Smith. Sufficient de- cision proxies for decision-focused learning.arXiv preprint arXiv:2505.03953,

  18. [18]

    librosa/librosa: 0.6.3,

    [Schutteet al., 2026 ] Noah Schutte, Kim van den Houten, and Grigorii Veviurko. PyDFLT: A Python-based decision- focused learning toolbox. https://doi.org/10.5281/zenodo. 20177131,

  19. [19]

    Decision-focused learning without decision-making: Learning locally opti- mized decision losses

    [Shahet al., 2022 ] Sanket Shah, Kai Wang, Bryan Wilder, Andrew Perrault, and Milind Tambe. Decision-focused learning without decision-making: Learning locally opti- mized decision losses. InProceedings of NeuIPS,

  20. [20]

    Leaving the nest: Going be- yond local loss functions for predict-then-optimize

    [Shahet al., 2024 ] Sanket Shah, Bryan Wilder, Andrew Per- rault, and Milind Tambe. Leaving the nest: Going be- yond local loss functions for predict-then-optimize. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 14902–14909,

  21. [21]

    Pyepo: A pytorch-based end-to-end predict-then-optimize library with linear objective function

    [Tang and Khalil, 2022] Bo Tang and Elias B Khalil. Pyepo: A pytorch-based end-to-end predict-then-optimize library with linear objective function. InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop),

  22. [22]

    [Tang and Khalil, 2024] Bo Tang and Elias B. Khalil. Cave: A cone-aligned approach for fast predict-then-optimize with binary linear programs. In Bistra Dilkina, editor,Pro- cedings of CPAIOR, volume 14743 ofLecture Notes in Computer Science, pages 193–210. Springer,

  23. [23]

    One- sided support vector regression for multiclass cost-sensitive classification

    [Tu and Lin, 2010] Han-Hsing Tu and Hsuan-Tien Lin. One- sided support vector regression for multiclass cost-sensitive classification. InICML, number 4, page 5,

  24. [24]

    Predict-then- optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies.Information Sciences, 594:400–415,

    [Vanderschuerenet al., 2022 ] Toon Vanderschueren, Tim Verdonck, Bart Baesens, and Wouter Verbeke. Predict-then- optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies.Information Sciences, 594:400–415,

  25. [25]

    Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization

    [Wilderet al., 2019 ] Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. Proceedings of the AAAI Conference on Artificial Intelli- gence, 33(01):1658–1665,

  26. [26]

    Bayesian estimation and pre- diction using asymmetric loss functions.Journal of the American Statistical Association, 81(394):446–451,

    [Zellner, 1986] Arnold Zellner. Bayesian estimation and pre- diction using asymmetric loss functions.Journal of the American Statistical Association, 81(394):446–451,

  27. [27]

    An extended tuning method for cost-sensitive regression and forecasting.Decision Support Systems, 51(3):372–383,

    [Zhaoet al., 2011 ] Huimin Zhao, Atish P Sinha, and Gau- rav Bansal. An extended tuning method for cost-sensitive regression and forecasting.Decision Support Systems, 51(3):372–383,

  28. [28]

    Demand is generated uniformly from 1 to 10, as in Tang and Khalil [2024]. One adjustment is that we scale demand such that the problem is always feasible, by making total demand 70% of the total capacity of all vehicles together (this is about the average ratio in CVRP20, while the larger sizes have a high likelihood to be infeasible without scaling). For...