pith. sign in

arxiv: 2505.20628 · v4 · submitted 2025-05-27 · 💻 cs.LG · math.OC

Position: Adopt Constraints Over Fixed Penalties in Deep Learning

Pith reviewed 2026-05-19 13:19 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords deep learningconstrained optimizationpenalty methodstrustworthy AInon-convex optimizationhard constraintsposition paper
0
0 comments X

The pith

Deep learning problems with non-negotiable requirements should start from the constrained formulation rather than fixed-penalty surrogates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed weighted-sum penalization is ill-suited when deep learning tasks include hard requirements. In non-convex landscapes the penalized objective is generally not equivalent to the original constrained problem, so solving one does not guarantee satisfying the other. Fixed penalties convert strict rules into soft trade-offs against task loss, and selecting the right coefficients requires repeated trial-and-error that can redefine the problem being solved. A sympathetic reader would care because applications such as safety-critical or fairness-aware systems need guarantees that penalization cannot reliably deliver. The position is therefore to treat the constrained problem as the primary object and select solution methods according to its structure and scale.

Core claim

When a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. This is because the penalized and constrained problems are generally not equivalent in non-convex settings, fixed penalties weaken hard requirements into weighted trade-offs, and choosing coefficients often involves costly trial and error that risks solving the wrong objective altogether. Solution strategies should then be chosen based on the problem's structure and scale.

What carries the argument

The direct constrained optimization formulation versus the fixed-penalty scalarized objective as alternative starting points for optimization.

If this is right

  • Optimization would focus on methods that preserve hard constraints rather than trading them off via weights.
  • Non-negotiable requirements would stay binding throughout training instead of becoming adjustable penalties.
  • Hyper-parameter search would target solver choice and structure rather than penalty magnitudes.
  • Applications needing strict compliance could measure success directly by constraint satisfaction rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Development of new scalable constrained solvers tailored to neural network training would become a priority.
  • Benchmarks could shift emphasis from penalized loss values to explicit rates of constraint violation.
  • Connections could emerge to projection-based or differentiable optimization techniques already used in other domains.

Load-bearing premise

That practitioners can reliably select and apply appropriate solution strategies for the constrained formulation at deep learning scale based on problem structure.

What would settle it

A concrete demonstration of a non-convex deep network training task with hard constraints where fixed penalization achieves consistent satisfaction without coefficient retuning or equivalence failures.

Figures

Figures reproduced from arXiv: 2505.20628 by Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien.

Figure 1
Figure 1. Figure 1: Solving a sparsity￾constrained training problem. Start￾ing from coefficients yielding low (25%) and high (85%) model den￾sity, five bisection search steps are needed to reach 50% sparsity. An￾notations show iterations; endpoints are labeled as 0. Reproduces [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Solving a binary classification task under a constraint that at least 70% of predictions belong [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Pareto front of non-dominated solutions between the objective [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Recent efforts to develop trustworthy AI systems have increased interest in learning problems with explicit requirements, or constraints. In deep learning, however, such problems are often handled through fixed weighted-sum penalization: the constraints are added to the task loss with fixed coefficients, and the resulting scalarized objective is minimized. This position paper argues that fixed penalization is often ill-suited for deep learning problems with non-negotiable requirements for several reasons. First, in non-convex settings, the penalized and constrained problems are generally not equivalent, so solving the former need not solve the latter. Second, fixed penalization weakens hard requirements into soft penalties to be traded off against task performance. Third, choosing penalty coefficients to indirectly solve the constrained problem often involves costly trial and error, because changing them alters the penalized objective itself, and hence can mean solving the wrong problem altogether. We therefore argue that, when a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. The appropriate solution strategy should then be chosen based on the problem's structure and scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper is a position piece arguing that deep learning problems with non-negotiable constraints should be formulated and solved as constrained optimization problems rather than via fixed weighted-sum penalization. It gives three reasons: (1) in non-convex regimes the penalized surrogate is generally not equivalent to the original constrained problem, (2) fixed penalties convert hard requirements into soft trade-offs, and (3) tuning the penalty coefficients changes the objective itself and therefore often amounts to solving the wrong problem. The manuscript concludes that the constrained formulation should be the starting point, after which an appropriate solver is selected according to problem structure and scale.

Significance. If the core distinctions hold, the position could usefully redirect attention in trustworthy-AI research toward constrained formulations that preserve hard requirements instead of relying on surrogate penalties whose behavior is harder to control. The arguments rest on standard facts about non-convex optimization and penalty methods; the paper does not claim new theorems or empirical results.

major comments (1)
  1. [Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the single major comment below and will revise the manuscript to improve its actionability while preserving the core position.

read point-by-point responses
  1. Referee: [Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.

    Authors: We agree that the manuscript would be strengthened by explicit references to existing constrained optimization techniques and a brief discussion of their practical status at deep-learning scale. Although the paper is a position piece whose primary contribution is to argue that the constrained formulation should be the starting point (rather than a fixed-penalty surrogate), we accept that readers need some indication of how one would then proceed. In the revision we will add a short paragraph citing representative work on augmented-Lagrangian methods, constrained SGD variants, and related approaches for non-convex problems, together with a concise note on their current scalability characteristics and open challenges. This addition supports the final sentence of the abstract without altering the central claim or introducing new empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity in the position paper's argument chain.

full rationale

The paper is a position statement that rests on standard, externally established facts from optimization theory: non-equivalence of penalized and constrained problems in non-convex regimes, the softening effect of fixed penalties, and the trial-and-error cost of coefficient selection. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the central recommendation. The final sentence simply states that an appropriate solver should be chosen according to problem structure; this is presented as a logical consequence rather than a reduction to any input or prior self-referential result. The argument is therefore self-contained and draws on independent mathematical properties.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central recommendation rests on standard results from non-convex optimization theory and practical observations about hyperparameter sensitivity in deep learning; no new entities or fitted parameters are introduced.

axioms (2)
  • standard math In non-convex settings, the penalized and constrained problems are generally not equivalent.
    Invoked as the first reason in the abstract.
  • domain assumption Fixed penalization weakens hard requirements into soft penalties to be traded off against task performance.
    Invoked as the second reason in the abstract.

pith-pipeline@v0.9.0 · 5729 in / 1278 out tokens · 40955 ms · 2026-05-19T13:19:15.783921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

  1. [1]

    Abadi et al

    M. Abadi et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016. (Cit. on p. 2, 7)

  2. [2]

    Arrow, L

    K. Arrow, L. Hurwicz, and H. Uzawa.Studies in Linear and Non-linear Programming. Stanford University Press, 1958. (Cit. on p. 2, 6, 7, 9)

  3. [3]

    M. R. Bachute and J. M. Subhedar. Autonomous Driving Architectures. Machine Learning with Applications, 2021. (Cit. on p. 1)

  4. [4]

    Bertsekas

    D. Bertsekas. On the Method of Multipliers for Convex Programming. IEEE Transactions on Automatic Control, 1975. (Cit. on p. 7, 9, 22)

  5. [5]

    Bertsekas

    D. Bertsekas. Nonlinear Programming. Athena Scientific, 2016. (Cit. on p. 3, 16, 22)

  6. [6]

    D. P. Bertsekas. On the Goldstein-Levitin-Polyak Gradient Projection Method. IEEE Transac- tions on automatic control, 1976. (Cit. on p. 9)

  7. [7]

    Bhatore et al

    S. Bhatore et al. Machine learning techniques for credit risk evaluation. Journal of Banking and Financial Technology, 2020. (Cit. on p. 1)

  8. [8]

    E. G. Birgin and J. M. Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. The SIAM series on Fundamentals of Algorithms, 2014. (Cit. on p. 3)

  9. [9]

    Boyd and L

    S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. (Cit. on p. 3, 9)

  10. [10]

    Bradbury et al

    J. Bradbury et al. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax, 2018. (Cit. on p. 7)

  11. [11]

    Brouillard, S

    P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin. Differentiable Causal Discovery from Interventional Data. In NeurIPS, 2020. (Cit. on p. 1)

  12. [12]

    Chakraborty, Y

    D. Chakraborty, Y . LeCun, T. G. Rudner, and E. Learned-Miller. Improving Pre-trained Self- Supervised Embeddings Through Effective Entropy Maximization. In AISTATS, 2025. (Cit. on p. 1)

  13. [13]

    Chamon and A

    L. Chamon and A. Ribeiro. Probably Approximately Correct Constrained Learning. In NeurIPS,

  14. [14]

    X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter- pretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS, 2016. (Cit. on p. 1)

  15. [15]

    Cotter, M

    A. Cotter, M. Gupta, H. Jiang, N. Srebro, K. Sridharan, S. Wang, B. Woodworth, and S. You. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Con- straints. In ICML, 2019. (Cit. on p. 10) 11

  16. [16]

    Cotter, H

    A. Cotter, H. Jiang, M. R. Gupta, S. Wang, T. Narayan, S. You, and K. Sridharan. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. JMLR, 2019. (Cit. on p. 1, 2, 3, 7, 8, 9, 19, 20)

  17. [17]

    Cotter et al

    A. Cotter et al. TensorFlow Constrained Optimization (TFCO). https://github.com/ google-research/tensorflow_constrained_optimization, 2019. (Cit. on p. 2, 7)

  18. [18]

    G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Es- chenhagen, P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rab- bat, S. Krishnan, D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson. Benchmarking Neural Network Training Algorithms. arXiv preprin...

  19. [19]

    J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ICLR, 2024. (Cit. on p. 1, 2, 3, 7)

  20. [20]

    Defazio, F

    A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014. (Cit. on p. 9)

  21. [21]

    Degrave and I

    J. Degrave and I. Korshunova. Why machine learning algorithms are hard to tune and how to fix it. Engraved, blog: www.engraved.blog/why-machine-learning-algorithms-are-hard-to-tune/,

  22. [22]

    (Cit. on p. 2, 4, 5, 18, 22)

  23. [23]

    Degrave and I

    J. Degrave and I. Korshunova. How we can make machine learning algorithms tunable. En- graved, blog: www.engraved.blog/how-we-can-make-machine-learning-algorithms-tunable/,

  24. [24]

    (Cit. on p. 2, 5, 18, 22)

  25. [25]

    I. I. Dikin. Iterative Solution of Problems of Linear and Quadratic Programming. In Doklady Akademii Nauk. Russian Academy of Sciences, 1967. (Cit. on p. 9)

  26. [26]

    Dilhac et al

    M.-A. Dilhac et al. Montréal Declaration for a Responsible Development of Artificial Intelli- gence, 2018. (Cit. on p. 1)

  27. [27]

    Dunion, T

    M. Dunion, T. McInroe, K. Luck, J. Hanna, and S. Albrecht. Conditional Mutual Information for Disentangled Representations in Reinforcement Learning. In NeurIPS, 2023. (Cit. on p. 1)

  28. [28]

    M. Ehrgott. Multicriteria Optimization. Springer Science & Business Media, 2005. (Cit. on p. 1, 2, 4, 5)

  29. [29]

    Elenter, N

    J. Elenter, N. NaderiAlizadeh, and A. Ribeiro. A Lagrangian Duality Approach to Active Learning. In NeurIPS, 2022. (Cit. on p. 2, 7)

  30. [30]

    Artificial Intelligence Act

    European Parliament. Artificial Intelligence Act. https://artificialintelligenceact.eu, 2024. (Cit. on p. 1)

  31. [31]

    Frank and P

    M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 1956. (Cit. on p. 9)

  32. [32]

    Gallego-Posada

    J. Gallego-Posada. Constrained Optimization for Machine Learning: Algorithms and Applica- tions. PhD Thesis, University of Montreal, 2024. (Cit. on p. 2, 4, 6)

  33. [33]

    Gallego-Posada, J

    J. Gallego-Posada, J. Ramirez, A. Erraqabi, Y . Bengio, and S. Lacoste-Julien. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. In NeurIPS, 2022. (Cit. on p. 2, 3, 4, 6, 7, 11, 19)

  34. [34]

    Gallego-Posada, J

    J. Gallego-Posada, J. Ramirez, M. Hashemizadeh, and S. Lacoste-Julien. Cooper: A Library for Constrained Optimization in Deep Learning. arXiv preprint arXiv:2504.01212, 2025. (Cit. on p. 2, 7, 18)

  35. [35]

    J. Gauvin. A Necessary and Sufficient Regularity Condition to Have Bounded Multipliers in Nonconvex Programming. Mathematical Programming, 1977. (Cit. on p. 9)

  36. [36]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. (Cit. on p. 1) 12

  37. [37]

    A. A. Goldstein. Convex Programming in Hilbert Space. University of Washington, 1964. (Cit. on p. 9)

  38. [38]

    R. M. Gower, M. Schmidt, F. Bach, and P. Richtárik. Variance-Reduced Methods for Machine Learning. Proceedings of the IEEE, 2020. (Cit. on p. 9)

  39. [39]

    S.-P. Han. A globally convergent method for nonlinear programming. Journal of optimization theory and applications, 1977. (Cit. on p. 9)

  40. [40]

    Hashemizadeh, J

    M. Hashemizadeh, J. Ramirez, R. Sukumaran, G. Farnadi, S. Lacoste-Julien, and J. Gallego- Posada. Balancing Act: Constraining Disparate Impact in Sparse Models. In ICLR, 2024. (Cit. on p. 2, 7, 9, 10)

  41. [41]

    Higgins, L

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch- ner. beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework . In ICLR, 2017. (Cit. on p. 1)

  42. [42]

    Hounie, L

    I. Hounie, L. F. Chamon, and A. Ribeiro. Automatic Data Augmentation via Invariance- Constrained Learning. In ICML, 2023. (Cit. on p. 7)

  43. [43]

    Hounie, J

    I. Hounie, J. Elenter, and A. Ribeiro. Neural Networks with Quantization Constraints. In ICASSP, 2023. (Cit. on p. 2, 7)

  44. [44]

    Hounie, A

    I. Hounie, A. Ribeiro, and L. F. Chamon. Resilient Constrained Learning. In NeurIPS, 2024. (Cit. on p. 2, 9)

  45. [45]

    Kingma and J

    D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015. (Cit. on p. 1, 4, 10, 19)

  46. [46]

    Kumar, P

    A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. In ICLR, 2018. (Cit. on p. 1)

  47. [47]

    Machine Bias

    J. Larson et al. Data and analysis for “Machine Bias”. https://github.com/propublica/ compas-analysis, 2016. (Cit. on p. 1)

  48. [48]

    E. S. Levitin and B. T. Polyak. Constrained Minimization Methods. USSR Computational mathematics and mathematical physics, 1966. (Cit. on p. 9)

  49. [49]

    T. Lin, C. Jin, and M. I. Jordan. Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization. JMLR, 2025. (Cit. on p. 7)

  50. [50]

    Louizos, M

    C. Louizos, M. Welling, and D. P. Kingma. Learning Sparse Neural Networks through L0 Regularization. In ICLR, 2018. (Cit. on p. 1, 19)

  51. [51]

    O. L. Mangasarian and S. Fromovitz. The Fritz John Necessary Optimality Conditions in the Presence of Equality and Inequality Constraints. Journal of Mathematical Analysis and Applications, 1967. (Cit. on p. 3, 9)

  52. [52]

    Miettinen

    K. Miettinen. Nonlinear Multiobjective Optimization. Springer Science & Business Media,

  53. [53]

    Narasimhan, A

    H. Narasimhan, A. Cotter, Y . Zhou, S. Wang, and W. Guo. Approximate Heavily-Constrained Learning with Lagrange Multiplier Models. In NeurIPS, 2020. (Cit. on p. 2, 7, 9)

  54. [54]

    Nocedal and S

    J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006. (Cit. on p. 9, 22)

  55. [55]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. (Cit. on p. 1)

  56. [56]

    Paszke et al

    A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019. (Cit. on p. 2, 7, 18)

  57. [57]

    J. C. Platt and A. H. Barr. Constrained Differential Optimization. In NeurIPS, 1987. (Cit. on p. 4, 10, 22)

  58. [58]

    M. J. Powell. The convergence of variable metric methods for nonlinearly constrained optimiza- tion calculations. In Nonlinear programming 3. Elsevier, 1978. (Cit. on p. 9) 13

  59. [59]

    Ramirez, I

    J. Ramirez, I. Hounie, J. Elenter, J. Gallego-Posada, M. Hashemizadeh, A. Ribeiro, and S. Lacoste-Julien. Feasible Learning. In AISTATS, 2025. (Cit. on p. 3, 7)

  60. [60]

    Robey, L

    A. Robey, L. Chamon, G. J. Pappas, H. Hassani, and A. Ribeiro. Adversarial Robustness with Semi-Infinite Constrained Learning. In NeurIPS, 2021. (Cit. on p. 2, 7)

  61. [61]

    E. Shi, L. Kong, and B. Jiang. Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks. arXiv preprint arXiv:2504.06470, 2025. (Cit. on p. 1)

  62. [62]

    Sohrabi, J

    M. Sohrabi, J. Ramirez, T. H. Zhang, S. Lacoste-Julien, and J. Gallego-Posada. On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization. In ICML, 2024. (Cit. on p. 7, 8, 10, 18, 22)

  63. [63]

    Stooke, J

    A. Stooke, J. Achiam, and P. Abbeel. Responsive Safety in Reinforcement Learning by PID Lagrangian Methods. In ICML, 2020. (Cit. on p. 2, 7, 10, 22)

  64. [64]

    R. B. Wilson. A simplicial algorithm for concave programming. PhD Thesis, Graduate School of Bussiness Administration, 1963. (Cit. on p. 9)

  65. [65]

    Zhang, Y

    G. Zhang, Y . Wang, L. Lessard, and R. B. Grosse. Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. In AISTATS, 2022. (Cit. on p. 7)

  66. [66]

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017. (Cit. on p. 1)

  67. [67]

    Zoutendijk

    G. Zoutendijk. Methods of Feasible Directions: A Study in Linear and Non-linear Programming. Elsevier Publishing Company, 1960. (Cit. on p. 9) 14 Appendix A Proofs 16 B Experimental Details 18 B.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Sparsity Constraints . . . . . . . . . . . . . . . . . . . . . . ...

  68. [68]

    We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

    Primal feasibility. We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

  69. [69]

    Stationarity. The gradients of L with respect to the primal variables vanish at (x∗, y∗, λ∗): ∂L ∂x = −(1 + y2) sin(x) + λ(1 + y2) cos(x) (x∗,y∗,λ∗) = −ϵ + λ∗ p 1 − ϵ2 = 0, (13) ∂L ∂y = 2y cos(x) + 2λy sin(x) (x∗,y∗,λ∗) = 0. (14)

  70. [70]

    We have λ∗ [g(x∗, y∗) − ϵ] = 0

    Complementary slackness. We have λ∗ [g(x∗, y∗) − ϵ] = 0 . Since λ∗ > 0, strict comple- mentary slackness holds. We now verify that(x∗, y∗, λ∗) satisfies the second-order sufficient conditions for optimality [5, Prop. 4.3.2]. Consider the Hessian of L with respect to (x, y): ∇2 x,yL = −(1 + y2) (cos(x) + λ sin(x)) −2y sin(x) + 2λy cos(x) −2y sin(x) + 2λy c...

  71. [71]

    To construct the penalized formulation of Eq

    and use a differentiable surrogate to update the model parameters, while still using the true non-differentiable constraint to update the multipliers.9 As a surrogate, we use the hinge loss: P (ˆy = 0) = E(x,y)∼D 1 − σ(w⊤x + b) ≥ 0.7, (24) which represents the expected proportion of inputs predicted as class 0. To construct the penalized formulation of Eq...

  72. [72]

    Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs

    The constrained approach achieves the target class prediction rate across a wide range of dual step-sizes. Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs. Dual Step-size Class 0 Percentage (%) Accuracy (%) 0 49 .75 99 .75 1.00 × 10−4 49.75 99 .75 2.15 × 10−4 49.75 99 .75 4.60 × ...