Position: Adopt Constraints Over Fixed Penalties in Deep Learning

Juan Ramirez; Meraj Hashemizadeh; Simon Lacoste-Julien

arxiv: 2505.20628 · v4 · submitted 2025-05-27 · 💻 cs.LG · math.OC

Position: Adopt Constraints Over Fixed Penalties in Deep Learning

Juan Ramirez , Meraj Hashemizadeh , Simon Lacoste-Julien This is my paper

Pith reviewed 2026-05-19 13:19 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords deep learningconstrained optimizationpenalty methodstrustworthy AInon-convex optimizationhard constraintsposition paper

0 comments

The pith

Deep learning problems with non-negotiable requirements should start from the constrained formulation rather than fixed-penalty surrogates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed weighted-sum penalization is ill-suited when deep learning tasks include hard requirements. In non-convex landscapes the penalized objective is generally not equivalent to the original constrained problem, so solving one does not guarantee satisfying the other. Fixed penalties convert strict rules into soft trade-offs against task loss, and selecting the right coefficients requires repeated trial-and-error that can redefine the problem being solved. A sympathetic reader would care because applications such as safety-critical or fairness-aware systems need guarantees that penalization cannot reliably deliver. The position is therefore to treat the constrained problem as the primary object and select solution methods according to its structure and scale.

Core claim

When a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. This is because the penalized and constrained problems are generally not equivalent in non-convex settings, fixed penalties weaken hard requirements into weighted trade-offs, and choosing coefficients often involves costly trial and error that risks solving the wrong objective altogether. Solution strategies should then be chosen based on the problem's structure and scale.

What carries the argument

The direct constrained optimization formulation versus the fixed-penalty scalarized objective as alternative starting points for optimization.

If this is right

Optimization would focus on methods that preserve hard constraints rather than trading them off via weights.
Non-negotiable requirements would stay binding throughout training instead of becoming adjustable penalties.
Hyper-parameter search would target solver choice and structure rather than penalty magnitudes.
Applications needing strict compliance could measure success directly by constraint satisfaction rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Development of new scalable constrained solvers tailored to neural network training would become a priority.
Benchmarks could shift emphasis from penalized loss values to explicit rates of constraint violation.
Connections could emerge to projection-based or differentiable optimization techniques already used in other domains.

Load-bearing premise

That practitioners can reliably select and apply appropriate solution strategies for the constrained formulation at deep learning scale based on problem structure.

What would settle it

A concrete demonstration of a non-convex deep network training task with hard constraints where fixed penalization achieves consistent satisfaction without coefficient retuning or equivalence failures.

Figures

Figures reproduced from arXiv: 2505.20628 by Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien.

**Figure 1.** Figure 1: Solving a sparsityconstrained training problem. Starting from coefficients yielding low (25%) and high (85%) model density, five bisection search steps are needed to reach 50% sparsity. Annotations show iterations; endpoints are labeled as 0. Reproduces [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Solving a binary classification task under a constraint that at least 70% of predictions belong [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The Pareto front of non-dominated solutions between the objective [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

read the original abstract

Recent efforts to develop trustworthy AI systems have increased interest in learning problems with explicit requirements, or constraints. In deep learning, however, such problems are often handled through fixed weighted-sum penalization: the constraints are added to the task loss with fixed coefficients, and the resulting scalarized objective is minimized. This position paper argues that fixed penalization is often ill-suited for deep learning problems with non-negotiable requirements for several reasons. First, in non-convex settings, the penalized and constrained problems are generally not equivalent, so solving the former need not solve the latter. Second, fixed penalization weakens hard requirements into soft penalties to be traded off against task performance. Third, choosing penalty coefficients to indirectly solve the constrained problem often involves costly trial and error, because changing them alters the penalized objective itself, and hence can mean solving the wrong problem altogether. We therefore argue that, when a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. The appropriate solution strategy should then be chosen based on the problem's structure and scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper is a position piece arguing that deep learning problems with non-negotiable constraints should be formulated and solved as constrained optimization problems rather than via fixed weighted-sum penalization. It gives three reasons: (1) in non-convex regimes the penalized surrogate is generally not equivalent to the original constrained problem, (2) fixed penalties convert hard requirements into soft trade-offs, and (3) tuning the penalty coefficients changes the objective itself and therefore often amounts to solving the wrong problem. The manuscript concludes that the constrained formulation should be the starting point, after which an appropriate solver is selected according to problem structure and scale.

Significance. If the core distinctions hold, the position could usefully redirect attention in trustworthy-AI research toward constrained formulations that preserve hard requirements instead of relying on surrogate penalties whose behavior is harder to control. The arguments rest on standard facts about non-convex optimization and penalty methods; the paper does not claim new theorems or empirical results.

major comments (1)

[Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the single major comment below and will revise the manuscript to improve its actionability while preserving the core position.

read point-by-point responses

Referee: [Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.

Authors: We agree that the manuscript would be strengthened by explicit references to existing constrained optimization techniques and a brief discussion of their practical status at deep-learning scale. Although the paper is a position piece whose primary contribution is to argue that the constrained formulation should be the starting point (rather than a fixed-penalty surrogate), we accept that readers need some indication of how one would then proceed. In the revision we will add a short paragraph citing representative work on augmented-Lagrangian methods, constrained SGD variants, and related approaches for non-convex problems, together with a concise note on their current scalability characteristics and open challenges. This addition supports the final sentence of the abstract without altering the central claim or introducing new empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity in the position paper's argument chain.

full rationale

The paper is a position statement that rests on standard, externally established facts from optimization theory: non-equivalence of penalized and constrained problems in non-convex regimes, the softening effect of fixed penalties, and the trial-and-error cost of coefficient selection. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the central recommendation. The final sentence simply states that an appropriate solver should be chosen according to problem structure; this is presented as a logical consequence rather than a reduction to any input or prior self-referential result. The argument is therefore self-contained and draws on independent mathematical properties.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central recommendation rests on standard results from non-convex optimization theory and practical observations about hyperparameter sensitivity in deep learning; no new entities or fitted parameters are introduced.

axioms (2)

standard math In non-convex settings, the penalized and constrained problems are generally not equivalent.
Invoked as the first reason in the abstract.
domain assumption Fixed penalization weakens hard requirements into soft penalties to be traded off against task performance.
Invoked as the second reason in the abstract.

pith-pipeline@v0.9.0 · 5729 in / 1278 out tokens · 40955 ms · 2026-05-19T13:19:15.783921+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fixed weighted-sum penalization... the penalized and constrained problems are generally not equivalent
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lagrangian approach... optimizes its coefficients—the multipliers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

[1]

Abadi et al

M. Abadi et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016. (Cit. on p. 2, 7)

work page 2016
[2]

Arrow, L

K. Arrow, L. Hurwicz, and H. Uzawa.Studies in Linear and Non-linear Programming. Stanford University Press, 1958. (Cit. on p. 2, 6, 7, 9)

work page 1958
[3]

M. R. Bachute and J. M. Subhedar. Autonomous Driving Architectures. Machine Learning with Applications, 2021. (Cit. on p. 1)

work page 2021
[4]

Bertsekas

D. Bertsekas. On the Method of Multipliers for Convex Programming. IEEE Transactions on Automatic Control, 1975. (Cit. on p. 7, 9, 22)

work page 1975
[5]

Bertsekas

D. Bertsekas. Nonlinear Programming. Athena Scientific, 2016. (Cit. on p. 3, 16, 22)

work page 2016
[6]

D. P. Bertsekas. On the Goldstein-Levitin-Polyak Gradient Projection Method. IEEE Transac- tions on automatic control, 1976. (Cit. on p. 9)

work page 1976
[7]

Bhatore et al

S. Bhatore et al. Machine learning techniques for credit risk evaluation. Journal of Banking and Financial Technology, 2020. (Cit. on p. 1)

work page 2020
[8]

E. G. Birgin and J. M. Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. The SIAM series on Fundamentals of Algorithms, 2014. (Cit. on p. 3)

work page 2014
[9]

Boyd and L

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. (Cit. on p. 3, 9)

work page 2004
[10]

Bradbury et al

J. Bradbury et al. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax, 2018. (Cit. on p. 7)

work page 2018
[11]

Brouillard, S

P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin. Differentiable Causal Discovery from Interventional Data. In NeurIPS, 2020. (Cit. on p. 1)

work page 2020
[12]

Chakraborty, Y

D. Chakraborty, Y . LeCun, T. G. Rudner, and E. Learned-Miller. Improving Pre-trained Self- Supervised Embeddings Through Effective Entropy Maximization. In AISTATS, 2025. (Cit. on p. 1)

work page 2025
[13]

Chamon and A

L. Chamon and A. Ribeiro. Probably Approximately Correct Constrained Learning. In NeurIPS,

work page
[14]

X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter- pretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS, 2016. (Cit. on p. 1)

work page 2016
[15]

Cotter, M

A. Cotter, M. Gupta, H. Jiang, N. Srebro, K. Sridharan, S. Wang, B. Woodworth, and S. You. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Con- straints. In ICML, 2019. (Cit. on p. 10) 11

work page 2019
[16]

Cotter, H

A. Cotter, H. Jiang, M. R. Gupta, S. Wang, T. Narayan, S. You, and K. Sridharan. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. JMLR, 2019. (Cit. on p. 1, 2, 3, 7, 8, 9, 19, 20)

work page 2019
[17]

Cotter et al

A. Cotter et al. TensorFlow Constrained Optimization (TFCO). https://github.com/ google-research/tensorflow_constrained_optimization, 2019. (Cit. on p. 2, 7)

work page 2019
[18]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Es- chenhagen, P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rab- bat, S. Krishnan, D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson. Benchmarking Neural Network Training Algorithms. arXiv preprin...

work page arXiv 2023
[19]

J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ICLR, 2024. (Cit. on p. 1, 2, 3, 7)

work page 2024
[20]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014. (Cit. on p. 9)

work page 2014
[21]

Degrave and I

J. Degrave and I. Korshunova. Why machine learning algorithms are hard to tune and how to fix it. Engraved, blog: www.engraved.blog/why-machine-learning-algorithms-are-hard-to-tune/,

work page
[22]

(Cit. on p. 2, 4, 5, 18, 22)

work page
[23]

Degrave and I

J. Degrave and I. Korshunova. How we can make machine learning algorithms tunable. En- graved, blog: www.engraved.blog/how-we-can-make-machine-learning-algorithms-tunable/,

work page
[24]

(Cit. on p. 2, 5, 18, 22)

work page
[25]

I. I. Dikin. Iterative Solution of Problems of Linear and Quadratic Programming. In Doklady Akademii Nauk. Russian Academy of Sciences, 1967. (Cit. on p. 9)

work page 1967
[26]

Dilhac et al

M.-A. Dilhac et al. Montréal Declaration for a Responsible Development of Artificial Intelli- gence, 2018. (Cit. on p. 1)

work page 2018
[27]

Dunion, T

M. Dunion, T. McInroe, K. Luck, J. Hanna, and S. Albrecht. Conditional Mutual Information for Disentangled Representations in Reinforcement Learning. In NeurIPS, 2023. (Cit. on p. 1)

work page 2023
[28]

M. Ehrgott. Multicriteria Optimization. Springer Science & Business Media, 2005. (Cit. on p. 1, 2, 4, 5)

work page 2005
[29]

Elenter, N

J. Elenter, N. NaderiAlizadeh, and A. Ribeiro. A Lagrangian Duality Approach to Active Learning. In NeurIPS, 2022. (Cit. on p. 2, 7)

work page 2022
[30]

Artificial Intelligence Act

European Parliament. Artificial Intelligence Act. https://artificialintelligenceact.eu, 2024. (Cit. on p. 1)

work page 2024
[31]

Frank and P

M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 1956. (Cit. on p. 9)

work page 1956
[32]

Gallego-Posada

J. Gallego-Posada. Constrained Optimization for Machine Learning: Algorithms and Applica- tions. PhD Thesis, University of Montreal, 2024. (Cit. on p. 2, 4, 6)

work page 2024
[33]

Gallego-Posada, J

J. Gallego-Posada, J. Ramirez, A. Erraqabi, Y . Bengio, and S. Lacoste-Julien. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. In NeurIPS, 2022. (Cit. on p. 2, 3, 4, 6, 7, 11, 19)

work page 2022
[34]

Gallego-Posada, J

J. Gallego-Posada, J. Ramirez, M. Hashemizadeh, and S. Lacoste-Julien. Cooper: A Library for Constrained Optimization in Deep Learning. arXiv preprint arXiv:2504.01212, 2025. (Cit. on p. 2, 7, 18)

work page arXiv 2025
[35]

J. Gauvin. A Necessary and Sufficient Regularity Condition to Have Bounded Multipliers in Nonconvex Programming. Mathematical Programming, 1977. (Cit. on p. 9)

work page 1977
[36]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. (Cit. on p. 1) 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

A. A. Goldstein. Convex Programming in Hilbert Space. University of Washington, 1964. (Cit. on p. 9)

work page 1964
[38]

R. M. Gower, M. Schmidt, F. Bach, and P. Richtárik. Variance-Reduced Methods for Machine Learning. Proceedings of the IEEE, 2020. (Cit. on p. 9)

work page 2020
[39]

S.-P. Han. A globally convergent method for nonlinear programming. Journal of optimization theory and applications, 1977. (Cit. on p. 9)

work page 1977
[40]

Hashemizadeh, J

M. Hashemizadeh, J. Ramirez, R. Sukumaran, G. Farnadi, S. Lacoste-Julien, and J. Gallego- Posada. Balancing Act: Constraining Disparate Impact in Sparse Models. In ICLR, 2024. (Cit. on p. 2, 7, 9, 10)

work page 2024
[41]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch- ner. beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework . In ICLR, 2017. (Cit. on p. 1)

work page 2017
[42]

Hounie, L

I. Hounie, L. F. Chamon, and A. Ribeiro. Automatic Data Augmentation via Invariance- Constrained Learning. In ICML, 2023. (Cit. on p. 7)

work page 2023
[43]

Hounie, J

I. Hounie, J. Elenter, and A. Ribeiro. Neural Networks with Quantization Constraints. In ICASSP, 2023. (Cit. on p. 2, 7)

work page 2023
[44]

Hounie, A

I. Hounie, A. Ribeiro, and L. F. Chamon. Resilient Constrained Learning. In NeurIPS, 2024. (Cit. on p. 2, 9)

work page 2024
[45]

Kingma and J

D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015. (Cit. on p. 1, 4, 10, 19)

work page 2015
[46]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. In ICLR, 2018. (Cit. on p. 1)

work page 2018
[47]

Machine Bias

J. Larson et al. Data and analysis for “Machine Bias”. https://github.com/propublica/ compas-analysis, 2016. (Cit. on p. 1)

work page 2016
[48]

E. S. Levitin and B. T. Polyak. Constrained Minimization Methods. USSR Computational mathematics and mathematical physics, 1966. (Cit. on p. 9)

work page 1966
[49]

T. Lin, C. Jin, and M. I. Jordan. Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization. JMLR, 2025. (Cit. on p. 7)

work page 2025
[50]

Louizos, M

C. Louizos, M. Welling, and D. P. Kingma. Learning Sparse Neural Networks through L0 Regularization. In ICLR, 2018. (Cit. on p. 1, 19)

work page 2018
[51]

O. L. Mangasarian and S. Fromovitz. The Fritz John Necessary Optimality Conditions in the Presence of Equality and Inequality Constraints. Journal of Mathematical Analysis and Applications, 1967. (Cit. on p. 3, 9)

work page 1967
[52]

Miettinen

K. Miettinen. Nonlinear Multiobjective Optimization. Springer Science & Business Media,

work page
[53]

Narasimhan, A

H. Narasimhan, A. Cotter, Y . Zhou, S. Wang, and W. Guo. Approximate Heavily-Constrained Learning with Lagrange Multiplier Models. In NeurIPS, 2020. (Cit. on p. 2, 7, 9)

work page 2020
[54]

Nocedal and S

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006. (Cit. on p. 9, 22)

work page 2006
[55]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. (Cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Paszke et al

A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019. (Cit. on p. 2, 7, 18)

work page 2019
[57]

J. C. Platt and A. H. Barr. Constrained Differential Optimization. In NeurIPS, 1987. (Cit. on p. 4, 10, 22)

work page 1987
[58]

M. J. Powell. The convergence of variable metric methods for nonlinearly constrained optimiza- tion calculations. In Nonlinear programming 3. Elsevier, 1978. (Cit. on p. 9) 13

work page 1978
[59]

Ramirez, I

J. Ramirez, I. Hounie, J. Elenter, J. Gallego-Posada, M. Hashemizadeh, A. Ribeiro, and S. Lacoste-Julien. Feasible Learning. In AISTATS, 2025. (Cit. on p. 3, 7)

work page 2025
[60]

Robey, L

A. Robey, L. Chamon, G. J. Pappas, H. Hassani, and A. Ribeiro. Adversarial Robustness with Semi-Infinite Constrained Learning. In NeurIPS, 2021. (Cit. on p. 2, 7)

work page 2021
[61]

E. Shi, L. Kong, and B. Jiang. Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks. arXiv preprint arXiv:2504.06470, 2025. (Cit. on p. 1)

work page arXiv 2025
[62]

Sohrabi, J

M. Sohrabi, J. Ramirez, T. H. Zhang, S. Lacoste-Julien, and J. Gallego-Posada. On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization. In ICML, 2024. (Cit. on p. 7, 8, 10, 18, 22)

work page 2024
[63]

Stooke, J

A. Stooke, J. Achiam, and P. Abbeel. Responsive Safety in Reinforcement Learning by PID Lagrangian Methods. In ICML, 2020. (Cit. on p. 2, 7, 10, 22)

work page 2020
[64]

R. B. Wilson. A simplicial algorithm for concave programming. PhD Thesis, Graduate School of Bussiness Administration, 1963. (Cit. on p. 9)

work page 1963
[65]

Zhang, Y

G. Zhang, Y . Wang, L. Lessard, and R. B. Grosse. Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. In AISTATS, 2022. (Cit. on p. 7)

work page 2022
[66]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017. (Cit. on p. 1)

work page 2017
[67]

Zoutendijk

G. Zoutendijk. Methods of Feasible Directions: A Study in Linear and Non-linear Programming. Elsevier Publishing Company, 1960. (Cit. on p. 9) 14 Appendix A Proofs 16 B Experimental Details 18 B.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Sparsity Constraints . . . . . . . . . . . . . . . . . . . . . . ...

work page 1960
[68]

We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

Primal feasibility. We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

work page
[69]

Stationarity. The gradients of L with respect to the primal variables vanish at (x∗, y∗, λ∗): ∂L ∂x = −(1 + y2) sin(x) + λ(1 + y2) cos(x) (x∗,y∗,λ∗) = −ϵ + λ∗ p 1 − ϵ2 = 0, (13) ∂L ∂y = 2y cos(x) + 2λy sin(x) (x∗,y∗,λ∗) = 0. (14)

work page
[70]

We have λ∗ [g(x∗, y∗) − ϵ] = 0

Complementary slackness. We have λ∗ [g(x∗, y∗) − ϵ] = 0 . Since λ∗ > 0, strict comple- mentary slackness holds. We now verify that(x∗, y∗, λ∗) satisfies the second-order sufficient conditions for optimality [5, Prop. 4.3.2]. Consider the Hessian of L with respect to (x, y): ∇2 x,yL = −(1 + y2) (cos(x) + λ sin(x)) −2y sin(x) + 2λy cos(x) −2y sin(x) + 2λy c...

work page
[71]

To construct the penalized formulation of Eq

and use a differentiable surrogate to update the model parameters, while still using the true non-differentiable constraint to update the multipliers.9 As a surrogate, we use the hinge loss: P (ˆy = 0) = E(x,y)∼D 1 − σ(w⊤x + b) ≥ 0.7, (24) which represents the expected proportion of inputs predicted as class 0. To construct the penalized formulation of Eq...

work page
[72]

Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs

The constrained approach achieves the target class prediction rate across a wide range of dual step-sizes. Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs. Dual Step-size Class 0 Percentage (%) Accuracy (%) 0 49 .75 99 .75 1.00 × 10−4 49.75 99 .75 2.15 × 10−4 49.75 99 .75 4.60 × ...

work page

[1] [1]

Abadi et al

M. Abadi et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016. (Cit. on p. 2, 7)

work page 2016

[2] [2]

Arrow, L

K. Arrow, L. Hurwicz, and H. Uzawa.Studies in Linear and Non-linear Programming. Stanford University Press, 1958. (Cit. on p. 2, 6, 7, 9)

work page 1958

[3] [3]

M. R. Bachute and J. M. Subhedar. Autonomous Driving Architectures. Machine Learning with Applications, 2021. (Cit. on p. 1)

work page 2021

[4] [4]

Bertsekas

D. Bertsekas. On the Method of Multipliers for Convex Programming. IEEE Transactions on Automatic Control, 1975. (Cit. on p. 7, 9, 22)

work page 1975

[5] [5]

Bertsekas

D. Bertsekas. Nonlinear Programming. Athena Scientific, 2016. (Cit. on p. 3, 16, 22)

work page 2016

[6] [6]

D. P. Bertsekas. On the Goldstein-Levitin-Polyak Gradient Projection Method. IEEE Transac- tions on automatic control, 1976. (Cit. on p. 9)

work page 1976

[7] [7]

Bhatore et al

S. Bhatore et al. Machine learning techniques for credit risk evaluation. Journal of Banking and Financial Technology, 2020. (Cit. on p. 1)

work page 2020

[8] [8]

E. G. Birgin and J. M. Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. The SIAM series on Fundamentals of Algorithms, 2014. (Cit. on p. 3)

work page 2014

[9] [9]

Boyd and L

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. (Cit. on p. 3, 9)

work page 2004

[10] [10]

Bradbury et al

J. Bradbury et al. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax, 2018. (Cit. on p. 7)

work page 2018

[11] [11]

Brouillard, S

P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin. Differentiable Causal Discovery from Interventional Data. In NeurIPS, 2020. (Cit. on p. 1)

work page 2020

[12] [12]

Chakraborty, Y

D. Chakraborty, Y . LeCun, T. G. Rudner, and E. Learned-Miller. Improving Pre-trained Self- Supervised Embeddings Through Effective Entropy Maximization. In AISTATS, 2025. (Cit. on p. 1)

work page 2025

[13] [13]

Chamon and A

L. Chamon and A. Ribeiro. Probably Approximately Correct Constrained Learning. In NeurIPS,

work page

[14] [14]

X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter- pretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS, 2016. (Cit. on p. 1)

work page 2016

[15] [15]

Cotter, M

A. Cotter, M. Gupta, H. Jiang, N. Srebro, K. Sridharan, S. Wang, B. Woodworth, and S. You. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Con- straints. In ICML, 2019. (Cit. on p. 10) 11

work page 2019

[16] [16]

Cotter, H

A. Cotter, H. Jiang, M. R. Gupta, S. Wang, T. Narayan, S. You, and K. Sridharan. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. JMLR, 2019. (Cit. on p. 1, 2, 3, 7, 8, 9, 19, 20)

work page 2019

[17] [17]

Cotter et al

A. Cotter et al. TensorFlow Constrained Optimization (TFCO). https://github.com/ google-research/tensorflow_constrained_optimization, 2019. (Cit. on p. 2, 7)

work page 2019

[18] [18]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Es- chenhagen, P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rab- bat, S. Krishnan, D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson. Benchmarking Neural Network Training Algorithms. arXiv preprin...

work page arXiv 2023

[19] [19]

J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ICLR, 2024. (Cit. on p. 1, 2, 3, 7)

work page 2024

[20] [20]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014. (Cit. on p. 9)

work page 2014

[21] [21]

Degrave and I

J. Degrave and I. Korshunova. Why machine learning algorithms are hard to tune and how to fix it. Engraved, blog: www.engraved.blog/why-machine-learning-algorithms-are-hard-to-tune/,

work page

[22] [22]

(Cit. on p. 2, 4, 5, 18, 22)

work page

[23] [23]

Degrave and I

J. Degrave and I. Korshunova. How we can make machine learning algorithms tunable. En- graved, blog: www.engraved.blog/how-we-can-make-machine-learning-algorithms-tunable/,

work page

[24] [24]

(Cit. on p. 2, 5, 18, 22)

work page

[25] [25]

I. I. Dikin. Iterative Solution of Problems of Linear and Quadratic Programming. In Doklady Akademii Nauk. Russian Academy of Sciences, 1967. (Cit. on p. 9)

work page 1967

[26] [26]

Dilhac et al

M.-A. Dilhac et al. Montréal Declaration for a Responsible Development of Artificial Intelli- gence, 2018. (Cit. on p. 1)

work page 2018

[27] [27]

Dunion, T

M. Dunion, T. McInroe, K. Luck, J. Hanna, and S. Albrecht. Conditional Mutual Information for Disentangled Representations in Reinforcement Learning. In NeurIPS, 2023. (Cit. on p. 1)

work page 2023

[28] [28]

M. Ehrgott. Multicriteria Optimization. Springer Science & Business Media, 2005. (Cit. on p. 1, 2, 4, 5)

work page 2005

[29] [29]

Elenter, N

J. Elenter, N. NaderiAlizadeh, and A. Ribeiro. A Lagrangian Duality Approach to Active Learning. In NeurIPS, 2022. (Cit. on p. 2, 7)

work page 2022

[30] [30]

Artificial Intelligence Act

European Parliament. Artificial Intelligence Act. https://artificialintelligenceact.eu, 2024. (Cit. on p. 1)

work page 2024

[31] [31]

Frank and P

M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 1956. (Cit. on p. 9)

work page 1956

[32] [32]

Gallego-Posada

J. Gallego-Posada. Constrained Optimization for Machine Learning: Algorithms and Applica- tions. PhD Thesis, University of Montreal, 2024. (Cit. on p. 2, 4, 6)

work page 2024

[33] [33]

Gallego-Posada, J

J. Gallego-Posada, J. Ramirez, A. Erraqabi, Y . Bengio, and S. Lacoste-Julien. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. In NeurIPS, 2022. (Cit. on p. 2, 3, 4, 6, 7, 11, 19)

work page 2022

[34] [34]

Gallego-Posada, J

J. Gallego-Posada, J. Ramirez, M. Hashemizadeh, and S. Lacoste-Julien. Cooper: A Library for Constrained Optimization in Deep Learning. arXiv preprint arXiv:2504.01212, 2025. (Cit. on p. 2, 7, 18)

work page arXiv 2025

[35] [35]

J. Gauvin. A Necessary and Sufficient Regularity Condition to Have Bounded Multipliers in Nonconvex Programming. Mathematical Programming, 1977. (Cit. on p. 9)

work page 1977

[36] [36]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. (Cit. on p. 1) 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

A. A. Goldstein. Convex Programming in Hilbert Space. University of Washington, 1964. (Cit. on p. 9)

work page 1964

[38] [38]

R. M. Gower, M. Schmidt, F. Bach, and P. Richtárik. Variance-Reduced Methods for Machine Learning. Proceedings of the IEEE, 2020. (Cit. on p. 9)

work page 2020

[39] [39]

S.-P. Han. A globally convergent method for nonlinear programming. Journal of optimization theory and applications, 1977. (Cit. on p. 9)

work page 1977

[40] [40]

Hashemizadeh, J

M. Hashemizadeh, J. Ramirez, R. Sukumaran, G. Farnadi, S. Lacoste-Julien, and J. Gallego- Posada. Balancing Act: Constraining Disparate Impact in Sparse Models. In ICLR, 2024. (Cit. on p. 2, 7, 9, 10)

work page 2024

[41] [41]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch- ner. beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework . In ICLR, 2017. (Cit. on p. 1)

work page 2017

[42] [42]

Hounie, L

I. Hounie, L. F. Chamon, and A. Ribeiro. Automatic Data Augmentation via Invariance- Constrained Learning. In ICML, 2023. (Cit. on p. 7)

work page 2023

[43] [43]

Hounie, J

I. Hounie, J. Elenter, and A. Ribeiro. Neural Networks with Quantization Constraints. In ICASSP, 2023. (Cit. on p. 2, 7)

work page 2023

[44] [44]

Hounie, A

I. Hounie, A. Ribeiro, and L. F. Chamon. Resilient Constrained Learning. In NeurIPS, 2024. (Cit. on p. 2, 9)

work page 2024

[45] [45]

Kingma and J

D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015. (Cit. on p. 1, 4, 10, 19)

work page 2015

[46] [46]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. In ICLR, 2018. (Cit. on p. 1)

work page 2018

[47] [47]

Machine Bias

J. Larson et al. Data and analysis for “Machine Bias”. https://github.com/propublica/ compas-analysis, 2016. (Cit. on p. 1)

work page 2016

[48] [48]

E. S. Levitin and B. T. Polyak. Constrained Minimization Methods. USSR Computational mathematics and mathematical physics, 1966. (Cit. on p. 9)

work page 1966

[49] [49]

T. Lin, C. Jin, and M. I. Jordan. Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization. JMLR, 2025. (Cit. on p. 7)

work page 2025

[50] [50]

Louizos, M

C. Louizos, M. Welling, and D. P. Kingma. Learning Sparse Neural Networks through L0 Regularization. In ICLR, 2018. (Cit. on p. 1, 19)

work page 2018

[51] [51]

O. L. Mangasarian and S. Fromovitz. The Fritz John Necessary Optimality Conditions in the Presence of Equality and Inequality Constraints. Journal of Mathematical Analysis and Applications, 1967. (Cit. on p. 3, 9)

work page 1967

[52] [52]

Miettinen

K. Miettinen. Nonlinear Multiobjective Optimization. Springer Science & Business Media,

work page

[53] [53]

Narasimhan, A

H. Narasimhan, A. Cotter, Y . Zhou, S. Wang, and W. Guo. Approximate Heavily-Constrained Learning with Lagrange Multiplier Models. In NeurIPS, 2020. (Cit. on p. 2, 7, 9)

work page 2020

[54] [54]

Nocedal and S

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006. (Cit. on p. 9, 22)

work page 2006

[55] [55]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. (Cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Paszke et al

A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019. (Cit. on p. 2, 7, 18)

work page 2019

[57] [57]

J. C. Platt and A. H. Barr. Constrained Differential Optimization. In NeurIPS, 1987. (Cit. on p. 4, 10, 22)

work page 1987

[58] [58]

M. J. Powell. The convergence of variable metric methods for nonlinearly constrained optimiza- tion calculations. In Nonlinear programming 3. Elsevier, 1978. (Cit. on p. 9) 13

work page 1978

[59] [59]

Ramirez, I

J. Ramirez, I. Hounie, J. Elenter, J. Gallego-Posada, M. Hashemizadeh, A. Ribeiro, and S. Lacoste-Julien. Feasible Learning. In AISTATS, 2025. (Cit. on p. 3, 7)

work page 2025

[60] [60]

Robey, L

A. Robey, L. Chamon, G. J. Pappas, H. Hassani, and A. Ribeiro. Adversarial Robustness with Semi-Infinite Constrained Learning. In NeurIPS, 2021. (Cit. on p. 2, 7)

work page 2021

[61] [61]

E. Shi, L. Kong, and B. Jiang. Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks. arXiv preprint arXiv:2504.06470, 2025. (Cit. on p. 1)

work page arXiv 2025

[62] [62]

Sohrabi, J

M. Sohrabi, J. Ramirez, T. H. Zhang, S. Lacoste-Julien, and J. Gallego-Posada. On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization. In ICML, 2024. (Cit. on p. 7, 8, 10, 18, 22)

work page 2024

[63] [63]

Stooke, J

A. Stooke, J. Achiam, and P. Abbeel. Responsive Safety in Reinforcement Learning by PID Lagrangian Methods. In ICML, 2020. (Cit. on p. 2, 7, 10, 22)

work page 2020

[64] [64]

R. B. Wilson. A simplicial algorithm for concave programming. PhD Thesis, Graduate School of Bussiness Administration, 1963. (Cit. on p. 9)

work page 1963

[65] [65]

Zhang, Y

G. Zhang, Y . Wang, L. Lessard, and R. B. Grosse. Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. In AISTATS, 2022. (Cit. on p. 7)

work page 2022

[66] [66]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017. (Cit. on p. 1)

work page 2017

[67] [67]

Zoutendijk

G. Zoutendijk. Methods of Feasible Directions: A Study in Linear and Non-linear Programming. Elsevier Publishing Company, 1960. (Cit. on p. 9) 14 Appendix A Proofs 16 B Experimental Details 18 B.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Sparsity Constraints . . . . . . . . . . . . . . . . . . . . . . ...

work page 1960

[68] [68]

We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

Primal feasibility. We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ

work page

[69] [69]

Stationarity. The gradients of L with respect to the primal variables vanish at (x∗, y∗, λ∗): ∂L ∂x = −(1 + y2) sin(x) + λ(1 + y2) cos(x) (x∗,y∗,λ∗) = −ϵ + λ∗ p 1 − ϵ2 = 0, (13) ∂L ∂y = 2y cos(x) + 2λy sin(x) (x∗,y∗,λ∗) = 0. (14)

work page

[70] [70]

We have λ∗ [g(x∗, y∗) − ϵ] = 0

Complementary slackness. We have λ∗ [g(x∗, y∗) − ϵ] = 0 . Since λ∗ > 0, strict comple- mentary slackness holds. We now verify that(x∗, y∗, λ∗) satisfies the second-order sufficient conditions for optimality [5, Prop. 4.3.2]. Consider the Hessian of L with respect to (x, y): ∇2 x,yL = −(1 + y2) (cos(x) + λ sin(x)) −2y sin(x) + 2λy cos(x) −2y sin(x) + 2λy c...

work page

[71] [71]

To construct the penalized formulation of Eq

and use a differentiable surrogate to update the model parameters, while still using the true non-differentiable constraint to update the multipliers.9 As a surrogate, we use the hinge loss: P (ˆy = 0) = E(x,y)∼D 1 − σ(w⊤x + b) ≥ 0.7, (24) which represents the expected proportion of inputs predicted as class 0. To construct the penalized formulation of Eq...

work page

[72] [72]

Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs

The constrained approach achieves the target class prediction rate across a wide range of dual step-sizes. Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs. Dual Step-size Class 0 Percentage (%) Accuracy (%) 0 49 .75 99 .75 1.00 × 10−4 49.75 99 .75 2.15 × 10−4 49.75 99 .75 4.60 × ...

work page