Position: Adopt Constraints Over Fixed Penalties in Deep Learning
Pith reviewed 2026-05-19 13:19 UTC · model grok-4.3
The pith
Deep learning problems with non-negotiable requirements should start from the constrained formulation rather than fixed-penalty surrogates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. This is because the penalized and constrained problems are generally not equivalent in non-convex settings, fixed penalties weaken hard requirements into weighted trade-offs, and choosing coefficients often involves costly trial and error that risks solving the wrong objective altogether. Solution strategies should then be chosen based on the problem's structure and scale.
What carries the argument
The direct constrained optimization formulation versus the fixed-penalty scalarized objective as alternative starting points for optimization.
If this is right
- Optimization would focus on methods that preserve hard constraints rather than trading them off via weights.
- Non-negotiable requirements would stay binding throughout training instead of becoming adjustable penalties.
- Hyper-parameter search would target solver choice and structure rather than penalty magnitudes.
- Applications needing strict compliance could measure success directly by constraint satisfaction rates.
Where Pith is reading between the lines
- Development of new scalable constrained solvers tailored to neural network training would become a priority.
- Benchmarks could shift emphasis from penalized loss values to explicit rates of constraint violation.
- Connections could emerge to projection-based or differentiable optimization techniques already used in other domains.
Load-bearing premise
That practitioners can reliably select and apply appropriate solution strategies for the constrained formulation at deep learning scale based on problem structure.
What would settle it
A concrete demonstration of a non-convex deep network training task with hard constraints where fixed penalization achieves consistent satisfaction without coefficient retuning or equivalence failures.
Figures
read the original abstract
Recent efforts to develop trustworthy AI systems have increased interest in learning problems with explicit requirements, or constraints. In deep learning, however, such problems are often handled through fixed weighted-sum penalization: the constraints are added to the task loss with fixed coefficients, and the resulting scalarized objective is minimized. This position paper argues that fixed penalization is often ill-suited for deep learning problems with non-negotiable requirements for several reasons. First, in non-convex settings, the penalized and constrained problems are generally not equivalent, so solving the former need not solve the latter. Second, fixed penalization weakens hard requirements into soft penalties to be traded off against task performance. Third, choosing penalty coefficients to indirectly solve the constrained problem often involves costly trial and error, because changing them alters the penalized objective itself, and hence can mean solving the wrong problem altogether. We therefore argue that, when a deep learning problem specifies non-negotiable requirements, the constrained formulation itself should be the starting point, not the surrogate problem defined by fixed penalization. The appropriate solution strategy should then be chosen based on the problem's structure and scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a position piece arguing that deep learning problems with non-negotiable constraints should be formulated and solved as constrained optimization problems rather than via fixed weighted-sum penalization. It gives three reasons: (1) in non-convex regimes the penalized surrogate is generally not equivalent to the original constrained problem, (2) fixed penalties convert hard requirements into soft trade-offs, and (3) tuning the penalty coefficients changes the objective itself and therefore often amounts to solving the wrong problem. The manuscript concludes that the constrained formulation should be the starting point, after which an appropriate solver is selected according to problem structure and scale.
Significance. If the core distinctions hold, the position could usefully redirect attention in trustworthy-AI research toward constrained formulations that preserve hard requirements instead of relying on surrogate penalties whose behavior is harder to control. The arguments rest on standard facts about non-convex optimization and penalty methods; the paper does not claim new theorems or empirical results.
major comments (1)
- [Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address the single major comment below and will revise the manuscript to improve its actionability while preserving the core position.
read point-by-point responses
-
Referee: [Abstract] Abstract, final sentence: the recommendation that 'the appropriate solution strategy should then be chosen based on the problem's structure and scale' presupposes the ready availability of practical, scalable constrained solvers for high-dimensional non-convex deep-learning problems. The manuscript supplies neither citations to existing methods (e.g., augmented-Lagrangian or constrained-SGD variants) nor any discussion of their behavior at DL scale. This assumption is load-bearing for the actionability of the central claim.
Authors: We agree that the manuscript would be strengthened by explicit references to existing constrained optimization techniques and a brief discussion of their practical status at deep-learning scale. Although the paper is a position piece whose primary contribution is to argue that the constrained formulation should be the starting point (rather than a fixed-penalty surrogate), we accept that readers need some indication of how one would then proceed. In the revision we will add a short paragraph citing representative work on augmented-Lagrangian methods, constrained SGD variants, and related approaches for non-convex problems, together with a concise note on their current scalability characteristics and open challenges. This addition supports the final sentence of the abstract without altering the central claim or introducing new empirical results. revision: yes
Circularity Check
No circularity in the position paper's argument chain.
full rationale
The paper is a position statement that rests on standard, externally established facts from optimization theory: non-equivalence of penalized and constrained problems in non-convex regimes, the softening effect of fixed penalties, and the trial-and-error cost of coefficient selection. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the central recommendation. The final sentence simply states that an appropriate solver should be chosen according to problem structure; this is presented as a logical consequence rather than a reduction to any input or prior self-referential result. The argument is therefore self-contained and draws on independent mathematical properties.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math In non-convex settings, the penalized and constrained problems are generally not equivalent.
- domain assumption Fixed penalization weakens hard requirements into soft penalties to be traded off against task performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fixed weighted-sum penalization... the penalized and constrained problems are generally not equivalent
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lagrangian approach... optimizes its coefficients—the multipliers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Abadi et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016. (Cit. on p. 2, 7)
work page 2016
- [2]
-
[3]
M. R. Bachute and J. M. Subhedar. Autonomous Driving Architectures. Machine Learning with Applications, 2021. (Cit. on p. 1)
work page 2021
- [4]
- [5]
-
[6]
D. P. Bertsekas. On the Goldstein-Levitin-Polyak Gradient Projection Method. IEEE Transac- tions on automatic control, 1976. (Cit. on p. 9)
work page 1976
-
[7]
S. Bhatore et al. Machine learning techniques for credit risk evaluation. Journal of Banking and Financial Technology, 2020. (Cit. on p. 1)
work page 2020
-
[8]
E. G. Birgin and J. M. Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. The SIAM series on Fundamentals of Algorithms, 2014. (Cit. on p. 3)
work page 2014
-
[9]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. (Cit. on p. 3, 9)
work page 2004
-
[10]
J. Bradbury et al. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax, 2018. (Cit. on p. 7)
work page 2018
-
[11]
P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin. Differentiable Causal Discovery from Interventional Data. In NeurIPS, 2020. (Cit. on p. 1)
work page 2020
-
[12]
D. Chakraborty, Y . LeCun, T. G. Rudner, and E. Learned-Miller. Improving Pre-trained Self- Supervised Embeddings Through Effective Entropy Maximization. In AISTATS, 2025. (Cit. on p. 1)
work page 2025
-
[13]
L. Chamon and A. Ribeiro. Probably Approximately Correct Constrained Learning. In NeurIPS,
-
[14]
X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter- pretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS, 2016. (Cit. on p. 1)
work page 2016
- [15]
- [16]
-
[17]
A. Cotter et al. TensorFlow Constrained Optimization (TFCO). https://github.com/ google-research/tensorflow_constrained_optimization, 2019. (Cit. on p. 2, 7)
work page 2019
-
[18]
G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Es- chenhagen, P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rab- bat, S. Krishnan, D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson. Benchmarking Neural Network Training Algorithms. arXiv preprin...
-
[19]
J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ICLR, 2024. (Cit. on p. 1, 2, 3, 7)
work page 2024
-
[20]
A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014. (Cit. on p. 9)
work page 2014
-
[21]
J. Degrave and I. Korshunova. Why machine learning algorithms are hard to tune and how to fix it. Engraved, blog: www.engraved.blog/why-machine-learning-algorithms-are-hard-to-tune/,
-
[22]
(Cit. on p. 2, 4, 5, 18, 22)
-
[23]
J. Degrave and I. Korshunova. How we can make machine learning algorithms tunable. En- graved, blog: www.engraved.blog/how-we-can-make-machine-learning-algorithms-tunable/,
-
[24]
(Cit. on p. 2, 5, 18, 22)
-
[25]
I. I. Dikin. Iterative Solution of Problems of Linear and Quadratic Programming. In Doklady Akademii Nauk. Russian Academy of Sciences, 1967. (Cit. on p. 9)
work page 1967
-
[26]
M.-A. Dilhac et al. Montréal Declaration for a Responsible Development of Artificial Intelli- gence, 2018. (Cit. on p. 1)
work page 2018
- [27]
-
[28]
M. Ehrgott. Multicriteria Optimization. Springer Science & Business Media, 2005. (Cit. on p. 1, 2, 4, 5)
work page 2005
-
[29]
J. Elenter, N. NaderiAlizadeh, and A. Ribeiro. A Lagrangian Duality Approach to Active Learning. In NeurIPS, 2022. (Cit. on p. 2, 7)
work page 2022
-
[30]
European Parliament. Artificial Intelligence Act. https://artificialintelligenceact.eu, 2024. (Cit. on p. 1)
work page 2024
-
[31]
M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 1956. (Cit. on p. 9)
work page 1956
-
[32]
J. Gallego-Posada. Constrained Optimization for Machine Learning: Algorithms and Applica- tions. PhD Thesis, University of Montreal, 2024. (Cit. on p. 2, 4, 6)
work page 2024
-
[33]
J. Gallego-Posada, J. Ramirez, A. Erraqabi, Y . Bengio, and S. Lacoste-Julien. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. In NeurIPS, 2022. (Cit. on p. 2, 3, 4, 6, 7, 11, 19)
work page 2022
-
[34]
J. Gallego-Posada, J. Ramirez, M. Hashemizadeh, and S. Lacoste-Julien. Cooper: A Library for Constrained Optimization in Deep Learning. arXiv preprint arXiv:2504.01212, 2025. (Cit. on p. 2, 7, 18)
-
[35]
J. Gauvin. A Necessary and Sufficient Regularity Condition to Have Bounded Multipliers in Nonconvex Programming. Mathematical Programming, 1977. (Cit. on p. 9)
work page 1977
-
[36]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. (Cit. on p. 1) 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
A. A. Goldstein. Convex Programming in Hilbert Space. University of Washington, 1964. (Cit. on p. 9)
work page 1964
-
[38]
R. M. Gower, M. Schmidt, F. Bach, and P. Richtárik. Variance-Reduced Methods for Machine Learning. Proceedings of the IEEE, 2020. (Cit. on p. 9)
work page 2020
-
[39]
S.-P. Han. A globally convergent method for nonlinear programming. Journal of optimization theory and applications, 1977. (Cit. on p. 9)
work page 1977
-
[40]
M. Hashemizadeh, J. Ramirez, R. Sukumaran, G. Farnadi, S. Lacoste-Julien, and J. Gallego- Posada. Balancing Act: Constraining Disparate Impact in Sparse Models. In ICLR, 2024. (Cit. on p. 2, 7, 9, 10)
work page 2024
-
[41]
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch- ner. beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework . In ICLR, 2017. (Cit. on p. 1)
work page 2017
- [42]
- [43]
- [44]
-
[45]
D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015. (Cit. on p. 1, 4, 10, 19)
work page 2015
- [46]
-
[47]
J. Larson et al. Data and analysis for “Machine Bias”. https://github.com/propublica/ compas-analysis, 2016. (Cit. on p. 1)
work page 2016
-
[48]
E. S. Levitin and B. T. Polyak. Constrained Minimization Methods. USSR Computational mathematics and mathematical physics, 1966. (Cit. on p. 9)
work page 1966
-
[49]
T. Lin, C. Jin, and M. I. Jordan. Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization. JMLR, 2025. (Cit. on p. 7)
work page 2025
-
[50]
C. Louizos, M. Welling, and D. P. Kingma. Learning Sparse Neural Networks through L0 Regularization. In ICLR, 2018. (Cit. on p. 1, 19)
work page 2018
-
[51]
O. L. Mangasarian and S. Fromovitz. The Fritz John Necessary Optimality Conditions in the Presence of Equality and Inequality Constraints. Journal of Mathematical Analysis and Applications, 1967. (Cit. on p. 3, 9)
work page 1967
- [52]
-
[53]
H. Narasimhan, A. Cotter, Y . Zhou, S. Wang, and W. Guo. Approximate Heavily-Constrained Learning with Lagrange Multiplier Models. In NeurIPS, 2020. (Cit. on p. 2, 7, 9)
work page 2020
-
[54]
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006. (Cit. on p. 9, 22)
work page 2006
-
[55]
OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. (Cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019. (Cit. on p. 2, 7, 18)
work page 2019
-
[57]
J. C. Platt and A. H. Barr. Constrained Differential Optimization. In NeurIPS, 1987. (Cit. on p. 4, 10, 22)
work page 1987
-
[58]
M. J. Powell. The convergence of variable metric methods for nonlinearly constrained optimiza- tion calculations. In Nonlinear programming 3. Elsevier, 1978. (Cit. on p. 9) 13
work page 1978
-
[59]
J. Ramirez, I. Hounie, J. Elenter, J. Gallego-Posada, M. Hashemizadeh, A. Ribeiro, and S. Lacoste-Julien. Feasible Learning. In AISTATS, 2025. (Cit. on p. 3, 7)
work page 2025
- [60]
- [61]
-
[62]
M. Sohrabi, J. Ramirez, T. H. Zhang, S. Lacoste-Julien, and J. Gallego-Posada. On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization. In ICML, 2024. (Cit. on p. 7, 8, 10, 18, 22)
work page 2024
- [63]
-
[64]
R. B. Wilson. A simplicial algorithm for concave programming. PhD Thesis, Graduate School of Bussiness Administration, 1963. (Cit. on p. 9)
work page 1963
- [65]
-
[66]
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017. (Cit. on p. 1)
work page 2017
-
[67]
G. Zoutendijk. Methods of Feasible Directions: A Study in Linear and Non-linear Programming. Elsevier Publishing Company, 1960. (Cit. on p. 9) 14 Appendix A Proofs 16 B Experimental Details 18 B.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Sparsity Constraints . . . . . . . . . . . . . . . . . . . . . . ...
work page 1960
-
[68]
We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ
Primal feasibility. We have g(x∗, y∗) = sin(arcsin( ϵ)) = ϵ ≤ ϵ
-
[69]
Stationarity. The gradients of L with respect to the primal variables vanish at (x∗, y∗, λ∗): ∂L ∂x = −(1 + y2) sin(x) + λ(1 + y2) cos(x) (x∗,y∗,λ∗) = −ϵ + λ∗ p 1 − ϵ2 = 0, (13) ∂L ∂y = 2y cos(x) + 2λy sin(x) (x∗,y∗,λ∗) = 0. (14)
-
[70]
We have λ∗ [g(x∗, y∗) − ϵ] = 0
Complementary slackness. We have λ∗ [g(x∗, y∗) − ϵ] = 0 . Since λ∗ > 0, strict comple- mentary slackness holds. We now verify that(x∗, y∗, λ∗) satisfies the second-order sufficient conditions for optimality [5, Prop. 4.3.2]. Consider the Hessian of L with respect to (x, y): ∇2 x,yL = −(1 + y2) (cos(x) + λ sin(x)) −2y sin(x) + 2λy cos(x) −2y sin(x) + 2λy c...
-
[71]
To construct the penalized formulation of Eq
and use a differentiable surrogate to update the model parameters, while still using the true non-differentiable constraint to update the multipliers.9 As a surrogate, we use the hinge loss: P (ˆy = 0) = E(x,y)∼D 1 − σ(w⊤x + b) ≥ 0.7, (24) which represents the expected proportion of inputs predicted as class 0. To construct the penalized formulation of Eq...
-
[72]
The constrained approach achieves the target class prediction rate across a wide range of dual step-sizes. Training accuracy stabilizes at 80% once the rate constraint is met, demonstrating robust feasibility-performance trade-offs. Dual Step-size Class 0 Percentage (%) Accuracy (%) 0 49 .75 99 .75 1.00 × 10−4 49.75 99 .75 2.15 × 10−4 49.75 99 .75 4.60 × ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.