Black-Box Followers, White-Box Leaders: Partial Zeroth-Order Methods for MPECs
Pith reviewed 2026-05-20 15:46 UTC · model grok-4.3
The pith
PZOS algorithm combines exact leader gradients with zeroth-order follower estimates for lower variance in MPECs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by deploying zeroth-order tools only on the followers' unknown response mapping and using exact partial gradients for the leader's cost in a chain-rule fashion, the PZOS algorithm achieves a strictly lower variance bound than black-box baselines and converges to both standard and partial Goldstein stationary points in MPECs.
What carries the argument
The PZOS algorithm that applies a chain-rule-inspired update combining exact partial gradients of the leader's cost with zeroth-order Jacobian estimates of the followers' response.
If this is right
- Convergence to partial Goldstein stationary points tailored to the composite structure.
- Improved convergence speed and lower estimator variance in numerical tests on toll optimization and security games.
- Robust performance even when limited to few queries per iteration.
- Better objective values compared to black-box methods.
Where Pith is reading between the lines
- The selective use of information could extend to other optimization problems with partial model knowledge.
- Defining a partial subdifferential may help in analyzing non-smooth bilevel problems more precisely.
- Further analysis could explore how the variance reduction affects scalability to larger problems.
Load-bearing premise
The followers' response mapping is queryable at chosen points and the composite objective admits a well-defined partial Goldstein subdifferential.
What would settle it
A calculation showing that the variance bound for PZOS is not strictly lower than the black-box method, or an experiment where the algorithm fails to converge to the defined stationary points under the stated conditions.
Figures
read the original abstract
We study mathematical programs with equilibrium constraints, in which a leader knows their own cost function, but lacks a model of the followers' response. Instead, the leader can only query this response at specific points. While this setting precludes the use of gradient-based methods, existing zeroth-order approaches treat the composed objective entirely as a black box, deploying zeroth-order tools across both the leader and follower. Such approaches are inefficient, as they discard information the leader already possesses about their own cost function. In this work we instead propose to deploy zeroth-order tools only where they are truly needed: to handle the unknown, non-smooth followers' response. Specifically, we first propose PZOS, an algorithm that combines exact partial gradients of the leader's cost with zeroth-order Jacobian estimates of the followers' response in a chain-rule-inspired manner, and establish that it achieves a strictly lower variance bound than the black-box baseline. Second, we introduce the partial Goldstein subdifferential, a stationarity notion tailored to this composite structure, and prove convergence of our algorithm to both standard and partial Goldstein stationary points. Finally, we validate our method on two application domains -- toll optimization in routing games and defense-attack investment in security games -- demonstrating consistent improvements over black-box baselines in convergence speed, objective value, and estimator variance, with robust performance even under few queries per iteration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PZOS, a partial zeroth-order algorithm for mathematical programs with equilibrium constraints (MPECs) in which the leader has exact knowledge of their own cost but only query access to the followers' response. PZOS combines exact partial gradients of the leader objective with zeroth-order Jacobian estimates of the follower mapping via a chain-rule construction, claims a strictly lower variance bound than fully black-box zeroth-order baselines, introduces the partial Goldstein subdifferential as a tailored stationarity notion, proves convergence to both standard and partial Goldstein points, and reports empirical gains on toll optimization in routing games and defense-attack investment in security games.
Significance. If the variance reduction and convergence results hold under the stated assumptions, the work offers a principled way to exploit partial white-box structure in composite zeroth-order problems, which could improve sample efficiency in bilevel and equilibrium-constrained optimization. The empirical validation on two distinct game-theoretic domains provides concrete evidence of faster convergence and lower estimator variance. The partial Goldstein subdifferential is a novel technical device whose utility, if rigorously justified, would be of interest to the mathematical programming community.
major comments (2)
- [§4, variance theorem] §4 (Variance Analysis) and the statement of the main variance theorem: the claim that PZOS achieves a strictly lower variance bound than the black-box baseline rests on a chain-rule decomposition that combines exact leader partials with the ZO Jacobian estimate. When the follower response is set-valued or non-differentiable, the partial Goldstein subdifferential may not admit the same one-sided directional properties required for the bound to remain strict; the black-box and partial methods could then target inequivalent stationarity notions, undermining the strict comparison. Please supply the precise statement of the variance lemma and the additional assumptions (if any) that guarantee the inequality holds for non-smooth responses.
- [Definition 3.2 and convergence theorem] Definition 3.2 (partial Goldstein subdifferential) and the convergence theorem that follows: the stationarity notion is introduced specifically for the composite structure, yet the proof that PZOS converges to both standard Goldstein and partial Goldstein points does not clarify whether the partial notion is strictly weaker or whether the algorithm can converge to a partial point that fails to be a standard stationary point. A concrete counter-example or equivalence result under mild conditions on the response mapping would strengthen the claim.
minor comments (2)
- [Experimental section] The abstract states that the method is validated 'with robust performance even under few queries per iteration,' but the experimental section does not report the exact number of queries per iteration or the corresponding variance reduction factor; adding a table with these quantities would improve reproducibility.
- [Notation section] Notation for the leader's partial gradient and the estimated Jacobian should be introduced once and used consistently; several places appear to reuse symbols without redefinition.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below, providing clarifications based on the paper's assumptions and indicating revisions where appropriate to improve clarity.
read point-by-point responses
-
Referee: [§4, variance theorem] §4 (Variance Analysis) and the statement of the main variance theorem: the claim that PZOS achieves a strictly lower variance bound than the black-box baseline rests on a chain-rule decomposition that combines exact leader partials with the ZO Jacobian estimate. When the follower response is set-valued or non-differentiable, the partial Goldstein subdifferential may not admit the same one-sided directional properties required for the bound to remain strict; the black-box and partial methods could then target inequivalent stationarity notions, undermining the strict comparison. Please supply the precise statement of the variance lemma and the additional assumptions (if any) that guarantee the inequality holds for non-smooth responses.
Authors: The variance bound in Theorem 4.1 is established under the manuscript's core assumptions, including Lipschitz continuity of the follower response mapping (Assumption 3.1) and the use of measurable selections for set-valued responses. The partial Goldstein subdifferential is defined via a chain-rule construction in the Clarke sense, which preserves the necessary directional properties for the variance comparison even in the non-smooth case. Because the leader partial gradient is computed exactly, its variance contribution is identically zero, yielding a strictly lower bound than the full black-box estimator regardless of non-smoothness in the follower component. We will add the explicit statement of the supporting variance lemma (currently Lemma 4.2) together with a dedicated paragraph listing all assumptions required for the strict inequality in the revised manuscript. revision: yes
-
Referee: [Definition 3.2 and convergence theorem] Definition 3.2 (partial Goldstein subdifferential) and the convergence theorem that follows: the stationarity notion is introduced specifically for the composite structure, yet the proof that PZOS converges to both standard Goldstein and partial Goldstein points does not clarify whether the partial notion is strictly weaker or whether the algorithm can converge to a partial point that fails to be a standard stationary point. A concrete counter-example or equivalence result under mild conditions on the response mapping would strengthen the claim.
Authors: Under the Lipschitz continuity assumption on the follower mapping (Assumption 3.1), the partial Goldstein subdifferential coincides with the standard Goldstein subdifferential of the composed objective. This follows from the chain rule holding in the Clarke subdifferential for Lipschitz functions, so the two stationarity notions are equivalent in the setting of the paper. Consequently, any point satisfying the partial condition also satisfies the standard condition, and the convergence theorem already guarantees both. We will insert a short proposition (or remark) in Section 3 establishing this equivalence under the stated assumptions, thereby clarifying that the partial notion is not strictly weaker here. revision: yes
Circularity Check
No circularity: variance bound and convergence derived from problem structure without reduction to self-defined inputs or self-citations.
full rationale
The paper's central claims rest on a mathematical derivation of a variance bound for PZOS that combines exact leader partials with ZO estimates of the follower Jacobian, plus convergence to partial Goldstein stationary points. These steps are presented as direct consequences of the composite objective and the stationarity definition introduced in the paper, without any quoted reduction to a fitted parameter, renamed empirical pattern, or load-bearing self-citation whose validity depends on the current result. The partial Goldstein subdifferential is defined explicitly to handle the non-smooth follower response, and the variance comparison is stated as strictly lower than the black-box baseline by construction of the update rule. No equations or sections in the provided text exhibit a self-definitional loop or a prediction that is statistically forced by prior fitting within the same work. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of a well-defined partial Goldstein subdifferential for the composite objective
invented entities (1)
-
partial Goldstein subdifferential
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Optnet: Differentiable optimization as a layer in neural networks
Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. InInternational conference on machine learning, pages 136–145. PMLR, 2017
work page 2017
-
[2]
Set-valued maps and viability theory
Jean-Pierre Aubin and Arrigo Cellina.Differential inclusions. Set-valued maps and viability theory. Springer, 1984
work page 1984
-
[3]
Hande Y Benson, Arun Sen, David F Shanno, and Robert J Vanderbei. Interior-point algorithms, penalty methods and equilibrium problems.Computational Optimization and Applications, 34 (2):155–182, 2006
work page 2006
-
[4]
Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization
Lesi Chen, Jing Xu, and Luo Luo. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. InInternational Conference on Machine Learning, pages 5219–5233. PMLR, 2023
work page 2023
- [5]
-
[6]
Shisheng Cui and Uday V Shanbhag. On the computation of equilibria in monotone and potential stochastic hierarchical games.Mathematical Programming, 198(2):1227–1285, 2023
work page 2023
-
[7]
Shisheng Cui, Uday V Shanbhag, and Farzad Yousefian. Complexity guarantees for an implicit smoothing-enabled method for stochastic MPECs.Mathematical Programming, 198(2):1153– 1225, 2023
work page 2023
-
[8]
Victor DeMiguel, Michael P Friedlander, Francisco J Nogales, and Stefan Scholtes. A two-sided relaxation scheme for mathematical programs with equilibrium constraints.SIAM Journal on Optimization, 16(2):587–609, 2005
work page 2005
-
[9]
Edith Elkind, Abheek Ghosh, and Paul W Goldberg. Continuous-time best-response and related dynamics in tullock contests with convex costs.arXiv preprint arXiv:2402.08541, 2024
-
[10]
Flaxman, Adam Tauman Kalai, and H
Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. InProceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics. ISBN 0898715857
work page 2005
-
[11]
Roger Fletcher, Sven Leyffer, Danny Ralph, and Stefan Scholtes. Local convergence of SQP methods for mathematical programs with equilibrium constraints.SIAM Journal on Optimization, 17(1):259–286, 2006
work page 2006
-
[12]
Big hype: Best intervention in games via distributed hypergradient descent
Panagiotis D Grontas, Giuseppe Belgioioso, Carlo Cenedese, Marta Fochesato, John Lygeros, and Florian Dörfler. Big hype: Best intervention in games via distributed hypergradient descent. IEEE Transactions on Automatic Control, 69(12):8338–8353, 2024
work page 2024
-
[13]
Kjell Hausken and Vicki M Bier. Defending against multiple different attackers.European Journal of Operational Research, 211(2):370–384, 2011
work page 2011
-
[14]
Tim Hoheisel, Christian Kanzow, and Alexandra Schwartz. Theoretical and numerical com- parison of relaxation methods for mathematical programs with complementarity constraints. Mathematical Programming, 137(1):257–288, 2013
work page 2013
-
[15]
David Iliaev, Sigal Oren, and Ella Segev. A tullock-contest-based approach for cyber security investments.Annals of Operations Research, 320(1):61–84, 2023. 10
work page 2023
-
[16]
Oracle complexity in nonsmooth nonconvex optimization
Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23(314):1–44, 2022
work page 2022
-
[17]
Guy Kornowski and Ohad Shamir. An algorithm with optimal dimension-dependence for zero- order nonsmooth nonconvex stochastic optimization.Journal of Machine Learning Research, 25(122):1–14, 2024
work page 2024
-
[18]
Tianyi Lin, Zeyu Zheng, and Michael Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022
work page 2022
-
[19]
P.O. Lindberg and Leonid Engelson. Tolled multi-class traffic equilibria and toll sensitivities. EURO Journal on Transportation and Logistics, 4(2):197–222, 2015. ISSN 2192-4376
work page 2015
-
[20]
Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization
Zhuanghua Liu, Cheng Chen, Luo Luo, and Bryan Kian Hsiang Low. Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 2...
work page 2024
-
[21]
Random gradient-free minimization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017
work page 2017
-
[22]
Liqun Qi and Zengxin Wei. On the constant positive linear dependence condition and its application to SQP methods.SIAM Journal on Optimization, 10(4):963–981, 2000
work page 2000
-
[23]
Arvind U Raghunathan and Lorenz T Biegler. An interior point method for mathematical programs with complementarity constraints (mpccs).SIAM Journal on Optimization, 15(3): 720–750, 2005
work page 2005
-
[24]
Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback.Journal of Machine Learning Research, 18(52):1–11, 2017
work page 2017
-
[25]
Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for Hessian-free bilevel algorithms.Advances in Neural Information Processing Systems, 35:4136–4149, 2022
work page 2022
-
[26]
Cambridge University Press, 1996
Rangarajan K Sundaram.A first course in optimization theory. Cambridge University Press, 1996
work page 1996
-
[27]
Haochen Tao, Shisheng Cui, Zhuo Li, and Jian Sun. A zeroth-order stochastic implicit method for bilevel-structured actor-critic schemes.Science China Information Sciences, 68(5):150204, 2025
work page 2025
-
[28]
Gordon Tullock. Efficient rent seeking. In James M. Buchanan, Robert D. Tollison, and Gordon Tullock, editors,Toward a Theory of the Rent-Seeking Society, pages 97–112. Texas A&M University Press, College Station, TX, 1980
work page 1980
-
[29]
Cambridge University Press, 2nd edition, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2nd edition, 2018
work page 2018
-
[30]
Hai Yang and Hai-Jun Huang. The multi-class, multi-criteria traffic network equilibrium and systems optimum problem.Transportation Research Part B: Methodological, 38(1):1–15, 2004
work page 2004
-
[31]
Farzad Yousefian, Angelia Nedi´c, and Uday V Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences.Automatica, 48(1):56–67, 2012
work page 2012
-
[32]
Complexity of finding stationary points of nonconvex nonsmooth functions
Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR, 2020. 11 Technical appendices and supplementary material A Supplementary algorithms and graphs Algorithm 2Zero-Order Smoothing (ZOS...
work page 2020
-
[33]
= 3 4. Goldstein subdifferential.For δ >0 , F is differentiable on B(0, δ)\ {0} with F ′(z) = 2z−1 for z >0 and F ′(z) = 2z−3 for z <0 . Collecting Clarke gradients over B(0, δ), the Goldstein subdifferential atx 0 = 0equals ∂G δ F(0) = conv ∪z∈B(x,δ) ∂F(z) = [−2δ−3,2δ−1]. This interval contains zero if and only ifδ≥ 1 2: -δ < 1 2: all elements are negati...
-
[34]
= 0, so0∈∂ G δ F(0). Hence x= 0 is not a (δ, ε)-Goldstein stationary point for δ∈(0,1) and ε <1 , nor for δ∈[1, 3 2) andε <3−2δ. Partial Goldstein subdifferential.The Jacobian ofy ∗ at differentiable pointsz /∈ {0,1}is J y∗(z) = (−1,−1) ⊤, z <0, (1,−1) ⊤,0< z <1, (1,1) ⊤, z >1, with Clarke Jacobians ∂y ∗(0) = conv{(−1,−1) ⊤,(1,−1) ⊤} and ∂y ∗(1) = c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.