Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems

Alessandro Rudi; Axel Parmentier; Pierre-Cyril Aubin-Frankowski; Yohann De Castro

arxiv: 2407.17200 · v3 · pith:6TQE6SD7new · submitted 2024-07-24 · 📊 stat.ML · cs.LG· math.OC· stat.ME

Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems

Pierre-Cyril Aubin-Frankowski , Yohann De Castro , Axel Parmentier , Alessandro Rudi This is my paper

Pith reviewed 2026-05-23 22:42 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCstat.ME

keywords generalization boundssurrogate policiescombinatorial optimizationperturbation biasfan-crossing probabilitystructured learningsmoothing parametersum-of-squares optimization

0 comments

The pith

Smoothed surrogate policies decompose excess risk into perturbation bias, statistical error, and optimization error for repeated combinatorial problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes generalization bounds for policies that combine a statistical model with a perturbed combinatorial oracle. It decomposes the excess risk into three terms: a perturbation bias governed by the new fan-crossing probability, a statistical estimation term of order 1 over lambda square root n, and an optimization term addressed via kernel sum-of-squares. The fan-crossing probability is bounded linearly in the smoothing parameter lambda under the Uniformly Bounded Density property and sub-linearly under the weaker Uniform Weak moment property. These conditions capture the interaction between the model distribution and the normal fan of the feasible polytope.

Core claim

The central claim is that smoothed policies admit a generalization bound separating excess risk into perturbation bias controlled by fan-crossing probability, statistical estimation error scaling as O(1/(λ√n)), and optimization error mitigated by kernel Sum-of-Squares. The fan-crossing probability is bounded by O(λ) under UBD or sub-linearly under UW, both reflecting geometric interaction between the statistical model and the normal fan of the feasible polytope.

What carries the argument

The fan-crossing probability, a geometric quantity measuring the likelihood that a random perturbation changes the oracle solution, which directly controls the perturbation bias term in the excess-risk decomposition.

If this is right

Choosing the smoothing parameter λ trades off bias against the statistical rate 1/(λ√n) to achieve overall consistency.
The Uniform Weak moment property allows sub-linear bias control even when densities are unbounded.
Kernel Sum-of-Squares methods provide a practical route to control the optimization error term in high dimensions.
The decomposition isolates the geometric contribution of the polytope's normal fan from purely statistical effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fan-crossing analysis may apply directly to other linear-oracle structured prediction settings such as ranking or matching.
Verification of the UBD or UW property on concrete polytopes like the assignment polytope would yield explicit constants for common problems.
The inverse scaling with λ suggests an optimal λ that shrinks with n, potentially yielding overall rates faster than standard unsmoothed methods.

Load-bearing premise

The statistical model and the normal fan of the feasible polytope satisfy either the Uniformly Bounded Density property or the Uniform Weak moment property.

What would settle it

A concrete counter-example polytope and model distribution where the fan-crossing probability grows faster than O(λ) while the Uniformly Bounded Density property holds would falsify the linear bias bound.

Figures

Figures reproduced from arXiv: 2407.17200 by Alessandro Rudi, Axel Parmentier, Pierre-Cyril Aubin-Frankowski, Yohann De Castro.

**Figure 2.** Figure 2: Normal cone at point y1 to the polytope (left) and normal fan with internal radius ρ at point θ (right). Example 2 (continued). In the context of stochastic vehicle scheduling, Y(x) corresponds to the partitions of the vertices V into paths. Solutions are encoded by indicator vectors y = (ya)a∈A where ya is a binary variable equal to 1 if arc a is in the solution, and to 0 otherwise. Its convex hull, C(x),… view at source ↗

read the original abstract

Many real-world decision problems require solving, again and again, combinatorial optimization instances drawn from a common distribution. A recent line of structured learning methods exploits this regularity by learning policies that pair a statistical model with a tractable combinatorial oracle, instead of solving each instance independently. Training such policies is notoriously difficult, however: the resulting empirical risk is piecewise constant in the model parameters, which hinders gradient-based optimization, and only a few theoretical guarantees have been provided so far. We address this issue by analyzing smoothed (perturbed) policies: adding controlled random perturbations to the direction used by the linear oracle yields a differentiable surrogate risk and improves generalization. Our main contribution is a generalization bound that decomposes the excess risk into $(\mathit{i})$ perturbation bias, $(\mathit{ii})$ statistical estimation error, and $(\mathit{iii})$ optimization error. The perturbation bias is controlled by the \emph{fan-crossing probability}, a new geometric quantity measuring the likelihood that a perturbation changes the oracle solution. We introduce two complementary conditions to bound it--the \emph{Uniformly Bounded Density} (UBD) property, yielding a sharp ${O}(\lambda)$ bias, and the weaker \emph{Uniform Weak moment} (UW) property, yielding a sub-linear bound--both capturing the geometric interaction between the statistical model and the normal fan of the feasible polytope. The statistical estimation error is controlled via a uniform deviation bound over the policy class, with rate ${O}(1/(\lambda\sqrt{n}))$ that scales inversely in the smoothing parameter. Concerning the optimization error, we exploit kernel Sum-of-Squares methods to mitigate the curse of dimensionality of global optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean decomposition of excess risk for smoothed surrogate policies using a new fan-crossing probability under UBD/UW geometric conditions.

read the letter

The main takeaway is a generalization bound that splits excess risk for these smoothed policies into perturbation bias (via the fan-crossing probability), statistical error, and optimization error. The new geometric quantity tracks how often a perturbation flips the oracle solution, and the UBD and UW conditions let them bound the bias by O(λ) or sublinear in the smoothing parameter λ. The statistical term scales as O(1/(λ sqrt n)), which makes the smoothing trade-off explicit. This decomposition and the two density conditions on the model-normal fan interaction look like the actual novelty; they are not just rephrasings of earlier uniform deviation results. The paper does a good job keeping the argument analytic and avoiding circularity in the bound definitions. The kernel SOS handling of optimization error is a reasonable way to address the non-convexity. Soft spots are limited. The UBD/UW assumptions are stated clearly as conditions on the geometry, but they will need verification case by case for common polytopes like matchings or TSP; they are not automatic. The optimization error section is sketched at a high level and may require more concrete rates or examples to be fully usable. No internal contradictions appear in the scaling or the decomposition. This work is aimed at people building end-to-end learned solvers that combine statistical models with combinatorial oracles. Readers working on theoretical analysis of structured prediction in operations research will find the new tools for controlling perturbation bias useful. It deserves a serious referee because the central claim is well-motivated and the rates follow from the stated assumptions without obvious gaps.

Referee Report

0 major / 3 minor

Summary. The paper develops generalization bounds for smoothed surrogate policies that pair a statistical model with a combinatorial optimization oracle. It decomposes excess risk into three terms: perturbation bias controlled by a new fan-crossing probability (bounded by O(λ) under the Uniformly Bounded Density assumption or sub-linearly under Uniform Weak moment), statistical estimation error of order O(1/(λ√n)), and optimization error handled via kernel Sum-of-Squares methods. The analysis relies on geometric properties of the normal fan of the feasible polytope interacting with the data distribution.

Significance. If the stated rates and decomposition hold under the given assumptions, the work supplies the first explicit, non-vacuous generalization guarantees for this class of structured learning methods, together with a practical smoothing mechanism that renders the surrogate differentiable. The fan-crossing probability is a novel geometric quantity that cleanly separates the bias induced by perturbation from statistical and optimization errors, and the kernel SOS treatment of the optimization term addresses a dimension-dependent difficulty that has limited prior analyses.

minor comments (3)

The abstract states the statistical rate as O(1/(λ√n)) but does not indicate whether the uniform deviation bound is taken with respect to the smoothed or unsmoothed policy class; the dependence on λ should be made explicit in the statement of the uniform convergence result.
Notation for the three error terms (perturbation bias, statistical estimation error, optimization error) is introduced in the abstract but should be given consistent symbols (e.g., B_λ, E_stat, E_opt) the first time they appear in the main text.
The manuscript would benefit from a short table or paragraph comparing the obtained rates with those of prior surrogate-based methods (e.g., the works cited in the introduction) under comparable assumptions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, recognition of the novelty of the fan-crossing probability, and recommendation for minor revision. The decomposition into perturbation bias, statistical error, and optimization error is indeed central to our contribution, and we are pleased that the geometric perspective on the normal fan is viewed as a useful addition to the literature on structured learning.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces the fan-crossing probability and the UBD/UW properties explicitly as assumptions on the interaction between the statistical model and the normal fan of the feasible polytope. The generalization bound is then derived analytically by decomposing excess risk into perturbation bias (O(λ) or sublinear under those assumptions), statistical error O(1/(λ√n)), and optimization error (via kernel SOS). No load-bearing step reduces by construction to a fitted parameter from the same data, a self-citation chain, or a renamed known result; all rates follow from the stated assumptions and standard uniform deviation arguments. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of the fan-crossing probability as a well-defined geometric measure and on the UBD and UW properties being sufficient to bound it; these are introduced in the paper rather than derived from prior literature.

axioms (1)

domain assumption The normal fan of the feasible polytope interacts with the statistical model in a way that admits the Uniformly Bounded Density or Uniform Weak moment property.
Invoked to control the fan-crossing probability; location: abstract paragraph on perturbation bias.

invented entities (1)

fan-crossing probability no independent evidence
purpose: Quantify the probability that a random perturbation changes the solution returned by the linear oracle.
New geometric quantity introduced to bound perturbation bias; independent_evidence false because no external verification mentioned.

pith-pipeline@v0.9.0 · 5854 in / 1441 out tokens · 15762 ms · 2026-05-23T22:42:20.709155+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

https://euro-neurips-vrp-2022.challenges.ortec.com/

EURO Meets NeurIPS 2022 Vehicle Routing Competition. https://euro-neurips-vrp-2022.challenges.ortec.com/ . Maria-Florina Balcan. Data-driven algorithm design. In Ti m Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, chapter 28, pages 626–645. Cambridge Universi ty Press,

work page 2022
[2]

A co nsistent regularization approach for structured prediction

Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A co nsistent regularization approach for structured prediction. In Daniel D. Lee, Masashi Sugiyama, Ulrike von L uxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 20...

work page 2016
[3]

Learning with combinatorial optimization layers: a probabilistic approach

Guillaume Dalle, Léo Baty, Louis Bouvier, and Axel Parmenti er. Learning with combinatorial optimization layers: a probabilistic approach. ArXiv preprint, abs/2207.13513,

work page arXiv
[4]

Evarist Giné and Richard Nickl

ISSN 0025-1909. Evarist Giné and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models. Cambridge university press,

work page 1909
[5]

A pac approach to applicati on-speciﬁc algorithm selection

Rishi Gupta and Tim Roughgarden. A pac approach to applicati on-speciﬁc algorithm selection. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 123–134,

work page 2016
[6]

Bach, and Alessandro Rudi

Ulysse Marteau-Ferey, Francis R. Bach, and Alessandro Rudi. Non-parametric models for non-negative functions. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Ma ria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

work page 2020
[7]

A General Theory for Structured Prediction with Smooth Convex Surrogates

Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A gener al theory for structured prediction with smooth convex surrogates. ArXiv preprint, abs/1902.01958,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[8]

Axel Parmentier

ISSN 1572-2740, 1572-2759. Axel Parmentier. Learning to Approximate Industrial Probl ems by Operations Research Classic Problems. Operations Research, 2021a. ISSN 0030-364X. Axel Parmentier. Learning structured approximations of op erations research problems. ArXiv preprint, abs/2107.04323, 2021b. Axel Parmentier and Vincent T’Kindt. Learning to solve the...

work page arXiv
[9]

Diﬀerentiation of blackbox combinatorial solvers

Marin Vlastelica Pogancic, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolinek. Diﬀerentiation of blackbox combinatorial solvers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

work page 2020
[10]

Integrated condi tional estimation-optimization

Meng Qi, Paul Grigas, and Zuo-Jun Max Shen. Integrated condi tional estimation-optimization. arXiv preprint arXiv:2110.12351,

work page arXiv
[11]

Furthermore, under the following Assumption (Orl.), ∃τ >0 s.t

{ PX (ρ(ψw (X))√ d(X) <λq ) + exp ( − 1 10λ2(1−q) )} . Furthermore, under the following Assumption (Orl.), ∃τ >0 s.t. EX [ exp ((ρ(ψw (X))√ d(X) )−τ)] < ∞, (Orl.) 1 it holds Vw (λ) = Oλ→0 ( exp(−λ− 2τ 2+τ) ) ; And under the following Assumption (Mom.), ∃τ >0 s.t. EX [(ρ(ψw (X))√ d(X) )−τ] < ∞, (Mom.) it holds Vw (λ) = Oλ→0 ( λτpolylog(λ) ) , where polylog...

work page 2000
[12]

and all q ∈(0, 1), V (λ) = EXPZ [ ∥Z∥2 ≥ρ(ψw (X)) λ ⏐ ⏐X ] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + EX [ exp ( −0. 1ρ2(ψw (X)) λ2 ) 1{ρ(ψw (X))≥λq √ d (X)} ] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + EX [ exp ( − d(X) 10λ2(1−q) )] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + exp ( − 1 10λ2(1−q) ) , the last inequality stemming from d(X) ≥1 PX -almost surely. Assume ( Orl.), by Markov’s ine...

work page 2012

[1] [1]

https://euro-neurips-vrp-2022.challenges.ortec.com/

EURO Meets NeurIPS 2022 Vehicle Routing Competition. https://euro-neurips-vrp-2022.challenges.ortec.com/ . Maria-Florina Balcan. Data-driven algorithm design. In Ti m Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, chapter 28, pages 626–645. Cambridge Universi ty Press,

work page 2022

[2] [2]

A co nsistent regularization approach for structured prediction

Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A co nsistent regularization approach for structured prediction. In Daniel D. Lee, Masashi Sugiyama, Ulrike von L uxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 20...

work page 2016

[3] [3]

Learning with combinatorial optimization layers: a probabilistic approach

Guillaume Dalle, Léo Baty, Louis Bouvier, and Axel Parmenti er. Learning with combinatorial optimization layers: a probabilistic approach. ArXiv preprint, abs/2207.13513,

work page arXiv

[4] [4]

Evarist Giné and Richard Nickl

ISSN 0025-1909. Evarist Giné and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models. Cambridge university press,

work page 1909

[5] [5]

A pac approach to applicati on-speciﬁc algorithm selection

Rishi Gupta and Tim Roughgarden. A pac approach to applicati on-speciﬁc algorithm selection. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 123–134,

work page 2016

[6] [6]

Bach, and Alessandro Rudi

Ulysse Marteau-Ferey, Francis R. Bach, and Alessandro Rudi. Non-parametric models for non-negative functions. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Ma ria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

work page 2020

[7] [7]

A General Theory for Structured Prediction with Smooth Convex Surrogates

Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A gener al theory for structured prediction with smooth convex surrogates. ArXiv preprint, abs/1902.01958,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[8] [8]

Axel Parmentier

ISSN 1572-2740, 1572-2759. Axel Parmentier. Learning to Approximate Industrial Probl ems by Operations Research Classic Problems. Operations Research, 2021a. ISSN 0030-364X. Axel Parmentier. Learning structured approximations of op erations research problems. ArXiv preprint, abs/2107.04323, 2021b. Axel Parmentier and Vincent T’Kindt. Learning to solve the...

work page arXiv

[9] [9]

Diﬀerentiation of blackbox combinatorial solvers

Marin Vlastelica Pogancic, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolinek. Diﬀerentiation of blackbox combinatorial solvers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

work page 2020

[10] [10]

Integrated condi tional estimation-optimization

Meng Qi, Paul Grigas, and Zuo-Jun Max Shen. Integrated condi tional estimation-optimization. arXiv preprint arXiv:2110.12351,

work page arXiv

[11] [11]

Furthermore, under the following Assumption (Orl.), ∃τ >0 s.t

{ PX (ρ(ψw (X))√ d(X) <λq ) + exp ( − 1 10λ2(1−q) )} . Furthermore, under the following Assumption (Orl.), ∃τ >0 s.t. EX [ exp ((ρ(ψw (X))√ d(X) )−τ)] < ∞, (Orl.) 1 it holds Vw (λ) = Oλ→0 ( exp(−λ− 2τ 2+τ) ) ; And under the following Assumption (Mom.), ∃τ >0 s.t. EX [(ρ(ψw (X))√ d(X) )−τ] < ∞, (Mom.) it holds Vw (λ) = Oλ→0 ( λτpolylog(λ) ) , where polylog...

work page 2000

[12] [12]

and all q ∈(0, 1), V (λ) = EXPZ [ ∥Z∥2 ≥ρ(ψw (X)) λ ⏐ ⏐X ] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + EX [ exp ( −0. 1ρ2(ψw (X)) λ2 ) 1{ρ(ψw (X))≥λq √ d (X)} ] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + EX [ exp ( − d(X) 10λ2(1−q) )] , ≤PX [ ρ(ψw (X))<λq√ d(X) ] + exp ( − 1 10λ2(1−q) ) , the last inequality stemming from d(X) ≥1 PX -almost surely. Assume ( Orl.), by Markov’s ine...

work page 2012