Online combinatorial optimization with stochastic decision sets and adversarial losses

Gergely Neu; Michal Valko

arxiv: 2604.25269 · v1 · submitted 2026-04-28 · 💻 cs.LG · stat.ML

Online combinatorial optimization with stochastic decision sets and adversarial losses

Gergely Neu , Michal Valko This is my paper

Pith reviewed 2026-05-07 16:53 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online combinatorial optimizationstochastic availabilityFollow-The-Perturbed-Leaderloss estimationsleeping banditsregret boundspartial feedback

0 comments

The pith

Follow-The-Perturbed-Leader with Counting Asleep Times estimation yields regret bounds for combinatorial optimization with stochastically available actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies sequential decision making where the set of available composite actions changes randomly each round while losses remain adversarial. It introduces algorithms built on Follow-The-Perturbed-Leader that incorporate a new loss estimation procedure called Counting Asleep Times to handle different levels of feedback. The resulting regret bounds cover full-information, semi-bandit, and a newly defined restricted-information model, with a concrete payoff being tighter guarantees for the sleeping bandit problem under stochastic availability. A reader should care because many practical problems involve unreliable or intermittently available resources such as sensors, road segments, or inventory items, and the method supplies efficient algorithms that still achieve sublinear regret despite the uncertainty in availability.

Core claim

The central claim is that Follow-The-Perturbed-Leader prediction, paired with the Counting Asleep Times loss estimator, produces sublinear regret for online combinatorial optimization problems in which decision sets are drawn stochastically each round. The estimator recovers unbiased loss information from partial observations of action availability, enabling regret bounds that hold for the full-information setting, the (semi-)bandit setting, and the intermediate restricted-information setting. As a direct consequence, the same approach improves the best known efficient regret guarantees for the sleeping bandit problem with stochastic availability.

What carries the argument

The Counting Asleep Times estimator, which tracks how often each action has been unavailable to impute losses from the observed rounds, combined with the Follow-The-Perturbed-Leader method for choosing actions.

If this is right

Regret bounds are obtained simultaneously for full information, semi-bandit feedback, and the newly introduced restricted-information feedback model.
The same algorithmic template delivers a strict improvement over prior efficient methods for the sleeping bandit problem with stochastic availability.
The Counting Asleep Times estimator can be plugged into Follow-The-Perturbed-Leader without requiring changes to the perturbation distribution.
Empirical evaluations on synthetic instances confirm lower cumulative loss than earlier approaches across the studied feedback models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may generalize to other stochastic constraints on decision sets, such as random delays or random costs of observation, by replacing the asleep-time counter with an analogous unbiased estimator.
Because the method works with an efficient FTPL implementation, it offers a practical route to handling intermittent availability in large-scale combinatorial problems like routing or resource allocation.
The restricted-information setting could serve as a useful intermediate model for real systems where some but not all loss information leaks through when actions are unavailable.

Load-bearing premise

Action availability is generated independently of the adversarial losses, and the Counting Asleep Times procedure produces unbiased loss estimates from the partial observations.

What would settle it

Run the proposed algorithm on a sleeping bandit instance where availability is deliberately correlated with the loss vector; if the observed regret grows linearly rather than sublinearly, the independence assumption fails.

Figures

Figures reproduced from arXiv: 2604.25269 by Gergely Neu, Michal Valko.

**Figure 1.** Figure 1: The protocol of online combinatorial optimization with stochastic action availability. view at source ↗

**Figure 2.** Figure 2: Left: Multi-arm bandits - varying availabilities. Middle: Shortest paths on a 3 × 3 grid. Right: Shortest paths on a 10 × 10 grid. For the bandit case, we evaluate SLEEPINGCATBANDIT using the same empirical setting as Kanade et al. [13]. We consider an experiment with T = 104 and 5 arms, each of which are available independently of each other with probability p. At the beginning of the experiment, the loss… view at source ↗

read the original abstract

Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Asleep Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends FTPL to stochastic action availability with a new Counting Asleep Times estimator, delivering better sleeping-bandit bounds and a useful restricted-feedback model.

read the letter

The main point is that Neu and Valko handle online combinatorial optimization where actions become unavailable stochastically, as in sensor failures or stockouts. They apply Follow-The-Perturbed-Leader across full-information, semi-bandit, and a new restricted-information feedback setting, using their Counting Asleep Times estimator to recover losses from partial observations. This produces regret bounds and, as a direct consequence, tighter efficient guarantees for the sleeping bandit problem with stochastic availability. They also run empirical comparisons that show gains over prior methods. The restricted-information model sits between full and bandit feedback and looks like a practical addition for applications with limited but structured observations. The estimator itself supplies the key technical step, and the analysis stays within the standard FTPL perturbation framework without obvious circularity. The work is grounded in existing literature on sleeping bandits and combinatorial online learning. The main assumption is that availability is stochastic and independent of the losses; the estimator must remain unbiased under that condition. If the paper's proofs establish this cleanly, the bounds follow. The abstract calls the sleeping-bandit improvement significant, but the exact constants and comparison baselines would need checking in the full version. No internal contradictions appear at the high level. This is for researchers working on bandits or online learning with dynamic or partial action sets. A reader focused on regret analysis and modest empirical validation will find concrete value. It deserves peer review because it fills a documented practical gap with new algorithmic tools and verifiable claims.

Referee Report

2 major / 3 minor

Summary. The manuscript studies online combinatorial optimization where decision sets are stochastic (actions available with independent probability) and losses are adversarial. It develops Follow-The-Perturbed-Leader algorithms equipped with a novel Counting Asleep Times loss estimator, deriving regret bounds for full-information, semi-bandit, and a new restricted-information feedback model, plus an improved efficient guarantee for the sleeping-bandit problem with stochastic availability. Empirical results on synthetic and real data are included to illustrate performance.

Significance. If the regret analysis holds, the work supplies a unified, efficient FTPL-based approach that handles stochastic availability across feedback regimes and improves the best-known polynomial-time bound for sleeping bandits. The Counting Asleep Times estimator appears to furnish unbiased loss estimates from partial observations without circularity, which is a technical contribution. Reproducible code and explicit regret expressions are strengths.

major comments (2)

[§4.2, Theorem 4.3] §4.2, Theorem 4.3: the regret bound for the restricted-information setting relies on the estimator recovering unbiased estimates of the loss vector; the proof sketch shows unbiasedness only under the independence assumption between availability and losses, but does not quantify the bias introduced when this assumption is mildly violated (as may occur in practice).
[§5.1, Eq. (12)] §5.1, Eq. (12): the sleeping-bandit regret is stated as O(√(T log |A|)) with high probability; the derivation uses a union bound over sleeping patterns whose number grows with the support size, yet the paper does not verify that the constant remains independent of the availability probability p when p approaches 0.

minor comments (3)

[§3.2] The definition of the restricted-information feedback model in §3.2 is clear but would benefit from an explicit comparison table against full-info and semi-bandit models.
[Figure 3] Figure 3 caption does not state the number of independent runs or error bars; add this information for reproducibility.
[§4.1] Notation for the perturbation distribution in §4.1 re-uses η without re-stating its dependence on T; a single sentence reminding the reader would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. We address the two major comments point by point below, indicating planned changes to the manuscript.

read point-by-point responses

Referee: [§4.2, Theorem 4.3] §4.2, Theorem 4.3: the regret bound for the restricted-information setting relies on the estimator recovering unbiased estimates of the loss vector; the proof sketch shows unbiasedness only under the independence assumption between availability and losses, but does not quantify the bias introduced when this assumption is mildly violated (as may occur in practice).

Authors: The model throughout the paper (Section 2) assumes that action availabilities are drawn independently of the loss vectors. Under this assumption, the proof of Theorem 4.3 shows that the Counting Asleep Times estimator is unbiased. The manuscript does not claim robustness to violations of independence. We will add a short remark in Section 4.2 explicitly stating the assumption and noting that mild dependence may introduce bias, while a precise quantification would require additional modeling of the joint distribution and is beyond the current scope. revision: partial
Referee: [§5.1, Eq. (12)] §5.1, Eq. (12): the sleeping-bandit regret is stated as O(√(T log |A|)) with high probability; the derivation uses a union bound over sleeping patterns whose number grows with the support size, yet the paper does not verify that the constant remains independent of the availability probability p when p approaches 0.

Authors: The high-probability bound in Equation (12) is derived by applying a union bound over availability patterns while incorporating the pattern probabilities (which depend on p) into the concentration terms. This yields a leading constant that does not depend on p. We will revise the proof sketch in Section 5.1 to include an explicit sentence verifying this independence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper extends the established Follow-The-Perturbed-Leader (FTPL) framework with a novel Counting Asleep Times loss estimator for stochastic availability settings. Regret bounds for full-information, semi-bandit, restricted-information, and sleeping-bandit cases are derived from the estimator's unbiasedness properties under the stated assumptions (stochastic independent availability). No step reduces by construction to a fitted input, self-definition, or self-citation chain; the estimator supplies independent grounding for the analysis. Self-citations to prior FTPL results are standard background and not load-bearing for the new contributions. The central claims remain externally falsifiable via the regret analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard online learning assumptions plus the validity of the new estimator; no free parameters or invented entities are explicitly fitted or postulated in the abstract.

axioms (2)

domain assumption Action availability follows a stochastic process independent of the loss sequence
Implicit in the problem statement of stochastic decision sets with adversarial losses.
domain assumption The learner receives feedback according to one of the three models (full, bandit, restricted)
Used to derive the regret bounds for each setting.

invented entities (1)

Counting Asleep Times estimator no independent evidence
purpose: Estimate losses of unavailable actions from partial observations
New technique introduced to enable the regret analysis under stochastic availability.

pith-pipeline@v0.9.0 · 5476 in / 1452 out tokens · 82863 ms · 2026-05-07T16:53:57.497129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Y ., Bubeck, S., and Lugosi, G

Audibert, J. Y ., Bubeck, S., and Lugosi, G. (2014). Regret in online combinatorial optimization. Mathematics of Operations Research. to appear

work page 2014
[2]

Auer, P., Cesa-Bianchi, N., Freund, Y ., and Schapire, R. E. (2002). The nonstochastic multi- armed bandit problem.SIAM J. Comput., 32(1):48–77

work page 2002
[3]

Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online linear optimization with bandit feedback. InConference on Learning Theory (COLT), pages 1–14

work page 2012
[4]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. (2006).Prediction, Learning, and Games. Cambridge Univer- sity Press, New York, NY , USA

work page 2006
[5]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits.Journal of Computer and System Sciences, 78:1404–1422

work page 2012
[6]

P., and Kakade, S

Dani, V ., Hayes, T. P., and Kakade, S. (2008). The price of bandit information for online optimization. InNeural Information Processing Systems (NeurIPS), pages 345–352

work page 2008
[7]

Freund, Y ., Schapire, R., Singer, Y ., and Warmuth, M. (1997). Using and combining predictors that specialize. InACM Symposium on Theory of Computing (STOC), pages 334–343. ACM Press

work page 1997
[8]

Gy ¨orgy, A., Linder, T., Lugosi, G., and Ottucs´ak, Gy.. (2007). The on-line shortest path problem under partial monitoring.Journal of Machine Learning Research, 8:2369–2403

work page 2007
[9]

Hannan, J. (1957). Approximation to bayes risk in repeated play.Contributions to the theory of games, 3:97–139

work page 1957
[10]

and Poland, J

Hutter, M. and Poland, J. (2004). Prediction with expert advice by following the perturbed leader for general weights. InAlgorithmic Learning Theory (ALT), pages 279–293

work page 2004
[11]

(2010).Recommender Systems: An Introduction

Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010).Recommender Systems: An Introduction. Cambridge University Press

work page 2010
[12]

and Vempala, S

Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems.Journal of Computer and System Sciences, 71:291–307

work page 2005
[13]

B., and Bryan, B

Kanade, V ., McMahan, H. B., and Bryan, B. (2009). Sleeping experts and bandits with stochas- tic action availability and adversarial rewards. InInternational Conference on Artificial Intelli- gence and Statistics (AISTATS), pages 272–279

work page 2009
[14]

and Steinke, T

Kanade, V . and Steinke, T. (2012). Learning hurdles for sleeping experts. InInnovations in Theoretical Computer Science (ITCS), pages 11–18. ACM

work page 2012
[15]

D., Niculescu-Mizil, A., and Sharma, Y

Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y . (2008). Regret bounds for sleeping experts and bandits. InConference on Learning Theory (COLT), pages 425–436

work page 2008
[16]

Koshevoy, G. A. (1999). Choice functions and abstract convex geometries.Mathematical Social Sciences, 38(1):35–44

work page 1999
[17]

McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting against an adaptive adversary. InConference on Learning Theory (COLT), pages 109–123

work page 2004
[18]

and Bart ´ok, G

Neu, G. and Bart ´ok, G. (2013). An efficient algorithm for learning with semi-bandit feedback. InAlgorithmic Learning Theory (ALT), pages 234–248. 10 A Proof of Theorem 1 We begin by applying Lemma 1, exploiting that ˆℓt =ℓ t and eSis an independent copy ofS t: TX t=1 E h eV T t ℓt i − TX t=1 E[π(S t) T ℓt] =E " TX t=1 eVt −π( eS) T ˆℓt # ≤ m(logd+ 1) η ...

work page 2013

[1] [1]

Y ., Bubeck, S., and Lugosi, G

Audibert, J. Y ., Bubeck, S., and Lugosi, G. (2014). Regret in online combinatorial optimization. Mathematics of Operations Research. to appear

work page 2014

[2] [2]

Auer, P., Cesa-Bianchi, N., Freund, Y ., and Schapire, R. E. (2002). The nonstochastic multi- armed bandit problem.SIAM J. Comput., 32(1):48–77

work page 2002

[3] [3]

Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online linear optimization with bandit feedback. InConference on Learning Theory (COLT), pages 1–14

work page 2012

[4] [4]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. (2006).Prediction, Learning, and Games. Cambridge Univer- sity Press, New York, NY , USA

work page 2006

[5] [5]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits.Journal of Computer and System Sciences, 78:1404–1422

work page 2012

[6] [6]

P., and Kakade, S

Dani, V ., Hayes, T. P., and Kakade, S. (2008). The price of bandit information for online optimization. InNeural Information Processing Systems (NeurIPS), pages 345–352

work page 2008

[7] [7]

Freund, Y ., Schapire, R., Singer, Y ., and Warmuth, M. (1997). Using and combining predictors that specialize. InACM Symposium on Theory of Computing (STOC), pages 334–343. ACM Press

work page 1997

[8] [8]

Gy ¨orgy, A., Linder, T., Lugosi, G., and Ottucs´ak, Gy.. (2007). The on-line shortest path problem under partial monitoring.Journal of Machine Learning Research, 8:2369–2403

work page 2007

[9] [9]

Hannan, J. (1957). Approximation to bayes risk in repeated play.Contributions to the theory of games, 3:97–139

work page 1957

[10] [10]

and Poland, J

Hutter, M. and Poland, J. (2004). Prediction with expert advice by following the perturbed leader for general weights. InAlgorithmic Learning Theory (ALT), pages 279–293

work page 2004

[11] [11]

(2010).Recommender Systems: An Introduction

Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010).Recommender Systems: An Introduction. Cambridge University Press

work page 2010

[12] [12]

and Vempala, S

Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems.Journal of Computer and System Sciences, 71:291–307

work page 2005

[13] [13]

B., and Bryan, B

Kanade, V ., McMahan, H. B., and Bryan, B. (2009). Sleeping experts and bandits with stochas- tic action availability and adversarial rewards. InInternational Conference on Artificial Intelli- gence and Statistics (AISTATS), pages 272–279

work page 2009

[14] [14]

and Steinke, T

Kanade, V . and Steinke, T. (2012). Learning hurdles for sleeping experts. InInnovations in Theoretical Computer Science (ITCS), pages 11–18. ACM

work page 2012

[15] [15]

D., Niculescu-Mizil, A., and Sharma, Y

Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y . (2008). Regret bounds for sleeping experts and bandits. InConference on Learning Theory (COLT), pages 425–436

work page 2008

[16] [16]

Koshevoy, G. A. (1999). Choice functions and abstract convex geometries.Mathematical Social Sciences, 38(1):35–44

work page 1999

[17] [17]

McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting against an adaptive adversary. InConference on Learning Theory (COLT), pages 109–123

work page 2004

[18] [18]

and Bart ´ok, G

Neu, G. and Bart ´ok, G. (2013). An efficient algorithm for learning with semi-bandit feedback. InAlgorithmic Learning Theory (ALT), pages 234–248. 10 A Proof of Theorem 1 We begin by applying Lemma 1, exploiting that ˆℓt =ℓ t and eSis an independent copy ofS t: TX t=1 E h eV T t ℓt i − TX t=1 E[π(S t) T ℓt] =E " TX t=1 eVt −π( eS) T ˆℓt # ≤ m(logd+ 1) η ...

work page 2013