Online combinatorial optimization with stochastic decision sets and adversarial losses
Pith reviewed 2026-05-07 16:53 UTC · model grok-4.3
The pith
Follow-The-Perturbed-Leader with Counting Asleep Times estimation yields regret bounds for combinatorial optimization with stochastically available actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Follow-The-Perturbed-Leader prediction, paired with the Counting Asleep Times loss estimator, produces sublinear regret for online combinatorial optimization problems in which decision sets are drawn stochastically each round. The estimator recovers unbiased loss information from partial observations of action availability, enabling regret bounds that hold for the full-information setting, the (semi-)bandit setting, and the intermediate restricted-information setting. As a direct consequence, the same approach improves the best known efficient regret guarantees for the sleeping bandit problem with stochastic availability.
What carries the argument
The Counting Asleep Times estimator, which tracks how often each action has been unavailable to impute losses from the observed rounds, combined with the Follow-The-Perturbed-Leader method for choosing actions.
If this is right
- Regret bounds are obtained simultaneously for full information, semi-bandit feedback, and the newly introduced restricted-information feedback model.
- The same algorithmic template delivers a strict improvement over prior efficient methods for the sleeping bandit problem with stochastic availability.
- The Counting Asleep Times estimator can be plugged into Follow-The-Perturbed-Leader without requiring changes to the perturbation distribution.
- Empirical evaluations on synthetic instances confirm lower cumulative loss than earlier approaches across the studied feedback models.
Where Pith is reading between the lines
- The technique may generalize to other stochastic constraints on decision sets, such as random delays or random costs of observation, by replacing the asleep-time counter with an analogous unbiased estimator.
- Because the method works with an efficient FTPL implementation, it offers a practical route to handling intermittent availability in large-scale combinatorial problems like routing or resource allocation.
- The restricted-information setting could serve as a useful intermediate model for real systems where some but not all loss information leaks through when actions are unavailable.
Load-bearing premise
Action availability is generated independently of the adversarial losses, and the Counting Asleep Times procedure produces unbiased loss estimates from the partial observations.
What would settle it
Run the proposed algorithm on a sleeping bandit instance where availability is deliberately correlated with the loss vector; if the observed regret grows linearly rather than sublinearly, the independence assumption fails.
Figures
read the original abstract
Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Asleep Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies online combinatorial optimization where decision sets are stochastic (actions available with independent probability) and losses are adversarial. It develops Follow-The-Perturbed-Leader algorithms equipped with a novel Counting Asleep Times loss estimator, deriving regret bounds for full-information, semi-bandit, and a new restricted-information feedback model, plus an improved efficient guarantee for the sleeping-bandit problem with stochastic availability. Empirical results on synthetic and real data are included to illustrate performance.
Significance. If the regret analysis holds, the work supplies a unified, efficient FTPL-based approach that handles stochastic availability across feedback regimes and improves the best-known polynomial-time bound for sleeping bandits. The Counting Asleep Times estimator appears to furnish unbiased loss estimates from partial observations without circularity, which is a technical contribution. Reproducible code and explicit regret expressions are strengths.
major comments (2)
- [§4.2, Theorem 4.3] §4.2, Theorem 4.3: the regret bound for the restricted-information setting relies on the estimator recovering unbiased estimates of the loss vector; the proof sketch shows unbiasedness only under the independence assumption between availability and losses, but does not quantify the bias introduced when this assumption is mildly violated (as may occur in practice).
- [§5.1, Eq. (12)] §5.1, Eq. (12): the sleeping-bandit regret is stated as O(√(T log |A|)) with high probability; the derivation uses a union bound over sleeping patterns whose number grows with the support size, yet the paper does not verify that the constant remains independent of the availability probability p when p approaches 0.
minor comments (3)
- [§3.2] The definition of the restricted-information feedback model in §3.2 is clear but would benefit from an explicit comparison table against full-info and semi-bandit models.
- [Figure 3] Figure 3 caption does not state the number of independent runs or error bars; add this information for reproducibility.
- [§4.1] Notation for the perturbation distribution in §4.1 re-uses η without re-stating its dependence on T; a single sentence reminding the reader would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful review and the recommendation of minor revision. We address the two major comments point by point below, indicating planned changes to the manuscript.
read point-by-point responses
-
Referee: [§4.2, Theorem 4.3] §4.2, Theorem 4.3: the regret bound for the restricted-information setting relies on the estimator recovering unbiased estimates of the loss vector; the proof sketch shows unbiasedness only under the independence assumption between availability and losses, but does not quantify the bias introduced when this assumption is mildly violated (as may occur in practice).
Authors: The model throughout the paper (Section 2) assumes that action availabilities are drawn independently of the loss vectors. Under this assumption, the proof of Theorem 4.3 shows that the Counting Asleep Times estimator is unbiased. The manuscript does not claim robustness to violations of independence. We will add a short remark in Section 4.2 explicitly stating the assumption and noting that mild dependence may introduce bias, while a precise quantification would require additional modeling of the joint distribution and is beyond the current scope. revision: partial
-
Referee: [§5.1, Eq. (12)] §5.1, Eq. (12): the sleeping-bandit regret is stated as O(√(T log |A|)) with high probability; the derivation uses a union bound over sleeping patterns whose number grows with the support size, yet the paper does not verify that the constant remains independent of the availability probability p when p approaches 0.
Authors: The high-probability bound in Equation (12) is derived by applying a union bound over availability patterns while incorporating the pattern probabilities (which depend on p) into the concentration terms. This yields a leading constant that does not depend on p. We will revise the proof sketch in Section 5.1 to include an explicit sentence verifying this independence. revision: partial
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper extends the established Follow-The-Perturbed-Leader (FTPL) framework with a novel Counting Asleep Times loss estimator for stochastic availability settings. Regret bounds for full-information, semi-bandit, restricted-information, and sleeping-bandit cases are derived from the estimator's unbiasedness properties under the stated assumptions (stochastic independent availability). No step reduces by construction to a fitted input, self-definition, or self-citation chain; the estimator supplies independent grounding for the analysis. Self-citations to prior FTPL results are standard background and not load-bearing for the new contributions. The central claims remain externally falsifiable via the regret analysis.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Action availability follows a stochastic process independent of the loss sequence
- domain assumption The learner receives feedback according to one of the three models (full, bandit, restricted)
invented entities (1)
-
Counting Asleep Times estimator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Y ., Bubeck, S., and Lugosi, G
Audibert, J. Y ., Bubeck, S., and Lugosi, G. (2014). Regret in online combinatorial optimization. Mathematics of Operations Research. to appear
work page 2014
-
[2]
Auer, P., Cesa-Bianchi, N., Freund, Y ., and Schapire, R. E. (2002). The nonstochastic multi- armed bandit problem.SIAM J. Comput., 32(1):48–77
work page 2002
-
[3]
Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online linear optimization with bandit feedback. InConference on Learning Theory (COLT), pages 1–14
work page 2012
-
[4]
Cesa-Bianchi, N. and Lugosi, G. (2006).Prediction, Learning, and Games. Cambridge Univer- sity Press, New York, NY , USA
work page 2006
-
[5]
Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits.Journal of Computer and System Sciences, 78:1404–1422
work page 2012
-
[6]
Dani, V ., Hayes, T. P., and Kakade, S. (2008). The price of bandit information for online optimization. InNeural Information Processing Systems (NeurIPS), pages 345–352
work page 2008
-
[7]
Freund, Y ., Schapire, R., Singer, Y ., and Warmuth, M. (1997). Using and combining predictors that specialize. InACM Symposium on Theory of Computing (STOC), pages 334–343. ACM Press
work page 1997
-
[8]
Gy ¨orgy, A., Linder, T., Lugosi, G., and Ottucs´ak, Gy.. (2007). The on-line shortest path problem under partial monitoring.Journal of Machine Learning Research, 8:2369–2403
work page 2007
-
[9]
Hannan, J. (1957). Approximation to bayes risk in repeated play.Contributions to the theory of games, 3:97–139
work page 1957
-
[10]
Hutter, M. and Poland, J. (2004). Prediction with expert advice by following the perturbed leader for general weights. InAlgorithmic Learning Theory (ALT), pages 279–293
work page 2004
-
[11]
(2010).Recommender Systems: An Introduction
Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010).Recommender Systems: An Introduction. Cambridge University Press
work page 2010
-
[12]
Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems.Journal of Computer and System Sciences, 71:291–307
work page 2005
-
[13]
Kanade, V ., McMahan, H. B., and Bryan, B. (2009). Sleeping experts and bandits with stochas- tic action availability and adversarial rewards. InInternational Conference on Artificial Intelli- gence and Statistics (AISTATS), pages 272–279
work page 2009
-
[14]
Kanade, V . and Steinke, T. (2012). Learning hurdles for sleeping experts. InInnovations in Theoretical Computer Science (ITCS), pages 11–18. ACM
work page 2012
-
[15]
D., Niculescu-Mizil, A., and Sharma, Y
Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y . (2008). Regret bounds for sleeping experts and bandits. InConference on Learning Theory (COLT), pages 425–436
work page 2008
-
[16]
Koshevoy, G. A. (1999). Choice functions and abstract convex geometries.Mathematical Social Sciences, 38(1):35–44
work page 1999
-
[17]
McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting against an adaptive adversary. InConference on Learning Theory (COLT), pages 109–123
work page 2004
-
[18]
Neu, G. and Bart ´ok, G. (2013). An efficient algorithm for learning with semi-bandit feedback. InAlgorithmic Learning Theory (ALT), pages 234–248. 10 A Proof of Theorem 1 We begin by applying Lemma 1, exploiting that ˆℓt =ℓ t and eSis an independent copy ofS t: TX t=1 E h eV T t ℓt i − TX t=1 E[π(S t) T ℓt] =E " TX t=1 eVt −π( eS) T ˆℓt # ≤ m(logd+ 1) η ...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.