Online learning with ErdH{o}s-R\'enyi side-observation graphs
Pith reviewed 2026-05-07 14:57 UTC · model grok-4.3
The pith
Algorithms achieve near-optimal regret in adversarial multi-armed bandits with random side observations of unknown probability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We consider adversarial multi-armed bandit problems where the learner observes losses of non-chosen arms with probability r independently. We propose two algorithms for different ranges of r. After T rounds with N arms, the expected regret is O(sqrt((T/r) log N)) for r at least (log T)/(2N), and O(sqrt((T/r) log (N+T))) for smaller r. All bounds are within logarithmic factors of the minimax performance even when r is known.
What carries the argument
The Erdős-Rényi side-observation model combined with two range-specific algorithms and an estimation procedure to select between them.
Load-bearing premise
Losses are chosen adversarially while side observations happen independently with a constant unknown probability r.
What would settle it
An adversarial bandit experiment with N arms over T rounds and independent side observations at rate r where the measured regret significantly exceeds the O(sqrt(T/r log N)) or O(sqrt(T/r log(N+T))) bounds would disprove the result.
Figures
read the original abstract
We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability $r$, independently of each other and the action of the learner. We propose two algorithms that work for different ranges of $r$. We show that after $T$ rounds in a bandit problem with $N$ arms, the expected regret of our first algorithm is $O(\sqrt{(T /r) \log N })$ whenever $r\ge(\log T)/(2N)$, while our second algorithm achieves a regret of $O(\sqrt{(T/r) \log (N+T)})$ for smaller values of $r$. We also give a quick estimation procedure that decides the range of~$r$. All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~$r$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies adversarial multi-armed bandit problems augmented with side observations drawn from an Erdős-Rényi graph: each non-chosen arm reveals its loss independently with fixed but unknown probability r. Two algorithms are proposed, one for the regime r ≥ (log T)/(2N) achieving expected regret O(√((T/r) log N)), and a second for smaller r with regret O(√((T/r) log (N+T))). A frequency-based estimator for r is used to select the regime. The bounds are shown to be within logarithmic factors of the minimax optimal performance even when r is known to the algorithm.
Significance. If the analyses are correct, this provides a complete solution for this natural side-observation model, with regret that automatically adapts to the density of observations. The bounds recover the full-information scaling when r is not too small and the bandit scaling when r is small, and the estimation overhead is shown to be negligible. This is a solid contribution to the literature on bandits with partial monitoring.
minor comments (2)
- [Abstract] Abstract: the phrase 'within logarithmic factors of the best achievable performance' could be made more precise by explicitly stating the information-theoretic lower bound (e.g., Ω(√(T log N / r)) or the appropriate regime-dependent form) that is being matched up to logs.
- The estimation procedure for r (mentioned in the abstract) should include a brief statement of the sample complexity or failure probability in the main theorem statements to make the lower-order term claim fully transparent.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation of minor revision. We appreciate the recognition that our results provide a complete solution for this natural side-observation model, recovering the appropriate scaling in both the dense and sparse regimes while showing that the estimation overhead is negligible.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives its regret bounds via explicit algorithm analyses that exploit the independent Bernoulli side-observation model and standard concentration inequalities, together with a frequency-based estimator for the regime of r. These steps are self-contained, produce upper bounds that match known information-theoretic scalings up to logarithmic factors, and do not reduce any claimed result to a fitted parameter or self-referential definition. The estimation phase is shown to add only lower-order terms and does not smuggle in the target bounds.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Side observations occur independently with fixed but unknown probability r.
Reference graph
Works this paper leans on
-
[1]
Allenberg, C., Auer, P., Gy ¨orfi, L., and Ottucs ´ak, Gy. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. InAlgo- rithmic Learning Theory (ALT), pages 229–243
work page 2006
-
[2]
Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond bandits. InConference on Learning Theory (COLT)
work page 2015
-
[3]
Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y . (2013). From bandits to experts: A tale of domination and independence. InNeural Information Processing Systems (NeurIPS)
work page 2013
-
[4]
Audibert, J.-Y . and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring.Journal of Machine Learning Research, 11:2785–2836
work page 2010
-
[5]
Buccapatnam, S., Eryilmaz, A., and Shroff, N. B. (2014). Stochastic bandits with side observations on networks. InInternational Conference on Measurement and Mod- eling of Computer Systems
work page 2014
-
[6]
Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. (2012). Leveraging side observations in stochastic ban- dits. InConference on Uncertainty in Artificial Intelli- gence (UAI)
work page 2012
-
[7]
Carpentier, A. and Valko, M. (2016). Revealing graph ban- dits for maximizing local influence. InInternational Conference on Artificial Intelligence and Statistics (AIS- TATS), pages 10–18
work page 2016
-
[8]
Cesa-Bianchi, N. and Lugosi, G. (2006).Prediction, learn- ing, and games. Cambridge University Press, New York, NY
work page 2006
-
[9]
Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. (2005). Min- imizing regret with label efficient prediction.IEEE Transactions on Information Theory, 51(6):2152–2162
work page 2005
-
[10]
Cohen, A., Hazan, T., and Koren, T. (2016). Online learn- ing with feedback graphs without the graphs. InInterna- tional Conference on Machine Learning (ICML)
work page 2016
-
[11]
Freund, Y . and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55:119–139. Gy¨orfi, L. and Ottucs´ak, Gy. (2007). Sequential prediction of unbounded stationary time series.IEEE Transactions on Information Theory, 53(5):1866–1872. Koc´ak, T., Neu, G., a...
work page 1997
-
[12]
Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm.Information and Computation, 108(2):212–261
work page 1994
-
[13]
Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. InNeural Information Processing Systems (NeurIPS)
work page 2011
-
[14]
Neu, G. and Bart ´ok, G. (2013). An efficient algorithm for learning with semi-bandit feedback. InAlgorithmic Learning Theory (ALT)
work page 2013
-
[15]
Seldin, Y ., Bartlett, P., Crammer, K., and Abbasi-Yadkori, Y . (2014). Prediction with limited advice and multi- armed bandits with paid observations. InInternational Conference on Machine Learning (ICML)
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.