pith. sign in

arxiv: 2604.25271 · v1 · submitted 2026-04-28 · 📊 stat.ML · cs.LG

Online learning with ErdH{o}s-R\'enyi side-observation graphs

Pith reviewed 2026-05-07 14:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords adversarial multi-armed banditsside observationsErdős-Rényi graphsregret minimizationonline learningpartial feedbackadaptive algorithms
0
0 comments X

The pith

Algorithms achieve near-optimal regret in adversarial multi-armed bandits with random side observations of unknown probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines adversarial multi-armed bandit settings where each unselected arm's loss is revealed independently with fixed but unknown probability r. Two algorithms are developed to handle different ranges of this probability value. The first delivers expected regret of order square root of (T divided by r) times log N when r is sufficiently large relative to log T over N. The second handles smaller r with a bound involving log of N plus T instead. These results come close to the best possible performance even for algorithms that know the value of r.

Core claim

We consider adversarial multi-armed bandit problems where the learner observes losses of non-chosen arms with probability r independently. We propose two algorithms for different ranges of r. After T rounds with N arms, the expected regret is O(sqrt((T/r) log N)) for r at least (log T)/(2N), and O(sqrt((T/r) log (N+T))) for smaller r. All bounds are within logarithmic factors of the minimax performance even when r is known.

What carries the argument

The Erdős-Rényi side-observation model combined with two range-specific algorithms and an estimation procedure to select between them.

Load-bearing premise

Losses are chosen adversarially while side observations happen independently with a constant unknown probability r.

What would settle it

An adversarial bandit experiment with N arms over T rounds and independent side observations at rate r where the measured regret significantly exceeds the O(sqrt(T/r log N)) or O(sqrt(T/r log(N+T))) bounds would disprove the result.

Figures

Figures reproduced from arXiv: 2604.25271 by Gergely Neu, Michal Valko, Tom\'a\v{s} Koc\'ak.

Figure 1
Figure 1. Figure 1: Comparison of algorithms for different amount of side information sequences (different sequences view at source ↗
read the original abstract

We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability $r$, independently of each other and the action of the learner. We propose two algorithms that work for different ranges of $r$. We show that after $T$ rounds in a bandit problem with $N$ arms, the expected regret of our first algorithm is $O(\sqrt{(T /r) \log N })$ whenever $r\ge(\log T)/(2N)$, while our second algorithm achieves a regret of $O(\sqrt{(T/r) \log (N+T)})$ for smaller values of $r$. We also give a quick estimation procedure that decides the range of~$r$. All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~$r$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript studies adversarial multi-armed bandit problems augmented with side observations drawn from an Erdős-Rényi graph: each non-chosen arm reveals its loss independently with fixed but unknown probability r. Two algorithms are proposed, one for the regime r ≥ (log T)/(2N) achieving expected regret O(√((T/r) log N)), and a second for smaller r with regret O(√((T/r) log (N+T))). A frequency-based estimator for r is used to select the regime. The bounds are shown to be within logarithmic factors of the minimax optimal performance even when r is known to the algorithm.

Significance. If the analyses are correct, this provides a complete solution for this natural side-observation model, with regret that automatically adapts to the density of observations. The bounds recover the full-information scaling when r is not too small and the bandit scaling when r is small, and the estimation overhead is shown to be negligible. This is a solid contribution to the literature on bandits with partial monitoring.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'within logarithmic factors of the best achievable performance' could be made more precise by explicitly stating the information-theoretic lower bound (e.g., Ω(√(T log N / r)) or the appropriate regime-dependent form) that is being matched up to logs.
  2. The estimation procedure for r (mentioned in the abstract) should include a brief statement of the sample complexity or failure probability in the main theorem statements to make the lower-order term claim fully transparent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation of minor revision. We appreciate the recognition that our results provide a complete solution for this natural side-observation model, recovering the appropriate scaling in both the dense and sparse regimes while showing that the estimation overhead is negligible.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives its regret bounds via explicit algorithm analyses that exploit the independent Bernoulli side-observation model and standard concentration inequalities, together with a frequency-based estimator for the regime of r. These steps are self-contained, produce upper bounds that match known information-theoretic scalings up to logarithmic factors, and do not reduce any claimed result to a fitted parameter or self-referential definition. The estimation phase is shown to add only lower-order terms and does not smuggle in the target bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard adversarial bandit model plus the specific random side-observation assumption; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Side observations occur independently with fixed but unknown probability r.
    Explicitly stated in the abstract as the problem setting.

pith-pipeline@v0.9.0 · 5478 in / 1145 out tokens · 40618 ms · 2026-05-07T14:57:07.873135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Allenberg, C., Auer, P., Gy ¨orfi, L., and Ottucs ´ak, Gy. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. InAlgo- rithmic Learning Theory (ALT), pages 229–243

  2. [2]

    Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond bandits. InConference on Learning Theory (COLT)

  3. [3]

    Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y . (2013). From bandits to experts: A tale of domination and independence. InNeural Information Processing Systems (NeurIPS)

  4. [4]

    and Bubeck, S

    Audibert, J.-Y . and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring.Journal of Machine Learning Research, 11:2785–2836

  5. [5]

    Buccapatnam, S., Eryilmaz, A., and Shroff, N. B. (2014). Stochastic bandits with side observations on networks. InInternational Conference on Measurement and Mod- eling of Computer Systems

  6. [6]

    Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. (2012). Leveraging side observations in stochastic ban- dits. InConference on Uncertainty in Artificial Intelli- gence (UAI)

  7. [7]

    and Valko, M

    Carpentier, A. and Valko, M. (2016). Revealing graph ban- dits for maximizing local influence. InInternational Conference on Artificial Intelligence and Statistics (AIS- TATS), pages 10–18

  8. [8]

    and Lugosi, G

    Cesa-Bianchi, N. and Lugosi, G. (2006).Prediction, learn- ing, and games. Cambridge University Press, New York, NY

  9. [9]

    Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. (2005). Min- imizing regret with label efficient prediction.IEEE Transactions on Information Theory, 51(6):2152–2162

  10. [10]

    Cohen, A., Hazan, T., and Koren, T. (2016). Online learn- ing with feedback graphs without the graphs. InInterna- tional Conference on Machine Learning (ICML)

  11. [11]

    and Schapire, R

    Freund, Y . and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55:119–139. Gy¨orfi, L. and Ottucs´ak, Gy. (2007). Sequential prediction of unbounded stationary time series.IEEE Transactions on Information Theory, 53(5):1866–1872. Koc´ak, T., Neu, G., a...

  12. [12]

    and Warmuth, M

    Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm.Information and Computation, 108(2):212–261

  13. [13]

    and Shamir, O

    Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. InNeural Information Processing Systems (NeurIPS)

  14. [14]

    and Bart ´ok, G

    Neu, G. and Bart ´ok, G. (2013). An efficient algorithm for learning with semi-bandit feedback. InAlgorithmic Learning Theory (ALT)

  15. [15]

    Seldin, Y ., Bartlett, P., Crammer, K., and Abbasi-Yadkori, Y . (2014). Prediction with limited advice and multi- armed bandits with paid observations. InInternational Conference on Machine Learning (ICML)