Online learning with noisy side observations

Gergely Neu; Michal Valko; Tom\'a\v{s} Koc\'ak

arxiv: 2604.13740 · v1 · submitted 2026-04-15 · 💻 cs.LG · stat.ML

Online learning with noisy side observations

Tom\'a\v{s} Koc\'ak , Gergely Neu , Michal Valko This is my paper

Pith reviewed 2026-05-10 14:27 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online learningpartial observabilitynoisy feedbackregret boundsgraph feedbackparameter-freeindependence numberside observations

0 comments

The pith

A parameter-free algorithm achieves regret of order sqrt(alpha-star T) in online learning with noisy side observations from a weighted directed graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a partial-observability model for online learning in which the learner receives its own loss plus noisy feedback about other actions, with the feedback quality governed by edge weights in a weighted directed graph. It develops an efficient algorithm that attains a regret bound of order the square root of alpha-star times the time horizon T, where alpha-star is a new graph property called the effective independence number. The algorithm requires no knowledge or estimation of alpha-star or other parameters. When edge weights are binary, the setting and bounds reduce to those of prior partial-observability models. A sympathetic reader would care because the result shows how graph structure can control regret growth even when side information is noisy and imperfectly known.

Core claim

The central claim is that there exists an efficient, parameter-free algorithm for the online learning problem with noisy side observations that guarantees a regret of order sqrt(alpha-star T) after T rounds, where alpha-star denotes the effective independence number of the underlying weighted directed graph; the algorithm achieves this without requiring prior knowledge or estimation of alpha-star, and recovers the near-optimal bounds of earlier binary-feedback models as a special case.

What carries the argument

The effective independence number alpha-star of the weighted directed graph, which captures the structural property that controls the regret rate, together with the parameter-free algorithm that attains the bound without knowing or estimating this number.

If this is right

The regret depends on the effective independence number rather than the total number of actions, allowing sublinear regret even in large action spaces when the graph is sparse or well-structured.
The algorithm works for arbitrary positive real edge weights without any parameter tuning or graph estimation step.
In the binary-weight case the bounds match the known near-optimal results for partial observability models with no additional assumptions.
The approach yields an efficient, practical method that can be deployed directly on problems whose feedback graph is only partially understood.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the effective independence number remains small for naturally occurring feedback graphs, the result suggests that online learning can scale to very large action sets without full observability.
The model could be extended to time-varying or adversarial graphs while preserving the same regret form, though this is not shown in the paper.
Empirical tests on recommendation or bandit problems with real noisy side information would provide concrete evidence for whether the theoretical alpha-star accurately predicts observed regret.
The reduction to the independence number invites further study of which graph families make alpha-star significantly smaller than the ordinary independence number.

Load-bearing premise

The problem's feedback structure can be represented by a weighted directed graph whose edge weights determine the noise level in side observations, and that an effective independence number alpha-star exists for this graph such that the stated regret bound holds.

What would settle it

Running the algorithm on a concrete weighted graph with a known effective independence number alpha-star and observing that the realized regret grows faster than sqrt(alpha-star T) by more than logarithmic factors after sufficiently large T would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.13740 by Gergely Neu, Michal Valko, Tom\'a\v{s} Koc\'ak.

**Figure 2.** Figure 2: Dependence of α ∗ on the size of the graph with random weights, 100 graphs for each size. from the fact that α ∗ ≤ α(1)/1 ≤ N. This essentially guarantees that incorporating side-observations can never be harmful to the performance of the learner: the regret of Exp3-WIX is always within logarithmic factors of the minimax regret of order √ NT for the standard multi-armed bandit problem without side observ… view at source ↗

**Figure 3.** Figure 3: Comparison of total regrets of the algorithms at time [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of $\widetilde{O}(\sqrt{\alpha^* T})$ after $T$ rounds, where $\alpha^*$ is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of $\alpha^*$. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends binary graph feedback to weighted noisy observations with a parameter-free algorithm whose regret scales by the effective independence number.

read the letter

The main takeaway is a parameter-free algorithm for online learning that handles noisy side observations whose quality varies according to a weighted directed graph. The regret bound is order sqrt(alpha* T), where alpha* is this new graph quantity they call the effective independence number, and the algorithm needs neither alpha* nor any other parameters upfront. For binary weights it recovers the known bounds from Mannor-Shamir and Alon et al., which is a good sign the extension is consistent. What stands out is the move to continuous weights that directly encode observation quality; that feels like a natural next step beyond the 0-1 edge models. The parameter-free guarantee is also useful in practice, since estimating graph properties in advance is often unrealistic. The abstract presents the model and the bound cleanly, with no obvious circularity or dependence on fitted quantities. The main soft spot is that the full proof and algorithm details are not visible here, so one still needs to verify that alpha* is well-defined for typical weighted graphs and that the algorithm remains efficient. Nothing in the description suggests a load-bearing gap, but the bound could turn out loose or the construction more involved than it first appears. This work is aimed at the online learning and bandit community, especially anyone already using graph-structured feedback. A reader who knows the cited partial-observability papers will see the value right away and can judge how much the weighted model changes the landscape. It has enough technical grounding and novelty to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper introduces a partial-observability model for online learning in which the learner receives its own loss plus noisy side observations whose quality is encoded by edge weights in a weighted directed graph. The central contribution is an efficient, parameter-free algorithm whose regret is bounded by O~(sqrt(alpha* T)), where alpha* is a new graph property termed the effective independence number; the algorithm requires neither knowledge nor estimation of alpha*. For the special case of binary edge weights the model and bound recover the results of Mannor-Shamir (2011) and Alon et al. (2013).

Significance. If the claimed regret bound and parameter-free property hold, the work meaningfully generalizes existing partial-observability frameworks to noisy weighted directed graphs while introducing a natural graph-theoretic quantity that controls the regret. The fact that the algorithm adapts without estimating alpha* is a technical strength that aligns with recent trends in adaptive online learning.

major comments (2)

The abstract states that the algorithm achieves O~(sqrt(alpha* T)) regret without knowledge of alpha*, yet the precise definition of alpha* (and the proof that the bound is non-vacuous for arbitrary weight matrices) is not visible in the provided abstract. Section 3 or the main theorem statement must explicitly define alpha* via a closed-form expression or optimization problem and show that the algorithm's internal quantities remain bounded independently of alpha*.
The reduction to the binary-edge case (Mannor-Shamir / Alon et al.) is asserted but not derived. The proof of Theorem 1 (or whichever result states the general bound) should contain an explicit specialization step showing that alpha* coincides with the independence number when all weights are 0/1.

minor comments (2)

Notation for the graph (directed, weighted) and the precise noise model on the side observations should be introduced in the first paragraph of Section 2 rather than deferred to the algorithm description.
The tilde-O notation hides logarithmic factors; the precise dependence on the number of actions n and on the maximum edge weight should be stated explicitly in the main regret theorem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the significance of our contribution and for the detailed comments. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and explicit derivations.

read point-by-point responses

Referee: The abstract states that the algorithm achieves O~(sqrt(alpha* T)) regret without knowledge of alpha*, yet the precise definition of alpha* (and the proof that the bound is non-vacuous for arbitrary weight matrices) is not visible in the provided abstract. Section 3 or the main theorem statement must explicitly define alpha* via a closed-form expression or optimization problem and show that the algorithm's internal quantities remain bounded independently of alpha*.

Authors: We agree that the abstract does not contain the full technical definition of α*, which is standard for abstracts. The definition of the effective independence number α* is already introduced in Section 3 as the solution to a specific optimization problem over the weighted graph. In the revision we will make this definition fully explicit directly in the statement of the main theorem (Theorem 1), including the precise optimization formulation. We will also add a supporting lemma (or paragraph within the proof) demonstrating that all internal quantities maintained by the algorithm (e.g., the potential function and the adaptive weights) remain bounded by quantities independent of α*, thereby establishing that the O~(sqrt(α* T)) bound is non-vacuous for arbitrary non-negative weight matrices. revision: yes
Referee: The reduction to the binary-edge case (Mannor-Shamir / Alon et al.) is asserted but not derived. The proof of Theorem 1 (or whichever result states the general bound) should contain an explicit specialization step showing that alpha* coincides with the independence number when all weights are 0/1.

Authors: We acknowledge that the manuscript currently asserts the reduction to the binary case without an explicit derivation. In the revised version we will insert a dedicated specialization paragraph immediately after the proof of the general bound. This paragraph will show that, when all edge weights belong to {0,1}, the optimization problem defining α* reduces exactly to the standard independence number α of the underlying graph, and that the algorithm and its regret bound recover the near-optimal O~(sqrt(α T)) guarantees of Mannor-Shamir (2011) and Alon et al. (2013). revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a new partial-observability model via weighted directed graphs and introduces the effective independence number alpha* as an intrinsic graph property. The central contribution is an efficient, parameter-free algorithm whose regret analysis yields the bound O~(sqrt(alpha* T)) without requiring knowledge or estimation of alpha*. This structure is consistent with standard adaptive techniques in online learning (e.g., doubling tricks or potential-based methods) that achieve bounds in terms of unknown problem parameters. No step reduces by construction to a fitted quantity, self-citation chain, or renamed input; the derivation remains self-contained against the graph model and standard regret analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim appears to rest on the existence of the effective independence number for any weighted digraph and on standard online-learning assumptions such as bounded losses.

axioms (1)

domain assumption Losses are bounded (standard assumption in online learning)
Implicit in any regret analysis for online learning; not stated explicitly in abstract but required for the O(sqrt(T)) form.

invented entities (1)

effective independence number alpha* no independent evidence
purpose: Quantifies the amount of independent information provided by the noisy side observations under the graph structure
Defined as a novel graph property; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5457 in / 1274 out tokens · 25520 ms · 2026-05-10T14:27:14.719198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

From bandits to experts: A tale of domination and independence

Alon, Noga, Cesa-Bianchi, Nicol \` o , Gentile, Claudio, and Mansour, Yishay. From bandits to experts: A tale of domination and independence . In Neural Information Processing Systems (NeurIPS), 2013

work page 2013
[2]

Online learning with feedback graphs: Beyond bandits

Alon, Noga, Cesa-Bianchi, Nicol \` o , Dekel, Ofer, and Koren, Tomer. Online learning with feedback graphs: Beyond bandits . In Conference on Learning Theory (COLT), 2015

work page 2015
[3]

Finite-time analysis of the multiarmed bandit problem

Auer, Peter, Cesa-Bianchi, Nicol \` o , and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem . Machine Learning, 47 0 (2-3): 0 235--256, 2002 a

work page 2002
[4]

Adaptive and self-confident on-line learning algorithms

Auer, Peter, Cesa-Bianchi, Nicol \` o , and Gentile, Claudio. Adaptive and self-confident on-line learning algorithms . Journal of Computer and System Sciences, 64: 0 48--75, 2002 b

work page 2002
[5]

Minimax regret of finite partial-monitoring games in stochastic environments

Bart \' o k, G \' a bor, P \' a l, D \' a vid, and Szepesv \' a ri, Csaba. Minimax regret of finite partial-monitoring games in stochastic environments . Conference on Learning Theory (COLT), 2011

work page 2011
[6]

Partial monitoring-classification, regret bounds, and algorithms

Bart \' o k, G \' a bor, Foster, Dean P, P \' a l, D \' a vid, Rakhlin, Alexander, and Szepesv \' a ri, Csaba. Partial monitoring-classification, regret bounds, and algorithms . Mathematics of Operations Research, 39 0 (4): 0 967--997, 2014

work page 2014
[7]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck, S \' e bastien and Cesa-Bianchi, Nicol \` o . Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems . Foundations and Trends in Machine Learning, 5: 0 1--122, 2012

work page 2012
[8]

Prediction, learning, and games

Cesa-Bianchi, Nicol \` o and Lugosi, G \' a bor. Prediction, learning, and games . Cambridge University Press, New York, NY, 2006

work page 2006
[9]

Online learning of noisy data with kernels

Cesa-Bianchi, Nicol \` o , Shalev-Shwartz, Shai, and Shamir, Oha. Online learning of noisy data with kernels . In Conference on Learning Theory (COLT), 2010

work page 2010
[10]

Prediction by random-walk perturbation

Devroye, Luc, Lugosi, G \' a bor, and Neu, Gergely. Prediction by random-walk perturbation . In Conference on Learning Theory (COLT), 2013

work page 2013
[11]

o rfi, L \' a szl \' o and Ottucs \' a k, Gy \

Gy \" o rfi, L \' a szl \' o and Ottucs \' a k, Gy \" o rgy. Sequential prediction of unbounded stationary time series . IEEE Transactions on Information Theory, 53 0 (5): 0 1866--1872, 2007

work page 2007
[12]

Efficient learning by implicit exploration in bandit problems with side observations

Koc \' a k, Tom \' a s , Neu, Gergely, Valko, Michal, and Munos, R \' e mi. Efficient learning by implicit exploration in bandit problems with side observations . In Neural Information Processing Systems (NeurIPS), 2014

work page 2014
[13]

From bandits to experts: On the value of side-observations

Mannor, Shie and Shamir, Ohad. From bandits to experts: On the value of side-observations . In Neural Information Processing Systems (NeurIPS), 2011

work page 2011
[14]

Online learning with Gaussian payoffs and side observations

Wu, Yifan, Gy \" o rgy, Andr \' a s, and Szepesv \' a ri, Csaba. Online learning with Gaussian payoffs and side observations . In Neural Information Processing Systems (NeurIPS), 2015

work page 2015

[1] [1]

From bandits to experts: A tale of domination and independence

Alon, Noga, Cesa-Bianchi, Nicol \` o , Gentile, Claudio, and Mansour, Yishay. From bandits to experts: A tale of domination and independence . In Neural Information Processing Systems (NeurIPS), 2013

work page 2013

[2] [2]

Online learning with feedback graphs: Beyond bandits

Alon, Noga, Cesa-Bianchi, Nicol \` o , Dekel, Ofer, and Koren, Tomer. Online learning with feedback graphs: Beyond bandits . In Conference on Learning Theory (COLT), 2015

work page 2015

[3] [3]

Finite-time analysis of the multiarmed bandit problem

Auer, Peter, Cesa-Bianchi, Nicol \` o , and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem . Machine Learning, 47 0 (2-3): 0 235--256, 2002 a

work page 2002

[4] [4]

Adaptive and self-confident on-line learning algorithms

Auer, Peter, Cesa-Bianchi, Nicol \` o , and Gentile, Claudio. Adaptive and self-confident on-line learning algorithms . Journal of Computer and System Sciences, 64: 0 48--75, 2002 b

work page 2002

[5] [5]

Minimax regret of finite partial-monitoring games in stochastic environments

Bart \' o k, G \' a bor, P \' a l, D \' a vid, and Szepesv \' a ri, Csaba. Minimax regret of finite partial-monitoring games in stochastic environments . Conference on Learning Theory (COLT), 2011

work page 2011

[6] [6]

Partial monitoring-classification, regret bounds, and algorithms

Bart \' o k, G \' a bor, Foster, Dean P, P \' a l, D \' a vid, Rakhlin, Alexander, and Szepesv \' a ri, Csaba. Partial monitoring-classification, regret bounds, and algorithms . Mathematics of Operations Research, 39 0 (4): 0 967--997, 2014

work page 2014

[7] [7]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck, S \' e bastien and Cesa-Bianchi, Nicol \` o . Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems . Foundations and Trends in Machine Learning, 5: 0 1--122, 2012

work page 2012

[8] [8]

Prediction, learning, and games

Cesa-Bianchi, Nicol \` o and Lugosi, G \' a bor. Prediction, learning, and games . Cambridge University Press, New York, NY, 2006

work page 2006

[9] [9]

Online learning of noisy data with kernels

Cesa-Bianchi, Nicol \` o , Shalev-Shwartz, Shai, and Shamir, Oha. Online learning of noisy data with kernels . In Conference on Learning Theory (COLT), 2010

work page 2010

[10] [10]

Prediction by random-walk perturbation

Devroye, Luc, Lugosi, G \' a bor, and Neu, Gergely. Prediction by random-walk perturbation . In Conference on Learning Theory (COLT), 2013

work page 2013

[11] [11]

o rfi, L \' a szl \' o and Ottucs \' a k, Gy \

Gy \" o rfi, L \' a szl \' o and Ottucs \' a k, Gy \" o rgy. Sequential prediction of unbounded stationary time series . IEEE Transactions on Information Theory, 53 0 (5): 0 1866--1872, 2007

work page 2007

[12] [12]

Efficient learning by implicit exploration in bandit problems with side observations

Koc \' a k, Tom \' a s , Neu, Gergely, Valko, Michal, and Munos, R \' e mi. Efficient learning by implicit exploration in bandit problems with side observations . In Neural Information Processing Systems (NeurIPS), 2014

work page 2014

[13] [13]

From bandits to experts: On the value of side-observations

Mannor, Shie and Shamir, Ohad. From bandits to experts: On the value of side-observations . In Neural Information Processing Systems (NeurIPS), 2011

work page 2011

[14] [14]

Online learning with Gaussian payoffs and side observations

Wu, Yifan, Gy \" o rgy, Andr \' a s, and Szepesv \' a ri, Csaba. Online learning with Gaussian payoffs and side observations . In Neural Information Processing Systems (NeurIPS), 2015

work page 2015