Flickering Multi-Armed Bandits

Amit Kiran Rege; Claire Monteleoni; Lijun Chen; Sourav Chakraborty

arxiv: 2602.17315 · v2 · submitted 2026-02-19 · 💻 cs.LG · cs.AI

Flickering Multi-Armed Bandits

Sourav Chakraborty , Amit Kiran Rege , Claire Monteleoni , Lijun Chen This is my paper

Pith reviewed 2026-05-15 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords flickering multi-armed banditsregret boundslazy random walkstochastic graphslocal navigationsublinear regretinformation lower boundgraph processes

0 comments

The pith

A two-phase lazy random walk achieves near-optimal sublinear regret in flickering multi-armed bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines flickering multi-armed bandits as a model where an agent must learn action values while restricted to moving only between neighboring actions on a stochastically evolving graph. It introduces a two-phase lazy random walk algorithm that manages both statistical exploration and the overhead of physical navigation on the graph. The algorithm attains high-probability sublinear regret bounds under i.i.d. Erdős–Rényi and Edge-Markovian dynamics, with these bounds shown to be near-optimal via a matching information-theoretic lower bound. The results quantify the extra cost imposed by local-move constraints in sequential decision problems such as robotic navigation.

Core claim

Under i.i.d. Erdős–Rényi and Edge-Markovian graph processes, the two-phase lazy random walk algorithm for flickering multi-armed bandits attains high-probability sublinear regret while respecting local neighborhood constraints, and this performance is shown to be near-optimal by a matching information-theoretic lower bound that accounts for both learning and navigation overhead.

What carries the argument

The two-phase lazy random walk algorithm that alternates local random-walk exploration phases on the current graph with exploitation phases to balance information acquisition against movement cost.

If this is right

High-probability sublinear regret is guaranteed for the algorithm under the stated stochastic graph models.
The regret scales with both statistical uncertainty and the time required to navigate between actions.
A matching information-theoretic lower bound establishes that the achieved regret is near-optimal.
The robotic disaster-response simulation shows the bounds apply to practical mobility-constrained settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local-move structure could model learning in wireless sensor networks where nodes communicate only with nearby devices.
Extending the lower bound to adversarial rather than i.i.d. graph sequences would identify when sublinear regret becomes impossible.
Similar navigation-aware algorithms may apply to distributed optimization on time-varying communication graphs.

Load-bearing premise

The analysis relies on the assumption that the underlying graph evolves according to independent and identically distributed Erdős–Rényi or Edge-Markovian processes.

What would settle it

Running the two-phase lazy random walk on graphs generated by the Erdős–Rényi process and observing cumulative regret that grows linearly with time rather than sublinearly would falsify the claimed bounds.

read the original abstract

We introduce Flickering Multi-Armed Bandits (FMAB) to model sequential decision-making in environments with changing action availability, where accessibility of the next action is restricted to a subset dependent on the agent's current choice. We formalize these constraints through stochastically evolving graphs where actions are limited to local neighborhoods. This mobility-constrained structure imposes a dual challenge: the statistical requirement of information acquisition and the physical overhead of navigation. We analyze FMAB under i.i.d. Erd\H{o}s--R'enyi and Edge-Markovian process, proposing a two-phase lazy random walk algorithm for robust exploration. We establish high-probability sublinear regret bounds and prove near-optimality via a matching information-theoretic lower bound. Our results characterize the intrinsic cost of learning under local-move constraints, complemented by a robotic disaster-response simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FMAB adds local mobility constraints via stochastic graphs to bandits and gives sublinear high-probability regret plus a matching lower bound for a lazy random walk under Erdős–Rényi and Edge-Markovian dynamics.

read the letter

The main point is that this paper defines Flickering Multi-Armed Bandits where the agent can only reach arms in the current local neighborhood of a stochastically changing graph, so learning has to pay both statistical and navigation costs. They give a two-phase lazy random walk that achieves high-probability sublinear regret under i.i.d. Erdős–Rényi and Edge-Markovian graph processes, together with an information-theoretic lower bound that matches the upper bound up to constants. The robotic disaster-response simulation is a concrete way to see the model in action. The formulation cleanly separates the dual costs and the analysis uses standard mixing-time and concentration tools, so the derivations look solid when the graph assumptions hold. The contribution is the specific combination of the mobility model, the algorithm, and the regret characterization for those two processes. The soft spot is that the guarantees rely on rapid mixing and reliable connectivity, which those two graph families provide but many other stochastic processes do not. For graphs with slower mixing or persistent disconnected components the same algorithm can incur linear regret, and the paper does not claim or prove results beyond the stated dynamics. The lower bound is correctly tailored to the model rather than overclaimed. This is useful reading for people working on constrained bandits or decision-making with physical movement limits. A reader already familiar with graph-based online learning will see the new angle quickly. I would send it to peer review because the model is well-motivated, the algorithm is simple, and the bounds are non-trivial under the given assumptions.

Referee Report

1 major / 2 minor

Summary. The paper introduces Flickering Multi-Armed Bandits (FMAB), a model for sequential decision-making where action availability is restricted to local neighborhoods in a stochastically evolving graph. Under i.i.d. Erdős–Rényi and Edge-Markovian graph processes, the authors propose a two-phase lazy random walk algorithm, derive high-probability sublinear regret bounds, and prove a matching information-theoretic lower bound. Results are illustrated via a robotic disaster-response simulation.

Significance. If the bounds hold, the work contributes by quantifying the combined statistical and navigation costs of learning under local-move constraints in dynamic graphs. The high-probability analysis and matching lower bound for standard random-graph processes provide a clean characterization of intrinsic regret in mobility-constrained settings, with direct relevance to robotics and network exploration tasks.

major comments (1)

[Regret analysis (post-algorithm section)] The central regret theorem (analysis section following the algorithm definition) establishes sublinear high-probability bounds only under the i.i.d. ER (fixed p) and Edge-Markovian (birth/death probabilities) assumptions that guarantee rapid mixing and connectivity w.h.p. The manuscript should explicitly delimit the claim to these processes in the theorem statement, as the general FMAB model allows arbitrary stochastic graphs for which the same algorithm can incur linear regret.

minor comments (2)

[Abstract] The abstract states that the simulation 'complements' the theory but provides no quantitative regret or navigation-cost numbers; add one or two key empirical observations to the abstract.
[Preliminaries] Notation for graph parameters (p for ER, birth/death rates for Edge-Markovian) should be introduced once in a dedicated notation subsection and used consistently in all subsequent equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The observation regarding the scope of the regret bounds is valid, and we will revise the manuscript to explicitly delimit the theorem statements to the i.i.d. Erdős–Rényi and Edge-Markovian cases, as suggested. This clarification will improve the precision of the presentation without altering the core contributions.

read point-by-point responses

Referee: The central regret theorem (analysis section following the algorithm definition) establishes sublinear high-probability bounds only under the i.i.d. ER (fixed p) and Edge-Markovian (birth/death probabilities) assumptions that guarantee rapid mixing and connectivity w.h.p. The manuscript should explicitly delimit the claim to these processes in the theorem statement, as the general FMAB model allows arbitrary stochastic graphs for which the same algorithm can incur linear regret.

Authors: We agree that the high-probability sublinear regret bounds rely on the rapid mixing and connectivity properties that hold specifically for the i.i.d. Erdős–Rényi (fixed p) and Edge-Markovian processes. The general FMAB model indeed permits arbitrary stochastic graph processes where the same algorithm may incur linear regret. In the revised version, we will update the theorem statements (and surrounding text in the analysis section) to explicitly state the required assumptions on the graph process and add a clarifying remark that the bounds are not claimed to hold beyond these cases. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived directly from model assumptions

full rationale

The paper derives high-probability sublinear regret bounds for the two-phase lazy random walk under i.i.d. Erdős–Rényi and Edge-Markovian graph evolution, together with a matching information-theoretic lower bound. These follow from the stated stochastic process definitions, connectivity/mixing properties, and standard concentration arguments; no parameters are fitted to data and then relabeled as predictions, no self-citations are load-bearing for the central claims, and no ansatz or uniqueness result is smuggled in. The derivation chain is self-contained against the explicit model assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard random-graph models and the definition of local-move constraints; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Graph evolves according to i.i.d. Erdős–Rényi or Edge-Markovian process
Explicitly stated as the setting for the regret analysis.

pith-pipeline@v0.9.0 · 5443 in / 1211 out tokens · 37771 ms · 2026-05-15T21:12:42.122056+00:00 · methodology

Flickering Multi-Armed Bandits

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)