pith. sign in

arxiv: 2602.17315 · v2 · submitted 2026-02-19 · 💻 cs.LG · cs.AI

Flickering Multi-Armed Bandits

Pith reviewed 2026-05-15 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords flickering multi-armed banditsregret boundslazy random walkstochastic graphslocal navigationsublinear regretinformation lower boundgraph processes
0
0 comments X

The pith

A two-phase lazy random walk achieves near-optimal sublinear regret in flickering multi-armed bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines flickering multi-armed bandits as a model where an agent must learn action values while restricted to moving only between neighboring actions on a stochastically evolving graph. It introduces a two-phase lazy random walk algorithm that manages both statistical exploration and the overhead of physical navigation on the graph. The algorithm attains high-probability sublinear regret bounds under i.i.d. Erdős–Rényi and Edge-Markovian dynamics, with these bounds shown to be near-optimal via a matching information-theoretic lower bound. The results quantify the extra cost imposed by local-move constraints in sequential decision problems such as robotic navigation.

Core claim

Under i.i.d. Erdős–Rényi and Edge-Markovian graph processes, the two-phase lazy random walk algorithm for flickering multi-armed bandits attains high-probability sublinear regret while respecting local neighborhood constraints, and this performance is shown to be near-optimal by a matching information-theoretic lower bound that accounts for both learning and navigation overhead.

What carries the argument

The two-phase lazy random walk algorithm that alternates local random-walk exploration phases on the current graph with exploitation phases to balance information acquisition against movement cost.

If this is right

  • High-probability sublinear regret is guaranteed for the algorithm under the stated stochastic graph models.
  • The regret scales with both statistical uncertainty and the time required to navigate between actions.
  • A matching information-theoretic lower bound establishes that the achieved regret is near-optimal.
  • The robotic disaster-response simulation shows the bounds apply to practical mobility-constrained settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local-move structure could model learning in wireless sensor networks where nodes communicate only with nearby devices.
  • Extending the lower bound to adversarial rather than i.i.d. graph sequences would identify when sublinear regret becomes impossible.
  • Similar navigation-aware algorithms may apply to distributed optimization on time-varying communication graphs.

Load-bearing premise

The analysis relies on the assumption that the underlying graph evolves according to independent and identically distributed Erdős–Rényi or Edge-Markovian processes.

What would settle it

Running the two-phase lazy random walk on graphs generated by the Erdős–Rényi process and observing cumulative regret that grows linearly with time rather than sublinearly would falsify the claimed bounds.

read the original abstract

We introduce Flickering Multi-Armed Bandits (FMAB) to model sequential decision-making in environments with changing action availability, where accessibility of the next action is restricted to a subset dependent on the agent's current choice. We formalize these constraints through stochastically evolving graphs where actions are limited to local neighborhoods. This mobility-constrained structure imposes a dual challenge: the statistical requirement of information acquisition and the physical overhead of navigation. We analyze FMAB under i.i.d. Erd\H{o}s--R'enyi and Edge-Markovian process, proposing a two-phase lazy random walk algorithm for robust exploration. We establish high-probability sublinear regret bounds and prove near-optimality via a matching information-theoretic lower bound. Our results characterize the intrinsic cost of learning under local-move constraints, complemented by a robotic disaster-response simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Flickering Multi-Armed Bandits (FMAB), a model for sequential decision-making where action availability is restricted to local neighborhoods in a stochastically evolving graph. Under i.i.d. Erdős–Rényi and Edge-Markovian graph processes, the authors propose a two-phase lazy random walk algorithm, derive high-probability sublinear regret bounds, and prove a matching information-theoretic lower bound. Results are illustrated via a robotic disaster-response simulation.

Significance. If the bounds hold, the work contributes by quantifying the combined statistical and navigation costs of learning under local-move constraints in dynamic graphs. The high-probability analysis and matching lower bound for standard random-graph processes provide a clean characterization of intrinsic regret in mobility-constrained settings, with direct relevance to robotics and network exploration tasks.

major comments (1)
  1. [Regret analysis (post-algorithm section)] The central regret theorem (analysis section following the algorithm definition) establishes sublinear high-probability bounds only under the i.i.d. ER (fixed p) and Edge-Markovian (birth/death probabilities) assumptions that guarantee rapid mixing and connectivity w.h.p. The manuscript should explicitly delimit the claim to these processes in the theorem statement, as the general FMAB model allows arbitrary stochastic graphs for which the same algorithm can incur linear regret.
minor comments (2)
  1. [Abstract] The abstract states that the simulation 'complements' the theory but provides no quantitative regret or navigation-cost numbers; add one or two key empirical observations to the abstract.
  2. [Preliminaries] Notation for graph parameters (p for ER, birth/death rates for Edge-Markovian) should be introduced once in a dedicated notation subsection and used consistently in all subsequent equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The observation regarding the scope of the regret bounds is valid, and we will revise the manuscript to explicitly delimit the theorem statements to the i.i.d. Erdős–Rényi and Edge-Markovian cases, as suggested. This clarification will improve the precision of the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: The central regret theorem (analysis section following the algorithm definition) establishes sublinear high-probability bounds only under the i.i.d. ER (fixed p) and Edge-Markovian (birth/death probabilities) assumptions that guarantee rapid mixing and connectivity w.h.p. The manuscript should explicitly delimit the claim to these processes in the theorem statement, as the general FMAB model allows arbitrary stochastic graphs for which the same algorithm can incur linear regret.

    Authors: We agree that the high-probability sublinear regret bounds rely on the rapid mixing and connectivity properties that hold specifically for the i.i.d. Erdős–Rényi (fixed p) and Edge-Markovian processes. The general FMAB model indeed permits arbitrary stochastic graph processes where the same algorithm may incur linear regret. In the revised version, we will update the theorem statements (and surrounding text in the analysis section) to explicitly state the required assumptions on the graph process and add a clarifying remark that the bounds are not claimed to hold beyond these cases. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived directly from model assumptions

full rationale

The paper derives high-probability sublinear regret bounds for the two-phase lazy random walk under i.i.d. Erdős–Rényi and Edge-Markovian graph evolution, together with a matching information-theoretic lower bound. These follow from the stated stochastic process definitions, connectivity/mixing properties, and standard concentration arguments; no parameters are fitted to data and then relabeled as predictions, no self-citations are load-bearing for the central claims, and no ansatz or uniqueness result is smuggled in. The derivation chain is self-contained against the explicit model assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard random-graph models and the definition of local-move constraints; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Graph evolves according to i.i.d. Erdős–Rényi or Edge-Markovian process
    Explicitly stated as the setting for the regret analysis.

pith-pipeline@v0.9.0 · 5443 in / 1211 out tokens · 37771 ms · 2026-05-15T21:12:42.122056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.