Representative Action Selection for Large Action Space Bandit Families

Mark Kozdoba; Quan Zhou; Shie Mannor

arxiv: 2505.18269 · v5 · submitted 2025-05-23 · 💻 cs.LG · math.OC· math.PR· stat.ML

Representative Action Selection for Large Action Space Bandit Families

Quan Zhou , Mark Kozdoba , Shie Mannor This is my paper

Pith reviewed 2026-05-19 12:37 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.PRstat.ML

keywords bandit algorithmsaction space reductionrepresentative actionssampling methodcorrelated actionstheoretical guarantees

0 comments

The pith

Repeatedly sampling random bandit instances and collecting their optimal actions selects a small representative set that works nearly as well as the full large action space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many bandit problems share a large action space where actions turn out to be highly correlated across different environments, so keeping every action is unnecessary. The paper shows it is possible to select a much smaller subset of representative actions that still delivers close to the best possible performance for the whole family. The method draws instances at random from the family, solves each one, and keeps only the optimal action found; this process repeats until the collected actions suffice. A reader cares because the approach requires no advance knowledge of which actions are correlated or how they relate.

Core claim

The paper claims that repeatedly sampling a bandit instance at random from the family, solving it to find its optimal action, and collecting that action produces a reduced action set that achieves near-optimal performance on the entire family, with theoretical guarantees, even when the correlation structure among actions is unknown in advance.

What carries the argument

The sampling procedure that draws random instances, solves each for its optimal action, and accumulates those optima into a representative action subset.

If this is right

The selected subset can be substituted for the full action space when solving any new instance from the family with only small performance loss.
The same sampling procedure works without any prior information about how actions correlate across environments.
The reduced set remains effective for the family even as the original action space grows very large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling idea could be tested in other sequential decision settings where actions are redundant but the redundancy pattern is unknown.
If the family distribution is only approximately known, one could replace exact sampling with a generative model and check whether the collected optima still suffice.

Load-bearing premise

The distribution over bandit instances is such that the optimal actions of a modest number of random samples cover near-optimal choices for almost every other instance in the family.

What would settle it

After running the sampling procedure to build the reduced set, new instances drawn from the same family show substantially lower reward or higher regret when restricted to the reduced set versus the full action space.

read the original abstract

We study the problem of selecting a subset from a large action space shared by a family of bandits. In many natural situations, while the nominal set of actions is large, actions are highly correlated: many yield similar rewards across environments, making it wasteful to maintain the full set. Our aim is to understand whether it is possible -- and how -- to select a smaller set of representative actions that performs nearly as well as the full action space. Our main contribution is a surprisingly simple algorithm: repeatedly sample a bandit instance at random, solve it, and collect the optimal action. This algorithm can significantly reduce the action space when such correlations are present, without the need to know a-priori the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Combinatorial Bandit, Meta Learning Bandit and Zooming baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core is a simple repeated-sampling procedure that collects optimal actions from random bandit instances to shrink large action spaces without upfront correlation knowledge, but its reduction claims rest on an unstated covering property of the instance distribution.

read the letter

The main thing to know is that the authors give a straightforward procedure: draw bandit instances from the family at random, solve each one exactly, and keep the optimal action. Repeat until you have a small set that you then use for the whole family. They say this cuts the action space substantially when correlations exist and does so without needing to know the correlation structure ahead of time. They also supply theoretical guarantees and run comparisons against combinatorial bandits, meta-learning bandits, and zooming methods.

Referee Report

1 major / 2 minor

Summary. The paper studies selecting a representative subset of actions from a large shared action space across a family of bandit instances, where actions are often correlated. The main contribution is a simple algorithm that repeatedly samples bandit instances at random from an underlying distribution, solves each sampled instance to identify its optimal action, and collects these optima to form a reduced action set. The authors claim this approach significantly reduces the action space when correlations exist, without requiring prior knowledge of the correlation structure, and provide theoretical guarantees on its performance along with empirical comparisons against Combinatorial Bandit, Meta Learning Bandit, and Zooming baselines.

Significance. If the theoretical guarantees hold under the paper's assumptions and the empirical results are robust across datasets, this work offers a practical, distribution-agnostic method for exploiting correlations in large action spaces. The algorithm's simplicity is a notable strength, as is the provision of theoretical performance bounds and direct comparisons to established baselines; such results could influence large-scale bandit applications in recommendation and reinforcement learning where full action spaces are computationally prohibitive.

major comments (1)

[Abstract and §3 (Algorithm Description)] The central claim that random sampling of optimal actions yields a sufficient representative set (without a-priori correlation knowledge) depends on an implicit covering property of the instance distribution. No explicit assumption (e.g., a bound on the covering number of optimal actions or a Lipschitz condition on how optima vary across instances) is stated to guarantee that the collected set covers near-optimal performance for unseen draws; if the distribution admits disjoint clusters of optima, the reduction may fail even after many samples. This is load-bearing for the abstract's performance claims and the theoretical guarantees.

minor comments (2)

[§3] Clarify the precise sampling distribution over bandit instances and any hyperparameters in the 'solve it' step (e.g., optimization tolerance or number of samples needed for the guarantee).
[§5 (Experiments)] In the empirical section, report the exact dataset sizes, number of independent runs, and statistical significance tests for the comparisons with the three baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses

Referee: [Abstract and §3 (Algorithm Description)] The central claim that random sampling of optimal actions yields a sufficient representative set (without a-priori correlation knowledge) depends on an implicit covering property of the instance distribution. No explicit assumption (e.g., a bound on the covering number of optimal actions or a Lipschitz condition on how optima vary across instances) is stated to guarantee that the collected set covers near-optimal performance for unseen draws; if the distribution admits disjoint clusters of optima, the reduction may fail even after many samples. This is load-bearing for the abstract's performance claims and the theoretical guarantees.

Authors: We thank the referee for highlighting this important point. Our algorithm is distribution-agnostic in the sense that it requires no prior knowledge of the correlation structure; the sampling procedure discovers representative actions directly from the data. The theoretical analysis (Section 4) provides performance guarantees with respect to the fixed underlying distribution over bandit instances: after sufficiently many samples, the representative set yields near-optimal expected performance on new draws from the same distribution, with the suboptimality gap controlled by a term that decreases with the number of samples. This analysis implicitly relies on the distribution admitting a finite (or effectively coverable) set of relevant optimal actions for the sample size used. We agree that an explicit discussion of covering properties would strengthen the presentation, especially for distributions with disjoint clusters of optima, where sample complexity would increase to achieve high-probability coverage. We will revise the abstract and Section 3 to state that the guarantees are with respect to the instance distribution and add a clarifying remark in the theoretical section on the role of covering. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm and guarantees are independent of inputs by construction

full rationale

The paper's core algorithm—repeatedly sampling bandit instances from the family, solving each, and collecting the optimal action—is presented as a direct procedure without any fitted parameters or self-referential definitions. Theoretical guarantees are stated to hold under the assumption that the instance distribution induces a small covering set of optima; this is an external modeling assumption on the data-generating process rather than a quantity derived from or fitted to the algorithm's own outputs. No equations reduce a claimed performance bound to a parameter defined by the same sampling procedure, and no load-bearing step relies on a self-citation chain that itself lacks independent verification. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the existence of correlations in the bandit family and on standard concentration or regret bounds from bandit theory.

axioms (1)

domain assumption Bandit family admits a distribution over instances such that optimal actions of sampled instances form a sufficient representative set
Invoked when claiming reduction is possible without knowing correlation structure

pith-pipeline@v0.9.0 · 5684 in / 1200 out tokens · 33494 ms · 2026-05-19T12:37:34.756812+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our main contribution is a surprisingly simple algorithm: repeatedly sample a bandit instance at random, solve it, and collect the optimal action.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide theoretical guarantees on the performance of the algorithm... under the standard assumption that the reward process is Gaussian

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.