Learning Safely Without Knowing the World:COMPASS-Hedge

Emmanouil-Vasileios Vlatakis-Gkaragkounis; Luanda Cai; Ting Hu

arxiv: 2603.22348 · v4 · pith:OGYU7U44new · submitted 2026-03-22 · 💻 cs.LG · cs.GT

Learning Safely Without Knowing the World:COMPASS-Hedge

Ting Hu , Luanda Cai , Emmanouil-Vasileios Vlatakis-Gkaragkounis This is my paper

Pith reviewed 2026-05-25 07:11 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords online learningregret minimizationadversarial environmentsstochastic environmentsbaseline safetyparameter-freefull-information feedback

0 comments

The pith

COMPASS-Hedge achieves minimax-optimal regret in adversarial settings, gap-dependent regret in stochastic settings, and constant regret to a baseline, all parameter-free.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online learning algorithms struggle to balance regret performance when facing adversarial or stochastic environments while also providing safety relative to a fixed baseline policy. Most prior methods can optimize for at most two of these goals or require knowledge of the environment type and specific parameters like gap sizes. COMPASS-Hedge introduces a combination of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing to address all three simultaneously. This yields the first full-information anytime algorithm with all three guarantees up to logarithmic factors without any such knowledge. Readers would care because it removes the usual trade-offs in designing robust online learners.

Core claim

The algorithm COMPASS-Hedge is the first full-information anytime method to simultaneously achieve, up to logarithmic factors, minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. It is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps.

What carries the argument

The integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy.

If this is right

It provides the first best-of-three-worlds guarantee in the full-information setting.
Baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
The method works without oracle access to problem-dependent parameters.
It operates in an anytime fashion suitable for unknown time horizons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could potentially be adapted to bandit settings where only partial feedback is available.
Similar mixing strategies might help in other sequential decision problems with mixed environment types.
Testing on datasets with unknown mixtures of adversarial and stochastic behavior would validate practical utility.
Extensions might allow incorporating multiple baselines without losing the guarantees.

Load-bearing premise

The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy delivers all three regret guarantees simultaneously without knowledge of the environment or gaps.

What would settle it

Observe the regret of COMPASS-Hedge in a purely stochastic environment with small known gaps and verify if it scales as the gap-dependent optimal rate rather than the minimax rate.

Figures

Figures reproduced from arXiv: 2603.22348 by Emmanouil-Vasileios Vlatakis-Gkaragkounis, Luanda Cai, Ting Hu.

**Figure 1.** Figure 1: The Best-of-Three-Worlds landscape. Each circle represents one desideratum. Prior algorithms (annotations) cover at most two regions simultaneously. Compass-Hedge (center) is the first to unify all three in a single, parameter-free algorithm. The Best-of-Three-Worlds Challenge. (BOTW) Despite the maturity of best-of-two-worlds research, the aforementioned triadic tension remains unanswered with he existing… view at source ↗

**Figure 2.** Figure 2: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert (near-zero regret). (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance (risk). Compass-Hedge maintains a flat trajectory near zero comparator regret (d), empirically confirming the O(1) safety bound. Future work. A natural next step is to tig… view at source ↗

**Figure 3.** Figure 3: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert. (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance. Compass-Hedge maintains a flat trajectory (d), confirming the O(1) safety bound. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗

read the original abstract

Online learning algorithms often face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. To the best of our knowledge, our algorithm is the first full-information anytime method to simultaneously achieve, up to logarithmic factors: i) minimax-optimal regret in adversarial environments; ii) instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The claimed best-of-three-worlds result runs into an unresolved tension between minimax optimality and Õ(1) baseline regret under adversarial losses when the baseline is arbitrary.

read the letter

The main thing to know is that COMPASS-Hedge tries to deliver minimax-optimal regret adversarially, gap-dependent regret stochastically, and Õ(1) regret to a fixed baseline, all parameter-free and without knowing the regime. If the analysis works, it would be a useful unification in full-information online learning. The paper does a solid job framing the trilemma and showing how prior methods fall short on covering all three at once. The proposed ingredients—adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing—are a reasonable way to attempt the combination, and the anytime property is a practical plus. The soft spot is exactly where the stress-test points: in adversarial settings, minimax optimality requires competing with the best expert up to logs. An arbitrary designated baseline can easily incur linear regret to that expert. Achieving Õ(1) regret to the baseline at the same time would force the algorithm to avoid fully tracking the best expert, which undercuts the minimax rate. The mixing strategy is supposed to thread this needle, but nothing in the abstract or positioning shows how the bounds are derived when the baseline is suboptimal. If the full proofs do not supply a clean reconciliation without extra assumptions on the baseline, the three guarantees cannot hold simultaneously. This paper is for people working on adaptive regret bounds and best-of-all-worlds algorithms. A reader already following that literature would get value from seeing the details. It deserves a serious referee to check whether the analysis actually closes the gap or whether the baseline guarantee only works under implicit restrictions.

Referee Report

2 major / 0 minor

Summary. The paper introduces COMPASS-Hedge, a full-information anytime algorithm for online learning that simultaneously achieves (up to log factors) minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. The method is parameter-free, requires no knowledge of the environment type or gap sizes, and relies on a novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing.

Significance. If the three simultaneous guarantees can be established without hidden parameters or environment knowledge, the result would constitute a meaningful advance in online learning by resolving the trilemma of adversarial optimality, stochastic efficiency, and baseline safety in the full-information setting.

major comments (2)

[Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.
[Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and for identifying points that require clarification in the abstract. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.

Authors: We agree that the abstract does not currently display the explicit form of the adversarial regret bound or the precise rule for selecting the mixing coefficient. The full manuscript contains the relevant theorem and proof, but the abstract will be revised to include a concise statement of the bound and the adaptive rule for the mixing weight. revision: yes
Referee: [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

Authors: We agree that the abstract alone does not supply the derivation reconciling the two terms. The revised version will add a short pointer in the abstract to the theorem that contains this derivation, together with a one-sentence description of how the phase-based mechanism keeps the baseline term additive rather than multiplicative. revision: yes

standing simulated objections not resolved

Reconciling a simultaneous Õ(1) regret bound to an arbitrary fixed baseline with a minimax-optimal (sub-linear) regret bound to the best expert in fully adversarial environments, when the baseline itself suffers linear regret relative to the best expert.

Circularity Check

0 steps flagged

No circularity identified; claims rest on uninspectable integration without self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that can be quoted to exhibit any reduction of a claimed guarantee to a fitted input, self-definition, or prior author result by construction. The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing is asserted to deliver the three guarantees simultaneously, but without mathematical steps or load-bearing citations shown, no circular step meets the criteria for flagging (requires explicit quote and exhibited reduction). This is the normal honest non-finding for an abstract-only view; the derivation chain cannot be walked for equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified existence of the described integration of scaling, phases, and mixing.

pith-pipeline@v0.9.0 · 5746 in / 1226 out tokens · 40078 ms · 2026-05-25T07:11:58.250470+00:00 · methodology

Learning Safely Without Knowing the World:COMPASS-Hedge

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)