Learning Safely Without Knowing the World:COMPASS-Hedge

Luanda Cai; Manolis Vlatakis; Ting Hu

arxiv: 2603.22348 · v3 · pith:OGYU7U44new · submitted 2026-03-22 · 💻 cs.LG · cs.GT

Learning Safely Without Knowing the World:COMPASS-Hedge

Ting Hu , Luanda Cai , Manolis Vlatakis This is my paper

Pith reviewed 2026-05-25 07:11 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords online learningregret minimizationadversarial environmentsstochastic environmentsbaseline safetyparameter-freefull-information feedback

0 comments

The pith

COMPASS-Hedge achieves minimax-optimal regret in adversarial settings, gap-dependent regret in stochastic settings, and constant regret to a baseline, all parameter-free.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online learning algorithms struggle to balance regret performance when facing adversarial or stochastic environments while also providing safety relative to a fixed baseline policy. Most prior methods can optimize for at most two of these goals or require knowledge of the environment type and specific parameters like gap sizes. COMPASS-Hedge introduces a combination of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing to address all three simultaneously. This yields the first full-information anytime algorithm with all three guarantees up to logarithmic factors without any such knowledge. Readers would care because it removes the usual trade-offs in designing robust online learners.

Core claim

The algorithm COMPASS-Hedge is the first full-information anytime method to simultaneously achieve, up to logarithmic factors, minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. It is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps.

What carries the argument

The integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy.

If this is right

It provides the first best-of-three-worlds guarantee in the full-information setting.
Baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
The method works without oracle access to problem-dependent parameters.
It operates in an anytime fashion suitable for unknown time horizons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could potentially be adapted to bandit settings where only partial feedback is available.
Similar mixing strategies might help in other sequential decision problems with mixed environment types.
Testing on datasets with unknown mixtures of adversarial and stochastic behavior would validate practical utility.
Extensions might allow incorporating multiple baselines without losing the guarantees.

Load-bearing premise

The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy delivers all three regret guarantees simultaneously without knowledge of the environment or gaps.

What would settle it

Observe the regret of COMPASS-Hedge in a purely stochastic environment with small known gaps and verify if it scales as the gap-dependent optimal rate rather than the minimax rate.

Figures

Figures reproduced from arXiv: 2603.22348 by Luanda Cai, Manolis Vlatakis, Ting Hu.

**Figure 1.** Figure 1: The Best-of-Three-Worlds landscape. Each circle represents one desideratum. Prior algorithms (annotations) cover at most two regions simultaneously. Compass-Hedge (center) is the first to unify all three in a single, parameter-free algorithm. The Best-of-Three-Worlds Challenge. (BOTW) Despite the maturity of best-of-two-worlds research, the aforementioned triadic tension remains unanswered with he existing… view at source ↗

**Figure 2.** Figure 2: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert (near-zero regret). (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance (risk). Compass-Hedge maintains a flat trajectory near zero comparator regret (d), empirically confirming the O(1) safety bound. Future work. A natural next step is to tig… view at source ↗

**Figure 3.** Figure 3: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert. (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance. Compass-Hedge maintains a flat trajectory (d), confirming the O(1) safety bound. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗

read the original abstract

Online learning algorithms often face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. To the best of our knowledge, our algorithm is the first full-information anytime method to simultaneously achieve, up to logarithmic factors: i) minimax-optimal regret in adversarial environments; ii) instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The claimed best-of-three-worlds result runs into an unresolved tension between minimax optimality and Õ(1) baseline regret under adversarial losses when the baseline is arbitrary.

read the letter

The main thing to know is that COMPASS-Hedge tries to deliver minimax-optimal regret adversarially, gap-dependent regret stochastically, and Õ(1) regret to a fixed baseline, all parameter-free and without knowing the regime. If the analysis works, it would be a useful unification in full-information online learning. The paper does a solid job framing the trilemma and showing how prior methods fall short on covering all three at once. The proposed ingredients—adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing—are a reasonable way to attempt the combination, and the anytime property is a practical plus. The soft spot is exactly where the stress-test points: in adversarial settings, minimax optimality requires competing with the best expert up to logs. An arbitrary designated baseline can easily incur linear regret to that expert. Achieving Õ(1) regret to the baseline at the same time would force the algorithm to avoid fully tracking the best expert, which undercuts the minimax rate. The mixing strategy is supposed to thread this needle, but nothing in the abstract or positioning shows how the bounds are derived when the baseline is suboptimal. If the full proofs do not supply a clean reconciliation without extra assumptions on the baseline, the three guarantees cannot hold simultaneously. This paper is for people working on adaptive regret bounds and best-of-all-worlds algorithms. A reader already following that literature would get value from seeing the details. It deserves a serious referee to check whether the analysis actually closes the gap or whether the baseline guarantee only works under implicit restrictions.

Referee Report

2 major / 0 minor

Summary. The paper introduces COMPASS-Hedge, a full-information anytime algorithm for online learning that simultaneously achieves (up to log factors) minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. The method is parameter-free, requires no knowledge of the environment type or gap sizes, and relies on a novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing.

Significance. If the three simultaneous guarantees can be established without hidden parameters or environment knowledge, the result would constitute a meaningful advance in online learning by resolving the trilemma of adversarial optimality, stochastic efficiency, and baseline safety in the full-information setting.

major comments (2)

[Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.
[Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and for identifying points that require clarification in the abstract. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.

Authors: We agree that the abstract does not currently display the explicit form of the adversarial regret bound or the precise rule for selecting the mixing coefficient. The full manuscript contains the relevant theorem and proof, but the abstract will be revised to include a concise statement of the bound and the adaptive rule for the mixing weight. revision: yes
Referee: [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

Authors: We agree that the abstract alone does not supply the derivation reconciling the two terms. The revised version will add a short pointer in the abstract to the theorem that contains this derivation, together with a one-sentence description of how the phase-based mechanism keeps the baseline term additive rather than multiplicative. revision: yes

standing simulated objections not resolved

Reconciling a simultaneous Õ(1) regret bound to an arbitrary fixed baseline with a minimax-optimal (sub-linear) regret bound to the best expert in fully adversarial environments, when the baseline itself suffers linear regret relative to the best expert.

Circularity Check

0 steps flagged

No circularity identified; claims rest on uninspectable integration without self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that can be quoted to exhibit any reduction of a claimed guarantee to a fitted input, self-definition, or prior author result by construction. The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing is asserted to deliver the three guarantees simultaneously, but without mathematical steps or load-bearing citations shown, no circular step meets the criteria for flagging (requires explicit quote and exhibited reduction). This is the normal honest non-finding for an abstract-only view; the derivation chain cannot be walked for equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified existence of the described integration of scaling, phases, and mixing.

pith-pipeline@v0.9.0 · 5746 in / 1226 out tokens · 40078 ms · 2026-05-25T07:11:58.250470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

Daron Andersonand Douglas J Leith. Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

work page 2022
[2]

Regime changes and financial markets.Annu

Andrew Ang and Allan Timmermann. Regime changes and financial markets.Annu. Rev. Financ. Econ., 4(1):313–337, 2012

work page 2012
[3]

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret.arXiv preprint arXiv:1206.6400, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

Angelos Assos, Yuval Dagan, and Constantinos Daskalakis. Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

work page 2024
[5]

Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

work page 2010
[6]

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

work page 1995
[7]

Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

work page 2002
[8]

Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

work page 2002
[9]

Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

work page 2018
[10]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

cambridge university press, 2005

Allan Borodin and Ran El-Yaniv.Online computation and competitive analysis. cambridge university press, 2005

work page 2005
[12]

Sellingtoano-regretbuyer

MarkBraverman,JiemingMao,JonSchneider,andMattWeinberg. Sellingtoano-regretbuyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

work page 2018
[13]

The best of both worlds: Stochastic and adversarial bandits

Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. InConference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012

work page 2012
[14]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

work page 2006
[15]

Universal portfolios.Mathematical finance, 1(1):1–29, 1991

Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991

work page 1991
[16]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018. 16

work page 2018
[17]

Safely using predictions in general-sum normal form games

Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. In Proceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

work page 2017
[18]

Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

Steven De Rooĳ, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

work page 2014
[19]

Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

work page 1915
[20]

Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[21]

Strategizingagainstno-regretlearners

YuanDeng,JonSchneider,andBalasubramanianSivan. Strategizingagainstno-regretlearners. Advances in neural information processing systems, 32, 2019

work page 2019
[22]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003
[23]

Regret to the best vs

Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

work page 2008
[24]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

work page 1997
[25]

Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

work page 2015
[26]

Conser- vative exploration in reinforcement learning

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conser- vative exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

work page 2020
[27]

Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement

Aaryan Gupta, Mayank Puri, Mayank Keshan, and Varun Tiwari. Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement. InThispaperwaspresented at Global Forum for Financial Consumers (GFFC), 2024

work page 2024
[28]

Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

work page 2024
[29]

EladHazan.Introductiontoonlineconvexoptimization.FoundationsandTrends®inOptimization, 2(3-4):157–325, 2016

work page 2016
[30]

Adaptive online prediction by following the perturbed leader

Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005

work page 2005
[31]

Onlinelearningunderdelayedfeedback

PooriaJoulani,AndrasGyorgy,andCsabaSzepesvári. Onlinelearningunderdelayedfeedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

work page 2013
[32]

Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011. 17

work page 2011
[33]

Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

Yoav Kolumbus, Jon Schneider, Inbal Talgam-Cohen, Emmanouil-Vasileios Vlatakis- Gkaragkounis,JoshuaRWang,andSMWeinberg. Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

work page 2024
[34]

The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

work page 2013
[35]

Second-order quantile methods for experts and combinatorial games

Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. InConference on Learning Theory, pages 1155–1175. PMLR, 2015

work page 2015
[36]

The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015
[37]

Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

Mengmeng Li, Philipp Schneider, Jelisaveta Aleksić, and Daniel Kuhn. Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

work page arXiv 2025
[38]

The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

work page 1994
[39]

Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

MingyangLiu,ChengjieWu,QihanLiu,YansenJing,JunYang,PingzhongTang,andChongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

work page 2022
[40]

Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

work page 2021
[41]

Strategizing against learners in bayesian games

Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games. InConference on Learning Theory, pages 5221–5252. PMLR, 2022

work page 2022
[42]

On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

work page 2019
[43]

Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and Volkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

work page arXiv 2025
[44]

A Modern Introduction to Online Learning

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[45]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[46]

Exploiting easy data in online optimization

Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, 2014

work page 2014
[47]

Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

work page 2012
[48]

Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016

Wanrong Tang, Suzhi Bi, and Ying Jun Zhang. Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016. 18

work page 2016
[49]

Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

work page 2020
[50]

Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

work page 2015
[51]

Conservative bandits

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

work page 2016
[52]

Best-of-Both-Worlds

Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits.Journal of Machine Learning Research, 22(28):1–49, 2021. 19 A Extended Related Work and Motivating Scenarios The literature on online learning is vast, spanning several decades of research in information theory, statistics, and computer science. We ...

work page 2021
[53]

Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T)

Adversarial Setting:Assumption A1 ensures the gap isO( √ T). Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T). Substituting back:R k(µ)≤4 p TlogA+O( √ T)

work page
[54]

safety valve

Stochastic Setting:Under Assumptions S1–S2, the pseudo-expected gap iso(1). Using the stochastic Hedge bound (Theorem 2 of Mourtada and Gaïffas [42]): max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ 4 logA+25 ∆ and consequently we have that E[Rhedge]≤ 4 logA+25 ∆ +o(1). Substituting back:R k(µ)≤ 16 logA+100 ∆ +O(1). Comparator Regret (µ=µ c).Finally, we evaluate the regret a...

work page 2015
[55]

We randomly sample a subset ofK=30stocks from the universe

work page
[56]

We instantiate the algorithms: Standard Hedge (Benchmark),Compass-Hedge(Oracle), and Compass-Hedge(Uniform)

work page
[57]

35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s

The algorithms process the stream sequentially. 35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s. Thedoubling parameter is set to2. No hyperparameter tuning was performed on test data to demonstrate the theory’s practicality. D.3 Regime Analysis: Why S&P 500? We classify the timeline into two regimes:

work page
[58]

Stochastic / Benign Regimes (e.g., 2015-2019):Low volatility and positive drift.Compass- Hedgeidentifies the environment is "easy" and accelerates convergence to top performing stocks

work page 2015
[59]

solved". • Comparator Regret( R(µc)): Measures safety against the baseline. The flat trajectory in Uniform experiments (Fig. 3d) signifies the empirical safety theorem. The

Adversarial/CrisisRegimes(e.g.,2020CovidCrash):Synchronizedsell-offsandbreakdown of correlations. During the March 2020 crash,Compass-Hedgedetects no expert reliably beats the baseline and "resets" or keeps mixing weightα low to effectively adhere to the1/N safety strategy. D.4 Additional Discussion on Metrics Pseudo-Regret vs. Comparator Regret. • Pseudo...

work page 2020

[1] [1]

Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

Daron Andersonand Douglas J Leith. Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

work page 2022

[2] [2]

Regime changes and financial markets.Annu

Andrew Ang and Allan Timmermann. Regime changes and financial markets.Annu. Rev. Financ. Econ., 4(1):313–337, 2012

work page 2012

[3] [3]

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret.arXiv preprint arXiv:1206.6400, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

Angelos Assos, Yuval Dagan, and Constantinos Daskalakis. Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

work page 2024

[5] [5]

Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

work page 2010

[6] [6]

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

work page 1995

[7] [7]

Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

work page 2002

[8] [8]

Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

work page 2002

[9] [9]

Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

work page 2018

[10] [10]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

cambridge university press, 2005

Allan Borodin and Ran El-Yaniv.Online computation and competitive analysis. cambridge university press, 2005

work page 2005

[12] [12]

Sellingtoano-regretbuyer

MarkBraverman,JiemingMao,JonSchneider,andMattWeinberg. Sellingtoano-regretbuyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

work page 2018

[13] [13]

The best of both worlds: Stochastic and adversarial bandits

Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. InConference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012

work page 2012

[14] [14]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

work page 2006

[15] [15]

Universal portfolios.Mathematical finance, 1(1):1–29, 1991

Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991

work page 1991

[16] [16]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018. 16

work page 2018

[17] [17]

Safely using predictions in general-sum normal form games

Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. In Proceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

work page 2017

[18] [18]

Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

Steven De Rooĳ, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

work page 2014

[19] [19]

Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

work page 1915

[20] [20]

Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[21] [21]

Strategizingagainstno-regretlearners

YuanDeng,JonSchneider,andBalasubramanianSivan. Strategizingagainstno-regretlearners. Advances in neural information processing systems, 32, 2019

work page 2019

[22] [22]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003

[23] [23]

Regret to the best vs

Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

work page 2008

[24] [24]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

work page 1997

[25] [25]

Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

work page 2015

[26] [26]

Conser- vative exploration in reinforcement learning

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conser- vative exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

work page 2020

[27] [27]

Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement

Aaryan Gupta, Mayank Puri, Mayank Keshan, and Varun Tiwari. Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement. InThispaperwaspresented at Global Forum for Financial Consumers (GFFC), 2024

work page 2024

[28] [28]

Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

work page 2024

[29] [29]

EladHazan.Introductiontoonlineconvexoptimization.FoundationsandTrends®inOptimization, 2(3-4):157–325, 2016

work page 2016

[30] [30]

Adaptive online prediction by following the perturbed leader

Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005

work page 2005

[31] [31]

Onlinelearningunderdelayedfeedback

PooriaJoulani,AndrasGyorgy,andCsabaSzepesvári. Onlinelearningunderdelayedfeedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

work page 2013

[32] [32]

Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011. 17

work page 2011

[33] [33]

Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

Yoav Kolumbus, Jon Schneider, Inbal Talgam-Cohen, Emmanouil-Vasileios Vlatakis- Gkaragkounis,JoshuaRWang,andSMWeinberg. Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

work page 2024

[34] [34]

The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

work page 2013

[35] [35]

Second-order quantile methods for experts and combinatorial games

Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. InConference on Learning Theory, pages 1155–1175. PMLR, 2015

work page 2015

[36] [36]

The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015

[37] [37]

Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

Mengmeng Li, Philipp Schneider, Jelisaveta Aleksić, and Daniel Kuhn. Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

work page arXiv 2025

[38] [38]

The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

work page 1994

[39] [39]

Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

MingyangLiu,ChengjieWu,QihanLiu,YansenJing,JunYang,PingzhongTang,andChongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

work page 2022

[40] [40]

Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

work page 2021

[41] [41]

Strategizing against learners in bayesian games

Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games. InConference on Learning Theory, pages 5221–5252. PMLR, 2022

work page 2022

[42] [42]

On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

work page 2019

[43] [43]

Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and Volkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

work page arXiv 2025

[44] [44]

A Modern Introduction to Online Learning

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[45] [45]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[46] [46]

Exploiting easy data in online optimization

Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, 2014

work page 2014

[47] [47]

Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

work page 2012

[48] [48]

Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016

Wanrong Tang, Suzhi Bi, and Ying Jun Zhang. Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016. 18

work page 2016

[49] [49]

Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

work page 2020

[50] [50]

Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

work page 2015

[51] [51]

Conservative bandits

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

work page 2016

[52] [52]

Best-of-Both-Worlds

Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits.Journal of Machine Learning Research, 22(28):1–49, 2021. 19 A Extended Related Work and Motivating Scenarios The literature on online learning is vast, spanning several decades of research in information theory, statistics, and computer science. We ...

work page 2021

[53] [53]

Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T)

Adversarial Setting:Assumption A1 ensures the gap isO( √ T). Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T). Substituting back:R k(µ)≤4 p TlogA+O( √ T)

work page

[54] [54]

safety valve

Stochastic Setting:Under Assumptions S1–S2, the pseudo-expected gap iso(1). Using the stochastic Hedge bound (Theorem 2 of Mourtada and Gaïffas [42]): max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ 4 logA+25 ∆ and consequently we have that E[Rhedge]≤ 4 logA+25 ∆ +o(1). Substituting back:R k(µ)≤ 16 logA+100 ∆ +O(1). Comparator Regret (µ=µ c).Finally, we evaluate the regret a...

work page 2015

[55] [55]

We randomly sample a subset ofK=30stocks from the universe

work page

[56] [56]

We instantiate the algorithms: Standard Hedge (Benchmark),Compass-Hedge(Oracle), and Compass-Hedge(Uniform)

work page

[57] [57]

35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s

The algorithms process the stream sequentially. 35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s. Thedoubling parameter is set to2. No hyperparameter tuning was performed on test data to demonstrate the theory’s practicality. D.3 Regime Analysis: Why S&P 500? We classify the timeline into two regimes:

work page

[58] [58]

Stochastic / Benign Regimes (e.g., 2015-2019):Low volatility and positive drift.Compass- Hedgeidentifies the environment is "easy" and accelerates convergence to top performing stocks

work page 2015

[59] [59]

solved". • Comparator Regret( R(µc)): Measures safety against the baseline. The flat trajectory in Uniform experiments (Fig. 3d) signifies the empirical safety theorem. The

Adversarial/CrisisRegimes(e.g.,2020CovidCrash):Synchronizedsell-offsandbreakdown of correlations. During the March 2020 crash,Compass-Hedgedetects no expert reliably beats the baseline and "resets" or keeps mixing weightα low to effectively adhere to the1/N safety strategy. D.4 Additional Discussion on Metrics Pseudo-Regret vs. Comparator Regret. • Pseudo...

work page 2020