Learning Safely Without Knowing the World:COMPASS-Hedge
Pith reviewed 2026-05-25 07:11 UTC · model grok-4.3
The pith
COMPASS-Hedge achieves minimax-optimal regret in adversarial settings, gap-dependent regret in stochastic settings, and constant regret to a baseline, all parameter-free.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The algorithm COMPASS-Hedge is the first full-information anytime method to simultaneously achieve, up to logarithmic factors, minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. It is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps.
What carries the argument
The integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy.
If this is right
- It provides the first best-of-three-worlds guarantee in the full-information setting.
- Baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
- The method works without oracle access to problem-dependent parameters.
- It operates in an anytime fashion suitable for unknown time horizons.
Where Pith is reading between the lines
- The technique could potentially be adapted to bandit settings where only partial feedback is available.
- Similar mixing strategies might help in other sequential decision problems with mixed environment types.
- Testing on datasets with unknown mixtures of adversarial and stochastic behavior would validate practical utility.
- Extensions might allow incorporating multiple baselines without losing the guarantees.
Load-bearing premise
The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy delivers all three regret guarantees simultaneously without knowledge of the environment or gaps.
What would settle it
Observe the regret of COMPASS-Hedge in a purely stochastic environment with small known gaps and verify if it scales as the gap-dependent optimal rate rather than the minimax rate.
Figures
read the original abstract
Online learning algorithms often face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. To the best of our knowledge, our algorithm is the first full-information anytime method to simultaneously achieve, up to logarithmic factors: i) minimax-optimal regret in adversarial environments; ii) instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COMPASS-Hedge, a full-information anytime algorithm for online learning that simultaneously achieves (up to log factors) minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. The method is parameter-free, requires no knowledge of the environment type or gap sizes, and relies on a novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing.
Significance. If the three simultaneous guarantees can be established without hidden parameters or environment knowledge, the result would constitute a meaningful advance in online learning by resolving the trilemma of adversarial optimality, stochastic efficiency, and baseline safety in the full-information setting.
major comments (2)
- [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.
- [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying points that require clarification in the abstract. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.
Authors: We agree that the abstract does not currently display the explicit form of the adversarial regret bound or the precise rule for selecting the mixing coefficient. The full manuscript contains the relevant theorem and proof, but the abstract will be revised to include a concise statement of the bound and the adaptive rule for the mixing weight. revision: yes
-
Referee: [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.
Authors: We agree that the abstract alone does not supply the derivation reconciling the two terms. The revised version will add a short pointer in the abstract to the theorem that contains this derivation, together with a one-sentence description of how the phase-based mechanism keeps the baseline term additive rather than multiplicative. revision: yes
- Reconciling a simultaneous Õ(1) regret bound to an arbitrary fixed baseline with a minimax-optimal (sub-linear) regret bound to the best expert in fully adversarial environments, when the baseline itself suffers linear regret relative to the best expert.
Circularity Check
No circularity identified; claims rest on uninspectable integration without self-referential reductions
full rationale
The provided abstract and description contain no equations, derivations, or self-citations that can be quoted to exhibit any reduction of a claimed guarantee to a fitted input, self-definition, or prior author result by construction. The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing is asserted to deliver the three guarantees simultaneously, but without mathematical steps or load-bearing citations shown, no circular step meets the criteria for flagging (requires explicit quote and exhibited reduction). This is the normal honest non-finding for an abstract-only view; the derivation chain cannot be walked for equivalence to inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Daron Andersonand Douglas J Leith. Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022
work page 2022
-
[2]
Regime changes and financial markets.Annu
Andrew Ang and Allan Timmermann. Regime changes and financial markets.Annu. Rev. Financ. Econ., 4(1):313–337, 2012
work page 2012
-
[3]
Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret
Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret.arXiv preprint arXiv:1206.6400, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[4]
Angelos Assos, Yuval Dagan, and Constantinos Daskalakis. Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024
work page 2024
-
[5]
Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010
work page 2010
-
[6]
Gambling in a rigged casino: The adversarial multi-armed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995
work page 1995
-
[7]
Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002
work page 2002
-
[8]
Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002
work page 2002
-
[9]
Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018
work page 2018
-
[10]
What Doubling Tricks Can and Can't Do for Multi-Armed Bandits
Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
cambridge university press, 2005
Allan Borodin and Ran El-Yaniv.Online computation and competitive analysis. cambridge university press, 2005
work page 2005
-
[12]
MarkBraverman,JiemingMao,JonSchneider,andMattWeinberg. Sellingtoano-regretbuyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018
work page 2018
-
[13]
The best of both worlds: Stochastic and adversarial bandits
Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. InConference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012
work page 2012
-
[14]
Cambridge university press, 2006
Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006
work page 2006
-
[15]
Universal portfolios.Mathematical finance, 1(1):1–29, 1991
Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991
work page 1991
-
[16]
Black-box reductions for parameter-free online learning in banach spaces
Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018. 16
work page 2018
-
[17]
Safely using predictions in general-sum normal form games
Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. In Proceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017
work page 2017
-
[18]
Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014
work page 2014
-
[19]
Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009
work page 1915
-
[20]
Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[21]
Strategizingagainstno-regretlearners
YuanDeng,JonSchneider,andBalasubramanianSivan. Strategizingagainstno-regretlearners. Advances in neural information processing systems, 32, 2019
work page 2019
-
[22]
Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
-
[23]
Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008
work page 2008
-
[24]
Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997
work page 1997
-
[25]
Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015
Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015
work page 2015
-
[26]
Conser- vative exploration in reinforcement learning
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conser- vative exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020
work page 2020
-
[27]
Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement
Aaryan Gupta, Mayank Puri, Mayank Keshan, and Varun Tiwari. Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement. InThispaperwaspresented at Global Forum for Financial Consumers (GFFC), 2024
work page 2024
-
[28]
Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024
Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024
work page 2024
-
[29]
EladHazan.Introductiontoonlineconvexoptimization.FoundationsandTrends®inOptimization, 2(3-4):157–325, 2016
work page 2016
-
[30]
Adaptive online prediction by following the perturbed leader
Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005
work page 2005
-
[31]
Onlinelearningunderdelayedfeedback
PooriaJoulani,AndrasGyorgy,andCsabaSzepesvári. Onlinelearningunderdelayedfeedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013
work page 2013
-
[32]
Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011
Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011. 17
work page 2011
-
[33]
Yoav Kolumbus, Jon Schneider, Inbal Talgam-Cohen, Emmanouil-Vasileios Vlatakis- Gkaragkounis,JoshuaRWang,andSMWeinberg. Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024
work page 2024
-
[34]
The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013
Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013
work page 2013
-
[35]
Second-order quantile methods for experts and combinatorial games
Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. InConference on Learning Theory, pages 1155–1175. PMLR, 2015
work page 2015
-
[36]
The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015
Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015
work page 2015
-
[37]
Mengmeng Li, Philipp Schneider, Jelisaveta Aleksić, and Daniel Kuhn. Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025
-
[38]
The weighted majority algorithm.Information and computation, 108(2):212–261, 1994
Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108(2):212–261, 1994
work page 1994
-
[39]
MingyangLiu,ChengjieWu,QihanLiu,YansenJing,JunYang,PingzhongTang,andChongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022
work page 2022
-
[40]
Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021
work page 2021
-
[41]
Strategizing against learners in bayesian games
Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games. InConference on Learning Theory, pages 5221–5252. PMLR, 2022
work page 2022
-
[42]
Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019
work page 2019
-
[43]
Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025
Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and Volkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025
-
[44]
A Modern Introduction to Online Learning
Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[45]
Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[46]
Exploiting easy data in online optimization
Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, 2014
work page 2014
-
[47]
Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012
work page 2012
-
[48]
Wanrong Tang, Suzhi Bi, and Ying Jun Zhang. Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016. 18
work page 2016
-
[49]
Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020
work page 2020
-
[50]
Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015
work page 2015
-
[51]
Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016
work page 2016
-
[52]
Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits.Journal of Machine Learning Research, 22(28):1–49, 2021. 19 A Extended Related Work and Motivating Scenarios The literature on online learning is vast, spanning several decades of research in information theory, statistics, and computer science. We ...
work page 2021
-
[53]
Adversarial Setting:Assumption A1 ensures the gap isO( √ T). Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T). Substituting back:R k(µ)≤4 p TlogA+O( √ T)
-
[54]
Stochastic Setting:Under Assumptions S1–S2, the pseudo-expected gap iso(1). Using the stochastic Hedge bound (Theorem 2 of Mourtada and Gaïffas [42]): max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ 4 logA+25 ∆ and consequently we have that E[Rhedge]≤ 4 logA+25 ∆ +o(1). Substituting back:R k(µ)≤ 16 logA+100 ∆ +O(1). Comparator Regret (µ=µ c).Finally, we evaluate the regret a...
work page 2015
-
[55]
We randomly sample a subset ofK=30stocks from the universe
-
[56]
We instantiate the algorithms: Standard Hedge (Benchmark),Compass-Hedge(Oracle), and Compass-Hedge(Uniform)
-
[57]
35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s
The algorithms process the stream sequentially. 35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s. Thedoubling parameter is set to2. No hyperparameter tuning was performed on test data to demonstrate the theory’s practicality. D.3 Regime Analysis: Why S&P 500? We classify the timeline into two regimes:
-
[58]
Stochastic / Benign Regimes (e.g., 2015-2019):Low volatility and positive drift.Compass- Hedgeidentifies the environment is "easy" and accelerates convergence to top performing stocks
work page 2015
-
[59]
Adversarial/CrisisRegimes(e.g.,2020CovidCrash):Synchronizedsell-offsandbreakdown of correlations. During the March 2020 crash,Compass-Hedgedetects no expert reliably beats the baseline and "resets" or keeps mixing weightα low to effectively adhere to the1/N safety strategy. D.4 Additional Discussion on Metrics Pseudo-Regret vs. Comparator Regret. • Pseudo...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.