pith. sign in

arxiv: 2603.22348 · v3 · pith:OGYU7U44new · submitted 2026-03-22 · 💻 cs.LG · cs.GT

Learning Safely Without Knowing the World:COMPASS-Hedge

Pith reviewed 2026-05-25 07:11 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords online learningregret minimizationadversarial environmentsstochastic environmentsbaseline safetyparameter-freefull-information feedback
0
0 comments X

The pith

COMPASS-Hedge achieves minimax-optimal regret in adversarial settings, gap-dependent regret in stochastic settings, and constant regret to a baseline, all parameter-free.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online learning algorithms struggle to balance regret performance when facing adversarial or stochastic environments while also providing safety relative to a fixed baseline policy. Most prior methods can optimize for at most two of these goals or require knowledge of the environment type and specific parameters like gap sizes. COMPASS-Hedge introduces a combination of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing to address all three simultaneously. This yields the first full-information anytime algorithm with all three guarantees up to logarithmic factors without any such knowledge. Readers would care because it removes the usual trade-offs in designing robust online learners.

Core claim

The algorithm COMPASS-Hedge is the first full-information anytime method to simultaneously achieve, up to logarithmic factors, minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. It is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps.

What carries the argument

The integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy.

If this is right

  • It provides the first best-of-three-worlds guarantee in the full-information setting.
  • Baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
  • The method works without oracle access to problem-dependent parameters.
  • It operates in an anytime fashion suitable for unknown time horizons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could potentially be adapted to bandit settings where only partial feedback is available.
  • Similar mixing strategies might help in other sequential decision problems with mixed environment types.
  • Testing on datasets with unknown mixtures of adversarial and stochastic behavior would validate practical utility.
  • Extensions might allow incorporating multiple baselines without losing the guarantees.

Load-bearing premise

The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing strategy delivers all three regret guarantees simultaneously without knowledge of the environment or gaps.

What would settle it

Observe the regret of COMPASS-Hedge in a purely stochastic environment with small known gaps and verify if it scales as the gap-dependent optimal rate rather than the minimax rate.

Figures

Figures reproduced from arXiv: 2603.22348 by Luanda Cai, Manolis Vlatakis, Ting Hu.

Figure 1
Figure 1. Figure 1: The Best-of-Three-Worlds landscape. Each circle represents one desideratum. Prior algorithms (annotations) cover at most two regions simultaneously. Compass-Hedge (center) is the first to unify all three in a single, parameter-free algorithm. The Best-of-Three-Worlds Challenge. (BOTW) Despite the maturity of best-of-two-worlds research, the aforementioned triadic tension remains unanswered with he existing… view at source ↗
Figure 2
Figure 2. Figure 2: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert (near-zero regret). (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance (risk). Compass-Hedge maintains a flat trajectory near zero comparator regret (d), empirically confirming the O(1) safety bound. Future work. A natural next step is to tig… view at source ↗
Figure 3
Figure 3. Figure 3: Median Regret (500 trials). Shaded regions denote 10th–90th percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert. (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance. Compass-Hedge maintains a flat trajectory (d), confirming the O(1) safety bound. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
read the original abstract

Online learning algorithms often face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. To the best of our knowledge, our algorithm is the first full-information anytime method to simultaneously achieve, up to logarithmic factors: i) minimax-optimal regret in adversarial environments; ii) instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces COMPASS-Hedge, a full-information anytime algorithm for online learning that simultaneously achieves (up to log factors) minimax-optimal regret in adversarial environments, instance-optimal gap-dependent regret in stochastic environments, and Õ(1) regret relative to a designated baseline policy. The method is parameter-free, requires no knowledge of the environment type or gap sizes, and relies on a novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing.

Significance. If the three simultaneous guarantees can be established without hidden parameters or environment knowledge, the result would constitute a meaningful advance in online learning by resolving the trilemma of adversarial optimality, stochastic efficiency, and baseline safety in the full-information setting.

major comments (2)
  1. [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.
  2. [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and for identifying points that require clarification in the abstract. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (approach paragraph): the central claim requires the comparator-aware mixing strategy to simultaneously deliver minimax-optimal regret against the best expert and Õ(1) regret to an arbitrary fixed baseline in adversarial regimes. When the baseline incurs linear regret to the best expert, standard mixing would either inflate the minimax rate or violate the Õ(1) baseline bound; the manuscript must exhibit explicit regret expressions (e.g., the adversarial theorem) showing how the mixing coefficient is chosen to avoid both degradations without knowledge of the baseline quality.

    Authors: We agree that the abstract does not currently display the explicit form of the adversarial regret bound or the precise rule for selecting the mixing coefficient. The full manuscript contains the relevant theorem and proof, but the abstract will be revised to include a concise statement of the bound and the adaptive rule for the mixing weight. revision: yes

  2. Referee: [Abstract] The abstract asserts that the integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing yields all three guarantees at once. Because no derivation or theorem statement is supplied that reconciles the Õ(1) baseline term with the minimax term under adversarial losses, it is impossible to verify that the bounds remain non-vacuous when the designated baseline is suboptimal.

    Authors: We agree that the abstract alone does not supply the derivation reconciling the two terms. The revised version will add a short pointer in the abstract to the theorem that contains this derivation, together with a one-sentence description of how the phase-based mechanism keeps the baseline term additive rather than multiplicative. revision: yes

standing simulated objections not resolved
  • Reconciling a simultaneous Õ(1) regret bound to an arbitrary fixed baseline with a minimax-optimal (sub-linear) regret bound to the best expert in fully adversarial environments, when the baseline itself suffers linear regret relative to the best expert.

Circularity Check

0 steps flagged

No circularity identified; claims rest on uninspectable integration without self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that can be quoted to exhibit any reduction of a claimed guarantee to a fitted input, self-definition, or prior author result by construction. The novel integration of adaptive pseudo-regret scaling, phase-based aggression, and comparator-aware mixing is asserted to deliver the three guarantees simultaneously, but without mathematical steps or load-bearing citations shown, no circular step meets the criteria for flagging (requires explicit quote and exhibited reduction). This is the normal honest non-finding for an abstract-only view; the derivation chain cannot be walked for equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified existence of the described integration of scaling, phases, and mixing.

pith-pipeline@v0.9.0 · 5746 in / 1226 out tokens · 40078 ms · 2026-05-25T07:11:58.250470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

    Daron Andersonand Douglas J Leith. Expected regret andpseudo-regretare equivalentwhen the optimal arm is unique.Journal of Machine Learning Research, 23(293):1–12, 2022

  2. [2]

    Regime changes and financial markets.Annu

    Andrew Ang and Allan Timmermann. Regime changes and financial markets.Annu. Rev. Financ. Econ., 4(1):313–337, 2012

  3. [3]

    Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

    Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret.arXiv preprint arXiv:1206.6400, 2012

  4. [4]

    Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

    Angelos Assos, Yuval Dagan, and Constantinos Daskalakis. Maximizing utility in multi-agent environments by anticipating the behavior of other learners.Advances in Neural Information Processing Systems, 37:38769–38798, 2024

  5. [5]

    Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

    Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

  6. [6]

    Gambling in a rigged casino: The adversarial multi-armed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

  7. [7]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

  8. [8]

    Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms.Journal of Computer and System Sciences, 64(1):48–75, 2002

  9. [9]

    Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

    Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.Journal of the ACM (JACM), 65(3):1–55, 2018

  10. [10]

    What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

    Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

  11. [11]

    cambridge university press, 2005

    Allan Borodin and Ran El-Yaniv.Online computation and competitive analysis. cambridge university press, 2005

  12. [12]

    Sellingtoano-regretbuyer

    MarkBraverman,JiemingMao,JonSchneider,andMattWeinberg. Sellingtoano-regretbuyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

  13. [13]

    The best of both worlds: Stochastic and adversarial bandits

    Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. InConference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012

  14. [14]

    Cambridge university press, 2006

    Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

  15. [15]

    Universal portfolios.Mathematical finance, 1(1):1–29, 1991

    Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991

  16. [16]

    Black-box reductions for parameter-free online learning in banach spaces

    Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018. 16

  17. [17]

    Safely using predictions in general-sum normal form games

    Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. In Proceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

  18. [18]

    Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

    Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must.The Journal of Machine Learning Research, 15(1):1281–1316, 2014

  19. [19]

    Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

    Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953, 2009

  20. [20]

    Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

    Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Prior-free dynamic auctions with low regret buyers.Advances in Neural Information Processing Systems, 32, 2019

  21. [21]

    Strategizingagainstno-regretlearners

    YuanDeng,JonSchneider,andBalasubramanianSivan. Strategizingagainstno-regretlearners. Advances in neural information processing systems, 32, 2019

  22. [22]

    Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

    Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

  23. [23]

    Regret to the best vs

    Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

  24. [24]

    A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

    Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

  25. [25]

    Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

    Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

  26. [26]

    Conser- vative exploration in reinforcement learning

    Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conser- vative exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

  27. [27]

    Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement

    Aaryan Gupta, Mayank Puri, Mayank Keshan, and Varun Tiwari. Ai in financial decision- making: Revolutionizinginvestmentstrategiesandriskmanagement. InThispaperwaspresented at Global Forum for Financial Consumers (GFFC), 2024

  28. [28]

    Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

    Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis with adaptive adversaries.Journal of the ACM, 71(3):1–34, 2024

  29. [29]

    EladHazan.Introductiontoonlineconvexoptimization.FoundationsandTrends®inOptimization, 2(3-4):157–325, 2016

  30. [30]

    Adaptive online prediction by following the perturbed leader

    Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005

  31. [31]

    Onlinelearningunderdelayedfeedback

    PooriaJoulani,AndrasGyorgy,andCsabaSzepesvári. Onlinelearningunderdelayedfeedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

  32. [32]

    Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

    Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011. 17

  33. [33]

    Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

    Yoav Kolumbus, Jon Schneider, Inbal Talgam-Cohen, Emmanouil-Vasileios Vlatakis- Gkaragkounis,JoshuaRWang,andSMWeinberg. Contractingwithalearningagent.Advances in Neural Information Processing Systems, 37:77366–77408, 2024

  34. [34]

    The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

    Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

  35. [35]

    Second-order quantile methods for experts and combinatorial games

    Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. InConference on Learning Theory, pages 1155–1175. PMLR, 2015

  36. [36]

    The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

    Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

  37. [37]

    Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

    Mengmeng Li, Philipp Schneider, Jelisaveta Aleksić, and Daniel Kuhn. Efficient best-of-both- worlds algorithms for contextual combinatorial semi-bandits.arXiv preprint arXiv:2508.18768, 2025

  38. [38]

    The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

    Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108(2):212–261, 1994

  39. [39]

    Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

    MingyangLiu,ChengjieWu,QihanLiu,YansenJing,JunYang,PingzhongTang,andChongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

  40. [40]

    Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

    Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies withzeroorboundedconstraintviolationforconstrainedmdps.AdvancesinNeuralInformation Processing Systems, 34:17183–17193, 2021

  41. [41]

    Strategizing against learners in bayesian games

    Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games. InConference on Learning Theory, pages 5221–5252. PMLR, 2022

  42. [42]

    On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

    Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime.Journal of Machine Learning Research, 20(83):1–28, 2019

  43. [43]

    Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

    Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and Volkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

  44. [44]

    A Modern Introduction to Online Learning

    Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

  45. [45]

    Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

    Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

  46. [46]

    Exploiting easy data in online optimization

    Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 27, 2014

  47. [47]

    Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

    Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2012

  48. [48]

    Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016

    Wanrong Tang, Suzhi Bi, and Ying Jun Zhang. Online charging scheduling algorithms of electric vehicles in smart grid: An overview.IEEE communications Magazine, 54(12):76–83, 2016. 18

  49. [49]

    Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

    Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits.Advances in Neural Information Processing Systems, 33:19795–19804, 2020

  50. [50]

    Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

    Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal designofclinicaltrials: benefitsandchallenges.Statisticalscience: areviewjournaloftheInstitute of Mathematical Statistics, 30(2):199, 2015

  51. [51]

    Conservative bandits

    Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

  52. [52]

    Best-of-Both-Worlds

    Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits.Journal of Machine Learning Research, 22(28):1–49, 2021. 19 A Extended Related Work and Motivating Scenarios The literature on online learning is vast, spanning several decades of research in information theory, statistics, and computer science. We ...

  53. [53]

    Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T)

    Adversarial Setting:Assumption A1 ensures the gap isO( √ T). Using Proposition 1 of Mourtada and Gaïffas [42]: max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ p TlogA and consequently we have that E[Rhedge]≤max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ +O( √ T)≤ p TlogA+O( √ T). Substituting back:R k(µ)≤4 p TlogA+O( √ T)

  54. [54]

    safety valve

    Stochastic Setting:Under Assumptions S1–S2, the pseudo-expected gap iso(1). Using the stochastic Hedge bound (Theorem 2 of Mourtada and Gaïffas [42]): max µ∈∆ A E ∑⟨ct, ˆµt −µ⟩ ≤ 4 logA+25 ∆ and consequently we have that E[Rhedge]≤ 4 logA+25 ∆ +o(1). Substituting back:R k(µ)≤ 16 logA+100 ∆ +O(1). Comparator Regret (µ=µ c).Finally, we evaluate the regret a...

  55. [55]

    We randomly sample a subset ofK=30stocks from the universe

  56. [56]

    We instantiate the algorithms: Standard Hedge (Benchmark),Compass-Hedge(Oracle), and Compass-Hedge(Uniform)

  57. [57]

    35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s

    The algorithms process the stream sequentially. 35 Hyperparameters.ForCompass-Hedge,weuselearningrate ηs = q logK 2s forstage s. Thedoubling parameter is set to2. No hyperparameter tuning was performed on test data to demonstrate the theory’s practicality. D.3 Regime Analysis: Why S&P 500? We classify the timeline into two regimes:

  58. [58]

    Stochastic / Benign Regimes (e.g., 2015-2019):Low volatility and positive drift.Compass- Hedgeidentifies the environment is "easy" and accelerates convergence to top performing stocks

  59. [59]

    solved". • Comparator Regret( R(µc)): Measures safety against the baseline. The flat trajectory in Uniform experiments (Fig. 3d) signifies the empirical safety theorem. The

    Adversarial/CrisisRegimes(e.g.,2020CovidCrash):Synchronizedsell-offsandbreakdown of correlations. During the March 2020 crash,Compass-Hedgedetects no expert reliably beats the baseline and "resets" or keeps mixing weightα low to effectively adhere to the1/N safety strategy. D.4 Additional Discussion on Metrics Pseudo-Regret vs. Comparator Regret. • Pseudo...