Best-of-Both-Worlds for Heavy-Tailed Markov Decision Processes
Pith reviewed 2026-05-21 14:29 UTC · model grok-4.3
The pith
Algorithms for heavy-tailed episodic MDPs achieve sublinear regret in adversarial regimes and logarithmic regret in stochastic regimes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HT-FTRL-OM and HT-FTRL-UOB achieve Õ(T^{1/α}) regret in adversarial regimes and O(log T) or O(log² T) regret in stochastic regimes for heavy-tailed episodic MDPs under the stated conditions. For known transitions the first algorithm uses skipping loss estimators inside the FTRL framework over occupancy measures. For unknown transitions the second algorithm adds a pessimistic skipping estimator that works under a mild truncative nonnegativity condition on the loss distributions, incurring an extra square-root term in the adversarial bound while preserving the logarithmic stochastic bound.
What carries the argument
Follow-The-Regularized-Leader over occupancy measures equipped with skipping loss estimators and, for unknown transitions, a pessimistic skipping loss estimator.
Load-bearing premise
The unknown-transition algorithm requires that loss distributions satisfy a truncative nonnegativity condition so the pessimistic skipping estimator can control heavy-tailed estimation error.
What would settle it
Construct loss distributions that violate the truncative nonnegativity condition, run HT-FTRL-UOB, and check whether the observed regret exceeds the claimed Õ(T^{1/α} + √T) bound.
read the original abstract
We investigate episodic Markov Decision Processes with heavy-tailed losses (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{{O}}(T^{1/\alpha})$ regret bound in adversarial regimes and a ${O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. Under a mild truncative nonnegativity condition on the loss distributions, this algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{{O}}(T^{1/\alpha} + \sqrt{T})$ regret in adversarial regimes and a ${O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HT-FTRL-OM (known transitions) and HT-FTRL-UOB (unknown transitions) for episodic MDPs with heavy-tailed losses. Both achieve best-of-both-worlds regret: Õ(T^{1/α}) in adversarial regimes and O(log T) (known) or O(log² T) (unknown) in stochastic/self-bounding regimes. The algorithms use FTRL over occupancy measures with novel skipping loss estimators; the unknown-transition version adds a pessimistic skipping estimator under a mild truncative nonnegativity condition on losses, supported by a local control mechanism for shifted losses, suboptimal-mass propagation, and a regret decomposition isolating transition uncertainty from heavy-tailed errors and skipping bias.
Significance. If the central claims hold, the work is significant: it supplies the first explicit BoBW guarantees for heavy-tailed MDPs, improving on prior conservative stochastic methods and non-adaptive adversarial ones. The technical contributions—particularly the regret decomposition separating transition and heavy-tail effects, and the local control for shifted losses—are likely to be useful beyond this setting. The manuscript ships explicit regret expressions and identifies the precise assumption needed for the unknown-transition case.
major comments (1)
- [§4] §4 (analysis of HT-FTRL-UOB): The bias-variance decomposition and isolation of the Õ(T^{1/α}) term from the extra √T term rest on the truncative nonnegativity condition to ensure the pessimistic skipping estimator controls heavy-tailed estimation error separately from transition uncertainty. The manuscript does not provide a quantitative robustness statement or counter-example showing what happens when the condition holds only approximately (e.g., small negative mass after truncation or interaction with occupancy support); without this, the stochastic-regime O(log² T) bound is not guaranteed to survive.
minor comments (2)
- Notation for the truncation threshold and the precise definition of 'mild' in the truncative nonnegativity condition should be stated once in a dedicated assumption box rather than scattered across lemmas.
- The abstract claims O(log T) for the known-transition stochastic case but the full statement in the introduction should explicitly list the dependence on the suboptimality gap or variance proxy.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and for the detailed, constructive feedback. We address the major comment below.
read point-by-point responses
-
Referee: [§4] §4 (analysis of HT-FTRL-UOB): The bias-variance decomposition and isolation of the Õ(T^{1/α}) term from the extra √T term rest on the truncative nonnegativity condition to ensure the pessimistic skipping estimator controls heavy-tailed estimation error separately from transition uncertainty. The manuscript does not provide a quantitative robustness statement or counter-example showing what happens when the condition holds only approximately (e.g., small negative mass after truncation or interaction with occupancy support); without this, the stochastic-regime O(log² T) bound is not guaranteed to survive.
Authors: We agree that the truncative nonnegativity condition is essential for the bias-variance decomposition and for cleanly isolating heavy-tailed estimation error from transition uncertainty in the analysis of HT-FTRL-UOB. The current manuscript does not contain an explicit quantitative robustness statement or counter-example for approximate satisfaction of the condition. In the revised version we will add a dedicated paragraph (in §4 and the appendix) that (i) supplies a simple counter-example in which a small negative mass after truncation produces an uncontrolled additive bias that inflates the stochastic-regime regret beyond O(log² T), and (ii) derives a quantitative bound on the extra regret incurred when the violation is of size δ (showing an additive O(δ T^{1/α} + δ √T) term). We will also briefly discuss interaction with occupancy-measure support. These additions will clarify the scope of the O(log² T) guarantee without changing the main theorems under the stated condition. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper derives its BoBW regret bounds for HT-FTRL-OM and HT-FTRL-UOB from explicit technical contributions including a local control mechanism for heavy-tailed shifted losses, a suboptimal-mass propagation principle, and a novel regret decomposition isolating transition uncertainty from estimation errors. These steps are presented as independent analysis rather than reductions to input assumptions or self-citations. The truncative nonnegativity condition is stated as an assumption enabling the pessimistic estimator, not as a derived or fitted quantity renamed as a prediction. No load-bearing self-citation chains or self-definitional loops appear in the abstract or described framework; the bounds remain non-tautological outputs of the stated conditions and insights.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Truncative nonnegativity condition on loss distributions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with a 1/α-Tsallis entropy regularizer and a novel skipping loss estimator
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.