Best-of-Both-Worlds for Heavy-Tailed Markov Decision Processes

Jiatai Huang; Longbo Huang; Yihan Du; Yu Chen; Yuhao Liu

arxiv: 2602.01295 · v3 · pith:5O6JWWL7new · submitted 2026-02-01 · 💻 cs.LG

Best-of-Both-Worlds for Heavy-Tailed Markov Decision Processes

Yu Chen , Yuhao Liu , Jiatai Huang , Yihan Du , Longbo Huang This is my paper

Pith reviewed 2026-05-21 14:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords heavy-tailed MDPsbest-of-both-worldsregret boundsFTRLepisodic MDPsadversarial stochasticoccupancy measuresskipping estimators

0 comments

The pith

Algorithms for heavy-tailed episodic MDPs achieve sublinear regret in adversarial regimes and logarithmic regret in stochastic regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two algorithms, HT-FTRL-OM and HT-FTRL-UOB, for episodic Markov Decision Processes whose losses have heavy tails. These algorithms apply the Follow-The-Regularized-Leader framework to occupancy measures together with skipping loss estimators that drop large observations to control heavy-tail effects. In adversarial environments the methods obtain regret of order T to the power one over alpha; in stochastic or other self-bounding environments they obtain only logarithmic regret. A reader would care because many practical sequential decision problems exhibit both unpredictable adversarial behavior and more regular stochastic patterns, yet prior methods were either too conservative for the stochastic case or non-adaptive to adversaries.

Core claim

HT-FTRL-OM and HT-FTRL-UOB achieve Õ(T^{1/α}) regret in adversarial regimes and O(log T) or O(log² T) regret in stochastic regimes for heavy-tailed episodic MDPs under the stated conditions. For known transitions the first algorithm uses skipping loss estimators inside the FTRL framework over occupancy measures. For unknown transitions the second algorithm adds a pessimistic skipping estimator that works under a mild truncative nonnegativity condition on the loss distributions, incurring an extra square-root term in the adversarial bound while preserving the logarithmic stochastic bound.

What carries the argument

Follow-The-Regularized-Leader over occupancy measures equipped with skipping loss estimators and, for unknown transitions, a pessimistic skipping loss estimator.

Load-bearing premise

The unknown-transition algorithm requires that loss distributions satisfy a truncative nonnegativity condition so the pessimistic skipping estimator can control heavy-tailed estimation error.

What would settle it

Construct loss distributions that violate the truncative nonnegativity condition, run HT-FTRL-UOB, and check whether the observed regret exceeds the claimed Õ(T^{1/α} + √T) bound.

read the original abstract

We investigate episodic Markov Decision Processes with heavy-tailed losses (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{{O}}(T^{1/\alpha})$ regret bound in adversarial regimes and a ${O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. Under a mild truncative nonnegativity condition on the loss distributions, this algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{{O}}(T^{1/\alpha} + \sqrt{T})$ regret in adversarial regimes and a ${O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives BoBW regret for heavy-tailed episodic MDPs using skipping estimators over occupancy measures, but the unknown-transition version rests on a truncative nonnegativity condition whose robustness is unclear.

read the letter

The core contribution is HT-FTRL-OM for known transitions and HT-FTRL-UOB for unknown ones. Both use FTRL on occupancy measures paired with skipping loss estimators to get Õ(T^{1/α}) regret under adversarial heavy-tailed losses and O(log T) or O(log² T) under stochastic regimes. The analysis introduces a local control mechanism for shifted losses, a suboptimal-mass propagation principle, and a decomposition that tries to isolate transition uncertainty from heavy-tail estimation error and skipping bias. Those pieces look new relative to prior BoBW and heavy-tailed MDP work, and they directly address the gap between conservative stochastic methods and non-adaptive adversarial ones in tabular episodic settings.

Referee Report

1 major / 2 minor

Summary. The paper proposes HT-FTRL-OM (known transitions) and HT-FTRL-UOB (unknown transitions) for episodic MDPs with heavy-tailed losses. Both achieve best-of-both-worlds regret: Õ(T^{1/α}) in adversarial regimes and O(log T) (known) or O(log² T) (unknown) in stochastic/self-bounding regimes. The algorithms use FTRL over occupancy measures with novel skipping loss estimators; the unknown-transition version adds a pessimistic skipping estimator under a mild truncative nonnegativity condition on losses, supported by a local control mechanism for shifted losses, suboptimal-mass propagation, and a regret decomposition isolating transition uncertainty from heavy-tailed errors and skipping bias.

Significance. If the central claims hold, the work is significant: it supplies the first explicit BoBW guarantees for heavy-tailed MDPs, improving on prior conservative stochastic methods and non-adaptive adversarial ones. The technical contributions—particularly the regret decomposition separating transition and heavy-tail effects, and the local control for shifted losses—are likely to be useful beyond this setting. The manuscript ships explicit regret expressions and identifies the precise assumption needed for the unknown-transition case.

major comments (1)

[§4] §4 (analysis of HT-FTRL-UOB): The bias-variance decomposition and isolation of the Õ(T^{1/α}) term from the extra √T term rest on the truncative nonnegativity condition to ensure the pessimistic skipping estimator controls heavy-tailed estimation error separately from transition uncertainty. The manuscript does not provide a quantitative robustness statement or counter-example showing what happens when the condition holds only approximately (e.g., small negative mass after truncation or interaction with occupancy support); without this, the stochastic-regime O(log² T) bound is not guaranteed to survive.

minor comments (2)

Notation for the truncation threshold and the precise definition of 'mild' in the truncative nonnegativity condition should be stated once in a dedicated assumption box rather than scattered across lemmas.
The abstract claims O(log T) for the known-transition stochastic case but the full statement in the introduction should explicitly list the dependence on the suboptimality gap or variance proxy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the detailed, constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [§4] §4 (analysis of HT-FTRL-UOB): The bias-variance decomposition and isolation of the Õ(T^{1/α}) term from the extra √T term rest on the truncative nonnegativity condition to ensure the pessimistic skipping estimator controls heavy-tailed estimation error separately from transition uncertainty. The manuscript does not provide a quantitative robustness statement or counter-example showing what happens when the condition holds only approximately (e.g., small negative mass after truncation or interaction with occupancy support); without this, the stochastic-regime O(log² T) bound is not guaranteed to survive.

Authors: We agree that the truncative nonnegativity condition is essential for the bias-variance decomposition and for cleanly isolating heavy-tailed estimation error from transition uncertainty in the analysis of HT-FTRL-UOB. The current manuscript does not contain an explicit quantitative robustness statement or counter-example for approximate satisfaction of the condition. In the revised version we will add a dedicated paragraph (in §4 and the appendix) that (i) supplies a simple counter-example in which a small negative mass after truncation produces an uncontrolled additive bias that inflates the stochastic-regime regret beyond O(log² T), and (ii) derives a quantitative bound on the extra regret incurred when the violation is of size δ (showing an additive O(δ T^{1/α} + δ √T) term). We will also briefly discuss interaction with occupancy-measure support. These additions will clarify the scope of the O(log² T) guarantee without changing the main theorems under the stated condition. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives its BoBW regret bounds for HT-FTRL-OM and HT-FTRL-UOB from explicit technical contributions including a local control mechanism for heavy-tailed shifted losses, a suboptimal-mass propagation principle, and a novel regret decomposition isolating transition uncertainty from estimation errors. These steps are presented as independent analysis rather than reductions to input assumptions or self-citations. The truncative nonnegativity condition is stated as an assumption enabling the pessimistic estimator, not as a derived or fitted quantity renamed as a prediction. No load-bearing self-citation chains or self-definitional loops appear in the abstract or described framework; the bounds remain non-tautological outputs of the stated conditions and insights.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard MDP assumptions plus a new truncative nonnegativity condition whose independent evidence is not supplied in the abstract.

axioms (1)

domain assumption Truncative nonnegativity condition on loss distributions
Invoked for the unknown-transition algorithm to control heavy-tailed estimation error via pessimistic skipping estimator.

pith-pipeline@v0.9.0 · 5803 in / 1316 out tokens · 32591 ms · 2026-05-21T14:29:52.476346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with a 1/α-Tsallis entropy regularizer and a novel skipping loss estimator
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.