Robust Mean Estimation for Optimization: The Impact of Heavy Tails

Bart P.G. van Parys; Bert Zwart

arxiv: 2503.21421 · v2 · submitted 2025-03-27 · 🧮 math.OC · math.PR· math.ST· stat.TH

Robust Mean Estimation for Optimization: The Impact of Heavy Tails

Bart P.G. van Parys , Bert Zwart This is my paper

Pith reviewed 2026-05-22 22:42 UTC · model grok-4.3

classification 🧮 math.OC math.PRmath.STstat.TH

keywords mean estimationheavy-tailed distributionsdistributionally robust optimizationKullback-Leibler divergencenon-negative random variablesstatistical estimationoverestimation probability

0 comments

The pith

Estimating the mean of a non-negative heavy-tailed random variable is optimally done by solving a KL-divergence distributionally robust optimization problem that bounds the probability of overestimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper focuses on constructing an estimator for the expected value of a non-negative heavy-tailed random variable such that the chance of overestimating this value remains controlled. This control is important because the estimate is intended for use in subsequent decision processes where overestimation could cause issues. The authors establish that the estimator obtained from a distributionally robust optimization formulation with Kullback-Leibler divergence is optimal under this criterion. They further compare its statistical behavior to alternatives like truncation, variance regularization, and Wasserstein-based robust optimization, finding favorable properties for the KL approach. This provides a method grounded in the specified risk control for heavy-tailed settings.

Core claim

We consider the problem of constructing a least conservative estimator of the expected value μ of a non-negative heavy-tailed random variable. We require that the probability of overestimating the expected value μ is kept appropriately small; a natural requirement if its subsequent use in a decision process is anticipated. In this setting, we show it is optimal to estimate μ by solving a distributionally robust optimization (DRO) problem using the Kullback-Leibler (KL) divergence. We further show that the statistical properties of KL-DRO compare favorably with other estimators based on truncation, variance regularization, or Wasserstein DRO.

What carries the argument

The Kullback-Leibler divergence distributionally robust optimization (KL-DRO) problem, which produces the estimator satisfying the overestimation probability bound while minimizing conservatism.

If this is right

KL-DRO yields estimators with better statistical properties than truncation methods when the overestimation probability must stay below a threshold.
Variance regularization and Wasserstein DRO produce more conservative estimates than KL-DRO under the same overestimation control.
The approach applies directly to mean estimation tasks that precede optimization or decision-making under heavy-tailed uncertainty.
The optimality holds specifically for non-negative random variables with the given tail behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The KL-DRO formulation could be extended to other divergence choices if the overestimation criterion is preserved.
In sequential decision settings, repeated use of this estimator might compound the risk control across multiple periods.
Testing the estimator on empirical heavy-tailed data from finance or insurance would check whether the theoretical advantage appears in practice.

Load-bearing premise

The definition of an optimal estimator requires keeping the probability of overestimating the mean appropriately small for non-negative heavy-tailed variables.

What would settle it

A concrete counter-example or simulation where a truncation-based estimator achieves strictly lower overestimation probability than the KL-DRO estimator for the same sample size and tail behavior would falsify the optimality result.

read the original abstract

We consider the problem of constructing a least conservative estimator of the expected value $\mu$ of a non-negative heavy-tailed random variable. We require that the probability of overestimating the expected value $\mu$ is kept appropriately small; a natural requirement if its subsequent use in a decision process is anticipated. In this setting, we show it is optimal to estimate $\mu$ by solving a distributionally robust optimization (DRO) problem using the Kullback-Leibler (KL) divergence. We further show that the statistical properties of KL-DRO compare favorably with other estimators based on truncation, variance regularization, or Wasserstein DRO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KL-DRO is optimal for keeping overestimation probability low on non-negative heavy-tailed means, with direct comparisons to truncation and Wasserstein alternatives.

read the letter

The key point is that estimating the mean of a non-negative heavy-tailed variable by solving a KL-divergence DRO problem is optimal when the goal is to keep the probability of overestimation small. The paper also shows this estimator has better statistical properties than truncation, variance regularization, or Wasserstein DRO under the same criterion. That optimality link is the main new piece, and the side-by-side comparisons help make the case concrete rather than abstract. The non-negativity assumption fits the heavy-tail setting and avoids some usual complications with signed variables. If the derivations in the full paper hold up, this gives a clean way to turn a practical requirement into a DRO formulation without extra tuning parameters. The central argument does not appear circular; it rests on the external overestimation-probability criterion rather than fitting to the data in a self-referential way. Minor soft spots include whether the optimality extends beyond the one-sided criterion or requires stronger tail conditions than stated, and whether the paper gives explicit rates or finite-sample bounds that practitioners can use. The comparisons would benefit from more detail on how the alternatives are tuned to match the same overestimation target. This paper is aimed at people working on robust optimization and risk-aware decision making with heavy tails. A reader who needs a theoretically grounded conservative estimator will get direct value from the optimality result and the comparisons. It deserves peer review because the claim is specific, the criterion is well-motivated, and the comparisons are reproducible in principle.

Referee Report

0 major / 1 minor

Summary. The paper considers estimating the mean μ of a non-negative heavy-tailed random variable while keeping the probability of overestimation appropriately small. It claims that the optimal such estimator is obtained by solving a distributionally robust optimization problem with the Kullback-Leibler divergence, and that the resulting KL-DRO estimator has favorable statistical properties relative to estimators based on truncation, variance regularization, or Wasserstein DRO.

Significance. If the optimality derivation and statistical comparisons hold, the work supplies a principled, criterion-driven justification for selecting a particular robust estimator in optimization settings with heavy tails. This could reduce unnecessary conservatism while controlling a one-sided risk that is relevant for downstream decision processes.

minor comments (1)

The abstract states the optimality result and the favorable comparisons but does not indicate where the supporting derivations appear; the manuscript should explicitly cross-reference the relevant theorem or proposition numbers for the optimality claim and for each comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript, which accurately captures our main claims regarding the optimality of KL-DRO for conservative mean estimation under heavy tails and its comparison to other approaches. We appreciate the recognition of the potential significance for decision processes. Since the report lists no specific major comments, we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines optimality externally as keeping the one-sided probability of overestimating μ appropriately small for non-negative heavy-tailed random variables. It then derives that a KL-divergence DRO estimator satisfies this criterion and compares its properties to truncation, variance regularization, and Wasserstein DRO alternatives. No step reduces a claimed prediction or optimality result to a fitted parameter, self-citation chain, or definitional tautology; the optimality criterion is stated independently of the estimator form and the comparisons rest on statistical properties rather than internal re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5633 in / 1050 out tokens · 47824 ms · 2026-05-22T22:42:18.629356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regularly varying of index ρ > 1... P[ζ > u] = L(u) u^{-ρ}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.