An Axiomatic Analysis of Distributionally Robust Optimization with $q$-Norm Ambiguity Sets for Probability Smoothing

Daiki Uchida; Hokuto Nagano; Kota Kurihara; Yoichi Izunaga

arxiv: 2511.18815 · v6 · submitted 2025-11-24 · 🧮 math.OC

An Axiomatic Analysis of Distributionally Robust Optimization with q-Norm Ambiguity Sets for Probability Smoothing

Yoichi Izunaga , Kota Kurihara , Hokuto Nagano , Daiki Uchida This is my paper

Pith reviewed 2026-05-17 06:03 UTC · model grok-4.3

classification 🧮 math.OC

keywords distributionally robust optimizationq-norm ambiguity setsprobability smoothingaxiomatic propertiespositivitysymmetryorder preservationregularized empirical loss

0 comments

The pith

q-DRO probability estimators satisfy positivity and symmetry for every q, plus order preservation when q exceeds 1, and coincide with regularized empirical loss minimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies probability estimators obtained by solving a distributionally robust optimization problem whose ambiguity set is a q-norm ball around the empirical distribution. It establishes that these estimators obey positivity and symmetry for all q at least 1, and obey order preservation when q lies strictly between 1 and infinity. The same analysis shows that the DRO problem is mathematically identical to minimizing the empirical loss plus a regularization term that depends on q. Readers may care because the zero-frequency problem in discrete data requires estimators that avoid assigning zero probability while respecting intuitive ordering of observed frequencies.

Core claim

For any q in the closed interval from 1 to infinity the q-DRO estimator satisfies positivity and symmetry; when q belongs to the open interval from 1 to infinity it additionally satisfies order preservation. The optimality conditions further establish that the q-DRO formulation is exactly equivalent to regularized empirical loss minimization.

What carries the argument

The q-norm ambiguity set, a ball of chosen radius centered at the empirical distribution measured in the q-norm, whose worst-case expectation defines the smoothed probability estimator.

If this is right

Every outcome receives strictly positive probability.
The estimator is unchanged under any relabeling of the outcomes.
When q exceeds 1, higher empirical frequency strictly implies higher estimated probability.
The DRO problem can be replaced by an ordinary regularized empirical minimization problem without changing the solution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard convex solvers for regularized empirical risk minimization can be used directly to compute the q-DRO probabilities.
The axiomatic guarantees may fail if the true distribution lies far outside the chosen q-norm ball.
Varying q continuously could trace a family of estimators that interpolate between different smoothing behaviors.

Load-bearing premise

The ambiguity sets are exactly q-norm balls around the empirical distribution and the resulting optimization problem is solved exactly.

What would settle it

A concrete counter-example in which, for some q strictly between 1 and infinity, the solved estimator assigns a strictly lower probability to an outcome with strictly higher empirical frequency.

Figures

Figures reproduced from arXiv: 2511.18815 by Daiki Uchida, Hokuto Nagano, Kota Kurihara, Yoichi Izunaga.

**Figure 2.** Figure 2: Sensitivity of the 2-DRO estimator to the robustness radius . 7. Conclusion and Future Work This paper analyzed the axiomatic properties of probability estimators derived from distributionally robust optimization with -norm ambiguity sets. We established that the resulting -DRO estimator satisfies Positivity and Symmetry for all ∈ [1, ∞], and further proved that Order Preservation holds for all ∈ (1, ∞) … view at source ↗

read the original abstract

We analyze the axiomatic properties of a class of probability estimators derived from Distributionally Robust Optimization (DRO) with $q$-norm ambiguity sets ($q$-DRO), a principled approach to the zero-frequency problem. While classical estimators such as Laplace smoothing are characterized by strong linearity axioms like Ratio Preservation, we show that $q$-DRO provides a flexible alternative that satisfies other desirable properties. We first prove that for any $q \in [1, \infty]$, the $q$-DRO estimator satisfies the fundamental axioms of Positivity and Symmetry. For the case of $q \in (1, \infty)$, we then prove that it also satisfies Order Preservation. Our analysis of the optimality conditions also reveals that the $q$-DRO formulation is equivalent to the regularized empirical loss minimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes the axiomatic properties of probability estimators derived from Distributionally Robust Optimization using q-norm ambiguity sets (q-DRO) as a principled approach to the zero-frequency problem. It proves that the q-DRO estimator satisfies Positivity and Symmetry for every q in [1, ∞] and additionally satisfies Order Preservation when q is in (1, ∞). Analysis of the optimality conditions further establishes that the q-DRO formulation is equivalent to regularized empirical loss minimization.

Significance. If the derivations hold, the work supplies a tunable, axiomatically grounded alternative to classical linear smoothers such as Laplace smoothing. The explicit equivalence between q-DRO and regularized loss minimization is a useful bridge between robust optimization and standard regularized estimation, potentially simplifying both theoretical analysis and numerical implementation. The paper contributes a clean axiomatic treatment within the DRO literature.

minor comments (3)

The introduction should explicitly list the three axioms (Positivity, Symmetry, Order Preservation) with precise mathematical statements and pointers to the relevant literature on axiomatic probability smoothing.
In the equivalence result, the dependence of the effective regularization parameter on both q and the radius of the ambiguity set should be stated explicitly (e.g., as a displayed formula) so that readers can immediately see how the two formulations correspond.
A short remark on whether the axiomatic guarantees continue to hold when the q-DRO problem is solved only approximately (e.g., via first-order methods) would strengthen the bridge to practical use.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the recognition of its contributions to the axiomatic characterization of q-DRO estimators, and the recommendation for minor revision. The work establishes that q-DRO satisfies Positivity and Symmetry for all q in [1, ∞] and Order Preservation for q in (1, ∞), while also showing equivalence to regularized empirical loss minimization. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; proofs are self-contained

full rationale

The paper derives its central claims through explicit mathematical proofs: it shows that the q-DRO estimator satisfies Positivity and Symmetry for any q ∈ [1, ∞] and Order Preservation for q ∈ (1, ∞), plus equivalence to regularized empirical loss minimization by analyzing optimality conditions on the q-norm ambiguity sets. These steps rest on the independent definitions of the axioms and the explicit construction of the DRO problem around the empirical distribution, using standard optimization theory rather than any self-referential fitting, post-hoc parameter choice, or load-bearing self-citation. No step reduces by construction to its own inputs, and the analysis is self-contained against external benchmarks such as the stated axioms and convex duality.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard definition of q-norm ambiguity sets in DRO and the chosen axioms (Positivity, Symmetry, Order Preservation). No free parameters are fitted inside the proofs themselves; q is treated as a tunable hyperparameter. No new entities are postulated.

free parameters (1)

q
The norm order q is a user-chosen parameter that defines the ambiguity set; different q values yield different estimators but the axiomatic proofs hold for ranges of q.

axioms (2)

domain assumption The ambiguity set is a q-norm ball centered at the empirical distribution.
Invoked throughout the DRO formulation and optimality analysis.
domain assumption The estimator is obtained by solving the DRO problem exactly.
Required for the equivalence to regularized empirical loss to hold.

pith-pipeline@v0.9.0 · 5446 in / 1445 out tokens · 28703 ms · 2026-05-17T06:03:10.168103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Introduction The estimation of probabilities from ﬁnite data is a fundame ntal task of machine learning, statistics, and information theory. A common and persisten t challenge in this task is the zero- frequency problem: if an event is not observed in a ﬁnite samp le, its probability is naively estimated as zero, leading to poor generalization and model f...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

it refers to any non-empty subset of Δ /u1D45B, without any additional assumptions imposed

Preliminaries: From Axiomatic Smoothing to a Distributi onally Robust Formulation 2.1. The Axiomatic Approach to Probability Smoothing Let /u1D441 = { 1, 2, . . . , /u1D45B } be a set of categories. A probability distribution is a vecto r /u1D491in the probability simplex Δ /u1D45B = { /u1D491∈ R/u1D45B | /summationtext.1 /u1D45B /u1D457=1 /u1D45D/u1D457=...

work page
[3]

/u1D486 /u1D45B/summationdisplay.1 /u1D457=1 ( ˆ/u1D45D/u1D457+ /u1D452/u1D457)(− log /u1D465/u1D457) (3a) s

Convex Reformulation of /u1D492-DRO Using the explicit deﬁnition of the ambiguity set (2), the in ner worst-case problem of the /u1D45E-DRO formulation, for a ﬁxed estimator /u1D499∈ Δ /u1D45B, can be stated as: max. /u1D486 /u1D45B/summationdisplay.1 /u1D457=1 ( ˆ/u1D45D/u1D457+ /u1D452/u1D457)(− log /u1D465/u1D457) (3a) s. t. ˆ/u1D45D/u1D457+ /u1D452/u1...

work page
[4]

Main Results: Axiomatic Properties of the /u1D492-DRO Estimator In this section, we analyze the properties of the /u1D45E-DRO estimator /u1D499by examining the KKT conditions of the convex problem (5). 6 4.1. Positivity and Symmetry Theorem 1. For any /u1D45E∈ [ 1, ∞] , the /u1D45E-DRO estimator /u1D499satisﬁes Positivity and Symmetry. Proof. We ﬁrst prov...

work page
[5]

The analysis in Section 4 provides a deeper interpretation

Discussion Our axiomatic analysis reveals that /u1D45E-DRO estimators form a ﬂexible class of smoothing rules. The analysis in Section 4 provides a deeper interpretation. 5.1. V alidity of Assumption Assumption 1 (i.e., ∥ − log( /u1D499) − /u1D6FD1 + /u1D740∥/u1D45E∗ > 0) was introduced as a technical condition to ensure the gradient of the /u1D45E∗-norm ...

work page
[6]

Numerical Examples This section presents numerical examples to validate our th eoretical ﬁndings. We demonstrate (i) the veriﬁcation of the axioms for /u1D45E= 2, and (ii) the eﬀect of the parameter /u1D>00on the optimal solution, illustrating the interpretation as regularizedempirical loss minimization. All experiments are implemented in Python using MOS...

work page
[7]

2332 < 0. 2742). 6.2. Experiment 2: Sensitivity Analysis Next, we analyze the eﬀect of the robustness radius /u1D>00(regularization strength). We use /u1D45B= 4 categories and a simple asymmetric empirical distribution : ˆ/u1D491= ( 0. 10, 0. 20, 0. 30, 0. 40) ⊤ . We ﬁx /u1D45E= 2 and vary /u1D>00from 0.0 to 0.3. Figure 2 illustrates how each component /u...

work page
[8]

Conclusion and Future Work This paper analyzed the axiomatic properties of probabilit y estimators derived from distribu- tionally robust optimization with /u1D45E-norm ambiguity sets. We established that the resulting /u1D45E-DRO estimator satisﬁes Positivity and Symmetry for all /u1D45E∈ [ 1, ∞] , and further proved that Order Preser- vation holds for a...

work page
[9]

Ben-Tal, L

A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, 2009

work page 2009
[10]

Berger, S

A. Berger, S. A. Della Pietra, and V . J. Della Pietra. A max imum entropy approach to natural language processing. Computational linguistics, 22(1):39–71, 1996

work page 1996
[11]

Boyd and L

S. Boyd and L. V andenberghe. Convex Optimization. Cambridge University Press, 2004

work page 2004
[12]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999. 12

work page 1999
[13]

T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006

work page 2006
[14]

D. Kuhn, S. Shaﬁee, and W. Wiesemann. Distributionally r obust optimization. Acta Numerica, 34:579–804, 2025

work page 2025
[15]

C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999

work page 1999
[16]

Mohajerin Esfahani and D

P . Mohajerin Esfahani and D. Kuhn. Data-driven distribu tionally robust optimization using the wasserstein metric: Performance guarantees and tracta ble reformulations. Mathematical Programming, 171(1):115–166, 2018

work page 2018
[17]

MOSEK Optimizer API for Python 11.0.29

MOSEK ApS. MOSEK Optimizer API for Python 11.0.29. 2024. URL https://docs.mosek.com/11.0/pythonapi/index.html

work page 2024
[18]

MOSEK Modeling Cookbook 3.3.0, 2024

MOSEK ApS. MOSEK Modeling Cookbook 3.3.0, 2024. URL https://docs.mosek.com/modeling-cookbook/

work page 2024
[19]

T. Sakai. The probability smoothing problem: Characte rizations of the Laplace method. Mathematical Social Sciences, 135:102409, 2025

work page 2025
[20]

Shaﬁeezadeh Abadeh, P

S. Shaﬁeezadeh Abadeh, P . M. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. Advances in neural information processing systems , 28, 2015

work page 2015
[21]

I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE transactions on information theory , 37(4):1085–1094, 2002. 13

work page 2002

[1] [1]

Introduction The estimation of probabilities from ﬁnite data is a fundame ntal task of machine learning, statistics, and information theory. A common and persisten t challenge in this task is the zero- frequency problem: if an event is not observed in a ﬁnite samp le, its probability is naively estimated as zero, leading to poor generalization and model f...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

it refers to any non-empty subset of Δ /u1D45B, without any additional assumptions imposed

Preliminaries: From Axiomatic Smoothing to a Distributi onally Robust Formulation 2.1. The Axiomatic Approach to Probability Smoothing Let /u1D441 = { 1, 2, . . . , /u1D45B } be a set of categories. A probability distribution is a vecto r /u1D491in the probability simplex Δ /u1D45B = { /u1D491∈ R/u1D45B | /summationtext.1 /u1D45B /u1D457=1 /u1D45D/u1D457=...

work page

[3] [3]

/u1D486 /u1D45B/summationdisplay.1 /u1D457=1 ( ˆ/u1D45D/u1D457+ /u1D452/u1D457)(− log /u1D465/u1D457) (3a) s

Convex Reformulation of /u1D492-DRO Using the explicit deﬁnition of the ambiguity set (2), the in ner worst-case problem of the /u1D45E-DRO formulation, for a ﬁxed estimator /u1D499∈ Δ /u1D45B, can be stated as: max. /u1D486 /u1D45B/summationdisplay.1 /u1D457=1 ( ˆ/u1D45D/u1D457+ /u1D452/u1D457)(− log /u1D465/u1D457) (3a) s. t. ˆ/u1D45D/u1D457+ /u1D452/u1...

work page

[4] [4]

Main Results: Axiomatic Properties of the /u1D492-DRO Estimator In this section, we analyze the properties of the /u1D45E-DRO estimator /u1D499by examining the KKT conditions of the convex problem (5). 6 4.1. Positivity and Symmetry Theorem 1. For any /u1D45E∈ [ 1, ∞] , the /u1D45E-DRO estimator /u1D499satisﬁes Positivity and Symmetry. Proof. We ﬁrst prov...

work page

[5] [5]

The analysis in Section 4 provides a deeper interpretation

Discussion Our axiomatic analysis reveals that /u1D45E-DRO estimators form a ﬂexible class of smoothing rules. The analysis in Section 4 provides a deeper interpretation. 5.1. V alidity of Assumption Assumption 1 (i.e., ∥ − log( /u1D499) − /u1D6FD1 + /u1D740∥/u1D45E∗ > 0) was introduced as a technical condition to ensure the gradient of the /u1D45E∗-norm ...

work page

[6] [6]

Numerical Examples This section presents numerical examples to validate our th eoretical ﬁndings. We demonstrate (i) the veriﬁcation of the axioms for /u1D45E= 2, and (ii) the eﬀect of the parameter /u1D>00on the optimal solution, illustrating the interpretation as regularizedempirical loss minimization. All experiments are implemented in Python using MOS...

work page

[7] [7]

2332 < 0. 2742). 6.2. Experiment 2: Sensitivity Analysis Next, we analyze the eﬀect of the robustness radius /u1D>00(regularization strength). We use /u1D45B= 4 categories and a simple asymmetric empirical distribution : ˆ/u1D491= ( 0. 10, 0. 20, 0. 30, 0. 40) ⊤ . We ﬁx /u1D45E= 2 and vary /u1D>00from 0.0 to 0.3. Figure 2 illustrates how each component /u...

work page

[8] [8]

Conclusion and Future Work This paper analyzed the axiomatic properties of probabilit y estimators derived from distribu- tionally robust optimization with /u1D45E-norm ambiguity sets. We established that the resulting /u1D45E-DRO estimator satisﬁes Positivity and Symmetry for all /u1D45E∈ [ 1, ∞] , and further proved that Order Preser- vation holds for a...

work page

[9] [9]

Ben-Tal, L

A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, 2009

work page 2009

[10] [10]

Berger, S

A. Berger, S. A. Della Pietra, and V . J. Della Pietra. A max imum entropy approach to natural language processing. Computational linguistics, 22(1):39–71, 1996

work page 1996

[11] [11]

Boyd and L

S. Boyd and L. V andenberghe. Convex Optimization. Cambridge University Press, 2004

work page 2004

[12] [12]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999. 12

work page 1999

[13] [13]

T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006

work page 2006

[14] [14]

D. Kuhn, S. Shaﬁee, and W. Wiesemann. Distributionally r obust optimization. Acta Numerica, 34:579–804, 2025

work page 2025

[15] [15]

C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999

work page 1999

[16] [16]

Mohajerin Esfahani and D

P . Mohajerin Esfahani and D. Kuhn. Data-driven distribu tionally robust optimization using the wasserstein metric: Performance guarantees and tracta ble reformulations. Mathematical Programming, 171(1):115–166, 2018

work page 2018

[17] [17]

MOSEK Optimizer API for Python 11.0.29

MOSEK ApS. MOSEK Optimizer API for Python 11.0.29. 2024. URL https://docs.mosek.com/11.0/pythonapi/index.html

work page 2024

[18] [18]

MOSEK Modeling Cookbook 3.3.0, 2024

MOSEK ApS. MOSEK Modeling Cookbook 3.3.0, 2024. URL https://docs.mosek.com/modeling-cookbook/

work page 2024

[19] [19]

T. Sakai. The probability smoothing problem: Characte rizations of the Laplace method. Mathematical Social Sciences, 135:102409, 2025

work page 2025

[20] [20]

Shaﬁeezadeh Abadeh, P

S. Shaﬁeezadeh Abadeh, P . M. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. Advances in neural information processing systems , 28, 2015

work page 2015

[21] [21]

I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE transactions on information theory , 37(4):1085–1094, 2002. 13

work page 2002