pith. sign in

arxiv: 2605.02099 · v2 · pith:6YRNMN2Enew · submitted 2026-05-03 · 🧮 math.ST · stat.TH

Entropic Strict Minimum Message Length and Its Connections to PAC-Bayes and NML

Pith reviewed 2026-05-20 23:52 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords entropic SMMLminimum message lengthPAC-Bayesnormalized maximum likelihoodrisk-sensitive codingexponential familiesasymptotic analysisinformation theory
0
0 comments X

The pith

Entropic SMML replaces expected codelength with an exponential certainty equivalent to create a tunable family of coding rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces entropic strict minimum message length as a risk-sensitive version of strict MML. It substitutes an exponential certainty equivalent for the usual expected two-part codelength under the prior predictive distribution. This substitution yields a one-parameter family of rules that moves continuously from Bayesian average-case coding to worst-case minimax coding. The construction recovers ordinary SMML at the risk-neutral end and the normalized maximum likelihood minimax-regret rule at the high-risk end. It also supplies a variational view that connects the criterion to PAC-Bayes and supplies joint asymptotics that locate the regime transitions on a logarithmic scale in sample size and risk parameter.

Core claim

Entropic SMML replaces the expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, thereby defining a one-parameter family of coding rules that interpolates between Bayesian average-case coding and worst-case minimax coding. Ordinary SMML is recovered in the risk-neutral limit, while the extreme risk-sensitive limit yields a minimax codelength criterion that coincides with the NML minimax-regret principle after centering by the oracle maximum-likelihood codelength. The criterion admits a variational characterization as a Kullback-Leibler-regularized worst-case expected codelength and, for regular exponential families, the fixed-codebook

What carries the argument

Entropic SMML criterion formed by replacing the expected two-part codelength with its exponential certainty equivalent under the prior predictive distribution.

If this is right

  • Ordinary SMML is recovered exactly when the risk parameter approaches the neutral limit.
  • The high-risk limit, after centering by the oracle MLE codelength, coincides with the NML minimax-regret principle.
  • A variational representation as KL-regularized worst-case expected codelength supplies a PAC-Bayes interpretation.
  • In regular parametric models the transition between Bayesian, robust and minimax regimes occurs on a logarithmic scale in n and the risk parameter.
  • For regular exponential families the fixed-codebook partition stays affine in sufficient-statistic space and the codepoints satisfy tilted moment-matching as tilted Bregman centroids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The logarithmic scaling implies that moderate risk sensitivity produces distinct behavior from both Bayesian and minimax extremes even at moderately large sample sizes.
  • The PAC-Bayes link may be used to derive new generalization bounds that incorporate explicit risk sensitivity into model selection.
  • The tilted-Bregman-centroid view suggests possible extensions to other Bregman divergences or non-exponential-family models where the affine property fails.
  • Practical coding algorithms could tune the risk parameter to trade average-case efficiency against robustness on small or heterogeneous data sets.

Load-bearing premise

The joint asymptotic theory and the claims of affine partitions with tilted moment-matching assume regular parametric models and regular exponential families.

What would settle it

A concrete counter-example showing a non-affine codebook partition or a non-logarithmic transition between regimes inside a regular exponential family for large but finite n would falsify the asymptotic claims.

Figures

Figures reproduced from arXiv: 2605.02099 by Daniel F. Schmidt, Enes Makalic.

Figure 1
Figure 1. Figure 1: Binomial comparison of ordinary SMML, entropic SMML, and the worst-case codelength endpoint for view at source ↗
Figure 1
Figure 1. Figure 1: Binomial comparison of ordinary SMML, entropic SMML, and the worst-case codelength endpoint for [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
read the original abstract

We introduce entropic strict minimum message length (SMML), a risk-sensitive generalization of strict minimum message length coding. The proposed criterion replaces expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, thereby defining a one-parameter family of coding rules that interpolates between Bayesian average-case coding and worst-case minimax coding. We show that ordinary SMML is recovered in the risk-neutral limit, while the extreme risk-sensitive limit yields a minimax codelength criterion; when centered by the oracle maximum likelihood codelength, this criterion coincides with the normalized maximum likelihood (NML) minimax-regret principle. We further prove that entropic SMML admits a variational characterization as a Kullback--Leibler-regularized worst-case expected codelength, giving it a PAC--Bayes-type interpretation. We establish a joint asymptotic theory linking the sample size $n$ and the risk parameter $\tau$, showing that in regular parametric models the transition between Bayesian, robust, and minimax coding regimes occurs on a logarithmic scale. For regular exponential families, the fixed-codebook partition remains affine in sufficient-statistic space, while the codepoints satisfy a tilted moment-matching condition and admit an interpretation as tilted Bregman centroids. These results position entropic SMML as an information-theoretic bridge between MML, PAC--Bayes, and MDL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces entropic strict minimum message length (SMML) as a risk-sensitive generalization of strict MML coding. It replaces the expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, yielding a one-parameter family indexed by risk parameter τ that interpolates between Bayesian average-case coding and worst-case minimax coding. The work claims that the risk-neutral limit recovers ordinary SMML, the extreme risk-sensitive limit yields a minimax codelength that coincides with the normalized maximum likelihood (NML) principle when centered by the oracle MLE codelength, a variational characterization as KL-regularized worst-case expected codelength with PAC-Bayes interpretation, a joint asymptotic theory for regular parametric models showing regime transitions on a logarithmic scale in n and τ, and for regular exponential families an affine fixed-codebook partition in sufficient-statistic space together with tilted moment-matching codepoints interpretable as tilted Bregman centroids.

Significance. If the stated derivations and asymptotic results hold under the assumed regularity conditions (twice-differentiable densities, positive definite Fisher information), the paper supplies a tunable information-theoretic criterion that formally bridges MML, PAC-Bayes, and MDL. The explicit PAC-Bayes-type variational form, the NML coincidence, and the geometric characterizations for exponential families are potentially useful for robust coding and model selection; the joint (n,τ) asymptotics, if rigorously controlled, would clarify the transition between average-case and worst-case regimes.

major comments (2)
  1. [§4] §4 (joint asymptotic theory): the claimed regime transitions on a logarithmic scale in n and τ rely on uniform Laplace-type approximations across the codebook. When τ scales as log n the exponential tilting amplifies tail contributions; without explicit uniform error bounds on the large-deviation rate function under the stated regularity, the interpolation between Bayesian and minimax regimes and the Bregman-centroid interpretation may hold only pointwise rather than uniformly.
  2. [Abstract, §3] Abstract and the variational/NML sections: the claims that the variational form, the NML coincidence, and the asymptotic regimes are proven are central, yet the manuscript provides no explicit statement of the full regularity conditions or verification of edge cases (e.g., boundary behavior of the exponential family or non-compact parameter spaces). This leaves the support for the central claims at the level of plausibility until the derivations are inspected.
minor comments (2)
  1. Define the exponential certainty equivalent explicitly in the main text (rather than only in the abstract) so that readers can follow the transition from expected codelength to the risk-sensitive objective without external references.
  2. Add a short table or diagram illustrating the limiting regimes (τ→0, τ→∞) and the corresponding coding rules to improve readability of the one-parameter family.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. The comments raise important issues about the rigor of our asymptotic results and the clarity of our regularity assumptions. We address each point below and commit to revisions that will strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (joint asymptotic theory): the claimed regime transitions on a logarithmic scale in n and τ rely on uniform Laplace-type approximations across the codebook. When τ scales as log n the exponential tilting amplifies tail contributions; without explicit uniform error bounds on the large-deviation rate function under the stated regularity, the interpolation between Bayesian and minimax regimes and the Bregman-centroid interpretation may hold only pointwise rather than uniformly.

    Authors: We appreciate this observation regarding the need for uniform error bounds in the asymptotic analysis. The current manuscript uses standard Laplace approximations for the regime transitions, but we agree that explicit uniform bounds are required to rigorously justify the claims when τ is of order log n. In the revision, we will include a lemma providing uniform large-deviation bounds under the assumed regularity conditions (twice-differentiable log-densities and positive definite Fisher information). This will ensure the interpolation holds uniformly, and we will update the Bregman-centroid interpretation accordingly. We believe this addresses the concern without altering the main results. revision: yes

  2. Referee: [Abstract, §3] Abstract and the variational/NML sections: the claims that the variational form, the NML coincidence, and the asymptotic regimes are proven are central, yet the manuscript provides no explicit statement of the full regularity conditions or verification of edge cases (e.g., boundary behavior of the exponential family or non-compact parameter spaces). This leaves the support for the central claims at the level of plausibility until the derivations are inspected.

    Authors: The referee correctly identifies that the manuscript would benefit from an explicit enumeration of the regularity conditions supporting the variational characterization, NML coincidence, and asymptotic regimes. We will add a new section titled 'Regularity Conditions' that lists all assumptions, including compactness of the parameter space for the main results and interior-point assumptions for exponential families. Edge cases such as boundary behavior will be discussed with references to truncation techniques for non-compact spaces. This revision will make the proofs more transparent and verifiable, elevating the claims from plausible to fully supported. revision: yes

Circularity Check

0 steps flagged

No circularity: new definition yields derived properties under standard regularity

full rationale

The paper defines entropic SMML by replacing the expected two-part codelength with an exponential certainty equivalent, creating a parameterized family. It then derives limit cases (risk-neutral recovers ordinary SMML; risk-sensitive yields NML after centering), a variational KL-regularized form, and joint asymptotics for regular parametric models and exponential families (affine partitions, tilted moment-matching, Bregman centroids). These steps follow directly from the definition plus standard analytic techniques and regularity assumptions (twice-differentiable densities, positive definite Fisher information) without any reduction of a claimed result to a fitted input, self-citation load-bearing premise, or imported uniqueness theorem. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central addition is the risk parameter τ that controls sensitivity; the remaining structure rests on standard regularity assumptions for parametric models and exponential families that are invoked for the asymptotic results.

free parameters (1)
  • risk parameter τ
    Single scalar that sets the degree of risk sensitivity in the exponential certainty equivalent and controls the transition between coding regimes.
axioms (1)
  • domain assumption Regularity conditions on the parametric model family
    Invoked to obtain the joint asymptotic theory linking n and τ and to guarantee that the fixed-codebook partition remains affine in sufficient-statistic space.

pith-pipeline@v0.9.0 · 5774 in / 1366 out tokens · 64365 ms · 2026-05-20T23:52:29.965947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Wallace and David M

    Chris S. Wallace and David M. Boulton. An information measure for classification.Computer Journal, 11(2):185– 194, August 1968

  2. [2]

    Wallace and David M

    Chris S. Wallace and David M. Boulton. An invariant Bayes method for point estimation.Classification Society Bulletin, 3(3):11–34, 1975

  3. [3]

    Wallace and Peter R

    Chris S. Wallace and Peter R. Freeman. Estimation and inference by compact coding.Journal of the Royal Statistical Society (Series B), 49(3):240–252, 1987

  4. [4]

    Wallace.Statistical and inductive inference by minimum message length

    Chris S. Wallace.Statistical and inductive inference by minimum message length. Information Science and Statistics. Springer, first edition, 2005

  5. [5]

    Modeling by shortest data description.Automatica, 14(5):465–471, September 1978

    Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, September 1978

  6. [6]

    Universal coding, information, prediction, and estimation.IEEE Transactions on Information Theory, 30(4):629–636, July 1984

    Jorma Rissanen. Universal coding, information, prediction, and estimation.IEEE Transactions on Information Theory, 30(4):629–636, July 1984

  7. [7]

    Fisher information and stochastic complexity.IEEE Transactions on Information Theory, 42(1):40–47, January 1996

    Jorma Rissanen. Fisher information and stochastic complexity.IEEE Transactions on Information Theory, 42(1):40–47, January 1996

  8. [8]

    Strong optimality of the normalized ML models as universal codes and information in data

    Jorma Rissanen. Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47(5):1712–1717, July 2001

  9. [9]

    Information Science and Statistics

    Jorma Rissanen.Information and Complexity in Statistical Modeling. Information Science and Statistics. Springer, first edition, 2007

  10. [10]

    Grünwald.The Minimum Description Length Principle

    Peter D. Grünwald.The Minimum Description Length Principle. Adaptive Communication and Machine Learning. The MIT Press, 2007

  11. [11]

    Minimum description length revisited.International Journal of Mathematics for Industry, 11(01), December 2019

    Peter Grünwald and Teemu Roos. Minimum description length revisited.International Journal of Mathematics for Industry, 11(01), December 2019

  12. [12]

    M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i.Communications on Pure and Applied Mathematics, 28(1):1–47, January 1975

  13. [13]

    Entropic risk measures: Coherence vs

    Hans Föllmer and Thomas Knispel. Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations.Stochastics and Dynamics, 11(02n03):333–351, 2011

  14. [14]

    Kullback and R

    S. Kullback and R. A. Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, March 1951

  15. [15]

    Pac-Bayesian supervised classification: The thermodynamics of statistical learning.IMS Lecture Notes Monograph Series, 56:1–163, 2007

    Olivier Catoni. Pac-Bayesian supervised classification: The thermodynamics of statistical learning.IMS Lecture Notes Monograph Series, 56:1–163, 2007

  16. [16]

    Enes Makalic and Daniel F. Schmidt. Information geometry and asymptotic theory for SMML estimators. arXiv:2604.05241, 2026

  17. [17]

    Dhillon, and Joydeep Ghosh

    Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(58):1705–1749, 2005

  18. [18]

    Campbell

    L.L. Campbell. A coding theorem and Rényi’s entropy.Information and Control, 8(4):423–429, August 1965

  19. [19]

    On measures of entropy and information

    Alfréd Rényi. On measures of entropy and information. In Jerzy Neyman, editor,Berkeley Symp. on Math. Statist. and Prob., volume I, pages 547–561. University of California Press, 1961

  20. [20]

    J.-F. Bercher. Source coding with escort distributions and Rényi entropy bounds.Physics Letters A, 373(36):3235– 3238, 2009

  21. [21]

    I. Csiszar. Generalized cutoff rates and Rényi’s information measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

  22. [22]

    Berger.Statistical Decision Theory and Bayesian Analysis

    James O. Berger.Statistical Decision Theory and Bayesian Analysis. Springer New York, 1985

  23. [23]

    Y . M. Shtarkov. Universal sequential coding of single messages.Probl. Inform. Transm., 23(3):3–17, 1987

  24. [24]

    Normalized maximum likelihood with luckiness for multivariate normal distributions, 2017

    Kohei Miyaguchi. Normalized maximum likelihood with luckiness for multivariate normal distributions, 2017

  25. [25]

    American Mathematical Society, 2000

    Shun’ichi Amari and Hiroshi Nagaoka.Methods of Information Geometry, volume 191 ofTranslations of mathematical monographs. American Mathematical Society, 2000

  26. [26]

    James G. Dowty. SMML estimators for exponential families with continuous sufficient statistics. arXiv:1302.0581, 2013. 17