Entropic Strict Minimum Message Length and Its Connections to PAC-Bayes and NML
Pith reviewed 2026-05-20 23:52 UTC · model grok-4.3
The pith
Entropic SMML replaces expected codelength with an exponential certainty equivalent to create a tunable family of coding rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Entropic SMML replaces the expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, thereby defining a one-parameter family of coding rules that interpolates between Bayesian average-case coding and worst-case minimax coding. Ordinary SMML is recovered in the risk-neutral limit, while the extreme risk-sensitive limit yields a minimax codelength criterion that coincides with the NML minimax-regret principle after centering by the oracle maximum-likelihood codelength. The criterion admits a variational characterization as a Kullback-Leibler-regularized worst-case expected codelength and, for regular exponential families, the fixed-codebook
What carries the argument
Entropic SMML criterion formed by replacing the expected two-part codelength with its exponential certainty equivalent under the prior predictive distribution.
If this is right
- Ordinary SMML is recovered exactly when the risk parameter approaches the neutral limit.
- The high-risk limit, after centering by the oracle MLE codelength, coincides with the NML minimax-regret principle.
- A variational representation as KL-regularized worst-case expected codelength supplies a PAC-Bayes interpretation.
- In regular parametric models the transition between Bayesian, robust and minimax regimes occurs on a logarithmic scale in n and the risk parameter.
- For regular exponential families the fixed-codebook partition stays affine in sufficient-statistic space and the codepoints satisfy tilted moment-matching as tilted Bregman centroids.
Where Pith is reading between the lines
- The logarithmic scaling implies that moderate risk sensitivity produces distinct behavior from both Bayesian and minimax extremes even at moderately large sample sizes.
- The PAC-Bayes link may be used to derive new generalization bounds that incorporate explicit risk sensitivity into model selection.
- The tilted-Bregman-centroid view suggests possible extensions to other Bregman divergences or non-exponential-family models where the affine property fails.
- Practical coding algorithms could tune the risk parameter to trade average-case efficiency against robustness on small or heterogeneous data sets.
Load-bearing premise
The joint asymptotic theory and the claims of affine partitions with tilted moment-matching assume regular parametric models and regular exponential families.
What would settle it
A concrete counter-example showing a non-affine codebook partition or a non-logarithmic transition between regimes inside a regular exponential family for large but finite n would falsify the asymptotic claims.
Figures
read the original abstract
We introduce entropic strict minimum message length (SMML), a risk-sensitive generalization of strict minimum message length coding. The proposed criterion replaces expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, thereby defining a one-parameter family of coding rules that interpolates between Bayesian average-case coding and worst-case minimax coding. We show that ordinary SMML is recovered in the risk-neutral limit, while the extreme risk-sensitive limit yields a minimax codelength criterion; when centered by the oracle maximum likelihood codelength, this criterion coincides with the normalized maximum likelihood (NML) minimax-regret principle. We further prove that entropic SMML admits a variational characterization as a Kullback--Leibler-regularized worst-case expected codelength, giving it a PAC--Bayes-type interpretation. We establish a joint asymptotic theory linking the sample size $n$ and the risk parameter $\tau$, showing that in regular parametric models the transition between Bayesian, robust, and minimax coding regimes occurs on a logarithmic scale. For regular exponential families, the fixed-codebook partition remains affine in sufficient-statistic space, while the codepoints satisfy a tilted moment-matching condition and admit an interpretation as tilted Bregman centroids. These results position entropic SMML as an information-theoretic bridge between MML, PAC--Bayes, and MDL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces entropic strict minimum message length (SMML) as a risk-sensitive generalization of strict MML coding. It replaces the expected two-part codelength under the prior predictive distribution with an exponential certainty equivalent, yielding a one-parameter family indexed by risk parameter τ that interpolates between Bayesian average-case coding and worst-case minimax coding. The work claims that the risk-neutral limit recovers ordinary SMML, the extreme risk-sensitive limit yields a minimax codelength that coincides with the normalized maximum likelihood (NML) principle when centered by the oracle MLE codelength, a variational characterization as KL-regularized worst-case expected codelength with PAC-Bayes interpretation, a joint asymptotic theory for regular parametric models showing regime transitions on a logarithmic scale in n and τ, and for regular exponential families an affine fixed-codebook partition in sufficient-statistic space together with tilted moment-matching codepoints interpretable as tilted Bregman centroids.
Significance. If the stated derivations and asymptotic results hold under the assumed regularity conditions (twice-differentiable densities, positive definite Fisher information), the paper supplies a tunable information-theoretic criterion that formally bridges MML, PAC-Bayes, and MDL. The explicit PAC-Bayes-type variational form, the NML coincidence, and the geometric characterizations for exponential families are potentially useful for robust coding and model selection; the joint (n,τ) asymptotics, if rigorously controlled, would clarify the transition between average-case and worst-case regimes.
major comments (2)
- [§4] §4 (joint asymptotic theory): the claimed regime transitions on a logarithmic scale in n and τ rely on uniform Laplace-type approximations across the codebook. When τ scales as log n the exponential tilting amplifies tail contributions; without explicit uniform error bounds on the large-deviation rate function under the stated regularity, the interpolation between Bayesian and minimax regimes and the Bregman-centroid interpretation may hold only pointwise rather than uniformly.
- [Abstract, §3] Abstract and the variational/NML sections: the claims that the variational form, the NML coincidence, and the asymptotic regimes are proven are central, yet the manuscript provides no explicit statement of the full regularity conditions or verification of edge cases (e.g., boundary behavior of the exponential family or non-compact parameter spaces). This leaves the support for the central claims at the level of plausibility until the derivations are inspected.
minor comments (2)
- Define the exponential certainty equivalent explicitly in the main text (rather than only in the abstract) so that readers can follow the transition from expected codelength to the risk-sensitive objective without external references.
- Add a short table or diagram illustrating the limiting regimes (τ→0, τ→∞) and the corresponding coding rules to improve readability of the one-parameter family.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. The comments raise important issues about the rigor of our asymptotic results and the clarity of our regularity assumptions. We address each point below and commit to revisions that will strengthen the paper.
read point-by-point responses
-
Referee: [§4] §4 (joint asymptotic theory): the claimed regime transitions on a logarithmic scale in n and τ rely on uniform Laplace-type approximations across the codebook. When τ scales as log n the exponential tilting amplifies tail contributions; without explicit uniform error bounds on the large-deviation rate function under the stated regularity, the interpolation between Bayesian and minimax regimes and the Bregman-centroid interpretation may hold only pointwise rather than uniformly.
Authors: We appreciate this observation regarding the need for uniform error bounds in the asymptotic analysis. The current manuscript uses standard Laplace approximations for the regime transitions, but we agree that explicit uniform bounds are required to rigorously justify the claims when τ is of order log n. In the revision, we will include a lemma providing uniform large-deviation bounds under the assumed regularity conditions (twice-differentiable log-densities and positive definite Fisher information). This will ensure the interpolation holds uniformly, and we will update the Bregman-centroid interpretation accordingly. We believe this addresses the concern without altering the main results. revision: yes
-
Referee: [Abstract, §3] Abstract and the variational/NML sections: the claims that the variational form, the NML coincidence, and the asymptotic regimes are proven are central, yet the manuscript provides no explicit statement of the full regularity conditions or verification of edge cases (e.g., boundary behavior of the exponential family or non-compact parameter spaces). This leaves the support for the central claims at the level of plausibility until the derivations are inspected.
Authors: The referee correctly identifies that the manuscript would benefit from an explicit enumeration of the regularity conditions supporting the variational characterization, NML coincidence, and asymptotic regimes. We will add a new section titled 'Regularity Conditions' that lists all assumptions, including compactness of the parameter space for the main results and interior-point assumptions for exponential families. Edge cases such as boundary behavior will be discussed with references to truncation techniques for non-compact spaces. This revision will make the proofs more transparent and verifiable, elevating the claims from plausible to fully supported. revision: yes
Circularity Check
No circularity: new definition yields derived properties under standard regularity
full rationale
The paper defines entropic SMML by replacing the expected two-part codelength with an exponential certainty equivalent, creating a parameterized family. It then derives limit cases (risk-neutral recovers ordinary SMML; risk-sensitive yields NML after centering), a variational KL-regularized form, and joint asymptotics for regular parametric models and exponential families (affine partitions, tilted moment-matching, Bregman centroids). These steps follow directly from the definition plus standard analytic techniques and regularity assumptions (twice-differentiable densities, positive definite Fisher information) without any reduction of a claimed result to a fitted input, self-citation load-bearing premise, or imported uniqueness theorem. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- risk parameter τ
axioms (1)
- domain assumption Regularity conditions on the parametric model family
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel; costAlphaLog_high_calibrated_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
In,τ(P,q,θ) = 1/τ log E_rn[exp(τ Λ_P,q,θ(Xn))]; codepoints satisfy tilted moment-matching n∇A(ν*) = Σ wj,τ(x;ν*) T(x) with wj,τ ∝ rn(x) pn(x|θ)−τ
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration; J_uniquely_calibrated_via_higher_derivative refines?
refinesRelation between the paper passage and the cited Recognition theorem.
entropic SMML codepoint is the m-projection of a τ-tilted distribution sj,τ ∝ rn(x) pn(x|θ*)−τ onto the model manifold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chris S. Wallace and David M. Boulton. An information measure for classification.Computer Journal, 11(2):185– 194, August 1968
work page 1968
-
[2]
Chris S. Wallace and David M. Boulton. An invariant Bayes method for point estimation.Classification Society Bulletin, 3(3):11–34, 1975
work page 1975
-
[3]
Chris S. Wallace and Peter R. Freeman. Estimation and inference by compact coding.Journal of the Royal Statistical Society (Series B), 49(3):240–252, 1987
work page 1987
-
[4]
Wallace.Statistical and inductive inference by minimum message length
Chris S. Wallace.Statistical and inductive inference by minimum message length. Information Science and Statistics. Springer, first edition, 2005
work page 2005
-
[5]
Modeling by shortest data description.Automatica, 14(5):465–471, September 1978
Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, September 1978
work page 1978
-
[6]
Jorma Rissanen. Universal coding, information, prediction, and estimation.IEEE Transactions on Information Theory, 30(4):629–636, July 1984
work page 1984
-
[7]
Jorma Rissanen. Fisher information and stochastic complexity.IEEE Transactions on Information Theory, 42(1):40–47, January 1996
work page 1996
-
[8]
Strong optimality of the normalized ML models as universal codes and information in data
Jorma Rissanen. Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47(5):1712–1717, July 2001
work page 2001
-
[9]
Information Science and Statistics
Jorma Rissanen.Information and Complexity in Statistical Modeling. Information Science and Statistics. Springer, first edition, 2007
work page 2007
-
[10]
Grünwald.The Minimum Description Length Principle
Peter D. Grünwald.The Minimum Description Length Principle. Adaptive Communication and Machine Learning. The MIT Press, 2007
work page 2007
-
[11]
Peter Grünwald and Teemu Roos. Minimum description length revisited.International Journal of Mathematics for Industry, 11(01), December 2019
work page 2019
-
[12]
M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i.Communications on Pure and Applied Mathematics, 28(1):1–47, January 1975
work page 1975
-
[13]
Entropic risk measures: Coherence vs
Hans Föllmer and Thomas Knispel. Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations.Stochastics and Dynamics, 11(02n03):333–351, 2011
work page 2011
-
[14]
S. Kullback and R. A. Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, March 1951
work page 1951
-
[15]
Olivier Catoni. Pac-Bayesian supervised classification: The thermodynamics of statistical learning.IMS Lecture Notes Monograph Series, 56:1–163, 2007
work page 2007
-
[16]
Enes Makalic and Daniel F. Schmidt. Information geometry and asymptotic theory for SMML estimators. arXiv:2604.05241, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(58):1705–1749, 2005
work page 2005
- [18]
-
[19]
On measures of entropy and information
Alfréd Rényi. On measures of entropy and information. In Jerzy Neyman, editor,Berkeley Symp. on Math. Statist. and Prob., volume I, pages 547–561. University of California Press, 1961
work page 1961
-
[20]
J.-F. Bercher. Source coding with escort distributions and Rényi entropy bounds.Physics Letters A, 373(36):3235– 3238, 2009
work page 2009
-
[21]
I. Csiszar. Generalized cutoff rates and Rényi’s information measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995
work page 1995
-
[22]
Berger.Statistical Decision Theory and Bayesian Analysis
James O. Berger.Statistical Decision Theory and Bayesian Analysis. Springer New York, 1985
work page 1985
-
[23]
Y . M. Shtarkov. Universal sequential coding of single messages.Probl. Inform. Transm., 23(3):3–17, 1987
work page 1987
-
[24]
Normalized maximum likelihood with luckiness for multivariate normal distributions, 2017
Kohei Miyaguchi. Normalized maximum likelihood with luckiness for multivariate normal distributions, 2017
work page 2017
-
[25]
American Mathematical Society, 2000
Shun’ichi Amari and Hiroshi Nagaoka.Methods of Information Geometry, volume 191 ofTranslations of mathematical monographs. American Mathematical Society, 2000
work page 2000
-
[26]
James G. Dowty. SMML estimators for exponential families with continuous sufficient statistics. arXiv:1302.0581, 2013. 17
work page internal anchor Pith review Pith/arXiv arXiv 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.