Recognition: 2 theorem links
· Lean TheoremA Global Characterization of f-Divergences Yielding PSD Mutual-Information Matrices
Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3
The pith
Mutual-information matrices from f-divergences are positive semidefinite for all finite alphabets precisely when the normalized generator expands as a power series with nonnegative coefficients that converges on all positive reals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given convex finite generators f with f(1)=0 and finite f(0), the matrix whose (i,j) entry is the f-mutual information between X_i and X_j is positive semidefinite for every finite-alphabet family exactly when the normalized bar f admits the expansion sum_{m ge 2} a_m (t-1)^m with a_m ge 0 and the series converges on (0,infty).
What carries the argument
The globally convergent power-series expansion of the normalized generator bar f(t) with nonnegative coefficients a_m.
If this is right
- Chi-squared divergence satisfies the condition and therefore always yields PSD matrices.
- Shannon mutual information and Jensen-Shannon divergence fail the condition and can produce non-PSD matrices.
- Total variation and ReLU divergences are excluded because they lack the required global analyticity.
- Any nonnegative mixture of qualifying monomial generators also produces PSD matrices.
- The kernel property on the variable indices is independent of whether the divergence itself is a metric.
Where Pith is reading between the lines
- The same power-series condition may characterize PSD kernels when the alphabets are continuous under suitable integrability.
- The result could guide construction of new divergence-based similarity measures that automatically produce PSD Gram matrices for clustering.
- It isolates an algebraic positivity requirement separate from the usual convexity used to define divergences.
Load-bearing premise
The local nonnegativity of Taylor coefficients extracted at t=1 via three-point kernels extends to a global power series representation valid for all finite alphabets.
What would settle it
An f whose generator has a power series with at least one negative coefficient yet still produces a PSD mutual-information matrix for every finite-alphabet collection, or conversely an analytic generator with all nonnegative coefficients that yields a non-PSD matrix for some collection.
read the original abstract
Given $n$ random variables, when does the matrix of pairwise $f$-mutual informations define a PSD kernel over variables? For convex finite generators $f:(0,\infty)\to\mathbb{R}$ with $f(1)=0$ and finite boundary value $f(0)$, we give a closed characterization up to linear transformation $f\sim f+c(t-1)$, which leaves every $f$-divergence and every $f$-mutual-information matrix unchanged. The matrix $M^{(f)}_{ij}:=I_f(X_i;X_j)$ is PSD for every finite-alphabet family if and only if the normalized representative has a globally convergent expansion $\bar f(t)=\sum_{m\ge2}a_m(t-1)^m$, with $a_m\ge0$, on all of $(0,\infty)$. Sufficiency follows from a replica embedding for monomial generators plus closure under nonnegative mixtures. Necessity first extracts the local Taylor cone at $1$ using biased three-point kernels $H_a$, the Belton--Guillot--Khare--Putinar (BGKP) low-rank Hankel positivity-preserver theorem, and then bootstraps analyticity to the divergence. This is a kernel characterization problem, not a metric one: PSD of the variable-indexed matrix is distinct from Hilbertian properties of divergences between distributions. The result explains why Shannon MI and Jensen--Shannon fail, why $\chi^2$ succeeds, and why non-analytic divergences such as total variation and ReLU are excluded.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that, for convex f:(0,∞)→ℝ with f(1)=0 and finite f(0), the matrix M^{(f)}_{ij}:=I_f(X_i;X_j) is positive semidefinite for every finite-alphabet family of random variables if and only if the normalized representative admits the globally convergent expansion bar f(t)=∑_{m≥2} a_m (t-1)^m with a_m≥0 on all of (0,∞). Sufficiency is obtained by replica embeddings of monomial generators together with closure under nonnegative mixtures; necessity extracts the local Taylor cone at t=1 via biased three-point kernels H_a and the BGKP low-rank Hankel positivity-preserver theorem, then bootstraps to global analyticity.
Significance. If the characterization holds, it supplies a precise kernel-theoretic criterion distinguishing f-divergences that induce PSD mutual-information matrices (e.g., χ²) from those that do not (Shannon MI, total variation, ReLU). The replica-embedding argument and the application of the BGKP theorem constitute a technically novel route to this global characterization, separate from metric or Hilbertian properties of the divergences themselves.
major comments (2)
- [necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.
- [normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.
minor comments (2)
- [§2] Notation for the normalized representative bar f is introduced without an explicit formula relating it to the original f; a one-line definition would improve readability.
- [abstract] The abstract states that the result explains why Jensen–Shannon fails, but the explicit counter-example family of random variables is not displayed; adding a short numerical illustration would help readers verify the claim.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major points below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.
Authors: The bootstrap in §4 proceeds by using the PSD condition on arbitrarily large finite alphabets to construct sequences of three-point kernels that probe the function at any point t>0. Convexity ensures that any violation of global nonnegativity of coefficients or failure of convergence would produce a negative eigenvalue for a suitably chosen distribution, contradicting the hypothesis. Thus the local cone condition plus convexity and the global PSD requirement together force analyticity on (0,∞) without extra regularity assumptions. We will add a short clarifying paragraph after the application of the BGKP theorem to make this implication explicit. revision: yes
-
Referee: [normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.
Authors: We agree that an explicit verification strengthens the reduction. The linear term c(t−1) contributes zero to every f-divergence because ∫(t−1)dQ=0 whenever Q is a probability measure; consequently it vanishes in the definition I_f(X_i;X_j)=D_f(P_{X_i X_j}||P_{X_i}P_{X_j}). We will insert a one-paragraph lemma in §2 containing this short calculation. revision: yes
Circularity Check
No circularity: derivation rests on external BGKP theorem and explicit replica/mixture constructions
full rationale
The necessity direction invokes the external Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem on biased three-point kernels to extract the local Taylor coefficients a_m >=0 at t=1; sufficiency is shown by direct construction via monomial replica embeddings closed under nonnegative mixtures. Neither step reduces the claimed iff characterization to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation whose content is merely the target result. The local-to-global analyticity bootstrap is presented as a separate analytic continuation argument rather than an identity by construction. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption f is a convex function from (0, infinity) to reals with f(1)=0 and finite f(0)
- standard math The Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem holds and applies to the extracted local Taylor cone
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean; Foundation/AlphaCoordinateFixation.leanwashburn_uniqueness_aczel; costAlphaLog_high_calibrated_iff; J_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
f(t) = sum_{m>=2} a_m (t-1)^m with a_m >=0 ... iff M^{(f)} PSD (Thm II.1); cosh(t-1)-1 example
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection; RCLCombiner_isCoupling_iff refines?
refinesRelation between the paper passage and the cited Recognition theorem.
replica embedding turns monomial (t-1)^m into Gram matrices; nonnegative mixtures preserve PSD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The mutual information: detecting and evaluating dependencies between variables,
R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The mutual information: detecting and evaluating dependencies between variables,” Bioinformatics, vol. 18, no. suppl 2, pp. S231–S240, 2002
work page 2002
-
[2]
Detecting novel associations in large data sets,
D. N. Reshef, Y . A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,”Science, vol. 334, no. 6062, pp. 1518–1524, 2011
work page 2011
-
[3]
G. Ver Steeg and A. Galstyan, “The information sieve,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 164–172
work page 2015
-
[4]
How transformers learn causal structure with gradient descent,
E. Nichani, A. Damian, and J. D. Lee, “How transformers learn causal structure with gradient descent,”ArXiv, vol. abs/2402.14735, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267782571
-
[5]
Mutual information matrices are not always positive semidefinite,
S. K. Jakobsen, “Mutual information matrices are not always positive semidefinite,”IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2694–2696, 2014
work page 2014
-
[6]
A new metric for probability distributions,
D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,”IEEE Transactions on Information theory, vol. 49, no. 7, pp. 1858–1860, 2003
work page 2003
-
[7]
Properties of classical and quantum jensen-shannon divergence,
J. Bri ¨et and P. Harremo¨es, “Properties of classical and quantum jensen-shannon divergence,”Physical Review A—Atomic, Molecular, and Optical Physics, vol. 79, no. 5, p. 052311, 2009
work page 2009
-
[8]
Information-type measures of difference of probability distributions and indirect observations,
I. Csisz ´ar, “Information-type measures of difference of probability distributions and indirect observations,”Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299–318, 1967
work page 1967
-
[9]
Information theory and statistics: A tutorial,
I. Csisz ´ar and P. C. Shields, “Information theory and statistics: A tutorial,”Foundations and Trends in Communications and Information Theory, vol. 1, no. 4, pp. 417–528, 2004
work page 2004
-
[10]
Positive definite functions on spheres,
I. J. Schoenberg, “Positive definite functions on spheres,” 1942
work page 1942
-
[11]
Harmonic analysis on semigroups: Theory of positive definite and related functions,
C. Berg, J. P. R. Christensen, and P. Ressel, “Harmonic analysis on semigroups: Theory of positive definite and related functions,”Graduate Texts in Mathematics, vol. 100, 1984
work page 1984
-
[12]
Multinomial goodness-of-fit tests,
N. Cressie and T. R. Read, “Multinomial goodness-of-fit tests,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 46, no. 3, pp. 440–464, 1984. 8 APPENDIX A. Replica tensorization for monomial divergences (Proposition IV .1 details) Form∈N, setf m(t) = (t−1) m and define the centered and scaled indicators: ϕa i (x) := 1{x=a} −p i(a)...
work page 1984
-
[13]
The kernel map is H1/3(z) = 4 9 f 1 + 9 16 z + 1 9 f 1 + 9 4 z + 4 9 f 1− 9 8 z , z∈R. We verify thatH 1/3(z) = 1 2 |z|for bothf TVD(t) = 1 2 |t−1|(total variation) andf ReLU(t) = (t−1) + (ReLU). TVD.Sincef TVD(1 +βz) = 1 2 |βz|, H1/3(z) = 1 2 4 9 · 9 16 + 1 9 · 9 4 + 4 9 · 9 8 |z| = 1 2 1 4 + 1 4 + 1 2 |z|= 1 2 |z|. ReLU.Sincef ReLU(1 +βz) = (βz) +, H1/3...
-
[14]
= 1 4,f(3) = 1,f(0) = 1 2, hence d1/3 = 4 9 · 1 4 + 1 9 ·1 + 4 9 · 1 2 = 4 9 . For ReLU,f( 3
-
[15]
= 1 2,f(3) = 2,f(0) = 0, hence d1/3 = 4 9 · 1 2 + 1 9 ·2 + 4 9 ·0 = 4 9 . Thusd 1/3 = 4 9 in both cases. UsingH 1/3(z) = 1 2 |z|, ifρ ii = 2 9 thenH 1/3(ρii) = 1 2 |ρii|= 1 9, and therefore ∆ii =d 1/3 −H 1/3(ρii) = 4 9 − 1 9 = 1 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.