arxiv: 2601.08929 · v3 · submitted 2026-01-13 · 💻 cs.IT · math.IT

Recognition: 2 theorem links

· Lean Theorem

A Global Characterization of f-Divergences Yielding PSD Mutual-Information Matrices

Zachary Robertson

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords f-divergencesmutual information matricespositive semidefinite kernelspower series expansionsinformation theoryconvex generatorsanalytic functions

0 comments

The pith

Mutual-information matrices from f-divergences are positive semidefinite for all finite alphabets precisely when the normalized generator expands as a power series with nonnegative coefficients that converges on all positive reals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the exact class of f-divergences for which the matrix of pairwise f-mutual informations is always positive semidefinite, no matter which finite-alphabet random variables are chosen. This holds if and only if the generator, suitably normalized, can be written as an infinite sum of nonnegative multiples of powers of (t-1) that converges for every positive t. The sufficiency direction uses replica embeddings of monomial generators and closure under convex combinations. The necessity direction first applies biased three-point kernels together with a theorem on low-rank Hankel matrices to force the Taylor coefficients at one to be nonnegative, then shows the series must converge globally. The characterization explains the success of chi-squared divergence and the failure of total variation and Shannon mutual information as kernels.

Core claim

Given convex finite generators f with f(1)=0 and finite f(0), the matrix whose (i,j) entry is the f-mutual information between X_i and X_j is positive semidefinite for every finite-alphabet family exactly when the normalized bar f admits the expansion sum_{m ge 2} a_m (t-1)^m with a_m ge 0 and the series converges on (0,infty).

What carries the argument

The globally convergent power-series expansion of the normalized generator bar f(t) with nonnegative coefficients a_m.

If this is right

Chi-squared divergence satisfies the condition and therefore always yields PSD matrices.
Shannon mutual information and Jensen-Shannon divergence fail the condition and can produce non-PSD matrices.
Total variation and ReLU divergences are excluded because they lack the required global analyticity.
Any nonnegative mixture of qualifying monomial generators also produces PSD matrices.
The kernel property on the variable indices is independent of whether the divergence itself is a metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same power-series condition may characterize PSD kernels when the alphabets are continuous under suitable integrability.
The result could guide construction of new divergence-based similarity measures that automatically produce PSD Gram matrices for clustering.
It isolates an algebraic positivity requirement separate from the usual convexity used to define divergences.

Load-bearing premise

The local nonnegativity of Taylor coefficients extracted at t=1 via three-point kernels extends to a global power series representation valid for all finite alphabets.

What would settle it

An f whose generator has a power series with at least one negative coefficient yet still produces a PSD mutual-information matrix for every finite-alphabet collection, or conversely an analytic generator with all nonnegative coefficients that yields a non-PSD matrix for some collection.

read the original abstract

Given $n$ random variables, when does the matrix of pairwise $f$-mutual informations define a PSD kernel over variables? For convex finite generators $f:(0,\infty)\to\mathbb{R}$ with $f(1)=0$ and finite boundary value $f(0)$, we give a closed characterization up to linear transformation $f\sim f+c(t-1)$, which leaves every $f$-divergence and every $f$-mutual-information matrix unchanged. The matrix $M^{(f)}_{ij}:=I_f(X_i;X_j)$ is PSD for every finite-alphabet family if and only if the normalized representative has a globally convergent expansion $\bar f(t)=\sum_{m\ge2}a_m(t-1)^m$, with $a_m\ge0$, on all of $(0,\infty)$. Sufficiency follows from a replica embedding for monomial generators plus closure under nonnegative mixtures. Necessity first extracts the local Taylor cone at $1$ using biased three-point kernels $H_a$, the Belton--Guillot--Khare--Putinar (BGKP) low-rank Hankel positivity-preserver theorem, and then bootstraps analyticity to the divergence. This is a kernel characterization problem, not a metric one: PSD of the variable-indexed matrix is distinct from Hilbertian properties of divergences between distributions. The result explains why Shannon MI and Jensen--Shannon fail, why $\chi^2$ succeeds, and why non-analytic divergences such as total variation and ReLU are excluded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down exactly which f-divergences produce PSD mutual-information matrices over any finite-alphabet collection, via a global power-series condition with nonnegative coefficients.

read the letter

The core contribution is an if-and-only-if statement: after normalizing so that the linear term vanishes, f yields PSD MI matrices for every finite-alphabet family precisely when its Taylor series around 1 has all coefficients nonnegative and converges on the whole (0,∞). Sufficiency is handled by embedding monomials via replicas and closing under mixtures; necessity extracts the local cone at t=1 with biased three-point kernels and the BGKP theorem, then extends to global analyticity. This cleanly separates the kernel property from any metric or Hilbert-space requirements and immediately accounts for the known cases (χ² works, Shannon and total variation do not). The argument is technically self-contained once the bootstrap step is granted. The potential soft spot is exactly that bootstrap: the local nonnegativity is forced by BGKP, but showing that the PSD condition on all finite alphabets automatically supplies the analytic continuation and radius control everywhere needs to be airtight; any implicit regularity assumption here would weaken the necessity direction. The paper is aimed at researchers who use f-divergences as kernels in information theory or ML. It is worth sending to referees who can check the analytic-extension details, because the claim is sharp and the tools are standard.

Referee Report

2 major / 2 minor

Summary. The paper claims that, for convex f:(0,∞)→ℝ with f(1)=0 and finite f(0), the matrix M^{(f)}_{ij}:=I_f(X_i;X_j) is positive semidefinite for every finite-alphabet family of random variables if and only if the normalized representative admits the globally convergent expansion bar f(t)=∑_{m≥2} a_m (t-1)^m with a_m≥0 on all of (0,∞). Sufficiency is obtained by replica embeddings of monomial generators together with closure under nonnegative mixtures; necessity extracts the local Taylor cone at t=1 via biased three-point kernels H_a and the BGKP low-rank Hankel positivity-preserver theorem, then bootstraps to global analyticity.

Significance. If the characterization holds, it supplies a precise kernel-theoretic criterion distinguishing f-divergences that induce PSD mutual-information matrices (e.g., χ²) from those that do not (Shannon MI, total variation, ReLU). The replica-embedding argument and the application of the BGKP theorem constitute a technically novel route to this global characterization, separate from metric or Hilbertian properties of the divergences themselves.

major comments (2)

[necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.
[normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.

minor comments (2)

[§2] Notation for the normalized representative bar f is introduced without an explicit formula relating it to the original f; a one-line definition would improve readability.
[abstract] The abstract states that the result explains why Jensen–Shannon fails, but the explicit counter-example family of random variables is not displayed; adding a short numerical illustration would help readers verify the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major points below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.

Authors: The bootstrap in §4 proceeds by using the PSD condition on arbitrarily large finite alphabets to construct sequences of three-point kernels that probe the function at any point t>0. Convexity ensures that any violation of global nonnegativity of coefficients or failure of convergence would produce a negative eigenvalue for a suitably chosen distribution, contradicting the hypothesis. Thus the local cone condition plus convexity and the global PSD requirement together force analyticity on (0,∞) without extra regularity assumptions. We will add a short clarifying paragraph after the application of the BGKP theorem to make this implication explicit. revision: yes
Referee: [normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.

Authors: We agree that an explicit verification strengthens the reduction. The linear term c(t−1) contributes zero to every f-divergence because ∫(t−1)dQ=0 whenever Q is a probability measure; consequently it vanishes in the definition I_f(X_i;X_j)=D_f(P_{X_i X_j}||P_{X_i}P_{X_j}). We will insert a one-paragraph lemma in §2 containing this short calculation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on external BGKP theorem and explicit replica/mixture constructions

full rationale

The necessity direction invokes the external Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem on biased three-point kernels to extract the local Taylor coefficients a_m >=0 at t=1; sufficiency is shown by direct construction via monomial replica embeddings closed under nonnegative mixtures. Neither step reduces the claimed iff characterization to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation whose content is merely the target result. The local-to-global analyticity bootstrap is presented as a separate analytic continuation argument rather than an identity by construction. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result depends on standard assumptions for f-divergences and an established theorem from functional analysis; no free parameters or invented entities are needed beyond these.

axioms (2)

domain assumption f is a convex function from (0, infinity) to reals with f(1)=0 and finite f(0)
Required for f-divergences to be properly defined and finite.
standard math The Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem holds and applies to the extracted local Taylor cone
Used to establish the necessity of nonnegative coefficients in the expansion.

pith-pipeline@v0.9.0 · 5573 in / 1373 out tokens · 40065 ms · 2026-05-16T14:19:49.410149+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; Foundation/AlphaCoordinateFixation.lean washburn_uniqueness_aczel; costAlphaLog_high_calibrated_iff; J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

f(t) = sum_{m>=2} a_m (t-1)^m with a_m >=0 ... iff M^{(f)} PSD (Thm II.1); cosh(t-1)-1 example
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection; RCLCombiner_isCoupling_iff refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

replica embedding turns monomial (t-1)^m into Gram matrices; nonnegative mixtures preserve PSD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

The mutual information: detecting and evaluating dependencies between variables,

R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The mutual information: detecting and evaluating dependencies between variables,” Bioinformatics, vol. 18, no. suppl 2, pp. S231–S240, 2002

work page 2002
[2]

Detecting novel associations in large data sets,

D. N. Reshef, Y . A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,”Science, vol. 334, no. 6062, pp. 1518–1524, 2011

work page 2011
[3]

The information sieve,

G. Ver Steeg and A. Galstyan, “The information sieve,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 164–172

work page 2015
[4]

How transformers learn causal structure with gradient descent,

E. Nichani, A. Damian, and J. D. Lee, “How transformers learn causal structure with gradient descent,”ArXiv, vol. abs/2402.14735, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267782571

work page arXiv 2024
[5]

Mutual information matrices are not always positive semidefinite,

S. K. Jakobsen, “Mutual information matrices are not always positive semidefinite,”IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2694–2696, 2014

work page 2014
[6]

A new metric for probability distributions,

D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,”IEEE Transactions on Information theory, vol. 49, no. 7, pp. 1858–1860, 2003

work page 2003
[7]

Properties of classical and quantum jensen-shannon divergence,

J. Bri ¨et and P. Harremo¨es, “Properties of classical and quantum jensen-shannon divergence,”Physical Review A—Atomic, Molecular, and Optical Physics, vol. 79, no. 5, p. 052311, 2009

work page 2009
[8]

Information-type measures of difference of probability distributions and indirect observations,

I. Csisz ´ar, “Information-type measures of difference of probability distributions and indirect observations,”Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299–318, 1967

work page 1967
[9]

Information theory and statistics: A tutorial,

I. Csisz ´ar and P. C. Shields, “Information theory and statistics: A tutorial,”Foundations and Trends in Communications and Information Theory, vol. 1, no. 4, pp. 417–528, 2004

work page 2004
[10]

Positive definite functions on spheres,

I. J. Schoenberg, “Positive definite functions on spheres,” 1942

work page 1942
[11]

Harmonic analysis on semigroups: Theory of positive definite and related functions,

C. Berg, J. P. R. Christensen, and P. Ressel, “Harmonic analysis on semigroups: Theory of positive definite and related functions,”Graduate Texts in Mathematics, vol. 100, 1984

work page 1984
[12]

Multinomial goodness-of-fit tests,

N. Cressie and T. R. Read, “Multinomial goodness-of-fit tests,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 46, no. 3, pp. 440–464, 1984. 8 APPENDIX A. Replica tensorization for monomial divergences (Proposition IV .1 details) Form∈N, setf m(t) = (t−1) m and define the centered and scaled indicators: ϕa i (x) := 1{x=a} −p i(a)...

work page 1984
[13]

We verify thatH 1/3(z) = 1 2 |z|for bothf TVD(t) = 1 2 |t−1|(total variation) andf ReLU(t) = (t−1) + (ReLU)

The kernel map is H1/3(z) = 4 9 f 1 + 9 16 z + 1 9 f 1 + 9 4 z + 4 9 f 1− 9 8 z , z∈R. We verify thatH 1/3(z) = 1 2 |z|for bothf TVD(t) = 1 2 |t−1|(total variation) andf ReLU(t) = (t−1) + (ReLU). TVD.Sincef TVD(1 +βz) = 1 2 |βz|, H1/3(z) = 1 2 4 9 · 9 16 + 1 9 · 9 4 + 4 9 · 9 8 |z| = 1 2 1 4 + 1 4 + 1 2 |z|= 1 2 |z|. ReLU.Sincef ReLU(1 +βz) = (βz) +, H1/3...

work page
[14]

For ReLU,f( 3

= 1 4,f(3) = 1,f(0) = 1 2, hence d1/3 = 4 9 · 1 4 + 1 9 ·1 + 4 9 · 1 2 = 4 9 . For ReLU,f( 3

work page
[15]

Thusd 1/3 = 4 9 in both cases

= 1 2,f(3) = 2,f(0) = 0, hence d1/3 = 4 9 · 1 2 + 1 9 ·2 + 4 9 ·0 = 4 9 . Thusd 1/3 = 4 9 in both cases. UsingH 1/3(z) = 1 2 |z|, ifρ ii = 2 9 thenH 1/3(ρii) = 1 2 |ρii|= 1 9, and therefore ∆ii =d 1/3 −H 1/3(ρii) = 4 9 − 1 9 = 1 3

work page