pith. machine review for the scientific record. sign in

arxiv: 2601.08929 · v3 · submitted 2026-01-13 · 💻 cs.IT · math.IT

Recognition: 2 theorem links

· Lean Theorem

A Global Characterization of f-Divergences Yielding PSD Mutual-Information Matrices

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords f-divergencesmutual information matricespositive semidefinite kernelspower series expansionsinformation theoryconvex generatorsanalytic functions
0
0 comments X

The pith

Mutual-information matrices from f-divergences are positive semidefinite for all finite alphabets precisely when the normalized generator expands as a power series with nonnegative coefficients that converges on all positive reals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the exact class of f-divergences for which the matrix of pairwise f-mutual informations is always positive semidefinite, no matter which finite-alphabet random variables are chosen. This holds if and only if the generator, suitably normalized, can be written as an infinite sum of nonnegative multiples of powers of (t-1) that converges for every positive t. The sufficiency direction uses replica embeddings of monomial generators and closure under convex combinations. The necessity direction first applies biased three-point kernels together with a theorem on low-rank Hankel matrices to force the Taylor coefficients at one to be nonnegative, then shows the series must converge globally. The characterization explains the success of chi-squared divergence and the failure of total variation and Shannon mutual information as kernels.

Core claim

Given convex finite generators f with f(1)=0 and finite f(0), the matrix whose (i,j) entry is the f-mutual information between X_i and X_j is positive semidefinite for every finite-alphabet family exactly when the normalized bar f admits the expansion sum_{m ge 2} a_m (t-1)^m with a_m ge 0 and the series converges on (0,infty).

What carries the argument

The globally convergent power-series expansion of the normalized generator bar f(t) with nonnegative coefficients a_m.

If this is right

  • Chi-squared divergence satisfies the condition and therefore always yields PSD matrices.
  • Shannon mutual information and Jensen-Shannon divergence fail the condition and can produce non-PSD matrices.
  • Total variation and ReLU divergences are excluded because they lack the required global analyticity.
  • Any nonnegative mixture of qualifying monomial generators also produces PSD matrices.
  • The kernel property on the variable indices is independent of whether the divergence itself is a metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same power-series condition may characterize PSD kernels when the alphabets are continuous under suitable integrability.
  • The result could guide construction of new divergence-based similarity measures that automatically produce PSD Gram matrices for clustering.
  • It isolates an algebraic positivity requirement separate from the usual convexity used to define divergences.

Load-bearing premise

The local nonnegativity of Taylor coefficients extracted at t=1 via three-point kernels extends to a global power series representation valid for all finite alphabets.

What would settle it

An f whose generator has a power series with at least one negative coefficient yet still produces a PSD mutual-information matrix for every finite-alphabet collection, or conversely an analytic generator with all nonnegative coefficients that yields a non-PSD matrix for some collection.

read the original abstract

Given $n$ random variables, when does the matrix of pairwise $f$-mutual informations define a PSD kernel over variables? For convex finite generators $f:(0,\infty)\to\mathbb{R}$ with $f(1)=0$ and finite boundary value $f(0)$, we give a closed characterization up to linear transformation $f\sim f+c(t-1)$, which leaves every $f$-divergence and every $f$-mutual-information matrix unchanged. The matrix $M^{(f)}_{ij}:=I_f(X_i;X_j)$ is PSD for every finite-alphabet family if and only if the normalized representative has a globally convergent expansion $\bar f(t)=\sum_{m\ge2}a_m(t-1)^m$, with $a_m\ge0$, on all of $(0,\infty)$. Sufficiency follows from a replica embedding for monomial generators plus closure under nonnegative mixtures. Necessity first extracts the local Taylor cone at $1$ using biased three-point kernels $H_a$, the Belton--Guillot--Khare--Putinar (BGKP) low-rank Hankel positivity-preserver theorem, and then bootstraps analyticity to the divergence. This is a kernel characterization problem, not a metric one: PSD of the variable-indexed matrix is distinct from Hilbertian properties of divergences between distributions. The result explains why Shannon MI and Jensen--Shannon fail, why $\chi^2$ succeeds, and why non-analytic divergences such as total variation and ReLU are excluded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that, for convex f:(0,∞)→ℝ with f(1)=0 and finite f(0), the matrix M^{(f)}_{ij}:=I_f(X_i;X_j) is positive semidefinite for every finite-alphabet family of random variables if and only if the normalized representative admits the globally convergent expansion bar f(t)=∑_{m≥2} a_m (t-1)^m with a_m≥0 on all of (0,∞). Sufficiency is obtained by replica embeddings of monomial generators together with closure under nonnegative mixtures; necessity extracts the local Taylor cone at t=1 via biased three-point kernels H_a and the BGKP low-rank Hankel positivity-preserver theorem, then bootstraps to global analyticity.

Significance. If the characterization holds, it supplies a precise kernel-theoretic criterion distinguishing f-divergences that induce PSD mutual-information matrices (e.g., χ²) from those that do not (Shannon MI, total variation, ReLU). The replica-embedding argument and the application of the BGKP theorem constitute a technically novel route to this global characterization, separate from metric or Hilbertian properties of the divergences themselves.

major comments (2)
  1. [necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.
  2. [normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.
minor comments (2)
  1. [§2] Notation for the normalized representative bar f is introduced without an explicit formula relating it to the original f; a one-line definition would improve readability.
  2. [abstract] The abstract states that the result explains why Jensen–Shannon fails, but the explicit counter-example family of random variables is not displayed; adding a short numerical illustration would help readers verify the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major points below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [necessity bootstrap (abstract, §4)] The necessity argument (abstract and §4 outline) extracts a_m≥0 locally via BGKP on three-point kernels but then bootstraps to global analyticity and convergence on (0,∞). It is not shown that the PSD condition on finite alphabets alone forces the required analytic continuation without an additional regularity assumption beyond convexity, f(1)=0 and finite f(0); non-analytic convex functions satisfying the local cone could therefore satisfy the hypothesis yet violate the claimed global expansion.

    Authors: The bootstrap in §4 proceeds by using the PSD condition on arbitrarily large finite alphabets to construct sequences of three-point kernels that probe the function at any point t>0. Convexity ensures that any violation of global nonnegativity of coefficients or failure of convergence would produce a negative eigenvalue for a suitably chosen distribution, contradicting the hypothesis. Thus the local cone condition plus convexity and the global PSD requirement together force analyticity on (0,∞) without extra regularity assumptions. We will add a short clarifying paragraph after the application of the BGKP theorem to make this implication explicit. revision: yes

  2. Referee: [normalization (abstract, §2)] The statement that the characterization is “up to linear transformation f∼f+c(t−1)” is used to normalize the representative, yet the proof that this equivalence class leaves every I_f matrix invariant is only sketched; a short explicit verification that the added term contributes zero to all pairwise mutual informations would strengthen the reduction step.

    Authors: We agree that an explicit verification strengthens the reduction. The linear term c(t−1) contributes zero to every f-divergence because ∫(t−1)dQ=0 whenever Q is a probability measure; consequently it vanishes in the definition I_f(X_i;X_j)=D_f(P_{X_i X_j}||P_{X_i}P_{X_j}). We will insert a one-paragraph lemma in §2 containing this short calculation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on external BGKP theorem and explicit replica/mixture constructions

full rationale

The necessity direction invokes the external Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem on biased three-point kernels to extract the local Taylor coefficients a_m >=0 at t=1; sufficiency is shown by direct construction via monomial replica embeddings closed under nonnegative mixtures. Neither step reduces the claimed iff characterization to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation whose content is merely the target result. The local-to-global analyticity bootstrap is presented as a separate analytic continuation argument rather than an identity by construction. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result depends on standard assumptions for f-divergences and an established theorem from functional analysis; no free parameters or invented entities are needed beyond these.

axioms (2)
  • domain assumption f is a convex function from (0, infinity) to reals with f(1)=0 and finite f(0)
    Required for f-divergences to be properly defined and finite.
  • standard math The Belton-Guillot-Khare-Putinar low-rank Hankel positivity-preserver theorem holds and applies to the extracted local Taylor cone
    Used to establish the necessity of nonnegative coefficients in the expansion.

pith-pipeline@v0.9.0 · 5573 in / 1373 out tokens · 40065 ms · 2026-05-16T14:19:49.410149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    The mutual information: detecting and evaluating dependencies between variables,

    R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The mutual information: detecting and evaluating dependencies between variables,” Bioinformatics, vol. 18, no. suppl 2, pp. S231–S240, 2002

  2. [2]

    Detecting novel associations in large data sets,

    D. N. Reshef, Y . A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,”Science, vol. 334, no. 6062, pp. 1518–1524, 2011

  3. [3]

    The information sieve,

    G. Ver Steeg and A. Galstyan, “The information sieve,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 164–172

  4. [4]

    How transformers learn causal structure with gradient descent,

    E. Nichani, A. Damian, and J. D. Lee, “How transformers learn causal structure with gradient descent,”ArXiv, vol. abs/2402.14735, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267782571

  5. [5]

    Mutual information matrices are not always positive semidefinite,

    S. K. Jakobsen, “Mutual information matrices are not always positive semidefinite,”IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2694–2696, 2014

  6. [6]

    A new metric for probability distributions,

    D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,”IEEE Transactions on Information theory, vol. 49, no. 7, pp. 1858–1860, 2003

  7. [7]

    Properties of classical and quantum jensen-shannon divergence,

    J. Bri ¨et and P. Harremo¨es, “Properties of classical and quantum jensen-shannon divergence,”Physical Review A—Atomic, Molecular, and Optical Physics, vol. 79, no. 5, p. 052311, 2009

  8. [8]

    Information-type measures of difference of probability distributions and indirect observations,

    I. Csisz ´ar, “Information-type measures of difference of probability distributions and indirect observations,”Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299–318, 1967

  9. [9]

    Information theory and statistics: A tutorial,

    I. Csisz ´ar and P. C. Shields, “Information theory and statistics: A tutorial,”Foundations and Trends in Communications and Information Theory, vol. 1, no. 4, pp. 417–528, 2004

  10. [10]

    Positive definite functions on spheres,

    I. J. Schoenberg, “Positive definite functions on spheres,” 1942

  11. [11]

    Harmonic analysis on semigroups: Theory of positive definite and related functions,

    C. Berg, J. P. R. Christensen, and P. Ressel, “Harmonic analysis on semigroups: Theory of positive definite and related functions,”Graduate Texts in Mathematics, vol. 100, 1984

  12. [12]

    Multinomial goodness-of-fit tests,

    N. Cressie and T. R. Read, “Multinomial goodness-of-fit tests,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 46, no. 3, pp. 440–464, 1984. 8 APPENDIX A. Replica tensorization for monomial divergences (Proposition IV .1 details) Form∈N, setf m(t) = (t−1) m and define the centered and scaled indicators: ϕa i (x) := 1{x=a} −p i(a)...

  13. [13]

    We verify thatH 1/3(z) = 1 2 |z|for bothf TVD(t) = 1 2 |t−1|(total variation) andf ReLU(t) = (t−1) + (ReLU)

    The kernel map is H1/3(z) = 4 9 f 1 + 9 16 z + 1 9 f 1 + 9 4 z + 4 9 f 1− 9 8 z , z∈R. We verify thatH 1/3(z) = 1 2 |z|for bothf TVD(t) = 1 2 |t−1|(total variation) andf ReLU(t) = (t−1) + (ReLU). TVD.Sincef TVD(1 +βz) = 1 2 |βz|, H1/3(z) = 1 2 4 9 · 9 16 + 1 9 · 9 4 + 4 9 · 9 8 |z| = 1 2 1 4 + 1 4 + 1 2 |z|= 1 2 |z|. ReLU.Sincef ReLU(1 +βz) = (βz) +, H1/3...

  14. [14]

    For ReLU,f( 3

    = 1 4,f(3) = 1,f(0) = 1 2, hence d1/3 = 4 9 · 1 4 + 1 9 ·1 + 4 9 · 1 2 = 4 9 . For ReLU,f( 3

  15. [15]

    Thusd 1/3 = 4 9 in both cases

    = 1 2,f(3) = 2,f(0) = 0, hence d1/3 = 4 9 · 1 2 + 1 9 ·2 + 4 9 ·0 = 4 9 . Thusd 1/3 = 4 9 in both cases. UsingH 1/3(z) = 1 2 |z|, ifρ ii = 2 9 thenH 1/3(ρii) = 1 2 |ρii|= 1 9, and therefore ∆ii =d 1/3 −H 1/3(ρii) = 4 9 − 1 9 = 1 3