pith. sign in

arxiv: 2509.24728 · v2 · pith:A5K2HDWUnew · submitted 2025-09-29 · 💻 cs.LG · stat.ML

Beyond Softmax: A Natural Parameterization for Categorical Random Variables

Pith reviewed 2026-05-18 11:49 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords categorical variablessoftmaxcatnatFisher Information Matrixgradient descentvariational autoencodersreinforcement learninggraph structure learning
0
0 comments X

The pith

The catnat function replaces softmax with hierarchical binary splits to produce a diagonal Fisher Information Matrix that simplifies gradient descent for categorical variables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the standard softmax parameterization for categorical random variables creates a non-diagonal Fisher Information Matrix, which complicates gradient-based learning in deep models. It introduces the catnat function built from a sequence of hierarchical binary splits as a replacement. This construction yields a diagonal Fisher matrix whose off-diagonal zeros persist under reparameterization. Experiments across graph structure learning, variational autoencoders, and reinforcement learning show faster learning and higher test performance. The approach integrates directly into existing codebases and works alongside common stabilization techniques.

Core claim

We replace the softmax with the catnat function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. Experiments including graph structure learning, variational autoencoders, and reinforcement learning empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.

What carries the argument

The catnat function, which parameterizes categorical distributions via a sequence of hierarchical binary splits and thereby produces a diagonal Fisher Information Matrix.

If this is right

  • Gradient descent becomes more efficient because the Fisher Information Matrix is diagonal.
  • Models reach higher test performance in tasks with categorical latent variables.
  • The parameterization integrates into existing architectures without modification to stabilization methods.
  • Training converges faster in graph structure learning, VAEs, and reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The diagonal structure could enable simpler curvature estimates when analyzing optimization in other discrete-variable settings.
  • This binary-split construction might extend to parameterizations for other discrete distributions beyond the categorical case.
  • In reinforcement learning the improved gradients could reduce variance in policy updates for large action spaces.

Load-bearing premise

The off-diagonal zeros in the Fisher Information Matrix from the hierarchical binary splits remain zero under the reparameterizations and training dynamics used in the target models.

What would settle it

Computing the Fisher Information Matrix from gradients during training with catnat and finding non-zero off-diagonal entries would show that the diagonal property does not hold in practice.

read the original abstract

Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcement-learning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitous $\textit{softmax}$ function and demonstrate its limitations from an information-geometric perspective; 2) replace the $\textit{softmax}$ with the $\textit{catnat}$ function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance. $\textit{Catnat}$ is simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to the $\textit{softmax}$ function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes replacing the softmax parameterization of categorical distributions with a new 'catnat' function constructed from a sequence of hierarchical binary splits. It claims to prove that this yields a diagonal Fisher Information Matrix, providing advantages for gradient-based optimization, and reports empirical gains in performance and learning efficiency across graph structure learning, variational autoencoders, and reinforcement learning tasks.

Significance. If the diagonal-FIM property is shown to persist under the reparameterizations and sampling procedures used in the target architectures, the work would provide a simple, drop-in improvement to training discrete latent-variable models. The multi-domain experimental evaluation is a positive feature, though the absence of detailed statistical reporting and full derivation details in the current version limits the ability to judge the magnitude of the practical advance.

major comments (2)
  1. [§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.
  2. [§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.
minor comments (2)
  1. The definition and implementation details of the catnat function (including how the binary splits are parameterized) would benefit from an explicit pseudocode listing or additional figure to aid reproducibility.
  2. A brief discussion of compatibility with existing stabilization techniques (e.g., temperature annealing) is mentioned in the abstract but not expanded in the main text; a short paragraph clarifying this point would strengthen the practical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.

    Authors: We appreciate the referee's careful reading of the theoretical section. The proof in §3 demonstrates that the catnat parameterization yields a diagonal Fisher Information Matrix for the categorical distribution with respect to its natural parameters. This property is intrinsic to the parameterization and is intended to improve the geometry of the optimization landscape. In the experimental sections, catnat is substituted for softmax within standard reparameterization frameworks. Although the Jacobians of the sampling estimators are generally dense, the diagonal structure of the base FIM still provides benefits by ensuring that parameter updates are less correlated, leading to more stable and efficient learning. That said, we agree that an explicit verification or bound on the off-diagonal terms in the composed setting would be valuable. In the revised manuscript, we will include a brief discussion in §3 or a new subsection addressing the interaction with reparameterization estimators and why the advantages are expected to persist. revision: partial

  2. Referee: [§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.

    Authors: We acknowledge this limitation in the current presentation of the results. To address it, the revised manuscript will include: (i) explicit description of the train/validation/test splits for the datasets used in the VAE experiments, (ii) details of the hyperparameter search procedure (including the ranges explored and selection criteria), (iii) the number of independent random seeds used for each experiment (we will report results over 5 seeds), and (iv) statistical analysis such as mean and standard deviation across runs, along with significance tests where appropriate. These additions will allow readers to better assess the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines catnat via hierarchical binary splits and separately proves the resulting diagonal Fisher Information Matrix as a geometric property of that parameterization. This is a standard derivation from the chosen coordinates rather than a self-referential loop or re-derivation by construction. No self-citations, fitted inputs renamed as predictions, or uniqueness theorems imported from prior author work appear in the abstract or described claims. The central result remains an independent information-geometric argument applied to the new function, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the geometric property of the new parameterization and on the assumption that this property survives in the training regimes tested.

axioms (1)
  • domain assumption Categorical distributions can be equivalently represented via a sequence of hierarchical binary splits.
    Invoked when defining catnat as an alternative to softmax.
invented entities (1)
  • catnat function no independent evidence
    purpose: To parameterize categorical random variables with a diagonal Fisher Information Matrix.
    Newly introduced construction; no independent evidence outside the paper's derivation and experiments.

pith-pipeline@v0.9.0 · 5738 in / 1297 out tokens · 46447 ms · 2026-05-18T11:49:36.767000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.