Beyond Softmax: A Natural Parameterization for Categorical Random Variables

Alessandro Manenti; Cesare Alippi

arxiv: 2509.24728 · v2 · pith:A5K2HDWUnew · submitted 2025-09-29 · 💻 cs.LG · stat.ML

Beyond Softmax: A Natural Parameterization for Categorical Random Variables

Alessandro Manenti , Cesare Alippi This is my paper

Pith reviewed 2026-05-18 11:49 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords categorical variablessoftmaxcatnatFisher Information Matrixgradient descentvariational autoencodersreinforcement learninggraph structure learning

0 comments

The pith

The catnat function replaces softmax with hierarchical binary splits to produce a diagonal Fisher Information Matrix that simplifies gradient descent for categorical variables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the standard softmax parameterization for categorical random variables creates a non-diagonal Fisher Information Matrix, which complicates gradient-based learning in deep models. It introduces the catnat function built from a sequence of hierarchical binary splits as a replacement. This construction yields a diagonal Fisher matrix whose off-diagonal zeros persist under reparameterization. Experiments across graph structure learning, variational autoencoders, and reinforcement learning show faster learning and higher test performance. The approach integrates directly into existing codebases and works alongside common stabilization techniques.

Core claim

We replace the softmax with the catnat function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. Experiments including graph structure learning, variational autoencoders, and reinforcement learning empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.

What carries the argument

The catnat function, which parameterizes categorical distributions via a sequence of hierarchical binary splits and thereby produces a diagonal Fisher Information Matrix.

If this is right

Gradient descent becomes more efficient because the Fisher Information Matrix is diagonal.
Models reach higher test performance in tasks with categorical latent variables.
The parameterization integrates into existing architectures without modification to stabilization methods.
Training converges faster in graph structure learning, VAEs, and reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The diagonal structure could enable simpler curvature estimates when analyzing optimization in other discrete-variable settings.
This binary-split construction might extend to parameterizations for other discrete distributions beyond the categorical case.
In reinforcement learning the improved gradients could reduce variance in policy updates for large action spaces.

Load-bearing premise

The off-diagonal zeros in the Fisher Information Matrix from the hierarchical binary splits remain zero under the reparameterizations and training dynamics used in the target models.

What would settle it

Computing the Fisher Information Matrix from gradients during training with catnat and finding non-zero off-diagonal entries would show that the diagonal property does not hold in practice.

read the original abstract

Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcement-learning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitous $\textit{softmax}$ function and demonstrate its limitations from an information-geometric perspective; 2) replace the $\textit{softmax}$ with the $\textit{catnat}$ function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance. $\textit{Catnat}$ is simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to the $\textit{softmax}$ function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Catnat replaces softmax with hierarchical binary splits to get a diagonal FIM, but the optimization benefit needs checking after standard reparameterizations.

read the letter

The main takeaway is that this paper offers catnat as a drop-in replacement for softmax in categorical latent models, built from hierarchical binary splits, and proves it produces a diagonal Fisher information matrix that should aid gradient descent. They connect the information geometry directly to the parameterization choice. The softmax has a dense FIM by nature, while the tree-structured splits make the parameters independent in the Fisher sense. The construction is simple and the proof appears solid for the base case. They back it up with experiments across three different areas: graph structure learning, VAEs, and RL. The results show improved efficiency and better final performance, and they note it works with standard tricks like gradient clipping. The potential weakness is around how this holds up in practice. Most of the target uses involve reparameterization or sampling methods that compose additional stochastic functions on top of the probability outputs. Those compositions generally have non-diagonal Jacobians, which could reintroduce correlations in the effective gradients. The paper claims advantages to gradient descent, but if they haven't verified the property after the full training pipeline, that leaves an open question about whether the diagonal FIM is the actual driver of the gains or if something else is going on. Readers who build models with discrete variables in deep learning would get the most out of this. It's relevant for anyone frustrated with softmax in optimization and looking for a minimal change. The work shows clear thinking on the geometric side and has reproducible experiments, so it deserves to go through peer review rather than being rejected outright. I'd recommend accepting it for review, focusing referee attention on confirming the FIM property survives the reparameterizations used in the experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes replacing the softmax parameterization of categorical distributions with a new 'catnat' function constructed from a sequence of hierarchical binary splits. It claims to prove that this yields a diagonal Fisher Information Matrix, providing advantages for gradient-based optimization, and reports empirical gains in performance and learning efficiency across graph structure learning, variational autoencoders, and reinforcement learning tasks.

Significance. If the diagonal-FIM property is shown to persist under the reparameterizations and sampling procedures used in the target architectures, the work would provide a simple, drop-in improvement to training discrete latent-variable models. The multi-domain experimental evaluation is a positive feature, though the absence of detailed statistical reporting and full derivation details in the current version limits the ability to judge the magnitude of the practical advance.

major comments (2)

[§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.
[§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.

minor comments (2)

The definition and implementation details of the catnat function (including how the binary splits are parameterized) would benefit from an explicit pseudocode listing or additional figure to aid reproducibility.
A brief discussion of compatibility with existing stabilization techniques (e.g., temperature annealing) is mentioned in the abstract but not expanded in the main text; a short paragraph clarifying this point would strengthen the practical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.

Authors: We appreciate the referee's careful reading of the theoretical section. The proof in §3 demonstrates that the catnat parameterization yields a diagonal Fisher Information Matrix for the categorical distribution with respect to its natural parameters. This property is intrinsic to the parameterization and is intended to improve the geometry of the optimization landscape. In the experimental sections, catnat is substituted for softmax within standard reparameterization frameworks. Although the Jacobians of the sampling estimators are generally dense, the diagonal structure of the base FIM still provides benefits by ensuring that parameter updates are less correlated, leading to more stable and efficient learning. That said, we agree that an explicit verification or bound on the off-diagonal terms in the composed setting would be valuable. In the revised manuscript, we will include a brief discussion in §3 or a new subsection addressing the interaction with reparameterization estimators and why the advantages are expected to persist. revision: partial
Referee: [§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.

Authors: We acknowledge this limitation in the current presentation of the results. To address it, the revised manuscript will include: (i) explicit description of the train/validation/test splits for the datasets used in the VAE experiments, (ii) details of the hyperparameter search procedure (including the ranges explored and selection criteria), (iii) the number of independent random seeds used for each experiment (we will report results over 5 seeds), and (iv) statistical analysis such as mean and standard deviation across runs, along with significance tests where appropriate. These additions will allow readers to better assess the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines catnat via hierarchical binary splits and separately proves the resulting diagonal Fisher Information Matrix as a geometric property of that parameterization. This is a standard derivation from the chosen coordinates rather than a self-referential loop or re-derivation by construction. No self-citations, fitted inputs renamed as predictions, or uniqueness theorems imported from prior author work appear in the abstract or described claims. The central result remains an independent information-geometric argument applied to the new function, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the geometric property of the new parameterization and on the assumption that this property survives in the training regimes tested.

axioms (1)

domain assumption Categorical distributions can be equivalently represented via a sequence of hierarchical binary splits.
Invoked when defining catnat as an alternative to softmax.

invented entities (1)

catnat function no independent evidence
purpose: To parameterize categorical random variables with a diagonal Fisher Information Matrix.
Newly introduced construction; no independent evidence outside the paper's derivation and experiments.

pith-pipeline@v0.9.0 · 5738 in / 1297 out tokens · 46447 ms · 2026-05-18T11:49:36.767000+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

catnat function, a function composed of a sequence of hierarchical binary splits

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.