Beyond Softmax: A Natural Parameterization for Categorical Random Variables
Pith reviewed 2026-05-18 11:49 UTC · model grok-4.3
The pith
The catnat function replaces softmax with hierarchical binary splits to produce a diagonal Fisher Information Matrix that simplifies gradient descent for categorical variables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We replace the softmax with the catnat function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. Experiments including graph structure learning, variational autoencoders, and reinforcement learning empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
What carries the argument
The catnat function, which parameterizes categorical distributions via a sequence of hierarchical binary splits and thereby produces a diagonal Fisher Information Matrix.
If this is right
- Gradient descent becomes more efficient because the Fisher Information Matrix is diagonal.
- Models reach higher test performance in tasks with categorical latent variables.
- The parameterization integrates into existing architectures without modification to stabilization methods.
- Training converges faster in graph structure learning, VAEs, and reinforcement learning.
Where Pith is reading between the lines
- The diagonal structure could enable simpler curvature estimates when analyzing optimization in other discrete-variable settings.
- This binary-split construction might extend to parameterizations for other discrete distributions beyond the categorical case.
- In reinforcement learning the improved gradients could reduce variance in policy updates for large action spaces.
Load-bearing premise
The off-diagonal zeros in the Fisher Information Matrix from the hierarchical binary splits remain zero under the reparameterizations and training dynamics used in the target models.
What would settle it
Computing the Fisher Information Matrix from gradients during training with catnat and finding non-zero off-diagonal entries would show that the diagonal property does not hold in practice.
read the original abstract
Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcement-learning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitous $\textit{softmax}$ function and demonstrate its limitations from an information-geometric perspective; 2) replace the $\textit{softmax}$ with the $\textit{catnat}$ function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance. $\textit{Catnat}$ is simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to the $\textit{softmax}$ function.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes replacing the softmax parameterization of categorical distributions with a new 'catnat' function constructed from a sequence of hierarchical binary splits. It claims to prove that this yields a diagonal Fisher Information Matrix, providing advantages for gradient-based optimization, and reports empirical gains in performance and learning efficiency across graph structure learning, variational autoencoders, and reinforcement learning tasks.
Significance. If the diagonal-FIM property is shown to persist under the reparameterizations and sampling procedures used in the target architectures, the work would provide a simple, drop-in improvement to training discrete latent-variable models. The multi-domain experimental evaluation is a positive feature, though the absence of detailed statistical reporting and full derivation details in the current version limits the ability to judge the magnitude of the practical advance.
major comments (2)
- [§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.
- [§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.
minor comments (2)
- The definition and implementation details of the catnat function (including how the binary splits are parameterized) would benefit from an explicit pseudocode listing or additional figure to aid reproducibility.
- A brief discussion of compatibility with existing stabilization techniques (e.g., temperature annealing) is mentioned in the abstract but not expanded in the main text; a short paragraph clarifying this point would strengthen the practical contribution.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§3] §3 (Proof of diagonal FIM): The derivation establishes diagonality for the direct hierarchical-binary parameterization of the categorical probabilities. However, the VAE and RL experiments (described in §5.2–5.3) rely on reparameterization estimators (Gumbel-softmax, straight-through, or policy gradients) whose Jacobians are generally dense; no explicit verification is given that the off-diagonal blocks of the effective FIM remain negligible after this composition or throughout optimization.
Authors: We appreciate the referee's careful reading of the theoretical section. The proof in §3 demonstrates that the catnat parameterization yields a diagonal Fisher Information Matrix for the categorical distribution with respect to its natural parameters. This property is intrinsic to the parameterization and is intended to improve the geometry of the optimization landscape. In the experimental sections, catnat is substituted for softmax within standard reparameterization frameworks. Although the Jacobians of the sampling estimators are generally dense, the diagonal structure of the base FIM still provides benefits by ensuring that parameter updates are less correlated, leading to more stable and efficient learning. That said, we agree that an explicit verification or bound on the off-diagonal terms in the composed setting would be valuable. In the revised manuscript, we will include a brief discussion in §3 or a new subsection addressing the interaction with reparameterization estimators and why the advantages are expected to persist. revision: partial
-
Referee: [§5.2] §5.2 (VAE experiments): The reported improvements in test performance and learning efficiency are presented without specification of data splits, hyperparameter search protocol, number of random seeds, or statistical significance tests. This makes it impossible to determine whether the observed gains are robust or could be explained by implementation details.
Authors: We acknowledge this limitation in the current presentation of the results. To address it, the revised manuscript will include: (i) explicit description of the train/validation/test splits for the datasets used in the VAE experiments, (ii) details of the hyperparameter search procedure (including the ranges explored and selection criteria), (iii) the number of independent random seeds used for each experiment (we will report results over 5 seeds), and (iv) statistical analysis such as mean and standard deviation across runs, along with significance tests where appropriate. These additions will allow readers to better assess the robustness of the reported gains. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines catnat via hierarchical binary splits and separately proves the resulting diagonal Fisher Information Matrix as a geometric property of that parameterization. This is a standard derivation from the chosen coordinates rather than a self-referential loop or re-derivation by construction. No self-citations, fitted inputs renamed as predictions, or uniqueness theorems imported from prior author work appear in the abstract or described claims. The central result remains an independent information-geometric argument applied to the new function, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Categorical distributions can be equivalently represented via a sequence of hierarchical binary splits.
invented entities (1)
-
catnat function
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
catnat function, a function composed of a sequence of hierarchical binary splits
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.