Symbolic Density Estimation for Discrete Distributions

Meng Li; Ziwen Liu

arxiv: 2605.21813 · v1 · pith:IVIWZIGYnew · submitted 2026-05-20 · 💻 cs.LG · stat.ME· stat.ML

Symbolic Density Estimation for Discrete Distributions

Ziwen Liu , Meng Li This is my paper

Pith reviewed 2026-05-22 08:41 UTC · model grok-4.3

classification 💻 cs.LG stat.MEstat.ML

keywords symbolic density estimationdiscrete distributionsevolutionary searchprobability mass functionsmixture modelszero inflationunsupervised learningbenchmark dataset

0 comments

The pith

Symbolic density estimation recovers closed-form probability mass functions for discrete distributions by composing elementary operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents symbolic density estimation as a way to automatically discover exact mathematical expressions for discrete probability distributions instead of deriving them by hand. It uses evolutionary search over compositions of basic analytic functions, combined with structural priors and validity checks, to find both the form and the parameters. This matters for expanding the usable catalog of distributions quickly and for handling extensions such as zero-inflated or mixture models. On a new benchmark covering common families the method recovers every case with accurate parameters, and on real data it produces interpretable mixtures that fit better than standard choices.

Core claim

The central claim is that an unsupervised evolutionary search over a structured space of elementary analytic operations, augmented by domain-specific priors and a validity-aware inference stage, can recover the exact closed-form probability mass functions and parameter values for a broad range of discrete distributions, including zero-inflated and finite-mixture families.

What carries the argument

Evolutionary search that composes elementary analytic operations into valid probability mass functions, guided by structural priors and a validity-aware inference stage.

If this is right

The contributed benchmark dataset enables systematic testing of future symbolic methods for discrete distributions.
The same framework extends directly to richer families such as zero-inflated distributions and finite mixtures.
On real data the recovered mixture models achieve better goodness-of-fit than conventional parametric choices while remaining concise and interpretable.
Accurate parameter recovery holds across all tested benchmark families once the correct functional form is identified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to discover symbolic expressions for continuous densities or for other statistical functionals such as hazard rates.
Combining the method with gradient-based refinement might improve numerical stability for parameters in high-dimensional mixtures.
The recovered symbolic forms could serve as interpretable building blocks inside larger probabilistic models rather than remaining stand-alone.
If the search space is enlarged, the framework might surface previously undocumented but still simple discrete distributions that fit empirical data well.

Load-bearing premise

The target distribution can be expressed exactly as a composition of elementary analytic operations inside the search space the evolutionary algorithm explores.

What would settle it

The method fails to recover the correct closed form or returns inaccurate parameter estimates when tested on synthetic data drawn from a discrete distribution whose expression lies outside the allowed search space of analytic compositions.

Figures

Figures reproduced from arXiv: 2605.21813 by Meng Li, Ziwen Liu.

**Figure 1.** Figure 1: Evolutionary generations to correct symbolic form (M = 50,000; log scale). 10 2 10 3 10 4 Evolutionary Steps to Convergence Geometric Yulesimon Logseries Zipf Zipfian Boltzmann Dlaplace Poisson Binomial Negbinomial Hypergeometric Betanegbinomial Betabinomial Neghypergeometric Computational Efficiency by Distribution Structure [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: PMF estimates for Beta-Binomial (left) and ZIP (right). KDE and Pyro produce non [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Real-data fit on PBMC gene 4046. Black points are the empirical log-PMF over support [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Diagnostic plots for zero-inflated models (ZIP, ZINB, ZIG). Each panel shows PMF fit, [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of PMF estimation results across the extended benchmark. Beta-Binomial and [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

read the original abstract

Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Symbolic Density Estimation (SDE), an unsupervised framework that composes elementary analytic operations via evolutionary search within a structured space, augmented by domain-specific structural priors and a validity-aware inference stage, to recover closed-form probability mass functions for discrete distributions. The approach extends to zero-inflated distributions and finite mixtures, contributes a benchmark dataset spanning common discrete families, asserts recovery of all benchmark families with accurate parameter estimates, and demonstrates improved goodness-of-fit via interpretable mixtures on real data.

Significance. If the recovery results prove robust, the work offers a systematic, automated route to expanding the catalog of interpretable discrete distributions beyond manual derivations, with direct utility for statistical modeling and machine learning. The benchmark dataset itself constitutes a reusable contribution for future evaluation, and the handling of richer families such as mixtures broadens applicability.

major comments (2)

[Abstract] Abstract: the assertion that the algorithm 'recovers all benchmark families with accurate parameter estimates' is load-bearing for the central claim yet provides no information on search implementation details, validity checks, error metrics, number of independent runs, or success rates; evolutionary search is stochastic, so absent these the recovery cannot be assessed as reliable rather than trajectory-dependent.
[Benchmark evaluation] Benchmark evaluation section: without reported ablation on population size, mutation rate, or explicit failure cases (even when targets lie inside the search space), the claim of consistent recovery across all families rests on potentially selected successful runs and cannot yet support the robustness required for the method's advertised generality.

minor comments (2)

[Method] Clarify the precise grammar or production rules defining the structured search space and list the elementary operations with their arity and domains.
[Results] Tables reporting recovered parameters should include variation or confidence intervals across runs rather than single-point estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental reporting that will strengthen the manuscript. We address each major comment point by point below and will revise the paper to incorporate the requested details and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the algorithm 'recovers all benchmark families with accurate parameter estimates' is load-bearing for the central claim yet provides no information on search implementation details, validity checks, error metrics, number of independent runs, or success rates; evolutionary search is stochastic, so absent these the recovery cannot be assessed as reliable rather than trajectory-dependent.

Authors: We agree that the stochastic nature of evolutionary search requires explicit reporting to support the recovery claims. The full manuscript describes the overall framework and validity-aware inference, but we acknowledge that the abstract and evaluation summary would benefit from greater specificity. In the revised version we will expand the abstract with a concise statement of the experimental protocol and add a dedicated paragraph to the Benchmark evaluation section that specifies the evolutionary search hyperparameters, validity checks, error metrics (including KL divergence and parameter estimation error), number of independent runs performed, and observed success rates across the benchmark families. revision: yes
Referee: [Benchmark evaluation] Benchmark evaluation section: without reported ablation on population size, mutation rate, or explicit failure cases (even when targets lie inside the search space), the claim of consistent recovery across all families rests on potentially selected successful runs and cannot yet support the robustness required for the method's advertised generality.

Authors: We concur that ablations and transparent reporting of edge cases are necessary to demonstrate robustness. The revised manuscript will include a new ablation study that systematically varies population size and mutation rate while measuring impact on recovery success and runtime. We will also add discussion and examples of failure cases or runs requiring additional iterations, even when the target distribution lies within the search space, to provide a balanced assessment rather than only successful trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic search over explicit space validated on independent benchmarks

full rationale

The paper describes an unsupervised evolutionary search procedure that composes elementary analytic operations inside a pre-defined structured space, augmented by structural priors and a validity-aware inference stage. Recovery of benchmark families is presented as an empirical test of whether the search reliably locates expressions whose parameters can then be estimated from data. This is a standard validation of a discovery algorithm rather than a derivation that reduces to its own fitted outputs or self-citations. No load-bearing step equates a claimed result to an input by construction; the benchmark dataset is contributed separately to support evaluation, and success is reported as an observed outcome of the method rather than a definitional necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the search space is expressive enough for target distributions.

pith-pipeline@v0.9.0 · 5641 in / 983 out tokens · 40382 ms · 2026-05-22T08:41:15.930674+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We seek an analytic expression for p(x) … from a hypothesis space P induced by an operator set O … logF(t) = log Γ(t+1), logC(n,k) … logaddexp(u,v) = log(e^u + e^v)
IndisputableMonolith/Foundation/Cost.lean Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

weighted least-squares objective … validity-aware inference stage … approximate normalization |∑ e^{f(x)} − 1| < ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Deaglan J

arXiv:2006.06813. Deaglan J. Bartlett, Harry Desmond, and Pedro G. Ferreira. Exhaustive symbolic regression.IEEE Transactions on Evolutionary Computation, 28(4):950–964,

work page arXiv 2006
[2]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

arXiv:2305.01582. Laure Crochepierre, Lydia Boudjeloud-Assala, and Vincent Barbesant. A reinforcement learn- ing approach to domain-knowledge inclusion using grammar guided symbolic regression,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Stéphane d’Ascoli, Sören Becker, Philippe Schwaller, Alexander Mathis, and Niki Kilbertus

arXiv:2202.04367. Stéphane d’Ascoli, Sören Becker, Philippe Schwaller, Alexander Mathis, and Niki Kilbertus. ODE- Former: Symbolic regression of dynamical systems with transformers. InThe Twelfth International Conference on Learning Representations,

work page arXiv
[4]

arXiv:1910.08892. N. Lloyd Johnson, Adrienne W. Kemp, and Samuel Kotz.Univariate Discrete Distributions. John Wiley & Sons, New York, 2nd edition,

work page arXiv 1910
[5]

arXiv:2102.08351. John R. Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA,

work page arXiv
[6]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K

arXiv:2306.08506. Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K. Reddy. Transformer-based planning for symbolic regression. InThirty-seventh Conference on Neural Information Processing Systems,

work page arXiv
[7]

ISR: Invertible symbolic regression, 2024a

Tony Tohme, Mohammad Javad Khojasteh, Mohsen Sadr, Florian Meyer, and Kamal Youcef-Toumi. ISR: Invertible symbolic regression, 2024a. arXiv:2405.06848. Tony Tohme, Mohsen Sadr, Kamal Youcef-Toumi, and Nicolas G. Hadjiconstantinou. MESSY estimation: Maximum-entropy based stochastic and symbolic densitY estimation, 2024b. arXiv:2306.04120. F. William Townes...

work page arXiv
[8]

Symbolicgpt: A generative trans- former model for symbolic regression

arXiv:2106.14131. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey,˙Ilhan Polat,...

work page arXiv
[9]

Each entry provides the standardized log-probability mass function (Log-PMF) alongside the associated parameters and the specific operator sets required for symbolic reconstruction

15 Appendix A Discrete Distribution Families Table 9 presents a comprehensive glossary of the discrete probability distribution families examined in this work. Each entry provides the standardized log-probability mass function (Log-PMF) alongside the associated parameters and the specific operator sets required for symbolic reconstruction. The 14 distribu...

work page 1992
[10]

Profile 2 reduces the same restricted space to 1.05×10 13, corresponding to an additional 96× reduction. Overall, the main pruning effect comes from the structural restrictions, while the soft complexity profiles provide a further multi-order-of-magnitude reduction in the search space explored in practice. C Implementation Details for SDE and Baselines Th...

work page 2017
[11]

• Post-search filtering and selection.After running the two profiles independently, all candidates that pass the post-search filters are pooled across profiles. We apply four fixed criteria: reconstruc- tion loss below 10−3; approximate normalization satisfying P x∈Xfit ef(x) −1 <10 −3; bounded log-mass with maxx∈Xfit f(x)<10 −3; and operator-count limits...

work page 2023
[12]

+ logC(N−K, n−x 0)−logC(N, n) (N, K, n) = (100,50,40) logC(50.0065, x 0 + 10.0065) + logC(50.0060, x 0)−64.7980 0.0324 (N, K, n) = (200,60,100)−144.3657 + logC(137.2004,104.0455−x

work page 2004
[13]

Beta-Binomial(n= 100, α= 2.0, β= 5.0) 10,000 0.0013−22.6464−logB(1.0003, x 0 + 1.0009)−logB(101.0007−x 0,4.0005) 5,000 0.0329 logB(24.2827,113.2000−x 0)−logB(102.6506−x 0,23.3376) 1,000 0.2120−31.2557−logB(7.2884,logC(161.0127, x 0 + 56.6708)) 500 0.6720−15.8934−logB(2.8866,100.2188−x

work page 2000
[14]

Even under the reduced grammar, this already yields 400,376 candidate expressions. Direct LASSO on the full enumerated library is infeasible, so we use a two-stage pipeline: we first compute SIS scores for the enumerated features, retain the top-200 screened expressions, and then fit LASSO on this reduced design matrix. In practice, only 136,004 of the 40...

work page 2017

[1] [1]

Deaglan J

arXiv:2006.06813. Deaglan J. Bartlett, Harry Desmond, and Pedro G. Ferreira. Exhaustive symbolic regression.IEEE Transactions on Evolutionary Computation, 28(4):950–964,

work page arXiv 2006

[2] [2]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

arXiv:2305.01582. Laure Crochepierre, Lydia Boudjeloud-Assala, and Vincent Barbesant. A reinforcement learn- ing approach to domain-knowledge inclusion using grammar guided symbolic regression,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Stéphane d’Ascoli, Sören Becker, Philippe Schwaller, Alexander Mathis, and Niki Kilbertus

arXiv:2202.04367. Stéphane d’Ascoli, Sören Becker, Philippe Schwaller, Alexander Mathis, and Niki Kilbertus. ODE- Former: Symbolic regression of dynamical systems with transformers. InThe Twelfth International Conference on Learning Representations,

work page arXiv

[4] [4]

arXiv:1910.08892. N. Lloyd Johnson, Adrienne W. Kemp, and Samuel Kotz.Univariate Discrete Distributions. John Wiley & Sons, New York, 2nd edition,

work page arXiv 1910

[5] [5]

arXiv:2102.08351. John R. Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA,

work page arXiv

[6] [6]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K

arXiv:2306.08506. Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K. Reddy. Transformer-based planning for symbolic regression. InThirty-seventh Conference on Neural Information Processing Systems,

work page arXiv

[7] [7]

ISR: Invertible symbolic regression, 2024a

Tony Tohme, Mohammad Javad Khojasteh, Mohsen Sadr, Florian Meyer, and Kamal Youcef-Toumi. ISR: Invertible symbolic regression, 2024a. arXiv:2405.06848. Tony Tohme, Mohsen Sadr, Kamal Youcef-Toumi, and Nicolas G. Hadjiconstantinou. MESSY estimation: Maximum-entropy based stochastic and symbolic densitY estimation, 2024b. arXiv:2306.04120. F. William Townes...

work page arXiv

[8] [8]

Symbolicgpt: A generative trans- former model for symbolic regression

arXiv:2106.14131. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey,˙Ilhan Polat,...

work page arXiv

[9] [9]

Each entry provides the standardized log-probability mass function (Log-PMF) alongside the associated parameters and the specific operator sets required for symbolic reconstruction

15 Appendix A Discrete Distribution Families Table 9 presents a comprehensive glossary of the discrete probability distribution families examined in this work. Each entry provides the standardized log-probability mass function (Log-PMF) alongside the associated parameters and the specific operator sets required for symbolic reconstruction. The 14 distribu...

work page 1992

[10] [10]

Profile 2 reduces the same restricted space to 1.05×10 13, corresponding to an additional 96× reduction. Overall, the main pruning effect comes from the structural restrictions, while the soft complexity profiles provide a further multi-order-of-magnitude reduction in the search space explored in practice. C Implementation Details for SDE and Baselines Th...

work page 2017

[11] [11]

• Post-search filtering and selection.After running the two profiles independently, all candidates that pass the post-search filters are pooled across profiles. We apply four fixed criteria: reconstruc- tion loss below 10−3; approximate normalization satisfying P x∈Xfit ef(x) −1 <10 −3; bounded log-mass with maxx∈Xfit f(x)<10 −3; and operator-count limits...

work page 2023

[12] [12]

+ logC(N−K, n−x 0)−logC(N, n) (N, K, n) = (100,50,40) logC(50.0065, x 0 + 10.0065) + logC(50.0060, x 0)−64.7980 0.0324 (N, K, n) = (200,60,100)−144.3657 + logC(137.2004,104.0455−x

work page 2004

[13] [13]

Beta-Binomial(n= 100, α= 2.0, β= 5.0) 10,000 0.0013−22.6464−logB(1.0003, x 0 + 1.0009)−logB(101.0007−x 0,4.0005) 5,000 0.0329 logB(24.2827,113.2000−x 0)−logB(102.6506−x 0,23.3376) 1,000 0.2120−31.2557−logB(7.2884,logC(161.0127, x 0 + 56.6708)) 500 0.6720−15.8934−logB(2.8866,100.2188−x

work page 2000

[14] [14]

Even under the reduced grammar, this already yields 400,376 candidate expressions. Direct LASSO on the full enumerated library is infeasible, so we use a two-stage pipeline: we first compute SIS scores for the enumerated features, retain the top-200 screened expressions, and then fit LASSO on this reduced design matrix. In practice, only 136,004 of the 40...

work page 2017