A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

James Enouen; Mahito Sugiyama

arxiv: 2410.11964 · v2 · submitted 2024-10-15 · 💻 cs.LG · stat.ML

A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

James Enouen , Mahito Sugiyama This is my paper

Pith reviewed 2026-05-23 18:35 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords log-linear modelsKL divergencemode interactionssparse selectioninformation geometryenergy-based modelshigher-order interactionsgenerative modeling

0 comments

The pith

Log-linear models achieve a complete decomposition of KL error by selecting higher-order mode interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that log-linear models over discrete variables can be extended beyond independent and pairwise terms by incorporating higher-order mode interactions, using ideas from information geometry. This extension produces an exact decomposition of the total KL divergence error into separate contributions from each possible interaction. The decomposition turns the modeling task into a sparse selection problem, where only a subset of interactions is retained to avoid overfitting when data is limited. An algorithm called MAHGenTa implements the selection using Monte-Carlo sampling of energy-based models together with a greedy heuristic that adds statistical robustness. Experiments on synthetic and real data show higher log-likelihood for both generation and classification compared with models limited to one- and two-body terms.

Core claim

Using refined information measures from information geometry, the KL error between a target distribution and a log-linear model decomposes exactly into additive terms, one for each possible mode interaction of any order. This decomposition directly motivates a sparse combinatorial selection problem over the full set of interactions. Solving the selection problem with the MAHGenTa procedure, which combines Monte-Carlo estimation for energy-based models and a greedy heuristic, produces distributions that make more efficient use of finite training data than standard pairwise models.

What carries the argument

Complete decomposition of KL error into per-interaction terms via refined information, which converts modeling into a sparse mode-interaction selection problem solved by MAHGenTa.

If this is right

Sparse selection of higher-order interactions allows log-linear models to generalize from smaller training sets than pairwise-only models.
The same selection procedure improves performance on both generative log-likelihood and downstream classification accuracy.
Models that retain only statistically supported interactions avoid wasting capacity on irrelevant higher-order terms.
The decomposition separates the contribution of each interaction order, making it possible to quantify how much each added interaction reduces the remaining KL error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition idea could be tested on continuous-variable energy-based models by replacing discrete mode counting with suitable density estimators.
Replacing the greedy heuristic with a convex relaxation or mixed-integer solver might produce different sparsity patterns and different generalization behavior.
Because the selection operates on the interaction set rather than on parameters, the approach may transfer to other exponential-family models that admit an interaction basis.

Load-bearing premise

The Monte-Carlo sampling combined with the greedy heuristic produces interaction selections whose generalization benefit is not an artifact of particular datasets or post-hoc choices.

What would settle it

On held-out test sets across multiple datasets, the log-likelihood of models produced by MAHGenTa is no higher than that of a full pairwise Boltzmann model or a model using all interactions.

Figures

Figures reproduced from arXiv: 2410.11964 by James Enouen, Mahito Sugiyama.

**Figure 1.** Figure 1: Three types of higher-order information (HS, IS, JS) defined for each possible subset S ⊆ [3]. in the realm of information theory for the visible-only case may also support further advances in latent approaches. Nonetheless, we hereafter restrict to the visible-only log-linear model. 3 Refined Information Notation Consider the set of distributions over finite spaces of d ∈ N variable dimensions, with eac… view at source ↗

**Figure 2.** Figure 2: KL Error vs. Number of Training Samples. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the training (dark) and vali [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: All hyperparameters of heredity strength and parameter count renormalization. Top 8: Full likelihood [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: All hyperparameters of heredity strength and parameter count renormalization. Bottom 8: Pseudo [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Model Performance vs. Number of Training Samples. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Simple causal graph to help illustrate refined information. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Example of a possible chain of mode interaction collections. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Algebraic structure of all hierarchical collections of mode interactions. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the majority of related energy-based models only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. We develop an algorithm called MAHGenTa which leverages a novel Monte-Carlo sampling technique for energy-based models alongside a greedy heuristic for incorporating statistical robustness. On both synthetic and real-world datasets, we demonstrate our algorithm's effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main advance is a full KL decomposition for log-linear models via higher-order mode interactions that motivates an effective sparse selection method.

read the letter

The paper gives a decomposition of the KL divergence for log-linear models that fully accounts for higher-order mode interactions, and this leads to a sparse selection problem solved by their MAHGenTa algorithm. That's the main new piece. It does a good job extending the usual pairwise focus in Boltzmann machines to include richer interactions using tools from information geometry. The motivation for sparse selection to improve generalization with finite data makes sense, and they back it with experiments on both synthetic and real datasets showing better log-likelihood. The algorithm's use of a novel Monte-Carlo sampling technique for energy-based models is a practical addition. The soft spots are mostly around the practical side: the Monte-Carlo estimator and greedy heuristic need to deliver robust selections, and it's not clear how sensitive the gains are to specific data or choices in which interactions to keep. The central decomposition itself looks clean with no obvious holes in the argument. This work is aimed at people building energy-based models for discrete variables who care about sample efficiency and higher-order dependencies. A reader interested in information geometry applications or structure learning in graphical models would get value from it. It deserves serious peer review because the idea is technically grounded and the results are presented without overstatement.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that tools from information geometry allow a complete decomposition of the KL error for log-linear models in terms of higher-order mode interactions (beyond 1-body and 2-body terms). This decomposition motivates a sparse selection problem over possible mode interactions, which is solved by the MAHGenTa algorithm using a novel Monte-Carlo sampling technique for energy-based models combined with a greedy heuristic for statistical robustness. The resulting distributions are shown to improve log-likelihood on synthetic and real-world data for generative modeling and to adapt to discriminative classification.

Significance. If the decomposition is complete and the heuristic produces robust selections, the work provides a principled extension of log-linear models to higher-order interactions, potentially improving data efficiency in high-dimensional discrete settings. The information-geometric framing of the KL decomposition is a clear strength, and the dual evaluation on generative and discriminative tasks supports practical relevance.

minor comments (3)

[Abstract] Abstract: the central decomposition is described only at a high level; adding a brief reference to its explicit form (or the section containing the derivation) would help readers immediately assess the contribution.
[MAHGenTa algorithm section] Algorithm description: the Monte-Carlo estimator is stated to be unbiased, but a short derivation or reference showing why the sampling scheme preserves unbiasedness under the energy-based model would strengthen the statistical claims.
[Experimental results] Experiments: results are presented without error bars or details on the number of independent runs; including these would make the reported log-likelihood improvements easier to interpret.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives a complete KL decomposition from information-geometric identities applied to log-linear models extended to higher-order mode interactions. This decomposition is presented as motivating (rather than being defined by) the subsequent sparse mode-interaction selection problem solved via MAHGenTa. No step reduces by the paper's own equations to a fitted parameter, self-citation chain, or ansatz smuggled from prior work by the same authors. The central claim is self-contained against external information-geometry benchmarks and does not rely on the empirical heuristic for its validity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations or sections can be inspected to list free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5790 in / 1159 out tokens · 22337 ms · 2026-05-23T18:35:47.105685+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical log-linear model... poset of mode interactions... Bregman flat manifold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985

work page 1985
[2]

Amari and H

S. Amari and H. Nagaoka. Methods of Infor- mation Geometry . Translations of mathemati- cal monographs. American Mathematical Society, 2000

work page 2000
[3]

S.-I. Amari. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory , 47(5):1701–1711, 2001

work page 2001
[4]

S.-I. Amari. Information Geometry and Its Ap- plications. Springer Publishing Company, Incor- porated, 1st edition, 2016

work page 2016
[5]

A. Aswani. Low-rank approximation and comple- tion of positive tensors. SIAM Journal on Ma- trix Analysis and Applications , 37(3):1337–1364, 2016

work page 2016
[6]

Bennasar, Y

M. Bennasar, Y. Hicks, and R. Setchi. Feature selection using joint mutual information max- imisation. Expert Systems with Applications , 42(22):8520–8532, 2015

work page 2015
[7]

J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of statistics , 41(3):1111, 2013

work page 2013
[8]

S. L. Buhl. On the existence of maximum like- lihood estimators for graphical gaussian mod- els. Scandinavian Journal of Statistics, 20(3):263– 270, 1993

work page 1993
[9]

Z. Chen, C. Wu, Y. Zhang, Z. Huang, B. Ran, M. Zhong, and N. Lyu. Feature selection with redundancy-complementariness dispersion. Knowledge-Based Systems, 89:203–217, 2015

work page 2015
[10]

A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972

work page 1972
[11]

Enouen and Y

J. Enouen and Y. Liu. Sparse interaction ad- ditive networks via feature interaction detection and sparse selection. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, ed- itors, Advances in Neural Information Processing Systems, volume 35, pages 13908–13920. Curran Associates, Inc., 2022

work page 2022
[12]

Y. Fan, Y. Kong, D. Li, and J. Lv. Interaction pursuit with feature screening and selection, 2016

work page 2016
[13]

R. A. Fisher. Two new properties of mathemati- cal likelihood. Proceedings of the Royal Society of London. Series A, Containing Papers of a Math- ematical and Physical Character , 144(852):285– 307, 1934

work page 1934
[14]

Ganmor, R

E. Ganmor, R. Segev, and E. Schneidman. Sparse low-order interaction network underlies a highly correlated and learnable neural population code. Proc. Natl. Acad. Sci. U. S. A. , 108(23):9679– 9684, June 2011

work page 2011
[15]

A. E. Gelfand. Gibbs sampling. Journal of the American Statistical Association , 95(452):1300– 1304, 2000

work page 2000
[16]

Geman and D

S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence , PAMI-6(6):721– 741, 1984

work page 1984
[17]

Ghalamkari, M

K. Ghalamkari, M. Sugiyama, and Y. Kawahara. Many-body approximation for non-negative ten- sors. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 74077–74102. Curran As- sociates, Inc., 2023

work page 2023
[18]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Ad- vances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

work page 2014
[19]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006

work page 2006
[20]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ran- zato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Sys- tems, volume 33, pages 6840–6851. Curran Asso- ciates, Inc., 2020

work page 2020
[21]

Højsgaard

S. Højsgaard. Statistical inference in context spe- cific interaction models for contingency tables. Scandinavian Journal of Statistics , 31(1):143– 158, 2004

work page 2004
[22]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In 2nd International Confer- ence on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Confer- ence Track Proceedings, 2014. Enouen, Sugiyama

work page 2014
[23]

S.-i. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using l 1- regularization. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Informa- tion Processing Systems , volume 19. MIT Press, 2006

work page 2006
[24]

Lowd and J

D. Lowd and J. Davis. Learning markov network structure with decision trees. In 2010 IEEE Inter- national Conference on Data Mining , pages 334– 343, 2010

work page 2010
[25]

F. Lyu, X. Tang, D. Liu, C. Ma, W. Luo, L. Chen, x. He, and X. S. Liu. Towards hybrid-grained feature interaction selection for deep sparse net- work. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 49325–49340. Curran As- sociates, Inc., 2023

work page 2023
[26]

Massam, J

H. Massam, J. Liu, and A. Dobra. A conju- gate prior for discrete hierarchical log-linear mod- els. The Annals of Statistics , 37(6A):3431 – 3467, 2009

work page 2009
[27]

W. McGill. Multivariate information transmis- sion. Transactions of the IRE Professional Group on Information Theory , 4(4):93–111, 1954

work page 1954
[28]

M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable Sparse High-Order Boltzmann Ma- chines. In Proceedings of the Seventeenth In- ternational Conference on Artificial Intelligence and Statistics , volume 33 of Proceedings of Ma- chine Learning Research , pages 614–622, Reyk- javik, Iceland, 22–25 Apr 2014. PMLR

work page 2014
[29]

Nagaoka and S.-i

H. Nagaoka and S.-i. Amari. Differential geome- try of smooth families of probability distributions. Technical report, University of Tokyo, 1982

work page 1982
[30]

Nakahara and S.-I

H. Nakahara and S.-I. Amari. Information- geometric measure for neural spikes. Neural Com- put., 14(10):2269–2316, Oct. 2002

work page 2002
[31]

Nakariyakul

S. Nakariyakul. High-dimensional hybrid feature selection using interaction information-guided search. Knowledge-Based Systems , 145:59–66, 2018

work page 2018
[32]

R. M. Neal. Annealed importance sampling. Statistics and Computing , 11:125–139, 2001

work page 2001
[33]

Nielsen, F

F. Nielsen, F. Critchley, and C. T. J. Dodson. Computational Information Geometry . Signals and Communication Technology. Springer Pub- lishing Company, Incorporated, 2017

work page 2017
[34]

Nyman, J

H. Nyman, J. Pensar, T. Koski, and J. Corander. Stratified Graphical Models - Context-Specific In- dependence in Graphical Models. Bayesian Anal- ysis, 9(4):883 – 908, 2014

work page 2014
[35]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K¨ opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil- amkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: an imperative style, high- performance deep learning library. In Proceedings of the 33rd International Co...

work page 2019
[36]

J. L. Peixoto. Hierarchical variable selection in polynomial regression models. The American Statistician, 41(4):311–313, 1987

work page 1987
[37]

C. R. Rao. Information and the Accuracy Attain- able in the Estimation of Statistical Parameters , pages 235–247. Springer New York, New York, NY, 1992

work page 1992
[38]

Salakhutdinov and G

R. Salakhutdinov and G. Hinton. Deep boltz- mann machines. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statis- tics, volume 5 of Proceedings of Machine Learn- ing Research, pages 448–455, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR

work page 2009
[39]

Schmidt and K

M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. In Y. W. Teh and M. Titterington, ed- itors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics, volume 9 of Proceedings of Machine Learn- ing Research, pages 709–716, Chia Laguna Resort, Sardinia, Italy, 13...

work page 2010
[40]

T. J. Sejnowski. Higher-order Boltzmann ma- chines. In Neural Networks for Computing , vol- ume 151 of American Institute of Physics Confer- ence Series, pages 398–403. AIP, Aug. 1986

work page 1986
[41]

C. E. Shannon. A mathematical theory of com- munication. The Bell System Technical Journal , 27(3):379–423, 1948

work page 1948
[42]

Shpitser, R

I. Shpitser, R. J. Evans, T. S. Richardson, and J. M. Robins. Sparse nested markov models with log-linear parameters. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Arti- ficial Intelligence, UAI’13, page 576–585, Arling- ton, Virginia, USA, 2013. AUAI Press. A Complete Decomposition of KL Error using Refined Information

work page 2013
[43]

Sugiyama and K

M. Sugiyama and K. Borgwardt. Finding statis- tically significant interactions between continuous features. In Proceedings of the Twenty-Eighth In- ternational Joint Conference on Artificial Intelli- gence, IJCAI-19, pages 3490–3498. International Joint Conferences on Artificial Intelligence Orga- nization, 7 2019

work page 2019
[44]

Sugiyama, H

M. Sugiyama, H. Nakahara, and K. Tsuda. Leg- endre decomposition for tensors. In S. Ben- gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[45]

T. Sun Han. Multiple mutual informations and multiple interactions in frequency data. Informa- tion and Control , 46(1):26–45, 1980

work page 1980
[46]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical So- ciety. Series B (Methodological) , 58(1):267–288, 1996

work page 1996
[47]

Van Haaren and J

J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature genera- tion approach. Proceedings of the AAAI Confer- ence on Artificial Intelligence , 26(1):1148–1154, Sep. 2012

work page 2012
[48]

M. J. Wainwright, J. Lafferty, and P. Raviku- mar. High-dimensional graphical model selec- tion using \ell 1-regularized logistic regression. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Sys- tems, volume 19. MIT Press, 2006

work page 2006
[49]

Z. Zeng, H. Zhang, R. Zhang, and C. Yin. A novel feature selection method considering feature in- teraction. Pattern Recognition, 48(8):2656–2666, 2015. Enouen, Sugiyama A Additional Results In Figures 5 and 6, we show the capacity curves across all sets of hyperparameters chosen for the experiments with the 10-dimensional subset of the 23-dimensional mus...

work page 2015

[1] [1]

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985

work page 1985

[2] [2]

Amari and H

S. Amari and H. Nagaoka. Methods of Infor- mation Geometry . Translations of mathemati- cal monographs. American Mathematical Society, 2000

work page 2000

[3] [3]

S.-I. Amari. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory , 47(5):1701–1711, 2001

work page 2001

[4] [4]

S.-I. Amari. Information Geometry and Its Ap- plications. Springer Publishing Company, Incor- porated, 1st edition, 2016

work page 2016

[5] [5]

A. Aswani. Low-rank approximation and comple- tion of positive tensors. SIAM Journal on Ma- trix Analysis and Applications , 37(3):1337–1364, 2016

work page 2016

[6] [6]

Bennasar, Y

M. Bennasar, Y. Hicks, and R. Setchi. Feature selection using joint mutual information max- imisation. Expert Systems with Applications , 42(22):8520–8532, 2015

work page 2015

[7] [7]

J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of statistics , 41(3):1111, 2013

work page 2013

[8] [8]

S. L. Buhl. On the existence of maximum like- lihood estimators for graphical gaussian mod- els. Scandinavian Journal of Statistics, 20(3):263– 270, 1993

work page 1993

[9] [9]

Z. Chen, C. Wu, Y. Zhang, Z. Huang, B. Ran, M. Zhong, and N. Lyu. Feature selection with redundancy-complementariness dispersion. Knowledge-Based Systems, 89:203–217, 2015

work page 2015

[10] [10]

A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972

work page 1972

[11] [11]

Enouen and Y

J. Enouen and Y. Liu. Sparse interaction ad- ditive networks via feature interaction detection and sparse selection. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, ed- itors, Advances in Neural Information Processing Systems, volume 35, pages 13908–13920. Curran Associates, Inc., 2022

work page 2022

[12] [12]

Y. Fan, Y. Kong, D. Li, and J. Lv. Interaction pursuit with feature screening and selection, 2016

work page 2016

[13] [13]

R. A. Fisher. Two new properties of mathemati- cal likelihood. Proceedings of the Royal Society of London. Series A, Containing Papers of a Math- ematical and Physical Character , 144(852):285– 307, 1934

work page 1934

[14] [14]

Ganmor, R

E. Ganmor, R. Segev, and E. Schneidman. Sparse low-order interaction network underlies a highly correlated and learnable neural population code. Proc. Natl. Acad. Sci. U. S. A. , 108(23):9679– 9684, June 2011

work page 2011

[15] [15]

A. E. Gelfand. Gibbs sampling. Journal of the American Statistical Association , 95(452):1300– 1304, 2000

work page 2000

[16] [16]

Geman and D

S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence , PAMI-6(6):721– 741, 1984

work page 1984

[17] [17]

Ghalamkari, M

K. Ghalamkari, M. Sugiyama, and Y. Kawahara. Many-body approximation for non-negative ten- sors. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 74077–74102. Curran As- sociates, Inc., 2023

work page 2023

[18] [18]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Ad- vances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

work page 2014

[19] [19]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006

work page 2006

[20] [20]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ran- zato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Sys- tems, volume 33, pages 6840–6851. Curran Asso- ciates, Inc., 2020

work page 2020

[21] [21]

Højsgaard

S. Højsgaard. Statistical inference in context spe- cific interaction models for contingency tables. Scandinavian Journal of Statistics , 31(1):143– 158, 2004

work page 2004

[22] [22]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In 2nd International Confer- ence on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Confer- ence Track Proceedings, 2014. Enouen, Sugiyama

work page 2014

[23] [23]

S.-i. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using l 1- regularization. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Informa- tion Processing Systems , volume 19. MIT Press, 2006

work page 2006

[24] [24]

Lowd and J

D. Lowd and J. Davis. Learning markov network structure with decision trees. In 2010 IEEE Inter- national Conference on Data Mining , pages 334– 343, 2010

work page 2010

[25] [25]

F. Lyu, X. Tang, D. Liu, C. Ma, W. Luo, L. Chen, x. He, and X. S. Liu. Towards hybrid-grained feature interaction selection for deep sparse net- work. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 49325–49340. Curran As- sociates, Inc., 2023

work page 2023

[26] [26]

Massam, J

H. Massam, J. Liu, and A. Dobra. A conju- gate prior for discrete hierarchical log-linear mod- els. The Annals of Statistics , 37(6A):3431 – 3467, 2009

work page 2009

[27] [27]

W. McGill. Multivariate information transmis- sion. Transactions of the IRE Professional Group on Information Theory , 4(4):93–111, 1954

work page 1954

[28] [28]

M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable Sparse High-Order Boltzmann Ma- chines. In Proceedings of the Seventeenth In- ternational Conference on Artificial Intelligence and Statistics , volume 33 of Proceedings of Ma- chine Learning Research , pages 614–622, Reyk- javik, Iceland, 22–25 Apr 2014. PMLR

work page 2014

[29] [29]

Nagaoka and S.-i

H. Nagaoka and S.-i. Amari. Differential geome- try of smooth families of probability distributions. Technical report, University of Tokyo, 1982

work page 1982

[30] [30]

Nakahara and S.-I

H. Nakahara and S.-I. Amari. Information- geometric measure for neural spikes. Neural Com- put., 14(10):2269–2316, Oct. 2002

work page 2002

[31] [31]

Nakariyakul

S. Nakariyakul. High-dimensional hybrid feature selection using interaction information-guided search. Knowledge-Based Systems , 145:59–66, 2018

work page 2018

[32] [32]

R. M. Neal. Annealed importance sampling. Statistics and Computing , 11:125–139, 2001

work page 2001

[33] [33]

Nielsen, F

F. Nielsen, F. Critchley, and C. T. J. Dodson. Computational Information Geometry . Signals and Communication Technology. Springer Pub- lishing Company, Incorporated, 2017

work page 2017

[34] [34]

Nyman, J

H. Nyman, J. Pensar, T. Koski, and J. Corander. Stratified Graphical Models - Context-Specific In- dependence in Graphical Models. Bayesian Anal- ysis, 9(4):883 – 908, 2014

work page 2014

[35] [35]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K¨ opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil- amkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: an imperative style, high- performance deep learning library. In Proceedings of the 33rd International Co...

work page 2019

[36] [36]

J. L. Peixoto. Hierarchical variable selection in polynomial regression models. The American Statistician, 41(4):311–313, 1987

work page 1987

[37] [37]

C. R. Rao. Information and the Accuracy Attain- able in the Estimation of Statistical Parameters , pages 235–247. Springer New York, New York, NY, 1992

work page 1992

[38] [38]

Salakhutdinov and G

R. Salakhutdinov and G. Hinton. Deep boltz- mann machines. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statis- tics, volume 5 of Proceedings of Machine Learn- ing Research, pages 448–455, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR

work page 2009

[39] [39]

Schmidt and K

M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. In Y. W. Teh and M. Titterington, ed- itors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics, volume 9 of Proceedings of Machine Learn- ing Research, pages 709–716, Chia Laguna Resort, Sardinia, Italy, 13...

work page 2010

[40] [40]

T. J. Sejnowski. Higher-order Boltzmann ma- chines. In Neural Networks for Computing , vol- ume 151 of American Institute of Physics Confer- ence Series, pages 398–403. AIP, Aug. 1986

work page 1986

[41] [41]

C. E. Shannon. A mathematical theory of com- munication. The Bell System Technical Journal , 27(3):379–423, 1948

work page 1948

[42] [42]

Shpitser, R

I. Shpitser, R. J. Evans, T. S. Richardson, and J. M. Robins. Sparse nested markov models with log-linear parameters. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Arti- ficial Intelligence, UAI’13, page 576–585, Arling- ton, Virginia, USA, 2013. AUAI Press. A Complete Decomposition of KL Error using Refined Information

work page 2013

[43] [43]

Sugiyama and K

M. Sugiyama and K. Borgwardt. Finding statis- tically significant interactions between continuous features. In Proceedings of the Twenty-Eighth In- ternational Joint Conference on Artificial Intelli- gence, IJCAI-19, pages 3490–3498. International Joint Conferences on Artificial Intelligence Orga- nization, 7 2019

work page 2019

[44] [44]

Sugiyama, H

M. Sugiyama, H. Nakahara, and K. Tsuda. Leg- endre decomposition for tensors. In S. Ben- gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[45] [45]

T. Sun Han. Multiple mutual informations and multiple interactions in frequency data. Informa- tion and Control , 46(1):26–45, 1980

work page 1980

[46] [46]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical So- ciety. Series B (Methodological) , 58(1):267–288, 1996

work page 1996

[47] [47]

Van Haaren and J

J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature genera- tion approach. Proceedings of the AAAI Confer- ence on Artificial Intelligence , 26(1):1148–1154, Sep. 2012

work page 2012

[48] [48]

M. J. Wainwright, J. Lafferty, and P. Raviku- mar. High-dimensional graphical model selec- tion using \ell 1-regularized logistic regression. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Sys- tems, volume 19. MIT Press, 2006

work page 2006

[49] [49]

Z. Zeng, H. Zhang, R. Zhang, and C. Yin. A novel feature selection method considering feature in- teraction. Pattern Recognition, 48(8):2656–2666, 2015. Enouen, Sugiyama A Additional Results In Figures 5 and 6, we show the capacity curves across all sets of hyperparameters chosen for the experiments with the 10-dimensional subset of the 23-dimensional mus...

work page 2015