pith. sign in

arxiv: 1906.08654 · v1 · pith:JX7OYSXCnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

ID3 Learns Juntas for Smoothed Product Distributions

Pith reviewed 2026-05-25 19:39 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords ID3 algorithmk-juntasdecision tree learningsmoothed distributionsproduct distributionslearnabilitypolynomial time
0
0 comments X

The pith

The ID3 algorithm learns k-juntas with k equal to log n in polynomial time when the input distribution is a smoothed product distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the ID3 algorithm learns functions depending on only a logarithmic number of variables in polynomial time when the data follows a smoothed product distribution. This provides a theoretical basis for the practical effectiveness of ID3 on data that is close to independent variable distributions but with some noise. A reader would care because it shows how a simple heuristic avoids worst-case hardness results under realistic assumptions about the data. The result applies specifically when the number of relevant variables is logarithmic in the total number.

Core claim

The paper proves that when the target function is a k-junta with k equal to log n, the ID3 algorithm learns it in polynomial time in the model where the observed distribution is a noisy variant of the original product distribution.

What carries the argument

The ID3 algorithm, which selects splits to build a decision tree, applied to inputs drawn from a smoothed product distribution to recover the k relevant variables of a k-junta.

Load-bearing premise

The target function depends on exactly k variables and the input distribution is a smoothed version of a product distribution.

What would settle it

A specific k-junta instance with k equal to log n on a smoothed product distribution where ID3 requires superpolynomial time to identify the relevant variables.

read the original abstract

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper analyzes the ID3 decision-tree learning algorithm when the target is a k-junta (depending on exactly k of n Boolean variables). It claims a polynomial-time learnability result for k = log n under the Kalai-Teng smoothed-analysis model: the observed distribution is a noisy perturbation of an underlying product distribution, and the algorithm recovers the junta with high probability.

Significance. If the central proof is correct, the result supplies a smoothed-analysis guarantee for a widely used heuristic on a natural function class, helping explain why ID3 succeeds on junta-like targets even when the underlying distribution is only approximately product. The absence of free parameters or self-referential definitions in the stated claim is a positive feature.

minor comments (2)
  1. [Abstract] Abstract: the precise smoothing parameter (e.g., the noise rate or the Kalai-Teng perturbation radius) is not stated; adding it would clarify the regime in which the polynomial runtime holds.
  2. The manuscript should include a short comparison table or paragraph contrasting the new smoothed bound with prior worst-case or PAC results for ID3 on juntas.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The assessment that the result would supply a smoothed-analysis guarantee for ID3 on junta-like targets (if the central proof holds) aligns with our goals. No major comments appear in the provided report, so we have no specific points to rebut or revise on that basis. We remain available to address any minor suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims a direct theoretical proof that ID3 learns k-juntas (k=log n) in polynomial time under the external Kalai-Teng smoothed product distribution model. The abstract and claim describe a learnability result for noisy variants of product distributions without any fitted parameters, self-definitional reductions, or load-bearing self-citations. The derivation chain is presented as an analysis within the given smoothed framework rather than a renaming or construction that reduces to its own inputs. No steps matching the enumerated circularity patterns are identifiable from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are identifiable or detailed.

pith-pipeline@v0.9.0 · 9650 in / 1040 out tokens · 58396 ms · 2026-05-25T19:39:45.381342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

    Zeyuan Allen-Zhu, Y uanzhi Li, and Yingyu Liang. Learnin g and generalization in overparam- eterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 , 2018

  2. [2]

    A Convergence Theory for Deep Learning via Over-Parameterization

    Zeyuan Allen-Zhu, Y uanzhi Li, and Zhao Song. A convergen ce theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 , 2018

  3. [3]

    Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

    Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019

  4. [4]

    Rank-r decision trees are a subclass of r-dec ision lists

    Avrim Blum. Rank-r decision trees are a subclass of r-dec ision lists. Information Processing Letters, 42(4):183–185, 1992

  5. [5]

    Weakly learning dnf and characterizing statistical query learning using fourier analysis

    Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kea rns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC, volume 94, pages 253–262, 1994

  6. [6]

    Noise-tolera nt learning, the parity problem, and the statistical query model

    Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolera nt learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003

  7. [7]

    SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

    Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Sha lev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linear ly separable data. arXiv preprint arXiv:1710.10174, 2017

  8. [8]

    On the proper learnin g of axis-parallel concepts

    Nader H Bshouty and Lynn Burroughs. On the proper learnin g of axis-parallel concepts. Jour- nal of Machine Learning Research , 4(Jun):157–176, 2003

  9. [9]

    On using extended sta tistical queries to avoid member- ship queries

    Nader H Bshouty and Vitaly Feldman. On using extended sta tistical queries to avoid member- ship queries. Journal of Machine Learning Research , 2(Feb):359–395, 2002

  10. [10]

    Learning dnf from random walks

    Nader H Bshouty, Elchanan Mossel, Ryan O’Donnell, and R occo A Servedio. Learning dnf from random walks. Journal of Computer and System Sciences , 71(3):250–265, 2005

  11. [11]

    Beyond the Low-Degree Algorithm: Mixtures of Subcubes and Their Applications

    Sitan Chen and Ankur Moitra. Beyond the low-degree algo rithm: Mixtures of subcubes and their applications. arXiv preprint arXiv:1803.06521 , 2018

  12. [12]

    Sgd learns the conjugate kernel class of t he network

    Amit Daniely. Sgd learns the conjugate kernel class of t he network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017

  13. [13]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018

  14. [14]

    Learning deci sion trees from random examples

    Andrzej Ehrenfeucht and David Haussler. Learning deci sion trees from random examples. Information and Computation , 82(3):231–246, 1989

  15. [15]

    New results for learning noisy parities and halfspaces

    Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on F oundations of Computer Science (FOCS’06), pages 563–574. IEEE, 2006

  16. [16]

    On agnos- tic learning of parities, monomials, and halfspaces

    Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. On agnos- tic learning of parities, monomials, and halfspaces. SIAM Journal on Computing , 39(2):606– 645, 2009

  17. [17]

    Decision trees: More the oretical justification for practical algorithms

    Amos Fiat and Dmitry Pechyony. Decision trees: More the oretical justification for practical algorithms. In International Conference on Algorithmic Learning Theory , pages 156–170. Springer, 2004

  18. [18]

    Hyperparameter Optimization: A Spectral Approach

    Elad Hazan, Adam Klivans, and Y ang Y uan. Hyperparamete r optimization: A spectral ap- proach. arXiv preprint arXiv:1706.00764 , 2017

  19. [19]

    Learning random log-depth decision trees under the uniform distribution

    Jeffrey C Jackson and Rocco A Servedio. Learning random log-depth decision trees under the uniform distribution. In Learning Theory and Kernel Machines , pages 610–624. Springer, 2003

  20. [20]

    Decision trees are PAC-learnable from most product distributions: a smoothed analysis

    Adam Tauman Kalai and Shang-Hua Teng. Decision trees ar e pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933 , 2008

  21. [21]

    Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework

    Michael Kearns. Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework. In PROCEEDINGS OF THE NATIONAL CON- FERENCE ON ARTIFICIAL INTELLIGENCE , pages 1337–1339, 1996. 9

  22. [22]

    On the boosting abil ity of top–down decision tree learn- ing algorithms

    Michael Kearns and Yishay Mansour. On the boosting abil ity of top–down decision tree learn- ing algorithms. Journal of Computer and System Sciences , 58(1):109–128, 1999

  23. [23]

    Learning decisio n trees using the fourier spectrum

    Eyal Kushilevitz and Yishay Mansour. Learning decisio n trees using the fourier spectrum. SIAM Journal on Computing , 22(6):1331–1348, 1993

  24. [24]

    Wide neural networks of any depth evolv e as linear models under gradient descent

    Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Y asaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolv e as linear models under gradient descent. arXiv preprint arXiv:1902.06720 , 2019

  25. [25]

    A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics

    Chao Ma, Lei Wu, et al. A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics. arXiv preprint arXiv:1904.04326 , 2019

  26. [26]

    Learning monotone decision trees in polynomial time

    Ryan O’Donnell and Rocco A Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing , 37(3):827–844, 2007

  27. [27]

    Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

    Samet Oymak and Mahdi Soltanolkotabi. Overparameteri zed nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004 , 2018

  28. [28]

    Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

    Samet Oymak and Mahdi Soltanolkotabi. Towards moderat e overparameterization: global convergence guarantees for training shallow neural networ ks. arXiv:1902.04674 [cs, math, stat], February 2019. arXiv: 1902.04674

  29. [29]

    Ross Quinlan

    J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986

  30. [30]

    Learning decision lists

    Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987

  31. [31]

    Understanding machine learning: From theory to algorithms

    Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

  32. [32]

    Failures of gradient-based deep learning

    Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machin e Learning-V olume 70, pages 3067–3075. JMLR. org, 2017

  33. [33]

    Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time

    Daniel A Spielman and Shang-Hua Teng. Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

  34. [34]

    Diverse Neural Network Learns True Target Functions

    Bo Xie, Yingyu Liang, and Le Song. Diverse neural networ k learns true target functions. arXiv preprint arXiv:1611.03131, 2016. 10