ID3 Learns Juntas for Smoothed Product Distributions
Pith reviewed 2026-05-25 19:39 UTC · model grok-4.3
The pith
The ID3 algorithm learns k-juntas with k equal to log n in polynomial time when the input distribution is a smoothed product distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proves that when the target function is a k-junta with k equal to log n, the ID3 algorithm learns it in polynomial time in the model where the observed distribution is a noisy variant of the original product distribution.
What carries the argument
The ID3 algorithm, which selects splits to build a decision tree, applied to inputs drawn from a smoothed product distribution to recover the k relevant variables of a k-junta.
Load-bearing premise
The target function depends on exactly k variables and the input distribution is a smoothed version of a product distribution.
What would settle it
A specific k-junta instance with k equal to log n on a smoothed product distribution where ID3 requires superpolynomial time to identify the relevant variables.
read the original abstract
In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the ID3 decision-tree learning algorithm when the target is a k-junta (depending on exactly k of n Boolean variables). It claims a polynomial-time learnability result for k = log n under the Kalai-Teng smoothed-analysis model: the observed distribution is a noisy perturbation of an underlying product distribution, and the algorithm recovers the junta with high probability.
Significance. If the central proof is correct, the result supplies a smoothed-analysis guarantee for a widely used heuristic on a natural function class, helping explain why ID3 succeeds on junta-like targets even when the underlying distribution is only approximately product. The absence of free parameters or self-referential definitions in the stated claim is a positive feature.
minor comments (2)
- [Abstract] Abstract: the precise smoothing parameter (e.g., the noise rate or the Kalai-Teng perturbation radius) is not stated; adding it would clarify the regime in which the polynomial runtime holds.
- The manuscript should include a short comparison table or paragraph contrasting the new smoothed bound with prior worst-case or PAC results for ID3 on juntas.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation of minor revision. The assessment that the result would supply a smoothed-analysis guarantee for ID3 on junta-like targets (if the central proof holds) aligns with our goals. No major comments appear in the provided report, so we have no specific points to rebut or revise on that basis. We remain available to address any minor suggestions from the editor.
Circularity Check
No significant circularity detected
full rationale
The paper claims a direct theoretical proof that ID3 learns k-juntas (k=log n) in polynomial time under the external Kalai-Teng smoothed product distribution model. The abstract and claim describe a learnability result for noisy variants of product distributions without any fitted parameters, self-definitional reductions, or load-bearing self-citations. The derivation chain is presented as an analysis within the given smoothed framework rather than a renaming or construction that reduces to its own inputs. No steps matching the enumerated circularity patterns are identifiable from the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learnin g and generalization in overparam- eterized neural networks, going beyond two layers
Zeyuan Allen-Zhu, Y uanzhi Li, and Yingyu Liang. Learnin g and generalization in overparam- eterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 , 2018
-
[2]
A Convergence Theory for Deep Learning via Over-Parameterization
Zeyuan Allen-Zhu, Y uanzhi Li, and Zhao Song. A convergen ce theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[4]
Rank-r decision trees are a subclass of r-dec ision lists
Avrim Blum. Rank-r decision trees are a subclass of r-dec ision lists. Information Processing Letters, 42(4):183–185, 1992
work page 1992
-
[5]
Weakly learning dnf and characterizing statistical query learning using fourier analysis
Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kea rns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC, volume 94, pages 253–262, 1994
work page 1994
-
[6]
Noise-tolera nt learning, the parity problem, and the statistical query model
Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolera nt learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003
work page 2003
-
[7]
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Sha lev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linear ly separable data. arXiv preprint arXiv:1710.10174, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
On the proper learnin g of axis-parallel concepts
Nader H Bshouty and Lynn Burroughs. On the proper learnin g of axis-parallel concepts. Jour- nal of Machine Learning Research , 4(Jun):157–176, 2003
work page 2003
-
[9]
On using extended sta tistical queries to avoid member- ship queries
Nader H Bshouty and Vitaly Feldman. On using extended sta tistical queries to avoid member- ship queries. Journal of Machine Learning Research , 2(Feb):359–395, 2002
work page 2002
-
[10]
Learning dnf from random walks
Nader H Bshouty, Elchanan Mossel, Ryan O’Donnell, and R occo A Servedio. Learning dnf from random walks. Journal of Computer and System Sciences , 71(3):250–265, 2005
work page 2005
-
[11]
Beyond the Low-Degree Algorithm: Mixtures of Subcubes and Their Applications
Sitan Chen and Ankur Moitra. Beyond the low-degree algo rithm: Mixtures of subcubes and their applications. arXiv preprint arXiv:1803.06521 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Sgd learns the conjugate kernel class of t he network
Amit Daniely. Sgd learns the conjugate kernel class of t he network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017
work page 2017
-
[13]
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Learning deci sion trees from random examples
Andrzej Ehrenfeucht and David Haussler. Learning deci sion trees from random examples. Information and Computation , 82(3):231–246, 1989
work page 1989
-
[15]
New results for learning noisy parities and halfspaces
Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on F oundations of Computer Science (FOCS’06), pages 563–574. IEEE, 2006
work page 2006
-
[16]
On agnos- tic learning of parities, monomials, and halfspaces
Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. On agnos- tic learning of parities, monomials, and halfspaces. SIAM Journal on Computing , 39(2):606– 645, 2009
work page 2009
-
[17]
Decision trees: More the oretical justification for practical algorithms
Amos Fiat and Dmitry Pechyony. Decision trees: More the oretical justification for practical algorithms. In International Conference on Algorithmic Learning Theory , pages 156–170. Springer, 2004
work page 2004
-
[18]
Hyperparameter Optimization: A Spectral Approach
Elad Hazan, Adam Klivans, and Y ang Y uan. Hyperparamete r optimization: A spectral ap- proach. arXiv preprint arXiv:1706.00764 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Learning random log-depth decision trees under the uniform distribution
Jeffrey C Jackson and Rocco A Servedio. Learning random log-depth decision trees under the uniform distribution. In Learning Theory and Kernel Machines , pages 610–624. Springer, 2003
work page 2003
-
[20]
Decision trees are PAC-learnable from most product distributions: a smoothed analysis
Adam Tauman Kalai and Shang-Hua Teng. Decision trees ar e pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933 , 2008
work page internal anchor Pith review Pith/arXiv arXiv 2008
-
[21]
Michael Kearns. Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework. In PROCEEDINGS OF THE NATIONAL CON- FERENCE ON ARTIFICIAL INTELLIGENCE , pages 1337–1339, 1996. 9
work page 1996
-
[22]
On the boosting abil ity of top–down decision tree learn- ing algorithms
Michael Kearns and Yishay Mansour. On the boosting abil ity of top–down decision tree learn- ing algorithms. Journal of Computer and System Sciences , 58(1):109–128, 1999
work page 1999
-
[23]
Learning decisio n trees using the fourier spectrum
Eyal Kushilevitz and Yishay Mansour. Learning decisio n trees using the fourier spectrum. SIAM Journal on Computing , 22(6):1331–1348, 1993
work page 1993
-
[24]
Wide neural networks of any depth evolv e as linear models under gradient descent
Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Y asaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolv e as linear models under gradient descent. arXiv preprint arXiv:1902.06720 , 2019
-
[25]
Chao Ma, Lei Wu, et al. A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics. arXiv preprint arXiv:1904.04326 , 2019
-
[26]
Learning monotone decision trees in polynomial time
Ryan O’Donnell and Rocco A Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing , 37(3):827–844, 2007
work page 2007
-
[27]
Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
Samet Oymak and Mahdi Soltanolkotabi. Overparameteri zed nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Samet Oymak and Mahdi Soltanolkotabi. Towards moderat e overparameterization: global convergence guarantees for training shallow neural networ ks. arXiv:1902.04674 [cs, math, stat], February 2019. arXiv: 1902.04674
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[29]
J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986
work page 1986
-
[30]
Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987
work page 1987
-
[31]
Understanding machine learning: From theory to algorithms
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014
work page 2014
-
[32]
Failures of gradient-based deep learning
Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machin e Learning-V olume 70, pages 3067–3075. JMLR. org, 2017
work page 2017
-
[33]
Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time
Daniel A Spielman and Shang-Hua Teng. Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004
work page 2004
-
[34]
Diverse Neural Network Learns True Target Functions
Bo Xie, Yingyu Liang, and Le Song. Diverse neural networ k learns true target functions. arXiv preprint arXiv:1611.03131, 2016. 10
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.