ID3 Learns Juntas for Smoothed Product Distributions

Alon Brutzkus; Amit Daniely; Eran Malach

arxiv: 1906.08654 · v1 · pith:JX7OYSXCnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

ID3 Learns Juntas for Smoothed Product Distributions

Alon Brutzkus , Amit Daniely , Eran Malach This is my paper

Pith reviewed 2026-05-25 19:39 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords ID3 algorithmk-juntasdecision tree learningsmoothed distributionsproduct distributionslearnabilitypolynomial time

0 comments

The pith

The ID3 algorithm learns k-juntas with k equal to log n in polynomial time when the input distribution is a smoothed product distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the ID3 algorithm learns functions depending on only a logarithmic number of variables in polynomial time when the data follows a smoothed product distribution. This provides a theoretical basis for the practical effectiveness of ID3 on data that is close to independent variable distributions but with some noise. A reader would care because it shows how a simple heuristic avoids worst-case hardness results under realistic assumptions about the data. The result applies specifically when the number of relevant variables is logarithmic in the total number.

Core claim

The paper proves that when the target function is a k-junta with k equal to log n, the ID3 algorithm learns it in polynomial time in the model where the observed distribution is a noisy variant of the original product distribution.

What carries the argument

The ID3 algorithm, which selects splits to build a decision tree, applied to inputs drawn from a smoothed product distribution to recover the k relevant variables of a k-junta.

Load-bearing premise

The target function depends on exactly k variables and the input distribution is a smoothed version of a product distribution.

What would settle it

A specific k-junta instance with k equal to log n on a smoothed product distribution where ID3 requires superpolynomial time to identify the relevant variables.

read the original abstract

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows ID3 learns log n-juntas in poly time under smoothed product distributions, extending prior decision tree analysis with a new positive result in that model.

read the letter

The main thing to know is that this paper proves ID3 learns k-juntas with k = log n in polynomial time when the distribution is a smoothed version of a product distribution in the Kalai-Teng model. The observed data is a noisy variant of the original, and the claim is that this noise is enough to make the algorithm succeed with high probability on the relevant variables. This is new because it pins down a concrete size of junta and applies the smoothed framework specifically to ID3, where earlier work had mostly negative or limited results for exact distributions. The paper does well by linking a common practical heuristic to a model that captures small real-world perturbations without forcing the distribution to be perfectly clean. It gives a positive learnability guarantee where hardness is the usual expectation. The soft spots are minor and mostly about visibility. The abstract states the result but does not include the proof steps, so the exact sample complexity, the required strength of the smoothing parameter, and how the information-gain ordering is controlled cannot be checked here. If the smoothing has to be unrealistically strong, the result would be less useful, but nothing in the claim suggests that is the case. The assumption that the target is exactly a junta is standard for these theorems and does not create an internal problem. The math and citations appear consistent with the smoothed analysis literature. This paper is for theorists working on analysis of greedy algorithms and decision tree learning. A reader who follows smoothed complexity or wants to understand why ID3 behaves well in practice would get value from the result. It deserves a serious referee because it supplies a concrete, non-trivial positive statement in an area that often sees only hardness results.

Referee Report

0 major / 2 minor

Summary. The paper analyzes the ID3 decision-tree learning algorithm when the target is a k-junta (depending on exactly k of n Boolean variables). It claims a polynomial-time learnability result for k = log n under the Kalai-Teng smoothed-analysis model: the observed distribution is a noisy perturbation of an underlying product distribution, and the algorithm recovers the junta with high probability.

Significance. If the central proof is correct, the result supplies a smoothed-analysis guarantee for a widely used heuristic on a natural function class, helping explain why ID3 succeeds on junta-like targets even when the underlying distribution is only approximately product. The absence of free parameters or self-referential definitions in the stated claim is a positive feature.

minor comments (2)

[Abstract] Abstract: the precise smoothing parameter (e.g., the noise rate or the Kalai-Teng perturbation radius) is not stated; adding it would clarify the regime in which the polynomial runtime holds.
The manuscript should include a short comparison table or paragraph contrasting the new smoothed bound with prior worst-case or PAC results for ID3 on juntas.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The assessment that the result would supply a smoothed-analysis guarantee for ID3 on junta-like targets (if the central proof holds) aligns with our goals. No major comments appear in the provided report, so we have no specific points to rebut or revise on that basis. We remain available to address any minor suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims a direct theoretical proof that ID3 learns k-juntas (k=log n) in polynomial time under the external Kalai-Teng smoothed product distribution model. The abstract and claim describe a learnability result for noisy variants of product distributions without any fitted parameters, self-definitional reductions, or load-bearing self-citations. The derivation chain is presented as an analysis within the given smoothed framework rather than a renaming or construction that reduces to its own inputs. No steps matching the enumerated circularity patterns are identifiable from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are identifiable or detailed.

pith-pipeline@v0.9.0 · 9650 in / 1040 out tokens · 58396 ms · 2026-05-25T19:39:45.381342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

[1]

Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

Zeyuan Allen-Zhu, Y uanzhi Li, and Yingyu Liang. Learnin g and generalization in overparam- eterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 , 2018

work page arXiv 2018
[2]

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Y uanzhi Li, and Zhao Song. A convergen ce theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[4]

Rank-r decision trees are a subclass of r-dec ision lists

Avrim Blum. Rank-r decision trees are a subclass of r-dec ision lists. Information Processing Letters, 42(4):183–185, 1992

work page 1992
[5]

Weakly learning dnf and characterizing statistical query learning using fourier analysis

Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kea rns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC, volume 94, pages 253–262, 1994

work page 1994
[6]

Noise-tolera nt learning, the parity problem, and the statistical query model

Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolera nt learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003

work page 2003
[7]

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Sha lev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linear ly separable data. arXiv preprint arXiv:1710.10174, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

On the proper learnin g of axis-parallel concepts

Nader H Bshouty and Lynn Burroughs. On the proper learnin g of axis-parallel concepts. Jour- nal of Machine Learning Research , 4(Jun):157–176, 2003

work page 2003
[9]

On using extended sta tistical queries to avoid member- ship queries

Nader H Bshouty and Vitaly Feldman. On using extended sta tistical queries to avoid member- ship queries. Journal of Machine Learning Research , 2(Feb):359–395, 2002

work page 2002
[10]

Learning dnf from random walks

Nader H Bshouty, Elchanan Mossel, Ryan O’Donnell, and R occo A Servedio. Learning dnf from random walks. Journal of Computer and System Sciences , 71(3):250–265, 2005

work page 2005
[11]

Beyond the Low-Degree Algorithm: Mixtures of Subcubes and Their Applications

Sitan Chen and Ankur Moitra. Beyond the low-degree algo rithm: Mixtures of subcubes and their applications. arXiv preprint arXiv:1803.06521 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Sgd learns the conjugate kernel class of t he network

Amit Daniely. Sgd learns the conjugate kernel class of t he network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017

work page 2017
[13]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Learning deci sion trees from random examples

Andrzej Ehrenfeucht and David Haussler. Learning deci sion trees from random examples. Information and Computation , 82(3):231–246, 1989

work page 1989
[15]

New results for learning noisy parities and halfspaces

Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on F oundations of Computer Science (FOCS’06), pages 563–574. IEEE, 2006

work page 2006
[16]

On agnos- tic learning of parities, monomials, and halfspaces

Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. On agnos- tic learning of parities, monomials, and halfspaces. SIAM Journal on Computing , 39(2):606– 645, 2009

work page 2009
[17]

Decision trees: More the oretical justiﬁcation for practical algorithms

Amos Fiat and Dmitry Pechyony. Decision trees: More the oretical justiﬁcation for practical algorithms. In International Conference on Algorithmic Learning Theory , pages 156–170. Springer, 2004

work page 2004
[18]

Hyperparameter Optimization: A Spectral Approach

Elad Hazan, Adam Klivans, and Y ang Y uan. Hyperparamete r optimization: A spectral ap- proach. arXiv preprint arXiv:1706.00764 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Learning random log-depth decision trees under the uniform distribution

Jeffrey C Jackson and Rocco A Servedio. Learning random log-depth decision trees under the uniform distribution. In Learning Theory and Kernel Machines , pages 610–624. Springer, 2003

work page 2003
[20]

Decision trees are PAC-learnable from most product distributions: a smoothed analysis

Adam Tauman Kalai and Shang-Hua Teng. Decision trees ar e pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933 , 2008

work page internal anchor Pith review Pith/arXiv arXiv 2008
[21]

Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework

Michael Kearns. Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework. In PROCEEDINGS OF THE NATIONAL CON- FERENCE ON ARTIFICIAL INTELLIGENCE , pages 1337–1339, 1996. 9

work page 1996
[22]

On the boosting abil ity of top–down decision tree learn- ing algorithms

Michael Kearns and Yishay Mansour. On the boosting abil ity of top–down decision tree learn- ing algorithms. Journal of Computer and System Sciences , 58(1):109–128, 1999

work page 1999
[23]

Learning decisio n trees using the fourier spectrum

Eyal Kushilevitz and Yishay Mansour. Learning decisio n trees using the fourier spectrum. SIAM Journal on Computing , 22(6):1331–1348, 1993

work page 1993
[24]

Wide neural networks of any depth evolv e as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Y asaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolv e as linear models under gradient descent. arXiv preprint arXiv:1902.06720 , 2019

work page arXiv 1902
[25]

A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics

Chao Ma, Lei Wu, et al. A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics. arXiv preprint arXiv:1904.04326 , 2019

work page arXiv 1904
[26]

Learning monotone decision trees in polynomial time

Ryan O’Donnell and Rocco A Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing , 37(3):827–844, 2007

work page 2007
[27]

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Samet Oymak and Mahdi Soltanolkotabi. Overparameteri zed nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Samet Oymak and Mahdi Soltanolkotabi. Towards moderat e overparameterization: global convergence guarantees for training shallow neural networ ks. arXiv:1902.04674 [cs, math, stat], February 2019. arXiv: 1902.04674

work page internal anchor Pith review Pith/arXiv arXiv 1902
[29]

Ross Quinlan

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986

work page 1986
[30]

Learning decision lists

Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987

work page 1987
[31]

Understanding machine learning: From theory to algorithms

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

work page 2014
[32]

Failures of gradient-based deep learning

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machin e Learning-V olume 70, pages 3067–3075. JMLR. org, 2017

work page 2017
[33]

Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time

Daniel A Spielman and Shang-Hua Teng. Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004
[34]

Diverse Neural Network Learns True Target Functions

Bo Xie, Yingyu Liang, and Le Song. Diverse neural networ k learns true target functions. arXiv preprint arXiv:1611.03131, 2016. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

Zeyuan Allen-Zhu, Y uanzhi Li, and Yingyu Liang. Learnin g and generalization in overparam- eterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 , 2018

work page arXiv 2018

[2] [2]

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Y uanzhi Li, and Zhao Song. A convergen ce theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[4] [4]

Rank-r decision trees are a subclass of r-dec ision lists

Avrim Blum. Rank-r decision trees are a subclass of r-dec ision lists. Information Processing Letters, 42(4):183–185, 1992

work page 1992

[5] [5]

Weakly learning dnf and characterizing statistical query learning using fourier analysis

Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kea rns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In STOC, volume 94, pages 253–262, 1994

work page 1994

[6] [6]

Noise-tolera nt learning, the parity problem, and the statistical query model

Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolera nt learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003

work page 2003

[7] [7]

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Sha lev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linear ly separable data. arXiv preprint arXiv:1710.10174, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

On the proper learnin g of axis-parallel concepts

Nader H Bshouty and Lynn Burroughs. On the proper learnin g of axis-parallel concepts. Jour- nal of Machine Learning Research , 4(Jun):157–176, 2003

work page 2003

[9] [9]

On using extended sta tistical queries to avoid member- ship queries

Nader H Bshouty and Vitaly Feldman. On using extended sta tistical queries to avoid member- ship queries. Journal of Machine Learning Research , 2(Feb):359–395, 2002

work page 2002

[10] [10]

Learning dnf from random walks

Nader H Bshouty, Elchanan Mossel, Ryan O’Donnell, and R occo A Servedio. Learning dnf from random walks. Journal of Computer and System Sciences , 71(3):250–265, 2005

work page 2005

[11] [11]

Beyond the Low-Degree Algorithm: Mixtures of Subcubes and Their Applications

Sitan Chen and Ankur Moitra. Beyond the low-degree algo rithm: Mixtures of subcubes and their applications. arXiv preprint arXiv:1803.06521 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Sgd learns the conjugate kernel class of t he network

Amit Daniely. Sgd learns the conjugate kernel class of t he network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017

work page 2017

[13] [13]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Learning deci sion trees from random examples

Andrzej Ehrenfeucht and David Haussler. Learning deci sion trees from random examples. Information and Computation , 82(3):231–246, 1989

work page 1989

[15] [15]

New results for learning noisy parities and halfspaces

Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on F oundations of Computer Science (FOCS’06), pages 563–574. IEEE, 2006

work page 2006

[16] [16]

On agnos- tic learning of parities, monomials, and halfspaces

Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and A shok Kumar Ponnuswami. On agnos- tic learning of parities, monomials, and halfspaces. SIAM Journal on Computing , 39(2):606– 645, 2009

work page 2009

[17] [17]

Decision trees: More the oretical justiﬁcation for practical algorithms

Amos Fiat and Dmitry Pechyony. Decision trees: More the oretical justiﬁcation for practical algorithms. In International Conference on Algorithmic Learning Theory , pages 156–170. Springer, 2004

work page 2004

[18] [18]

Hyperparameter Optimization: A Spectral Approach

Elad Hazan, Adam Klivans, and Y ang Y uan. Hyperparamete r optimization: A spectral ap- proach. arXiv preprint arXiv:1706.00764 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Learning random log-depth decision trees under the uniform distribution

Jeffrey C Jackson and Rocco A Servedio. Learning random log-depth decision trees under the uniform distribution. In Learning Theory and Kernel Machines , pages 610–624. Springer, 2003

work page 2003

[20] [20]

Decision trees are PAC-learnable from most product distributions: a smoothed analysis

Adam Tauman Kalai and Shang-Hua Teng. Decision trees ar e pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933 , 2008

work page internal anchor Pith review Pith/arXiv arXiv 2008

[21] [21]

Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework

Michael Kearns. Boosting theory towards practice: Rec ent developments in decision tree induction and the weak learning framework. In PROCEEDINGS OF THE NATIONAL CON- FERENCE ON ARTIFICIAL INTELLIGENCE , pages 1337–1339, 1996. 9

work page 1996

[22] [22]

On the boosting abil ity of top–down decision tree learn- ing algorithms

Michael Kearns and Yishay Mansour. On the boosting abil ity of top–down decision tree learn- ing algorithms. Journal of Computer and System Sciences , 58(1):109–128, 1999

work page 1999

[23] [23]

Learning decisio n trees using the fourier spectrum

Eyal Kushilevitz and Yishay Mansour. Learning decisio n trees using the fourier spectrum. SIAM Journal on Computing , 22(6):1331–1348, 1993

work page 1993

[24] [24]

Wide neural networks of any depth evolv e as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Y asaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolv e as linear models under gradient descent. arXiv preprint arXiv:1902.06720 , 2019

work page arXiv 1902

[25] [25]

A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics

Chao Ma, Lei Wu, et al. A comparative analysis of the opti mization and generalization prop- erty of two-layer neural network and random feature models u nder gradient descent dynamics. arXiv preprint arXiv:1904.04326 , 2019

work page arXiv 1904

[26] [26]

Learning monotone decision trees in polynomial time

Ryan O’Donnell and Rocco A Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing , 37(3):827–844, 2007

work page 2007

[27] [27]

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Samet Oymak and Mahdi Soltanolkotabi. Overparameteri zed nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Samet Oymak and Mahdi Soltanolkotabi. Towards moderat e overparameterization: global convergence guarantees for training shallow neural networ ks. arXiv:1902.04674 [cs, math, stat], February 2019. arXiv: 1902.04674

work page internal anchor Pith review Pith/arXiv arXiv 1902

[29] [29]

Ross Quinlan

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986

work page 1986

[30] [30]

Learning decision lists

Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987

work page 1987

[31] [31]

Understanding machine learning: From theory to algorithms

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

work page 2014

[32] [32]

Failures of gradient-based deep learning

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machin e Learning-V olume 70, pages 3067–3075. JMLR. org, 2017

work page 2017

[33] [33]

Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time

Daniel A Spielman and Shang-Hua Teng. Smoothed analysi s of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004

[34] [34]

Diverse Neural Network Learns True Target Functions

Bo Xie, Yingyu Liang, and Le Song. Diverse neural networ k learns true target functions. arXiv preprint arXiv:1611.03131, 2016. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016