Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

Florent Krzakala; Jason D. Lee; Yatin Dandi; Yunwei Ren

arxiv: 2601.19756 · v2 · pith:EYFTEP4Snew · submitted 2026-01-27 · 💻 cs.LG · stat.ML

Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

Yunwei Ren , Yatin Dandi , Florent Krzakala , Jason D. Lee This is my paper

classification 💻 cs.LG stat.ML

keywords deephierarchicallearningnetworkslayersmodelsefficientlyexploit

0 comments

read the original abstract

The empirical success of deep learning is often attributed to deep networks' ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained with gradient-based methods and standard input-label pairs, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models -- a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
cs.LG 2026-05 unverdicted novelty 8.0

Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
Learn from your own latents and not from tokens: A sample-complexity theory
cs.LG 2026-05 unverdicted novelty 7.0

Latent prediction SSL recovers latent trees from PCFG data with sample complexity constant in hierarchy depth L (up to logs), unlike exponential for token-level or supervised methods.
Learning Sparse Compositional Functions with Norm-Constrained Neural Networks
stat.ML 2026-05 unverdicted novelty 4.0

Derives approximation rates and excess risk bounds for Frobenius norm-constrained DNNs learning sparse compositional functions on DAGs, applicable to multi-index models and binary trees while avoiding the curse of dim...