Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

Dat H. Do; Dianbo Liu; Duc V. Le; Rushi Shah

arxiv: 2606.19941 · v1 · pith:D7XJUZKYnew · submitted 2026-06-18 · 💻 cs.LG

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

Dat H. Do , Rushi Shah , Duc V. Le , Dianbo Liu This is my paper

Pith reviewed 2026-06-26 18:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords compositionalityneural networksdepthconnectivitysparsitygeneralizationgradient descentsolution manifolds

0 comments

The pith

Compositionality in neural networks arises only in a narrow depth and specific sparse connectivity regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural networks trained by gradient descent rarely develop internal compositionality, the reuse of meaningful primitives in new combinations that supports generalization. This work shows the property appears only when both network depth and connectivity fall inside a narrow, target-dependent sweet spot. Specific patterns of sparse connections are required; random or different sparsity patterns do not suffice. Shallower or deeper networks, or those outside the right connectivity pattern, converge instead to fractured non-compositional solutions. The authors supply a pruning procedure to locate the right connectivity, a depth heuristic, and a supporting theory based on compositional sparsity, volume ratios, and feature-interference bounds.

Core claim

Compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis it appears only in certain specifically sparse networks and depends on which connections remain rather than on weight sparsity alone. Along the depth axis it emerges inside a narrow, target-dependent regime, peaking at particular depths while both shallower and deeper networks fail. When either condition is violated, gradient descent silently converges to fractured solutions. The findings are supported by similarity-based pruning to recover compositional connectivity, a heuristic depth predictor, and a theoretical framework of compositional sparsity, volume-ratio arguments, and feature-interferenc

What carries the argument

The narrow depth-connectivity regime that constrains reachable solution manifolds, identified through compositional sparsity, volume-ratio arguments, and feature-interference bounds.

If this is right

Gradient descent reaches compositional solutions only when both the depth and the specific connectivity pattern satisfy the narrow regime.
Violating either the depth or connectivity condition causes convergence to fractured rather than compositional solutions.
Similarity-based pruning can recover the connectivity pattern that permits compositional solutions.
A heuristic depth predictor can locate the depths at which compositionality is most likely for a given target.
The theoretical framework of compositional sparsity, volume ratios, and feature-interference bounds accounts for the limited reachability of compositional manifolds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regime may explain why standard architectures trained end-to-end often fail to exhibit strong compositionality even when the task admits it.
Task-specific depth selection or connectivity search could be used to steer training toward compositional solutions without changing the optimizer.
Different tasks likely possess different optimal depths inside the regime, requiring per-target tuning rather than a universal depth choice.

Load-bearing premise

The observed failure to reach compositional solutions outside the narrow regime is caused by architecture constraints on depth and specific connectivity rather than by optimization dynamics, data distribution, or initialization.

What would settle it

Demonstrating compositional internal structure in networks whose depth or connectivity lies outside the identified narrow regime would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19941 by Dat H. Do, Dianbo Liu, Duc V. Le, Rushi Shah.

**Figure 1.** Figure 1: Comparing internal structure (Red=-1, White=0, Blue=1): an evolutionary algorithm (EA) [11] setup can yield factorized, reusable intermediate features, whereas an SGD-trained network [10] often exhibits fragmented, entangled ones. We quantify compositionality via weight sweeping: perturb each nonzero parameter by noise δ and query a calibrated VLM judge under a fixed prompt to assess whether the image stil… view at source ↗

**Figure 2.** Figure 2: Architectural bias shapes compositionality (Red=-1, White=0, Blue=1). The MLPs takes pixel coordinate inputs (x, y, d = p x 2 + y 2, 1) and predicts (h, s, v), which are converted into the RGB image. Each square box visualizes the output/activation map of a single neuron over the image grid. Left: Preserving the NEAT sparse wiring and retraining the MLP yields partially compositional intermediate features.… view at source ↗

**Figure 3.** Figure 3: Compositional score vs depth offset on Picbreeder artifacts. We vary only the network depth around a reference depth while keeping pruning and training settings fixed. Each data exhibits a peak at a target-specific depth, while shallower and deeper networks show reduced modularity. The original Picbreeder CPPNs (see Appendix D) already exhibit different depths across artifacts, suggesting that compositio… view at source ↗

**Figure 4.** Figure 4: Image complexity predicts optimally compositional depth. We compute the PNG compression ratio for each target image and compare it with the empirically optimal depth found by depth sweep. Higher image complexity tends to correlate with larger optimal compositional depth. Section. 2.4 shows that compositionality peaks at a target-specific optimal compositional depth, implying there is no single depth that … view at source ↗

**Figure 5.** Figure 5: Qualitative result of out-of-domain targets beyond Picbreeder. We apply SP and heuristic depth search to images whose underlying compositional structure is unknown. The model exhibits monosemantic intermediate features, meaningful output changes under weight sweeping, and a depth-ablation peak near the predicted depth. More results are included in the Appendix H.4. Combining SP with heuristic depth search … view at source ↗

**Figure 6.** Figure 6: Theoretical versus empirical volume ratio [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Three mechanisms biasing SGD towards the compositional basin. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Feature orthogonality comparison across three Picbreeder artifacts, with and without SP [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity of the predicted compositional score to the number of primitives [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Internal representation of the Picbreeder skull CPPN. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Internal representation of the Picbreeder butterfly CPPN [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Internal representation of the Picbreeder apple CPPN [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Final training loss on Picbreeder’s skull. Using Muon with more Newton-Schultz steps reaches lower loss. While S-Prune can expose more distinctive subnetworks, the optimizer still strongly influences how fractured the learned solution remains [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Full visualization of SP on Picbreeder’s butterfly artifact. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Full visualization of SP on Picbreeder’s apple artifact. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: SP on MLPs having 11 layers (optimal depth is 12) on Picbreeder’s skull artifact lead to [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: SP on MLPs having 13 layers (optimal depth is 12) on Picbreeder’s skull artifact lead to [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: SP (2 rounds) + Adam on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: SP (2 rounds) + Muon on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: SP (2 rounds) + Muon (NS step=20) on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: SP (2 rounds) + Adam on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: SP (2 rounds) + Muon on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

**Figure 23.** Figure 23: SP (2 rounds) + Muon (NS step=20) on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

**Figure 24.** Figure 24: Full visualization of SP on a car image with some corresponding images from weights [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗

**Figure 25.** Figure 25: Full visualization of SP on a cat image with some corresponding images from weights [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Full visualization of SP on a butterfly image. [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Full visualization of SP on an image illustrating a red cube and a yellow sphere. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

**Figure 28.** Figure 28: Full visualization of SP on an image illustrating a real butterfly with background. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: Full visualization of SP on an image illustrating a real butterfly without background. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: SP round 1 - 1404 parameters [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Lottery Ticket Hypothesis [13] on skull image, 473 weights [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗

**Figure 32.** Figure 32: Lottery Ticket Hypothesis [13] on skulll image, 1452 weights [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Wanda [14] on skull image, 1404 weights 37 [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: LLM-Pruner [15] on skull image, 1397 weights [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗

**Figure 35.** Figure 35: SP round 1 on butterfly image, 3405 weights [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗

**Figure 36.** Figure 36: Lottery Ticket Hypothesis [13] on butterfly image, 468 weights [PITH_FULL_IMAGE:figures/full_fig_p039_36.png] view at source ↗

**Figure 37.** Figure 37: Lottery Ticket Hypothesis [13] butterfly image, 3392 weights [PITH_FULL_IMAGE:figures/full_fig_p040_37.png] view at source ↗

**Figure 38.** Figure 38: Training loss with different optimizers using learning rate 5e-3 [PITH_FULL_IMAGE:figures/full_fig_p041_38.png] view at source ↗

**Figure 39.** Figure 39: Training loss with different optimizers using learning rate 1e-3 [PITH_FULL_IMAGE:figures/full_fig_p041_39.png] view at source ↗

**Figure 40.** Figure 40: Training loss with different optimizers using learning rate 5e-4 [PITH_FULL_IMAGE:figures/full_fig_p042_40.png] view at source ↗

**Figure 41.** Figure 41: Training loss with different optimizers using learning rate 1e-4 [PITH_FULL_IMAGE:figures/full_fig_p042_41.png] view at source ↗

**Figure 42.** Figure 42: Multi-CPPN for Picbreeder’s skull, n=-1 43 [PITH_FULL_IMAGE:figures/full_fig_p043_42.png] view at source ↗

**Figure 43.** Figure 43: Multi-CPPN for Picbreeder’s butterfly, n=0 44 [PITH_FULL_IMAGE:figures/full_fig_p044_43.png] view at source ↗

**Figure 44.** Figure 44: Multi-CPPN for Picbreeder’s apple, n=1 45 [PITH_FULL_IMAGE:figures/full_fig_p045_44.png] view at source ↗

read the original abstract

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Compositionality appears only in a narrow depth-connectivity window, with two new tools to locate it, but the unreachability claim outside that window rests on observed failures rather than a full dynamics argument.

read the letter

The main thing to know is that this paper finds compositionality only in a narrow, target-dependent depth band and only for certain specific sparse connectivities rather than sparsity in general. Outside that window gradient descent reaches fractured solutions instead.

What stands out is the pair of practical methods: similarity-based pruning to recover the right connectivity pattern and a heuristic depth predictor. Those are concrete and could be tried by others. The experiments appear to map the regime clearly enough to show the pattern, and the theoretical framing with compositional sparsity, volume ratios, and interference bounds gives a plausible account for why the regime works.

The softer part is the direction of the central claim. The framework explains reachability inside the regime but does not derive that compositional manifolds are inaccessible to gradient descent outside it. The failures could still trace to optimization details, initialization, or training length rather than hard architectural exclusion. A stronger case would need either an explicit argument about the loss landscape or controls showing that alternative training choices still cannot escape the fractured attractors.

This is aimed at people studying how architecture shapes systematic generalization. It has enough new empirical mapping and methods to justify sending it to referees, though the dynamics gap will likely draw comments.

Referee Report

2 major / 2 minor

Summary. The paper claims that compositionality emerges in neural networks only within a narrow depth-connectivity regime: specific sparse connectivity patterns (not mere weight sparsity) along the connectivity axis, and a narrow target-dependent depth range (peaking at specific depths, failing for shallower or deeper nets) along the depth axis. Outside this regime, gradient descent converges to fractured non-compositional solutions. The authors introduce similarity-based pruning (SP) to recover compositional connectivity and a heuristic depth predictor, and support the findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds.

Significance. If the central claim holds, the work identifies architecture constraints that control reachability of compositional solutions under gradient descent, offering both an explanation for why such structure is rare and practical methods (SP and depth heuristic) to induce it. The empirical discovery of the narrow regime combined with the theoretical framing could guide architecture design for compositional generalization tasks.

major comments (2)

[Theoretical Framework] Theoretical Framework section: the compositional sparsity, volume-ratio arguments, and feature-interference bounds are invoked to explain why compositional solutions are reachable only inside the narrow regime. However, these primarily bound manifold measure or interference; they do not derive that gradient-descent trajectories have no connecting paths to compositional solutions outside the regime or that the dynamics are forced into fractured attractors. The 'unreachability' direction therefore rests on extrapolation from observed empirical failures rather than a direct consequence of the bounds.
[Experiments] Empirical results on depth axis (abstract and §Experiments): the claim that both shallower and deeper networks fail to reach compositional solutions is central, yet the manuscript does not report controls that isolate architecture constraints from optimization dynamics, data distribution, or initialization effects. Without such isolation, the narrow-regime conclusion remains vulnerable to the alternative that the failures are optimization artifacts rather than manifold unreachability.

minor comments (2)

The description of similarity-based pruning (SP) would benefit from an explicit algorithm box or pseudocode to clarify how 'which connections remain' are selected versus random or magnitude-based sparsity.
Notation for 'fractured solutions' is used without a formal definition; a short paragraph relating it to the volume-ratio or interference quantities would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below, clarifying the scope of our theoretical results and committing to additional experimental controls where appropriate.

read point-by-point responses

Referee: [Theoretical Framework] Theoretical Framework section: the compositional sparsity, volume-ratio arguments, and feature-interference bounds are invoked to explain why compositional solutions are reachable only inside the narrow regime. However, these primarily bound manifold measure or interference; they do not derive that gradient-descent trajectories have no connecting paths to compositional solutions outside the regime or that the dynamics are forced into fractured attractors. The 'unreachability' direction therefore rests on extrapolation from observed empirical failures rather than a direct consequence of the bounds.

Authors: We agree that the theoretical framework (compositional sparsity, volume-ratio arguments, and feature-interference bounds) establishes that compositional solution manifolds have larger relative measure and lower interference inside the identified regime, thereby making such solutions more accessible under gradient descent. The framework does not, however, derive a rigorous statement that no connecting paths exist in parameter space outside the regime or that the dynamics are provably trapped in fractured attractors. The unreachability claim outside the regime is therefore supported primarily by the empirical evidence of consistent convergence to fractured solutions across multiple depths, connectivities, and tasks. In revision we will explicitly distinguish the theoretical support for preferential reachability inside the regime from the empirical observation of unreachability outside it. revision: partial
Referee: [Experiments] Empirical results on depth axis (abstract and §Experiments): the claim that both shallower and deeper networks fail to reach compositional solutions is central, yet the manuscript does not report controls that isolate architecture constraints from optimization dynamics, data distribution, or initialization effects. Without such isolation, the narrow-regime conclusion remains vulnerable to the alternative that the failures are optimization artifacts rather than manifold unreachability.

Authors: We acknowledge that the current experiments do not include exhaustive ablations that fully isolate depth and connectivity constraints from optimizer hyperparameters, initialization distributions, or data variations. While we observe the same narrow-regime pattern across multiple random seeds, datasets, and architectures, additional targeted controls would strengthen the architectural interpretation. We will add such controls (varying learning-rate schedules, initialization scales, and data subsampling while fixing depth/connectivity) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation remains self-contained

full rationale

The provided abstract and context describe empirical results on compositionality in a narrow depth-connectivity regime, supported by a theoretical framework of compositional sparsity, volume-ratio arguments, and feature-interference bounds. No equations, self-citations, fitted parameters renamed as predictions, or self-definitional steps are exhibited in the text. The theory is presented as explanatory support for the observed regime rather than reducing to the inputs by construction. Without quotable reductions matching the enumerated patterns, the central claim does not collapse into tautology or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or sections from which free parameters, axioms, or invented entities can be extracted; the theoretical framework is referenced at a high level only.

pith-pipeline@v0.9.1-grok · 5749 in / 1289 out tokens · 32924 ms · 2026-06-26T18:21:19.179791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 22 canonical work pages · 6 internal anchors

[1]

arXiv preprint arXiv:2505.00661 , year=

Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a controlled study. arXiv preprint arXiv:2505.00661, 2025

work page arXiv 2025
[2]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023

work page arXiv 2023
[3]

Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, Xudong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, and Angjoo Kanazawa. Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

work page arXiv 2025
[4]

Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, et al. Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

work page arXiv 2025
[5]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

2023
[6]

T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[8]

arXiv preprint arXiv:2512.16853 (2025)

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025
[9]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Akarsh Kumar, Jeff Clune, Joel Lehman, and Kenneth O. Stanley. Questioning representational optimism in deep learning: The fractured entangled representation hypothesis.arXiv preprint arXiv:2505.11581, 2025

work page arXiv 2025
[11]

Picbreeder: evolving pictures collaboratively online

Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: evolving pictures collaboratively online. InProceedings of the SIGCHI conference on human factors in computing systems, pages 1759–1768, 2008. 10

2008
[12]

Efficient evolution of neural network topologies

Kenneth O Stanley and Risto Miikkulainen. Efficient evolution of neural network topologies. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), volume 2, pages 1757–1762. IEEE, 2002

2002
[13]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

2023
[16]

[41]Cuchiero, C., Schmocker, P., and Teichmann, J.Global universal approximation of functional input maps on weighted spaces.Constructive Approximation(2026), 1–76

David A. Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso A. Poggio. Position: A theory of deep learning must include compositional sparsity.ArXiv, abs/2507.02550, 2025

work page arXiv 2025
[17]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

2017
[18]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Sharpness-aware min- imization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations
[20]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024
[22]

The MIT press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of causal inference: founda- tions and learning algorithms. The MIT press, 2017

2017
[23]

Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367

Dinghuai Zhang, Kartik Ahuja, Yilun Xu, Yisen Wang, and Aaron Courville. Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367. PMLR, 2021

2021
[24]

A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

2008
[25]

Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

Steven Phillips and William H Wilson. Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

2010
[26]

Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. InInternational conference on machine learning, pages 8489–8510. PMLR, 2023

2023
[27]

Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

Sam Spilsbury and Alexander Ilin. Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

work page arXiv 2022
[28]

Ablating concepts in text-to-image diffusion models

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023

2023
[29]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Ton g, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 11

2024
[30]

Does clip bind concepts? probing compositionality in large image models

Martha Lewis, Nihal Nayak, Peilin Yu, Jack Merullo, Qinan Yu, Stephen Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1487–1500, 2024

2024
[31]

Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

Tian Yun, Usha Bhalla, Ellie Pavlick, and Chen Sun. Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

work page arXiv 2022
[32]

Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

Michael Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

2023
[33]

Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

Thaddäus Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

2023
[34]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989
[35]

Optimal brain surgeon and general network pruning

Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

1993
[36]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017

2017
[37]

Depgraph: Towards any structural pruning

Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16091–16101, 2023

2023
[38]

Gradient-free structured pruning with unla- beled data

Azade Nova, Hanjun Dai, and Dale Schuurmans. Gradient-free structured pruning with unla- beled data. InInternational Conference on Machine Learning, pages 26326–26341. PMLR, 2023

2023
[39]

Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

2015
[40]

Why random pruning is all we need to start sparse

Advait Harshal Gadhikar, Sohom Mukherjee, and Rebekka Burkholz. Why random pruning is all we need to start sparse. InInternational Conference on Machine Learning, pages 10542–10570. PMLR, 2023

2023
[41]

Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations

Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, AJAY KUMAR JAISW AL, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations
[42]

Coreset-based neural network compression

Abhimanyu Dubey, Moitreya Chatterjee, and Narendra Ahuja. Coreset-based neural network compression. InProceedings of the European Conference on Computer Vision (ECCV), pages 454–470, 2018

2018
[43]

Pruning convolutional neural networks for resource efficient inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations, 2017

2017
[44]

Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale

Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11833–11856, 2023

2023
[45]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023

2023
[46]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1288–1301, 2024. 12

2024
[47]

Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

2020
[48]

Soft threshold weight reparameterization for learnable sparsity

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pages 5544–5555. PMLR, 2020

2020
[49]

Rethinking the value of network pruning

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learning Representations, 2019

2019
[50]

A three-regime model of network pruning

Yefan Zhou, Yaoqing Yang, Arin Chang, and Michael W Mahoney. A three-regime model of network pruning. InInternational Conference on Machine Learning, pages 42790–42809. PMLR, 2023

2023
[51]

Comparing rewinding and fine-tuning in neural network pruning

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. InInternational Conference on Learning Representations, 2020

2020
[52]

Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

Steven M Frankland and Joshua D Greene. Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

2020
[53]

Compositional clustering in task structure learning

Nicholas T Franklin and Michael J Frank. Compositional clustering in task structure learning. PLoS computational biology, 14(4):e1006116, 2018

2018
[54]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022

2022
[55]

Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

2021
[56]

Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

Guangyue Xu, Parisa Kordjamshidi, and Joyce Chai. Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

work page arXiv 2022
[57]

When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022
[58]

The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

Emanuele Bugliarello and Desmond Elliott. The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

work page arXiv 2021
[59]

Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

Colin Conwell and Tomer Ullman. Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

work page arXiv 2022
[60]

arXiv preprint arXiv:2212.10015 (2022)

Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015, 2022

work page arXiv 2022
[61]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

2022
[62]

Measuring compositionality in representation learning

Jacob Andreas, Marco Baroni, Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes, Jacob Devlin, Alona Fyshe, Leila Wehbe, et al. Measuring compositionality in representation learning. InInternational conference on learning representations, volume 375, pages 2227–2237. Association for Computational Linguistics, 2019

2019
[63]

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InInternational conference on machine learning, pages 2873–2882. PMLR, 2018. 13

2018
[64]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017
[65]

Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

Lukas Schott, Julius V on Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, and Wieland Brendel. Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

work page arXiv 2021
[66]

Benchmark- ing compositionality with formal languages

Josef Valvoda Naomi Saphra Jonathan Rawski and Adina Williams Ryan Cotterell. Benchmark- ing compositionality with formal languages. 2022

2022
[67]

Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

2024
[68]

Importance estimation for neural network pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019

2019
[69]

Structured pruning learns compact and accurate models

Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, 2022

2022
[70]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[71]

prunable

Mansheej Paul, Feng Chen, Brett W Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask?arXiv preprint arXiv:2210.03044, 2022. 14 A Proofs for Section 5 3 6 12 24 48 96 192 Width W 1 2 3 4 5 6 7 8 9Depth L (W,L) = C(W,P)/T otal [P=3] Predicted sweet spot ...

work page arXiv 2022

[1] [1]

arXiv preprint arXiv:2505.00661 , year=

Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a controlled study. arXiv preprint arXiv:2505.00661, 2025

work page arXiv 2025

[2] [2]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023

work page arXiv 2023

[3] [3]

Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, Xudong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, and Angjoo Kanazawa. Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

work page arXiv 2025

[4] [4]

Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, et al. Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

work page arXiv 2025

[5] [5]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

2023

[6] [6]

T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[7] [7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[8] [8]

arXiv preprint arXiv:2512.16853 (2025)

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025

[9] [9]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Akarsh Kumar, Jeff Clune, Joel Lehman, and Kenneth O. Stanley. Questioning representational optimism in deep learning: The fractured entangled representation hypothesis.arXiv preprint arXiv:2505.11581, 2025

work page arXiv 2025

[11] [11]

Picbreeder: evolving pictures collaboratively online

Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: evolving pictures collaboratively online. InProceedings of the SIGCHI conference on human factors in computing systems, pages 1759–1768, 2008. 10

2008

[12] [12]

Efficient evolution of neural network topologies

Kenneth O Stanley and Risto Miikkulainen. Efficient evolution of neural network topologies. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), volume 2, pages 1757–1762. IEEE, 2002

2002

[13] [13]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

2023

[16] [16]

[41]Cuchiero, C., Schmocker, P., and Teichmann, J.Global universal approximation of functional input maps on weighted spaces.Constructive Approximation(2026), 1–76

David A. Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso A. Poggio. Position: A theory of deep learning must include compositional sparsity.ArXiv, abs/2507.02550, 2025

work page arXiv 2025

[17] [17]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

2017

[18] [18]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Sharpness-aware min- imization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations

[20] [20]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024

[22] [22]

The MIT press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of causal inference: founda- tions and learning algorithms. The MIT press, 2017

2017

[23] [23]

Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367

Dinghuai Zhang, Kartik Ahuja, Yilun Xu, Yisen Wang, and Aaron Courville. Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367. PMLR, 2021

2021

[24] [24]

A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

2008

[25] [25]

Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

Steven Phillips and William H Wilson. Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

2010

[26] [26]

Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. InInternational conference on machine learning, pages 8489–8510. PMLR, 2023

2023

[27] [27]

Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

Sam Spilsbury and Alexander Ilin. Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

work page arXiv 2022

[28] [28]

Ablating concepts in text-to-image diffusion models

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023

2023

[29] [29]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Ton g, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 11

2024

[30] [30]

Does clip bind concepts? probing compositionality in large image models

Martha Lewis, Nihal Nayak, Peilin Yu, Jack Merullo, Qinan Yu, Stephen Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1487–1500, 2024

2024

[31] [31]

Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

Tian Yun, Usha Bhalla, Ellie Pavlick, and Chen Sun. Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

work page arXiv 2022

[32] [32]

Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

Michael Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

2023

[33] [33]

Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

Thaddäus Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

2023

[34] [34]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989

[35] [35]

Optimal brain surgeon and general network pruning

Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

1993

[36] [36]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017

2017

[37] [37]

Depgraph: Towards any structural pruning

Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16091–16101, 2023

2023

[38] [38]

Gradient-free structured pruning with unla- beled data

Azade Nova, Hanjun Dai, and Dale Schuurmans. Gradient-free structured pruning with unla- beled data. InInternational Conference on Machine Learning, pages 26326–26341. PMLR, 2023

2023

[39] [39]

Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

2015

[40] [40]

Why random pruning is all we need to start sparse

Advait Harshal Gadhikar, Sohom Mukherjee, and Rebekka Burkholz. Why random pruning is all we need to start sparse. InInternational Conference on Machine Learning, pages 10542–10570. PMLR, 2023

2023

[41] [41]

Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations

Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, AJAY KUMAR JAISW AL, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations

[42] [42]

Coreset-based neural network compression

Abhimanyu Dubey, Moitreya Chatterjee, and Narendra Ahuja. Coreset-based neural network compression. InProceedings of the European Conference on Computer Vision (ECCV), pages 454–470, 2018

2018

[43] [43]

Pruning convolutional neural networks for resource efficient inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations, 2017

2017

[44] [44]

Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale

Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11833–11856, 2023

2023

[45] [45]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023

2023

[46] [46]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1288–1301, 2024. 12

2024

[47] [47]

Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

2020

[48] [48]

Soft threshold weight reparameterization for learnable sparsity

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pages 5544–5555. PMLR, 2020

2020

[49] [49]

Rethinking the value of network pruning

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learning Representations, 2019

2019

[50] [50]

A three-regime model of network pruning

Yefan Zhou, Yaoqing Yang, Arin Chang, and Michael W Mahoney. A three-regime model of network pruning. InInternational Conference on Machine Learning, pages 42790–42809. PMLR, 2023

2023

[51] [51]

Comparing rewinding and fine-tuning in neural network pruning

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. InInternational Conference on Learning Representations, 2020

2020

[52] [52]

Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

Steven M Frankland and Joshua D Greene. Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

2020

[53] [53]

Compositional clustering in task structure learning

Nicholas T Franklin and Michael J Frank. Compositional clustering in task structure learning. PLoS computational biology, 14(4):e1006116, 2018

2018

[54] [54]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022

2022

[55] [55]

Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

2021

[56] [56]

Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

Guangyue Xu, Parisa Kordjamshidi, and Joyce Chai. Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

work page arXiv 2022

[57] [57]

When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022

[58] [58]

The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

Emanuele Bugliarello and Desmond Elliott. The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

work page arXiv 2021

[59] [59]

Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

Colin Conwell and Tomer Ullman. Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

work page arXiv 2022

[60] [60]

arXiv preprint arXiv:2212.10015 (2022)

Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015, 2022

work page arXiv 2022

[61] [61]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

2022

[62] [62]

Measuring compositionality in representation learning

Jacob Andreas, Marco Baroni, Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes, Jacob Devlin, Alona Fyshe, Leila Wehbe, et al. Measuring compositionality in representation learning. InInternational conference on learning representations, volume 375, pages 2227–2237. Association for Computational Linguistics, 2019

2019

[63] [63]

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InInternational conference on machine learning, pages 2873–2882. PMLR, 2018. 13

2018

[64] [64]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017

[65] [65]

Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

Lukas Schott, Julius V on Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, and Wieland Brendel. Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

work page arXiv 2021

[66] [66]

Benchmark- ing compositionality with formal languages

Josef Valvoda Naomi Saphra Jonathan Rawski and Adina Williams Ryan Cotterell. Benchmark- ing compositionality with formal languages. 2022

2022

[67] [67]

Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

2024

[68] [68]

Importance estimation for neural network pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019

2019

[69] [69]

Structured pruning learns compact and accurate models

Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, 2022

2022

[70] [70]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[71] [71]

prunable

Mansheej Paul, Feng Chen, Brett W Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask?arXiv preprint arXiv:2210.03044, 2022. 14 A Proofs for Section 5 3 6 12 24 48 96 192 Width W 1 2 3 4 5 6 7 8 9Depth L (W,L) = C(W,P)/T otal [P=3] Predicted sweet spot ...

work page arXiv 2022