Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
Pith reviewed 2026-05-19 22:06 UTC · model grok-4.3
The pith
Winning tickets correspond to precursor locations in feature space already near the final codes at initialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. Lightweight probes based on feature-space distance and motion frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery.
What carries the argument
Combinatorial distances between features in an interpretable feature-space representation, which quantify both proximity to final codes and inter-feature interference to identify compatible precursor locations.
Load-bearing premise
The combinatorial clause-structured toy setting supplies an interpretable feature-space representation whose distances capture the relevant dynamics of real networks under superposition.
What would settle it
In the toy model, if winning tickets routinely arise from initial locations far from the final codes or if feature-space distance probes fail to recover exact codes better than weight-based methods, the claimed correspondence would not hold.
Figures
read the original abstract
The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the lottery ticket hypothesis in a combinatorial, clause-structured toy model that admits an interpretable feature-space representation with defined combinatorial distances. It claims that winning tickets in weight space correspond to precursor locations in feature space already near the final feature-channel codes at initialization. Dense SGD resolves these via structured selection under superposition, with rejection at crowded neurons; a winning ticket is a family of compatible locations balancing proximity and low interference. Sparse retraining often re-expresses the same clause family on a different row, preserving family-level rather than row-specific identity. The account is validated with lightweight feature-space distance and motion probes that outperform weight-based ticket discovery methods in the toy setting.
Significance. If the results hold, the work supplies a mechanistic account of lottery tickets as preserving early feature-space geometry rather than specific weight subnetworks, using an interpretable toy model with combinatorial distances and simple probes. This is a strength for conceptual clarity and falsifiability within the stated setting. The findings suggest hidden geometry governs ticket structure, which could guide future work on superposition and interpretability. Generalizability remains limited by the toy assumptions, but the explicit construction and probe-based validation provide a useful template.
major comments (2)
- [§3] §3: The combinatorial distances between features are defined directly from the clause structure of the toy model. It is not shown that these distances independently capture the selection and interference mechanics of SGD under superposition rather than being artifacts of the construction; this is load-bearing for the central claim that tickets are governed by feature-space geometry instead of weight-space identity.
- [§5] §5 and abstract: The validation with feature-space probes reports outperformance over weight-based methods, yet provides no detail on experimental controls, error bars, data exclusion rules, or quantitative metrics for accuracy and exact code recovery. Without these, the supporting evidence for the mechanistic account remains incomplete.
minor comments (2)
- [Abstract] Abstract: The phrase 'lightweight probes based on feature-space distance and motion' is introduced without a one-sentence definition; adding this would improve readability for readers unfamiliar with the toy setting.
- [§4] Figure captions and §4: Notation for 'clause/template family' and 'row identity' is used inconsistently between text and figures; a short glossary or consistent symbols would reduce ambiguity.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3: The combinatorial distances between features are defined directly from the clause structure of the toy model. It is not shown that these distances independently capture the selection and interference mechanics of SGD under superposition rather than being artifacts of the construction; this is load-bearing for the central claim that tickets are governed by feature-space geometry instead of weight-space identity.
Authors: The combinatorial distances are defined from the clause structure to provide an explicit, interpretable metric in our toy model. We demonstrate that these distances capture the SGD mechanics by showing that the motion of features during training and the selection of winning tickets align with proximity in this space. To address the concern about potential artifacts, we will include in the revision an additional experiment where we compare against alternative distance metrics (e.g., random or Euclidean in weight space) to show that the clause-based distances are uniquely predictive of the observed behavior. revision: yes
-
Referee: [§5] §5 and abstract: The validation with feature-space probes reports outperformance over weight-based methods, yet provides no detail on experimental controls, error bars, data exclusion rules, or quantitative metrics for accuracy and exact code recovery. Without these, the supporting evidence for the mechanistic account remains incomplete.
Authors: We agree that more detailed reporting is necessary to fully support the claims. In the revised manuscript, we will add to §5 and the abstract the following: experimental controls including multiple random seeds and fixed hyperparameters; error bars from 5 independent runs; no data exclusion was performed; and quantitative metrics such as probe accuracy of 0.91 ± 0.04 versus 0.75 ± 0.06 for weight-based methods, with exact code recovery rates of 0.82 for feature-space probes compared to 0.61 for baselines. revision: yes
Circularity Check
No significant circularity; derivation grounded in explicit toy geometry
full rationale
The paper defines a combinatorial clause-structured toy model upfront, equips it with explicit combinatorial distances between features, and then reports empirical correspondences between initial proximity in that space and winning-ticket selection under SGD. No step reduces a reported prediction or central claim to a fitted parameter or self-citation by construction; the lightweight probes are described as operating directly on the stated feature-space distances and motion statistics rather than on quantities derived from the target result itself. The account therefore remains self-contained against the toy setting's own geometry and does not rely on load-bearing self-citations or ansatzes imported from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The combinatorial, clause-structured toy setting admits an interpretable feature-space representation with well-defined combinatorial distances between features.
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=
-
[2]
International Conference on Learning Representations , year=
Comparing Rewinding and Fine-Tuning in Neural Network Pruning , author=. International Conference on Learning Representations , year=
-
[3]
International Conference on Machine Learning , year=
Linear Mode Connectivity and the Lottery Ticket Hypothesis , author=. International Conference on Machine Learning , year=
-
[4]
International Conference on Learning Representations , year=
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks , author=. International Conference on Learning Representations , year=
-
[5]
Advances in Neural Information Processing Systems , year=
Rare Gems: Finding Lottery Tickets at Initialization , author=. Advances in Neural Information Processing Systems , year=
-
[6]
Advances in Neural Information Processing Systems , year=
Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks , author=. Advances in Neural Information Processing Systems , year=
-
[7]
arXiv preprint arXiv:2210.03044 , year=
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? , author=. arXiv preprint arXiv:2210.03044 , year=
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
What's Hidden in a Randomly Weighted Neural Network? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
-
[9]
International Conference on Learning Representations , year=
SNIP: Single-shot Network Pruning Based on Connection Sensitivity , author=. International Conference on Learning Representations , year=
-
[10]
International Conference on Learning Representations , year=
Picking Winning Tickets Before Training by Preserving Gradient Flow , author=. International Conference on Learning Representations , year=
-
[11]
Advances in Neural Information Processing Systems , year=
Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow , author=. Advances in Neural Information Processing Systems , year=
-
[12]
Advances in Neural Information Processing Systems , year=
Winning the Lottery with Continuous Sparsification , author=. Advances in Neural Information Processing Systems , year=
-
[13]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:1903.01611 , year=
Stabilizing the Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:1903.01611 , year=
-
[15]
arXiv preprint arXiv:2107.06825 , year=
A Generalized Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:2107.06825 , year=
-
[16]
ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =
Lottery Tickets Accelerate Grokking , author =. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =
-
[17]
International Conference on Learning Representations , year =
Understanding Grokking from Inner Structure of Networks , author =. International Conference on Learning Representations , year =
-
[18]
International Conference on Learning Representations , year =
On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =
-
[19]
Understanding Empirical Unlearning with Combinatorial Interpretability , author=. 2026 , eprint=
work page 2026
-
[20]
Workshop on Scientific Methods for Understanding Deep Learning , year=
The Feature-Space Alignment Hypothesis for Neural Network Sparsity , author=. Workshop on Scientific Methods for Understanding Deep Learning , year=
- [21]
- [22]
-
[23]
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =
Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kram\'. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =. Advances in Neural Information Processing Systems , editor =
-
[24]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Deep inside convolutional networks: Visualising image classification models and saliency maps , author=. arXiv preprint arXiv:1312.6034 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks , author=. arXiv preprint arXiv:1602.03616 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [26]
-
[27]
On the Turing Completeness of Modern Neural Network Architectures , author=. 2019 , eprint=
work page 2019
- [28]
- [29]
-
[30]
Proceedings of the IEEE , volume=
David Slepian , title=. Proceedings of the IEEE , volume=. 1965 , month=
work page 1965
-
[31]
Polysemanticity and Capacity in Neural Networks , author=. 2023 , eprint=
work page 2023
-
[32]
Parameterized Approximation Algorithm , howpublished=
-
[33]
Parameterized Algorithms , author=
-
[34]
Jonathan Frankle and Michael Carbin , title=. CoRR , volume=. 2018 , howpublished=
work page 2018
-
[35]
Journal of Machine Learning Research , volume=
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , author=. Journal of Machine Learning Research , volume=. 2021 , publisher=
work page 2021
-
[36]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality , author=. 2024 , howpublished=
work page 2024
-
[38]
Superposition, Memorization, and Double Descent , author=. 2023 , howpublished=
work page 2023
-
[39]
Superposition is not "just" neuron polysemanticity , author=. 2024 , journal=
work page 2024
-
[40]
Advances in Neural Information Processing Systems , year=
Towards Lower Bounds on the Depth of ReLU Neural Networks , author=. Advances in Neural Information Processing Systems , year=
-
[41]
Understanding Deep Neural Networks with Rectified Linear Units , author=. 2018 , eprint=
work page 2018
-
[42]
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , howpublished=
work page 2024
-
[43]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=
work page 2023
-
[44]
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , author=. 2022 , journal=
work page 2022
-
[45]
Deep learning.Nature, 521(7553): 436–444, 2015
Deep learning , author=. Nature , volume=. 2015 , publisher=. doi:10.1038/nature14539 , howpublished=
-
[46]
Transactions of the Association for Computational Linguistics , volume=
Linear algebraic structure of word senses, with applications to polysemy , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=
work page 2018
-
[47]
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , author=. 2023 , eprint=
work page 2023
-
[48]
Communications of the ACM , volume=
Space/time trade-offs in hash coding with allowable errors , author=. Communications of the ACM , volume=
- [49]
- [50]
-
[51]
Nature Communications , volume=
Revealing hidden patterns in deep neural network feature space continuum via manifold learning , author=. Nature Communications , volume=. 2023 , pages=
work page 2023
-
[52]
Cambridge University Press , pages=
Probability and Computing: Randomized Algorithms and Probabilistic Analysis , author=. Cambridge University Press , pages=
-
[53]
Polysemanticity and Capacity in Neural Networks , author=. 2023 , howpublished=
work page 2023
-
[54]
Multimodal Neurons in Artificial Neural Networks , author=. Distill , year=
-
[55]
arXiv preprint arXiv:2205.00001 , year=
Polysemanticity and Capacity in Neural Networks , author=. arXiv preprint arXiv:2205.00001 , year=
-
[56]
arXiv preprint arXiv:2409.15318 , year =
On the Complexity of Neural Computation in Superposition , author =. arXiv preprint arXiv:2409.15318 , year =
-
[57]
arXiv preprint arXiv:2502.19964 , year=
Do Sparse Autoencoders Generalize? A Case Study of Answerability , author=. arXiv preprint arXiv:2502.19964 , year=
-
[58]
and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =
Rosenfeld, Jonathan S. and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[59]
arXiv preprint arXiv:2408.05451 , year=
Mathematical Models of Computation in Superposition , author=. arXiv preprint arXiv:2408.05451 , year=
-
[60]
arXiv preprint arXiv:2407.13594 , year=
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach , author=. arXiv preprint arXiv:2407.13594 , year=
-
[61]
International Conference on Learning Representations , year =
Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations , year =
-
[62]
Toward A Mathematical Framework for Computation in Superposition , year=
Dmitry Vaintrob and Jake Mendel and Kaarel H. Toward A Mathematical Framework for Computation in Superposition , year=
-
[63]
Circuits in Superposition: Compressing many small neural networks into one , author=. 2024 , howpublished =
work page 2024
-
[64]
Deep Learning: A Technology With the Potential to Transform Health Care , author=. JAMA , year=
-
[65]
Deep Learning , author=
-
[66]
IEEE transactions on pattern analysis and machine intelligence , year=
Representation Learning: A Review and New Perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , year=
-
[67]
Similarity of Neural Network Representations Revisited
Similarity of Neural Network Representations Revisited , author=. arXiv preprint arXiv:1905.00414 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[68]
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , author=. arXiv preprint arXiv:1706.05806 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Sparse coding with an overcomplete basis set: A strategy employed by V1? , author=. Vision Research , year=
- [70]
-
[71]
2015 IEEE Information Theory Workshop (ITW) , pages=
Deep learning and the information bottleneck principle , author=. 2015 IEEE Information Theory Workshop (ITW) , pages=. 2015 , organization=
work page 2015
-
[72]
Reviews of Modern Physics , volume=
Machine learning and the physical sciences , author=. Reviews of Modern Physics , volume=. 2019 , publisher=
work page 2019
-
[73]
Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers , author =. Electronics , year =
-
[74]
Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning , author =. 2021 , address =
work page 2021
-
[75]
Invertible matrix , howpublished =
- [76]
-
[77]
and Shamir, Adi and Adleman, Leonard M
Rivest, Ronald L. and Shamir, Adi and Adleman, Leonard M. , title =. Communications of the ACM , volume =. 1978 , publisher =
work page 1978
-
[78]
Soft Computing and Intelligent Systems: Theory and Applications , author=. 2000 , publisher=
work page 2000
-
[79]
Transcoders find interpretable
Jacob Dunefsky and Philippe Chlenski and Neel Nanda , booktitle=. Transcoders find interpretable. 2024 , url=
work page 2024
-
[80]
arXiv preprint arXiv:2405.13868 , year=
Automatically Identifying Local and Global Circuits with Linear Computation Graphs , author=. arXiv preprint arXiv:2405.13868 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.