Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

Alon Bebchuk; Nir Shavit

arxiv: 2605.17704 · v1 · pith:LIBMPMSVnew · submitted 2026-05-18 · 💻 cs.LG

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

Alon Bebchuk , Nir Shavit This is my paper

Pith reviewed 2026-05-19 22:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords lottery ticket hypothesisfeature spacecombinatorial toy modelssparse subnetworkssuperpositionwinning ticketsinitializationinterpretability

0 comments

The pith

Winning tickets correspond to precursor locations in feature space already near the final codes at initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In a combinatorial toy model built from clause-structured features, the paper shows that lottery-ticket subnetworks match full-model performance because they occupy initial locations in feature space that already lie close to the eventual feature-channel codes. Dense training resolves these locations through structured selection: proximal candidates either converge to the target codes or are rejected when they sit in crowded neurons, revealing competition under superposition. The preserved object turns out to be a family of compatible code locations that together balance proximity with low inter-feature interference rather than any single microscopic row identity. Sparse retraining frequently re-expresses the same clause or template family on a different row, so the ticket is family-level. Feature-space distance and motion probes recover these structures more accurately than conventional weight-based methods, shifting attention from weight subnetwork identity to early hidden geometry.

Core claim

Winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. Lightweight probes based on feature-space distance and motion frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery.

What carries the argument

Combinatorial distances between features in an interpretable feature-space representation, which quantify both proximity to final codes and inter-feature interference to identify compatible precursor locations.

Load-bearing premise

The combinatorial clause-structured toy setting supplies an interpretable feature-space representation whose distances capture the relevant dynamics of real networks under superposition.

What would settle it

In the toy model, if winning tickets routinely arise from initial locations far from the final codes or if feature-space distance probes fail to recover exact codes better than weight-based methods, the claimed correspondence would not hold.

Figures

Figures reproduced from arXiv: 2605.17704 by Alon Bebchuk, Nir Shavit.

**Figure 1.** Figure 1: Initial explanation of the setup and ticket cycle. The figure follows one representative run through dense training, masking, rewind, and sparse retraining. final (masked) C1 in panel (l), the one resulting from using only masked locations with their dense values, it has a different feature space representation in which it has less codes, and its accuracy has dropped from 0.978 to 0.725. If we now take a l… view at source ↗

**Figure 2.** Figure 2: The ticket support changes the feature-space initial condition. 4 Feature Space Lottery Tickets We now state the ticket notion used in this model. The object we call a lottery ticket is defined in the initial feature space: it is a set of row–clause locations. A weight-space mask is a witness that this feature-space object can be realized after rewind. Let θ0 = (W0 1 , W0 2 , . . .) be the dense initializa… view at source ↗

**Figure 3.** Figure 3: Feature-space tickets are family-level and already present at rewind. Left pair: same-site recall asks whether a dense-final 4P code is recovered on the same row, while family recall forgets the row and asks whether the same clause/template family is recovered somewhere in the sparse final model. Right pair: during sparse training, locations that eventually become final codes have much higher near-code fra… view at source ↗

**Figure 4.** Figure 4: Supported locations split into stable codes and lost locations. Dense training moves some precursor locations into exact codes, recruits others, and rejects locally plausible alternatives. oracle mask but do not become exact codes. Recruited final codes are not obvious oracle-supported candidates early, yet rapidly become exact codes. Close-but-lost locations are the negative control: they begin locally pr… view at source ↗

**Figure 5.** Figure 5: Oracle-supported survival depends on both proximity and row load. Row load and oracle-supported survival [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Sparse expansion at fixed W1 budget. A random 32 × 16 50% sparse expansion improves over the 16 × 16 dense baseline, but the OBS ticket expansion is substantially better and nearly matches the full 32 × 16 dense reference. Post-training OBS retraining is shown as a compression reference rather than as a rewound ticket [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Code-overlap diagnostics. Left: fraction of dense [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Initial precursor structure. Left: random and OBS masks have comparable overall density [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Initial distance to final code templates. Each bar group shows the distribution of initial [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Same-site distance from the initial local vector to the method’s own final code template. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Detector-derived sparse tickets under matched training conditions. up in accuracy because dense training has already projected much of the feature-channel organization into W1 magnitude order. The W1-κ and W1-grad-κ curves are weight-space controls; they score individual coordinates by coupling to clause templates, without first asking whether the corresponding location is close to a code in C1. Their wea… view at source ↗

**Figure 12.** Figure 12: Broad diagnostic sweep: feature-space masks recover the internal code structure more reliably than weight-space baselines. sparse accuracy and exact canonical code count. Feature-space rules are static distance, dynamic motion, and a combined static–dynamic diagnostic; weight-space comparisons include checkpoint magnitude, SNIP, GraSP, SynFlow, and an Early-Bird-style magnitude rule (Lee et al., 2019; Wan… view at source ↗

**Figure 13.** Figure 13: Detector-derived sparse final states in W1 and C1. The heat maps show final sparse representations produced by several detector rules. The W1 panels display the selected sparse coordinate supports after retraining, while the C1 panels show the clause-local feature-space computation. Exact 4P and 3N1P boxes are overlaid where the final sparse model recovers canonical codes. The visual lesson is that detect… view at source ↗

**Figure 14.** Figure 14: Scaling of final sparse-test accuracy after fixing the large-model data. Each panel varies the [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Scaling of final sparse-test code counts after fixing the large-model data. Each panel [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Sparse retraining contracts the locations that become sparse-final codes. Each panel tracks the fraction of sites that are within the near-code threshold during sparse retraining. The top row uses 50% W1 sparsity and the bottom row uses 75% sparsity; columns separate 4P and 3N1P families. Curves labeled “eventual final code” are the locations that become exact sparse-final codes, while “not final code” lo… view at source ↗

**Figure 17.** Figure 17: Family recall is consistently higher than same-site recall. The left column in each row measures exact same-site recall: whether the same row–clause location is preserved. The right column measures family recall: whether the same clause/template family is recovered somewhere in the sparse representation. The pattern is shown separately for 4P and 3N1P codes at 75% sparsity and across Hadamard, random-fixe… view at source ↗

**Figure 18.** Figure 18: Feature-space rules can improve outcomes while matching the oracle mask less. Each point compares a distance-based feature-space rule to initialization magnitude in one embedding, clause-count, and sparsity setting. The horizontal axis is the difference in Jaccard overlap with the dense-final oracle support; the vertical axis is the gain in final sparse accuracy (left) or exact-code count (right). Many po… view at source ↗

**Figure 19.** Figure 19: Translation ablations: accuracy relative to the site-greedy translation. The panels compare several ways of translating ranked row–clause sites into W1 masks. The vertical axis is the change in final sparse accuracy relative to a site-greedy conversion. Columns correspond to translation variants such as row aggregation, orthogonalized conversion, joint signed conversion, and joint OMP-style conversion. Cu… view at source ↗

**Figure 20.** Figure 20: Translation ablations: exact-code count relative to the site-greedy translation. This figure uses the same layout as [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Target-aware trajectories show how motion is measured. Sites are grouped by their relation to the final target family and tracked over dense training. Curves that become final or recruited codes move toward lower distance and higher signed margin, while close-but-lost sites fail to maintain that improvement. This figure motivates the dynamic detector term d0(h, c) − de(h, c): it rewards sites that are not… view at source ↗

**Figure 22.** Figure 22: Oracle support growth from 75% to 50% sparsity. Looser supports add code locations rather than arbitrary isolated weights. The sparse oracle’s feature-space support grows in a structured way: additional retained coordinates bring in more clause-local code sites. This confirms that the main 75% examples are not isolated artifacts of a particular sparsity budget. The fresh-random comparison is a negative co… view at source ↗

**Figure 23.** Figure 23: Fresh-random initialization shows why the dense-final oracle is basin-tied. Columns correspond to sparse initialization modes: rewind to the dense probe checkpoint, rewind to the original dense initialization, and fresh-random initialization. The top row reports final sparse accuracy; the bottom row reports exact canonical code count. Under matched rewind settings, the dense-final oracle remains the stron… view at source ↗

**Figure 24.** Figure 24: Embedding-family sweep under dense rewind. Hadamard-50, RandomFixed-50, and Learned-50 embeddings show the same qualitative ordering: feature-space detectors remain competitive across embeddings, and code recovery is a more sensitive diagnostic than accuracy. The result argues that the feature-space ticket story is not specific to a Hadamard C0. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Embedding-family sweep under fresh-random sparse initialization. When the sparse initialization no longer matches the dense-final oracle basin, the state-adaptive feature-space rules become more competitive across embeddings. The exact-code panels show the clearest version of this effect: rules that read the current feature-space geometry can recover code structure even when the dense-final oracle loses p… view at source ↗

**Figure 26.** Figure 26: Representative cross-setting accuracy curves. The panels show final sparse accuracy as a function of the dense epoch used for detection, for H = 16, clause counts 8 and 16, and multiple sparsities. Feature-space diagnostics are competitive early, while checkpoint magnitude becomes a strong accuracy proxy later in training. This supports the interpretation that dense training gradually projects the feature… view at source ↗

read the original abstract

The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses a toy combinatorial model to argue that lottery tickets preserve families of initial feature locations close to final codes rather than specific weight rows, but the distances may be too built-in to generalize.

read the letter

This paper's main point is that in their clause-structured toy, winning tickets correspond to precursor spots in feature space that start near the final codes at initialization. Dense SGD picks among those proximal locations while rejecting ones in crowded neurons, and sparse retraining often swaps to a different row but keeps the same family of clause codes. The preserved object is therefore family-level geometry instead of microscopic row identity.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the lottery ticket hypothesis in a combinatorial, clause-structured toy model that admits an interpretable feature-space representation with defined combinatorial distances. It claims that winning tickets in weight space correspond to precursor locations in feature space already near the final feature-channel codes at initialization. Dense SGD resolves these via structured selection under superposition, with rejection at crowded neurons; a winning ticket is a family of compatible locations balancing proximity and low interference. Sparse retraining often re-expresses the same clause family on a different row, preserving family-level rather than row-specific identity. The account is validated with lightweight feature-space distance and motion probes that outperform weight-based ticket discovery methods in the toy setting.

Significance. If the results hold, the work supplies a mechanistic account of lottery tickets as preserving early feature-space geometry rather than specific weight subnetworks, using an interpretable toy model with combinatorial distances and simple probes. This is a strength for conceptual clarity and falsifiability within the stated setting. The findings suggest hidden geometry governs ticket structure, which could guide future work on superposition and interpretability. Generalizability remains limited by the toy assumptions, but the explicit construction and probe-based validation provide a useful template.

major comments (2)

[§3] §3: The combinatorial distances between features are defined directly from the clause structure of the toy model. It is not shown that these distances independently capture the selection and interference mechanics of SGD under superposition rather than being artifacts of the construction; this is load-bearing for the central claim that tickets are governed by feature-space geometry instead of weight-space identity.
[§5] §5 and abstract: The validation with feature-space probes reports outperformance over weight-based methods, yet provides no detail on experimental controls, error bars, data exclusion rules, or quantitative metrics for accuracy and exact code recovery. Without these, the supporting evidence for the mechanistic account remains incomplete.

minor comments (2)

[Abstract] Abstract: The phrase 'lightweight probes based on feature-space distance and motion' is introduced without a one-sentence definition; adding this would improve readability for readers unfamiliar with the toy setting.
[§4] Figure captions and §4: Notation for 'clause/template family' and 'row identity' is used inconsistently between text and figures; a short glossary or consistent symbols would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3: The combinatorial distances between features are defined directly from the clause structure of the toy model. It is not shown that these distances independently capture the selection and interference mechanics of SGD under superposition rather than being artifacts of the construction; this is load-bearing for the central claim that tickets are governed by feature-space geometry instead of weight-space identity.

Authors: The combinatorial distances are defined from the clause structure to provide an explicit, interpretable metric in our toy model. We demonstrate that these distances capture the SGD mechanics by showing that the motion of features during training and the selection of winning tickets align with proximity in this space. To address the concern about potential artifacts, we will include in the revision an additional experiment where we compare against alternative distance metrics (e.g., random or Euclidean in weight space) to show that the clause-based distances are uniquely predictive of the observed behavior. revision: yes
Referee: [§5] §5 and abstract: The validation with feature-space probes reports outperformance over weight-based methods, yet provides no detail on experimental controls, error bars, data exclusion rules, or quantitative metrics for accuracy and exact code recovery. Without these, the supporting evidence for the mechanistic account remains incomplete.

Authors: We agree that more detailed reporting is necessary to fully support the claims. In the revised manuscript, we will add to §5 and the abstract the following: experimental controls including multiple random seeds and fixed hyperparameters; error bars from 5 independent runs; no data exclusion was performed; and quantitative metrics such as probe accuracy of 0.91 ± 0.04 versus 0.75 ± 0.06 for weight-based methods, with exact code recovery rates of 0.82 for feature-space probes compared to 0.61 for baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in explicit toy geometry

full rationale

The paper defines a combinatorial clause-structured toy model upfront, equips it with explicit combinatorial distances between features, and then reports empirical correspondences between initial proximity in that space and winning-ticket selection under SGD. No step reduces a reported prediction or central claim to a fitted parameter or self-citation by construction; the lightweight probes are described as operating directly on the stated feature-space distances and motion statistics rather than on quantities derived from the target result itself. The account therefore remains self-contained against the toy setting's own geometry and does not rely on load-bearing self-citations or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the toy combinatorial setting as a proxy for real-network feature dynamics, with no explicit free parameters, new entities, or additional axioms detailed in the abstract.

axioms (1)

domain assumption The combinatorial, clause-structured toy setting admits an interpretable feature-space representation with well-defined combinatorial distances between features.
This premise, stated directly in the abstract, enables the distance-based probes and precursor-location analysis.

pith-pipeline@v0.9.0 · 5771 in / 1298 out tokens · 52683 ms · 2026-05-19T22:06:17.836785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 6 internal anchors

[1]

International Conference on Learning Representations , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

work page
[2]

International Conference on Learning Representations , year=

Comparing Rewinding and Fine-Tuning in Neural Network Pruning , author=. International Conference on Learning Representations , year=

work page
[3]

International Conference on Machine Learning , year=

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author=. International Conference on Machine Learning , year=

work page
[4]

International Conference on Learning Representations , year=

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks , author=. International Conference on Learning Representations , year=

work page
[5]

Advances in Neural Information Processing Systems , year=

Rare Gems: Finding Lottery Tickets at Initialization , author=. Advances in Neural Information Processing Systems , year=

work page
[6]

Advances in Neural Information Processing Systems , year=

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks , author=. Advances in Neural Information Processing Systems , year=

work page
[7]

arXiv preprint arXiv:2210.03044 , year=

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? , author=. arXiv preprint arXiv:2210.03044 , year=

work page arXiv
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

What's Hidden in a Randomly Weighted Neural Network? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[9]

International Conference on Learning Representations , year=

SNIP: Single-shot Network Pruning Based on Connection Sensitivity , author=. International Conference on Learning Representations , year=

work page
[10]

International Conference on Learning Representations , year=

Picking Winning Tickets Before Training by Preserving Gradient Flow , author=. International Conference on Learning Representations , year=

work page
[11]

Advances in Neural Information Processing Systems , year=

Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow , author=. Advances in Neural Information Processing Systems , year=

work page
[12]

Advances in Neural Information Processing Systems , year=

Winning the Lottery with Continuous Sparsification , author=. Advances in Neural Information Processing Systems , year=

work page
[13]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:1903.01611 , year=

Stabilizing the Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:1903.01611 , year=

work page arXiv 1903
[15]

arXiv preprint arXiv:2107.06825 , year=

A Generalized Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:2107.06825 , year=

work page arXiv
[16]

ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =

Lottery Tickets Accelerate Grokking , author =. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =

work page
[17]

International Conference on Learning Representations , year =

Understanding Grokking from Inner Structure of Networks , author =. International Conference on Learning Representations , year =

work page
[18]

International Conference on Learning Representations , year =

On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page
[19]

2026 , eprint=

Understanding Empirical Unlearning with Combinatorial Interpretability , author=. 2026 , eprint=

work page 2026
[20]

Workshop on Scientific Methods for Understanding Deep Learning , year=

The Feature-Space Alignment Hypothesis for Neural Network Sparsity , author=. Workshop on Scientific Methods for Understanding Deep Learning , year=

work page
[21]

2025 , eprint=

Expand Neurons, Not Parameters , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Position-aware Automatic Circuit Discovery , author=. 2025 , eprint=

work page 2025
[23]

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kram\'. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =. Advances in Neural Information Processing Systems , editor =

work page
[24]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Deep inside convolutional networks: Visualising image classification models and saliency maps , author=. arXiv preprint arXiv:1312.6034 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks

Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks , author=. arXiv preprint arXiv:1602.03616 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

work page 2022
[27]

2019 , eprint=

On the Turing Completeness of Modern Neural Network Architectures , author=. 2019 , eprint=

work page 2019
[28]

2020 , month=

Zoom In: An Introduction to Circuits , author=. 2020 , month=

work page 2020
[29]

2023 , howpublished=

Yuster, Raphael and Zwick, Uri , title=. 2023 , howpublished=

work page 2023
[30]

Proceedings of the IEEE , volume=

David Slepian , title=. Proceedings of the IEEE , volume=. 1965 , month=

work page 1965
[31]

2023 , eprint=

Polysemanticity and Capacity in Neural Networks , author=. 2023 , eprint=

work page 2023
[32]

Parameterized Approximation Algorithm , howpublished=

work page
[33]

Parameterized Algorithms , author=

work page
[34]

CoRR , volume=

Jonathan Frankle and Michael Carbin , title=. CoRR , volume=. 2018 , howpublished=

work page 2018
[35]

Journal of Machine Learning Research , volume=

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , author=. Journal of Machine Learning Research , volume=. 2021 , publisher=

work page 2021
[36]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2024 , howpublished=

Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality , author=. 2024 , howpublished=

work page 2024
[38]

2023 , howpublished=

Superposition, Memorization, and Double Descent , author=. 2023 , howpublished=

work page 2023
[39]

2024 , journal=

Superposition is not "just" neuron polysemanticity , author=. 2024 , journal=

work page 2024
[40]

Advances in Neural Information Processing Systems , year=

Towards Lower Bounds on the Depth of ReLU Neural Networks , author=. Advances in Neural Information Processing Systems , year=

work page
[41]

2018 , eprint=

Understanding Deep Neural Networks with Rectified Linear Units , author=. 2018 , eprint=

work page 2018
[42]

2024 , howpublished=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , howpublished=

work page 2024
[43]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[44]

2022 , journal=

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , author=. 2022 , journal=

work page 2022
[45]

Deep learning.Nature, 521(7553): 436–444, 2015

Deep learning , author=. Nature , volume=. 2015 , publisher=. doi:10.1038/nature14539 , howpublished=

work page doi:10.1038/nature14539 2015
[46]

Transactions of the Association for Computational Linguistics , volume=

Linear algebraic structure of word senses, with applications to polysemy , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[47]

2023 , eprint=

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , author=. 2023 , eprint=

work page 2023
[48]

Communications of the ACM , volume=

Space/time trade-offs in hash coding with allowable errors , author=. Communications of the ACM , volume=

work page
[49]

, author=

Network applications of Bloom filters: A survey. , author=. Internet Mathematics , volume=

work page
[50]

, author=

Extensions of Lipschitz mappings into a Hilbert space. , author=. Contemporary Mathematics , volume=

work page
[51]

Nature Communications , volume=

Revealing hidden patterns in deep neural network feature space continuum via manifold learning , author=. Nature Communications , volume=. 2023 , pages=

work page 2023
[52]

Cambridge University Press , pages=

Probability and Computing: Randomized Algorithms and Probabilistic Analysis , author=. Cambridge University Press , pages=

work page
[53]

2023 , howpublished=

Polysemanticity and Capacity in Neural Networks , author=. 2023 , howpublished=

work page 2023
[54]

Distill , year=

Multimodal Neurons in Artificial Neural Networks , author=. Distill , year=

work page
[55]

arXiv preprint arXiv:2205.00001 , year=

Polysemanticity and Capacity in Neural Networks , author=. arXiv preprint arXiv:2205.00001 , year=

work page arXiv
[56]

arXiv preprint arXiv:2409.15318 , year =

On the Complexity of Neural Computation in Superposition , author =. arXiv preprint arXiv:2409.15318 , year =

work page arXiv
[57]

arXiv preprint arXiv:2502.19964 , year=

Do Sparse Autoencoders Generalize? A Case Study of Answerability , author=. arXiv preprint arXiv:2502.19964 , year=

work page arXiv
[58]

and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =

Rosenfeld, Jonathan S. and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

work page
[59]

arXiv preprint arXiv:2408.05451 , year=

Mathematical Models of Computation in Superposition , author=. arXiv preprint arXiv:2408.05451 , year=

work page arXiv
[60]

arXiv preprint arXiv:2407.13594 , year=

Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach , author=. arXiv preprint arXiv:2407.13594 , year=

work page arXiv
[61]

International Conference on Learning Representations , year =

Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations , year =

work page
[62]

Toward A Mathematical Framework for Computation in Superposition , year=

Dmitry Vaintrob and Jake Mendel and Kaarel H. Toward A Mathematical Framework for Computation in Superposition , year=

work page
[63]

2024 , howpublished =

Circuits in Superposition: Compressing many small neural networks into one , author=. 2024 , howpublished =

work page 2024
[64]

JAMA , year=

Deep Learning: A Technology With the Potential to Transform Health Care , author=. JAMA , year=

work page
[65]

Deep Learning , author=

work page
[66]

IEEE transactions on pattern analysis and machine intelligence , year=

Representation Learning: A Review and New Perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page
[67]

Similarity of Neural Network Representations Revisited

Similarity of Neural Network Representations Revisited , author=. arXiv preprint arXiv:1905.00414 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[68]

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , author=. arXiv preprint arXiv:1706.05806 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Vision Research , year=

Sparse coding with an overcomplete basis set: A strategy employed by V1? , author=. Vision Research , year=

work page
[70]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[71]

2015 IEEE Information Theory Workshop (ITW) , pages=

Deep learning and the information bottleneck principle , author=. 2015 IEEE Information Theory Workshop (ITW) , pages=. 2015 , organization=

work page 2015
[72]

Reviews of Modern Physics , volume=

Machine learning and the physical sciences , author=. Reviews of Modern Physics , volume=. 2019 , publisher=

work page 2019
[73]

Electronics , year =

Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers , author =. Electronics , year =

work page
[74]

2021 , address =

Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning , author =. 2021 , address =

work page 2021
[75]

Invertible matrix , howpublished =

work page
[76]

, title =

Diffie, Whitfield and Hellman, Martin E. , title =. IEEE Transactions on Information Theory , volume =. 1976 , publisher =

work page 1976
[77]

and Shamir, Adi and Adleman, Leonard M

Rivest, Ronald L. and Shamir, Adi and Adleman, Leonard M. , title =. Communications of the ACM , volume =. 1978 , publisher =

work page 1978
[78]

2000 , publisher=

Soft Computing and Intelligent Systems: Theory and Applications , author=. 2000 , publisher=

work page 2000
[79]

Transcoders find interpretable

Jacob Dunefsky and Philippe Chlenski and Neel Nanda , booktitle=. Transcoders find interpretable. 2024 , url=

work page 2024
[80]

arXiv preprint arXiv:2405.13868 , year=

Automatically Identifying Local and Global Circuits with Linear Computation Graphs , author=. arXiv preprint arXiv:2405.13868 , year=

work page arXiv

Showing first 80 references.

[1] [1]

International Conference on Learning Representations , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

work page

[2] [2]

International Conference on Learning Representations , year=

Comparing Rewinding and Fine-Tuning in Neural Network Pruning , author=. International Conference on Learning Representations , year=

work page

[3] [3]

International Conference on Machine Learning , year=

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author=. International Conference on Machine Learning , year=

work page

[4] [4]

International Conference on Learning Representations , year=

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks , author=. International Conference on Learning Representations , year=

work page

[5] [5]

Advances in Neural Information Processing Systems , year=

Rare Gems: Finding Lottery Tickets at Initialization , author=. Advances in Neural Information Processing Systems , year=

work page

[6] [6]

Advances in Neural Information Processing Systems , year=

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks , author=. Advances in Neural Information Processing Systems , year=

work page

[7] [7]

arXiv preprint arXiv:2210.03044 , year=

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? , author=. arXiv preprint arXiv:2210.03044 , year=

work page arXiv

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

What's Hidden in a Randomly Weighted Neural Network? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[9] [9]

International Conference on Learning Representations , year=

SNIP: Single-shot Network Pruning Based on Connection Sensitivity , author=. International Conference on Learning Representations , year=

work page

[10] [10]

International Conference on Learning Representations , year=

Picking Winning Tickets Before Training by Preserving Gradient Flow , author=. International Conference on Learning Representations , year=

work page

[11] [11]

Advances in Neural Information Processing Systems , year=

Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow , author=. Advances in Neural Information Processing Systems , year=

work page

[12] [12]

Advances in Neural Information Processing Systems , year=

Winning the Lottery with Continuous Sparsification , author=. Advances in Neural Information Processing Systems , year=

work page

[13] [13]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:1903.01611 , year=

Stabilizing the Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:1903.01611 , year=

work page arXiv 1903

[15] [15]

arXiv preprint arXiv:2107.06825 , year=

A Generalized Lottery Ticket Hypothesis , author=. arXiv preprint arXiv:2107.06825 , year=

work page arXiv

[16] [16]

ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =

Lottery Tickets Accelerate Grokking , author =. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models , year =

work page

[17] [17]

International Conference on Learning Representations , year =

Understanding Grokking from Inner Structure of Networks , author =. International Conference on Learning Representations , year =

work page

[18] [18]

International Conference on Learning Representations , year =

On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page

[19] [19]

2026 , eprint=

Understanding Empirical Unlearning with Combinatorial Interpretability , author=. 2026 , eprint=

work page 2026

[20] [20]

Workshop on Scientific Methods for Understanding Deep Learning , year=

The Feature-Space Alignment Hypothesis for Neural Network Sparsity , author=. Workshop on Scientific Methods for Understanding Deep Learning , year=

work page

[21] [21]

2025 , eprint=

Expand Neurons, Not Parameters , author=. 2025 , eprint=

work page 2025

[22] [22]

2025 , eprint=

Position-aware Automatic Circuit Discovery , author=. 2025 , eprint=

work page 2025

[23] [23]

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kram\'. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , url =. Advances in Neural Information Processing Systems , editor =

work page

[24] [24]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Deep inside convolutional networks: Visualising image classification models and saliency maps , author=. arXiv preprint arXiv:1312.6034 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks

Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks , author=. arXiv preprint arXiv:1602.03616 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

work page 2022

[27] [27]

2019 , eprint=

On the Turing Completeness of Modern Neural Network Architectures , author=. 2019 , eprint=

work page 2019

[28] [28]

2020 , month=

Zoom In: An Introduction to Circuits , author=. 2020 , month=

work page 2020

[29] [29]

2023 , howpublished=

Yuster, Raphael and Zwick, Uri , title=. 2023 , howpublished=

work page 2023

[30] [30]

Proceedings of the IEEE , volume=

David Slepian , title=. Proceedings of the IEEE , volume=. 1965 , month=

work page 1965

[31] [31]

2023 , eprint=

Polysemanticity and Capacity in Neural Networks , author=. 2023 , eprint=

work page 2023

[32] [32]

Parameterized Approximation Algorithm , howpublished=

work page

[33] [33]

Parameterized Algorithms , author=

work page

[34] [34]

CoRR , volume=

Jonathan Frankle and Michael Carbin , title=. CoRR , volume=. 2018 , howpublished=

work page 2018

[35] [35]

Journal of Machine Learning Research , volume=

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , author=. Journal of Machine Learning Research , volume=. 2021 , publisher=

work page 2021

[36] [36]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

2024 , howpublished=

Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality , author=. 2024 , howpublished=

work page 2024

[38] [38]

2023 , howpublished=

Superposition, Memorization, and Double Descent , author=. 2023 , howpublished=

work page 2023

[39] [39]

2024 , journal=

Superposition is not "just" neuron polysemanticity , author=. 2024 , journal=

work page 2024

[40] [40]

Advances in Neural Information Processing Systems , year=

Towards Lower Bounds on the Depth of ReLU Neural Networks , author=. Advances in Neural Information Processing Systems , year=

work page

[41] [41]

2018 , eprint=

Understanding Deep Neural Networks with Rectified Linear Units , author=. 2018 , eprint=

work page 2018

[42] [42]

2024 , howpublished=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , howpublished=

work page 2024

[43] [43]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023

[44] [44]

2022 , journal=

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , author=. 2022 , journal=

work page 2022

[45] [45]

Deep learning.Nature, 521(7553): 436–444, 2015

Deep learning , author=. Nature , volume=. 2015 , publisher=. doi:10.1038/nature14539 , howpublished=

work page doi:10.1038/nature14539 2015

[46] [46]

Transactions of the Association for Computational Linguistics , volume=

Linear algebraic structure of word senses, with applications to polysemy , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018

[47] [47]

2023 , eprint=

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , author=. 2023 , eprint=

work page 2023

[48] [48]

Communications of the ACM , volume=

Space/time trade-offs in hash coding with allowable errors , author=. Communications of the ACM , volume=

work page

[49] [49]

, author=

Network applications of Bloom filters: A survey. , author=. Internet Mathematics , volume=

work page

[50] [50]

, author=

Extensions of Lipschitz mappings into a Hilbert space. , author=. Contemporary Mathematics , volume=

work page

[51] [51]

Nature Communications , volume=

Revealing hidden patterns in deep neural network feature space continuum via manifold learning , author=. Nature Communications , volume=. 2023 , pages=

work page 2023

[52] [52]

Cambridge University Press , pages=

Probability and Computing: Randomized Algorithms and Probabilistic Analysis , author=. Cambridge University Press , pages=

work page

[53] [53]

2023 , howpublished=

Polysemanticity and Capacity in Neural Networks , author=. 2023 , howpublished=

work page 2023

[54] [54]

Distill , year=

Multimodal Neurons in Artificial Neural Networks , author=. Distill , year=

work page

[55] [55]

arXiv preprint arXiv:2205.00001 , year=

Polysemanticity and Capacity in Neural Networks , author=. arXiv preprint arXiv:2205.00001 , year=

work page arXiv

[56] [56]

arXiv preprint arXiv:2409.15318 , year =

On the Complexity of Neural Computation in Superposition , author =. arXiv preprint arXiv:2409.15318 , year =

work page arXiv

[57] [57]

arXiv preprint arXiv:2502.19964 , year=

Do Sparse Autoencoders Generalize? A Case Study of Answerability , author=. arXiv preprint arXiv:2502.19964 , year=

work page arXiv

[58] [58]

and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =

Rosenfeld, Jonathan S. and Rosenfeld, Amir and Belinkov, Yonatan and Shavit, Nir , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

work page

[59] [59]

arXiv preprint arXiv:2408.05451 , year=

Mathematical Models of Computation in Superposition , author=. arXiv preprint arXiv:2408.05451 , year=

work page arXiv

[60] [60]

arXiv preprint arXiv:2407.13594 , year=

Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach , author=. arXiv preprint arXiv:2407.13594 , year=

work page arXiv

[61] [61]

International Conference on Learning Representations , year =

Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations , year =

work page

[62] [62]

Toward A Mathematical Framework for Computation in Superposition , year=

Dmitry Vaintrob and Jake Mendel and Kaarel H. Toward A Mathematical Framework for Computation in Superposition , year=

work page

[63] [63]

2024 , howpublished =

Circuits in Superposition: Compressing many small neural networks into one , author=. 2024 , howpublished =

work page 2024

[64] [64]

JAMA , year=

Deep Learning: A Technology With the Potential to Transform Health Care , author=. JAMA , year=

work page

[65] [65]

Deep Learning , author=

work page

[66] [66]

IEEE transactions on pattern analysis and machine intelligence , year=

Representation Learning: A Review and New Perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page

[67] [67]

Similarity of Neural Network Representations Revisited

Similarity of Neural Network Representations Revisited , author=. arXiv preprint arXiv:1905.00414 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[68] [68]

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , author=. arXiv preprint arXiv:1706.05806 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Vision Research , year=

Sparse coding with an overcomplete basis set: A strategy employed by V1? , author=. Vision Research , year=

work page

[70] [70]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020

[71] [71]

2015 IEEE Information Theory Workshop (ITW) , pages=

Deep learning and the information bottleneck principle , author=. 2015 IEEE Information Theory Workshop (ITW) , pages=. 2015 , organization=

work page 2015

[72] [72]

Reviews of Modern Physics , volume=

Machine learning and the physical sciences , author=. Reviews of Modern Physics , volume=. 2019 , publisher=

work page 2019

[73] [73]

Electronics , year =

Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers , author =. Electronics , year =

work page

[74] [74]

2021 , address =

Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning , author =. 2021 , address =

work page 2021

[75] [75]

Invertible matrix , howpublished =

work page

[76] [76]

, title =

Diffie, Whitfield and Hellman, Martin E. , title =. IEEE Transactions on Information Theory , volume =. 1976 , publisher =

work page 1976

[77] [77]

and Shamir, Adi and Adleman, Leonard M

Rivest, Ronald L. and Shamir, Adi and Adleman, Leonard M. , title =. Communications of the ACM , volume =. 1978 , publisher =

work page 1978

[78] [78]

2000 , publisher=

Soft Computing and Intelligent Systems: Theory and Applications , author=. 2000 , publisher=

work page 2000

[79] [79]

Transcoders find interpretable

Jacob Dunefsky and Philippe Chlenski and Neel Nanda , booktitle=. Transcoders find interpretable. 2024 , url=

work page 2024

[80] [80]

arXiv preprint arXiv:2405.13868 , year=

Automatically Identifying Local and Global Circuits with Linear Computation Graphs , author=. arXiv preprint arXiv:2405.13868 , year=

work page arXiv