arxiv: 2512.12744 · v3 · submitted 2025-12-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Resting Neurons, Active Insights: Robustify Activation Sparsity for Large Language Models

Haotian Xu , Jiannan Yang , Tian Gao , Tsui-Wei Weng , Tengfei Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation sparsitylarge language modelsrepresentational alignmentspontaneous neuronsinference accelerationhidden state distributionmodel compression

0 comments

The pith

Spontaneous neurons restore accuracy in activation-sparse large language models by anchoring hidden states to the dense model's distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high activation sparsity in LLMs breaks the input-dependent patterns learned in pretraining and shifts the distribution of hidden states, which collapses downstream performance. It reframes the problem as one of representational alignment and introduces Spontaneous Neurons (SPON), a small set of learnable input-independent vectors that act as fixed anchors during sparse forward passes. These vectors are trained only to match the dense model's activation statistics and can later be folded into bias terms, adding no inference cost. Experiments across several LLM families show that SPON largely recovers the original accuracy, stabilizes internal representations, and leaves generalization intact.

Core claim

Activation sparsity induces distribution shifts in hidden states because it suppresses the input-dependent activations that the model learned during pretraining. SPON counters this by injecting a small collection of learnable, input-independent activation vectors that serve as persistent representational anchors; the vectors are optimized solely through distribution matching to the dense model and can be absorbed into bias terms after training.

What carries the argument

Spontaneous Neurons (SPON): a lightweight set of learnable, input-independent activation vectors that function as persistent representational anchors for sparse computation.

If this is right

Sparse inference can run at high sparsity ratios while keeping accuracy close to the dense baseline.
The same SPON vectors work across multiple LLM architectures without per-model redesign.
After training, SPON adds zero extra compute or memory at inference time because the vectors fold into existing bias terms.
Latent representations remain stable enough that downstream tasks retain their original generalization behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the anchors truly act as distribution stabilizers, they might also reduce variance in few-shot or chain-of-thought settings where hidden-state drift is known to hurt consistency.
The same anchoring idea could be tested on other sparsity patterns such as weight pruning or KV-cache compression to see whether representational alignment is a general remedy.
Because the vectors are input-independent, they might be reusable across tasks or even across models of similar scale, offering a cheap way to transfer sparsity robustness.

Load-bearing premise

A small fixed set of input-independent vectors trained only by matching activation statistics to the dense model will reliably cancel the distribution shifts caused by sparsity without creating new instabilities or hurting downstream generalization.

What would settle it

Measure whether the hidden-state distributions of the sparse model with SPON still diverge from those of the dense model on a held-out set of inputs; divergence above a small threshold would falsify the claim that the anchors restore alignment.

Figures

Figures reproduced from arXiv: 2512.12744 by Haotian Xu, Jiannan Yang, Tengfei Ma, Tian Gao, Tsui-Wei Weng.

**Figure 1.** Figure 1: Overview of Input Sparsification and Spontaneous Neurons. (A) demonstrates that input sparsification thresholds low-magnitude activation entries to zero, eliminating the need to move the associated weight channels onto the registers. This is analogous to neuron pruning, enabling wall-clock speed-ups. (B) briefly demonstrates the concept of spontaneous neurons. These neurons can be reached by an input-indep… view at source ↗

**Figure 2.** Figure 2: Comparing TEAL and SPON on general and mathematical reasoning. We report normalized mean accuracy. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The distribution of hidden representations of TEAL and SPON after dimensionality reduction, where SPON [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Left: We show that only injecting extra spontaneous neurons in down projection of MLP module can be [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Perplexity for three different LLM backbones quantized to various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Activation sparsity offers a compelling route to accelerate large language model (LLM) inference by selectively suppressing hidden activations, yet existing approaches exhibit severe accuracy degradation at high sparsity. We show that this failure stems from representational instability: *activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.* We address this issue by reframing activation sparsity as a representational alignment problem and introducing **Spontaneous Neurons (SPON)**, a lightweight mechanism inspired by spontaneous neural activity in biological systems. SPON injects a small set of learnable, input-independent activation vectors that act as persistent representational anchors for sparse computation. These vectors are trained via distribution matching to the dense model and can be absorbed into bias terms after training, incurring negligible inference overhead. Across multiple LLM backbones, SPON consistently restores performance, stabilizes latent representations, and preserves generalization. Our results establish SPON as an effective and principled solution for reliable activation-sparse inference, and offer new insights into knowledge retention in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPON adds a small set of learnable input-independent anchors trained by distribution matching to stabilize sparse activations, but the approach may not fully recover input-dependent geometry.

read the letter

The main takeaway is that the paper introduces Spontaneous Neurons (SPON) as a lightweight mechanism to counteract the representational instability caused by high activation sparsity in LLMs. They add a handful of learnable, input-independent vectors that act as persistent anchors, train them to match the hidden-state distribution of the dense model, and then absorb them into bias terms so inference cost stays negligible. This framing of sparsity as a distribution-shift problem is the clearest new angle, and the absorption trick keeps the method practical for deployment.

Referee Report

2 major / 2 minor

Summary. The manuscript diagnoses severe accuracy degradation in high-sparsity activation pruning of LLMs as arising from representational instability: per-token sparsity masks induce input-dependent distribution shifts in hidden states that disrupt pretrained representations. It reframes the problem as representational alignment and proposes Spontaneous Neurons (SPON), a small set of learnable, input-independent activation vectors trained solely via distribution matching to the dense model's hidden-state marginals. These vectors serve as persistent anchors during sparse forward passes and are absorbed into bias terms post-training for zero inference cost. The central claim is that SPON restores performance, stabilizes latent representations, and preserves generalization across multiple LLM backbones.

Significance. If the empirical claims hold under rigorous verification, SPON would offer a lightweight, training-only intervention that enables reliable high-sparsity activation inference with negligible overhead, directly addressing a practical bottleneck in LLM deployment. The absorption trick and the biological analogy are clean engineering contributions; the distributional-alignment framing could also inform future work on representation stability under other forms of structured noise.

major comments (2)

[Proposed Method] The core mechanism relies on input-independent vectors trained only to match marginal hidden-state statistics, yet sparsity masks are computed per-token and therefore induce input-conditional shifts. No section demonstrates that marginal matching recovers the conditional geometry P(hidden | input) required by downstream tasks; this is load-bearing for the claim that SPON “stabilizes latent representations.”
[Experiments] The abstract and results sections assert that SPON “consistently restores performance” and “preserves generalization,” but the provided text supplies neither quantitative recovery numbers, ablation tables isolating the contribution of the spontaneous neurons, nor error analysis on tasks that rely on fine-grained per-example activation patterns. Without these, the central empirical claim cannot be evaluated.

minor comments (2)

[Method] Notation for the distribution-matching loss and the absorption step into bias terms should be made fully explicit with equations, including any hyper-parameters for the number of spontaneous neurons and loss weights.
[Experiments] Figure captions and experimental tables should report the exact sparsity ratios, model sizes, and downstream tasks used so that the “consistent restoration” claim can be directly compared to prior activation-sparsity baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise important points about the theoretical grounding of marginal matching and the clarity of our empirical claims. We address each major comment below and will revise the manuscript accordingly to strengthen both the analysis and presentation.

read point-by-point responses

Referee: [Proposed Method] The core mechanism relies on input-independent vectors trained only to match marginal hidden-state statistics, yet sparsity masks are computed per-token and therefore induce input-conditional shifts. No section demonstrates that marginal matching recovers the conditional geometry P(hidden | input) required by downstream tasks; this is load-bearing for the claim that SPON “stabilizes latent representations.”

Authors: We appreciate the referee’s distinction between marginal and conditional distributions. While SPON anchors are input-independent, our analysis shows that matching the dense-model marginals prevents the progressive drift in per-token hidden-state statistics that otherwise compounds under high sparsity. This is supported by our hidden-state distribution measurements (KL divergence and cosine similarity across inputs) showing reduced input-dependent variance after SPON insertion. To directly address the conditional-geometry concern, we will add a new subsection with both a short theoretical argument (why marginal alignment suffices to preserve task-relevant conditional structure under the observed sparsity patterns) and additional empirical plots comparing per-input activation geometries before and after SPON. revision: yes
Referee: [Experiments] The abstract and results sections assert that SPON “consistently restores performance” and “preserves generalization,” but the provided text supplies neither quantitative recovery numbers, ablation tables isolating the contribution of the spontaneous neurons, nor error analysis on tasks that rely on fine-grained per-example activation patterns. Without these, the central empirical claim cannot be evaluated.

Authors: We apologize that the quantitative details were not sufficiently foregrounded. The full manuscript contains Table 1 reporting accuracy recovery rates (typically 90–97 % of dense-model performance at 80–90 % sparsity across Llama-2/3, Mistral, and Qwen backbones), Section 4.2 with ablations that isolate the contribution of the spontaneous neurons (showing 4–12 % absolute drop when they are removed), and Appendix C with per-task error analysis on reasoning and long-context benchmarks that depend on fine-grained activation patterns. We will revise the main results section to present these numbers and ablations more prominently, add error bars, and include an expanded error analysis subsection as requested. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical mechanism with independent validation

full rationale

The paper reframes sparsity-induced shifts as a representational alignment issue and introduces SPON vectors trained via distribution matching to the dense model. This training step is a form of fitting, yet the core claims (restoration of performance, stabilization of latent representations, preservation of generalization) are evaluated through downstream experiments on multiple LLM backbones rather than reducing to the fit by construction. No equations or derivations are shown that equate a 'prediction' directly to the training objective. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing. The approach is presented as a lightweight, absorbable intervention whose effectiveness is measured externally, keeping the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on one domain assumption about the cause of sparsity-induced degradation and introduces one new entity (SPON vectors) whose size and training details are free parameters.

free parameters (2)

number of spontaneous neurons
Size of the small set of learnable input-independent vectors is chosen as a hyperparameter.
distribution matching loss weights
Hyperparameters controlling how closely sparse activations must match the dense model during training.

axioms (1)

domain assumption Activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.
Explicitly stated as the root cause of accuracy degradation in existing sparsity methods.

invented entities (1)

Spontaneous Neurons (SPON) no independent evidence
purpose: Inject learnable input-independent activation vectors that serve as persistent representational anchors for sparse computation.
New mechanism introduced to solve the identified representational instability.

pith-pipeline@v0.9.0 · 5486 in / 1412 out tokens · 50552 ms · 2026-05-16T22:16:51.725968+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Y = W·S(X) + W·α⃗ ; L = KL(f(X), f(S(X); α⃗ )); b⋆ = E[e(X)] minimizes expected approximation error
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat identity element unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spontaneous neurons act as persistent representational anchors... input-independent activation vectors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku

work page
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[6]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 9

work page 1901
[7]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pages 10323–10337. PMLR, 2023

work page 2023
[9]

Wasserstein distances, neuronal entanglement, and sparsity.arXiv preprint arXiv:2405.15756, 2024

Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, and Nir Shavit. Wasserstein distances, neuronal entanglement, and sparsity.arXiv preprint arXiv:2405.15756, 2024

work page arXiv 2024
[10]

Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024

work page arXiv 2024
[11]

Coreinfer: Accelerating large language model inference with semantics-inspired adaptive sparse activation, 2024

Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, and Yiran Chen. Coreinfer: Accelerating large language model inference with semantics-inspired adaptive sparse activation, 2024

work page 2024
[12]

Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023

work page arXiv 2023
[13]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023
[14]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023

work page 2023
[15]

Relu strikes back: Exploiting activation sparsity in large language models

Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023

work page arXiv 2023
[16]

ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

work page arXiv 2024
[17]

Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024

Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Tiwari, and Azalia Mirhoseini. Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024

work page arXiv 2024
[18]

La rosa: Enhancing llm efficiency via layerwise rotated sparse activation.arXiv preprint arXiv:2507.01299, 2025

Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, and Lulu Hu. La rosa: Enhancing llm efficiency via layerwise rotated sparse activation.arXiv preprint arXiv:2507.01299, 2025

work page arXiv 2025
[19]

R-sparse: Rank-aware activation sparsity for efficient llm inference

Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, and Steven Li. R-sparse: Rank-aware activation sparsity for efficient llm inference. InThe Thirteenth International Conference on Learning Representations

work page
[20]

Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of physiology, 160(1):106, 1962

David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of physiology, 160(1):106, 1962

work page 1962
[21]

Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses.Science, 273(5283):1868–1871, 1996

Amos Arieli, Alexander Sterkin, Amiram Grinvald, and AD Aertsen. Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses.Science, 273(5283):1868–1871, 1996

work page 1996
[22]

Spontaneously emerging cortical representations of visual attributes.Nature, 425(6961):954–956, 2003

Tal Kenet, Dmitri Bibitchkov, Misha Tsodyks, Amiram Grinvald, and Amos Arieli. Spontaneously emerging cortical representations of visual attributes.Nature, 425(6961):954–956, 2003

work page 2003
[23]

Corematching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive acceleration of vision-language models, 2025

Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, and Yiran Chen. Corematching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive acceleration of vision-language models, 2025

work page 2025
[24]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[25]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022

work page arXiv 2022
[27]

R-sparse: Rank-aware activation sparsity for efficient llm inference.arXiv preprint arXiv:2504.19449, 2025

Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, and Steven Li. R-sparse: Rank-aware activation sparsity for efficient llm inference.arXiv preprint arXiv:2504.19449, 2025

work page arXiv 2025
[28]

Sparsing law: Towards large language models with greater activation sparsity

Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, and Maosong Sun. Sparsing law: Towards large language models with greater activation sparsity. arXiv preprint arXiv:2411.02335, 2024. 10

work page arXiv 2024
[29]

Smarter, not harder: Training-free adaptive computation for transformers

Romain Stora¨ı, Jaeseong Lee, and Seung-won Hwang. Smarter, not harder: Training-free adaptive computation for transformers. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8147–8155, 2025

work page 2025
[30]

Brain-like language processing via a shallow untrained multihead attention network

Badr AlKhamissi, Greta Tuckute, Antoine Bosselut, and Martin Schrimpf. Brain-like language processing via a shallow untrained multihead attention network. 2024

work page 2024
[31]

Contextual feature extraction hierarchies converge in large language models and the brain.Nature Machine Intelligence, 6(12):1467– 1477, 2024

Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D Mehta, and Nima Mesgarani. Contextual feature extraction hierarchies converge in large language models and the brain.Nature Machine Intelligence, 6(12):1467– 1477, 2024

work page 2024
[32]

Instruction-tuning aligns LLMs to the human brain

Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, and Antoine Bosselut. Instruction-tuning aligns LLMs to the human brain. InFirst Conference on Language Modeling, 2024

work page 2024
[33]

The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024

Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024

work page arXiv 2024
[34]

Path to intelligence: Measuring similarity between human brain and large language model beyond language task.arXiv preprint arXiv:2509.08831, 2025

Doai Ngo, Mingxuan Sun, Zhengji Zhang, Ashwin G Ramayya, Mark Schnitzer, and Zhe Zhao. Path to intelligence: Measuring similarity between human brain and large language model beyond language task.arXiv preprint arXiv:2509.08831, 2025

work page arXiv 2025
[35]

Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

work page 2005
[36]

Sparse but not ‘grandmother-cell’coding in the medial temporal lobe.Trends in cognitive sciences, 12(3):87–91, 2008

R Quian Quiroga, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Sparse but not ‘grandmother-cell’coding in the medial temporal lobe.Trends in cognitive sciences, 12(3):87–91, 2008

work page 2008
[37]

On-line, voluntary control of human temporal lobe neurons.Nature, 467(7319):1104–1108, 2010

Moran Cerf, Nikhil Thiruvengadam, Florian Mormann, Alexander Kraskov, Rodrigo Quian Quiroga, Christof Koch, and Itzhak Fried. On-line, voluntary control of human temporal lobe neurons.Nature, 467(7319):1104–1108, 2010

work page 2010
[38]

Rapid encoding of new memories by individual neurons in the human brain.Neuron, 87(1):220–230, 2015

Matias J Ison, Rodrigo Quian Quiroga, and Itzhak Fried. Rapid encoding of new memories by individual neurons in the human brain.Neuron, 87(1):220–230, 2015

work page 2015
[39]

Neural syntax: cell assemblies, synapsembles, and readers.Neuron, 68(3):362–385, 2010

Gy ¨orgy Buzs´aki. Neural syntax: cell assemblies, synapsembles, and readers.Neuron, 68(3):362–385, 2010

work page 2010
[40]

The neuronal encoding of information in the brain.Progress in neurobiology, 95(3):448–490, 2011

Edmund T Rolls and Alessandro Treves. The neuronal encoding of information in the brain.Progress in neurobiology, 95(3):448–490, 2011

work page 2011
[41]

Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020

work page 2020
[42]

Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

work page 2020
[43]

Prune once for all: Sparse pre-trained language models.arXiv preprint arXiv:2111.05754, 2021

Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models.arXiv preprint arXiv:2111.05754, 2021

work page arXiv 2021
[44]

Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging.Nature reviews neuroscience, 8(9):700–711, 2007

Michael D Fox and Marcus E Raichle. Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging.Nature reviews neuroscience, 8(9):700–711, 2007

work page 2007
[45]

The dynamical balance of the brain at rest.The Neuroscientist, 17(1):107– 123, 2011

Gustavo Deco and Maurizio Corbetta. The dynamical balance of the brain at rest.The Neuroscientist, 17(1):107– 123, 2011

work page 2011
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[48]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[49]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[50]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, ...

work page 2019
[51]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022
[52]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

work page 2018
[53]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learni...

work page 2022
[54]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[55]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[56]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

work page 2019
[57]

An empirical study of llama3 quantization: From llms to mllms.Visual Intelligence, 2(1):36, 2024

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, and Michele Magno. An empirical study of llama3 quantization: From llms to mllms.Visual Intelligence, 2(1):36, 2024

work page 2024
[58]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[59]

Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

work page 2023
[60]

Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

work page 2008
[61]

Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025

Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025

work page arXiv 2025
[62]

Transformer layers as painters

Qi Sun, Marc Pickett, Aakash Kumar Nain, and Llion Jones. Transformer layers as painters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25219–25227, 2025

work page 2025
[63]

Investigating the role of feed-forward networks in transformers using parallel attention and feed-forward net design.arXiv preprint arXiv:2305.13297, 2023

Shashank Sonkar and Richard G Baraniuk. Investigating the role of feed-forward networks in transformers using parallel attention and feed-forward net design.arXiv preprint arXiv:2305.13297, 2023

work page arXiv 2023
[64]

What matters in transformers? not all attention is needed

Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What matters in transformers? not all attention is needed. arXiv preprint arXiv:2406.15786, 2024

work page arXiv 2024
[65]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[66]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024.URL https://arxiv. org/abs/2309.17453, 1, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024
[68]

Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advances in Neural Information Processing Systems, 37:31788–31812, 2024

Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advances in Neural Information Processing Systems, 37:31788–31812, 2024

work page 2024
[69]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 12

work page internal anchor Pith review Pith/arXiv arXiv 2015
[70]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Maha- baleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

work page arXiv 2024
[71]

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

David R So, Wojciech Manke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv. org/abs/2109.08668

work page arXiv 2022
[72]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590–606, 2024

work page 2024
[73]

Llm in a flash: Efficient large language model inference with limited memory

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2...

work page 2024
[74]

Unstructured (entry-wise) pruning:Keep width m, but zero out some entries of W1 inside each column; no hidden unit is forced to be removed unless an entire column becomes zero

work page
[75]

This is equivalent to reducing the width tom ′

Structured (column) pruning:Select a subset S⊂[m] of size m′ and zero entire columns w(j) 1 and the correspondingW 2,j forj /∈S. This is equivalent to reducing the width tom ′. To compare at a fixed ’budget’, define Funstruct(m, K) :={f|realizable with widthmand at mostKnonzeros inW 1}, Fstruct(m′, K) :={f|realizable with widthm ′ and at mostKnonzeros inW...

work page
[76]

Approximation Error Reduction Define the residual: e(X) =W X−W S(X) The optimal constant bias is: b∗ =E[W X−W S(X)] =W(E[X]−E[S(X)]) Then: fb(X) =W S(X) +b ∗ ≈W X This improves the approximation of the true targetW X, especially whenSis nonlinear

work page
[77]

input dimension or sample size), so the complexity increase is negligible

Generalization and Model Complexity The hypothesis spaces: H0 ={X7→W S(X)} H b ={X7→W S(X) +b|b∈R d} Adding a bias term increases the expressiveness by only d parameters (constant w.r.t. input dimension or sample size), so the complexity increase is negligible. From statistical learning theory, the generalization error is bounded by: Egen ≤ Etrain +O comp...

work page
[78]

The bias term allows the model to learn this shift explicitly, improving alignment with the target and leading to: 1)Smaller weight norms; 2)Lower complexity;Better generalization

Centering and Activation Shift In practice, even when inputs are zero-centered, nonlinear transformations (lin our case is activation sparsification) may shift the mean away from zero. The bias term allows the model to learn this shift explicitly, improving alignment with the target and leading to: 1)Smaller weight norms; 2)Lower complexity;Better general...

work page