Recognition: 2 theorem links
· Lean TheoremResting Neurons, Active Insights: Robustify Activation Sparsity for Large Language Models
Pith reviewed 2026-05-16 22:16 UTC · model grok-4.3
The pith
Spontaneous neurons restore accuracy in activation-sparse large language models by anchoring hidden states to the dense model's distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Activation sparsity induces distribution shifts in hidden states because it suppresses the input-dependent activations that the model learned during pretraining. SPON counters this by injecting a small collection of learnable, input-independent activation vectors that serve as persistent representational anchors; the vectors are optimized solely through distribution matching to the dense model and can be absorbed into bias terms after training.
What carries the argument
Spontaneous Neurons (SPON): a lightweight set of learnable, input-independent activation vectors that function as persistent representational anchors for sparse computation.
If this is right
- Sparse inference can run at high sparsity ratios while keeping accuracy close to the dense baseline.
- The same SPON vectors work across multiple LLM architectures without per-model redesign.
- After training, SPON adds zero extra compute or memory at inference time because the vectors fold into existing bias terms.
- Latent representations remain stable enough that downstream tasks retain their original generalization behavior.
Where Pith is reading between the lines
- If the anchors truly act as distribution stabilizers, they might also reduce variance in few-shot or chain-of-thought settings where hidden-state drift is known to hurt consistency.
- The same anchoring idea could be tested on other sparsity patterns such as weight pruning or KV-cache compression to see whether representational alignment is a general remedy.
- Because the vectors are input-independent, they might be reusable across tasks or even across models of similar scale, offering a cheap way to transfer sparsity robustness.
Load-bearing premise
A small fixed set of input-independent vectors trained only by matching activation statistics to the dense model will reliably cancel the distribution shifts caused by sparsity without creating new instabilities or hurting downstream generalization.
What would settle it
Measure whether the hidden-state distributions of the sparse model with SPON still diverge from those of the dense model on a held-out set of inputs; divergence above a small threshold would falsify the claim that the anchors restore alignment.
Figures
read the original abstract
Activation sparsity offers a compelling route to accelerate large language model (LLM) inference by selectively suppressing hidden activations, yet existing approaches exhibit severe accuracy degradation at high sparsity. We show that this failure stems from representational instability: *activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.* We address this issue by reframing activation sparsity as a representational alignment problem and introducing **Spontaneous Neurons (SPON)**, a lightweight mechanism inspired by spontaneous neural activity in biological systems. SPON injects a small set of learnable, input-independent activation vectors that act as persistent representational anchors for sparse computation. These vectors are trained via distribution matching to the dense model and can be absorbed into bias terms after training, incurring negligible inference overhead. Across multiple LLM backbones, SPON consistently restores performance, stabilizes latent representations, and preserves generalization. Our results establish SPON as an effective and principled solution for reliable activation-sparse inference, and offer new insights into knowledge retention in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript diagnoses severe accuracy degradation in high-sparsity activation pruning of LLMs as arising from representational instability: per-token sparsity masks induce input-dependent distribution shifts in hidden states that disrupt pretrained representations. It reframes the problem as representational alignment and proposes Spontaneous Neurons (SPON), a small set of learnable, input-independent activation vectors trained solely via distribution matching to the dense model's hidden-state marginals. These vectors serve as persistent anchors during sparse forward passes and are absorbed into bias terms post-training for zero inference cost. The central claim is that SPON restores performance, stabilizes latent representations, and preserves generalization across multiple LLM backbones.
Significance. If the empirical claims hold under rigorous verification, SPON would offer a lightweight, training-only intervention that enables reliable high-sparsity activation inference with negligible overhead, directly addressing a practical bottleneck in LLM deployment. The absorption trick and the biological analogy are clean engineering contributions; the distributional-alignment framing could also inform future work on representation stability under other forms of structured noise.
major comments (2)
- [Proposed Method] The core mechanism relies on input-independent vectors trained only to match marginal hidden-state statistics, yet sparsity masks are computed per-token and therefore induce input-conditional shifts. No section demonstrates that marginal matching recovers the conditional geometry P(hidden | input) required by downstream tasks; this is load-bearing for the claim that SPON “stabilizes latent representations.”
- [Experiments] The abstract and results sections assert that SPON “consistently restores performance” and “preserves generalization,” but the provided text supplies neither quantitative recovery numbers, ablation tables isolating the contribution of the spontaneous neurons, nor error analysis on tasks that rely on fine-grained per-example activation patterns. Without these, the central empirical claim cannot be evaluated.
minor comments (2)
- [Method] Notation for the distribution-matching loss and the absorption step into bias terms should be made fully explicit with equations, including any hyper-parameters for the number of spontaneous neurons and loss weights.
- [Experiments] Figure captions and experimental tables should report the exact sparsity ratios, model sizes, and downstream tasks used so that the “consistent restoration” claim can be directly compared to prior activation-sparsity baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments raise important points about the theoretical grounding of marginal matching and the clarity of our empirical claims. We address each major comment below and will revise the manuscript accordingly to strengthen both the analysis and presentation.
read point-by-point responses
-
Referee: [Proposed Method] The core mechanism relies on input-independent vectors trained only to match marginal hidden-state statistics, yet sparsity masks are computed per-token and therefore induce input-conditional shifts. No section demonstrates that marginal matching recovers the conditional geometry P(hidden | input) required by downstream tasks; this is load-bearing for the claim that SPON “stabilizes latent representations.”
Authors: We appreciate the referee’s distinction between marginal and conditional distributions. While SPON anchors are input-independent, our analysis shows that matching the dense-model marginals prevents the progressive drift in per-token hidden-state statistics that otherwise compounds under high sparsity. This is supported by our hidden-state distribution measurements (KL divergence and cosine similarity across inputs) showing reduced input-dependent variance after SPON insertion. To directly address the conditional-geometry concern, we will add a new subsection with both a short theoretical argument (why marginal alignment suffices to preserve task-relevant conditional structure under the observed sparsity patterns) and additional empirical plots comparing per-input activation geometries before and after SPON. revision: yes
-
Referee: [Experiments] The abstract and results sections assert that SPON “consistently restores performance” and “preserves generalization,” but the provided text supplies neither quantitative recovery numbers, ablation tables isolating the contribution of the spontaneous neurons, nor error analysis on tasks that rely on fine-grained per-example activation patterns. Without these, the central empirical claim cannot be evaluated.
Authors: We apologize that the quantitative details were not sufficiently foregrounded. The full manuscript contains Table 1 reporting accuracy recovery rates (typically 90–97 % of dense-model performance at 80–90 % sparsity across Llama-2/3, Mistral, and Qwen backbones), Section 4.2 with ablations that isolate the contribution of the spontaneous neurons (showing 4–12 % absolute drop when they are removed), and Appendix C with per-task error analysis on reasoning and long-context benchmarks that depend on fine-grained activation patterns. We will revise the main results section to present these numbers and ablations more prominently, add error bars, and include an expanded error analysis subsection as requested. revision: yes
Circularity Check
No significant circularity; empirical mechanism with independent validation
full rationale
The paper reframes sparsity-induced shifts as a representational alignment issue and introduces SPON vectors trained via distribution matching to the dense model. This training step is a form of fitting, yet the core claims (restoration of performance, stabilization of latent representations, preservation of generalization) are evaluated through downstream experiments on multiple LLM backbones rather than reducing to the fit by construction. No equations or derivations are shown that equate a 'prediction' directly to the training objective. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing. The approach is presented as a lightweight, absorbable intervention whose effectiveness is measured externally, keeping the derivation chain self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of spontaneous neurons
- distribution matching loss weights
axioms (1)
- domain assumption Activation sparsity disrupts input-dependent activation learned during pretraining, inducing distribution shifts in hidden states.
invented entities (1)
-
Spontaneous Neurons (SPON)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Y = W·S(X) + W·α⃗ ; L = KL(f(X), f(S(X); α⃗ )); b⋆ = E[e(X)] minimizes expected approximation error
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat identity element unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
spontaneous neurons act as persistent representational anchors... input-independent activation vectors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 9
work page 1901
-
[7]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pages 10323–10337. PMLR, 2023
work page 2023
-
[9]
Wasserstein distances, neuronal entanglement, and sparsity.arXiv preprint arXiv:2405.15756, 2024
Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, and Nir Shavit. Wasserstein distances, neuronal entanglement, and sparsity.arXiv preprint arXiv:2405.15756, 2024
-
[10]
Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024
-
[11]
Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, and Yiran Chen. Coreinfer: Accelerating large language model inference with semantics-inspired adaptive sparse activation, 2024
work page 2024
-
[12]
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023
-
[13]
Llm-pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023
work page 2023
-
[14]
Deja vu: Contextual sparsity for efficient llms at inference time
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023
work page 2023
-
[15]
Relu strikes back: Exploiting activation sparsity in large language models
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023
-
[16]
Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024
-
[17]
Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Tiwari, and Azalia Mirhoseini. Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024
-
[18]
Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, and Lulu Hu. La rosa: Enhancing llm efficiency via layerwise rotated sparse activation.arXiv preprint arXiv:2507.01299, 2025
-
[19]
R-sparse: Rank-aware activation sparsity for efficient llm inference
Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, and Steven Li. R-sparse: Rank-aware activation sparsity for efficient llm inference. InThe Thirteenth International Conference on Learning Representations
-
[20]
David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of physiology, 160(1):106, 1962
work page 1962
-
[21]
Amos Arieli, Alexander Sterkin, Amiram Grinvald, and AD Aertsen. Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses.Science, 273(5283):1868–1871, 1996
work page 1996
-
[22]
Spontaneously emerging cortical representations of visual attributes.Nature, 425(6961):954–956, 2003
Tal Kenet, Dmitri Bibitchkov, Misha Tsodyks, Amiram Grinvald, and Amos Arieli. Spontaneously emerging cortical representations of visual attributes.Nature, 425(6961):954–956, 2003
work page 2003
-
[23]
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, and Yiran Chen. Corematching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive acceleration of vision-language models, 2025
work page 2025
-
[24]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[25]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022
-
[27]
Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, and Steven Li. R-sparse: Rank-aware activation sparsity for efficient llm inference.arXiv preprint arXiv:2504.19449, 2025
-
[28]
Sparsing law: Towards large language models with greater activation sparsity
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, and Maosong Sun. Sparsing law: Towards large language models with greater activation sparsity. arXiv preprint arXiv:2411.02335, 2024. 10
-
[29]
Smarter, not harder: Training-free adaptive computation for transformers
Romain Stora¨ı, Jaeseong Lee, and Seung-won Hwang. Smarter, not harder: Training-free adaptive computation for transformers. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8147–8155, 2025
work page 2025
-
[30]
Brain-like language processing via a shallow untrained multihead attention network
Badr AlKhamissi, Greta Tuckute, Antoine Bosselut, and Martin Schrimpf. Brain-like language processing via a shallow untrained multihead attention network. 2024
work page 2024
-
[31]
Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D Mehta, and Nima Mesgarani. Contextual feature extraction hierarchies converge in large language models and the brain.Nature Machine Intelligence, 6(12):1467– 1477, 2024
work page 2024
-
[32]
Instruction-tuning aligns LLMs to the human brain
Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, and Antoine Bosselut. Instruction-tuning aligns LLMs to the human brain. InFirst Conference on Language Modeling, 2024
work page 2024
-
[33]
Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024
-
[34]
Doai Ngo, Mingxuan Sun, Zhengji Zhang, Ashwin G Ramayya, Mark Schnitzer, and Zhe Zhao. Path to intelligence: Measuring similarity between human brain and large language model beyond language task.arXiv preprint arXiv:2509.08831, 2025
-
[35]
R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005
work page 2005
-
[36]
R Quian Quiroga, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Sparse but not ‘grandmother-cell’coding in the medial temporal lobe.Trends in cognitive sciences, 12(3):87–91, 2008
work page 2008
-
[37]
On-line, voluntary control of human temporal lobe neurons.Nature, 467(7319):1104–1108, 2010
Moran Cerf, Nikhil Thiruvengadam, Florian Mormann, Alexander Kraskov, Rodrigo Quian Quiroga, Christof Koch, and Itzhak Fried. On-line, voluntary control of human temporal lobe neurons.Nature, 467(7319):1104–1108, 2010
work page 2010
-
[38]
Rapid encoding of new memories by individual neurons in the human brain.Neuron, 87(1):220–230, 2015
Matias J Ison, Rodrigo Quian Quiroga, and Itzhak Fried. Rapid encoding of new memories by individual neurons in the human brain.Neuron, 87(1):220–230, 2015
work page 2015
-
[39]
Neural syntax: cell assemblies, synapsembles, and readers.Neuron, 68(3):362–385, 2010
Gy ¨orgy Buzs´aki. Neural syntax: cell assemblies, synapsembles, and readers.Neuron, 68(3):362–385, 2010
work page 2010
-
[40]
The neuronal encoding of information in the brain.Progress in neurobiology, 95(3):448–490, 2011
Edmund T Rolls and Alessandro Treves. The neuronal encoding of information in the brain.Progress in neurobiology, 95(3):448–490, 2011
work page 2011
-
[41]
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020
work page 2020
-
[42]
Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020
work page 2020
-
[43]
Prune once for all: Sparse pre-trained language models.arXiv preprint arXiv:2111.05754, 2021
Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models.arXiv preprint arXiv:2111.05754, 2021
-
[44]
Michael D Fox and Marcus E Raichle. Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging.Nature reviews neuroscience, 8(9):700–711, 2007
work page 2007
-
[45]
The dynamical balance of the brain at rest.The Neuroscientist, 17(1):107– 123, 2011
Gustavo Deco and Maurizio Corbetta. The dynamical balance of the brain at rest.The Neuroscientist, 17(1):107– 123, 2011
work page 2011
-
[46]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b, 2023
work page 2023
-
[48]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[49]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[50]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, ...
work page 2019
-
[51]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics
work page 2022
-
[52]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018
work page 2018
-
[53]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learni...
work page 2022
-
[54]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[55]
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[56]
Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019
work page 2019
-
[57]
An empirical study of llama3 quantization: From llms to mllms.Visual Intelligence, 2(1):36, 2024
Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, and Michele Magno. An empirical study of llama3 quantization: From llms to mllms.Visual Intelligence, 2(1):36, 2024
work page 2024
-
[58]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020
work page 2020
-
[59]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023
work page 2023
-
[60]
Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008
work page 2008
-
[61]
Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025
-
[62]
Transformer layers as painters
Qi Sun, Marc Pickett, Aakash Kumar Nain, and Llion Jones. Transformer layers as painters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25219–25227, 2025
work page 2025
-
[63]
Shashank Sonkar and Richard G Baraniuk. Investigating the role of feed-forward networks in transformers using parallel attention and feed-forward net design.arXiv preprint arXiv:2305.13297, 2023
-
[64]
What matters in transformers? not all attention is needed
Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What matters in transformers? not all attention is needed. arXiv preprint arXiv:2406.15786, 2024
-
[65]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[66]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024.URL https://arxiv. org/abs/2309.17453, 1, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024
work page 2024
-
[68]
Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advances in Neural Information Processing Systems, 37:31788–31812, 2024
work page 2024
-
[69]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 12
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[70]
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Maha- baleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024
-
[71]
Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv
David R So, Wojciech Manke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv. org/abs/2109.08668
-
[72]
Powerinfer: Fast large language model serving with a consumer-grade gpu
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590–606, 2024
work page 2024
-
[73]
Llm in a flash: Efficient large language model inference with limited memory
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2...
work page 2024
-
[74]
Unstructured (entry-wise) pruning:Keep width m, but zero out some entries of W1 inside each column; no hidden unit is forced to be removed unless an entire column becomes zero
-
[75]
This is equivalent to reducing the width tom ′
Structured (column) pruning:Select a subset S⊂[m] of size m′ and zero entire columns w(j) 1 and the correspondingW 2,j forj /∈S. This is equivalent to reducing the width tom ′. To compare at a fixed ’budget’, define Funstruct(m, K) :={f|realizable with widthmand at mostKnonzeros inW 1}, Fstruct(m′, K) :={f|realizable with widthm ′ and at mostKnonzeros inW...
-
[76]
Approximation Error Reduction Define the residual: e(X) =W X−W S(X) The optimal constant bias is: b∗ =E[W X−W S(X)] =W(E[X]−E[S(X)]) Then: fb(X) =W S(X) +b ∗ ≈W X This improves the approximation of the true targetW X, especially whenSis nonlinear
-
[77]
input dimension or sample size), so the complexity increase is negligible
Generalization and Model Complexity The hypothesis spaces: H0 ={X7→W S(X)} H b ={X7→W S(X) +b|b∈R d} Adding a bias term increases the expressiveness by only d parameters (constant w.r.t. input dimension or sample size), so the complexity increase is negligible. From statistical learning theory, the generalization error is bounded by: Egen ≤ Etrain +O comp...
-
[78]
Centering and Activation Shift In practice, even when inputs are zero-centered, nonlinear transformations (lin our case is activation sparsification) may shift the mean away from zero. The bias term allows the model to learn this shift explicitly, improving alignment with the target and leading to: 1)Smaller weight norms; 2)Lower complexity;Better general...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.