Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
Pith reviewed 2026-05-24 06:32 UTC · model grok-4.3
The pith
Small-magnitude weights in pre-trained LLMs encode vital knowledge for difficult downstream tasks, shown by irreversible monotonic performance drops when pruned.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Junk DNA Hypothesis states that small-magnitude weights of a pre-trained LLM encode vital knowledge needed for difficult downstream tasks. This knowledge is revealed by a monotonic rise in performance drop as more of these weights are pruned by magnitude, with steeper drops on harder tasks. The impairment cannot be reversed by downstream continual training. Quantization does not produce an equivalent monotonic separation of task difficulty. The claim is supported by new quantifiable difficulty metrics defined within and across task categories and holds over multiple model sizes, datasets, and pruning techniques.
What carries the argument
Monotonic performance degradation on a spectrum of downstream task difficulties when small-magnitude pre-trained weights are removed by increasing magnitude-based pruning ratios.
If this is right
- Performance declines more sharply on difficult tasks than on easy ones as the pruning ratio of small-magnitude weights increases.
- The performance loss on difficult tasks persists even when the pruned model is allowed further training on the downstream task.
- Quantization does not produce the same monotonic relationship between pruning level and task difficulty.
- The pattern appears consistently across model sizes, task categories, datasets, and pruning methods.
Where Pith is reading between the lines
- If the claim holds, pruning methods intended for high-complexity applications would need to protect small-magnitude weights to avoid permanent capability loss.
- The result suggests that apparent parameter redundancy may actually support robustness on harder problems rather than being removable without consequence.
- One could test whether the same monotonic pattern appears when models are trained from random initialization under magnitude constraints.
Load-bearing premise
That the observed monotonic performance drop after magnitude pruning directly shows small weights encode vital knowledge rather than arising from optimization dynamics or task selection choices.
What would settle it
An experiment in which performance on the most difficult tasks returns to the level of the unpruned model after magnitude pruning once sufficient downstream continual training is performed.
Figures
read the original abstract
We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Junk DNA Hypothesis for pre-trained LLMs, claiming that small-magnitude weights encode vital knowledge for difficult downstream tasks. This is manifested empirically as a monotonic performance drop on tasks of increasing difficulty (measured by newly introduced within- and across-category metrics) when pruning by magnitude; the degradation is claimed to be irreversible even after downstream continual training. The effect is contrasted with quantization, which does not exhibit the same monotonic disentanglement of task difficulty. Extensive experiments across model sizes, tasks, datasets, and pruning methods are presented to substantiate the hypothesis.
Significance. If the central empirical claim holds after addressing potential confounds, the work would challenge prevailing views on weight redundancy in LLMs and inform pruning/quantization strategies by highlighting permanent loss on complex tasks. The introduction of task-difficulty metrics and the public code release are positive contributions that aid reproducibility.
major comments (2)
- [Experimental Results] Experimental section: The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.
- [Continual Training subsection] Continual-training experiments: The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.
minor comments (2)
- [Task Difficulty Metrics] The two task-difficulty metrics are introduced but would benefit from explicit equations or pseudocode in the main text rather than only in the appendix.
- [Figures] Figure captions and axis labels should explicitly state the pruning ratio range and number of runs for error bars to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, providing clarifications on our experimental design while remaining faithful to the manuscript's content.
read point-by-point responses
-
Referee: [Experimental Results] The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.
Authors: The manuscript already contrasts magnitude pruning against quantization (which removes the monotonic disentanglement of task difficulty) and reports results across multiple pruning methods. These controls demonstrate that the observed monotonic impairment is tied to magnitude-based removal of small weights rather than arbitrary sparsity. A random-pruning baseline would provide an additional comparison, but the existing cross-method and quantization contrasts already isolate the effect sufficiently to support the hypothesis that small-magnitude weights carry task-critical information. We therefore maintain the current interpretation while noting the referee's suggestion for future work. revision: no
-
Referee: [Continual Training subsection] The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.
Authors: The experimental protocol section states that identical fine-tuning settings (learning rate, schedule, epochs, optimizer, and batch size) are applied to all models, pruned or unpruned. To address the concern directly, we will add an explicit sentence in the Continual Training subsection confirming that the protocol is held constant, thereby ruling out optimization-dynamics confounds. revision: yes
Circularity Check
Empirical hypothesis supported by pruning experiments; no derivation reduces to inputs
full rationale
The paper advances the Junk DNA Hypothesis solely via experimental results: magnitude pruning of small pre-trained weights produces monotonic performance drops that worsen with task difficulty, with irreversibility after continual training. No equations, uniqueness theorems, ansatzes, or first-principles derivations are presented that could reduce to fitted quantities or self-citations by construction. Task-difficulty metrics are introduced as quantifiable definitions and evaluated directly on data; the central claims rest on these observations across models and datasets rather than any self-referential reduction. This is the most common honest finding for an empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Task difficulty can be meaningfully and reproducibly quantified both within the same task category and across different task categories using the metrics introduced in the paper.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Open source parallel corpus of opus. 2020
work page 2020
-
[3]
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[4]
Junk DNA: a journey through the dark matter of the genome
Nessa Carey. Junk DNA: a journey through the dark matter of the genome. Columbia University Press, 2015
work page 2015
-
[5]
The lottery ticket hypothesis for pre-trained bert networks
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33: 0 15834--15846, 2020
work page 2020
-
[6]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[7]
Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve
Giannis Daras, Negin Raoof, Zoi Gkalitsiou, and Alex Dimakis. Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve. Advances in Neural Information Processing Systems, 35: 0 35130--35142, 2022
work page 2022
-
[8]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Learning to prune deep neural networks via layer-wise optimal brain surgeon
Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[10]
Rigging the lottery: Making all tickets winners
Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp.\ 2943--2952. PMLR, 2020
work page 2020
-
[11]
Sparsevsr: Lightweight and noise robust visual speech recognition
Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, and Maja Pantic. Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552, 2023
-
[12]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7
work page 2019
-
[13]
Linear mode connectivity and the lottery ticket hypothesis
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020
work page 2020
-
[14]
Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023
work page 2023
-
[15]
M-fac: Efficient matrix-free approximations of second-order information
Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34: 0 14873--14886, 2021
work page 2021
-
[16]
The State of Sparsity in Deep Neural Networks
Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[17]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016
work page 2016
-
[18]
Second order derivatives for network pruning: Optimal brain surgeon
Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992
work page 1992
-
[19]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
The emergence of essential sparsity in large pre-trained models: The weights that matter
Ajay Jaiswal, Shiwei Liu, Tianlong Chen, and Zhangyang Wang. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023 a
-
[21]
Training your sparse neural network better with any mask
Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp.\ 9833--9844. PMLR, 2022
work page 2022
-
[22]
Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models
Ajay Kumar Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, and Zhangyang Wang. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp.\ 14691--14701. PMLR, 2023 b
work page 2023
-
[23]
Towards more effective and economic sparsely-activated model
Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model. arXiv preprint arXiv:2110.07431, 2021
-
[24]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017
work page 2017
-
[25]
Twice fine-tuning deep neural networks for paraphrase identification
Bowon Ko and Ho-Jin Choi. Twice fine-tuning deep neural networks for paraphrase identification. Electronics Letters, 56 0 (9): 0 444--447, 2020
work page 2020
-
[26]
Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns
Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, pp.\ 344--350, 2021
work page 2021
-
[27]
The optimal bert surgeon: Scalable and accurate second-order pruning for large language models
Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022
-
[28]
Block pruning for faster transformers
Fran c ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021
-
[29]
Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp.\ 598--605, 1990
work page 1990
-
[30]
Large models are parsimonious learners: Activation sparsity in trained transformers
Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022
-
[31]
Do we actually need dense over-parameterization? in-time over-parameterization in sparse training
Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. arXiv preprint arXiv:2102.02887, 2021
-
[32]
Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022 a
-
[33]
Don't be so dense: Sparse-to-sparse gan training without sacrificing performance
Shiwei Liu, Yuesong Tian, Tianlong Chen, and Li Shen. Don't be so dense: Sparse-to-sparse gan training without sacrificing performance. arXiv preprint arXiv:2203.02770, 2022 b
-
[34]
Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Jaiswal, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023
-
[35]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[36]
Multilingual denoising pre-training for neural machine translation
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8: 0 726--742, 2020
work page 2020
-
[37]
Llm-pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023
-
[38]
Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9 0 (1): 0 2383, 2018
work page 2018
-
[39]
Pruning Convolutional Neural Networks for Resource Efficient Inference
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019
work page 2019
-
[41]
Using relevance to reduce network size automatically
Michael C Mozer and Paul Smolensky. Using relevance to reduce network size automatically. Connection Science, 1 0 (1): 0 3--16, 1989
work page 1989
-
[42]
Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark
Nikita Nangia and Samuel R Bowman. Human vs. muppet: A conservative estimate of human performance on the glue benchmark. arXiv preprint arXiv:1905.10425, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[43]
Nvidia a100 tensor core gpu architecture
Nvidia. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020
work page 2020
-
[44]
So much "junk" dna in our genome
Susumu Ohno. So much "junk" dna in our genome. Brookhaven symposia in biology, 23: 0 366--70, 1972
work page 1972
-
[45]
In-context retrieval-augmented language models
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023
-
[46]
Comparing rewinding and fine-tuning in neural network pruning
Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In 8th International Conference on Learning Representations, 2020
work page 2020
-
[47]
Movement pruning: Adaptive sparsity by fine-tuning
Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 20378--20389. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/eae15aabaa768ae4a5993...
work page 2020
-
[48]
Woodfisher: Efficient second-order approximation for neural network compression
Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33: 0 18098--18109, 2020
work page 2020
-
[49]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Multilingual translation with extensible multilingual pretraining and finetuning
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401, 2020
-
[51]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[52]
Eigendamage: Structured pruning in the kronecker-factored eigenbasis
Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning, pp.\ 6566--6575. PMLR, 2019
work page 2019
-
[53]
Best practices for text classification with distillation (part 2/4) – challenging use cases
Moshe Wasserblat. Best practices for text classification with distillation (part 2/4) – challenging use cases. https://www.linkedin.com/pulse/best-practices-text-classification-distillation-part-24-wasserblat/, 2021
work page 2021
-
[54]
Rethinking network pruning--under the pre-train and fine-tune paradigm
Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning--under the pre-train and fine-tune paradigm. arXiv preprint arXiv:2104.08682, 2021
-
[55]
Dynamic sparsity is channel-level sparsity learner
Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, and Shiwei Liu. Dynamic sparsity is channel-level sparsity learner. arXiv preprint arXiv:2305.19454, 2023
-
[56]
Mest: Accurate and fast memory-economic sparse training framework on the edge
Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems, 34: 0 20838--20850, 2021
work page 2021
-
[57]
Prune once for all: Sparse pre-trained language models
Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021
-
[58]
Mlprune: Multi-layer pruning for automated neural network compression
Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compression. 2018
work page 2018
-
[59]
Platon: Pruning large transformer models with upper confidence bound of weight importance
Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp.\ 26809--26823. PMLR, 2022
work page 2022
-
[60]
Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate
Ye Zheng, Steven Josefowicz, Ashutosh Chaudhry, Xiao P Peng, Katherine Forbush, and Alexander Y Rudensky. Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate. Nature, 463 0 (7282): 0 808--812, 2010
work page 2010
-
[61]
Learning n: m fine-grained structured sparse neural networks from scratch
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021
-
[62]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[63]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[64]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.