pith. sign in

arxiv: 2310.02277 · v4 · submitted 2023-09-29 · 💻 cs.LG · cs.AI

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Pith reviewed 2026-05-24 06:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords junk dna hypothesismagnitude pruningllm compressiondownstream task difficultysmall-magnitude weightsirreversible performance lossmodel pruningcontinual training
0
0 comments X

The pith

Small-magnitude weights in pre-trained LLMs encode vital knowledge for difficult downstream tasks, shown by irreversible monotonic performance drops when pruned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the view that LLMs hold large amounts of redundant parameters removable by pruning with little cost. Instead it proposes that the small-magnitude weights carry information required specifically for harder downstream tasks. Experiments demonstrate that successively removing these weights by magnitude produces steadily larger performance losses, and the losses grow larger as task difficulty increases. The degradation on difficult tasks stays even after the model receives further training on the target task. The same pattern does not appear under quantization, and the authors supply new metrics that rank task difficulty both inside one category and across categories.

Core claim

The Junk DNA Hypothesis states that small-magnitude weights of a pre-trained LLM encode vital knowledge needed for difficult downstream tasks. This knowledge is revealed by a monotonic rise in performance drop as more of these weights are pruned by magnitude, with steeper drops on harder tasks. The impairment cannot be reversed by downstream continual training. Quantization does not produce an equivalent monotonic separation of task difficulty. The claim is supported by new quantifiable difficulty metrics defined within and across task categories and holds over multiple model sizes, datasets, and pruning techniques.

What carries the argument

Monotonic performance degradation on a spectrum of downstream task difficulties when small-magnitude pre-trained weights are removed by increasing magnitude-based pruning ratios.

If this is right

  • Performance declines more sharply on difficult tasks than on easy ones as the pruning ratio of small-magnitude weights increases.
  • The performance loss on difficult tasks persists even when the pruned model is allowed further training on the downstream task.
  • Quantization does not produce the same monotonic relationship between pruning level and task difficulty.
  • The pattern appears consistently across model sizes, task categories, datasets, and pruning methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the claim holds, pruning methods intended for high-complexity applications would need to protect small-magnitude weights to avoid permanent capability loss.
  • The result suggests that apparent parameter redundancy may actually support robustness on harder problems rather than being removable without consequence.
  • One could test whether the same monotonic pattern appears when models are trained from random initialization under magnitude constraints.

Load-bearing premise

That the observed monotonic performance drop after magnitude pruning directly shows small weights encode vital knowledge rather than arising from optimization dynamics or task selection choices.

What would settle it

An experiment in which performance on the most difficult tasks returns to the level of the unpruned model after magnitude pruning once sufficient downstream continual training is performed.

Figures

Figures reproduced from arXiv: 2310.02277 by Ajay Jaiswal, Lu Yin, Shiwei Liu, Souvik Kundu, Zhangyang Wang.

Figure 1
Figure 1. Figure 1: Task Difficulty Setting 1: Varying target domain data adequacy: Dense Transfer vs. Sparse Transfer using RoBERTa￾Base on various downstream tasks. Task difficulty is measured by the training data volume. manipulate the option count for each question from [2 − 4] which provide a random guess success rate from 50% (2 options) to 25% (4 options). This setting uniquely allows us to control the task difficult f… view at source ↗
Figure 3
Figure 3. Figure 3: Task Difficulty Setting 3:Varying context length in Retrival-Augmented QA. Dense v.s. Sparse subnetwork perfor￾mance of Vicuna-7B pruned using TrivaQA Benchmark. Task difficulty is measured by the number of tokens provided in context. for this task, we propose to vary the context length ensur￾ing that the correct answer still resides within the provided context. Retrieval-augmented QA requires LLMs to poss… view at source ↗
Figure 4
Figure 4. Figure 4: Task Difficulty Setting 4:Few-shot In-context Learn￾ing. Dense v.s. Sparse subnetwork performance of Vicuna-7B pruned using MMLU Benchmark. Task difficulty is measured by the number of K shot in-context demonstration examples provided to assist multiple-choice QA. ditioned documents. Given the ground truth, we select x% of tokens around it in the context document from the docu￾ment selection step, to ensur… view at source ↗
Figure 6
Figure 6. Figure 6: Across-Task Difficulty for Factoid-based QA and Multiple-Choice QA:: Dense v.s. Sparse subnetwork perfor￾mance of Vicuna-7B. Task difficulty is measured by Human-LLM Performance gap normalized by the dense performance. for the validity of the Junk DNA hypothesis across a broad spectrum of task categories. While it may be feasible to remove small-magnitude weights without significant reper￾cussions in simpl… view at source ↗
Figure 5
Figure 5. Figure 5: Across-Task Difficulty via Normalized Human-LLM Performance Gap: Dense v.s. Sparse subnetwork performance of Vicuna-13B. Task difficulty is measured by Human-LLM Perfor￾mance gap normalized by the dense performance. 3.2.2. TASK DIFFICULTY SETTING 6: Factoid-based v.s. Multiple-choice QA Rationale and Method: In this setting, we compare two popular QA settings: Factoid-based QA and Multiple￾Choice QA. A typ… view at source ↗
Figure 7
Figure 7. Figure 7: How is pruning special? Performance comparison of pruning and quantization with varying compression ratios on our task difficulty spectrum. We can observe the monotonic impair￾ment of pruning across task difficulty and pruning ratio. On the contrary, quantization fail to capture this monotonic behavior across task difficulty and compression ratio. extreme end of our task difficulty spectrum, which is again… view at source ↗
Figure 8
Figure 8. Figure 8: Varying target domain data adequacy: Four different fine-tuning settings with RoBERTa-Base on various downstream tasks. All performance is normalized by the one of Dense Transfer. 102030405060 70 80 90 60 80 100 SST-2 10 20 30 40 50 60 70 80 90 25 50 75 100 MNLI 10 20 30 40 50 60 70 80 90 60 80 100 QNLI 10 20 30 40 50 60 70 80 90 0 50 100 COLA 10 20 30 40 50 60 70 80 90 25 50 75 100 CSQA 10 20 30 40 50 60 … view at source ↗
Figure 9
Figure 9. Figure 9: Across-Task Difficulty via Normalized Human-LLM Performance Gap: Four different fine-tuning settings with RoBERTa￾Large on various downstream tasks. All performance is normalized by the one of Dense Transfer. culty spectrum normalized by human performance for the aforementioned tasks is presented in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Linear interpolation from the Dense Transfer (Left) model to its corresponding Sparse Transfer models (Right) on easy and harder tasks (in terms of across-task difficulty). easier ones. Consequently, the absence of these small weights disrupts the optimal basin, leading to a considerable loss of performance. To test our conjecture, we utilize the linear mode connectivity (LMC) metric proposed by (Frankle … view at source ↗
Figure 11
Figure 11. Figure 11: Low-Rank Compression using SVD. We noticed the concurrent work (Sharma et al., 2023) suggesting layer-selective low-rank compression of weights often improves LLM reasoning and generalization, without needing no re-training needed. We however note that requires careful 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Junk DNA Hypothesis for pre-trained LLMs, claiming that small-magnitude weights encode vital knowledge for difficult downstream tasks. This is manifested empirically as a monotonic performance drop on tasks of increasing difficulty (measured by newly introduced within- and across-category metrics) when pruning by magnitude; the degradation is claimed to be irreversible even after downstream continual training. The effect is contrasted with quantization, which does not exhibit the same monotonic disentanglement of task difficulty. Extensive experiments across model sizes, tasks, datasets, and pruning methods are presented to substantiate the hypothesis.

Significance. If the central empirical claim holds after addressing potential confounds, the work would challenge prevailing views on weight redundancy in LLMs and inform pruning/quantization strategies by highlighting permanent loss on complex tasks. The introduction of task-difficulty metrics and the public code release are positive contributions that aid reproducibility.

major comments (2)
  1. [Experimental Results] Experimental section: The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.
  2. [Continual Training subsection] Continual-training experiments: The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.
minor comments (2)
  1. [Task Difficulty Metrics] The two task-difficulty metrics are introduced but would benefit from explicit equations or pseudocode in the main text rather than only in the appendix.
  2. [Figures] Figure captions and axis labels should explicitly state the pruning ratio range and number of runs for error bars to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, providing clarifications on our experimental design while remaining faithful to the manuscript's content.

read point-by-point responses
  1. Referee: [Experimental Results] The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.

    Authors: The manuscript already contrasts magnitude pruning against quantization (which removes the monotonic disentanglement of task difficulty) and reports results across multiple pruning methods. These controls demonstrate that the observed monotonic impairment is tied to magnitude-based removal of small weights rather than arbitrary sparsity. A random-pruning baseline would provide an additional comparison, but the existing cross-method and quantization contrasts already isolate the effect sufficiently to support the hypothesis that small-magnitude weights carry task-critical information. We therefore maintain the current interpretation while noting the referee's suggestion for future work. revision: no

  2. Referee: [Continual Training subsection] The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.

    Authors: The experimental protocol section states that identical fine-tuning settings (learning rate, schedule, epochs, optimizer, and batch size) are applied to all models, pruned or unpruned. To address the concern directly, we will add an explicit sentence in the Continual Training subsection confirming that the protocol is held constant, thereby ruling out optimization-dynamics confounds. revision: yes

Circularity Check

0 steps flagged

Empirical hypothesis supported by pruning experiments; no derivation reduces to inputs

full rationale

The paper advances the Junk DNA Hypothesis solely via experimental results: magnitude pruning of small pre-trained weights produces monotonic performance drops that worsen with task difficulty, with irreversibility after continual training. No equations, uniqueness theorems, ansatzes, or first-principles derivations are presented that could reduce to fitted quantities or self-citations by construction. Task-difficulty metrics are introduced as quantifiable definitions and evaluated directly on data; the central claims rest on these observations across models and datasets rather than any self-referential reduction. This is the most common honest finding for an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim is an empirical observation rather than a derivation; it rests on the validity of newly introduced task-difficulty metrics and the causal interpretation of pruning results. No free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Task difficulty can be meaningfully and reproducibly quantified both within the same task category and across different task categories using the metrics introduced in the paper.
    These metrics are required to establish the claimed monotonic relationship between pruning level and performance drop across the difficulty spectrum.

pith-pipeline@v0.9.0 · 5795 in / 1290 out tokens · 31127 ms · 2026-05-24T06:32:08.468924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 9 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Open source parallel corpus of opus. 2020

  3. [3]

    Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

    Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019

  4. [4]

    Junk DNA: a journey through the dark matter of the genome

    Nessa Carey. Junk DNA: a journey through the dark matter of the genome. Columbia University Press, 2015

  5. [5]

    The lottery ticket hypothesis for pre-trained bert networks

    Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33: 0 15834--15846, 2020

  6. [6]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  7. [7]

    Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve

    Giannis Daras, Negin Raoof, Zoi Gkalitsiou, and Alex Dimakis. Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve. Advances in Neural Information Processing Systems, 35: 0 35130--35142, 2022

  8. [8]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  9. [9]

    Learning to prune deep neural networks via layer-wise optimal brain surgeon

    Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017

  10. [10]

    Rigging the lottery: Making all tickets winners

    Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp.\ 2943--2952. PMLR, 2020

  11. [11]

    Sparsevsr: Lightweight and noise robust visual speech recognition

    Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, and Maja Pantic. Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552, 2023

  12. [12]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7

  13. [13]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020

  14. [14]

    Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

  15. [15]

    M-fac: Efficient matrix-free approximations of second-order information

    Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34: 0 14873--14886, 2021

  16. [16]

    The State of Sparsity in Deep Neural Networks

    Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019

  17. [17]

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

    Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016

  18. [18]

    Second order derivatives for network pruning: Optimal brain surgeon

    Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992

  19. [19]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  20. [20]

    The emergence of essential sparsity in large pre-trained models: The weights that matter

    Ajay Jaiswal, Shiwei Liu, Tianlong Chen, and Zhangyang Wang. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023 a

  21. [21]

    Training your sparse neural network better with any mask

    Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp.\ 9833--9844. PMLR, 2022

  22. [22]

    Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models

    Ajay Kumar Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, and Zhangyang Wang. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp.\ 14691--14701. PMLR, 2023 b

  23. [23]

    Towards more effective and economic sparsely-activated model

    Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model. arXiv preprint arXiv:2110.07431, 2021

  24. [24]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

  25. [25]

    Twice fine-tuning deep neural networks for paraphrase identification

    Bowon Ko and Ho-Jin Choi. Twice fine-tuning deep neural networks for paraphrase identification. Electronics Letters, 56 0 (9): 0 444--447, 2020

  26. [26]

    Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns

    Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, pp.\ 344--350, 2021

  27. [27]

    The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

    Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022

  28. [28]

    Block pruning for faster transformers

    Fran c ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021

  29. [29]

    Optimal brain damage

    Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp.\ 598--605, 1990

  30. [30]

    Large models are parsimonious learners: Activation sparsity in trained transformers

    Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022

  31. [31]

    Do we actually need dense over-parameterization? in-time over-parameterization in sparse training

    Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. arXiv preprint arXiv:2102.02887, 2021

  32. [32]

    The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training

    Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022 a

  33. [33]

    Don't be so dense: Sparse-to-sparse gan training without sacrificing performance

    Shiwei Liu, Yuesong Tian, Tianlong Chen, and Li Shen. Don't be so dense: Sparse-to-sparse gan training without sacrificing performance. arXiv preprint arXiv:2203.02770, 2022 b

  34. [34]

    Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

    Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Jaiswal, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

  35. [35]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  36. [36]

    Multilingual denoising pre-training for neural machine translation

    Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8: 0 726--742, 2020

  37. [37]

    Llm-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023

  38. [38]

    Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

    Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9 0 (1): 0 2383, 2018

  39. [39]

    Pruning Convolutional Neural Networks for Resource Efficient Inference

    Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016

  40. [40]

    Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

    Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019

  41. [41]

    Using relevance to reduce network size automatically

    Michael C Mozer and Paul Smolensky. Using relevance to reduce network size automatically. Connection Science, 1 0 (1): 0 3--16, 1989

  42. [42]

    Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

    Nikita Nangia and Samuel R Bowman. Human vs. muppet: A conservative estimate of human performance on the glue benchmark. arXiv preprint arXiv:1905.10425, 2019

  43. [43]

    Nvidia a100 tensor core gpu architecture

    Nvidia. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020

  44. [44]

    So much "junk" dna in our genome

    Susumu Ohno. So much "junk" dna in our genome. Brookhaven symposia in biology, 23: 0 366--70, 1972

  45. [45]

    In-context retrieval-augmented language models

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023

  46. [46]

    Comparing rewinding and fine-tuning in neural network pruning

    Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In 8th International Conference on Learning Representations, 2020

  47. [47]

    Movement pruning: Adaptive sparsity by fine-tuning

    Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 20378--20389. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/eae15aabaa768ae4a5993...

  48. [48]

    Woodfisher: Efficient second-order approximation for neural network compression

    Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33: 0 18098--18109, 2020

  49. [49]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

  50. [50]

    Multilingual translation with extensible multilingual pretraining and finetuning

    Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401, 2020

  51. [51]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  52. [52]

    Eigendamage: Structured pruning in the kronecker-factored eigenbasis

    Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning, pp.\ 6566--6575. PMLR, 2019

  53. [53]

    Best practices for text classification with distillation (part 2/4) – challenging use cases

    Moshe Wasserblat. Best practices for text classification with distillation (part 2/4) – challenging use cases. https://www.linkedin.com/pulse/best-practices-text-classification-distillation-part-24-wasserblat/, 2021

  54. [54]

    Rethinking network pruning--under the pre-train and fine-tune paradigm

    Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning--under the pre-train and fine-tune paradigm. arXiv preprint arXiv:2104.08682, 2021

  55. [55]

    Dynamic sparsity is channel-level sparsity learner

    Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, and Shiwei Liu. Dynamic sparsity is channel-level sparsity learner. arXiv preprint arXiv:2305.19454, 2023

  56. [56]

    Mest: Accurate and fast memory-economic sparse training framework on the edge

    Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems, 34: 0 20838--20850, 2021

  57. [57]

    Prune once for all: Sparse pre-trained language models

    Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021

  58. [58]

    Mlprune: Multi-layer pruning for automated neural network compression

    Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compression. 2018

  59. [59]

    Platon: Pruning large transformer models with upper confidence bound of weight importance

    Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp.\ 26809--26823. PMLR, 2022

  60. [60]

    Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate

    Ye Zheng, Steven Josefowicz, Ashutosh Chaudhry, Xiao P Peng, Katherine Forbush, and Alexander Y Rudensky. Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate. Nature, 463 0 (7282): 0 808--812, 2010

  61. [61]

    Learning n: m fine-grained structured sparse neural networks from scratch

    Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021

  62. [62]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  63. [63]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  64. [64]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...