Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Ajay Jaiswal; Lu Yin; Shiwei Liu; Souvik Kundu; Zhangyang Wang

arxiv: 2310.02277 · v4 · submitted 2023-09-29 · 💻 cs.LG · cs.AI

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Lu Yin , Ajay Jaiswal , Shiwei Liu , Souvik Kundu , Zhangyang Wang This is my paper

Pith reviewed 2026-05-24 06:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords junk dna hypothesismagnitude pruningllm compressiondownstream task difficultysmall-magnitude weightsirreversible performance lossmodel pruningcontinual training

0 comments

The pith

Small-magnitude weights in pre-trained LLMs encode vital knowledge for difficult downstream tasks, shown by irreversible monotonic performance drops when pruned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the view that LLMs hold large amounts of redundant parameters removable by pruning with little cost. Instead it proposes that the small-magnitude weights carry information required specifically for harder downstream tasks. Experiments demonstrate that successively removing these weights by magnitude produces steadily larger performance losses, and the losses grow larger as task difficulty increases. The degradation on difficult tasks stays even after the model receives further training on the target task. The same pattern does not appear under quantization, and the authors supply new metrics that rank task difficulty both inside one category and across categories.

Core claim

The Junk DNA Hypothesis states that small-magnitude weights of a pre-trained LLM encode vital knowledge needed for difficult downstream tasks. This knowledge is revealed by a monotonic rise in performance drop as more of these weights are pruned by magnitude, with steeper drops on harder tasks. The impairment cannot be reversed by downstream continual training. Quantization does not produce an equivalent monotonic separation of task difficulty. The claim is supported by new quantifiable difficulty metrics defined within and across task categories and holds over multiple model sizes, datasets, and pruning techniques.

What carries the argument

Monotonic performance degradation on a spectrum of downstream task difficulties when small-magnitude pre-trained weights are removed by increasing magnitude-based pruning ratios.

If this is right

Performance declines more sharply on difficult tasks than on easy ones as the pruning ratio of small-magnitude weights increases.
The performance loss on difficult tasks persists even when the pruned model is allowed further training on the downstream task.
Quantization does not produce the same monotonic relationship between pruning level and task difficulty.
The pattern appears consistently across model sizes, task categories, datasets, and pruning methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the claim holds, pruning methods intended for high-complexity applications would need to protect small-magnitude weights to avoid permanent capability loss.
The result suggests that apparent parameter redundancy may actually support robustness on harder problems rather than being removable without consequence.
One could test whether the same monotonic pattern appears when models are trained from random initialization under magnitude constraints.

Load-bearing premise

That the observed monotonic performance drop after magnitude pruning directly shows small weights encode vital knowledge rather than arising from optimization dynamics or task selection choices.

What would settle it

An experiment in which performance on the most difficult tasks returns to the level of the unpruned model after magnitude pruning once sufficient downstream continual training is performed.

Figures

Figures reproduced from arXiv: 2310.02277 by Ajay Jaiswal, Lu Yin, Shiwei Liu, Souvik Kundu, Zhangyang Wang.

**Figure 1.** Figure 1: Task Difficulty Setting 1: Varying target domain data adequacy: Dense Transfer vs. Sparse Transfer using RoBERTaBase on various downstream tasks. Task difficulty is measured by the training data volume. manipulate the option count for each question from [2 − 4] which provide a random guess success rate from 50% (2 options) to 25% (4 options). This setting uniquely allows us to control the task difficult f… view at source ↗

**Figure 3.** Figure 3: Task Difficulty Setting 3:Varying context length in Retrival-Augmented QA. Dense v.s. Sparse subnetwork performance of Vicuna-7B pruned using TrivaQA Benchmark. Task difficulty is measured by the number of tokens provided in context. for this task, we propose to vary the context length ensuring that the correct answer still resides within the provided context. Retrieval-augmented QA requires LLMs to poss… view at source ↗

**Figure 4.** Figure 4: Task Difficulty Setting 4:Few-shot In-context Learning. Dense v.s. Sparse subnetwork performance of Vicuna-7B pruned using MMLU Benchmark. Task difficulty is measured by the number of K shot in-context demonstration examples provided to assist multiple-choice QA. ditioned documents. Given the ground truth, we select x% of tokens around it in the context document from the document selection step, to ensur… view at source ↗

**Figure 6.** Figure 6: Across-Task Difficulty for Factoid-based QA and Multiple-Choice QA:: Dense v.s. Sparse subnetwork performance of Vicuna-7B. Task difficulty is measured by Human-LLM Performance gap normalized by the dense performance. for the validity of the Junk DNA hypothesis across a broad spectrum of task categories. While it may be feasible to remove small-magnitude weights without significant repercussions in simpl… view at source ↗

**Figure 5.** Figure 5: Across-Task Difficulty via Normalized Human-LLM Performance Gap: Dense v.s. Sparse subnetwork performance of Vicuna-13B. Task difficulty is measured by Human-LLM Performance gap normalized by the dense performance. 3.2.2. TASK DIFFICULTY SETTING 6: Factoid-based v.s. Multiple-choice QA Rationale and Method: In this setting, we compare two popular QA settings: Factoid-based QA and MultipleChoice QA. A typ… view at source ↗

**Figure 7.** Figure 7: How is pruning special? Performance comparison of pruning and quantization with varying compression ratios on our task difficulty spectrum. We can observe the monotonic impairment of pruning across task difficulty and pruning ratio. On the contrary, quantization fail to capture this monotonic behavior across task difficulty and compression ratio. extreme end of our task difficulty spectrum, which is again… view at source ↗

**Figure 8.** Figure 8: Varying target domain data adequacy: Four different fine-tuning settings with RoBERTa-Base on various downstream tasks. All performance is normalized by the one of Dense Transfer. 102030405060 70 80 90 60 80 100 SST-2 10 20 30 40 50 60 70 80 90 25 50 75 100 MNLI 10 20 30 40 50 60 70 80 90 60 80 100 QNLI 10 20 30 40 50 60 70 80 90 0 50 100 COLA 10 20 30 40 50 60 70 80 90 25 50 75 100 CSQA 10 20 30 40 50 60 … view at source ↗

**Figure 9.** Figure 9: Across-Task Difficulty via Normalized Human-LLM Performance Gap: Four different fine-tuning settings with RoBERTaLarge on various downstream tasks. All performance is normalized by the one of Dense Transfer. culty spectrum normalized by human performance for the aforementioned tasks is presented in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Linear interpolation from the Dense Transfer (Left) model to its corresponding Sparse Transfer models (Right) on easy and harder tasks (in terms of across-task difficulty). easier ones. Consequently, the absence of these small weights disrupts the optimal basin, leading to a considerable loss of performance. To test our conjecture, we utilize the linear mode connectivity (LMC) metric proposed by (Frankle … view at source ↗

**Figure 11.** Figure 11: Low-Rank Compression using SVD. We noticed the concurrent work (Sharma et al., 2023) suggesting layer-selective low-rank compression of weights often improves LLM reasoning and generalization, without needing no re-training needed. We however note that requires careful 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The monotonic pruning effect on hard tasks is the real observation here, but linking it directly to small weights encoding specific knowledge still needs stronger isolation from optimization confounds.

read the letter

The paper's central finding is that magnitude pruning of small pre-trained weights produces a monotonic performance drop that hits harder downstream tasks more severely, and this loss does not recover with continued fine-tuning. They contrast this with quantization, which does not show the same pattern, and they introduce metrics to quantify task difficulty both within and across categories. Experiments span multiple model sizes, tasks, and pruning methods, with code released.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Junk DNA Hypothesis for pre-trained LLMs, claiming that small-magnitude weights encode vital knowledge for difficult downstream tasks. This is manifested empirically as a monotonic performance drop on tasks of increasing difficulty (measured by newly introduced within- and across-category metrics) when pruning by magnitude; the degradation is claimed to be irreversible even after downstream continual training. The effect is contrasted with quantization, which does not exhibit the same monotonic disentanglement of task difficulty. Extensive experiments across model sizes, tasks, datasets, and pruning methods are presented to substantiate the hypothesis.

Significance. If the central empirical claim holds after addressing potential confounds, the work would challenge prevailing views on weight redundancy in LLMs and inform pruning/quantization strategies by highlighting permanent loss on complex tasks. The introduction of task-difficulty metrics and the public code release are positive contributions that aid reproducibility.

major comments (2)

[Experimental Results] Experimental section: The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.
[Continual Training subsection] Continual-training experiments: The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.

minor comments (2)

[Task Difficulty Metrics] The two task-difficulty metrics are introduced but would benefit from explicit equations or pseudocode in the main text rather than only in the appendix.
[Figures] Figure captions and axis labels should explicitly state the pruning ratio range and number of runs for error bars to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, providing clarifications on our experimental design while remaining faithful to the manuscript's content.

read point-by-point responses

Referee: [Experimental Results] The monotonic degradation and irreversibility after continual training are interpreted as evidence that small-magnitude weights specifically encode vital task knowledge. However, without a random-pruning (or other non-magnitude) baseline of matched sparsity, the results do not isolate this from general effects of weight removal on gradient scales and the fine-tuning loss landscape. This control is load-bearing for the hypothesis.

Authors: The manuscript already contrasts magnitude pruning against quantization (which removes the monotonic disentanglement of task difficulty) and reports results across multiple pruning methods. These controls demonstrate that the observed monotonic impairment is tied to magnitude-based removal of small weights rather than arbitrary sparsity. A random-pruning baseline would provide an additional comparison, but the existing cross-method and quantization contrasts already isolate the effect sufficiently to support the hypothesis that small-magnitude weights carry task-critical information. We therefore maintain the current interpretation while noting the referee's suggestion for future work. revision: no
Referee: [Continual Training subsection] The claim of irreparable loss requires explicit confirmation that the post-pruning fine-tuning protocol (learning-rate schedule, epochs, optimizer settings) is held constant across pruned and unpruned models; otherwise the observed impairment on difficult tasks could arise from altered optimization dynamics alone.

Authors: The experimental protocol section states that identical fine-tuning settings (learning rate, schedule, epochs, optimizer, and batch size) are applied to all models, pruned or unpruned. To address the concern directly, we will add an explicit sentence in the Continual Training subsection confirming that the protocol is held constant, thereby ruling out optimization-dynamics confounds. revision: yes

Circularity Check

0 steps flagged

Empirical hypothesis supported by pruning experiments; no derivation reduces to inputs

full rationale

The paper advances the Junk DNA Hypothesis solely via experimental results: magnitude pruning of small pre-trained weights produces monotonic performance drops that worsen with task difficulty, with irreversibility after continual training. No equations, uniqueness theorems, ansatzes, or first-principles derivations are presented that could reduce to fitted quantities or self-citations by construction. Task-difficulty metrics are introduced as quantifiable definitions and evaluated directly on data; the central claims rest on these observations across models and datasets rather than any self-referential reduction. This is the most common honest finding for an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim is an empirical observation rather than a derivation; it rests on the validity of newly introduced task-difficulty metrics and the causal interpretation of pruning results. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Task difficulty can be meaningfully and reproducibly quantified both within the same task category and across different task categories using the metrics introduced in the paper.
These metrics are required to establish the claimed monotonic relationship between pruning level and performance drop across the difficulty spectrum.

pith-pipeline@v0.9.0 · 5795 in / 1290 out tokens · 31127 ms · 2026-05-24T06:32:08.468924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 9 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Open source parallel corpus of opus. 2020

work page 2020
[3]

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

Junk DNA: a journey through the dark matter of the genome

Nessa Carey. Junk DNA: a journey through the dark matter of the genome. Columbia University Press, 2015

work page 2015
[5]

The lottery ticket hypothesis for pre-trained bert networks

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33: 0 15834--15846, 2020

work page 2020
[6]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[7]

Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve

Giannis Daras, Negin Raoof, Zoi Gkalitsiou, and Alex Dimakis. Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve. Advances in Neural Information Processing Systems, 35: 0 35130--35142, 2022

work page 2022
[8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Learning to prune deep neural networks via layer-wise optimal brain surgeon

Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[10]

Rigging the lottery: Making all tickets winners

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp.\ 2943--2952. PMLR, 2020

work page 2020
[11]

Sparsevsr: Lightweight and noise robust visual speech recognition

Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, and Maja Pantic. Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552, 2023

work page arXiv 2023
[12]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7

work page 2019
[13]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020

work page 2020
[14]

Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

work page 2023
[15]

M-fac: Efficient matrix-free approximations of second-order information

Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34: 0 14873--14886, 2021

work page 2021
[16]

The State of Sparsity in Deep Neural Networks

Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[17]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016

work page 2016
[18]

Second order derivatives for network pruning: Optimal brain surgeon

Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992

work page 1992
[19]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

The emergence of essential sparsity in large pre-trained models: The weights that matter

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, and Zhangyang Wang. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023 a

work page arXiv 2023
[21]

Training your sparse neural network better with any mask

Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp.\ 9833--9844. PMLR, 2022

work page 2022
[22]

Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models

Ajay Kumar Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, and Zhangyang Wang. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp.\ 14691--14701. PMLR, 2023 b

work page 2023
[23]

Towards more effective and economic sparsely-activated model

Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model. arXiv preprint arXiv:2110.07431, 2021

work page arXiv 2021
[24]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

work page 2017
[25]

Twice fine-tuning deep neural networks for paraphrase identification

Bowon Ko and Ho-Jin Choi. Twice fine-tuning deep neural networks for paraphrase identification. Electronics Letters, 56 0 (9): 0 444--447, 2020

work page 2020
[26]

Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns

Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, pp.\ 344--350, 2021

work page 2021
[27]

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022

work page arXiv 2022
[28]

Block pruning for faster transformers

Fran c ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021

work page arXiv 2021
[29]

Optimal brain damage

Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp.\ 598--605, 1990

work page 1990
[30]

Large models are parsimonious learners: Activation sparsity in trained transformers

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022

work page arXiv 2022
[31]

Do we actually need dense over-parameterization? in-time over-parameterization in sparse training

Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. arXiv preprint arXiv:2102.02887, 2021

work page arXiv 2021
[32]

The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022 a

work page arXiv 2022
[33]

Don't be so dense: Sparse-to-sparse gan training without sacrificing performance

Shiwei Liu, Yuesong Tian, Tianlong Chen, and Li Shen. Don't be so dense: Sparse-to-sparse gan training without sacrificing performance. arXiv preprint arXiv:2203.02770, 2022 b

work page arXiv 2022
[34]

Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Jaiswal, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

work page arXiv 2023
[35]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[36]

Multilingual denoising pre-training for neural machine translation

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8: 0 726--742, 2020

work page 2020
[37]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023

work page arXiv 2023
[38]

Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9 0 (1): 0 2383, 2018

work page 2018
[39]

Pruning Convolutional Neural Networks for Resource Efficient Inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019

work page 2019
[41]

Using relevance to reduce network size automatically

Michael C Mozer and Paul Smolensky. Using relevance to reduce network size automatically. Connection Science, 1 0 (1): 0 3--16, 1989

work page 1989
[42]

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Nikita Nangia and Samuel R Bowman. Human vs. muppet: A conservative estimate of human performance on the glue benchmark. arXiv preprint arXiv:1905.10425, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[43]

Nvidia a100 tensor core gpu architecture

Nvidia. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020

work page 2020
[44]

So much "junk" dna in our genome

Susumu Ohno. So much "junk" dna in our genome. Brookhaven symposia in biology, 23: 0 366--70, 1972

work page 1972
[45]

In-context retrieval-augmented language models

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023

work page arXiv 2023
[46]

Comparing rewinding and fine-tuning in neural network pruning

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In 8th International Conference on Learning Representations, 2020

work page 2020
[47]

Movement pruning: Adaptive sparsity by fine-tuning

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 20378--20389. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/eae15aabaa768ae4a5993...

work page 2020
[48]

Woodfisher: Efficient second-order approximation for neural network compression

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33: 0 18098--18109, 2020

work page 2020
[49]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Multilingual translation with extensible multilingual pretraining and finetuning

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401, 2020

work page arXiv 2008
[51]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Eigendamage: Structured pruning in the kronecker-factored eigenbasis

Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning, pp.\ 6566--6575. PMLR, 2019

work page 2019
[53]

Best practices for text classification with distillation (part 2/4) – challenging use cases

Moshe Wasserblat. Best practices for text classification with distillation (part 2/4) – challenging use cases. https://www.linkedin.com/pulse/best-practices-text-classification-distillation-part-24-wasserblat/, 2021

work page 2021
[54]

Rethinking network pruning--under the pre-train and fine-tune paradigm

Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning--under the pre-train and fine-tune paradigm. arXiv preprint arXiv:2104.08682, 2021

work page arXiv 2021
[55]

Dynamic sparsity is channel-level sparsity learner

Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, and Shiwei Liu. Dynamic sparsity is channel-level sparsity learner. arXiv preprint arXiv:2305.19454, 2023

work page arXiv 2023
[56]

Mest: Accurate and fast memory-economic sparse training framework on the edge

Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems, 34: 0 20838--20850, 2021

work page 2021
[57]

Prune once for all: Sparse pre-trained language models

Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021

work page arXiv 2021
[58]

Mlprune: Multi-layer pruning for automated neural network compression

Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compression. 2018

work page 2018
[59]

Platon: Pruning large transformer models with upper confidence bound of weight importance

Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp.\ 26809--26823. PMLR, 2022

work page 2022
[60]

Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate

Ye Zheng, Steven Josefowicz, Ashutosh Chaudhry, Xiao P Peng, Katherine Forbush, and Alexander Y Rudensky. Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate. Nature, 463 0 (7282): 0 808--812, 2010

work page 2010
[61]

Learning n: m fine-grained structured sparse neural networks from scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021

work page arXiv 2021
[62]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[63]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[64]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Open source parallel corpus of opus. 2020

work page 2020

[3] [3]

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

Junk DNA: a journey through the dark matter of the genome

Nessa Carey. Junk DNA: a journey through the dark matter of the genome. Columbia University Press, 2015

work page 2015

[5] [5]

The lottery ticket hypothesis for pre-trained bert networks

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33: 0 15834--15846, 2020

work page 2020

[6] [6]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023

[7] [7]

Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve

Giannis Daras, Negin Raoof, Zoi Gkalitsiou, and Alex Dimakis. Multitasking models are robust to structural failure: A neural model for bilingual cognitive reserve. Advances in Neural Information Processing Systems, 35: 0 35130--35142, 2022

work page 2022

[8] [8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Learning to prune deep neural networks via layer-wise optimal brain surgeon

Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[10] [10]

Rigging the lottery: Making all tickets winners

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp.\ 2943--2952. PMLR, 2020

work page 2020

[11] [11]

Sparsevsr: Lightweight and noise robust visual speech recognition

Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, and Maja Pantic. Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552, 2023

work page arXiv 2023

[12] [12]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7

work page 2019

[13] [13]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020

work page 2020

[14] [14]

Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

work page 2023

[15] [15]

M-fac: Efficient matrix-free approximations of second-order information

Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34: 0 14873--14886, 2021

work page 2021

[16] [16]

The State of Sparsity in Deep Neural Networks

Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[17] [17]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016

work page 2016

[18] [18]

Second order derivatives for network pruning: Optimal brain surgeon

Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992

work page 1992

[19] [19]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

The emergence of essential sparsity in large pre-trained models: The weights that matter

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, and Zhangyang Wang. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023 a

work page arXiv 2023

[21] [21]

Training your sparse neural network better with any mask

Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp.\ 9833--9844. PMLR, 2022

work page 2022

[22] [22]

Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models

Ajay Kumar Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, and Zhangyang Wang. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp.\ 14691--14701. PMLR, 2023 b

work page 2023

[23] [23]

Towards more effective and economic sparsely-activated model

Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model. arXiv preprint arXiv:2110.07431, 2021

work page arXiv 2021

[24] [24]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

work page 2017

[25] [25]

Twice fine-tuning deep neural networks for paraphrase identification

Bowon Ko and Ho-Jin Choi. Twice fine-tuning deep neural networks for paraphrase identification. Electronics Letters, 56 0 (9): 0 444--447, 2020

work page 2020

[26] [26]

Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns

Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. Dnr: A tunable robust pruning framework through dynamic network rewiring of dnns. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, pp.\ 344--350, 2021

work page 2021

[27] [27]

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022

work page arXiv 2022

[28] [28]

Block pruning for faster transformers

Fran c ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021

work page arXiv 2021

[29] [29]

Optimal brain damage

Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp.\ 598--605, 1990

work page 1990

[30] [30]

Large models are parsimonious learners: Activation sparsity in trained transformers

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022

work page arXiv 2022

[31] [31]

Do we actually need dense over-parameterization? in-time over-parameterization in sparse training

Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. arXiv preprint arXiv:2102.02887, 2021

work page arXiv 2021

[32] [32]

The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022 a

work page arXiv 2022

[33] [33]

Don't be so dense: Sparse-to-sparse gan training without sacrificing performance

Shiwei Liu, Yuesong Tian, Tianlong Chen, and Li Shen. Don't be so dense: Sparse-to-sparse gan training without sacrificing performance. arXiv preprint arXiv:2203.02770, 2022 b

work page arXiv 2022

[34] [34]

Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Jaiswal, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023

work page arXiv 2023

[35] [35]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[36] [36]

Multilingual denoising pre-training for neural machine translation

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8: 0 726--742, 2020

work page 2020

[37] [37]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023

work page arXiv 2023

[38] [38]

Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9 0 (1): 0 2383, 2018

work page 2018

[39] [39]

Pruning Convolutional Neural Networks for Resource Efficient Inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019

work page 2019

[41] [41]

Using relevance to reduce network size automatically

Michael C Mozer and Paul Smolensky. Using relevance to reduce network size automatically. Connection Science, 1 0 (1): 0 3--16, 1989

work page 1989

[42] [42]

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Nikita Nangia and Samuel R Bowman. Human vs. muppet: A conservative estimate of human performance on the glue benchmark. arXiv preprint arXiv:1905.10425, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[43] [43]

Nvidia a100 tensor core gpu architecture

Nvidia. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020

work page 2020

[44] [44]

So much "junk" dna in our genome

Susumu Ohno. So much "junk" dna in our genome. Brookhaven symposia in biology, 23: 0 366--70, 1972

work page 1972

[45] [45]

In-context retrieval-augmented language models

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023

work page arXiv 2023

[46] [46]

Comparing rewinding and fine-tuning in neural network pruning

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In 8th International Conference on Learning Representations, 2020

work page 2020

[47] [47]

Movement pruning: Adaptive sparsity by fine-tuning

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 20378--20389. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/eae15aabaa768ae4a5993...

work page 2020

[48] [48]

Woodfisher: Efficient second-order approximation for neural network compression

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33: 0 18098--18109, 2020

work page 2020

[49] [49]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Multilingual translation with extensible multilingual pretraining and finetuning

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401, 2020

work page arXiv 2008

[51] [51]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [52]

Eigendamage: Structured pruning in the kronecker-factored eigenbasis

Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning, pp.\ 6566--6575. PMLR, 2019

work page 2019

[53] [53]

Best practices for text classification with distillation (part 2/4) – challenging use cases

Moshe Wasserblat. Best practices for text classification with distillation (part 2/4) – challenging use cases. https://www.linkedin.com/pulse/best-practices-text-classification-distillation-part-24-wasserblat/, 2021

work page 2021

[54] [54]

Rethinking network pruning--under the pre-train and fine-tune paradigm

Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning--under the pre-train and fine-tune paradigm. arXiv preprint arXiv:2104.08682, 2021

work page arXiv 2021

[55] [55]

Dynamic sparsity is channel-level sparsity learner

Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, and Shiwei Liu. Dynamic sparsity is channel-level sparsity learner. arXiv preprint arXiv:2305.19454, 2023

work page arXiv 2023

[56] [56]

Mest: Accurate and fast memory-economic sparse training framework on the edge

Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems, 34: 0 20838--20850, 2021

work page 2021

[57] [57]

Prune once for all: Sparse pre-trained language models

Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021

work page arXiv 2021

[58] [58]

Mlprune: Multi-layer pruning for automated neural network compression

Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compression. 2018

work page 2018

[59] [59]

Platon: Pruning large transformer models with upper confidence bound of weight importance

Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp.\ 26809--26823. PMLR, 2022

work page 2022

[60] [60]

Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate

Ye Zheng, Steven Josefowicz, Ashutosh Chaudhry, Xiao P Peng, Katherine Forbush, and Alexander Y Rudensky. Role of conserved non-coding dna elements in the foxp3 gene in regulatory t-cell fate. Nature, 463 0 (7282): 0 808--812, 2010

work page 2010

[61] [61]

Learning n: m fine-grained structured sparse neural networks from scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021

work page arXiv 2021

[62] [62]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[63] [63]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[64] [64]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page