pith. machine review for the scientific record. sign in

arxiv: 2605.06402 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords semi-structured sparsityLLM pruningHessian estimationsoft mask annealingpost-training compressionmodel efficiencyzero-shot evaluation
0
0 comments X

The pith

SparseForge recovers LLM accuracy under 2:4 semi-structured sparsity by directly annealing Hessian-guided soft masks rather than scaling retraining data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SparseForge as a post-training method to turn dense LLMs into hardware-friendly semi-structured sparse versions while keeping most of their performance. It estimates which weights matter using the Hessian matrix, then gradually hardens continuous soft masks into fixed 2:4 patterns during limited retraining. On LLaMA-2-7B this yields 57.27 percent average zero-shot accuracy after only 5 billion tokens, beating the original dense model's 56.43 percent and coming close to prior methods that use eight times more tokens. The same pattern holds on other model families. If correct, the work shows that careful mask design can replace much of the expensive data scaling previously needed for sparse recovery.

Core claim

SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity it reaches 57.27 percent average zero-shot accuracy using only 5B retraining tokens, surpassing the dense baseline of 56.43 percent and approaching the 57.52 percent result of a state-of-the-art method that requires 40B tokens, with consistent gains across model families.

What carries the argument

Hessian-guided soft-mask annealing: a process that scores weight importance with second-order curvature information and progressively converts continuous soft masks into discrete 2:4 structured sparsity patterns during retraining.

If this is right

  • Semi-structured sparse LLMs can exceed dense accuracy on zero-shot tasks with far less retraining compute.
  • Mask optimization serves as a substitute for token scaling in sparse recovery pipelines.
  • The same annealing procedure transfers to other model families without major changes.
  • Hardware-native 2:4 sparsity becomes practical for deployment at lower total training cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mask design may be a higher-leverage control than previously assumed for balancing sparsity and capability.
  • The approach could be combined with other compression methods such as quantization to compound efficiency gains.
  • If the annealing schedule proves robust at larger scales, it would lower the compute barrier for testing many sparse configurations.

Load-bearing premise

Directly optimizing the sparsity mask through Hessian-guided annealing produces stable accuracy recovery that generalizes across model families without hidden dataset-specific tuning.

What would settle it

If retraining LLaMA-2-7B to 2:4 sparsity with a fixed random mask for 5B tokens produces zero-shot accuracy below the dense 56.43 percent while the annealed mask reaches 57.27 percent, the mask optimization step adds value; the opposite outcome would falsify it.

Figures

Figures reproduced from arXiv: 2605.06402 by Chaofan Lin, Key, Liu Hanzuo, Mingyu Gao, Rayying, Weixuan Sun, Yulong Wang.

Figure 1
Figure 1. Figure 1: (a) Compared with hard-mask retraining, SparseForge explicitly optimizes a soft mask and progressively anneals it into a deployable binary 2:4 mask. (b) On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with 5B retraining tokens, approaching the 57.52% result of the stronger 40B-token prior SOTA variant [14] while using about 8× fewer training tokens. quality can redu… view at source ↗
Figure 2
Figure 2. Figure 2: Soft masks improve sparse recov￾ery: compared with the hard-mask-style AST baseline, our soft-mask-style SparseForge im￾proves the mean zero-shot accuracy on the 7-task benchmark from 58.62% to 59.20%, supporting the need for soft mask optimiza￾tion. Under semi-structured constraints such as 2:4 spar￾sity, pruning is no longer an element-wise decision but a grouped selection problem in which weights compet… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Hessian-aware importance provides a better survival signal: replacing it with magnitude￾based scoring drops mean zero-shot accuracy from 57.23% to 55.61%. (b) Soft masks must also be progressively hardened toward a deployable 2:4 pattern: score structural hardening pushes the top-2 and bottom-2 entries in each group towards 1 and 0, respectively; without such hardening, the mask remains far from binary… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of SparseForge. We first co-optimize the weights and the explicit learnable soft mask in a dual-track retraining loop (§4.1) in the heating stage, where the mask is updated with a Hessian-guided signal (§4.2). Then we progressively binarize the mask to a hard, deployable form in the quenching stage (§4.3). where Ltask and LKL inherit the standard task and distillation losses from prior work, and L… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy vs. retraining tokens on LLaMA-2-7B (2:4 sparsity, log-scale). SparseForge view at source ↗
read the original abstract

Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $\textbf{5B}$ retraining tokens, surpassing the dense model's 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using $\textbf{40B}$ tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SparseForge, a post-training framework for semi-structured LLM sparsification that directly optimizes the sparsity mask via Hessian-aware importance estimation combined with progressive annealing of soft masks into hardware-executable structured sparsity. It claims this yields efficient recovery, with the central empirical result that on LLaMA-2-7B under 2:4 sparsity the method reaches 57.27% average zero-shot accuracy using only 5B retraining tokens, surpassing the dense baseline of 56.43% and approaching a prior SOTA result of 57.52% obtained with 40B tokens; similar accuracy-efficiency gains are reported across model families.

Significance. If the accuracy numbers prove robust and the efficiency advantage generalizes without hidden per-model tuning, the work would meaningfully improve the practicality of semi-structured pruning for LLMs by lowering the token budget required for recovery, thereby reducing compute costs while preserving or exceeding dense-model performance on zero-shot tasks.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the reported 57.27% vs. 56.43% comparison and the 5B-vs-40B token efficiency claim are presented without variance estimates, run counts, statistical tests, or explicit baseline reproduction details (e.g., data exclusion rules or exact hyperparameter matching), which are load-bearing for the central claim that the method surpasses the dense model and approaches SOTA with far fewer tokens.
  2. [Method and Experiments] Method and Experiments: the assumption that Hessian-guided soft-mask annealing produces stable, generalizable recovery without per-model or per-dataset tuning is not directly tested; the manuscript should provide ablations on annealing schedule hyperparameters and cross-model validation to demonstrate that the reported gains are intrinsic rather than artifacts of schedule choice or corpus selection.
minor comments (1)
  1. [Abstract] Abstract: the LaTeX bolding of token counts is clear, but ensure the full manuscript consistently reports token counts and accuracy metrics with the same precision and units.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing statistical rigor and experimental validation. We address each major comment point by point below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the reported 57.27% vs. 56.43% comparison and the 5B-vs-40B token efficiency claim are presented without variance estimates, run counts, statistical tests, or explicit baseline reproduction details (e.g., data exclusion rules or exact hyperparameter matching), which are load-bearing for the central claim that the method surpasses the dense model and approaches SOTA with far fewer tokens.

    Authors: We agree that variance estimates and explicit reproduction details would improve the robustness of the central claims. In the revised manuscript we will report key accuracy results as averages over multiple independent runs (minimum of three random seeds) with standard deviations. We will also expand the Evaluation section and add an appendix subsection detailing the exact retraining corpus composition, any data filtering rules, and hyperparameter settings used for SparseForge as well as for the reproduced baselines, ensuring transparent matching to prior work. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments: the assumption that Hessian-guided soft-mask annealing produces stable, generalizable recovery without per-model or per-dataset tuning is not directly tested; the manuscript should provide ablations on annealing schedule hyperparameters and cross-model validation to demonstrate that the reported gains are intrinsic rather than artifacts of schedule choice or corpus selection.

    Authors: The current manuscript already reports consistent gains across multiple model families (LLaMA-2-7B and additional families in the Experiments section), providing initial evidence of generalizability. To directly address the request for explicit testing, the revision will include a new ablation subsection varying annealing schedule hyperparameters (e.g., decay rate and temperature progression) and showing that performance remains stable within practical ranges. These results will confirm that the efficiency gains are not artifacts of a single schedule choice. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results stand independently of inputs

full rationale

The paper introduces SparseForge as a framework for semi-structured sparsity via Hessian-guided soft-mask annealing and reports empirical accuracy gains on LLaMA-2-7B (57.27% zero-shot with 5B tokens) and other models. No equations, derivations, or self-citation chains are present that reduce these outcomes to fitted parameters or inputs by construction. The accuracy-efficiency claims rest on experimental measurements rather than any self-definitional or load-bearing reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the method rests on standard domain assumptions about Hessian importance and the effectiveness of annealing schedules whose exact hyperparameters are not disclosed.

free parameters (1)
  • annealing schedule hyperparameters
    Parameters controlling the rate and shape of soft-to-hard mask transition are required but not quantified in the abstract.
axioms (1)
  • domain assumption Hessian matrix entries provide reliable per-weight importance scores for pruning decisions in transformer models
    Central to the importance estimation step described in the abstract.

pith-pipeline@v0.9.0 · 5505 in / 1213 out tokens · 48485 ms · 2026-05-08T12:43:15.697350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    allenai/dolmino-mix-1124

    Allen Institute for AI. allenai/dolmino-mix-1124. https://huggingface.co/datasets/ allenai/dolmino-mix-1124, 2024. Hugging Face dataset card, accessed: 2026-04-30

  2. [2]

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sid Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wil- son Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang,...

  3. [3]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI Conference on Artificial Intelligence,

  4. [4]

    URLhttps://api.semanticscholar.org/CorpusID:208290939

  5. [5]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. ArXiv, abs/1905.10044, 2019. URL https://api.semanticscholar.org/CorpusID: 165163607

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reason- ing challenge.ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/ CorpusID:3922816

  7. [7]

    Damai Dai, Chengqi Deng, Chenggang Zhao, Runxin Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InAnnual Meeting of the Association for Computational L...

  8. [8]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Jesse Dodge, Ana Marasovic, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner, and William Agnew. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InConference on Empirical Methods in Natural Language Processing,

  9. [9]

    URLhttps://api.semanticscholar.org/CorpusID:237568724

  10. [10]

    MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models,

    Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. Maskllm: Learnable semi-structured sparsity for large lan- guage models.ArXiv, abs/2409.17481, 2024. URL https://api.semanticscholar.org/ CorpusID:272910976

  11. [11]

    Frantar and D

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot.ArXiv, abs/2301.00774, 2023. URL https://api.semanticscholar.org/ CorpusID:255372747

  12. [12]

    The State of Sparsity in Deep Neural Networks

    Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv, abs/1902.09574, 2019. URLhttps://api.semanticscholar.org/CorpusID:67855585

  13. [13]

    Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. InNeural Information Processing Systems, 2015. URL https: //api.semanticscholar.org/CorpusID:2238772

  14. [14]

    Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems, 1993

  15. [15]

    Pruning large lan- guage models with semi-structural adaptive sparse training

    Weiyu Huang, Guohao Jian, Yuezhou Hu, Jun Zhu, and Jianfei Chen. Pruning large lan- guage models with semi-structural adaptive sparse training. InAAAI Conference on Artificial Intelligence, 2024. URLhttps://api.semanticscholar.org/CorpusID:271544038. 10

  16. [16]

    Cast: Continuous and differentiable semi-structured sparsity-aware training for large language models.ArXiv, abs/2509.25996,

    Weiyu Huang, Yuezhou Hu, Jun Zhu, and Jianfei Chen. Cast: Continuous and differentiable semi-structured sparsity-aware training for large language models.ArXiv, abs/2509.25996,

  17. [17]

    URLhttps://api.semanticscholar.org/CorpusID:281682355

  18. [18]

    Hutchinson

    M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics - Simulation and Computation, 19(2): 433–450, 1990. doi: 10.1080/03610919008812866. URL https://doi.org/10.1080/ 03610919008812866

  19. [19]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard H. Hovy. Race: Large- scale reading comprehension dataset from examinations.ArXiv, abs/1704.04683, 2017. URL https://api.semanticscholar.org/CorpusID:6826032

  20. [20]

    Denker, and Sara A

    Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, 1990

  21. [21]

    Pruning Filters for Efficient ConvNets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets.ArXiv, abs/1608.08710, 2016. URL https://api.semanticscholar. org/CorpusID:14089312

  22. [22]

    Learning efficient convolutional networks through network slimming.2017 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2755–2763, 2017

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming.2017 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2755–2763, 2017. URL https: //api.semanticscholar.org/CorpusID:5993328

  23. [23]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InConference on Empirical Methods in Natural Language Processing, 2018. URL https://api.semanticscholar. org/CorpusID:52183757

  24. [24]

    Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,

    Asit K. Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. ArXiv, abs/2104.08378, 2021. URL https://api.semanticscholar.org/CorpusID: 233296249

  25. [25]

    Nvidia ampere architecture in-depth

    NVIDIA. Nvidia ampere architecture in-depth. https://developer.nvidia.com/blog/ nvidia-ampere-architecture-in-depth/, 2020. Accessed: 2026-04-30

  26. [26]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William ...

  27. [27]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI,

  28. [28]

    URL https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf

  29. [29]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67, 2019. URL https://api. semanticscholar.org/CorpusID:204838007

  30. [30]

    Winogrande

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande. Communications of the ACM, 64:99 – 106, 2019. URL https://api.semanticscholar. org/CorpusID:198893658. 11

  31. [31]

    Woodfisher: Efficient second-order approximations for model compression.ArXiv, abs/2004.14340, 2020

    Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximations for model compression.ArXiv, abs/2004.14340, 2020. URL https://api.semanticscholar. org/CorpusID:216641895

  32. [32]

    InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models.ArXiv, abs/2306.11695, 2023. URL https://api. semanticscholar.org/CorpusID:259203115

  33. [33]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Ant...

  34. [34]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general- purpose language understanding systems.ArXiv, abs/1905.00537, 2019. URL https://api. semanticscholar.org/CorpusID:143424870

  35. [35]

    Learning structured sparsity in deep neural networks

    Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Helen Li. Learning structured sparsity in deep neural networks. InNeural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:2056019

  36. [36]

    Besa: Pruning large language models with block- wise parameter-efficient sparsity allocation.ArXiv, abs/2402.16880, 2024

    Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kai-Chuang Zhang, Peng Gao, Fengwei An, Yu Qiao, and Ping Luo. Besa: Pruning large language models with block- wise parameter-efficient sparsity allocation.ArXiv, abs/2402.16880, 2024. URL https: //api.semanticscholar.org/CorpusID:268032346

  37. [37]

    Qwen3 technical report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Jingren Zhou, Junyan Lin, Kai Dang, Keqin Bao, Ke-Pei Ya...

  38. [38]

    Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023

    Lu Yin, You Wu, Zhenyu (Allen) Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.ArXiv, abs/2310.05175, 2023. URLhttps://api.semanticscholar.org/CorpusID:263829692

  39. [39]

    Hellaswag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019. URLhttps://api.semanticscholar.org/CorpusID:159041722

  40. [40]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.ArXiv, abs/2205.01068,

  41. [41]

    Efficiency

    URLhttps://api.semanticscholar.org/CorpusID:248496292. 12 A Detailed Cross-Model Results Under 2:4 Sparsity For readability, the main text reports a compact cross-model summary using mean zero-shot accuracy only. In this appendix, we provide the full task-level results corresponding to the cross-model comparison in Table 1. Table 4 reports the dense-model...