pith. sign in

arxiv: 2605.15491 · v1 · pith:FVQMEHJWnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.PF

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Pith reviewed 2026-05-19 16:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PF
keywords layer pruningLLM compressionactivation alignmenttraining-free recoverylinear operatorboundary mismatchTransformer decodercalibration set
0
0 comments X

The pith

A closed-form linear operator derived from calibration data can reconstruct the hidden-state mismatch caused by removing entire layers from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that layer pruning creates a predictable activation discrepancy between the output of one surviving layer and the input expected by the next. By solving for the single best linear transformation that minimizes this discrepancy on a small calibration set, the authors recover most of the lost performance without any retraining. This works because the derived operator is the mathematically unconstrained optimum of the alignment objective, whereas earlier recovery methods were forced to search inside smaller families of possible transformations. If the claim holds, layer pruning becomes a more reliable way to shrink and speed up large models while preserving accuracy on downstream tasks.

Core claim

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. Ghosted Layers address this by solving a boundary activation alignment problem. The method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. This solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces.

What carries the argument

The closed-form optimal linear operator for boundary activation alignment, obtained by solving the least-squares problem on a calibration set to minimize the difference between pruned and original activations.

If this is right

  • The method yields higher accuracy and lower perplexity than prior training-free recovery techniques on multiple LLM families and pruning strategies.
  • The efficiency gains from layer pruning, such as reduced inference latency and memory use, remain intact because the added operator is a single matrix multiplication.
  • Because the solution is the true unconstrained optimum rather than an approximation inside a restricted subspace, further improvements would require changing the objective itself rather than searching harder within the same family.
  • The approach is training-free and uses only a small calibration set, so it can be applied after any pruning decision without additional optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the discrepancy introduced by pruning turns out to be largely linear, similar closed-form operators might correct other compression artifacts such as those from low-rank adaptation or early-exit mechanisms.
  • The calibration-set requirement suggests that periodically refitting the operator on recent user data could keep recovery quality high when the input distribution shifts over time.
  • Because the operator is derived once and then fixed, it could be fused into the adjacent layers at deployment time to eliminate any extra runtime cost beyond the original pruning savings.

Load-bearing premise

The activation discrepancy caused by removed layers can be accurately captured and reversed by one linear transformation that was fitted on limited calibration examples and then works for every input the model will see later.

What would settle it

If the linear operator fitted on the calibration set produces no measurable reduction in activation mismatch or no gain in perplexity and accuracy when tested on a large, held-out set of diverse inputs, the claim that it provides the effective unconstrained recovery would be falsified.

Figures

Figures reproduced from arXiv: 2605.15491 by Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee, Vincent-Daniel Yun.

Figure 1
Figure 1. Figure 1: Mean absolute error between the expected boundary activation and the activation received by downstream [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ghosted Layers as drop-in replacements for pruned transformer blocks. One or more consecutive transformer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frobenius norm decomposition of M∗ into symmetric and anti-symmetric components across two LLM backbones (n = 7, LLM-Streamline). Detailed setups are in Appendix A.2 Constrained solution space. As established in Theorem 4.1, W∗ is the unconstrained minimizer over all of R C×C , whereas any symmetric operator W satisfies W − W⊤ = 0 and is thus confined to the symmetric subspace. To empirically verify that W… view at source ↗
Figure 4
Figure 4. Figure 4: Per-channel mean absolute error (MAE) between the repaired boundary activation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average accuracy across 9 commonsense reasoning benchmarks with LLaMA-3.1-8B Efficiency Comparison [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that layer pruning in LLMs creates a boundary activation mismatch that can be recovered training-free by Ghosted Layers: a closed-form optimal linear operator fitted on a small calibration set to reconstruct the pruned-layer activation discrepancy. It asserts this operator is the unconstrained optimum of the alignment objective (unlike prior methods limited to constrained subspaces), and reports consistent accuracy and perplexity gains over baselines across multiple LLM backbones and pruning strategies.

Significance. If the linear operator generalizes beyond the calibration distribution and the closed-form derivation is independent of downstream task loss, the approach would offer a lightweight, training-free way to mitigate pruning-induced degradation while retaining the efficiency benefits of layer removal. The parameter-free character of the claimed optimum would be a notable strength for reproducibility.

major comments (3)
  1. [Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.
  2. [Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.
  3. [Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.
minor comments (2)
  1. The manuscript should include a clear statement of the exact least-squares objective and the resulting closed-form expression for the linear operator (presumably W = Y X^+ or equivalent) so readers can verify the unconstrained claim.
  2. Figure and table captions would benefit from explicit mention of the calibration-set size and the pruning ratios tested to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have addressed each major comment below and revised the paper accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.

    Authors: We agree that a formal robustness bound or theoretical analysis of distribution shift would strengthen the claims. The linear operator is derived to minimize the immediate boundary mismatch on the calibration set, and while subsequent nonlinear layers can in principle modify the effective mapping, the alignment is applied precisely at the interface to reduce propagation of the discrepancy. Our experiments already test generalization across multiple models, pruning ratios, and evaluation datasets that differ from the calibration distribution. In the revision we have added a dedicated paragraph in Section 3.2 discussing the modeling assumptions and limitations, and we include new experiments evaluating performance on out-of-distribution prompts to provide more explicit empirical support for the assumption. revision: partial

  2. Referee: [Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.

    Authors: We appreciate this observation. The full derivation appears in Section 3, where we formulate the alignment objective as an unconstrained least-squares problem over the linear operator and obtain the closed-form solution via the normal equations; this solution depends only on the observed activation pairs from the calibration set and contains no dependence on any downstream task loss. To address the referee's concern we have expanded the abstract to include a brief reference to the derivation and added a pointer to the relevant equations (Eqs. 3–6) so that readers can immediately locate the proof that the operator is the unconstrained optimum. revision: yes

  3. Referee: [Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.

    Authors: We thank the referee for noting this gap in reporting. The revised Experiments section now reports mean performance together with standard deviations and error bars computed over five independent random seeds for every metric and model. We have also added a new paragraph detailing the calibration-set construction: for each experiment we randomly sample 256 sequences (each of length 512 tokens) from the training split of the respective dataset, with the random seed fixed for reproducibility; sensitivity to calibration-set size is additionally explored in an appendix table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a standard closed-form solution to an explicitly stated alignment objective

full rationale

The paper defines a boundary activation alignment objective and derives its unconstrained optimum as a closed-form linear operator fitted on calibration activations. This is a direct mathematical solution to the stated minimization problem rather than a reduction of the claimed result to its own inputs by construction. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported from prior author work, and no fitted parameter is relabeled as an independent prediction. The central claim remains that the derived operator is unconstrained (in contrast to prior constrained subspaces), which follows from the problem formulation itself without tautology. Generalization from calibration to inference is an empirical assumption but does not render the derivation chain circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that activation mismatch after pruning is well-approximated by a linear transformation recoverable from limited calibration data. No explicit free parameters, axioms, or invented entities are named in the abstract, but the linear-operator assumption functions as an unstated domain assumption.

axioms (1)
  • domain assumption The activation discrepancy at layer boundaries after pruning is reconstructible by a linear operator derived from a small calibration set.
    Invoked in the description of the boundary activation alignment problem and the closed-form solution.

pith-pipeline@v0.9.0 · 5673 in / 1363 out tokens · 40276 ms · 2026-05-19T16:10:28.262917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    Fluctuation-based adaptive structured pruning for large language models

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 2024

  2. [2]

    Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

  4. [4]

    Streamlining redundant lay- ers to compress large language models

    Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant lay- ers to compress large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    A simple linear patch revives layer-pruned large language models

    Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. A simple linear patch revives layer-pruned large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  6. [6]

    Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

    Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

  7. [7]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  9. [9]

    The PASCAL recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer, 2005

  10. [10]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

  11. [11]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  12. [12]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013. 10

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    The unreasonable ineffectiveness of the deeper layers

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

  16. [16]

    Higham.Accuracy and Stability of Numerical Algorithms

    Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2002

  17. [17]

    arXiv preprint arXiv:2402.02834 , volume=

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened LLaMA: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834, 2024

  18. [18]

    RACE: Large-scale ReAding compre- hension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding compre- hension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017

  19. [19]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  20. [20]

    LLM-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  21. [21]

    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

    Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993

  22. [22]

    ShortGPT: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, July 2025

  23. [23]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

  24. [24]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

  25. [25]

    Compact language models via pruning and knowledge distillation

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems, volume 37, pages 41076– 41102, 2024

  26. [26]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  27. [27]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(1), 2020

  28. [28]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011

  29. [29]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

  30. [30]

    Replaceme: Network simplification via depth pruning and transformer block linearization

    Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, and Sergey Zagoruyko. Replaceme: Network simplification via depth pruning and transformer block linearization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  31. [31]

    Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

    Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InProceedings of the 41st International Conference on Machine Learning, 2024

  32. [32]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. 11

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  34. [34]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  35. [35]

    Smith, and Hannaneh Hajishirzi

    Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng...

  36. [36]

    arXiv preprint arXiv:2310.06694 (2023)

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 12 Appendix: Ghosted Layers This appendix provides supplementary materials that complement the main paper. It includes the proof of ou...

  39. [39]

    We adopt 32 sequences as the default since this is the smallest size at which downstream accuracy is already saturated, and the additional perplexity reduction from larger calibration sets does not translate into accuracy gains. D Fine-tuning results D.1 Fine-tuning setup We follow the fine-tuning protocol ofLinearPatch [5] exactly to ensure a fair head-t...