pith. sign in

arxiv: 2606.05516 · v1 · pith:N6EHFQ4Snew · submitted 2026-06-03 · 💻 cs.LG

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

Pith reviewed 2026-06-28 06:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationLLM fine-tuningdominant layeractivation outliersresidual streamperturbation sensitivitymemory-efficient tuning
0
0 comments X

The pith

Zeroth-order fine-tuning of LLMs is dominated by a single decoding layer that matches full-model results when tuned alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that zeroth-order optimization for LLM fine-tuning concentrates its useful adaptation in one specific decoding layer. Tuning only this layer produces results equal to or better than tuning the entire model across multiple LLM families and tasks. The layer is located in advance by spotting the first activation outlier through an inference-only pass on the pre-trained model, with no training required. This concentration occurs because the layer sits early in the residual stream and shows high sensitivity to perturbations, letting its effects accumulate in later layers to drive the optimization signals.

Core claim

Zeroth-order fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. The dominant layer is task-agnostic but model-specific, and aligns with the first activation-outlier layer in the pre-trained model. This layer has high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through subsequent decoding layers and produce strong optimization signals under forward-only updates.

What carries the argument

The dominant decoding layer, identified as the first activation-outlier layer, which supplies high perturbation sensitivity and early residual-stream position to generate the primary ZO optimization signals.

If this is right

  • Dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning.
  • It achieves up to 4.52 times training speedup.
  • The dominant layer is task-agnostic but model-specific.
  • It can be identified before training through inference-only analysis of activation outliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-identifying the dominant layer could become a standard first step for any memory-efficient LLM adaptation method.
  • The early-position and sensitivity combination may point to similar single-layer dominance in other forward-only or low-memory training regimes.
  • Model families with different residual architectures might show their dominant layers at different depths, allowing targeted pre-checks.

Load-bearing premise

That the first activation-outlier layer found by inference-only analysis is reliably the one whose sensitivity and position generate the observed optimization signals under ZO updates.

What would settle it

If tuning any non-outlier layer alone, or the full model without the identified dominant layer, produces equal or better results than dominant-layer tuning across the same tasks and models, the dominance claim would not hold.

Figures

Figures reproduced from arXiv: 2606.05516 by Abeer Matar Almalky, Adnan Siraj Rakin, Deliang Fan, Li Yang, Sen Lin, Shuteng Niu, Wanhao Yu, Yihang Zuo, Zheng Wang, Ziyan Wang.

Figure 1
Figure 1. Figure 1: Layer-wise fine-tuning on Llama2-7B results across three representative datasets. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean and Maximum of Layer-wise output activation magnitudes for LLaMA2-7B and Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of optimization signal flow in ZO and FO. In ZO, gradients are estimated only from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layerwise Loss Change under 1-step Perturbation on LLaMA2-7B across 3 tasks. ZO fine-tuning relies entirely on changes in the final loss under parameter perturbations. As a result, layers differ in effectiveness based on how strongly their perturba￾tions influence the final loss, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Schematic view of how ZO perturbations injected at the first outlier-activation layer and propagate [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curve comparison. In both LLaMA2-7B and Qwen3-8B, the dominant layer reduces [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\times$ training speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that zeroth-order (ZO) fine-tuning of LLMs is sharply dominated by a single 'dominant' decoding layer, which can be identified pre-training via the first activation-outlier layer in an inference-only analysis. Fine-tuning only this layer (task-agnostic but model-specific) matches or exceeds full-model ZO performance; the explanation rests on the layer's high perturbation sensitivity combined with early residual-stream position, allowing effects to propagate. Experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks report improved average performance over MeZO and LoRA-based ZO baselines plus up to 4.52× speedup.

Significance. If the central empirical claim holds, the result would be significant for memory-efficient LLM adaptation, as it reduces ZO fine-tuning to a single layer while preserving or improving performance. The inference-only identification method and the perturbation-propagation analysis are strengths that could generalize beyond the two model families tested.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim that dominant-layer ZO 'consistently matches or even exceeds' full-model ZO is load-bearing, yet the provided material gives no details on number of random seeds, standard deviations, or statistical tests; without these, it is impossible to determine whether reported gains are reliable or could be explained by variance.
  2. [Analysis section] Analysis of perturbation propagation: the explanation that the first activation-outlier layer produces 'disproportionately strong and stable optimization signals' because of sensitivity plus early residual position requires a quantitative test (e.g., ablation of perturbation magnitude at that layer versus later layers) to confirm it is not post-hoc; the current description risks circularity with the identification rule.
minor comments (2)
  1. [Method] Clarify the precise definition and threshold used to detect 'activation outliers' (e.g., which statistic, which token positions) so the identification procedure is fully reproducible.
  2. [Experiments] Table or figure reporting per-task results should include the full-model ZO, dominant-layer ZO, and random-layer controls side-by-side with the same random seed to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the empirical validation and analysis. We provide point-by-point responses below and will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that dominant-layer ZO 'consistently matches or even exceeds' full-model ZO is load-bearing, yet the provided material gives no details on number of random seeds, standard deviations, or statistical tests; without these, it is impossible to determine whether reported gains are reliable or could be explained by variance.

    Authors: We fully agree that providing details on the number of random seeds, standard deviations, and statistical tests is crucial for validating the central claim. The current manuscript does not include these, which is an oversight. In the revised manuscript, we will report that all experiments were conducted with 3 random seeds, include standard deviations in the tables, and perform statistical tests (such as Wilcoxon signed-rank tests) to confirm the significance of the performance improvements. This will allow readers to assess the reliability of the results. revision: yes

  2. Referee: [Analysis section] Analysis of perturbation propagation: the explanation that the first activation-outlier layer produces 'disproportionately strong and stable optimization signals' because of sensitivity plus early residual position requires a quantitative test (e.g., ablation of perturbation magnitude at that layer versus later layers) to confirm it is not post-hoc; the current description risks circularity with the identification rule.

    Authors: We acknowledge the risk of circularity in the current analysis. Although the dominant layer is identified via a task-agnostic, inference-only method based on activation outliers, we agree that additional quantitative evidence is needed to support the perturbation propagation explanation. In the revision, we will add an ablation experiment that varies the perturbation magnitude specifically at the dominant layer compared to later layers and measures the resulting optimization signal strength. This will provide independent validation of the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical investigation that identifies a dominant layer via a pre-training, inference-only analysis of activation outliers and then validates through experiments that fine-tuning only that layer matches or exceeds full-model ZO performance. No derivation, equation, or prediction reduces by construction to a fitted quantity defined from the target data; the identification step is explicitly described as independent of the fine-tuning process. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the central claims. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper reports empirical findings from experiments; abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5811 in / 913 out tokens · 29925 ms · 2026-06-28T06:33:10.412596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Intrinsic dimensionality explains the effec- tiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effec- tiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021

  2. [2]

    Systematic outliers in large language models, 2025

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025

  3. [3]

    The second pascal recognising textual entailment challenge.Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 01 2006

    Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, and Danilo Giampiccolo. The second pascal recognising textual entailment challenge.Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 01 2006

  4. [4]

    The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

  5. [5]

    Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization, 2019

    Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization, 2019

  6. [6]

    Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698, 2024

    Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698, 2024

  7. [7]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  8. [8]

    The pascal recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

  9. [9]

    Fzoo: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed, 2025

    Sizhe Dang, Yangyang Guo, Yanjun Zhao, Haishan Ye, Xiaodong Zheng, Guang Dai, and Ivor Tsang. Fzoo: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed, 2025

  10. [10]

    The commitmentbank: Investigat- ing projection in naturally occurring discourse

    Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigat- ing projection in naturally occurring discourse. Inproceedings of Sinn und Bedeutung, volume 23, pages 107–124, 2019

  11. [11]

    Gpt3.int8(): 8-bit matrix multiplica- tion for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplica- tion for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022

  12. [12]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  13. [13]

    The third pascal recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B Dolan. The third pascal recognizing textual entailment challenge. InProceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9, 2007

  14. [14]

    Zeroth-order fine-tuning of llms with transferable static sparsity

    Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, et al. Zeroth-order fine-tuning of llms with transferable static sparsity. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019. 10

  16. [16]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  17. [17]

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, 2018

  18. [18]

    The winograd schema challenge.KR, 2012 (13th):3, 2012

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012 (13th):3, 2012

  19. [19]

    Outlier-weighed layerwise sampling for llm fine-tuning, 2025

    Pengxiang Li, Lu Yin, Xiaowei Gao, and Shiwei Liu. Outlier-weighed layerwise sampling for llm fine-tuning, 2025

  20. [20]

    AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

    Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, and Hong Xu. Agzo: Activation-guided zeroth-order optimization for llm fine-tuning.arXiv preprint arXiv:2601.17261, 2026

  21. [21]

    Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751, 2024

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751, 2024

  22. [22]

    Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

  23. [23]

    Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

  24. [24]

    Lisa: Layer- wise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layer- wise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

  25. [25]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

  26. [26]

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

  27. [27]

    Understanding layer significance in llm alignment, 2025

    Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Understanding layer significance in llm alignment, 2025

  28. [28]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Co...

  29. [29]

    Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002

    James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002

  30. [30]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024

  31. [31]

    Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning, 2025

    Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning, 2025. 11

  32. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  33. [33]

    Simultaneous computation and memory efficient zeroth-order optimizer for fine-tuning large language models, 2024

    Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, and Changxing Ding. Simultaneous computation and memory efficient zeroth-order optimizer for fine-tuning large language models, 2024

  34. [34]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models

    Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, and Jianke Zhu. Layer-wise importance matters: Less memory for better performance in parameter-efficient fine-tuning of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1977–1...

  37. [37]

    Zeroth-order fine-tuning of llms in random subspaces

    Ziming Yu, Pan Zhou, Sike Wang, Jia Li, Mi Tian, and Hua Huang. Zeroth-order fine-tuning of llms in random subspaces. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4475–4485, 2025

  38. [38]

    Learning a zeroth-order optimizer for fine-tuning llms, 2025

    Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, and Huan Zhang. Learning a zeroth-order optimizer for fine-tuning llms, 2025

  39. [39]

    Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen

    Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark, 2024

  40. [40]

    Helene: Hessian layer-wise clipping and gradient annealing for accelerating fine-tuning llm with zeroth-order optimization, 2024

    Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, and Jin Lu. Helene: Hessian layer-wise clipping and gradient annealing for accelerating fine-tuning llm with zeroth-order optimization, 2024

  41. [41]

    <hypothesis>

    Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024. 12 A Experimental Details A.1 Tasks, Models, and Metrics Following MeZO [22], we construct each task split by sampling up to 1000 training examples, 500 ...