ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning
Pith reviewed 2026-05-21 05:42 UTC · model grok-4.3
The pith
ChunkFT performs full fine-tuning by computing gradients only on dynamically activated sub-tensors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChunkFT reformulates full-parameter fine-tuning around a dynamically activated working set. It enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. The authors supply a theoretical convergence analysis in the deterministic setting and report that full-parameter fine-tuning of a 7B model with 1K input length requires only 13.72 GB of GPU memory.
What carries the argument
The dynamically activated working set that processes the model in byte-streamed chunks to compute gradients only for selected sub-tensors.
If this is right
- Full fine-tuning of a 7B model with 1K input length requires only 13.72 GB GPU memory.
- Llama 3-8B can be fully fine-tuned on a single RTX 4090-24GB GPU.
- Llama 3-70B can be fully fine-tuned on two H800-80GB GPUs.
- Downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds standard full-parameter fine-tuning.
Where Pith is reading between the lines
- The chunking approach might extend to training from scratch on memory-limited hardware by applying the same sub-tensor schedule.
- It could support selective parameter updates in continual learning scenarios without requiring full gradient storage.
- Integration with quantization or sparse activation might further reduce memory while preserving the full-update guarantee.
Load-bearing premise
Sequentially updating dynamically activated sub-tensors produces the same optimization trajectory and final performance as simultaneous dense gradient updates across the entire model.
What would settle it
Train the same small model that fits in memory under both ChunkFT and standard full fine-tuning with identical data and hyperparameters, then compare the final weights and downstream accuracy to check for equivalence.
Figures
read the original abstract
This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ChunkFT, a memory-efficient full fine-tuning framework that reformulates optimization around dynamically activated working sets to enable gradient computation on arbitrary sub-tensors without architecture changes. It supplies a deterministic convergence analysis and reports concrete results: full fine-tuning of Llama 3-8B with 1K context uses 13.72 GB on a single RTX 4090, while Llama 3-70B runs on 2x H800-80GB GPUs; downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds existing memory-efficient baselines and, in some cases, standard full fine-tuning.
Significance. If the central equivalence claim holds, the work would be significant for enabling practical full-parameter fine-tuning of 70B-scale models on modest hardware while preserving optimization quality superior to low-rank methods. The deterministic convergence analysis, open repository, and explicit memory/accuracy numbers on Llama 3 models constitute clear strengths that support reproducibility and practical impact.
major comments (2)
- [Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.
- [§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.
minor comments (1)
- [Abstract / Introduction] The abstract and introduction introduce 'byte-streamed optimization' without an early, self-contained definition; a short clarifying sentence would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the constructive major comments. We address each point below, providing clarifications and indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.
Authors: We appreciate the referee's observation regarding the scope of our theoretical analysis. The convergence result is indeed derived under the deterministic setting to establish a foundational guarantee for the ChunkFT framework. In practice, our experiments utilize standard stochastic optimization with mini-batches, as is conventional for large language model fine-tuning. The sequential activation of sub-tensors is implemented such that the gradient for each chunk is computed based on the current model state, and updates are applied sequentially within the same optimization step to closely approximate the dense gradient update. While a complete bias analysis for the stochastic regime is not included, the empirical results on Llama 3 models show performance comparable to or better than standard full fine-tuning, indicating that any deviations due to ordering or noise are not detrimental. In the revised manuscript, we will expand the Theoretical Analysis section to include a discussion on the stochastic extension and potential ordering effects, along with additional empirical validation of the equivalence. revision: partial
-
Referee: [§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.
Authors: We agree that explicit details on optimizer state management are essential for reproducibility and for verifying the equivalence to dense full fine-tuning. We will revise the Experiments section (§5) to include a clear description of how momentum and variance are handled across chunks, ensuring readers can reproduce the memory savings and confirm that the optimization trajectory remains comparable to dense full fine-tuning. revision: yes
Circularity Check
No circularity: algorithmic definition with separate theoretical analysis
full rationale
The paper defines ChunkFT algorithmically as a reformulation around dynamically activated working sets for gradient computation on sub-tensors. It supplies an independent theoretical convergence analysis in the deterministic setting, distinct from the empirical performance numbers. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. Results are validated against external full fine-tuning and baselines on downstream tasks, keeping the chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- chunk size / working-set size
axioms (1)
- domain assumption Gradient computation on an arbitrary sub-tensor yields a valid update direction for the full model when chunks are cycled through.
Reference graph
Works this paper leans on
-
[1]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7871–7880, 2020
work page 2020
-
[2]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[3]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[4]
Full parameter fine-tuning for large language models with limited resources
Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8187–8198, 2024
work page 2024
-
[5]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023
work page 2023
-
[6]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Zero: Memory optimiza- tions toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020
work page 2020
-
[8]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022
work page 2022
-
[9]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022
work page 2022
-
[10]
Ssmlora: Enhancing low-rank adaptation with state space model
Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025
work page 2025
-
[11]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[12]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. 10
work page 2021
-
[13]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021
work page 2021
-
[14]
Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025
-
[15]
Adalomo: Low-memory optimiza- tion with adaptive learning rate
Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimiza- tion with adaptive learning rate. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12486–12502, 2024
work page 2024
-
[16]
Galore: Memory-efficient llm training by gradient low-rank projection
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InForty-first International Conference on Machine Learning
-
[17]
Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[18]
Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024
work page 2024
-
[19]
Hift: A hierarchical full parameter fine-tuning strategy
Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. Hift: A hierarchical full parameter fine-tuning strategy. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18266–18287, 2024
work page 2024
-
[20]
Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024
work page 2024
-
[21]
Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024
-
[22]
Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024
work page 2024
-
[23]
Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, and Hinrich Schütze. High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026
-
[24]
Yuxuan Hu, Tian Tian, Xiaodong Chen, Zhe Zhao, Tao Tao, Weifang Zhang, Yuanfeng Li, Yuhang Liang, Cuiping Li, Hong Chen, et al. Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026
work page 2026
-
[25]
Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793, 2024
-
[26]
8-bit optimizers via block-wise quantization
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations
-
[27]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...
work page 2019
-
[28]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019. 11
work page 2019
-
[29]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[31]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[32]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[33]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Program induction by rationale generation: Learning to solve and explain algebraic word problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017
work page 2017
-
[35]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[36]
Agieval: A human-centric benchmark for evaluating foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024
work page 2024
-
[37]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Numglue: A suite of fundamental yet challenging mathematical reasoning tasks
Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, 2022
work page 2022
-
[39]
Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011
work page 2011
-
[40]
The winograd schema challenge.KR, 2012(13th):3, 2012
Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012
work page 2012
-
[41]
The pascal recognising textual entailment challenge
Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005
work page 2005
-
[42]
The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009
work page 2009
-
[43]
Looking beyond the surface: A challenge set for reading comprehension over multiple sentences
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–...
work page 2018
-
[44]
Wic: the word-in-context dataset for evaluating context-sensitive meaning representations
Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, 2019. 12
work page 2019
-
[45]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013
work page 2013
-
[46]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Llamafactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024
work page 2024
-
[49]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[50]
Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015
Stephen J Wright. Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015
work page 2015
-
[51]
Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020
-
[52]
Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023. 13 A Benchmarks To comprehensively evaluate the capabilities of different methods, we conduct experiments on a diverse set of benchmarks coveringmathematical reasoning,natural language un...
work page 2023
-
[53]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.