ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

Daling Wang; Ercong Nie; Feiliang Ren; Hinrich Sch\"utze; Mengjie Zhao; Mingyang Wang; Qian Li; Shi Feng; Yongkang Liu; Zijing Wang

arxiv: 2605.21177 · v1 · pith:EH4GZG7Xnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

Yongkang Liu , Zijing Wang , Mengjie Zhao , Ercong Nie , Mingyang Wang , Qian Li , Feiliang Ren , Shi Feng

show 2 more authors

Daling Wang Hinrich Sch\"utze

This is my paper

Pith reviewed 2026-05-21 05:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords memory-efficient fine-tuningfull-parameter fine-tuningsub-tensor gradientslarge language modelsconvergence analysisLlama modelsgradient computation

0 comments

The pith

ChunkFT performs full fine-tuning by computing gradients only on dynamically activated sub-tensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChunkFT to reformulate full-parameter fine-tuning around a working set of sub-tensors that activate one at a time. This lets the method compute gradients for arbitrary parts of the model without any changes to the network architecture or the usual requirement to store every gradient simultaneously. A convergence analysis is given for the deterministic case, and experiments demonstrate that an 8B model fine-tunes on a single 24 GB GPU while a 70B model fits on two 80 GB GPUs. Readers would care because the approach keeps the optimization close to standard full fine-tuning yet slashes memory use enough to make complete updates practical on modest hardware. Downstream results on language understanding, math reasoning, and MT-Bench match or exceed both full fine-tuning and other memory-efficient baselines.

Core claim

ChunkFT reformulates full-parameter fine-tuning around a dynamically activated working set. It enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. The authors supply a theoretical convergence analysis in the deterministic setting and report that full-parameter fine-tuning of a 7B model with 1K input length requires only 13.72 GB of GPU memory.

What carries the argument

The dynamically activated working set that processes the model in byte-streamed chunks to compute gradients only for selected sub-tensors.

If this is right

Full fine-tuning of a 7B model with 1K input length requires only 13.72 GB GPU memory.
Llama 3-8B can be fully fine-tuned on a single RTX 4090-24GB GPU.
Llama 3-70B can be fully fine-tuned on two H800-80GB GPUs.
Downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds standard full-parameter fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The chunking approach might extend to training from scratch on memory-limited hardware by applying the same sub-tensor schedule.
It could support selective parameter updates in continual learning scenarios without requiring full gradient storage.
Integration with quantization or sparse activation might further reduce memory while preserving the full-update guarantee.

Load-bearing premise

Sequentially updating dynamically activated sub-tensors produces the same optimization trajectory and final performance as simultaneous dense gradient updates across the entire model.

What would settle it

Train the same small model that fits in memory under both ChunkFT and standard full fine-tuning with identical data and hyperparameters, then compare the final weights and downstream accuracy to check for equivalence.

Figures

Figures reproduced from arXiv: 2605.21177 by Daling Wang, Ercong Nie, Feiliang Ren, Hinrich Sch\"utze, Mengjie Zhao, Mingyang Wang, Qian Li, Shi Feng, Yongkang Liu, Zijing Wang.

**Figure 1.** Figure 1: Loss convergence behavior of CHUNKFT (Llama 2-7B). Left: training loss curves of CHUNKFT on six natural language understanding tasks. Right: loss curves of training a single block with different update intervals T on BoolQ. 4.2 Downstream Performance Evaluation. Natural language understanding. Tables 4 and 6 report the performance of different optimization methods on natural language understanding benchmar… view at source ↗

**Figure 2.** Figure 2: Effect of the chunk number K on memory and performance. Left: peak GPU memory under different chunk numbers K. Right: BoolQ accuracy under different chunk numbers K. 4.3 Ablation Study Loss convergence [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChunkFT shows how to chunk full fine-tuning updates to fit 70B models on a couple of GPUs with memory around 13GB for 8B, and the runs look competitive, though the deterministic theory leaves the stochastic case open.

read the letter

ChunkFT lets you do full-parameter fine-tuning of Llama 3-8B on a single 24GB 4090 and the 70B version on two 80GB cards by activating only a working set of sub-tensors at a time for gradients. The reported memory for a 7B model at 1K length is 13.72GB, and downstream scores on language, math, and MT-Bench come out comparable to or better than the baselines they tested against, including some cases above standard full fine-tuning.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ChunkFT, a memory-efficient full fine-tuning framework that reformulates optimization around dynamically activated working sets to enable gradient computation on arbitrary sub-tensors without architecture changes. It supplies a deterministic convergence analysis and reports concrete results: full fine-tuning of Llama 3-8B with 1K context uses 13.72 GB on a single RTX 4090, while Llama 3-70B runs on 2x H800-80GB GPUs; downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds existing memory-efficient baselines and, in some cases, standard full fine-tuning.

Significance. If the central equivalence claim holds, the work would be significant for enabling practical full-parameter fine-tuning of 70B-scale models on modest hardware while preserving optimization quality superior to low-rank methods. The deterministic convergence analysis, open repository, and explicit memory/accuracy numbers on Llama 3 models constitute clear strengths that support reproducibility and practical impact.

major comments (2)

[Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.
[§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.

minor comments (1)

[Abstract / Introduction] The abstract and introduction introduce 'byte-streamed optimization' without an early, self-contained definition; a short clarifying sentence would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive major comments. We address each point below, providing clarifications and indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.

Authors: We appreciate the referee's observation regarding the scope of our theoretical analysis. The convergence result is indeed derived under the deterministic setting to establish a foundational guarantee for the ChunkFT framework. In practice, our experiments utilize standard stochastic optimization with mini-batches, as is conventional for large language model fine-tuning. The sequential activation of sub-tensors is implemented such that the gradient for each chunk is computed based on the current model state, and updates are applied sequentially within the same optimization step to closely approximate the dense gradient update. While a complete bias analysis for the stochastic regime is not included, the empirical results on Llama 3 models show performance comparable to or better than standard full fine-tuning, indicating that any deviations due to ordering or noise are not detrimental. In the revised manuscript, we will expand the Theoretical Analysis section to include a discussion on the stochastic extension and potential ordering effects, along with additional empirical validation of the equivalence. revision: partial
Referee: [§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.

Authors: We agree that explicit details on optimizer state management are essential for reproducibility and for verifying the equivalence to dense full fine-tuning. We will revise the Experiments section (§5) to include a clear description of how momentum and variance are handled across chunks, ensuring readers can reproduce the memory savings and confirm that the optimization trajectory remains comparable to dense full fine-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic definition with separate theoretical analysis

full rationale

The paper defines ChunkFT algorithmically as a reformulation around dynamically activated working sets for gradient computation on sub-tensors. It supplies an independent theoretical convergence analysis in the deterministic setting, distinct from the empirical performance numbers. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. Results are validated against external full fine-tuning and baselines on downstream tasks, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that sub-tensor gradient computation can be performed without architectural modification and that sequential chunk updates preserve convergence behavior.

free parameters (1)

chunk size / working-set size
Chosen dynamically to fit available GPU memory; exact selection rule not visible in abstract.

axioms (1)

domain assumption Gradient computation on an arbitrary sub-tensor yields a valid update direction for the full model when chunks are cycled through.
Invoked in the reformulation of full-parameter fine-tuning around a dynamically activated working set.

pith-pipeline@v0.9.0 · 5797 in / 1207 out tokens · 31887 ms · 2026-05-21T05:42:57.216495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 9 internal anchors

[1]

Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7871–7880, 2020

work page 2020
[2]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[3]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[4]

Full parameter fine-tuning for large language models with limited resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8187–8198, 2024

work page 2024
[5]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[6]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020
[8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

work page 2022
[9]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

work page 2022
[10]

Ssmlora: Enhancing low-rank adaptation with state space model

Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025

work page 2025
[11]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[12]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. 10

work page 2021
[13]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

work page 2021
[14]

Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

work page arXiv 2025
[15]

Adalomo: Low-memory optimiza- tion with adaptive learning rate

Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimiza- tion with adaptive learning rate. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12486–12502, 2024

work page 2024
[16]

Galore: Memory-efficient llm training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InForty-first International Conference on Machine Learning

work page
[17]

Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[18]

Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

work page 2024
[19]

Hift: A hierarchical full parameter fine-tuning strategy

Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. Hift: A hierarchical full parameter fine-tuning strategy. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18266–18287, 2024

work page 2024
[20]

Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

work page 2024
[21]

Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024
[22]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

work page 2024
[23]

High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, and Hinrich Schütze. High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

work page arXiv 2026
[24]

Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

Yuxuan Hu, Tian Tian, Xiaodong Chen, Zhe Zhao, Tao Tao, Weifang Zhang, Yuanfeng Li, Yuhang Liang, Cuiping Li, Hong Chen, et al. Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

work page 2026
[25]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793, 2024

work page arXiv 2024
[26]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations

work page
[27]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

work page 2019
[28]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019. 11

work page 2019
[29]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[31]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[32]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017

work page 2017
[35]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[36]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024
[37]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Numglue: A suite of fundamental yet challenging mathematical reasoning tasks

Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, 2022

work page 2022
[39]

Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

work page 2011
[40]

The winograd schema challenge.KR, 2012(13th):3, 2012

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012

work page 2012
[41]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

work page 2005
[42]

The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

work page 2009
[43]

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–...

work page 2018
[44]

Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, 2019. 12

work page 2019
[45]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

work page 2013
[46]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

work page 2024
[49]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

Stephen J Wright. Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

work page 2015
[51]

Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

work page arXiv 2003
[52]

Limitations

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023. 13 A Benchmarks To comprehensively evaluate the capabilities of different methods, we conduct experiments on a diverse set of benchmarks coveringmathematical reasoning,natural language un...

work page 2023
[53]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7871–7880, 2020

work page 2020

[2] [2]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[3] [3]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[4] [4]

Full parameter fine-tuning for large language models with limited resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8187–8198, 2024

work page 2024

[5] [5]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023

[6] [6]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020

[8] [8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

work page 2022

[9] [9]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

work page 2022

[10] [10]

Ssmlora: Enhancing low-rank adaptation with state space model

Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025

work page 2025

[11] [11]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019

[12] [12]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. 10

work page 2021

[13] [13]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

work page 2021

[14] [14]

Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

work page arXiv 2025

[15] [15]

Adalomo: Low-memory optimiza- tion with adaptive learning rate

Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimiza- tion with adaptive learning rate. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12486–12502, 2024

work page 2024

[16] [16]

Galore: Memory-efficient llm training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InForty-first International Conference on Machine Learning

work page

[17] [17]

Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025

[18] [18]

Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

work page 2024

[19] [19]

Hift: A hierarchical full parameter fine-tuning strategy

Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. Hift: A hierarchical full parameter fine-tuning strategy. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18266–18287, 2024

work page 2024

[20] [20]

Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

work page 2024

[21] [21]

Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024

[22] [22]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

work page 2024

[23] [23]

High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, and Hinrich Schütze. High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

work page arXiv 2026

[24] [24]

Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

Yuxuan Hu, Tian Tian, Xiaodong Chen, Zhe Zhao, Tao Tao, Weifang Zhang, Yuanfeng Li, Yuhang Liang, Cuiping Li, Hong Chen, et al. Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

work page 2026

[25] [25]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793, 2024

work page arXiv 2024

[26] [26]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations

work page

[27] [27]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

work page 2019

[28] [28]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019. 11

work page 2019

[29] [29]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[31] [31]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[32] [32]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023

[33] [33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017

work page 2017

[35] [35]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[36] [36]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024

[37] [37]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

Numglue: A suite of fundamental yet challenging mathematical reasoning tasks

Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, 2022

work page 2022

[39] [39]

Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

work page 2011

[40] [40]

The winograd schema challenge.KR, 2012(13th):3, 2012

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012

work page 2012

[41] [41]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

work page 2005

[42] [42]

The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

work page 2009

[43] [43]

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–...

work page 2018

[44] [44]

Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, 2019. 12

work page 2019

[45] [45]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

work page 2013

[46] [46]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

work page 2024

[49] [49]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

Stephen J Wright. Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

work page 2015

[51] [51]

Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

work page arXiv 2003

[52] [52]

Limitations

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023. 13 A Benchmarks To comprehensively evaluate the capabilities of different methods, we conduct experiments on a diverse set of benchmarks coveringmathematical reasoning,natural language un...

work page 2023

[53] [53]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page